SlideShare a Scribd company logo
Data Analytics with NOSQL
Mukundan Agaram
Chris Weiss
Some initial thoughts about data...
Continual issues with large scale web apps
– Data growth + query response time
● Data growth => performance degradation
● Explosion of big data “analytics” use cases
– Increase in unstructured data
● More interconnectivity, more formats, lack of structure...
● Document oriented data (XML/JSON) are difficult to
manage and search
– Distributed server configurations
● Large systems, more distribution and HA
Cloud services has aggravated these issues
Agenda for the night
● What is NOSQL?
● Varieties of NOSQL
● Key Industry Use Cases
● Applications for Data Analytics
● Landscape
● Demos/Walkthroughs
● Closing Discussions
What is NOSQL?
● “...mechanism for storage and retrieval of data
that is modeled in means other than tabular
relations used in relational databases.”
Wikipedia
● Non SQL or Non-relational
● Not Only SQL
● Technically since late 1960...
– E.g. IDMS, IMS, MUMPS, Cache, BerkeleyDB
What is NOSQL?
● Drivers for modern day NOSQL
– Web 2.0
– Big Data
– Facebook, Google, Amazon, Expedia etc.
– Horizontal scaling to clusters of computers
● Achilles heel for RDBMS
– Cost
– Provide
● HA
● Partition Tolerance (a.k.a sharding)
● Speed
NOSQL - Drawbacks and Barriers
● Compromise on consistency (CAP Theorem)
● Custom query languages vs. SQL
● Lack of standardized interfaces
● Existing investments in RDBMS
● Most lack true ACID transactions.
– Use an “eventually” consistent model
– Data is replicated with a conflict resolution algorithm
– Methods for conflict resolution and distribution vary
significantly
CAP Theorem
● a.k.a Brewer's theorem
● Impossible for a distributed computer system to
simultaneously provide
– Consistency
● all nodes see same data at same time
– Availability
● Every request receives a response
– Partition Tolerance
● Fault tolerance to partitioning because of network failures
CAP alignment for NOSQL
Source: http://blog.nahurst.com/visual-guide-to-nosql-systems
NOSQL direction
The landscape is morphing...
● Current NOSQL industry focus
– Address large distributed systems reactionary to the
CAP theorem
● The newer breed of NOSQL address important
aspects such as ACID
● There is a new buzz word …
– NewSQL
Database Evolution
NOSQL Model Classification
Key Value Stores &
Caches
Data is represented as a collection of (K,V) pairs. In-memory,
persistent or eventually persistent.
Document Databases Data is stored in JSON document structures.
RDF, OWL & Triple Stores Meaningful way to connect information. Can inference over
triples (S,P,O). Can be represented graphically. SPARQL
Wide Column Databases Extensible record set. Stores data tables as sections of
columns. Great for EDW.
Graph Databases Stores data as a graph G(V,E). Great for correlation analysis,
recommendation engines and fraud detection.
Multi-model Databases Combination of one or more varieties of the above.
NOSQL Models
● Key-Value
– Cache (EHCache, BigMemory, Coherence, Memcached)
– Store (Redis, Riak, AeroSpike, Oracle NoSQL)
● Document (MongoDB, CouchDB, AmazonDynamoDB)
● Wide Column (Cassandra, HBase, Vertica)
● Graph (Neo4j, Titan, Giraph)
● Multi-model (OrientDB, ArangoDB, Sqrrl)
Source: www.db-engines.com
Consider NOSQL for...
● Enabling “big data” and “web” scale
– Massive distribution through horizontal scaling
● Performant queries (alternatives to RDBMS)
– Denormalization and large horizontal scalability
● Massive write volumes (Facebook, Twitter)
● Fast and dynamic access to key data
● Flexible schemas and data types
● Data/Schema Migration
● Developer centric environments
Consider NOSQL for...
● Diverse data organization options
– Hierarchical correlation
– Graph correlation
– Semantic relationships
– Set based analytics
● Caching in end usage format
● Data Archival
● Big Data Analytics
– Cumulative metrics and insights
– Correlation
Where RDBMS/SQL is better..
● OLTP
● Data Integrity
● SQL centricity
● Complex relationships
– Exception of graph NOSQL
● Maturity, stability and standardization
Use Cases
● Log management (unstructured data)
● Data synchronization (online vs. offline sources)
– Shopping cart, Field sales/services, PoS, Gaming,
Transportation/telemetry
● User profile management
● Customer 360 degree view
● Fraud detection
● Medical/Healthcare diagnosis
● Data Archival
● Recommendation Engines
Applications for Data Analytics
● Complements (part of) Hadoop and Big Data
● Acts as the persistence infrastructure for larger
machine learning use cases
– Predictive Analytics
– Fraud/Anomaly/Outlier Detection
– Recommendation engines
● Provides a back drop for interesting data
visualization initiatives
– Integrate with visualization packages such as
Tableau
Interesting links
● Redis in Practice: Who's online?
www.lukemelia.com/blog/archives/2010/01/17/redis-in-practice-whos-online/
● Inventory list of NOSQL systems
www.nosql-database.org
● Database Engine ranking and analytics
www.db-engines.com
● Visual guide to NOSQL systems
www.blog.nahurst.com/visual-guide-to-nosql-systems
Case Studies / Demos
● Retail fraud detection
– Neo4j
– Contrasting with OrientDB
– Tinkerpop/Gremlin/Blue Print
● 360 degree single view of voter information
– MongoDB
● Schema on read
– Hadoop
Data analytics with NOSQL
Data analytics with NOSQL
Gremlin Blueprints Architecture
Neo4j OrientDB TitanGraph ArangoDB
Qualified Voter – Use Case
● Tracks registration information for all voters in
Michigan
● Uses a tabular geography model
● Highly normalized schema
– Data partitioned into subsets
● Enable local application instances and row level security
● Expensive queries when doing reporting
● Expensive queries for performing “single view”
of voter
● Several tables with tens of millions of records
Voter Schema
Find the first 100 voters in Ingham county with
status and school district
SELECT V.VOTER_IDENTIFICATION_NUMBER,V.FIRST_NAME, V.LAST_NAME, G.CODE AS GENDER,
IDS.NAME AS ID_STATUS, UST.NAME AS UOCAVA_STATUS,
VA.ADDRESS_LINE_ONE, VA.CITY, VA.ZIP_CODE,
DIS.NAME AS SCHOOL_DISTRICT
FROM VOTER V, VOTER_ADDRESS VA, GENDER G,
IDENTIFICATION_STATUS IDS, UOCAVA_STATUS UST, VOTER_STATUS_TYPE VST,
STREET_RANGE SI, DISTINCT_POLITICAL_AREA DPA, DISTINCT_POLITICAL_AREA_DIS DPAD,
DISTRICT DIS, DISTRICT_TYPE DT, COUNTY CO
WHERE V.ID = VA.VOTER_ID AND V.GENDER_ID = G.ID AND V.IDENTIFICATION_STATUS_ID = IDS.ID
AND V.UOCAVA_STATUS_ID = UST.ID AND V.VOTER_STATUS_TYPE_ID = VST.ID AND VST.NAME = 'Active'
AND VA.STREET_RANGE_ID = SI.ID AND SI.DISTINCT_POLITICAL_AREA_ID = DPA.ID
AND VA.IS_ACTIVE = 'Y'
AND DPA.COUNTY_ID = CO.ID AND CO.NAME = 'Ingham'
AND DPA.ID = DPAD.DISTINCT_POLITICAL_AREA_ID AND DPAD.DISTRICT_ID = DIS.ID
AND DIS.DISTRICT_TYPE_ID = DT.ID AND DT.NAME = 'School'
AND ROWNUM <= 100;
Data analytics with NOSQL
Data analytics with NOSQL
Expensive in terms of IO
● Multiple objects read
● Two stage IO:
● Read index
● Read entire table row
● Selected and WHERE clause columns
assembled and then filtered
● Resources for larger volume query would be
high – memory, CPU, fast disk
Parting conclusions
● NOSQL is a mixed bag of fruit
● This space is growing
● There are hundreds of products
● Best value is realized from identifying the
correct use case
– Functional requirements
– Non-functional requirements
Finally you can use NOSQL for...
Thank You!!
Questions?

More Related Content

PPTX
Data(base) taxonomy
PPTX
Data Structure Introduction chapter 1
PPTX
Realizing Semantic Web - Light Weight semantics and beyond
PPTX
Donders Institute - Research Data Management
PPTX
Design approach
PPTX
Database and types of database
PDF
Creating Effective Data Visualizations for Online Learning
PPTX
Data Modeling Basics
Data(base) taxonomy
Data Structure Introduction chapter 1
Realizing Semantic Web - Light Weight semantics and beyond
Donders Institute - Research Data Management
Design approach
Database and types of database
Creating Effective Data Visualizations for Online Learning
Data Modeling Basics

What's hot (6)

PPTX
Data Modeling PPT
PDF
Influence of-structured--semi-structured--unstructured-data-on-various-data-m...
PPTX
Data structure unitfirst part1
PPTX
Data Dictionary
PDF
General concepts: DDI
PPT
Data Modeling PPT
Influence of-structured--semi-structured--unstructured-data-on-various-data-m...
Data structure unitfirst part1
Data Dictionary
General concepts: DDI
Ad

Viewers also liked (20)

PPTX
Slide share test 110727
PPT
Multimedia01
PDF
Cosug 2012-lzy
PPT
Elements, Compounds & Mixtures Day 3
PDF
Crociate e preghiere quotidiane (Programma di Preghiera di Gesù all'umantià, ...
PPTX
My life
KEY
Linkedin
PDF
Vesterinen: Etsivä nuorisotyö, ammattina välittäminen
DOCX
Options for filmingh
PDF
Goede leiders zijn goede verhalenvertellers - Hans Donckers - Beanmachine
PPT
Privatsparande
PDF
Infográfico Pessoal
PPTX
Doublerbuxtutorial
PPTX
Lecture ready class 5
PDF
Walking the talk - 3 insights from Behavior Design
PDF
PDF
Notam Sul/Sudeste - 01-mai-16
PPT
Empacotamento e backport de aplicações em debian
PPT
3words pp
PPTX
Link Building With Twitter
Slide share test 110727
Multimedia01
Cosug 2012-lzy
Elements, Compounds & Mixtures Day 3
Crociate e preghiere quotidiane (Programma di Preghiera di Gesù all'umantià, ...
My life
Linkedin
Vesterinen: Etsivä nuorisotyö, ammattina välittäminen
Options for filmingh
Goede leiders zijn goede verhalenvertellers - Hans Donckers - Beanmachine
Privatsparande
Infográfico Pessoal
Doublerbuxtutorial
Lecture ready class 5
Walking the talk - 3 insights from Behavior Design
Notam Sul/Sudeste - 01-mai-16
Empacotamento e backport de aplicações em debian
3words pp
Link Building With Twitter
Ad

Similar to Data analytics with NOSQL (20)

PDF
Database Systems - A Historical Perspective
PPTX
NoSQLDatabases
PPTX
The Rise of NoSQL and Polyglot Persistence
PPTX
To SQL or NoSQL, that is the question
PPT
NoSQL Seminer
PPTX
UNIT I Introduction to NoSQL.pptx
PPT
6269441.ppt
PPTX
Sql vs NoSQL
PPTX
UNIT I Introduction to NoSQL.pptx
PPTX
Introduction to Data Science NoSQL.pptx
PPTX
nosql.pptx
PPTX
Introduction to asdfghjkln b vfgh n v
PPT
Indic threads pune12-nosql now and path ahead
PPTX
NoSQL databases - An introduction
PPTX
PDF
Nosql primer
PPTX
2018 05 08_biological_databases_no_sql
PPT
NO SQL: What, Why, How
PPTX
NoSQL with Microsoft Azure
PPTX
NoSQL A brief look at Apache Cassandra Distributed Database
Database Systems - A Historical Perspective
NoSQLDatabases
The Rise of NoSQL and Polyglot Persistence
To SQL or NoSQL, that is the question
NoSQL Seminer
UNIT I Introduction to NoSQL.pptx
6269441.ppt
Sql vs NoSQL
UNIT I Introduction to NoSQL.pptx
Introduction to Data Science NoSQL.pptx
nosql.pptx
Introduction to asdfghjkln b vfgh n v
Indic threads pune12-nosql now and path ahead
NoSQL databases - An introduction
Nosql primer
2018 05 08_biological_databases_no_sql
NO SQL: What, Why, How
NoSQL with Microsoft Azure
NoSQL A brief look at Apache Cassandra Distributed Database

Recently uploaded (20)

PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Lecture1 pattern recognition............
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Mega Projects Data Mega Projects Data
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Qualitative Qantitative and Mixed Methods.pptx
IB Computer Science - Internal Assessment.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Business Acumen Training GuidePresentation.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Lecture1 pattern recognition............
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Fluorescence-microscope_Botany_detailed content
Mega Projects Data Mega Projects Data
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Database Infoormation System (DBIS).pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Supervised vs unsupervised machine learning algorithms
Business Ppt On Nestle.pptx huunnnhhgfvu
MODULE 8 - DISASTER risk PREPAREDNESS.pptx

Data analytics with NOSQL

  • 1. Data Analytics with NOSQL Mukundan Agaram Chris Weiss
  • 2. Some initial thoughts about data... Continual issues with large scale web apps – Data growth + query response time ● Data growth => performance degradation ● Explosion of big data “analytics” use cases – Increase in unstructured data ● More interconnectivity, more formats, lack of structure... ● Document oriented data (XML/JSON) are difficult to manage and search – Distributed server configurations ● Large systems, more distribution and HA Cloud services has aggravated these issues
  • 3. Agenda for the night ● What is NOSQL? ● Varieties of NOSQL ● Key Industry Use Cases ● Applications for Data Analytics ● Landscape ● Demos/Walkthroughs ● Closing Discussions
  • 4. What is NOSQL? ● “...mechanism for storage and retrieval of data that is modeled in means other than tabular relations used in relational databases.” Wikipedia ● Non SQL or Non-relational ● Not Only SQL ● Technically since late 1960... – E.g. IDMS, IMS, MUMPS, Cache, BerkeleyDB
  • 5. What is NOSQL? ● Drivers for modern day NOSQL – Web 2.0 – Big Data – Facebook, Google, Amazon, Expedia etc. – Horizontal scaling to clusters of computers ● Achilles heel for RDBMS – Cost – Provide ● HA ● Partition Tolerance (a.k.a sharding) ● Speed
  • 6. NOSQL - Drawbacks and Barriers ● Compromise on consistency (CAP Theorem) ● Custom query languages vs. SQL ● Lack of standardized interfaces ● Existing investments in RDBMS ● Most lack true ACID transactions. – Use an “eventually” consistent model – Data is replicated with a conflict resolution algorithm – Methods for conflict resolution and distribution vary significantly
  • 7. CAP Theorem ● a.k.a Brewer's theorem ● Impossible for a distributed computer system to simultaneously provide – Consistency ● all nodes see same data at same time – Availability ● Every request receives a response – Partition Tolerance ● Fault tolerance to partitioning because of network failures
  • 8. CAP alignment for NOSQL Source: http://blog.nahurst.com/visual-guide-to-nosql-systems
  • 9. NOSQL direction The landscape is morphing... ● Current NOSQL industry focus – Address large distributed systems reactionary to the CAP theorem ● The newer breed of NOSQL address important aspects such as ACID ● There is a new buzz word … – NewSQL
  • 11. NOSQL Model Classification Key Value Stores & Caches Data is represented as a collection of (K,V) pairs. In-memory, persistent or eventually persistent. Document Databases Data is stored in JSON document structures. RDF, OWL & Triple Stores Meaningful way to connect information. Can inference over triples (S,P,O). Can be represented graphically. SPARQL Wide Column Databases Extensible record set. Stores data tables as sections of columns. Great for EDW. Graph Databases Stores data as a graph G(V,E). Great for correlation analysis, recommendation engines and fraud detection. Multi-model Databases Combination of one or more varieties of the above.
  • 12. NOSQL Models ● Key-Value – Cache (EHCache, BigMemory, Coherence, Memcached) – Store (Redis, Riak, AeroSpike, Oracle NoSQL) ● Document (MongoDB, CouchDB, AmazonDynamoDB) ● Wide Column (Cassandra, HBase, Vertica) ● Graph (Neo4j, Titan, Giraph) ● Multi-model (OrientDB, ArangoDB, Sqrrl)
  • 14. Consider NOSQL for... ● Enabling “big data” and “web” scale – Massive distribution through horizontal scaling ● Performant queries (alternatives to RDBMS) – Denormalization and large horizontal scalability ● Massive write volumes (Facebook, Twitter) ● Fast and dynamic access to key data ● Flexible schemas and data types ● Data/Schema Migration ● Developer centric environments
  • 15. Consider NOSQL for... ● Diverse data organization options – Hierarchical correlation – Graph correlation – Semantic relationships – Set based analytics ● Caching in end usage format ● Data Archival ● Big Data Analytics – Cumulative metrics and insights – Correlation
  • 16. Where RDBMS/SQL is better.. ● OLTP ● Data Integrity ● SQL centricity ● Complex relationships – Exception of graph NOSQL ● Maturity, stability and standardization
  • 17. Use Cases ● Log management (unstructured data) ● Data synchronization (online vs. offline sources) – Shopping cart, Field sales/services, PoS, Gaming, Transportation/telemetry ● User profile management ● Customer 360 degree view ● Fraud detection ● Medical/Healthcare diagnosis ● Data Archival ● Recommendation Engines
  • 18. Applications for Data Analytics ● Complements (part of) Hadoop and Big Data ● Acts as the persistence infrastructure for larger machine learning use cases – Predictive Analytics – Fraud/Anomaly/Outlier Detection – Recommendation engines ● Provides a back drop for interesting data visualization initiatives – Integrate with visualization packages such as Tableau
  • 19. Interesting links ● Redis in Practice: Who's online? www.lukemelia.com/blog/archives/2010/01/17/redis-in-practice-whos-online/ ● Inventory list of NOSQL systems www.nosql-database.org ● Database Engine ranking and analytics www.db-engines.com ● Visual guide to NOSQL systems www.blog.nahurst.com/visual-guide-to-nosql-systems
  • 20. Case Studies / Demos ● Retail fraud detection – Neo4j – Contrasting with OrientDB – Tinkerpop/Gremlin/Blue Print ● 360 degree single view of voter information – MongoDB ● Schema on read – Hadoop
  • 23. Gremlin Blueprints Architecture Neo4j OrientDB TitanGraph ArangoDB
  • 24. Qualified Voter – Use Case ● Tracks registration information for all voters in Michigan ● Uses a tabular geography model ● Highly normalized schema – Data partitioned into subsets ● Enable local application instances and row level security ● Expensive queries when doing reporting ● Expensive queries for performing “single view” of voter ● Several tables with tens of millions of records
  • 26. Find the first 100 voters in Ingham county with status and school district SELECT V.VOTER_IDENTIFICATION_NUMBER,V.FIRST_NAME, V.LAST_NAME, G.CODE AS GENDER, IDS.NAME AS ID_STATUS, UST.NAME AS UOCAVA_STATUS, VA.ADDRESS_LINE_ONE, VA.CITY, VA.ZIP_CODE, DIS.NAME AS SCHOOL_DISTRICT FROM VOTER V, VOTER_ADDRESS VA, GENDER G, IDENTIFICATION_STATUS IDS, UOCAVA_STATUS UST, VOTER_STATUS_TYPE VST, STREET_RANGE SI, DISTINCT_POLITICAL_AREA DPA, DISTINCT_POLITICAL_AREA_DIS DPAD, DISTRICT DIS, DISTRICT_TYPE DT, COUNTY CO WHERE V.ID = VA.VOTER_ID AND V.GENDER_ID = G.ID AND V.IDENTIFICATION_STATUS_ID = IDS.ID AND V.UOCAVA_STATUS_ID = UST.ID AND V.VOTER_STATUS_TYPE_ID = VST.ID AND VST.NAME = 'Active' AND VA.STREET_RANGE_ID = SI.ID AND SI.DISTINCT_POLITICAL_AREA_ID = DPA.ID AND VA.IS_ACTIVE = 'Y' AND DPA.COUNTY_ID = CO.ID AND CO.NAME = 'Ingham' AND DPA.ID = DPAD.DISTINCT_POLITICAL_AREA_ID AND DPAD.DISTRICT_ID = DIS.ID AND DIS.DISTRICT_TYPE_ID = DT.ID AND DT.NAME = 'School' AND ROWNUM <= 100;
  • 29. Expensive in terms of IO ● Multiple objects read ● Two stage IO: ● Read index ● Read entire table row ● Selected and WHERE clause columns assembled and then filtered ● Resources for larger volume query would be high – memory, CPU, fast disk
  • 30. Parting conclusions ● NOSQL is a mixed bag of fruit ● This space is growing ● There are hundreds of products ● Best value is realized from identifying the correct use case – Functional requirements – Non-functional requirements
  • 31. Finally you can use NOSQL for...