SlideShare a Scribd company logo
BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA
HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH
Real-Time Analytics with Apache
Cassandra
Cassandra Day Berlin, 11.2.2016
Guido Schmutz
Guido Schmutz
Working for Trivadis for more than 19 years
Oracle ACE Director for Fusion Middleware and SOA
Co-Author of different books
Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
Member of Trivadis Architecture Board
Technology Manager @ Trivadis
More than 25 years of software development experience
Contact: guido.schmutz@trivadis.com
Blog: http://guidoschmutz.wordpress.com
Slideshare: http://de.slideshare.net/gschmutz
Twitter: gschmutz
2
Our company.
© Trivadis – The Company3 2/11/16
Trivadis is a market leader in IT consulting, system integration, solution engineering
and the provision of IT services focusing on and and Open
Source technologies
in Switzerland, Germany, Austria and Denmark. We offer our services in the following
strategic business fields:
Trivadis Services takes over the interacting operation of your IT systems.
O P E R A T I O N
COPENHAGEN
MUNICH
LAUSANNE
BERN
ZURICH
BRUGG
GENEVA
HAMBURG
DÜSSELDORF
FRANKFURT
STUTTGART
FREIBURG
BASEL
VIENNA
With over 600 specialists and IT experts in your region.
© Trivadis – The Company4 2/11/16
14 Trivadis branches and more than
600 employees
200 Service Level Agreements
Over 4,000 training participants
Research and development budget:
CHF 5.0 million
Financially self-supporting and
sustainably profitable
Experience from more than 1,900
projects per year at over 800
customers
Agenda
1. Customer Use Case and Architecture
2. Cassandra Data Modeling
3. Cassandra for Timeseries Data
4. Titan:db for Graph Data
5
Customer Use Case and
Architecture
6
Data Science Lab @ Armasuisse W&T
W+T flagship project, standing
for innovation & tech transfer
Building capabilities in the
areas of:
• Social Media Intelligence
(SOCMINT)
• Big Data Technologies &
Architectures
Invest into new, innovative and not
widely-proven technology
• Batch / Real-time analysis
• NoSQL databases
• Text analysis (NLP)
• Graph Data
• …
3 Phases: June 2013 – June 2015
7
SOCMINT System – Time Dimension
Major data model: Time
series (TS)
TS reflect user behaviors
over time
Activities correlate with
events
Anomaly detection
Event detection &
prediction
8
SOCMINT System – Social Dimension
User-user networks (social
graphs);
Twitter: follower, retweet and
mention graphs
Who is central in a social
network?
Who has retweeted a given
tweet to whom?
9
SOCMINT System - “Lambda Architecture” for Big Data
Data
Collection
(Analytical)	Batch	Data	Processing
Batch
compute
Batch	Result	StoreData
Sources
Channel
Data
Access
Reports
Service
Analytic
Tools
Alerting
Tools
Social
RDBMS
Sensor
ERP
Logfiles
Mobile
Machine
(Analytical)	Real-Time	Data	Processing
Stream/Event	Processing
Batch
compute
Real-Time	Result	
Store
Messaging
Result	Store
Query
Engine
Result	Store
Computed	
Information
Raw	Data	
(Reservoir)
=	Data	in	Motion =	Data	at	Rest
10
SOCMINT System – Frameworks & Components in Use
Data
Collection
(Analytical)	Batch	Data	Processing
Batch
compute
Batch	Result	StoreData
Sources
Channel
Data
Access
Reports
Service
Analytic
Tools
Alerting
Tools
Social
(Analytical)	Real-Time	Data	Processing
Stream/Event	Processing
Batch
compute
Real-Time	Result	
Store
Messaging
Result	Store
Query
Engine
Result	Store
Computed	
Information
Raw	Data	
(Reservoir)
=	Data	in	Motion =	Data	at	Rest
11
Streaming Analytics Processing Pipeline
Kafka provides reliable and efficient queuing
Storm processes (rollups, counts)
Cassandrastores results at same speed
StoringProcessingQueuing
12
Twitter
Sensor 1
Twitter
Sensor 2
Twitter
Sensor 3
Visualization
Application
Visualization
Application
Cassandra Data Modeling
13
Cassandra Data Modelling
14
• Don’t think relational !
• Denormalize, Denormalize, Denormalize ….
• Rows are gigantic and sorted = one row is stored on one node
• Know your application/use cases => from query to model
• Index is not an afterthought, anymore => “index” upfront
• Control physical storage structure
“Static” Tables – “Skinny Row”
15
rowkey
CREATE TABLE skinny (rowkey text,
c1 text PRIMARY KEY,
c2 text,
c3 text,
PRIMARY KEY (rowkey));
Grows	up	to	Billion	of	Rows
rowkey-1 c1 c2 c3
value-c1 value-c2 value-c3
rowkey-2 c1 c3
value-c1 value-c3
rowkey-3 c1 c2 c3
value-c1 value-c2 value-c3
c1 c2 c3
Partition	Key
“Dynamic” Tables – “Wide Row”
16
rowkey
Billion	of	Rows
rowkey-1 ckey-1:c1 ckey-1:c2
value-c1 value-c2
rowkey-2
rowkey-3
CREATE TABLE wide (rowkey text,
ckey text,
c1 text,
c2 text,
PRIMARY KEY (rowkey, ckey) WITH CLUSTERING ORDER BY (ckey ASC);
ckey-2:c1 ckey-2:c2
value-c1 value-c2
ckey-3:c1 ckey-3:c2
value-c1 value-c2
ckey-1:c1 ckey-1:c2
value-c1 value-c2
ckey-2:c1 ckey-2:c2
value-c1 value-c2
ckey-1:c1 ckey-1:c2
value-c1 value-c2
ckey-2:c1 ckey-2:c2
value-c1 value-c2
ckey-3:c1 ckey-3:c2
value-c1 value-c2
1 2	Billion
Partition	Key Clustering Key
Cassandra for Timeseries Data
17
Show Timeseries: Provide list of metrics
18
CREATE TABLE tweet_count (
sensor_id text,
bucket_id text,
key text,
time_id timestamp,
count counter,
PRIMARY KEY((sensor_id, bucket_id), key, time_id))
WITH CLUSTERING ORDER BY (key ASC, time_id DESC);
Use of “Static” Table
bucket-id defines buckets of
values
• HOUR-2015-10 = values
collected hourly in one
partition for one month
ABC-001:HOUR-2015-10 dse:10:00:count
1’550
ABC-001:DAY-2015-10 dse:14-OCT:count
105’999
dse:13-OCT:count
120’344
nosql:14-OCT:count
2’532
dse:09:00:count
2’299
nosql:10:00:count
25
30d	*	24h	*	n	keys	=	n	*	720	cols
Open	Source	TimeSeries	DBs	over	Cassandra:
Kairos	DB:	 https://kairosdb.github.io/
Heroic:	 http://spotify.github.io/heroicPartition	Key Clustering Key
Show Timeseries: Provide list of metrics
19
UPDATE tweet_count SET count = count + 1
WHERE sensor_id = 'ABC-001’ AND bucket_id = 'HOUR-2015-10'
AND key = 'ALL’ AND time_id = '2015-10-14 10:00:00';
SELECT * from tweet_count
WHERE sensor_id = 'ABC-001' AND bucket_id = 'HOUR-2015-10'
AND key = 'ALL' AND time_id >= '2015-10-14 08:00:00’;
sensor_id | bucket_id | key | time_id | count
----------+--------------+-----+--------------------------+-------
ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 12:00:00+0000 | 100230
ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 11:00:00+0000 | 102230
ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 10:00:00+0000 | 105430
ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 09:00:00+0000 | 203240
ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 08:00:00+0000 | 132230
Partition	Key Clustering Key
Titan:db & Cassandra for Graph
Data
20
Supporting Graph Data with Titan:db and Cassandra
21
http://thinkaurelius.github.io/titan/
Gremlin in Action – Creating the Graph
22
Gremlin in Action – Graph Traversal
23
Gremlin in Action – Graph Traversal (II)
24
Summary - Know your domain
Connectedness	 of	Datalow high
Document
Data
Store
Key-Value
Stores
Wide-
Column
Store
Graph
Databases
Relational
Databases
Guido Schmutz
Email: guido.schmutz@trivadis.com
+41 79 412 05 39
26

More Related Content

PDF
Real Time Analytics with Apache Cassandra - Cassandra Day Munich
PDF
Oracle Panel: Expert Insights into Faster Oracle SOA Suite Project Delivery
PDF
Oracle Stream Analytics - Simplifying Stream Processing
PDF
Big Data Architectures
PDF
Big Data and Fast Data - Lambda Architecture in Action
PDF
Internet of Things - Are traditional architectures good enough?
PDF
Introduction to Streaming Analytics
PDF
Architektur von Big Data Lösungen
Real Time Analytics with Apache Cassandra - Cassandra Day Munich
Oracle Panel: Expert Insights into Faster Oracle SOA Suite Project Delivery
Oracle Stream Analytics - Simplifying Stream Processing
Big Data Architectures
Big Data and Fast Data - Lambda Architecture in Action
Internet of Things - Are traditional architectures good enough?
Introduction to Streaming Analytics
Architektur von Big Data Lösungen

What's hot (20)

PDF
Oracle Stream Explorer - Simplifying Event/Stream Processing
PPTX
Fast Data: A Customer’s Journey to Delivering a Compelling Real-Time Solution
PDF
Processing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12c
PDF
Blueprints for the analysis of social media
PDF
Reliable Data Intestion in BigData / IoT
PDF
A Microservice Architecture for Big Data Pipelines
PDF
SQL vs. NoSQL
PDF
Introduction to Stream Processing
PPTX
Using Apache Cassandra and Apache Kafka to Scale Next Gen Applications
PDF
Ingesting streaming data into Graph Database
PPTX
Real time big data stream processing
PPTX
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
PDF
Twitter Storm: Ereignisverarbeitung in Echtzeit
PDF
Modern real-time streaming architectures
PDF
Advanced data science algorithms applied to scalable stream processing by Dav...
PPTX
Azure event hubs, Stream Analytics & Power BI (by Sam Vanhoutte)
PDF
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
PDF
Internet of Things (IoT) - in the cloud or rather on-premises?
PDF
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
PDF
Architecting Microservices Applications with Instant Analytics
Oracle Stream Explorer - Simplifying Event/Stream Processing
Fast Data: A Customer’s Journey to Delivering a Compelling Real-Time Solution
Processing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12c
Blueprints for the analysis of social media
Reliable Data Intestion in BigData / IoT
A Microservice Architecture for Big Data Pipelines
SQL vs. NoSQL
Introduction to Stream Processing
Using Apache Cassandra and Apache Kafka to Scale Next Gen Applications
Ingesting streaming data into Graph Database
Real time big data stream processing
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Twitter Storm: Ereignisverarbeitung in Echtzeit
Modern real-time streaming architectures
Advanced data science algorithms applied to scalable stream processing by Dav...
Azure event hubs, Stream Analytics & Power BI (by Sam Vanhoutte)
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Internet of Things (IoT) - in the cloud or rather on-premises?
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Architecting Microservices Applications with Instant Analytics
Ad

Viewers also liked (20)

PDF
„Enterprise Event Bus“ Unified Log (Event) Processing Architecture
PDF
Big Data Architectures @ JAX / BigDataCon 2016
PDF
Apache Storm vs. Spark Streaming - two stream processing platforms compared
PDF
Internet of Things (IoT) and Big Data
PDF
Building Big Data Streaming Architectures
PDF
Real-time Stream Processing with Apache Flink @ Hadoop Summit
PPTX
KDD 2016 Streaming Analytics Tutorial
PPTX
Just in time (series) - KairosDB
PDF
RBea: Scalable Real-Time Analytics at King
PDF
Real-time analytics as a service at King
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem
PDF
Streaming Analytics
PPTX
Data Streaming (in a Nutshell) ... and Spark's window operations
PPTX
Stream Analytics in the Enterprise
PDF
Stream Processing Everywhere - What to use?
PDF
The end of polling : why and how to transform a REST API into a Data Streamin...
PDF
Stateful Distributed Stream Processing
PDF
Apache Kafka - Scalable Message-Processing and more !
PDF
MongoDB Solution for Internet of Things and Big Data
PDF
Distributed Real-Time Stream Processing: Why and How 2.0
„Enterprise Event Bus“ Unified Log (Event) Processing Architecture
Big Data Architectures @ JAX / BigDataCon 2016
Apache Storm vs. Spark Streaming - two stream processing platforms compared
Internet of Things (IoT) and Big Data
Building Big Data Streaming Architectures
Real-time Stream Processing with Apache Flink @ Hadoop Summit
KDD 2016 Streaming Analytics Tutorial
Just in time (series) - KairosDB
RBea: Scalable Real-Time Analytics at King
Real-time analytics as a service at King
Large-Scale Stream Processing in the Hadoop Ecosystem
Streaming Analytics
Data Streaming (in a Nutshell) ... and Spark's window operations
Stream Analytics in the Enterprise
Stream Processing Everywhere - What to use?
The end of polling : why and how to transform a REST API into a Data Streamin...
Stateful Distributed Stream Processing
Apache Kafka - Scalable Message-Processing and more !
MongoDB Solution for Internet of Things and Big Data
Distributed Real-Time Stream Processing: Why and How 2.0
Ad

Similar to Real Time Analytics with Apache Cassandra - Cassandra Day Berlin (20)

PDF
Streaming Visualization
PDF
Hitachi Streaming Data Platform
PDF
Hitachi Streaming Data Platform_v8
PDF
Hitachi streaming data platform v8
PDF
Big Data Architecture
PDF
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
PDF
Big Data Analytics for Real Time Systems
PPTX
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
PDF
Streaming Visualization
PPTX
Azure Data Explorer deep dive - review 04.2020
PDF
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
PDF
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
PDF
Data Ingestion in Big Data and IoT platforms
PDF
WSO2 Analytics Platform - The one stop shop for all your data needs
PDF
Metaverse and Digital Twins on Enterprise-Public.pdf
PPTX
Pragmatic approach to Microservice Architecture: Role of Middleware
PDF
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...
PDF
Streaming analytics state of the art
PDF
Building Event-Driven (Micro) Services with Apache Kafka
PDF
Apache Cassandra for Timeseries- and Graph-Data
Streaming Visualization
Hitachi Streaming Data Platform
Hitachi Streaming Data Platform_v8
Hitachi streaming data platform v8
Big Data Architecture
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Big Data Analytics for Real Time Systems
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Streaming Visualization
Azure Data Explorer deep dive - review 04.2020
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Data Ingestion in Big Data and IoT platforms
WSO2 Analytics Platform - The one stop shop for all your data needs
Metaverse and Digital Twins on Enterprise-Public.pdf
Pragmatic approach to Microservice Architecture: Role of Middleware
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...
Streaming analytics state of the art
Building Event-Driven (Micro) Services with Apache Kafka
Apache Cassandra for Timeseries- and Graph-Data

More from Guido Schmutz (20)

PDF
30 Minutes to the Analytics Platform with Infrastructure as Code
PDF
Event Broker (Kafka) in a Modern Data Architecture
PDF
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
PDF
ksqlDB - Stream Processing simplified!
PDF
Kafka as your Data Lake - is it Feasible?
PDF
Event Hub (i.e. Kafka) in Modern Data Architecture
PDF
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
PDF
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
PDF
Building Event Driven (Micro)services with Apache Kafka
PDF
Location Analytics - Real-Time Geofencing using Apache Kafka
PDF
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
PDF
What is Apache Kafka? Why is it so popular? Should I use it?
PDF
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
PDF
Location Analytics Real-Time Geofencing using Kafka
PDF
Streaming Visualisation
PDF
Kafka as an event store - is it good enough?
PDF
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
PDF
Fundamentals Big Data and AI Architecture
PDF
Location Analytics - Real-Time Geofencing using Kafka
PDF
Streaming Visualization
30 Minutes to the Analytics Platform with Infrastructure as Code
Event Broker (Kafka) in a Modern Data Architecture
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
ksqlDB - Stream Processing simplified!
Kafka as your Data Lake - is it Feasible?
Event Hub (i.e. Kafka) in Modern Data Architecture
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Building Event Driven (Micro)services with Apache Kafka
Location Analytics - Real-Time Geofencing using Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
What is Apache Kafka? Why is it so popular? Should I use it?
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Location Analytics Real-Time Geofencing using Kafka
Streaming Visualisation
Kafka as an event store - is it good enough?
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Fundamentals Big Data and AI Architecture
Location Analytics - Real-Time Geofencing using Kafka
Streaming Visualization

Recently uploaded (20)

PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
KodekX | Application Modernization Development
PPT
Teaching material agriculture food technology
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Spectroscopy.pptx food analysis technology
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Electronic commerce courselecture one. Pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Cloud computing and distributed systems.
PDF
Approach and Philosophy of On baking technology
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
KodekX | Application Modernization Development
Teaching material agriculture food technology
Unlocking AI with Model Context Protocol (MCP)
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Spectral efficient network and resource selection model in 5G networks
Spectroscopy.pptx food analysis technology
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
The AUB Centre for AI in Media Proposal.docx
Electronic commerce courselecture one. Pdf
Review of recent advances in non-invasive hemoglobin estimation
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Cloud computing and distributed systems.
Approach and Philosophy of On baking technology
Chapter 3 Spatial Domain Image Processing.pdf
sap open course for s4hana steps from ECC to s4
The Rise and Fall of 3GPP – Time for a Sabbatical?
20250228 LYD VKU AI Blended-Learning.pptx

Real Time Analytics with Apache Cassandra - Cassandra Day Berlin

  • 1. BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH Real-Time Analytics with Apache Cassandra Cassandra Day Berlin, 11.2.2016 Guido Schmutz
  • 2. Guido Schmutz Working for Trivadis for more than 19 years Oracle ACE Director for Fusion Middleware and SOA Co-Author of different books Consultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast Data Member of Trivadis Architecture Board Technology Manager @ Trivadis More than 25 years of software development experience Contact: guido.schmutz@trivadis.com Blog: http://guidoschmutz.wordpress.com Slideshare: http://de.slideshare.net/gschmutz Twitter: gschmutz 2
  • 3. Our company. © Trivadis – The Company3 2/11/16 Trivadis is a market leader in IT consulting, system integration, solution engineering and the provision of IT services focusing on and and Open Source technologies in Switzerland, Germany, Austria and Denmark. We offer our services in the following strategic business fields: Trivadis Services takes over the interacting operation of your IT systems. O P E R A T I O N
  • 4. COPENHAGEN MUNICH LAUSANNE BERN ZURICH BRUGG GENEVA HAMBURG DÜSSELDORF FRANKFURT STUTTGART FREIBURG BASEL VIENNA With over 600 specialists and IT experts in your region. © Trivadis – The Company4 2/11/16 14 Trivadis branches and more than 600 employees 200 Service Level Agreements Over 4,000 training participants Research and development budget: CHF 5.0 million Financially self-supporting and sustainably profitable Experience from more than 1,900 projects per year at over 800 customers
  • 5. Agenda 1. Customer Use Case and Architecture 2. Cassandra Data Modeling 3. Cassandra for Timeseries Data 4. Titan:db for Graph Data 5
  • 6. Customer Use Case and Architecture 6
  • 7. Data Science Lab @ Armasuisse W&T W+T flagship project, standing for innovation & tech transfer Building capabilities in the areas of: • Social Media Intelligence (SOCMINT) • Big Data Technologies & Architectures Invest into new, innovative and not widely-proven technology • Batch / Real-time analysis • NoSQL databases • Text analysis (NLP) • Graph Data • … 3 Phases: June 2013 – June 2015 7
  • 8. SOCMINT System – Time Dimension Major data model: Time series (TS) TS reflect user behaviors over time Activities correlate with events Anomaly detection Event detection & prediction 8
  • 9. SOCMINT System – Social Dimension User-user networks (social graphs); Twitter: follower, retweet and mention graphs Who is central in a social network? Who has retweeted a given tweet to whom? 9
  • 10. SOCMINT System - “Lambda Architecture” for Big Data Data Collection (Analytical) Batch Data Processing Batch compute Batch Result StoreData Sources Channel Data Access Reports Service Analytic Tools Alerting Tools Social RDBMS Sensor ERP Logfiles Mobile Machine (Analytical) Real-Time Data Processing Stream/Event Processing Batch compute Real-Time Result Store Messaging Result Store Query Engine Result Store Computed Information Raw Data (Reservoir) = Data in Motion = Data at Rest 10
  • 11. SOCMINT System – Frameworks & Components in Use Data Collection (Analytical) Batch Data Processing Batch compute Batch Result StoreData Sources Channel Data Access Reports Service Analytic Tools Alerting Tools Social (Analytical) Real-Time Data Processing Stream/Event Processing Batch compute Real-Time Result Store Messaging Result Store Query Engine Result Store Computed Information Raw Data (Reservoir) = Data in Motion = Data at Rest 11
  • 12. Streaming Analytics Processing Pipeline Kafka provides reliable and efficient queuing Storm processes (rollups, counts) Cassandrastores results at same speed StoringProcessingQueuing 12 Twitter Sensor 1 Twitter Sensor 2 Twitter Sensor 3 Visualization Application Visualization Application
  • 14. Cassandra Data Modelling 14 • Don’t think relational ! • Denormalize, Denormalize, Denormalize …. • Rows are gigantic and sorted = one row is stored on one node • Know your application/use cases => from query to model • Index is not an afterthought, anymore => “index” upfront • Control physical storage structure
  • 15. “Static” Tables – “Skinny Row” 15 rowkey CREATE TABLE skinny (rowkey text, c1 text PRIMARY KEY, c2 text, c3 text, PRIMARY KEY (rowkey)); Grows up to Billion of Rows rowkey-1 c1 c2 c3 value-c1 value-c2 value-c3 rowkey-2 c1 c3 value-c1 value-c3 rowkey-3 c1 c2 c3 value-c1 value-c2 value-c3 c1 c2 c3 Partition Key
  • 16. “Dynamic” Tables – “Wide Row” 16 rowkey Billion of Rows rowkey-1 ckey-1:c1 ckey-1:c2 value-c1 value-c2 rowkey-2 rowkey-3 CREATE TABLE wide (rowkey text, ckey text, c1 text, c2 text, PRIMARY KEY (rowkey, ckey) WITH CLUSTERING ORDER BY (ckey ASC); ckey-2:c1 ckey-2:c2 value-c1 value-c2 ckey-3:c1 ckey-3:c2 value-c1 value-c2 ckey-1:c1 ckey-1:c2 value-c1 value-c2 ckey-2:c1 ckey-2:c2 value-c1 value-c2 ckey-1:c1 ckey-1:c2 value-c1 value-c2 ckey-2:c1 ckey-2:c2 value-c1 value-c2 ckey-3:c1 ckey-3:c2 value-c1 value-c2 1 2 Billion Partition Key Clustering Key
  • 18. Show Timeseries: Provide list of metrics 18 CREATE TABLE tweet_count ( sensor_id text, bucket_id text, key text, time_id timestamp, count counter, PRIMARY KEY((sensor_id, bucket_id), key, time_id)) WITH CLUSTERING ORDER BY (key ASC, time_id DESC); Use of “Static” Table bucket-id defines buckets of values • HOUR-2015-10 = values collected hourly in one partition for one month ABC-001:HOUR-2015-10 dse:10:00:count 1’550 ABC-001:DAY-2015-10 dse:14-OCT:count 105’999 dse:13-OCT:count 120’344 nosql:14-OCT:count 2’532 dse:09:00:count 2’299 nosql:10:00:count 25 30d * 24h * n keys = n * 720 cols Open Source TimeSeries DBs over Cassandra: Kairos DB: https://kairosdb.github.io/ Heroic: http://spotify.github.io/heroicPartition Key Clustering Key
  • 19. Show Timeseries: Provide list of metrics 19 UPDATE tweet_count SET count = count + 1 WHERE sensor_id = 'ABC-001’ AND bucket_id = 'HOUR-2015-10' AND key = 'ALL’ AND time_id = '2015-10-14 10:00:00'; SELECT * from tweet_count WHERE sensor_id = 'ABC-001' AND bucket_id = 'HOUR-2015-10' AND key = 'ALL' AND time_id >= '2015-10-14 08:00:00’; sensor_id | bucket_id | key | time_id | count ----------+--------------+-----+--------------------------+------- ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 12:00:00+0000 | 100230 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 11:00:00+0000 | 102230 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 10:00:00+0000 | 105430 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 09:00:00+0000 | 203240 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 08:00:00+0000 | 132230 Partition Key Clustering Key
  • 20. Titan:db & Cassandra for Graph Data 20
  • 21. Supporting Graph Data with Titan:db and Cassandra 21 http://thinkaurelius.github.io/titan/
  • 22. Gremlin in Action – Creating the Graph 22
  • 23. Gremlin in Action – Graph Traversal 23
  • 24. Gremlin in Action – Graph Traversal (II) 24
  • 25. Summary - Know your domain Connectedness of Datalow high Document Data Store Key-Value Stores Wide- Column Store Graph Databases Relational Databases