SlideShare a Scribd company logo
BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA
HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH
Real-Time Analytics with Apache
Cassandra
Cassandra Day Munich, 9.2.2016
Guido Schmutz
Guido Schmutz
Working for Trivadis for more than 19 years
Oracle ACE Director for Fusion Middleware and SOA
Co-Author of different books
Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
Member of Trivadis Architecture Board
Technology Manager @ Trivadis
More than 25 years of software development experience
Contact: guido.schmutz@trivadis.com
Blog: http://guidoschmutz.wordpress.com
Slideshare: http://de.slideshare.net/gschmutz
Twitter: gschmutz
2
Our company.
© Trivadis – The Company3 2/11/16
Trivadis is a market leader in IT consulting, system integration, solution engineering
and the provision of IT services focusing on and and Open
Source technologies
in Switzerland, Germany, Austria and Denmark. We offer our services in the following
strategic business fields:
Trivadis Services takes over the interacting operation of your IT systems.
O P E R A T I O N
COPENHAGEN
MUNICH
LAUSANNE
BERN
ZURICH
BRUGG
GENEVA
HAMBURG
DÜSSELDORF
FRANKFURT
STUTTGART
FREIBURG
BASEL
VIENNA
With over 600 specialists and IT experts in your region.
© Trivadis – The Company4 2/11/16
14 Trivadis branches and more than
600 employees
200 Service Level Agreements
Over 4,000 training participants
Research and development budget:
CHF 5.0 million
Financially self-supporting and
sustainably profitable
Experience from more than 1,900
projects per year at over 800
customers
Agenda
1. Customer Use Case and Architecture
2. Cassandra Data Modeling
3. Cassandra for Timeseries Data
4. Titan:db for Graph Data
5
Customer Use Case and
Architecture
6
Data Science Lab @ Armasuisse W&T
W+T flagship project, standing
for innovation & tech transfer
Building capabilities in the
areas of:
• Social Media Intelligence
(SOCMINT)
• Big Data Technologies &
Architectures
Invest into new, innovative and not
widely-proven technology
• Batch / Real-time analysis
• NoSQL databases
• Text analysis (NLP)
• Graph Data
• …
3 Phases: June 2013 – June 2015
7
SOCMINT Demonstrator – Time Dimension
Major data model: Time
series (TS)
TS reflect user behaviors
over time
Activities correlate with
events
Anomaly detection
Event detection &
prediction
8
SOCMINT Demonstrator – Social Dimension
User-user networks (social
graphs);
Twitter: follower, retweet and
mention graphs
Who is central in a social
network?
Who has retweeted a given
tweet to whom?
9
SOCMINT Demonstrator - “Lambda Architecture” for Big
Data
Data
Collection
(Analytical)	Batch	Data	Processing
Batch
compute
Batch	Result	StoreData
Sources
Channel
Data
Access
Reports
Service
Analytic
Tools
Alerting
Tools
Social
RDBMS
Sensor
ERP
Logfiles
Mobile
Machine
(Analytical)	Real-Time	Data	Processing
Stream/Event	Processing
Batch
compute
Real-Time	Result	
Store
Messaging
Result	Store
Query
Engine
Result	Store
Computed	
Information
Raw	Data	
(Reservoir)
=	Data	in	Motion =	Data	at	Rest
10
SOCMINT Demonstrator – Frameworks & Components
in Use
Data
Collection
(Analytical)	Batch	Data	Processing
Batch
compute
Batch	Result	StoreData
Sources
Channel
Data
Access
Reports
Service
Analytic
Tools
Alerting
Tools
Social
(Analytical)	Real-Time	Data	Processing
Stream/Event	Processing
Batch
compute
Real-Time	Result	
Store
Messaging
Result	Store
Query
Engine
Result	Store
Computed	
Information
Raw	Data	
(Reservoir)
=	Data	in	Motion =	Data	at	Rest
11
Streaming Analytics Processing Pipeline
Kafka provides reliable and efficient queuing
Storm processes (rollups, counts)
Cassandrastores results at same speed
StoringProcessingQueuing
12
Twitter
Sensor 1
Twitter
Sensor 2
Twitter
Sensor 3
Visualizatio
n
Application
Visualizatio
n
Application
Cassandra Data Modeling
13
Cassandra Data Modelling
14
• Don’t think relational !
• Denormalize, Denormalize, Denormalize ….
• Rows are gigantic and sorted = one row is stored on one node
• Know your application/use cases => from query to model
• Index is not an afterthought, anymore => “index” upfront
• Control physical storage structure
“Static” Tables – “Skinny Row”
15
rowkey
CREATE TABLE skinny (rowkey text,
c1 text PRIMARY KEY,
c2 text,
c3 text,
PRIMARY KEY (rowkey));
Grows	up	to	Billion	of	Rows
rowkey-1 c1 c2 c3
value-c1 value-c2 value-c3
rowkey-2 c1 c3
value-c1 value-c3
rowkey-3 c1 c2 c3
value-c1 value-c2 value-c3
c1 c2 c3
Partition	Key
“Dynamic” Tables – “Wide Row”
16
rowkey
Billion	of	Rows
rowkey-1 ckey-1:c1 ckey-1:c2
value-c1 value-c2
rowkey-2
rowkey-3
CREATE TABLE wide (rowkey text,
ckey text,
c1 text,
c2 text,
PRIMARY KEY (rowkey, ckey) WITH CLUSTERING ORDER BY (ckey ASC);
ckey-2:c1 ckey-2:c2
value-c1 value-c2
ckey-3:c1 ckey-3:c2
value-c1 value-c2
ckey-1:c1 ckey-1:c2
value-c1 value-c2
ckey-2:c1 ckey-2:c2
value-c1 value-c2
ckey-1:c1 ckey-1:c2
value-c1 value-c2
ckey-2:c1 ckey-2:c2
value-c1 value-c2
ckey-3:c1 ckey-3:c2
value-c1 value-c2
1 2	Billion
Partition	Key Clustering Key
Cassandra for Timeseries Data
17
Know your application => From query to model
18
Show Timeline of
Tweets
Show Timeseries on
different levels of
aggregation
(resolution)
• Seconds
• Minute
• Hours
Show Timeline: Provide Raw Data (Tweets)
19
CREATE TABLE tweet (tweet_id bigint,
username text,
message text,
hashtags list<text>,
latitude double,
longitude double,
…
PRIMARY KEY(tweet_id));
• Skinny Row Table
• Holds the sensor raw data =>
Tweets
• Similar to a relational table
• Primary Key is the partition key
10000121 username message hashtags latitude longitude
gschmutz Getting	ready	for .. [cassandra,	nosql] 0 0
20121223 username message hashtags latitude longitude
DataStax The Speed	Factor	.. [BigData 0 0
tweet_id
Partition	Key Clustering Key
Show Timeline: Provide Raw Data (Tweets)
20
INSERT INTO tweet (tweet_id, username, message, hashtags, latitude,
longitude) VALUES (10000121, 'gschmutz', 'Getting ready for my talk about
using Cassandra for Timeseries and Graph Data', ['cassandra', 'nosql'],
0,0);
SELECT tweet_id, username, hashtags, message FROM tweet
WHERE tweet_id = 10000121 ;
tweet_id | username | hashtag | message
---------+----------+------------------------+----------------------------
10000121 | gschmutz | ['cassandra', 'nosql'] | Getting ready for ...
20121223 | DataStax | [’BigData’] | The Speed Factor ...
Partition	Key Clustering Key
Show Timeline: Provide Sequence of Events
21
CREATE TABLE tweet_timeline (
sensor_id text,
bucket_id text,
time_id timestamp,
tweet_id bigint,
PRIMARY KEY((sensor_id, bucket_id), time_id))
WITH CLUSTERING ORDER BY (time_id DESC);
Wide Row Table
bucket-id creates buckets
for columns
• SECOND-2015-10-14
ABC-001:SECOND-2015-10-14 10:00:02:tweet-id
10000121	
DEF-931:SECOND-2015-10-14 10:09:02:tweet-id
1003121343
09:12:09:tweet-id
1002111343
09:10:02:tweet-id
1001121343
Partition	Key Clustering Key
Show Timeline: Provide Sequence of Events
22
INSERT INTO tweet_timeline (sensor_id, bucket_id, time_id, tweet_id)
VALUES ('ABC-001', 'SECOND-2015-10-14', '2015-09-30 10:50:00', 10000121 );
SELECT * from tweet_timeline
WHERE sensor_id = 'ABC-001’ AND bucket_id = 'HOUR-2015-10'
AND key = 'ALL’ AND time_id <= '2015-10-14 12:00:00';
sensor_id | bucket_id | time_id | tweet_id
----------+-------------------+--------------------------+----------
ABC-001 | SECOND-2015-10-14 | 2015-10-14 11:53:00+0000 | 10020334
ABC-001 | SECOND-2015-10-14 | 2015-10-14 10:52:00+0000 | 10000334
ABC-001 | SECOND-2015-10-14 | 2015-10-14 10:51:00+0000 | 10000127
ABC-001 | SECOND-2015-10-14 | 2015-10-14 10:50:00+0000 | 10000121
Sorted	by	time_id
Partition	Key Clustering Key
Show Timeseries: Provide list of metrics
23
CREATE TABLE tweet_count (
sensor_id text,
bucket_id text,
key text,
time_id timestamp,
count counter,
PRIMARY KEY((sensor_id, bucket_id), key, time_id))
WITH CLUSTERING ORDER BY (key ASC, time_id DESC);
Wide Row Table
bucket-id creates buckets
for columns
• SECOND-2015-10-14
• HOUR-2015-10
• DAY-2015-10
ABC-001:HOUR-2015-10 ALL:10:00:count
1’550
ABC-001:DAY-2015-10 ALL:14-OCT:count
105’999
ALL:13-OCT:count
120’344
nosql:14-OCT:count
2’532
ALL:09:00:count
2’299
nosql:08:00:count
25
30d	*	24h	*	n	keys	=	n	*	720	cols
Partition	Key Clustering Key
Show Timeseries: Provide list of metrics
24
UPDATE tweet_count SET count = count + 1
WHERE sensor_id = 'ABC-001’ AND bucket_id = 'HOUR-2015-10'
AND key = 'ALL’ AND time_id = '2015-10-14 10:00:00';
SELECT * from tweet_count
WHERE sensor_id = 'ABC-001' AND bucket_id = 'HOUR-2015-10'
AND key = 'ALL' AND time_id >= '2015-10-14 08:00:00’;
sensor_id | bucket_id | key | time_id | count
----------+--------------+-----+--------------------------+-------
ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 12:00:00+0000 | 100230
ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 11:00:00+0000 | 102230
ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 10:00:00+0000 | 105430
ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 09:00:00+0000 | 203240
ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 08:00:00+0000 | 132230
Partition	Key Clustering Key
Titan:db & Cassandra for Graph
Data
25
Introduction to the Graph Model – Property Graph
Vertex (Node)
• Represent Entities
• Always have an ID
• Can contain properties (key-
value pairs)
Edge (Relationship)
• Lines between nodes
• may be directed or undirected
• Have IDs and properites
Properties
• Values about node or relationship
• Allow to add semantic to
relationships
User	1
Tweet	2
author
follow
retweet
User	2
Id:	16134540
name:	cloudera
location:	Palo	Alto
Id:	18898576
name:	gschmutz
location:	Berne
Id:	18898999
text:	CDH5	has	been..
time:	July	11	2015
time:	June	11	2015
key:	 value
26
since:	May	2012
Tweet	1
Id:	18898576
text:	Join	BigData..
time:	June	11	2015
author
Titan:db Architecture
27
http://thinkaurelius.github.io/titan/
TinkerPop 3 Stack
TinkerPop is a framework composed of
various interoperable components
Vendor independent (similar to JDBC for
RDBMS)
Core API defines Graph, Vertex, Edge, …
Gremlin traversal language is vendor-
independent way to query (traverse) a graph
Gremlin server can be leveraged to allow
over the wire communication with a
TinkerPop enabled graph system
http://tinkerpop.incubator.apache.org/
28
Gremlin Graph Traversal Engine
29
Language / System agostic: many graph languages for many
graph systems
Provided Traversal Engine: SPARQL or any other graph query
language on the Gremlin Traversal Machine
Native distributed execution: A Gremlin Traversal over an
OLAP Graph Processor (Hadoop / Spark)
Gremlin in Action – Creating the Graph
30
Gremlin in Action – Graph Traversal
31
Gremlin in Action – Graph Traversal (II)
32
Summary - Know your domain
Connectedness	 of	Datalow high
Document
Data
Store
Key-Value
Stores
Wide-
Column
Store
Graph
Databases
Relational
Databases
Guido Schmutz
Email: guido.schmutz@trivadis.com
+41 79 412 05 39
34

More Related Content

PDF
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
PDF
Internet of Things - Are traditional architectures good enough?
PDF
Oracle Panel: Expert Insights into Faster Oracle SOA Suite Project Delivery
PDF
Introduction to Streaming Analytics
PDF
Oracle Stream Explorer - Simplifying Event/Stream Processing
PDF
„Enterprise Event Bus“ Unified Log (Event) Processing Architecture
PDF
Blueprints for the analysis of social media
PDF
Unified Log Processing Architecture
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
Internet of Things - Are traditional architectures good enough?
Oracle Panel: Expert Insights into Faster Oracle SOA Suite Project Delivery
Introduction to Streaming Analytics
Oracle Stream Explorer - Simplifying Event/Stream Processing
„Enterprise Event Bus“ Unified Log (Event) Processing Architecture
Blueprints for the analysis of social media
Unified Log Processing Architecture

What's hot (20)

PDF
Big Data and Fast Data - Lambda Architecture in Action
PDF
IoT Architecture - are traditional architectures good enough or do we need n...
PDF
Architektur von Big Data Lösungen
PDF
Big Data and Fast Data - big and fast combined, is it possible?
PDF
Big Data and Fast Data – Big and Fast Combined, is it Possible?
PDF
SQL vs. NoSQL
PDF
Twitter Storm: Ereignisverarbeitung in Echtzeit
PDF
Data Democratization at Nubank
PPTX
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
PDF
Manage the Digital Transformation with Machine Learning in a Reactive Microse...
PDF
Blockchain and Apache NiFi
PDF
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
PDF
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
PDF
Big Data Architecture
PDF
Make your data talk
PDF
Data Preparation vs. Inline Data Wrangling in Data Science and Machine Learning
PDF
VP of WW Partners by Alan Chhabra
PDF
DNA - Einstein - Data science ja bigdata
PDF
R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
PDF
Critical Breakthroughs and Challenges in Big Data and Analytics
Big Data and Fast Data - Lambda Architecture in Action
IoT Architecture - are traditional architectures good enough or do we need n...
Architektur von Big Data Lösungen
Big Data and Fast Data - big and fast combined, is it possible?
Big Data and Fast Data – Big and Fast Combined, is it Possible?
SQL vs. NoSQL
Twitter Storm: Ereignisverarbeitung in Echtzeit
Data Democratization at Nubank
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Manage the Digital Transformation with Machine Learning in a Reactive Microse...
Blockchain and Apache NiFi
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
Big Data Architecture
Make your data talk
Data Preparation vs. Inline Data Wrangling in Data Science and Machine Learning
VP of WW Partners by Alan Chhabra
DNA - Einstein - Data science ja bigdata
R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
Critical Breakthroughs and Challenges in Big Data and Analytics
Ad

Viewers also liked (20)

PDF
Building Big Data Streaming Architectures
PPTX
KDD 2016 Streaming Analytics Tutorial
PDF
Real-time Stream Processing with Apache Flink @ Hadoop Summit
PDF
RBea: Scalable Real-Time Analytics at King
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem
PDF
Real-time analytics as a service at King
PDF
Streaming Analytics
PPTX
Data Streaming (in a Nutshell) ... and Spark's window operations
PPTX
Stream Analytics in the Enterprise
PDF
Reliable Data Intestion in BigData / IoT
PDF
Stream Processing Everywhere - What to use?
PDF
The end of polling : why and how to transform a REST API into a Data Streamin...
PDF
Stateful Distributed Stream Processing
PDF
Oracle Stream Analytics - Simplifying Stream Processing
PDF
Apache Kafka - Scalable Message-Processing and more !
PDF
Big Data Architectures @ JAX / BigDataCon 2016
PDF
Distributed Real-Time Stream Processing: Why and How 2.0
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
PDF
Kafka and Stream Processing, Taking Analytics Real-time, Mike Spicer
PDF
Spark Streaming into context
Building Big Data Streaming Architectures
KDD 2016 Streaming Analytics Tutorial
Real-time Stream Processing with Apache Flink @ Hadoop Summit
RBea: Scalable Real-Time Analytics at King
Large-Scale Stream Processing in the Hadoop Ecosystem
Real-time analytics as a service at King
Streaming Analytics
Data Streaming (in a Nutshell) ... and Spark's window operations
Stream Analytics in the Enterprise
Reliable Data Intestion in BigData / IoT
Stream Processing Everywhere - What to use?
The end of polling : why and how to transform a REST API into a Data Streamin...
Stateful Distributed Stream Processing
Oracle Stream Analytics - Simplifying Stream Processing
Apache Kafka - Scalable Message-Processing and more !
Big Data Architectures @ JAX / BigDataCon 2016
Distributed Real-Time Stream Processing: Why and How 2.0
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Kafka and Stream Processing, Taking Analytics Real-time, Mike Spicer
Spark Streaming into context
Ad

Similar to Real Time Analytics with Apache Cassandra - Cassandra Day Munich (20)

PDF
Apache Cassandra for Timeseries- and Graph-Data
PPTX
ParStream - Big Data for Business Users
PPTX
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...
PPTX
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
PDF
Hybrid solutions – combining in memory solutions with SSD - Christos Erotocritou
PDF
Lambda at Weather Scale - Cassandra Summit 2015
PPTX
Big Data Overview 2013-2014
PPTX
The key to unlocking the Value in the IoT? Managing the Data!
PDF
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
PDF
Real time data driven applications (and SQL vs NoSQL databases)
PPTX
Check Point Big Data Forum m3
PDF
Owning time series with team apache Strata San Jose 2015
PDF
PDF
Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL...
PDF
Serverless SQL
PPTX
Use Cases for NoSQL in Media
PPTX
How we evolved data pipeline at Celtra and what we learned along the way
PPTX
Big Data Warehousing Meetup with Riak
PDF
Seminaire bigdata23102014
PDF
Database Survival Guide: Exploratory Webcast
Apache Cassandra for Timeseries- and Graph-Data
ParStream - Big Data for Business Users
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Hybrid solutions – combining in memory solutions with SSD - Christos Erotocritou
Lambda at Weather Scale - Cassandra Summit 2015
Big Data Overview 2013-2014
The key to unlocking the Value in the IoT? Managing the Data!
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
Real time data driven applications (and SQL vs NoSQL databases)
Check Point Big Data Forum m3
Owning time series with team apache Strata San Jose 2015
Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL...
Serverless SQL
Use Cases for NoSQL in Media
How we evolved data pipeline at Celtra and what we learned along the way
Big Data Warehousing Meetup with Riak
Seminaire bigdata23102014
Database Survival Guide: Exploratory Webcast

More from Guido Schmutz (20)

PDF
30 Minutes to the Analytics Platform with Infrastructure as Code
PDF
Event Broker (Kafka) in a Modern Data Architecture
PDF
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
PDF
ksqlDB - Stream Processing simplified!
PDF
Kafka as your Data Lake - is it Feasible?
PDF
Event Hub (i.e. Kafka) in Modern Data Architecture
PDF
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
PDF
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
PDF
Building Event Driven (Micro)services with Apache Kafka
PDF
Location Analytics - Real-Time Geofencing using Apache Kafka
PDF
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
PDF
What is Apache Kafka? Why is it so popular? Should I use it?
PDF
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
PDF
Location Analytics Real-Time Geofencing using Kafka
PDF
Streaming Visualisation
PDF
Kafka as an event store - is it good enough?
PDF
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
PDF
Fundamentals Big Data and AI Architecture
PDF
Location Analytics - Real-Time Geofencing using Kafka
PDF
Streaming Visualization
30 Minutes to the Analytics Platform with Infrastructure as Code
Event Broker (Kafka) in a Modern Data Architecture
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
ksqlDB - Stream Processing simplified!
Kafka as your Data Lake - is it Feasible?
Event Hub (i.e. Kafka) in Modern Data Architecture
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Building Event Driven (Micro)services with Apache Kafka
Location Analytics - Real-Time Geofencing using Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
What is Apache Kafka? Why is it so popular? Should I use it?
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Location Analytics Real-Time Geofencing using Kafka
Streaming Visualisation
Kafka as an event store - is it good enough?
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Fundamentals Big Data and AI Architecture
Location Analytics - Real-Time Geofencing using Kafka
Streaming Visualization

Recently uploaded (20)

PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Cloud computing and distributed systems.
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Machine learning based COVID-19 study performance prediction
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Encapsulation theory and applications.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
cuic standard and advanced reporting.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPT
Teaching material agriculture food technology
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Cloud computing and distributed systems.
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Machine learning based COVID-19 study performance prediction
MYSQL Presentation for SQL database connectivity
Network Security Unit 5.pdf for BCA BBA.
Encapsulation theory and applications.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Unlocking AI with Model Context Protocol (MCP)
Diabetes mellitus diagnosis method based random forest with bat algorithm
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
cuic standard and advanced reporting.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Teaching material agriculture food technology
The AUB Centre for AI in Media Proposal.docx
NewMind AI Weekly Chronicles - August'25 Week I

Real Time Analytics with Apache Cassandra - Cassandra Day Munich

  • 1. BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH Real-Time Analytics with Apache Cassandra Cassandra Day Munich, 9.2.2016 Guido Schmutz
  • 2. Guido Schmutz Working for Trivadis for more than 19 years Oracle ACE Director for Fusion Middleware and SOA Co-Author of different books Consultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast Data Member of Trivadis Architecture Board Technology Manager @ Trivadis More than 25 years of software development experience Contact: guido.schmutz@trivadis.com Blog: http://guidoschmutz.wordpress.com Slideshare: http://de.slideshare.net/gschmutz Twitter: gschmutz 2
  • 3. Our company. © Trivadis – The Company3 2/11/16 Trivadis is a market leader in IT consulting, system integration, solution engineering and the provision of IT services focusing on and and Open Source technologies in Switzerland, Germany, Austria and Denmark. We offer our services in the following strategic business fields: Trivadis Services takes over the interacting operation of your IT systems. O P E R A T I O N
  • 4. COPENHAGEN MUNICH LAUSANNE BERN ZURICH BRUGG GENEVA HAMBURG DÜSSELDORF FRANKFURT STUTTGART FREIBURG BASEL VIENNA With over 600 specialists and IT experts in your region. © Trivadis – The Company4 2/11/16 14 Trivadis branches and more than 600 employees 200 Service Level Agreements Over 4,000 training participants Research and development budget: CHF 5.0 million Financially self-supporting and sustainably profitable Experience from more than 1,900 projects per year at over 800 customers
  • 5. Agenda 1. Customer Use Case and Architecture 2. Cassandra Data Modeling 3. Cassandra for Timeseries Data 4. Titan:db for Graph Data 5
  • 6. Customer Use Case and Architecture 6
  • 7. Data Science Lab @ Armasuisse W&T W+T flagship project, standing for innovation & tech transfer Building capabilities in the areas of: • Social Media Intelligence (SOCMINT) • Big Data Technologies & Architectures Invest into new, innovative and not widely-proven technology • Batch / Real-time analysis • NoSQL databases • Text analysis (NLP) • Graph Data • … 3 Phases: June 2013 – June 2015 7
  • 8. SOCMINT Demonstrator – Time Dimension Major data model: Time series (TS) TS reflect user behaviors over time Activities correlate with events Anomaly detection Event detection & prediction 8
  • 9. SOCMINT Demonstrator – Social Dimension User-user networks (social graphs); Twitter: follower, retweet and mention graphs Who is central in a social network? Who has retweeted a given tweet to whom? 9
  • 10. SOCMINT Demonstrator - “Lambda Architecture” for Big Data Data Collection (Analytical) Batch Data Processing Batch compute Batch Result StoreData Sources Channel Data Access Reports Service Analytic Tools Alerting Tools Social RDBMS Sensor ERP Logfiles Mobile Machine (Analytical) Real-Time Data Processing Stream/Event Processing Batch compute Real-Time Result Store Messaging Result Store Query Engine Result Store Computed Information Raw Data (Reservoir) = Data in Motion = Data at Rest 10
  • 11. SOCMINT Demonstrator – Frameworks & Components in Use Data Collection (Analytical) Batch Data Processing Batch compute Batch Result StoreData Sources Channel Data Access Reports Service Analytic Tools Alerting Tools Social (Analytical) Real-Time Data Processing Stream/Event Processing Batch compute Real-Time Result Store Messaging Result Store Query Engine Result Store Computed Information Raw Data (Reservoir) = Data in Motion = Data at Rest 11
  • 12. Streaming Analytics Processing Pipeline Kafka provides reliable and efficient queuing Storm processes (rollups, counts) Cassandrastores results at same speed StoringProcessingQueuing 12 Twitter Sensor 1 Twitter Sensor 2 Twitter Sensor 3 Visualizatio n Application Visualizatio n Application
  • 14. Cassandra Data Modelling 14 • Don’t think relational ! • Denormalize, Denormalize, Denormalize …. • Rows are gigantic and sorted = one row is stored on one node • Know your application/use cases => from query to model • Index is not an afterthought, anymore => “index” upfront • Control physical storage structure
  • 15. “Static” Tables – “Skinny Row” 15 rowkey CREATE TABLE skinny (rowkey text, c1 text PRIMARY KEY, c2 text, c3 text, PRIMARY KEY (rowkey)); Grows up to Billion of Rows rowkey-1 c1 c2 c3 value-c1 value-c2 value-c3 rowkey-2 c1 c3 value-c1 value-c3 rowkey-3 c1 c2 c3 value-c1 value-c2 value-c3 c1 c2 c3 Partition Key
  • 16. “Dynamic” Tables – “Wide Row” 16 rowkey Billion of Rows rowkey-1 ckey-1:c1 ckey-1:c2 value-c1 value-c2 rowkey-2 rowkey-3 CREATE TABLE wide (rowkey text, ckey text, c1 text, c2 text, PRIMARY KEY (rowkey, ckey) WITH CLUSTERING ORDER BY (ckey ASC); ckey-2:c1 ckey-2:c2 value-c1 value-c2 ckey-3:c1 ckey-3:c2 value-c1 value-c2 ckey-1:c1 ckey-1:c2 value-c1 value-c2 ckey-2:c1 ckey-2:c2 value-c1 value-c2 ckey-1:c1 ckey-1:c2 value-c1 value-c2 ckey-2:c1 ckey-2:c2 value-c1 value-c2 ckey-3:c1 ckey-3:c2 value-c1 value-c2 1 2 Billion Partition Key Clustering Key
  • 18. Know your application => From query to model 18 Show Timeline of Tweets Show Timeseries on different levels of aggregation (resolution) • Seconds • Minute • Hours
  • 19. Show Timeline: Provide Raw Data (Tweets) 19 CREATE TABLE tweet (tweet_id bigint, username text, message text, hashtags list<text>, latitude double, longitude double, … PRIMARY KEY(tweet_id)); • Skinny Row Table • Holds the sensor raw data => Tweets • Similar to a relational table • Primary Key is the partition key 10000121 username message hashtags latitude longitude gschmutz Getting ready for .. [cassandra, nosql] 0 0 20121223 username message hashtags latitude longitude DataStax The Speed Factor .. [BigData 0 0 tweet_id Partition Key Clustering Key
  • 20. Show Timeline: Provide Raw Data (Tweets) 20 INSERT INTO tweet (tweet_id, username, message, hashtags, latitude, longitude) VALUES (10000121, 'gschmutz', 'Getting ready for my talk about using Cassandra for Timeseries and Graph Data', ['cassandra', 'nosql'], 0,0); SELECT tweet_id, username, hashtags, message FROM tweet WHERE tweet_id = 10000121 ; tweet_id | username | hashtag | message ---------+----------+------------------------+---------------------------- 10000121 | gschmutz | ['cassandra', 'nosql'] | Getting ready for ... 20121223 | DataStax | [’BigData’] | The Speed Factor ... Partition Key Clustering Key
  • 21. Show Timeline: Provide Sequence of Events 21 CREATE TABLE tweet_timeline ( sensor_id text, bucket_id text, time_id timestamp, tweet_id bigint, PRIMARY KEY((sensor_id, bucket_id), time_id)) WITH CLUSTERING ORDER BY (time_id DESC); Wide Row Table bucket-id creates buckets for columns • SECOND-2015-10-14 ABC-001:SECOND-2015-10-14 10:00:02:tweet-id 10000121 DEF-931:SECOND-2015-10-14 10:09:02:tweet-id 1003121343 09:12:09:tweet-id 1002111343 09:10:02:tweet-id 1001121343 Partition Key Clustering Key
  • 22. Show Timeline: Provide Sequence of Events 22 INSERT INTO tweet_timeline (sensor_id, bucket_id, time_id, tweet_id) VALUES ('ABC-001', 'SECOND-2015-10-14', '2015-09-30 10:50:00', 10000121 ); SELECT * from tweet_timeline WHERE sensor_id = 'ABC-001’ AND bucket_id = 'HOUR-2015-10' AND key = 'ALL’ AND time_id <= '2015-10-14 12:00:00'; sensor_id | bucket_id | time_id | tweet_id ----------+-------------------+--------------------------+---------- ABC-001 | SECOND-2015-10-14 | 2015-10-14 11:53:00+0000 | 10020334 ABC-001 | SECOND-2015-10-14 | 2015-10-14 10:52:00+0000 | 10000334 ABC-001 | SECOND-2015-10-14 | 2015-10-14 10:51:00+0000 | 10000127 ABC-001 | SECOND-2015-10-14 | 2015-10-14 10:50:00+0000 | 10000121 Sorted by time_id Partition Key Clustering Key
  • 23. Show Timeseries: Provide list of metrics 23 CREATE TABLE tweet_count ( sensor_id text, bucket_id text, key text, time_id timestamp, count counter, PRIMARY KEY((sensor_id, bucket_id), key, time_id)) WITH CLUSTERING ORDER BY (key ASC, time_id DESC); Wide Row Table bucket-id creates buckets for columns • SECOND-2015-10-14 • HOUR-2015-10 • DAY-2015-10 ABC-001:HOUR-2015-10 ALL:10:00:count 1’550 ABC-001:DAY-2015-10 ALL:14-OCT:count 105’999 ALL:13-OCT:count 120’344 nosql:14-OCT:count 2’532 ALL:09:00:count 2’299 nosql:08:00:count 25 30d * 24h * n keys = n * 720 cols Partition Key Clustering Key
  • 24. Show Timeseries: Provide list of metrics 24 UPDATE tweet_count SET count = count + 1 WHERE sensor_id = 'ABC-001’ AND bucket_id = 'HOUR-2015-10' AND key = 'ALL’ AND time_id = '2015-10-14 10:00:00'; SELECT * from tweet_count WHERE sensor_id = 'ABC-001' AND bucket_id = 'HOUR-2015-10' AND key = 'ALL' AND time_id >= '2015-10-14 08:00:00’; sensor_id | bucket_id | key | time_id | count ----------+--------------+-----+--------------------------+------- ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 12:00:00+0000 | 100230 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 11:00:00+0000 | 102230 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 10:00:00+0000 | 105430 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 09:00:00+0000 | 203240 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 08:00:00+0000 | 132230 Partition Key Clustering Key
  • 25. Titan:db & Cassandra for Graph Data 25
  • 26. Introduction to the Graph Model – Property Graph Vertex (Node) • Represent Entities • Always have an ID • Can contain properties (key- value pairs) Edge (Relationship) • Lines between nodes • may be directed or undirected • Have IDs and properites Properties • Values about node or relationship • Allow to add semantic to relationships User 1 Tweet 2 author follow retweet User 2 Id: 16134540 name: cloudera location: Palo Alto Id: 18898576 name: gschmutz location: Berne Id: 18898999 text: CDH5 has been.. time: July 11 2015 time: June 11 2015 key: value 26 since: May 2012 Tweet 1 Id: 18898576 text: Join BigData.. time: June 11 2015 author
  • 28. TinkerPop 3 Stack TinkerPop is a framework composed of various interoperable components Vendor independent (similar to JDBC for RDBMS) Core API defines Graph, Vertex, Edge, … Gremlin traversal language is vendor- independent way to query (traverse) a graph Gremlin server can be leveraged to allow over the wire communication with a TinkerPop enabled graph system http://tinkerpop.incubator.apache.org/ 28
  • 29. Gremlin Graph Traversal Engine 29 Language / System agostic: many graph languages for many graph systems Provided Traversal Engine: SPARQL or any other graph query language on the Gremlin Traversal Machine Native distributed execution: A Gremlin Traversal over an OLAP Graph Processor (Hadoop / Spark)
  • 30. Gremlin in Action – Creating the Graph 30
  • 31. Gremlin in Action – Graph Traversal 31
  • 32. Gremlin in Action – Graph Traversal (II) 32
  • 33. Summary - Know your domain Connectedness of Datalow high Document Data Store Key-Value Stores Wide- Column Store Graph Databases Relational Databases