SlideShare a Scribd company logo
BÂLE BERNE BRUGG DUSSELDORF FRANCFORT S.M. FRIBOURG E.BR. GENÈVE
HAMBOURG COPENHAGUE LAUSANNE MUNICH STUTTGART VIENNE ZURICH
Real-Time Analytics with Apache
Cassandra and Apache Spark
Guido Schmutz
Guido Schmutz
‱ Working for Trivadis for more than 18 years
‱ Oracle ACE Director for Fusion Middleware and SOA
‱ Author of different books
‱ Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
‱ Technology Manager @ Trivadis
‱ More than 25 years of software development experience
‱ Contact: guido.schmutz@trivadis.com
‱ Blog: http://guidoschmutz.wordpress.com
‱ Twitter: gschmutz
Agenda
1. Introduction
2. Apache Spark
3. Apache Cassandra
4. Combining Spark & Cassandra
5. Summary
Big Data Definition (4 Vs)
+	Time	to	action	?	– Big	Data	+	Real-Time	=	Stream	Processing
Characteristics	of	Big	Data:	Its	Volume,	
Velocity	and	Variety	in	combination
What is Real-Time Analytics?
What is it? Why do we need
it?
How does it work?
‱ Collect real-time data
‱ Process data as it flows in
‱ Data in Motion over Data at
Rest
‱ Reports and Dashboard
access processed data
Time
Events RespondAnalyze
Short	time	to	
analyze	&	
respond	
§ Required	- for	new	business	models	
§ Desired	- for	competitive	advantage
Real Time Analytics Use Cases
‱ Algorithmic Trading
‱ Online Fraud Detection
‱ Geo Fencing
‱ Proximity/Location Tracking
‱ Intrusion detection systems
‱ Traffic Management
‱ Recommendations
‱ Churn detection
‱ Internet of Things (IoT) / Intelligence
Sensors
‱ Social Media/Data Analytics
‱ Gaming Data Feed
‱ 

Apache Spark
Motivation – Why Apache Spark?
Hadoop MapReduce: Data Sharing on Disk
Spark: Speed up processing by using Memory instead of Disks
map reduce . . .
Input
HDFS
read
HDFS
write
HDFS
read
HDFS
write
op1 op2
. . .
Input
Output
Output
Apache Spark
Apache Spark is a fast and general engine for large-scale data processing
‱ The hot trend in Big Data!
‱ Originally developed 2009 in UC Berkley’s AMPLab
‱ Based on 2007 Microsoft Dryad paper
‱ Written in Scala, supports Java, Python, SQL and R
‱ Can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x
faster on disk
‱ One of the largest OSS communities in big data with over 200 contributors in 50+
organizations
‱ Open Sourced in 2010 – since 2014 part of Apache Software foundation
Apache Spark
Spark	SQL
(Batch	Processing)
Blink	DB
(Approximate
Querying)
Spark	Streaming
(Real-Time)
MLlib,	Spark	R
(Machine	
Learning)
GraphX
(Graph	Processing)
Spark	Core	API	and	Execution	Model
Spark
Standalone
MESOS YARN HDFS
Elastic
Search
NoSQL S3
Libraries
Core	Runtime
Cluster	Resource	Managers Data	Stores
Resilient Distributed Dataset (RDD)
Are
‱ Immutable
‱ Re-computable
‱ Fault tolerant
‱ Reusable
Have Transformations
‱ Produce new RDD
‱ Rich set of transformation available
‱ filter(), flatMap(), map(),
distinct(), groupBy(), union(),
join(), sortByKey(),
reduceByKey(), subtract(), ...
Have Actions
‱ Start cluster computing operations
‱ Rich set of action available
‱ collect(), count(), fold(),
reduce(), count(), 

RDD RDD
Input Source
‱ File
‱ Database
‱ Stream
‱ Collection
.count() ->	100
Data
Partitions RDD
Data
Partition	0
Partition	1
Partition	2
Partition	3
Partition	4
Partition	5
Partition	6
Partition	7
Partition	8
Partition	9
Server	1
Server	2
Server	3
Server	4
Server	5
Partitions RDD
Data
Partition	0
Partition	1
Partition	2
Partition	3
Partition	4
Partition	5
Partition	6
Partition	7
Partition	8
Partition	9
Server	1
Server	2
Server	3
Server	4
Server	5
Partitions RDD
Data
Partition	0
Partition	1
Partition	2
Partition	3
Partition	4
Partition	5
Partition	6
Partition	7
Partition	8
Partition	9
Server	2
Server	3
Server	4
Server	5
Stage 1 – reduceByKey()
Stage 1 – flatMap() + map()
Spark Workflow Input	HDFS	File
HadoopRDD
MappedRDD
ShuffledRDD
Text	File	Output
sc.hapoopFile()
map()
reduceByKey()
sc.saveAsTextFile()
Transformations
(Lazy)
Action	
(Execute	
Transformations)
Master
MappedRDD
P0
P1
P3
ShuffledRDD
P0
MappedRDD
flatMap()
DAG	
Scheduler
Spark Workflow HDFS	File	Input	1
HadoopRDD
FilteredRDD
MappedRDD
ShuffledRDD
HDFS	File	Output
HadoopRDD
MappedRDD
HDFS	File	Input	2
SparkContext.hadoopFile()
SparkContext.hadoopFile()filter()
map() map()
join()
SparkContext.saveAsHadoopFile()
Transformations
(Lazy)
Action	
(Execute	Transformations)
Spark Execution Model
Data	
Storage
Worker
Master
Executer
Executer
Server
Executer
Stage 1 – flatMap() + map()
Spark Execution Model
Data	
Storage
Worker
Master
Executer
Data	
Storage
Worker
Executer
Data	
Storage
Worker
Executer
RDD
P0
P1
P3
Narrow	TransformationMaster
filter()
map()
sample()
flatMap()
Data	
Storage
Worker
Executer
Stage 2 – reduceByKey()
Spark Execution Model
Data	
Storage
Worker
Executer
Data	
Storage
Worker
Executer
RDD
P0
Wide	Transformation
Master
join()
reduceByKey()
union()
groupByKey()
Shuffle	!
Data	
Storage
Worker
Executer
Data	
Storage
Worker
Executer
Batch vs. Real-Time Processing
Petabytes	of	Data
Gigabytes
Per	Second
Various Input Sources
Apache Kafka
distributed publish-subscribe messaging system
Designed for processing of real time activity stream data (logs, metrics collections,
social media streams, 
)
Initially developed at LinkedIn, now part of Apache
Does not use JMS API and standards
Kafka maintains feeds of messages in topics Kafka Cluster
Consumer Consumer Consumer
Producer Producer Producer
Apache Kafka
Kafka Broker
Temperature
Processor
Temperature	Topic
Rainfall	Topic
1 2 3 4 5 6
Rainfall
Processor1 2 3 4 5 6
Weather
Station
Apache Kafka
Kafka Broker
Temperature
Processor
Temperature	Topic
Rainfall	Topic
1 2 3 4 5 6
Rainfall
Processor
Partition	0
1 2 3 4 5 6
Partition	0
1 2 3 4 5 6
Partition	1 Temperature
Processor
Weather
Station
Apache
Kafka
Kafka Broker
Temperature
Processor
Weather
Station
Temperature	Topic
Rainfall	Topic
Rainfall
Processor
P	0
Temperature
Processor
1 2 3 4 5
P	1 1 2 3 4 5
Kafka Broker
Temperature	Topic
Rainfall	Topic
P	0 1 2 3 4 5
P	1 1 2 3 4 5
P	0 1 2 3 4 5
P	0 1 2 3 4 5
Discretized Stream (DStream)
Kafka
Weather
Station
Weather
Station
Weather
Station
Discretized Stream (DStream)
Kafka
Weather
Station
Weather
Station
Weather
Station
Discretized Stream (DStream)
Kafka
Weather
Station
Weather
Station
Weather
Station
Discretized Stream (DStream)
Kafka
Weather
Station
Weather
Station
Weather
Station Discrete	by	time
Individual	Event
DStream =	RDD
Discretized Stream (DStream)
DStream DStream
X	Seconds
Transform
.countByValue()
.reduceByKey()
.join
.map
Discretized Stream (DStream)
time	1 time	2 time	3
message
time	n
.
f(message 1)
RDD	@time	1
f(message 2)
f(message n)

.
message 1
RDD	@time	1
message 2
message n

.
result 1
result 2
result n

.
message message message
f(message 1)
RDD	@time	2
f(message 2)
f(message n)

.
message 1
RDD	@time	2
message 2
message n

.
result 1
result 2
result n

.
f(message 1)
RDD	@time	3
f(message 2)
f(message n)

.
message 1
RDD	@time	3
message 2
message n

.
result 1
result 2
result n

.
f(message 1)
RDD	@time	n
f(message 2)
f(message n)

.
message 1
RDD	@time	n
message 2
message n

.
result 1
result 2
result n

.
Input	Stream
Event	DStream
MappedDStream
map()
saveAsHadoopFiles()
Time	Increasing
DStreamTransformation	Lineage
Actions	Trigger	
Spark	Jobs
Adapted	from	Chris	Fregly: http://slidesha.re/11PP7FV
Apache Spark Streaming – Core concepts
Discretized Stream (DStream)
‱ Core Spark Streaming abstraction
‱ micro batches of RDD’s
‱ Operations similar to RDD
Input DStreams
‱ Represents the stream of raw data received
from streaming sources
‱ Data can be ingested from many sources:
Kafka, Kinesis, Flume, Twitter, ZeroMQ, TCP
Socket, Akka actors, etc.
‱ Custom Sources can be easily written for
custom data sources
Operations
‱ Same as Spark Core + Additional Stateful
transformations (window, reduceByWindow)
Apache Cassandra
Apache Cassandra
Apache Cassandraℱ is a free
‱ Distributed

‱ High performance

‱ Extremely scalable

‱ Fault tolerant (i.e. no single point of failure)

post-relational database solution
Optimized for high write throughput
Apache Cassandra - History
Bigtable Dynamo
Motivation - Why NoSQL Databases?
aaa ‱ Dynamo Paper (2007)
‱ How to build a data store that is
‱ Reliable
‱ Performant
‱ “Always On”
‱ Nothing new and shiny
‱ 24 other papers cited
‱ Evolutionary
Motivation - Why NoSQL Databases?
‱ Google Big Table (2006)
‱ Richer data model
‱ 1 key and lot’s of values
‱ Fast sequential access
‱ 38 other papers cited
Motivation - Why NoSQL Databases?
‱ Cassandra Paper (2008)
‱ Distributed features of Dynamo
‱ Data Model and storage from BigTable
‱ February 2010 graduated to a top-level Apache
Project
Apache Cassandra – More than one server
All nodes participate in a cluster
Shared nothing
Add or remove as needed
More capacity? Add more servers
Node is a basic unit inside a cluster
Each node owns a range of partitions
Consistent Hashing
Node	1
Node	2
Node	3
Node	4
[26-50]
[0-25]
[51-75]
[76-100] [0-25]
[0-25]
[26-50]
[26-50]
[51-75]
[51-75]
[76-100]
[76-100]
Apache Cassandra – Fully Replicated
Client writes local
Data syncs across WAN
Replication per Data Center
Node	1
Node	2
Node	3
Node	4
Node	1
Node	2
Node	3
Node	4
West						East
Client
Apache Cassandra
What is Cassandra NOT?
‱ A Data Ocean
‱ A Data Lake
‱ A Data Pond
‱ An In-Memory Database
‱ A Key-Value Store
‱ Not for Data Warehousing
What are good use cases?
‱ Product Catalog / Playlists
‱ Personalization (Ads, Recommendations)
‱ Fraud Detection
‱ Time Series (Finance, Smart Meter)
‱ IoT / Sensor Data
‱ Graph / Network data
How Cassandra stores data
‱ Model brought from Google Bigtable
‱ Row Key and a lot of columns
‱ Column names sorted (UTF8, Int, Timestamp, etc.)
Column	Name 
 Column Name
Column	Value Column	Value
Timestamp Timestamp
TTL TTL
Row	Key
1 2	Billion
Billion	of	Rows
Combining Spark & Cassandra
Spark and Cassandra Architecture – Great Combo
Good	at	analyzing	a	huge	amount	
of	data
Good	at	storing	a	huge	amount	of	
data
Spark and Cassandra Architecture
Spark	Streaming
(Near	Real-Time)
SparkSQL
(Structured	Data)
MLlib
(Machine	Learning)
GraphX
(Graph	Analysis)
Spark and Cassandra Architecture
Spark	Connector
Weather
Station
Spark	Streaming
(Near	Real-Time)
SparkSQL
(Structured	Data)
MLlib
(Machine	Learning)
GraphX
(Graph	Analysis)
Weather
Station
Weather
Station
Weather
Station
Weather
Station
Spark and Cassandra Architecture
‱ Single Node running Cassandra
‱ Spark Worker is really small
‱ Spark Master lives outside a
node
‱ Spark Worker starts Spark
Executer in separate JVM
‱ Node local
Worker
Master
Executer
Executer
Server
Executer
Spark and Cassandra Architecture
Worker
Worker
Worker
Master
Worker
‱ Each node runs Spark and
Cassandra
‱ Spark Master can make
decisions based on Token
Ranges
‱ Spark likes to work on small
partitions of data across a
large cluster
‱ Cassandra likes to spread out
data in a large cluster
0-25
26-50
51-75
76-100
Will	only	 have	
to	analyze	25%	
of	data!
Spark and Cassandra Architecture
Master
0-25
26-50
51-75
76-100
Worker
Worker
WorkerWorker
0-25
26-50
51-75
76-100
Transactional Analytics
Cassandra and Spark
Cassandra Cassandra	&	Spark
Joins	and	Unions No Yes
Transformations Limited Yes
Outside	Data	Integration No Yes
Aggregations Limited Yes
Summary
Summary
Kafka
‱ Topics store information broken into
partitions
‱ Brokers store partitions
‱ Partitions are replicated for data
resilience
Cassandra
‱ Goals of Apache Cassandra are all
about staying online and performant
‱ Best for applications close to your users
‱ Partitions are similar data grouped by a
partition key
Spark
‱ Replacement for Hadoop Map Reduce
‱ In memory
‱ More operations than just Map and Reduce
‱ Makes data analysis easier
‱ Spark Streaming can take a variety of sources
Spark + Cassandra
‱ Cassandra acts as the storage layer for Spark
‱ Deploy in a mixed cluster configuration
‱ Spark executors access Cassandra using the
DataStax connector
Lambda Architecture with Spark/Cassandra
Data
Collection
(Analytical)	Batch	Data	Processing
Batch
compute
Result	StoreData
Sources
Channel
Data
Access
Reports
Service
Analytic
Tools
Alerting
Tools
Social
(Analytical)	Real-Time	Data	Processing
Stream/Event	Processing
Batch
compute
Messaging
Result	Store
Query
Engine
Result	Store
Computed	
Information
Raw	Data	
(Reservoir)
Lambda Architecture with Spark/Cassandra
Data
Collection
(Analytical)	Batch	Data	Processing
Batch
compute
Result	StoreData
Sources
Channel
Data
Access
Reports
Service
Analytic
Tools
Alerting
Tools
Social
(Analytical)	Real-Time	Data	Processing
Stream/Event	Processing
Batch
compute
Messaging
Result	Store
Query
Engine
Result	Store
Computed	
Information
Raw	Data	
(Reservoir)
Real-Time Analytics with Apache Cassandra and Apache Spark
Guido Schmutz
Technology Manager
guido.schmutz@trivadis.com

More Related Content

PPTX
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
PDF
Cassandra & Spark for IoT
PDF
Feeding Cassandra with Spark-Streaming and Kafka
PDF
Real-time Cassandra
 
PDF
Building Event Streaming Architectures on Scylla and Kafka
PDF
Lambda architecture
PDF
Using the SDACK Architecture to Build a Big Data Product
PPTX
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Cassandra & Spark for IoT
Feeding Cassandra with Spark-Streaming and Kafka
Real-time Cassandra
 
Building Event Streaming Architectures on Scylla and Kafka
Lambda architecture
Using the SDACK Architecture to Build a Big Data Product
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...

What's hot (20)

PDF
Lambda at Weather Scale - Cassandra Summit 2015
PDF
Azure + DataStax Enterprise Powers Office 365 Per User Store
PDF
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
PDF
Analytics with Spark and Cassandra
PDF
Migration Best Practices: From RDBMS to Cassandra without a Hitch
PDF
The Last Pickle: Distributed Tracing from Application to Database
PDF
Apache Kafka - Scalable Message-Processing and more !
PDF
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
PDF
Kafka spark cassandra webinar feb 16 2016
PDF
Data Pipelines with Spark & DataStax Enterprise
PDF
Real-time analytics with Druid at Appsflyer
PDF
Data processing platforms with SMACK: Spark and Mesos internals
PDF
New Analytics Toolbox DevNexus 2015
PDF
Lifting the hood on spark streaming - StampedeCon 2015
PDF
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
PPTX
Programmatic Bidding Data Streams & Druid
PDF
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
PDF
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
PDF
Apache cassandra & apache spark for time series data
PDF
RDBMS to NoSQL: Practical Advice from Successful Migrations
Lambda at Weather Scale - Cassandra Summit 2015
Azure + DataStax Enterprise Powers Office 365 Per User Store
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Analytics with Spark and Cassandra
Migration Best Practices: From RDBMS to Cassandra without a Hitch
The Last Pickle: Distributed Tracing from Application to Database
Apache Kafka - Scalable Message-Processing and more !
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Kafka spark cassandra webinar feb 16 2016
Data Pipelines with Spark & DataStax Enterprise
Real-time analytics with Druid at Appsflyer
Data processing platforms with SMACK: Spark and Mesos internals
New Analytics Toolbox DevNexus 2015
Lifting the hood on spark streaming - StampedeCon 2015
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Programmatic Bidding Data Streams & Druid
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
Apache cassandra & apache spark for time series data
RDBMS to NoSQL: Practical Advice from Successful Migrations
Ad

Similar to Real-Time Analytics with Apache Cassandra and Apache Spark (20)

PDF
Jump Start on Apache Spark 2.2 with Databricks
PDF
20170126 big data processing
PPTX
Paris Data Geek - Spark Streaming
PPTX
Jump Start with Apache Spark 2.0 on Databricks
PPTX
Intro to Apache Spark by CTO of Twingo
PPTX
Spark to DocumentDB connector
PPTX
AnalyticsConf2016 - Zaawansowana analityka na platformie Azure HDInsight
PDF
Kafka spark cassandra webinar feb 16 2016
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
PDF
Apache Spark: The Next Gen toolset for Big Data Processing
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
PPT
11. From Hadoop to Spark 1:2
PPT
Apache spark-melbourne-april-2015-meetup
PPTX
Big Data Retrospective - STL Big Data IDEA Jan 2019
PDF
Unified Big Data Processing with Apache Spark
PPTX
Hadoop world overview trends and topics
PPTX
Building data pipelines for modern data warehouse with Apache¼ Sparkℱ and .NE...
PDF
Apache Spark Overview @ ferret
PPTX
Big Data on azure
PDF
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Jump Start on Apache Spark 2.2 with Databricks
20170126 big data processing
Paris Data Geek - Spark Streaming
Jump Start with Apache Spark 2.0 on Databricks
Intro to Apache Spark by CTO of Twingo
Spark to DocumentDB connector
AnalyticsConf2016 - Zaawansowana analityka na platformie Azure HDInsight
Kafka spark cassandra webinar feb 16 2016
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Apache Spark: The Next Gen toolset for Big Data Processing
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
11. From Hadoop to Spark 1:2
Apache spark-melbourne-april-2015-meetup
Big Data Retrospective - STL Big Data IDEA Jan 2019
Unified Big Data Processing with Apache Spark
Hadoop world overview trends and topics
Building data pipelines for modern data warehouse with Apache¼ Sparkℱ and .NE...
Apache Spark Overview @ ferret
Big Data on azure
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Ad

More from Guido Schmutz (20)

PDF
30 Minutes to the Analytics Platform with Infrastructure as Code
PDF
Event Broker (Kafka) in a Modern Data Architecture
PDF
ksqlDB - Stream Processing simplified!
PDF
Kafka as your Data Lake - is it Feasible?
PDF
Event Hub (i.e. Kafka) in Modern Data Architecture
PDF
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
PDF
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
PDF
Building Event Driven (Micro)services with Apache Kafka
PDF
Location Analytics - Real-Time Geofencing using Apache Kafka
PDF
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
PDF
What is Apache Kafka? Why is it so popular? Should I use it?
PDF
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
PDF
Location Analytics Real-Time Geofencing using Kafka
PDF
Streaming Visualisation
PDF
Kafka as an event store - is it good enough?
PDF
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
PDF
Fundamentals Big Data and AI Architecture
PDF
Location Analytics - Real-Time Geofencing using Kafka
PDF
Streaming Visualization
PDF
Streaming Visualization
30 Minutes to the Analytics Platform with Infrastructure as Code
Event Broker (Kafka) in a Modern Data Architecture
ksqlDB - Stream Processing simplified!
Kafka as your Data Lake - is it Feasible?
Event Hub (i.e. Kafka) in Modern Data Architecture
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Building Event Driven (Micro)services with Apache Kafka
Location Analytics - Real-Time Geofencing using Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
What is Apache Kafka? Why is it so popular? Should I use it?
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Location Analytics Real-Time Geofencing using Kafka
Streaming Visualisation
Kafka as an event store - is it good enough?
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Fundamentals Big Data and AI Architecture
Location Analytics - Real-Time Geofencing using Kafka
Streaming Visualization
Streaming Visualization

Recently uploaded (20)

PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
AI in Product Development-omnex systems
PDF
top salesforce developer skills in 2025.pdf
PPT
Introduction Database Management System for Course Database
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
medical staffing services at VALiNTRY
PPTX
Transform Your Business with a Software ERP System
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
 
PDF
Nekopoi APK 2025 free lastest update
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
L1 - Introduction to python Backend.pptx
PDF
System and Network Administration Chapter 2
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
System and Network Administraation Chapter 3
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Operating system designcfffgfgggggggvggggggggg
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Which alternative to Crystal Reports is best for small or large businesses.pdf
AI in Product Development-omnex systems
top salesforce developer skills in 2025.pdf
Introduction Database Management System for Course Database
PTS Company Brochure 2025 (1).pdf.......
medical staffing services at VALiNTRY
Transform Your Business with a Software ERP System
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
 
Nekopoi APK 2025 free lastest update
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Adobe Illustrator 28.6 Crack My Vision of Vector Design
L1 - Introduction to python Backend.pptx
System and Network Administration Chapter 2
VVF-Customer-Presentation2025-Ver1.9.pptx
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
System and Network Administraation Chapter 3
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus

Real-Time Analytics with Apache Cassandra and Apache Spark

  • 1. BÂLE BERNE BRUGG DUSSELDORF FRANCFORT S.M. FRIBOURG E.BR. GENÈVE HAMBOURG COPENHAGUE LAUSANNE MUNICH STUTTGART VIENNE ZURICH Real-Time Analytics with Apache Cassandra and Apache Spark Guido Schmutz
  • 2. Guido Schmutz ‱ Working for Trivadis for more than 18 years ‱ Oracle ACE Director for Fusion Middleware and SOA ‱ Author of different books ‱ Consultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast Data ‱ Technology Manager @ Trivadis ‱ More than 25 years of software development experience ‱ Contact: guido.schmutz@trivadis.com ‱ Blog: http://guidoschmutz.wordpress.com ‱ Twitter: gschmutz
  • 3. Agenda 1. Introduction 2. Apache Spark 3. Apache Cassandra 4. Combining Spark & Cassandra 5. Summary
  • 4. Big Data Definition (4 Vs) + Time to action ? – Big Data + Real-Time = Stream Processing Characteristics of Big Data: Its Volume, Velocity and Variety in combination
  • 5. What is Real-Time Analytics? What is it? Why do we need it? How does it work? ‱ Collect real-time data ‱ Process data as it flows in ‱ Data in Motion over Data at Rest ‱ Reports and Dashboard access processed data Time Events RespondAnalyze Short time to analyze & respond § Required - for new business models § Desired - for competitive advantage
  • 6. Real Time Analytics Use Cases ‱ Algorithmic Trading ‱ Online Fraud Detection ‱ Geo Fencing ‱ Proximity/Location Tracking ‱ Intrusion detection systems ‱ Traffic Management ‱ Recommendations ‱ Churn detection ‱ Internet of Things (IoT) / Intelligence Sensors ‱ Social Media/Data Analytics ‱ Gaming Data Feed ‱ 

  • 8. Motivation – Why Apache Spark? Hadoop MapReduce: Data Sharing on Disk Spark: Speed up processing by using Memory instead of Disks map reduce . . . Input HDFS read HDFS write HDFS read HDFS write op1 op2 . . . Input Output Output
  • 9. Apache Spark Apache Spark is a fast and general engine for large-scale data processing ‱ The hot trend in Big Data! ‱ Originally developed 2009 in UC Berkley’s AMPLab ‱ Based on 2007 Microsoft Dryad paper ‱ Written in Scala, supports Java, Python, SQL and R ‱ Can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk ‱ One of the largest OSS communities in big data with over 200 contributors in 50+ organizations ‱ Open Sourced in 2010 – since 2014 part of Apache Software foundation
  • 11. Resilient Distributed Dataset (RDD) Are ‱ Immutable ‱ Re-computable ‱ Fault tolerant ‱ Reusable Have Transformations ‱ Produce new RDD ‱ Rich set of transformation available ‱ filter(), flatMap(), map(), distinct(), groupBy(), union(), join(), sortByKey(), reduceByKey(), subtract(), ... Have Actions ‱ Start cluster computing operations ‱ Rich set of action available ‱ collect(), count(), fold(), reduce(), count(), 

  • 12. RDD RDD Input Source ‱ File ‱ Database ‱ Stream ‱ Collection .count() -> 100 Data
  • 16. Stage 1 – reduceByKey() Stage 1 – flatMap() + map() Spark Workflow Input HDFS File HadoopRDD MappedRDD ShuffledRDD Text File Output sc.hapoopFile() map() reduceByKey() sc.saveAsTextFile() Transformations (Lazy) Action (Execute Transformations) Master MappedRDD P0 P1 P3 ShuffledRDD P0 MappedRDD flatMap() DAG Scheduler
  • 19. Stage 1 – flatMap() + map() Spark Execution Model Data Storage Worker Master Executer Data Storage Worker Executer Data Storage Worker Executer RDD P0 P1 P3 Narrow TransformationMaster filter() map() sample() flatMap() Data Storage Worker Executer
  • 20. Stage 2 – reduceByKey() Spark Execution Model Data Storage Worker Executer Data Storage Worker Executer RDD P0 Wide Transformation Master join() reduceByKey() union() groupByKey() Shuffle ! Data Storage Worker Executer Data Storage Worker Executer
  • 21. Batch vs. Real-Time Processing Petabytes of Data Gigabytes Per Second
  • 23. Apache Kafka distributed publish-subscribe messaging system Designed for processing of real time activity stream data (logs, metrics collections, social media streams, 
) Initially developed at LinkedIn, now part of Apache Does not use JMS API and standards Kafka maintains feeds of messages in topics Kafka Cluster Consumer Consumer Consumer Producer Producer Producer
  • 24. Apache Kafka Kafka Broker Temperature Processor Temperature Topic Rainfall Topic 1 2 3 4 5 6 Rainfall Processor1 2 3 4 5 6 Weather Station
  • 25. Apache Kafka Kafka Broker Temperature Processor Temperature Topic Rainfall Topic 1 2 3 4 5 6 Rainfall Processor Partition 0 1 2 3 4 5 6 Partition 0 1 2 3 4 5 6 Partition 1 Temperature Processor Weather Station
  • 26. Apache Kafka Kafka Broker Temperature Processor Weather Station Temperature Topic Rainfall Topic Rainfall Processor P 0 Temperature Processor 1 2 3 4 5 P 1 1 2 3 4 5 Kafka Broker Temperature Topic Rainfall Topic P 0 1 2 3 4 5 P 1 1 2 3 4 5 P 0 1 2 3 4 5 P 0 1 2 3 4 5
  • 31. Discretized Stream (DStream) DStream DStream X Seconds Transform .countByValue() .reduceByKey() .join .map
  • 32. Discretized Stream (DStream) time 1 time 2 time 3 message time n
. f(message 1) RDD @time 1 f(message 2) f(message n) 
. message 1 RDD @time 1 message 2 message n 
. result 1 result 2 result n 
. message message message f(message 1) RDD @time 2 f(message 2) f(message n) 
. message 1 RDD @time 2 message 2 message n 
. result 1 result 2 result n 
. f(message 1) RDD @time 3 f(message 2) f(message n) 
. message 1 RDD @time 3 message 2 message n 
. result 1 result 2 result n 
. f(message 1) RDD @time n f(message 2) f(message n) 
. message 1 RDD @time n message 2 message n 
. result 1 result 2 result n 
. Input Stream Event DStream MappedDStream map() saveAsHadoopFiles() Time Increasing DStreamTransformation Lineage Actions Trigger Spark Jobs Adapted from Chris Fregly: http://slidesha.re/11PP7FV
  • 33. Apache Spark Streaming – Core concepts Discretized Stream (DStream) ‱ Core Spark Streaming abstraction ‱ micro batches of RDD’s ‱ Operations similar to RDD Input DStreams ‱ Represents the stream of raw data received from streaming sources ‱ Data can be ingested from many sources: Kafka, Kinesis, Flume, Twitter, ZeroMQ, TCP Socket, Akka actors, etc. ‱ Custom Sources can be easily written for custom data sources Operations ‱ Same as Spark Core + Additional Stateful transformations (window, reduceByWindow)
  • 35. Apache Cassandra Apache Cassandraℱ is a free ‱ Distributed
 ‱ High performance
 ‱ Extremely scalable
 ‱ Fault tolerant (i.e. no single point of failure)
 post-relational database solution Optimized for high write throughput
  • 36. Apache Cassandra - History Bigtable Dynamo
  • 37. Motivation - Why NoSQL Databases? aaa ‱ Dynamo Paper (2007) ‱ How to build a data store that is ‱ Reliable ‱ Performant ‱ “Always On” ‱ Nothing new and shiny ‱ 24 other papers cited ‱ Evolutionary
  • 38. Motivation - Why NoSQL Databases? ‱ Google Big Table (2006) ‱ Richer data model ‱ 1 key and lot’s of values ‱ Fast sequential access ‱ 38 other papers cited
  • 39. Motivation - Why NoSQL Databases? ‱ Cassandra Paper (2008) ‱ Distributed features of Dynamo ‱ Data Model and storage from BigTable ‱ February 2010 graduated to a top-level Apache Project
  • 40. Apache Cassandra – More than one server All nodes participate in a cluster Shared nothing Add or remove as needed More capacity? Add more servers Node is a basic unit inside a cluster Each node owns a range of partitions Consistent Hashing Node 1 Node 2 Node 3 Node 4 [26-50] [0-25] [51-75] [76-100] [0-25] [0-25] [26-50] [26-50] [51-75] [51-75] [76-100] [76-100]
  • 41. Apache Cassandra – Fully Replicated Client writes local Data syncs across WAN Replication per Data Center Node 1 Node 2 Node 3 Node 4 Node 1 Node 2 Node 3 Node 4 West East Client
  • 42. Apache Cassandra What is Cassandra NOT? ‱ A Data Ocean ‱ A Data Lake ‱ A Data Pond ‱ An In-Memory Database ‱ A Key-Value Store ‱ Not for Data Warehousing What are good use cases? ‱ Product Catalog / Playlists ‱ Personalization (Ads, Recommendations) ‱ Fraud Detection ‱ Time Series (Finance, Smart Meter) ‱ IoT / Sensor Data ‱ Graph / Network data
  • 43. How Cassandra stores data ‱ Model brought from Google Bigtable ‱ Row Key and a lot of columns ‱ Column names sorted (UTF8, Int, Timestamp, etc.) Column Name 
 Column Name Column Value Column Value Timestamp Timestamp TTL TTL Row Key 1 2 Billion Billion of Rows
  • 44. Combining Spark & Cassandra
  • 45. Spark and Cassandra Architecture – Great Combo Good at analyzing a huge amount of data Good at storing a huge amount of data
  • 46. Spark and Cassandra Architecture Spark Streaming (Near Real-Time) SparkSQL (Structured Data) MLlib (Machine Learning) GraphX (Graph Analysis)
  • 47. Spark and Cassandra Architecture Spark Connector Weather Station Spark Streaming (Near Real-Time) SparkSQL (Structured Data) MLlib (Machine Learning) GraphX (Graph Analysis) Weather Station Weather Station Weather Station Weather Station
  • 48. Spark and Cassandra Architecture ‱ Single Node running Cassandra ‱ Spark Worker is really small ‱ Spark Master lives outside a node ‱ Spark Worker starts Spark Executer in separate JVM ‱ Node local Worker Master Executer Executer Server Executer
  • 49. Spark and Cassandra Architecture Worker Worker Worker Master Worker ‱ Each node runs Spark and Cassandra ‱ Spark Master can make decisions based on Token Ranges ‱ Spark likes to work on small partitions of data across a large cluster ‱ Cassandra likes to spread out data in a large cluster 0-25 26-50 51-75 76-100 Will only have to analyze 25% of data!
  • 50. Spark and Cassandra Architecture Master 0-25 26-50 51-75 76-100 Worker Worker WorkerWorker 0-25 26-50 51-75 76-100 Transactional Analytics
  • 51. Cassandra and Spark Cassandra Cassandra & Spark Joins and Unions No Yes Transformations Limited Yes Outside Data Integration No Yes Aggregations Limited Yes
  • 53. Summary Kafka ‱ Topics store information broken into partitions ‱ Brokers store partitions ‱ Partitions are replicated for data resilience Cassandra ‱ Goals of Apache Cassandra are all about staying online and performant ‱ Best for applications close to your users ‱ Partitions are similar data grouped by a partition key Spark ‱ Replacement for Hadoop Map Reduce ‱ In memory ‱ More operations than just Map and Reduce ‱ Makes data analysis easier ‱ Spark Streaming can take a variety of sources Spark + Cassandra ‱ Cassandra acts as the storage layer for Spark ‱ Deploy in a mixed cluster configuration ‱ Spark executors access Cassandra using the DataStax connector
  • 54. Lambda Architecture with Spark/Cassandra Data Collection (Analytical) Batch Data Processing Batch compute Result StoreData Sources Channel Data Access Reports Service Analytic Tools Alerting Tools Social (Analytical) Real-Time Data Processing Stream/Event Processing Batch compute Messaging Result Store Query Engine Result Store Computed Information Raw Data (Reservoir)
  • 55. Lambda Architecture with Spark/Cassandra Data Collection (Analytical) Batch Data Processing Batch compute Result StoreData Sources Channel Data Access Reports Service Analytic Tools Alerting Tools Social (Analytical) Real-Time Data Processing Stream/Event Processing Batch compute Messaging Result Store Query Engine Result Store Computed Information Raw Data (Reservoir)