Real-Time Analytics with Apache Cassandra and Apache Spark

BÂLE BERNE BRUGG DUSSELDORF FRANCFORT S.M. FRIBOURG E.BR. GENÈVE
HAMBOURG COPENHAGUE LAUSANNE MUNICH STUTTGART VIENNE ZURICH
Real-Time Analytics with Apache
Cassandra and Apache Spark
Guido Schmutz

Guido Schmutz
• Working for Trivadis for more than 18 years
• Oracle ACE Director for Fusion Middleware and SOA
• Author of different books
• Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
• Technology Manager @ Trivadis
• More than 25 years of software development experience
• Contact: guido.schmutz@trivadis.com
• Blog: http://guidoschmutz.wordpress.com
• Twitter: gschmutz

Agenda
1. Introduction
2. Apache Spark
3. Apache Cassandra
4. Combining Spark & Cassandra
5. Summary

Big Data Definition (4 Vs)
+ Time to action ? – Big Data + Real-Time = Stream Processing
Characteristics of Big Data: Its Volume,
Velocity and Variety in combination

What is Real-Time Analytics?
What is it? Why do we need
it?
How does it work?
• Collect real-time data
• Process data as it flows in
• Data in Motion over Data at
Rest
• Reports and Dashboard
access processed data
Time
Events RespondAnalyze
Short time to
analyze &
respond
§ Required - for new business models
§ Desired - for competitive advantage

Real Time Analytics Use Cases
• Algorithmic Trading
• Online Fraud Detection
• Geo Fencing
• Proximity/Location Tracking
• Intrusion detection systems
• Traffic Management
• Recommendations
• Churn detection
• Internet of Things (IoT) / Intelligence
Sensors
• Social Media/Data Analytics
• Gaming Data Feed
• …

Motivation – Why Apache Spark?
Hadoop MapReduce: Data Sharing on Disk
Spark: Speed up processing by using Memory instead of Disks
map reduce . . .
Input
HDFS
read
HDFS
write
HDFS
read
HDFS
write
op1 op2
. . .
Input
Output
Output

Apache Spark
Apache Spark is a fast and general engine for large-scale data processing
• The hot trend in Big Data!
• Originally developed 2009 in UC Berkley’s AMPLab
• Based on 2007 Microsoft Dryad paper
• Written in Scala, supports Java, Python, SQL and R
• Can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x
faster on disk
• One of the largest OSS communities in big data with over 200 contributors in 50+
organizations
• Open Sourced in 2010 – since 2014 part of Apache Software foundation

Apache Spark
Spark SQL
(Batch Processing)
Blink DB
(Approximate
Querying)
Spark Streaming
(Real-Time)
MLlib, Spark R
(Machine
Learning)
GraphX
(Graph Processing)
Spark Core API and Execution Model
Spark
Standalone
MESOS YARN HDFS
Elastic
Search
NoSQL S3
Libraries
Core Runtime
Cluster Resource Managers Data Stores

Resilient Distributed Dataset (RDD)
Are
• Immutable
• Re-computable
• Fault tolerant
• Reusable
Have Transformations
• Produce new RDD
• Rich set of transformation available
• filter(), flatMap(), map(),
distinct(), groupBy(), union(),
join(), sortByKey(),
reduceByKey(), subtract(), ...
Have Actions
• Start cluster computing operations
• Rich set of action available
• collect(), count(), fold(),
reduce(), count(), …

RDD RDD
Input Source
• File
• Database
• Stream
• Collection
.count() -> 100
Data

Partitions RDD
Data
Partition 0
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Partition 7
Partition 8
Partition 9
Server 1
Server 2
Server 3
Server 4
Server 5

Partitions RDD
Data
Partition 0
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Partition 7
Partition 8
Partition 9
Server 2
Server 3
Server 4
Server 5

Stage 1 – reduceByKey()
Stage 1 – flatMap() + map()
Spark Workflow Input HDFS File
HadoopRDD
MappedRDD
ShuffledRDD
Text File Output
sc.hapoopFile()
map()
reduceByKey()
sc.saveAsTextFile()
Transformations
(Lazy)
Action
(Execute
Transformations)
Master
MappedRDD
P0
P1
P3
ShuffledRDD
P0
MappedRDD
flatMap()
DAG
Scheduler

Spark Workflow HDFS File Input 1
HadoopRDD
FilteredRDD
MappedRDD
ShuffledRDD
HDFS File Output
HadoopRDD
MappedRDD
HDFS File Input 2
SparkContext.hadoopFile()
SparkContext.hadoopFile()filter()
map() map()
join()
SparkContext.saveAsHadoopFile()
Transformations
(Lazy)
Action
(Execute Transformations)

Spark Execution Model
Data
Storage
Worker
Master
Executer
Executer
Server
Executer

Stage 1 – flatMap() + map()
Data
Storage
Worker
Master
Executer
Data
Storage
Worker
Executer
Data
Storage
Worker
Executer
RDD
P0
P1
P3
Narrow TransformationMaster
filter()
map()
sample()
flatMap()
Data
Storage
Worker
Executer

Stage 2 – reduceByKey()
Data
Storage
Worker
Executer
Data
Storage
Worker
Executer
RDD
P0
Wide Transformation
Master
join()
reduceByKey()
union()
groupByKey()
Shuffle !
Data
Storage
Worker
Executer
Data
Storage
Worker
Executer

Batch vs. Real-Time Processing
Petabytes of Data
Gigabytes
Per Second

Apache Kafka
distributed publish-subscribe messaging system
Designed for processing of real time activity stream data (logs, metrics collections,
social media streams, …)
Initially developed at LinkedIn, now part of Apache
Does not use JMS API and standards
Kafka maintains feeds of messages in topics Kafka Cluster
Consumer Consumer Consumer
Producer Producer Producer

Apache Kafka
Kafka Broker
Temperature
Processor
Temperature Topic
Rainfall Topic
1 2 3 4 5 6
Rainfall
Processor1 2 3 4 5 6
Weather
Station

Apache Kafka
Kafka Broker
Temperature
Processor
Temperature Topic
Rainfall Topic
1 2 3 4 5 6
Rainfall
Processor
Partition 0
1 2 3 4 5 6
Partition 0
1 2 3 4 5 6
Partition 1 Temperature
Processor
Weather
Station

Apache
Kafka
Kafka Broker
Temperature
Processor
Weather
Station
Temperature Topic
Rainfall Topic
Rainfall
Processor
P 0
Temperature
Processor
1 2 3 4 5
P 1 1 2 3 4 5
Kafka Broker
Temperature Topic
Rainfall Topic
P 0 1 2 3 4 5
P 1 1 2 3 4 5
P 0 1 2 3 4 5
P 0 1 2 3 4 5

Discretized Stream (DStream)
Kafka
Weather
Station
Weather
Station
Weather
Station

Kafka
Weather
Station
Weather
Station
Weather
Station Discrete by time
Individual Event
DStream = RDD

DStream DStream
X Seconds
Transform
.countByValue()
.reduceByKey()
.join
.map

time 1 time 2 time 3
message
time n….
f(message 1)
RDD @time 1
f(message 2)
f(message n)
….
message 1
RDD @time 1
message 2
message n
….
result 1
result 2
result n
….
message message message
f(message 1)
RDD @time 2
f(message 2)
f(message n)
….
message 1
RDD @time 2
message 2
message n
….
result 1
result 2
result n
….
f(message 1)
RDD @time 3
f(message 2)
f(message n)
….
message 1
RDD @time 3
message 2
message n
….
result 1
result 2
result n
….
f(message 1)
RDD @time n
f(message 2)
f(message n)
….
message 1
RDD @time n
message 2
message n
….
result 1
result 2
result n
….
Input Stream
Event DStream
MappedDStream
map()
saveAsHadoopFiles()
Time Increasing
DStreamTransformation Lineage
Actions Trigger
Spark Jobs
Adapted from Chris Fregly: http://slidesha.re/11PP7FV

Apache Spark Streaming – Core concepts
• Core Spark Streaming abstraction
• micro batches of RDD’s
• Operations similar to RDD
Input DStreams
• Represents the stream of raw data received
from streaming sources
• Data can be ingested from many sources:
Kafka, Kinesis, Flume, Twitter, ZeroMQ, TCP
Socket, Akka actors, etc.
• Custom Sources can be easily written for
custom data sources
Operations
• Same as Spark Core + Additional Stateful
transformations (window, reduceByWindow)

Apache Cassandra
Apache Cassandra™ is a free
• Distributed…
• High performance…
• Extremely scalable…
• Fault tolerant (i.e. no single point of failure)…
post-relational database solution
Optimized for high write throughput

Apache Cassandra - History
Bigtable Dynamo

Motivation - Why NoSQL Databases?
aaa • Dynamo Paper (2007)
• How to build a data store that is
• Reliable
• Performant
• “Always On”
• Nothing new and shiny
• 24 other papers cited
• Evolutionary

• Google Big Table (2006)
• Richer data model
• 1 key and lot’s of values
• Fast sequential access
• 38 other papers cited

• Cassandra Paper (2008)
• Distributed features of Dynamo
• Data Model and storage from BigTable
• February 2010 graduated to a top-level Apache
Project

Apache Cassandra – More than one server
All nodes participate in a cluster
Shared nothing
Add or remove as needed
More capacity? Add more servers
Node is a basic unit inside a cluster
Each node owns a range of partitions
Consistent Hashing
Node 1
Node 2
Node 3
Node 4
[26-50]
[0-25]
[51-75]
[76-100] [0-25]
[0-25]
[26-50]
[26-50]
[51-75]
[51-75]
[76-100]
[76-100]

Apache Cassandra – Fully Replicated
Client writes local
Data syncs across WAN
Replication per Data Center
Node 1
Node 2
Node 3
Node 4
Node 1
Node 2
Node 3
Node 4
West East
Client

Apache Cassandra
What is Cassandra NOT?
• A Data Ocean
• A Data Lake
• A Data Pond
• An In-Memory Database
• A Key-Value Store
• Not for Data Warehousing
What are good use cases?
• Product Catalog / Playlists
• Personalization (Ads, Recommendations)
• Fraud Detection
• Time Series (Finance, Smart Meter)
• IoT / Sensor Data
• Graph / Network data

How Cassandra stores data
• Model brought from Google Bigtable
• Row Key and a lot of columns
• Column names sorted (UTF8, Int, Timestamp, etc.)
Column Name … Column Name
Column Value Column Value
Timestamp Timestamp
TTL TTL
Row Key
1 2 Billion
Billion of Rows

Spark and Cassandra Architecture – Great Combo
Good at analyzing a huge amount
of data
Good at storing a huge amount of
data

Spark and Cassandra Architecture
Spark Streaming
(Near Real-Time)
SparkSQL
(Structured Data)
MLlib
(Machine Learning)
GraphX
(Graph Analysis)

Spark Connector
Weather
Station
Spark Streaming
(Near Real-Time)
SparkSQL
(Structured Data)
MLlib
(Machine Learning)
GraphX
(Graph Analysis)
Weather
Station
Weather
Station
Weather
Station
Weather
Station

• Single Node running Cassandra
• Spark Worker is really small
• Spark Master lives outside a
node
• Spark Worker starts Spark
Executer in separate JVM
• Node local
Worker
Master
Executer
Executer
Server
Executer

Worker
Worker
Worker
Master
Worker
• Each node runs Spark and
Cassandra
• Spark Master can make
decisions based on Token
Ranges
• Spark likes to work on small
partitions of data across a
large cluster
• Cassandra likes to spread out
data in a large cluster
0-25
26-50
51-75
76-100
Will only have
to analyze 25%
of data!

Master
0-25
26-50
51-75
76-100
Worker
Worker
WorkerWorker
0-25
26-50
51-75
76-100
Transactional Analytics

Cassandra and Spark
Cassandra Cassandra & Spark
Joins and Unions No Yes
Transformations Limited Yes
Outside Data Integration No Yes
Aggregations Limited Yes

Summary
Kafka
• Topics store information broken into
partitions
• Brokers store partitions
• Partitions are replicated for data
resilience
Cassandra
• Goals of Apache Cassandra are all
about staying online and performant
• Best for applications close to your users
• Partitions are similar data grouped by a
partition key
Spark
• Replacement for Hadoop Map Reduce
• In memory
• More operations than just Map and Reduce
• Makes data analysis easier
• Spark Streaming can take a variety of sources
Spark + Cassandra
• Cassandra acts as the storage layer for Spark
• Deploy in a mixed cluster configuration
• Spark executors access Cassandra using the
DataStax connector

Lambda Architecture with Spark/Cassandra
Data
Collection
(Analytical) Batch Data Processing
Batch
compute
Result StoreData
Sources
Channel
Data
Access
Reports
Service
Analytic
Tools
Alerting
Tools
Social
(Analytical) Real-Time Data Processing
Stream/Event Processing
Batch
compute
Messaging
Result Store
Query
Engine
Result Store
Computed
Information
Raw Data
(Reservoir)

Real-Time Analytics with Apache Cassandra and Apache Spark

Guido Schmutz
Technology Manager
guido.schmutz@trivadis.com

Real-Time Analytics with Apache Cassandra and Apache Spark

More Related Content

What's hot (20)

Similar to Real-Time Analytics with Apache Cassandra and Apache Spark (20)

More from Guido Schmutz (20)

Recently uploaded (20)

Real-Time Analytics with Apache Cassandra and Apache Spark