SlideShare a Scribd company logo
@doanduyhai
Real time data processing
with Spark & Cassandra
DuyHai DOAN, Technical Advocate
@doanduyhai
Who Am I ?!
Duy Hai DOAN
Cassandra technical advocate
•  talks, meetups, confs
•  open-source devs (Achilles, …)
•  OSS Cassandra point of contact
☞ duy_hai.doan@datastax.com
☞ @doanduyhai
2
@doanduyhai
Datastax!
•  Founded in April 2010 
•  We contribute a lot to Apache Cassandra™
•  400+ customers (25 of the Fortune 100), 200+ employees
•  Headquarter in San Francisco Bay area
•  EU headquarter in London, offices in France and Germany
•  Datastax Enterprise = OSS Cassandra + extra features
3
Spark & Cassandra Integration!
Spark & its eco-system!
Cassandra & token ranges!
Stand-alone cluster deployment!
!
@doanduyhai
What is Apache Spark ?!
Created at 

Apache Project since 2010

General data processing framework

MapReduce is not the A & ΩΩ

One-framework-many-components approach
5
@doanduyhai
Spark characteristics!
Fast
•  10x-100x faster than Hadoop MapReduce
•  In-memory storage
•  Single JVM process per node, multi-threaded

Easy
•  Rich Scala, Java and Python APIs (R is coming …)
•  2x-5x less code
•  Interactive shell

6
@doanduyhai
Spark code example!
Setup
Data-set (can be from text, CSV, JSON, Cassandra, HDFS, …)
val$conf$=$new$SparkConf(true)$
$ .setAppName("basic_example")$
$ .setMaster("local[3]")$
$
val$sc$=$new$SparkContext(conf)$
val$people$=$List(("jdoe","John$DOE",$33),$
$$$$$$$$$$$$$$$$$$("hsue","Helen$SUE",$24),$
$$$$$$$$$$$$$$$$$$("rsmith",$"Richard$Smith",$33))$
7
@doanduyhai
RDDs!
RDD = Resilient Distributed Dataset

val$parallelPeople:$RDD[(String,$String,$Int)]$=$sc.parallelize(people)$
$
val$extractAge:$RDD[(Int,$(String,$String,$Int))]$=$parallelPeople$
$ $ $ $ $ $ .map(tuple$=>$(tuple._3,$tuple))$
$
val$groupByAge:$RDD[(Int,$Iterable[(String,$String,$Int)])]=extractAge.groupByKey()$
$
val$countByAge:$Map[Int,$Long]$=$groupByAge.countByKey()$
8
@doanduyhai
RDDs!
RDD[A] = distributed collection of A 
•  RDD[Person]
•  RDD[(String,Int)], …

RDD[A] split into partitions

Partitions distributed over n workers à parallel computing
9
@doanduyhai
Spark eco-system!
Local Standalone cluster YARN Mesos
Spark Core Engine (Scala/Java/Python)
Spark Streaming MLLibGraphXSpark SQL
Persistence
Cluster Manager
…
10
@doanduyhai
Spark eco-system!
Local Standalone cluster YARN Mesos
Spark Core Engine (Scala/Java/Python)
Spark Streaming MLLibGraphXSpark SQL
Persistence
Cluster Manager
…
11
@doanduyhai
What is Apache Cassandra?!
Created at 

Apache Project since 2009

Distributed NoSQL database

Eventual consistency (A & P of the CAP theorem)

Distributed table abstraction
12
@doanduyhai
Cassandra data distribution reminder!
Random: hash of #partition → token = hash(#p)

Hash: ]-X, X]

X = huge number (264/2)

 n1
n2
n3
n4
n5
n6
n7
n8
13
@doanduyhai
Cassandra token ranges!
A: ]0, X/8]
B: ] X/8, 2X/8]
C: ] 2X/8, 3X/8]
D: ] 3X/8, 4X/8]
E: ] 4X/8, 5X/8]
F: ] 5X/8, 6X/8]
G: ] 6X/8, 7X/8]
H: ] 7X/8, X]

Murmur3 hash function
n1
n2
n3
n4
n5
n6
n7
n8
A
B
C
D
E
F
G
H
14
@doanduyhai
Linear scalability!
n1
n2
n3
n4
n5
n6
n7
n8
A
B
C
D
E
F
G
H
user_id1
user_id2
user_id3
user_id4
user_id5
15
@doanduyhai
Linear scalability!
n1
n2
n3
n4
n5
n6
n7
n8
A
B
C
D
E
F
G
H
user_id1
user_id2
user_id3
user_id4
user_id5
16
@doanduyhai
Cassandra Query Language (CQL)!

INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33);

UPDATE users SET age = 34 WHERE login = ‘jdoe’;

DELETE age FROM users WHERE login = ‘jdoe’;

SELECT age FROM users WHERE login = ‘jdoe’;
17
@doanduyhai
Why Spark on Cassandra ?!
Reliable persistent store (HA)

Structured data (Cassandra CQL à Dataframe API)

Multi data-center !!!

For Spark
18
@doanduyhai
Why Spark on Cassandra ?!
Reliable persistent store (HA)

Structured data (Cassandra CQL à Dataframe API)

Multi data-center !!!

Cross-table operations (JOIN, UNION, etc.)

Real-time/batch processing

Complex analytics (e.g. machine learning)
For Spark
For Cassandra
19
@doanduyhai
Use Cases!
Load data from various
sources
Analytics (join, aggregate, transform, …)
Sanitize, validate, normalize data
Schema migration,
Data conversion
20
@doanduyhai
Cluster deployment!
C*
SparkM
SparkW
C*
SparkW
C*
SparkW
C*
SparkW
C*
SparkW
Stand-alone cluster
21
@doanduyhai
Cluster deployment!
Spark Master
Spark Worker Spark Worker Spark Worker Spark Worker
Executor Executor Executor Executor
Driver Program
Cassandra – Spark placement
1 Cassandra process ⟷ 1 Spark worker
C* C* C* C*
22
Spark & Cassandra Connector!
Core API!
SparkSQL!
SparkStreaming!
@doanduyhai
Connector architecture!
All Cassandra types supported and converted to Scala types

Server side data filtering (SELECT … WHERE …)

Use Java-driver underneath
!
Scala and Java support
24
@doanduyhai
Connector architecture – Core API!
Cassandra tables exposed as Spark RDDs

Read from and write to Cassandra

Mapping of C* tables and rows to Scala objects
•  CassandraRow
•  Scala case class (object mapper)
•  Scala tuples 


25
@doanduyhai
Connector architecture – Spark SQL !

Mapping of Cassandra table to SchemaRDD
•  CassandraSQLRow à SparkRow
•  custom query plan
•  push predicates to CQL for early filtering

SELECT * FROM user_emails WHERE login = ‘jdoe’;
26
@doanduyhai
Connector architecture – Spark Streaming !

Streaming data INTO Cassandra table
•  trivial setup
•  be careful about your Cassandra data model !!!
Streaming data OUT of Cassandra tables ?
•  work in progress …
27
Connector API !
Connector API!
Data Locality Implementation!
@doanduyhai
Connector API!
Connecting to Cassandra
!//!Import!Cassandra.specific!functions!on!SparkContext!and!RDD!objects!
!import!com.datastax.driver.spark._!
!!
!//!Spark!connection!options!
!val!conf!=!new!SparkConf(true)!
! .setMaster("spark://192.168.123.10:7077")!
! .setAppName("cassandra.demo")!
! .set("cassandra.connection.host","192.168.123.10")!//!initial!contact!
! .set("cassandra.username",!"cassandra")!
! .set("cassandra.password",!"cassandra")!
!
!val!sc!=!new!SparkContext(conf)!
29
@doanduyhai
Connector API!
Preparing test data
CREATE&TABLE&test.words&(word&text&PRIMARY&KEY,&count&int);&
&
INSERT&INTO&test.words&(word,&count)&VALUES&('bar',&30);&
INSERT&INTO&test.words&(word,&count)&VALUES&('foo',&20);&
30
@doanduyhai
Connector API!
Reading from Cassandra
!//!Use!table!as!RDD!
!val!rdd!=!sc.cassandraTable("test",!"words")!
!//!rdd:!CassandraRDD[CassandraRow]!=!CassandraRDD[0]!
!
!rdd.toArray.foreach(println)!
!//!CassandraRow[word:!bar,!count:!30]!
!//!CassandraRow[word:!foo,!count:!20]!
!
!rdd.columnNames!!!!//!Stream(word,!count)!
!rdd.size!!!!!!!!!!!//!2!
!
!val!firstRow!=!rdd.first!!//firstRow:CassandraRow=CassandraRow[word:!bar,!count:!30]!
!
!firstRow.getInt("count")!!//!Int!=!30!
31
@doanduyhai
Connector API!
Writing data to Cassandra
!val!newRdd!=!sc.parallelize(Seq(("cat",!40),!("fox",!50)))!!
!//!newRdd:!org.apache.spark.rdd.RDD[(String,!Int)]!=!ParallelCollectionRDD[2]!!!
!
!newRdd.saveToCassandra("test",!"words",!Seq("word",!"count"))!
SELECT&*&FROM&test.words;&
&
&&&&word&|&count&&&
&&&999999+9999999&
&&&&&bar&|&&&&30&
&&&&&foo&|&&&&20&
&&&&&cat&|&&&&40&
&&&&&fox&|&&&&50&&
32
@doanduyhai
Remember token ranges ?!
A: ]0, X/8]
B: ] X/8, 2X/8]
C: ] 2X/8, 3X/8]
D: ] 3X/8, 4X/8]
E: ] 4X/8, 5X/8]
F: ] 5X/8, 6X/8]
G: ] 6X/8, 7X/8]
H: ] 7X/8, X]
n1
n2
n3
n4
n5
n6
n7
n8
A
B
C
D
E
F
G
H
33
@doanduyhai
Data Locality!
C*
SparkM
SparkW
C*
SparkW
C*
SparkW
C*
SparkW
C*
SparkW
Spark partition RDD
Cassandra
tokens ranges
34
@doanduyhai
Data Locality!
C*
SparkM
SparkW
C*
SparkW
C*
SparkW
C*
SparkW
C*
SparkW
Use Murmur3Partitioner

35
@doanduyhai
Read data locality!
Read from Cassandra
Spark shuffle operations
36
@doanduyhai
Repartition before write !
Write to Cassandra
rdd.repartitionByCassandraReplica("keyspace","table")
37
@doanduyhai
Or async batch writes!
Async batches fan-out writes to Cassandra
Spark shuffle operations
38
@doanduyhai
Write data locality!
39
•  either stream data with Spark using repartitionByCassandraReplica()
•  or flush data to Cassandra by async batches
•  in any case, there will be data movement on network (sorry no magic)
@doanduyhai
Joins with data locality!
40

CREATE TABLE artists(name text, style text, … PRIMARY KEY(name));


CREATE TABLE albums(title text, artist text, year int,… PRIMARY KEY(title));
val join: CassandraJoinRDD[(String,Int), (String,String)] =
sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS)
// Select only useful columns for join and processing
.select("artist","year")
.as((_:String, _:Int))
// Repartition RDDs by "artists" PK, which is "name"
.repartitionByCassandraReplica(KEYSPACE, ARTISTS)
// Join with "artists" table, selecting only "name" and "country" columns
.joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country"))
.on(SomeColumns("name"))
@doanduyhai
Joins pipeline with data locality!
41
val join: CassandraJoinRDD[(String,Int), (String,String)] =
sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS)
// Select only useful columns for join and processing
.select("artist","year")
.as((_:String, _:Int))
.repartitionByCassandraReplica(KEYSPACE, ARTISTS)
.joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country"))
.on(SomeColumns("name"))
.map(…)
.filter(…)
.groupByKey()
.mapValues(…)
.repartitionByCassandraReplica(KEYSPACE, ARTISTS_RATINGS)
.joinWithCassandraTable(KEYSPACE, ARTISTS_RATINGS)
…
!
!
@doanduyhai
Perfect data locality scenario!
42
•  read localy from Cassandra
•  use operations that do not require shuffle in Spark (map, filter, …)
•  repartitionbyCassandraReplica()
•  à to a table having same partition key as original table
•  save back into this Cassandra table
Demo
https://github.com/doanduyhai/Cassandra-Spark-Demo
@doanduyhai
What’s for future ?!
Datastax Enterprise 4.7 
•  Cassandra + Spark + Solr as your analytics platform

Filter out most data possible with Solr from Cassandra

Fetch the filtered data in Spark and perform aggregations

Save back final data into Cassandra

44
@doanduyhai
What’s for future ?!
What’s about data locality ?
45
@doanduyhai
val join: CassandraJoinRDD[(String,Int), (String,String)] =
sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS)
// Select only useful columns for join and processing
.select("artist","year").where("solr_query = 'style:*rock* AND ratings:[3 TO *]' ")
.as((_:String, _:Int))
.repartitionByCassandraReplica(KEYSPACE, ARTISTS)
.joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country"))
.on(SomeColumns("name")).where("solr_query = 'age:[20 TO 30]' ")
What’s for future ?!
1.  compute Spark partitions using Cassandra token ranges
2.  on each partition, use Solr for local data filtering (no fan out !)
3.  fetch data back into Spark for aggregations
4.  repeat 1 – 3 as many times as necessary 
46
@doanduyhai
What’s for future ?!
47

SELECT … FROM … 

WHERE token(#partition)> 3X/8 

AND token(#partition)<= 4X/8

AND solr_query='full text search expression';
1
2
3
Advantages of same JVM Cassandra + Solr integration
1
Single-pass local full text search (no fan out) 2
Data retrieval
D: ] 3X/8, 4X/8]
Q & R
! "!
Thank You
@doanduyhai
duy_hai.doan@datastax.com
https://academy.datastax.com/

More Related Content

PDF
Spark cassandra connector.API, Best Practices and Use-Cases
PDF
Spark Cassandra Connector Dataframes
PDF
Lightning fast analytics with Spark and Cassandra
PDF
Apache Spark and DataStax Enablement
PPTX
Spark + Cassandra = Real Time Analytics on Operational Data
PDF
Big data analytics with Spark & Cassandra
PDF
Cassandra spark connector
PDF
Cassandra introduction 2016
Spark cassandra connector.API, Best Practices and Use-Cases
Spark Cassandra Connector Dataframes
Lightning fast analytics with Spark and Cassandra
Apache Spark and DataStax Enablement
Spark + Cassandra = Real Time Analytics on Operational Data
Big data analytics with Spark & Cassandra
Cassandra spark connector
Cassandra introduction 2016

What's hot (20)

PDF
Spark Cassandra Connector: Past, Present, and Future
PDF
Spark cassandra integration, theory and practice
PDF
Zero to Streaming: Spark and Cassandra
PDF
Analytics with Cassandra & Spark
PDF
Spark cassandra integration 2016
PPTX
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
PDF
Fast track to getting started with DSE Max @ ING
PDF
Datastax enterprise presentation
PPTX
Lightning fast analytics with Cassandra and Spark
PDF
Lightning fast analytics with Spark and Cassandra
PPTX
BI, Reporting and Analytics on Apache Cassandra
PDF
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
PDF
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
PDF
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
PDF
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
PDF
Cassandra introduction 2016
PDF
Sasi, cassandra on full text search ride
PDF
Apache Cassandra and Python for Analyzing Streaming Big Data
PDF
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
PDF
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016
Spark Cassandra Connector: Past, Present, and Future
Spark cassandra integration, theory and practice
Zero to Streaming: Spark and Cassandra
Analytics with Cassandra & Spark
Spark cassandra integration 2016
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Fast track to getting started with DSE Max @ ING
Datastax enterprise presentation
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Spark and Cassandra
BI, Reporting and Analytics on Apache Cassandra
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Cassandra introduction 2016
Sasi, cassandra on full text search ride
Apache Cassandra and Python for Analyzing Streaming Big Data
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016
Ad

Viewers also liked (20)

PPTX
Cassandra concepts, patterns and anti-patterns
PDF
Spark Cassandra 2016
PDF
Introduction to KillrChat
PDF
KillrChat Data Modeling
PDF
KillrChat presentation
PDF
Cassandra introduction @ ParisJUG
PDF
Cassandra drivers and libraries
PDF
Cassandra introduction mars jug
PDF
Cassandra introduction @ NantesJUG
PDF
Apache Zeppelin @DevoxxFR 2016
PDF
Datastax day 2016 introduction to apache cassandra
PDF
Cassandra introduction at FinishJUG
PDF
Cassandra nice use cases and worst anti patterns no sql-matters barcelona
PDF
Data stax academy
PDF
Libon cassandra summiteu2014
PDF
Cassandra Anti-Patterns
PDF
Cassandra 3 new features @ Geecon Krakow 2016
PDF
Apache zeppelin the missing component for the big data ecosystem
PDF
Cassandra for the ops dos and donts
PDF
From rdbms to cassandra without a hitch
Cassandra concepts, patterns and anti-patterns
Spark Cassandra 2016
Introduction to KillrChat
KillrChat Data Modeling
KillrChat presentation
Cassandra introduction @ ParisJUG
Cassandra drivers and libraries
Cassandra introduction mars jug
Cassandra introduction @ NantesJUG
Apache Zeppelin @DevoxxFR 2016
Datastax day 2016 introduction to apache cassandra
Cassandra introduction at FinishJUG
Cassandra nice use cases and worst anti patterns no sql-matters barcelona
Data stax academy
Libon cassandra summiteu2014
Cassandra Anti-Patterns
Cassandra 3 new features @ Geecon Krakow 2016
Apache zeppelin the missing component for the big data ecosystem
Cassandra for the ops dos and donts
From rdbms to cassandra without a hitch
Ad

Similar to Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris (20)

PDF
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...
PPTX
5 Ways to Use Spark to Enrich your Cassandra Environment
PDF
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
PPTX
Lightning Fast Analytics with Cassandra and Spark
PDF
PySpark Cassandra - Amsterdam Spark Meetup
PDF
Introduction to Cassandra & Data model
PPTX
Spark Cassandra Connector: Past, Present and Furure
PPTX
Apache Cassandra
PDF
Apache cassandra and spark. you got the the lighter, let's start the fire
PDF
Cassandra introduction apache con 2014 budapest
PDF
Cassandra and Spark - Tim Berglund
PDF
Analytics with Spark and Cassandra
PDF
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
PDF
Spark and cassandra (Hulu Talk)
PDF
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
PDF
Spark & Cassandra - DevFest Córdoba
PPTX
Big Data-Driven Applications with Cassandra and Spark
PDF
Cassandra and Spark
PPTX
Cassandra for mission critical data
PDF
Manchester Hadoop Meetup: Cassandra Spark internals
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...
5 Ways to Use Spark to Enrich your Cassandra Environment
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Lightning Fast Analytics with Cassandra and Spark
PySpark Cassandra - Amsterdam Spark Meetup
Introduction to Cassandra & Data model
Spark Cassandra Connector: Past, Present and Furure
Apache Cassandra
Apache cassandra and spark. you got the the lighter, let's start the fire
Cassandra introduction apache con 2014 budapest
Cassandra and Spark - Tim Berglund
Analytics with Spark and Cassandra
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Spark and cassandra (Hulu Talk)
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
Spark & Cassandra - DevFest Córdoba
Big Data-Driven Applications with Cassandra and Spark
Cassandra and Spark
Cassandra for mission critical data
Manchester Hadoop Meetup: Cassandra Spark internals

More from Duyhai Doan (13)

PDF
Pourquoi Terraform n'est pas le bon outil pour les déploiements automatisés d...
PDF
Le futur d'apache cassandra
PDF
Big data 101 for beginners devoxxpl
PDF
Big data 101 for beginners riga dev days
PDF
Datastax day 2016 : Cassandra data modeling basics
PDF
Apache cassandra in 2016
PDF
Spark zeppelin-cassandra at synchrotron
PDF
Algorithme distribués pour big data saison 2 @DevoxxFR 2016
PDF
Cassandra 3 new features 2016
PDF
Cassandra UDF and Materialized Views
PDF
Apache zeppelin, the missing component for the big data ecosystem
PDF
Distributed algorithms for big data @ GeeCon
PDF
Algorithmes distribues pour le big data @ DevoxxFR 2015
Pourquoi Terraform n'est pas le bon outil pour les déploiements automatisés d...
Le futur d'apache cassandra
Big data 101 for beginners devoxxpl
Big data 101 for beginners riga dev days
Datastax day 2016 : Cassandra data modeling basics
Apache cassandra in 2016
Spark zeppelin-cassandra at synchrotron
Algorithme distribués pour big data saison 2 @DevoxxFR 2016
Cassandra 3 new features 2016
Cassandra UDF and Materialized Views
Apache zeppelin, the missing component for the big data ecosystem
Distributed algorithms for big data @ GeeCon
Algorithmes distribues pour le big data @ DevoxxFR 2015

Recently uploaded (20)

PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
KodekX | Application Modernization Development
PDF
Machine learning based COVID-19 study performance prediction
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Big Data Technologies - Introduction.pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Programs and apps: productivity, graphics, security and other tools
Agricultural_Statistics_at_a_Glance_2022_0.pdf
MYSQL Presentation for SQL database connectivity
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Mobile App Security Testing_ A Comprehensive Guide.pdf
NewMind AI Weekly Chronicles - August'25 Week I
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
KodekX | Application Modernization Development
Machine learning based COVID-19 study performance prediction
Network Security Unit 5.pdf for BCA BBA.
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
MIND Revenue Release Quarter 2 2025 Press Release
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Big Data Technologies - Introduction.pptx
Empathic Computing: Creating Shared Understanding
Reach Out and Touch Someone: Haptics and Empathic Computing
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...

Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

  • 1. @doanduyhai Real time data processing with Spark & Cassandra DuyHai DOAN, Technical Advocate
  • 2. @doanduyhai Who Am I ?! Duy Hai DOAN Cassandra technical advocate •  talks, meetups, confs •  open-source devs (Achilles, …) •  OSS Cassandra point of contact ☞ duy_hai.doan@datastax.com ☞ @doanduyhai 2
  • 3. @doanduyhai Datastax! •  Founded in April 2010 •  We contribute a lot to Apache Cassandra™ •  400+ customers (25 of the Fortune 100), 200+ employees •  Headquarter in San Francisco Bay area •  EU headquarter in London, offices in France and Germany •  Datastax Enterprise = OSS Cassandra + extra features 3
  • 4. Spark & Cassandra Integration! Spark & its eco-system! Cassandra & token ranges! Stand-alone cluster deployment! !
  • 5. @doanduyhai What is Apache Spark ?! Created at Apache Project since 2010 General data processing framework MapReduce is not the A & ΩΩ One-framework-many-components approach 5
  • 6. @doanduyhai Spark characteristics! Fast •  10x-100x faster than Hadoop MapReduce •  In-memory storage •  Single JVM process per node, multi-threaded Easy •  Rich Scala, Java and Python APIs (R is coming …) •  2x-5x less code •  Interactive shell 6
  • 7. @doanduyhai Spark code example! Setup Data-set (can be from text, CSV, JSON, Cassandra, HDFS, …) val$conf$=$new$SparkConf(true)$ $ .setAppName("basic_example")$ $ .setMaster("local[3]")$ $ val$sc$=$new$SparkContext(conf)$ val$people$=$List(("jdoe","John$DOE",$33),$ $$$$$$$$$$$$$$$$$$("hsue","Helen$SUE",$24),$ $$$$$$$$$$$$$$$$$$("rsmith",$"Richard$Smith",$33))$ 7
  • 8. @doanduyhai RDDs! RDD = Resilient Distributed Dataset val$parallelPeople:$RDD[(String,$String,$Int)]$=$sc.parallelize(people)$ $ val$extractAge:$RDD[(Int,$(String,$String,$Int))]$=$parallelPeople$ $ $ $ $ $ $ .map(tuple$=>$(tuple._3,$tuple))$ $ val$groupByAge:$RDD[(Int,$Iterable[(String,$String,$Int)])]=extractAge.groupByKey()$ $ val$countByAge:$Map[Int,$Long]$=$groupByAge.countByKey()$ 8
  • 9. @doanduyhai RDDs! RDD[A] = distributed collection of A •  RDD[Person] •  RDD[(String,Int)], … RDD[A] split into partitions Partitions distributed over n workers à parallel computing 9
  • 10. @doanduyhai Spark eco-system! Local Standalone cluster YARN Mesos Spark Core Engine (Scala/Java/Python) Spark Streaming MLLibGraphXSpark SQL Persistence Cluster Manager … 10
  • 11. @doanduyhai Spark eco-system! Local Standalone cluster YARN Mesos Spark Core Engine (Scala/Java/Python) Spark Streaming MLLibGraphXSpark SQL Persistence Cluster Manager … 11
  • 12. @doanduyhai What is Apache Cassandra?! Created at Apache Project since 2009 Distributed NoSQL database Eventual consistency (A & P of the CAP theorem) Distributed table abstraction 12
  • 13. @doanduyhai Cassandra data distribution reminder! Random: hash of #partition → token = hash(#p) Hash: ]-X, X] X = huge number (264/2) n1 n2 n3 n4 n5 n6 n7 n8 13
  • 14. @doanduyhai Cassandra token ranges! A: ]0, X/8] B: ] X/8, 2X/8] C: ] 2X/8, 3X/8] D: ] 3X/8, 4X/8] E: ] 4X/8, 5X/8] F: ] 5X/8, 6X/8] G: ] 6X/8, 7X/8] H: ] 7X/8, X] Murmur3 hash function n1 n2 n3 n4 n5 n6 n7 n8 A B C D E F G H 14
  • 17. @doanduyhai Cassandra Query Language (CQL)! INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33); UPDATE users SET age = 34 WHERE login = ‘jdoe’; DELETE age FROM users WHERE login = ‘jdoe’; SELECT age FROM users WHERE login = ‘jdoe’; 17
  • 18. @doanduyhai Why Spark on Cassandra ?! Reliable persistent store (HA) Structured data (Cassandra CQL à Dataframe API) Multi data-center !!! For Spark 18
  • 19. @doanduyhai Why Spark on Cassandra ?! Reliable persistent store (HA) Structured data (Cassandra CQL à Dataframe API) Multi data-center !!! Cross-table operations (JOIN, UNION, etc.) Real-time/batch processing Complex analytics (e.g. machine learning) For Spark For Cassandra 19
  • 20. @doanduyhai Use Cases! Load data from various sources Analytics (join, aggregate, transform, …) Sanitize, validate, normalize data Schema migration, Data conversion 20
  • 22. @doanduyhai Cluster deployment! Spark Master Spark Worker Spark Worker Spark Worker Spark Worker Executor Executor Executor Executor Driver Program Cassandra – Spark placement 1 Cassandra process ⟷ 1 Spark worker C* C* C* C* 22
  • 23. Spark & Cassandra Connector! Core API! SparkSQL! SparkStreaming!
  • 24. @doanduyhai Connector architecture! All Cassandra types supported and converted to Scala types Server side data filtering (SELECT … WHERE …) Use Java-driver underneath ! Scala and Java support 24
  • 25. @doanduyhai Connector architecture – Core API! Cassandra tables exposed as Spark RDDs Read from and write to Cassandra Mapping of C* tables and rows to Scala objects •  CassandraRow •  Scala case class (object mapper) •  Scala tuples 25
  • 26. @doanduyhai Connector architecture – Spark SQL ! Mapping of Cassandra table to SchemaRDD •  CassandraSQLRow à SparkRow •  custom query plan •  push predicates to CQL for early filtering SELECT * FROM user_emails WHERE login = ‘jdoe’; 26
  • 27. @doanduyhai Connector architecture – Spark Streaming ! Streaming data INTO Cassandra table •  trivial setup •  be careful about your Cassandra data model !!! Streaming data OUT of Cassandra tables ? •  work in progress … 27
  • 28. Connector API ! Connector API! Data Locality Implementation!
  • 29. @doanduyhai Connector API! Connecting to Cassandra !//!Import!Cassandra.specific!functions!on!SparkContext!and!RDD!objects! !import!com.datastax.driver.spark._! !! !//!Spark!connection!options! !val!conf!=!new!SparkConf(true)! ! .setMaster("spark://192.168.123.10:7077")! ! .setAppName("cassandra.demo")! ! .set("cassandra.connection.host","192.168.123.10")!//!initial!contact! ! .set("cassandra.username",!"cassandra")! ! .set("cassandra.password",!"cassandra")! ! !val!sc!=!new!SparkContext(conf)! 29
  • 30. @doanduyhai Connector API! Preparing test data CREATE&TABLE&test.words&(word&text&PRIMARY&KEY,&count&int);& & INSERT&INTO&test.words&(word,&count)&VALUES&('bar',&30);& INSERT&INTO&test.words&(word,&count)&VALUES&('foo',&20);& 30
  • 31. @doanduyhai Connector API! Reading from Cassandra !//!Use!table!as!RDD! !val!rdd!=!sc.cassandraTable("test",!"words")! !//!rdd:!CassandraRDD[CassandraRow]!=!CassandraRDD[0]! ! !rdd.toArray.foreach(println)! !//!CassandraRow[word:!bar,!count:!30]! !//!CassandraRow[word:!foo,!count:!20]! ! !rdd.columnNames!!!!//!Stream(word,!count)! !rdd.size!!!!!!!!!!!//!2! ! !val!firstRow!=!rdd.first!!//firstRow:CassandraRow=CassandraRow[word:!bar,!count:!30]! ! !firstRow.getInt("count")!!//!Int!=!30! 31
  • 32. @doanduyhai Connector API! Writing data to Cassandra !val!newRdd!=!sc.parallelize(Seq(("cat",!40),!("fox",!50)))!! !//!newRdd:!org.apache.spark.rdd.RDD[(String,!Int)]!=!ParallelCollectionRDD[2]!!! ! !newRdd.saveToCassandra("test",!"words",!Seq("word",!"count"))! SELECT&*&FROM&test.words;& & &&&&word&|&count&&& &&&999999+9999999& &&&&&bar&|&&&&30& &&&&&foo&|&&&&20& &&&&&cat&|&&&&40& &&&&&fox&|&&&&50&& 32
  • 33. @doanduyhai Remember token ranges ?! A: ]0, X/8] B: ] X/8, 2X/8] C: ] 2X/8, 3X/8] D: ] 3X/8, 4X/8] E: ] 4X/8, 5X/8] F: ] 5X/8, 6X/8] G: ] 6X/8, 7X/8] H: ] 7X/8, X] n1 n2 n3 n4 n5 n6 n7 n8 A B C D E F G H 33
  • 36. @doanduyhai Read data locality! Read from Cassandra Spark shuffle operations 36
  • 37. @doanduyhai Repartition before write ! Write to Cassandra rdd.repartitionByCassandraReplica("keyspace","table") 37
  • 38. @doanduyhai Or async batch writes! Async batches fan-out writes to Cassandra Spark shuffle operations 38
  • 39. @doanduyhai Write data locality! 39 •  either stream data with Spark using repartitionByCassandraReplica() •  or flush data to Cassandra by async batches •  in any case, there will be data movement on network (sorry no magic)
  • 40. @doanduyhai Joins with data locality! 40 CREATE TABLE artists(name text, style text, … PRIMARY KEY(name)); CREATE TABLE albums(title text, artist text, year int,… PRIMARY KEY(title)); val join: CassandraJoinRDD[(String,Int), (String,String)] = sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS) // Select only useful columns for join and processing .select("artist","year") .as((_:String, _:Int)) // Repartition RDDs by "artists" PK, which is "name" .repartitionByCassandraReplica(KEYSPACE, ARTISTS) // Join with "artists" table, selecting only "name" and "country" columns .joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country")) .on(SomeColumns("name"))
  • 41. @doanduyhai Joins pipeline with data locality! 41 val join: CassandraJoinRDD[(String,Int), (String,String)] = sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS) // Select only useful columns for join and processing .select("artist","year") .as((_:String, _:Int)) .repartitionByCassandraReplica(KEYSPACE, ARTISTS) .joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country")) .on(SomeColumns("name")) .map(…) .filter(…) .groupByKey() .mapValues(…) .repartitionByCassandraReplica(KEYSPACE, ARTISTS_RATINGS) .joinWithCassandraTable(KEYSPACE, ARTISTS_RATINGS) … ! !
  • 42. @doanduyhai Perfect data locality scenario! 42 •  read localy from Cassandra •  use operations that do not require shuffle in Spark (map, filter, …) •  repartitionbyCassandraReplica() •  à to a table having same partition key as original table •  save back into this Cassandra table
  • 44. @doanduyhai What’s for future ?! Datastax Enterprise 4.7 •  Cassandra + Spark + Solr as your analytics platform Filter out most data possible with Solr from Cassandra Fetch the filtered data in Spark and perform aggregations Save back final data into Cassandra 44
  • 45. @doanduyhai What’s for future ?! What’s about data locality ? 45
  • 46. @doanduyhai val join: CassandraJoinRDD[(String,Int), (String,String)] = sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS) // Select only useful columns for join and processing .select("artist","year").where("solr_query = 'style:*rock* AND ratings:[3 TO *]' ") .as((_:String, _:Int)) .repartitionByCassandraReplica(KEYSPACE, ARTISTS) .joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country")) .on(SomeColumns("name")).where("solr_query = 'age:[20 TO 30]' ") What’s for future ?! 1.  compute Spark partitions using Cassandra token ranges 2.  on each partition, use Solr for local data filtering (no fan out !) 3.  fetch data back into Spark for aggregations 4.  repeat 1 – 3 as many times as necessary 46
  • 47. @doanduyhai What’s for future ?! 47 SELECT … FROM … WHERE token(#partition)> 3X/8 AND token(#partition)<= 4X/8 AND solr_query='full text search expression'; 1 2 3 Advantages of same JVM Cassandra + Solr integration 1 Single-pass local full text search (no fan out) 2 Data retrieval D: ] 3X/8, 4X/8]
  • 48. Q & R ! "!