@doanduyhai
Real time data processing
with Spark & Cassandra
DuyHai DOAN, Technical Advocate
@doanduyhai
Who Am I ?!
Duy Hai DOAN
Cassandra technical advocate
•  talks, meetups, confs
•  open-source devs (Achilles, …)
•  OSS Cassandra point of contact
☞ duy_hai.doan@datastax.com
☞ @doanduyhai
2
@doanduyhai
Datastax!
•  Founded in April 2010 
•  We contribute a lot to Apache Cassandra™
•  400+ customers (25 of the Fortune 100), 200+ employees
•  Headquarter in San Francisco Bay area
•  EU headquarter in London, offices in France and Germany
•  Datastax Enterprise = OSS Cassandra + extra features
3
Spark & Cassandra Integration!
Spark & its eco-system!
Cassandra & token ranges!
Stand-alone cluster deployment!
!
@doanduyhai
What is Apache Spark ?!
Created at 

Apache Project since 2010

General data processing framework

MapReduce is not the A & ΩΩ

One-framework-many-components approach
5
@doanduyhai
Spark characteristics!
Fast
•  10x-100x faster than Hadoop MapReduce
•  In-memory storage
•  Single JVM process per node, multi-threaded

Easy
•  Rich Scala, Java and Python APIs (R is coming …)
•  2x-5x less code
•  Interactive shell

6
@doanduyhai
Spark code example!
Setup
Data-set (can be from text, CSV, JSON, Cassandra, HDFS, …)
val$conf$=$new$SparkConf(true)$
$ .setAppName("basic_example")$
$ .setMaster("local[3]")$
$
val$sc$=$new$SparkContext(conf)$
val$people$=$List(("jdoe","John$DOE",$33),$
$$$$$$$$$$$$$$$$$$("hsue","Helen$SUE",$24),$
$$$$$$$$$$$$$$$$$$("rsmith",$"Richard$Smith",$33))$
7
@doanduyhai
RDDs!
RDD = Resilient Distributed Dataset

val$parallelPeople:$RDD[(String,$String,$Int)]$=$sc.parallelize(people)$
$
val$extractAge:$RDD[(Int,$(String,$String,$Int))]$=$parallelPeople$
$ $ $ $ $ $ .map(tuple$=>$(tuple._3,$tuple))$
$
val$groupByAge:$RDD[(Int,$Iterable[(String,$String,$Int)])]=extractAge.groupByKey()$
$
val$countByAge:$Map[Int,$Long]$=$groupByAge.countByKey()$
8
@doanduyhai
RDDs!
RDD[A] = distributed collection of A 
•  RDD[Person]
•  RDD[(String,Int)], …

RDD[A] split into partitions

Partitions distributed over n workers à parallel computing
9
@doanduyhai
Spark eco-system!
Local Standalone cluster YARN Mesos
Spark Core Engine (Scala/Java/Python)
Spark Streaming MLLibGraphXSpark SQL
Persistence
Cluster Manager
…
10
@doanduyhai
Spark eco-system!
Local Standalone cluster YARN Mesos
Spark Core Engine (Scala/Java/Python)
Spark Streaming MLLibGraphXSpark SQL
Persistence
Cluster Manager
…
11
@doanduyhai
What is Apache Cassandra?!
Created at 

Apache Project since 2009

Distributed NoSQL database

Eventual consistency (A & P of the CAP theorem)

Distributed table abstraction
12
@doanduyhai
Cassandra data distribution reminder!
Random: hash of #partition → token = hash(#p)

Hash: ]-X, X]

X = huge number (264/2)

 n1
n2
n3
n4
n5
n6
n7
n8
13
@doanduyhai
Cassandra token ranges!
A: ]0, X/8]
B: ] X/8, 2X/8]
C: ] 2X/8, 3X/8]
D: ] 3X/8, 4X/8]
E: ] 4X/8, 5X/8]
F: ] 5X/8, 6X/8]
G: ] 6X/8, 7X/8]
H: ] 7X/8, X]

Murmur3 hash function
n1
n2
n3
n4
n5
n6
n7
n8
A
B
C
D
E
F
G
H
14
@doanduyhai
Linear scalability!
n1
n2
n3
n4
n5
n6
n7
n8
A
B
C
D
E
F
G
H
user_id1
user_id2
user_id3
user_id4
user_id5
15
@doanduyhai
Linear scalability!
n1
n2
n3
n4
n5
n6
n7
n8
A
B
C
D
E
F
G
H
user_id1
user_id2
user_id3
user_id4
user_id5
16
@doanduyhai
Cassandra Query Language (CQL)!

INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33);

UPDATE users SET age = 34 WHERE login = ‘jdoe’;

DELETE age FROM users WHERE login = ‘jdoe’;

SELECT age FROM users WHERE login = ‘jdoe’;
17
@doanduyhai
Why Spark on Cassandra ?!
Reliable persistent store (HA)

Structured data (Cassandra CQL à Dataframe API)

Multi data-center !!!

For Spark
18
@doanduyhai
Why Spark on Cassandra ?!
Reliable persistent store (HA)

Structured data (Cassandra CQL à Dataframe API)

Multi data-center !!!

Cross-table operations (JOIN, UNION, etc.)

Real-time/batch processing

Complex analytics (e.g. machine learning)
For Spark
For Cassandra
19
@doanduyhai
Use Cases!
Load data from various
sources
Analytics (join, aggregate, transform, …)
Sanitize, validate, normalize data
Schema migration,
Data conversion
20
@doanduyhai
Cluster deployment!
C*
SparkM
SparkW
C*
SparkW
C*
SparkW
C*
SparkW
C*
SparkW
Stand-alone cluster
21
@doanduyhai
Cluster deployment!
Spark Master
Spark Worker Spark Worker Spark Worker Spark Worker
Executor Executor Executor Executor
Driver Program
Cassandra – Spark placement
1 Cassandra process ⟷ 1 Spark worker
C* C* C* C*
22
Spark & Cassandra Connector!
Core API!
SparkSQL!
SparkStreaming!
@doanduyhai
Connector architecture!
All Cassandra types supported and converted to Scala types

Server side data filtering (SELECT … WHERE …)

Use Java-driver underneath
!
Scala and Java support
24
@doanduyhai
Connector architecture – Core API!
Cassandra tables exposed as Spark RDDs

Read from and write to Cassandra

Mapping of C* tables and rows to Scala objects
•  CassandraRow
•  Scala case class (object mapper)
•  Scala tuples 


25
@doanduyhai
Connector architecture – Spark SQL !

Mapping of Cassandra table to SchemaRDD
•  CassandraSQLRow à SparkRow
•  custom query plan
•  push predicates to CQL for early filtering

SELECT * FROM user_emails WHERE login = ‘jdoe’;
26
@doanduyhai
Connector architecture – Spark Streaming !

Streaming data INTO Cassandra table
•  trivial setup
•  be careful about your Cassandra data model !!!
Streaming data OUT of Cassandra tables ?
•  work in progress …
27
Connector API !
Connector API!
Data Locality Implementation!
@doanduyhai
Connector API!
Connecting to Cassandra
!//!Import!Cassandra.specific!functions!on!SparkContext!and!RDD!objects!
!import!com.datastax.driver.spark._!
!!
!//!Spark!connection!options!
!val!conf!=!new!SparkConf(true)!
! .setMaster("spark://192.168.123.10:7077")!
! .setAppName("cassandra.demo")!
! .set("cassandra.connection.host","192.168.123.10")!//!initial!contact!
! .set("cassandra.username",!"cassandra")!
! .set("cassandra.password",!"cassandra")!
!
!val!sc!=!new!SparkContext(conf)!
29
@doanduyhai
Connector API!
Preparing test data
CREATE&TABLE&test.words&(word&text&PRIMARY&KEY,&count&int);&
&
INSERT&INTO&test.words&(word,&count)&VALUES&('bar',&30);&
INSERT&INTO&test.words&(word,&count)&VALUES&('foo',&20);&
30
@doanduyhai
Connector API!
Reading from Cassandra
!//!Use!table!as!RDD!
!val!rdd!=!sc.cassandraTable("test",!"words")!
!//!rdd:!CassandraRDD[CassandraRow]!=!CassandraRDD[0]!
!
!rdd.toArray.foreach(println)!
!//!CassandraRow[word:!bar,!count:!30]!
!//!CassandraRow[word:!foo,!count:!20]!
!
!rdd.columnNames!!!!//!Stream(word,!count)!
!rdd.size!!!!!!!!!!!//!2!
!
!val!firstRow!=!rdd.first!!//firstRow:CassandraRow=CassandraRow[word:!bar,!count:!30]!
!
!firstRow.getInt("count")!!//!Int!=!30!
31
@doanduyhai
Connector API!
Writing data to Cassandra
!val!newRdd!=!sc.parallelize(Seq(("cat",!40),!("fox",!50)))!!
!//!newRdd:!org.apache.spark.rdd.RDD[(String,!Int)]!=!ParallelCollectionRDD[2]!!!
!
!newRdd.saveToCassandra("test",!"words",!Seq("word",!"count"))!
SELECT&*&FROM&test.words;&
&
&&&&word&|&count&&&
&&&999999+9999999&
&&&&&bar&|&&&&30&
&&&&&foo&|&&&&20&
&&&&&cat&|&&&&40&
&&&&&fox&|&&&&50&&
32
@doanduyhai
Remember token ranges ?!
A: ]0, X/8]
B: ] X/8, 2X/8]
C: ] 2X/8, 3X/8]
D: ] 3X/8, 4X/8]
E: ] 4X/8, 5X/8]
F: ] 5X/8, 6X/8]
G: ] 6X/8, 7X/8]
H: ] 7X/8, X]
n1
n2
n3
n4
n5
n6
n7
n8
A
B
C
D
E
F
G
H
33
@doanduyhai
Data Locality!
C*
SparkM
SparkW
C*
SparkW
C*
SparkW
C*
SparkW
C*
SparkW
Spark partition RDD
Cassandra
tokens ranges
34
@doanduyhai
Data Locality!
C*
SparkM
SparkW
C*
SparkW
C*
SparkW
C*
SparkW
C*
SparkW
Use Murmur3Partitioner

35
@doanduyhai
Read data locality!
Read from Cassandra
Spark shuffle operations
36
@doanduyhai
Repartition before write !
Write to Cassandra
rdd.repartitionByCassandraReplica("keyspace","table")
37
@doanduyhai
Or async batch writes!
Async batches fan-out writes to Cassandra
Spark shuffle operations
38
@doanduyhai
Write data locality!
39
•  either stream data with Spark using repartitionByCassandraReplica()
•  or flush data to Cassandra by async batches
•  in any case, there will be data movement on network (sorry no magic)
@doanduyhai
Joins with data locality!
40

CREATE TABLE artists(name text, style text, … PRIMARY KEY(name));


CREATE TABLE albums(title text, artist text, year int,… PRIMARY KEY(title));
val join: CassandraJoinRDD[(String,Int), (String,String)] =
sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS)
// Select only useful columns for join and processing
.select("artist","year")
.as((_:String, _:Int))
// Repartition RDDs by "artists" PK, which is "name"
.repartitionByCassandraReplica(KEYSPACE, ARTISTS)
// Join with "artists" table, selecting only "name" and "country" columns
.joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country"))
.on(SomeColumns("name"))
@doanduyhai
Joins pipeline with data locality!
41
val join: CassandraJoinRDD[(String,Int), (String,String)] =
sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS)
// Select only useful columns for join and processing
.select("artist","year")
.as((_:String, _:Int))
.repartitionByCassandraReplica(KEYSPACE, ARTISTS)
.joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country"))
.on(SomeColumns("name"))
.map(…)
.filter(…)
.groupByKey()
.mapValues(…)
.repartitionByCassandraReplica(KEYSPACE, ARTISTS_RATINGS)
.joinWithCassandraTable(KEYSPACE, ARTISTS_RATINGS)
…
!
!
@doanduyhai
Perfect data locality scenario!
42
•  read localy from Cassandra
•  use operations that do not require shuffle in Spark (map, filter, …)
•  repartitionbyCassandraReplica()
•  à to a table having same partition key as original table
•  save back into this Cassandra table
Demo
https://github.com/doanduyhai/Cassandra-Spark-Demo
@doanduyhai
What’s for future ?!
Datastax Enterprise 4.7 
•  Cassandra + Spark + Solr as your analytics platform

Filter out most data possible with Solr from Cassandra

Fetch the filtered data in Spark and perform aggregations

Save back final data into Cassandra

44
@doanduyhai
What’s for future ?!
What’s about data locality ?
45
@doanduyhai
val join: CassandraJoinRDD[(String,Int), (String,String)] =
sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS)
// Select only useful columns for join and processing
.select("artist","year").where("solr_query = 'style:*rock* AND ratings:[3 TO *]' ")
.as((_:String, _:Int))
.repartitionByCassandraReplica(KEYSPACE, ARTISTS)
.joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country"))
.on(SomeColumns("name")).where("solr_query = 'age:[20 TO 30]' ")
What’s for future ?!
1.  compute Spark partitions using Cassandra token ranges
2.  on each partition, use Solr for local data filtering (no fan out !)
3.  fetch data back into Spark for aggregations
4.  repeat 1 – 3 as many times as necessary 
46
@doanduyhai
What’s for future ?!
47

SELECT … FROM … 

WHERE token(#partition)> 3X/8 

AND token(#partition)<= 4X/8

AND solr_query='full text search expression';
1
2
3
Advantages of same JVM Cassandra + Solr integration
1
Single-pass local full text search (no fan out) 2
Data retrieval
D: ] 3X/8, 4X/8]
Q & R
! "!
Thank You
@doanduyhai
duy_hai.doan@datastax.com
https://academy.datastax.com/

More Related Content

PDF
Spark cassandra integration, theory and practice
PDF
Fast track to getting started with DSE Max @ ING
PDF
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...
PDF
Cassandra introduction 2016
PDF
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016
PDF
Cassandra introduction apache con 2014 budapest
PDF
Context-Aware Access Control for RDF Graph Stores
PDF
Big Data Processing using Apache Spark and Clojure
Spark cassandra integration, theory and practice
Fast track to getting started with DSE Max @ ING
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...
Cassandra introduction 2016
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016
Cassandra introduction apache con 2014 budapest
Context-Aware Access Control for RDF Graph Stores
Big Data Processing using Apache Spark and Clojure

What's hot (20)

PDF
NoSQL store everyone ignored - Postgres Conf 2021
PDF
A deeper-understanding-of-spark-internals
PDF
PySpark with Juypter
PDF
Data Exploration with Apache Drill: Day 1
PDF
"Solr Update" at code4lib '13 - Chicago
PPTX
Redis Indices (#RedisTLV)
PDF
Semantic Web Technologies in Health Care Analytics
PPT
Tthornton code4lib
PDF
Odessapy2013 - Graph databases and Python
KEY
Dachis group pigout_101
PDF
Data science at the command line
PDF
Beyond shuffling - Scala Days Berlin 2016
PDF
Data Exploration with Apache Drill: Day 2
PDF
What's new in Redis v3.2
PDF
Cassandra introduction @ NantesJUG
PDF
MongoSV Schema Workshop
PDF
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
PDF
SDEC2011 NoSQL concepts and models
PDF
Redis basics
PDF
Cassandra nice use cases and worst anti patterns no sql-matters barcelona
NoSQL store everyone ignored - Postgres Conf 2021
A deeper-understanding-of-spark-internals
PySpark with Juypter
Data Exploration with Apache Drill: Day 1
"Solr Update" at code4lib '13 - Chicago
Redis Indices (#RedisTLV)
Semantic Web Technologies in Health Care Analytics
Tthornton code4lib
Odessapy2013 - Graph databases and Python
Dachis group pigout_101
Data science at the command line
Beyond shuffling - Scala Days Berlin 2016
Data Exploration with Apache Drill: Day 2
What's new in Redis v3.2
Cassandra introduction @ NantesJUG
MongoSV Schema Workshop
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
SDEC2011 NoSQL concepts and models
Redis basics
Cassandra nice use cases and worst anti patterns no sql-matters barcelona
Ad

Viewers also liked (15)

PDF
AWS Simple Workflow: Distributed Out of the Box! - Morning@Lohika
PPTX
Spark - Migration Story
PPTX
Big data analysis in java world
PDF
Tweaking performance on high-load projects
PDF
Apache HBase Workshop
PPTX
React. Flux. Redux
PPTX
Marionette talk 2016
PDF
DOAN DuyHai – Cassandra: real world best use-cases and worst anti-patterns - ...
PPTX
Introduction to real time big data with Apache Spark
PPTX
Lambda architecture: from zero to One
PPTX
NLP: a peek into a day of a computational linguist
PDF
Introduction to Data Science
PDF
Real-Time Analytics with Apache Cassandra and Apache Spark
PPT
69 claves para conocer Big Data
PPTX
BI, Reporting and Analytics on Apache Cassandra
AWS Simple Workflow: Distributed Out of the Box! - Morning@Lohika
Spark - Migration Story
Big data analysis in java world
Tweaking performance on high-load projects
Apache HBase Workshop
React. Flux. Redux
Marionette talk 2016
DOAN DuyHai – Cassandra: real world best use-cases and worst anti-patterns - ...
Introduction to real time big data with Apache Spark
Lambda architecture: from zero to One
NLP: a peek into a day of a computational linguist
Introduction to Data Science
Real-Time Analytics with Apache Cassandra and Apache Spark
69 claves para conocer Big Data
BI, Reporting and Analytics on Apache Cassandra
Ad

Similar to DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015 (20)

PDF
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
PDF
Cassandra spark connector
PDF
Spark cassandra connector.API, Best Practices and Use-Cases
PDF
Lightning fast analytics with Spark and Cassandra
PPTX
5 Ways to Use Spark to Enrich your Cassandra Environment
PDF
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
PPTX
Spark + Cassandra = Real Time Analytics on Operational Data
PPTX
Lightning Fast Analytics with Cassandra and Spark
PDF
PySpark Cassandra - Amsterdam Spark Meetup
PPTX
Lightning fast analytics with Cassandra and Spark
PDF
Introduction to Cassandra & Data model
PPTX
Spark Cassandra Connector: Past, Present and Furure
PDF
Lightning fast analytics with Spark and Cassandra
PPTX
Apache Cassandra
PDF
Analytics with Cassandra & Spark
PDF
Apache cassandra and spark. you got the the lighter, let's start the fire
PDF
Cassandra and Spark - Tim Berglund
PDF
Analytics with Spark and Cassandra
PDF
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
PDF
Spark and cassandra (Hulu Talk)
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra spark connector
Spark cassandra connector.API, Best Practices and Use-Cases
Lightning fast analytics with Spark and Cassandra
5 Ways to Use Spark to Enrich your Cassandra Environment
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Spark + Cassandra = Real Time Analytics on Operational Data
Lightning Fast Analytics with Cassandra and Spark
PySpark Cassandra - Amsterdam Spark Meetup
Lightning fast analytics with Cassandra and Spark
Introduction to Cassandra & Data model
Spark Cassandra Connector: Past, Present and Furure
Lightning fast analytics with Spark and Cassandra
Apache Cassandra
Analytics with Cassandra & Spark
Apache cassandra and spark. you got the the lighter, let's start the fire
Cassandra and Spark - Tim Berglund
Analytics with Spark and Cassandra
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Spark and cassandra (Hulu Talk)

More from NoSQLmatters (20)

PDF
Nathan Ford- Divination of the Defects (Graph-Based Defect Prediction through...
PDF
Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...
PDF
Adrian Colyer - Keynote: NoSQL matters - NoSQL matters Dublin 2015
PDF
Peter Bakas - Zero to Insights - Real time analytics with Kafka, C*, and Spar...
PDF
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
PDF
Mark Harwood - Building Entity Centric Indexes - NoSQL matters Dublin 2015
PDF
Prassnitha Sampath - Real Time Big Data Analytics with Kafka, Storm & HBase -...
PDF
Akmal Chaudhri - How to Build Streaming Data Applications: Evaluating the Top...
PDF
Michael Hackstein - NoSQL meets Microservices - NoSQL matters Dublin 2015
PDF
Chris Ward - Understanding databases for distributed docker applications - No...
PDF
Philipp Krenn - Host your database in the cloud, they said... - NoSQL matters...
PDF
Lucian Precup - Back to the Future: SQL 92 for Elasticsearch? - NoSQL matters...
PDF
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
PDF
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
PDF
David Pilato - Advance search for your legacy application - NoSQL matters Par...
PDF
Tugdual Grall - From SQL to NoSQL in less than 40 min - NoSQL matters Paris 2015
PDF
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
PDF
Michael Hackstein - Polyglot Persistence & Multi-Model NoSQL Databases - NoSQ...
PDF
Rob Harrop- Key Note The God, the Bad and the Ugly - NoSQL matters Paris 2015
PDF
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Nathan Ford- Divination of the Defects (Graph-Based Defect Prediction through...
Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...
Adrian Colyer - Keynote: NoSQL matters - NoSQL matters Dublin 2015
Peter Bakas - Zero to Insights - Real time analytics with Kafka, C*, and Spar...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Mark Harwood - Building Entity Centric Indexes - NoSQL matters Dublin 2015
Prassnitha Sampath - Real Time Big Data Analytics with Kafka, Storm & HBase -...
Akmal Chaudhri - How to Build Streaming Data Applications: Evaluating the Top...
Michael Hackstein - NoSQL meets Microservices - NoSQL matters Dublin 2015
Chris Ward - Understanding databases for distributed docker applications - No...
Philipp Krenn - Host your database in the cloud, they said... - NoSQL matters...
Lucian Precup - Back to the Future: SQL 92 for Elasticsearch? - NoSQL matters...
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
David Pilato - Advance search for your legacy application - NoSQL matters Par...
Tugdual Grall - From SQL to NoSQL in less than 40 min - NoSQL matters Paris 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Michael Hackstein - Polyglot Persistence & Multi-Model NoSQL Databases - NoSQ...
Rob Harrop- Key Note The God, the Bad and the Ugly - NoSQL matters Paris 2015
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...

Recently uploaded (20)

PPTX
WiFi Honeypot Detecscfddssdffsedfseztor.pptx
PDF
MCP Security Tutorial - Beginner to Advanced
PPTX
Weekly report ppt - harsh dattuprasad patel.pptx
PDF
Wondershare Recoverit Full Crack New Version (Latest 2025)
PPTX
GSA Content Generator Crack (2025 Latest)
PDF
Autodesk AutoCAD Crack Free Download 2025
PPTX
Introduction to Windows Operating System
PPTX
"Secure File Sharing Solutions on AWS".pptx
PDF
AI/ML Infra Meetup | LLM Agents and Implementation Challenges
PDF
How Tridens DevSecOps Ensures Compliance, Security, and Agility
PDF
Time Tracking Features That Teams and Organizations Actually Need
PDF
Cost to Outsource Software Development in 2025
PDF
Topaz Photo AI Crack New Download (Latest 2025)
PDF
Salesforce Agentforce AI Implementation.pdf
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PDF
Visual explanation of Dijkstra's Algorithm using Python
PDF
iTop VPN Crack Latest Version Full Key 2025
PDF
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
PDF
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
PDF
CCleaner 6.39.11548 Crack 2025 License Key
WiFi Honeypot Detecscfddssdffsedfseztor.pptx
MCP Security Tutorial - Beginner to Advanced
Weekly report ppt - harsh dattuprasad patel.pptx
Wondershare Recoverit Full Crack New Version (Latest 2025)
GSA Content Generator Crack (2025 Latest)
Autodesk AutoCAD Crack Free Download 2025
Introduction to Windows Operating System
"Secure File Sharing Solutions on AWS".pptx
AI/ML Infra Meetup | LLM Agents and Implementation Challenges
How Tridens DevSecOps Ensures Compliance, Security, and Agility
Time Tracking Features That Teams and Organizations Actually Need
Cost to Outsource Software Development in 2025
Topaz Photo AI Crack New Download (Latest 2025)
Salesforce Agentforce AI Implementation.pdf
Why Generative AI is the Future of Content, Code & Creativity?
Visual explanation of Dijkstra's Algorithm using Python
iTop VPN Crack Latest Version Full Key 2025
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
CCleaner 6.39.11548 Crack 2025 License Key

DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

  • 1. @doanduyhai Real time data processing with Spark & Cassandra DuyHai DOAN, Technical Advocate
  • 2. @doanduyhai Who Am I ?! Duy Hai DOAN Cassandra technical advocate •  talks, meetups, confs •  open-source devs (Achilles, …) •  OSS Cassandra point of contact ☞ duy_hai.doan@datastax.com ☞ @doanduyhai 2
  • 3. @doanduyhai Datastax! •  Founded in April 2010 •  We contribute a lot to Apache Cassandra™ •  400+ customers (25 of the Fortune 100), 200+ employees •  Headquarter in San Francisco Bay area •  EU headquarter in London, offices in France and Germany •  Datastax Enterprise = OSS Cassandra + extra features 3
  • 4. Spark & Cassandra Integration! Spark & its eco-system! Cassandra & token ranges! Stand-alone cluster deployment! !
  • 5. @doanduyhai What is Apache Spark ?! Created at Apache Project since 2010 General data processing framework MapReduce is not the A & ΩΩ One-framework-many-components approach 5
  • 6. @doanduyhai Spark characteristics! Fast •  10x-100x faster than Hadoop MapReduce •  In-memory storage •  Single JVM process per node, multi-threaded Easy •  Rich Scala, Java and Python APIs (R is coming …) •  2x-5x less code •  Interactive shell 6
  • 7. @doanduyhai Spark code example! Setup Data-set (can be from text, CSV, JSON, Cassandra, HDFS, …) val$conf$=$new$SparkConf(true)$ $ .setAppName("basic_example")$ $ .setMaster("local[3]")$ $ val$sc$=$new$SparkContext(conf)$ val$people$=$List(("jdoe","John$DOE",$33),$ $$$$$$$$$$$$$$$$$$("hsue","Helen$SUE",$24),$ $$$$$$$$$$$$$$$$$$("rsmith",$"Richard$Smith",$33))$ 7
  • 8. @doanduyhai RDDs! RDD = Resilient Distributed Dataset val$parallelPeople:$RDD[(String,$String,$Int)]$=$sc.parallelize(people)$ $ val$extractAge:$RDD[(Int,$(String,$String,$Int))]$=$parallelPeople$ $ $ $ $ $ $ .map(tuple$=>$(tuple._3,$tuple))$ $ val$groupByAge:$RDD[(Int,$Iterable[(String,$String,$Int)])]=extractAge.groupByKey()$ $ val$countByAge:$Map[Int,$Long]$=$groupByAge.countByKey()$ 8
  • 9. @doanduyhai RDDs! RDD[A] = distributed collection of A •  RDD[Person] •  RDD[(String,Int)], … RDD[A] split into partitions Partitions distributed over n workers à parallel computing 9
  • 10. @doanduyhai Spark eco-system! Local Standalone cluster YARN Mesos Spark Core Engine (Scala/Java/Python) Spark Streaming MLLibGraphXSpark SQL Persistence Cluster Manager … 10
  • 11. @doanduyhai Spark eco-system! Local Standalone cluster YARN Mesos Spark Core Engine (Scala/Java/Python) Spark Streaming MLLibGraphXSpark SQL Persistence Cluster Manager … 11
  • 12. @doanduyhai What is Apache Cassandra?! Created at Apache Project since 2009 Distributed NoSQL database Eventual consistency (A & P of the CAP theorem) Distributed table abstraction 12
  • 13. @doanduyhai Cassandra data distribution reminder! Random: hash of #partition → token = hash(#p) Hash: ]-X, X] X = huge number (264/2) n1 n2 n3 n4 n5 n6 n7 n8 13
  • 14. @doanduyhai Cassandra token ranges! A: ]0, X/8] B: ] X/8, 2X/8] C: ] 2X/8, 3X/8] D: ] 3X/8, 4X/8] E: ] 4X/8, 5X/8] F: ] 5X/8, 6X/8] G: ] 6X/8, 7X/8] H: ] 7X/8, X] Murmur3 hash function n1 n2 n3 n4 n5 n6 n7 n8 A B C D E F G H 14
  • 17. @doanduyhai Cassandra Query Language (CQL)! INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33); UPDATE users SET age = 34 WHERE login = ‘jdoe’; DELETE age FROM users WHERE login = ‘jdoe’; SELECT age FROM users WHERE login = ‘jdoe’; 17
  • 18. @doanduyhai Why Spark on Cassandra ?! Reliable persistent store (HA) Structured data (Cassandra CQL à Dataframe API) Multi data-center !!! For Spark 18
  • 19. @doanduyhai Why Spark on Cassandra ?! Reliable persistent store (HA) Structured data (Cassandra CQL à Dataframe API) Multi data-center !!! Cross-table operations (JOIN, UNION, etc.) Real-time/batch processing Complex analytics (e.g. machine learning) For Spark For Cassandra 19
  • 20. @doanduyhai Use Cases! Load data from various sources Analytics (join, aggregate, transform, …) Sanitize, validate, normalize data Schema migration, Data conversion 20
  • 22. @doanduyhai Cluster deployment! Spark Master Spark Worker Spark Worker Spark Worker Spark Worker Executor Executor Executor Executor Driver Program Cassandra – Spark placement 1 Cassandra process ⟷ 1 Spark worker C* C* C* C* 22
  • 23. Spark & Cassandra Connector! Core API! SparkSQL! SparkStreaming!
  • 24. @doanduyhai Connector architecture! All Cassandra types supported and converted to Scala types Server side data filtering (SELECT … WHERE …) Use Java-driver underneath ! Scala and Java support 24
  • 25. @doanduyhai Connector architecture – Core API! Cassandra tables exposed as Spark RDDs Read from and write to Cassandra Mapping of C* tables and rows to Scala objects •  CassandraRow •  Scala case class (object mapper) •  Scala tuples 25
  • 26. @doanduyhai Connector architecture – Spark SQL ! Mapping of Cassandra table to SchemaRDD •  CassandraSQLRow à SparkRow •  custom query plan •  push predicates to CQL for early filtering SELECT * FROM user_emails WHERE login = ‘jdoe’; 26
  • 27. @doanduyhai Connector architecture – Spark Streaming ! Streaming data INTO Cassandra table •  trivial setup •  be careful about your Cassandra data model !!! Streaming data OUT of Cassandra tables ? •  work in progress … 27
  • 28. Connector API ! Connector API! Data Locality Implementation!
  • 29. @doanduyhai Connector API! Connecting to Cassandra !//!Import!Cassandra.specific!functions!on!SparkContext!and!RDD!objects! !import!com.datastax.driver.spark._! !! !//!Spark!connection!options! !val!conf!=!new!SparkConf(true)! ! .setMaster("spark://192.168.123.10:7077")! ! .setAppName("cassandra.demo")! ! .set("cassandra.connection.host","192.168.123.10")!//!initial!contact! ! .set("cassandra.username",!"cassandra")! ! .set("cassandra.password",!"cassandra")! ! !val!sc!=!new!SparkContext(conf)! 29
  • 30. @doanduyhai Connector API! Preparing test data CREATE&TABLE&test.words&(word&text&PRIMARY&KEY,&count&int);& & INSERT&INTO&test.words&(word,&count)&VALUES&('bar',&30);& INSERT&INTO&test.words&(word,&count)&VALUES&('foo',&20);& 30
  • 31. @doanduyhai Connector API! Reading from Cassandra !//!Use!table!as!RDD! !val!rdd!=!sc.cassandraTable("test",!"words")! !//!rdd:!CassandraRDD[CassandraRow]!=!CassandraRDD[0]! ! !rdd.toArray.foreach(println)! !//!CassandraRow[word:!bar,!count:!30]! !//!CassandraRow[word:!foo,!count:!20]! ! !rdd.columnNames!!!!//!Stream(word,!count)! !rdd.size!!!!!!!!!!!//!2! ! !val!firstRow!=!rdd.first!!//firstRow:CassandraRow=CassandraRow[word:!bar,!count:!30]! ! !firstRow.getInt("count")!!//!Int!=!30! 31
  • 32. @doanduyhai Connector API! Writing data to Cassandra !val!newRdd!=!sc.parallelize(Seq(("cat",!40),!("fox",!50)))!! !//!newRdd:!org.apache.spark.rdd.RDD[(String,!Int)]!=!ParallelCollectionRDD[2]!!! ! !newRdd.saveToCassandra("test",!"words",!Seq("word",!"count"))! SELECT&*&FROM&test.words;& & &&&&word&|&count&&& &&&999999+9999999& &&&&&bar&|&&&&30& &&&&&foo&|&&&&20& &&&&&cat&|&&&&40& &&&&&fox&|&&&&50&& 32
  • 33. @doanduyhai Remember token ranges ?! A: ]0, X/8] B: ] X/8, 2X/8] C: ] 2X/8, 3X/8] D: ] 3X/8, 4X/8] E: ] 4X/8, 5X/8] F: ] 5X/8, 6X/8] G: ] 6X/8, 7X/8] H: ] 7X/8, X] n1 n2 n3 n4 n5 n6 n7 n8 A B C D E F G H 33
  • 36. @doanduyhai Read data locality! Read from Cassandra Spark shuffle operations 36
  • 37. @doanduyhai Repartition before write ! Write to Cassandra rdd.repartitionByCassandraReplica("keyspace","table") 37
  • 38. @doanduyhai Or async batch writes! Async batches fan-out writes to Cassandra Spark shuffle operations 38
  • 39. @doanduyhai Write data locality! 39 •  either stream data with Spark using repartitionByCassandraReplica() •  or flush data to Cassandra by async batches •  in any case, there will be data movement on network (sorry no magic)
  • 40. @doanduyhai Joins with data locality! 40 CREATE TABLE artists(name text, style text, … PRIMARY KEY(name)); CREATE TABLE albums(title text, artist text, year int,… PRIMARY KEY(title)); val join: CassandraJoinRDD[(String,Int), (String,String)] = sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS) // Select only useful columns for join and processing .select("artist","year") .as((_:String, _:Int)) // Repartition RDDs by "artists" PK, which is "name" .repartitionByCassandraReplica(KEYSPACE, ARTISTS) // Join with "artists" table, selecting only "name" and "country" columns .joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country")) .on(SomeColumns("name"))
  • 41. @doanduyhai Joins pipeline with data locality! 41 val join: CassandraJoinRDD[(String,Int), (String,String)] = sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS) // Select only useful columns for join and processing .select("artist","year") .as((_:String, _:Int)) .repartitionByCassandraReplica(KEYSPACE, ARTISTS) .joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country")) .on(SomeColumns("name")) .map(…) .filter(…) .groupByKey() .mapValues(…) .repartitionByCassandraReplica(KEYSPACE, ARTISTS_RATINGS) .joinWithCassandraTable(KEYSPACE, ARTISTS_RATINGS) … ! !
  • 42. @doanduyhai Perfect data locality scenario! 42 •  read localy from Cassandra •  use operations that do not require shuffle in Spark (map, filter, …) •  repartitionbyCassandraReplica() •  à to a table having same partition key as original table •  save back into this Cassandra table
  • 44. @doanduyhai What’s for future ?! Datastax Enterprise 4.7 •  Cassandra + Spark + Solr as your analytics platform Filter out most data possible with Solr from Cassandra Fetch the filtered data in Spark and perform aggregations Save back final data into Cassandra 44
  • 45. @doanduyhai What’s for future ?! What’s about data locality ? 45
  • 46. @doanduyhai val join: CassandraJoinRDD[(String,Int), (String,String)] = sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS) // Select only useful columns for join and processing .select("artist","year").where("solr_query = 'style:*rock* AND ratings:[3 TO *]' ") .as((_:String, _:Int)) .repartitionByCassandraReplica(KEYSPACE, ARTISTS) .joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country")) .on(SomeColumns("name")).where("solr_query = 'age:[20 TO 30]' ") What’s for future ?! 1.  compute Spark partitions using Cassandra token ranges 2.  on each partition, use Solr for local data filtering (no fan out !) 3.  fetch data back into Spark for aggregations 4.  repeat 1 – 3 as many times as necessary 46
  • 47. @doanduyhai What’s for future ?! 47 SELECT … FROM … WHERE token(#partition)> 3X/8 AND token(#partition)<= 4X/8 AND solr_query='full text search expression'; 1 2 3 Advantages of same JVM Cassandra + Solr integration 1 Single-pass local full text search (no fan out) 2 Data retrieval D: ] 3X/8, 4X/8]
  • 48. Q & R ! "!