SlideShare a Scribd company logo
Spark-Cassandra Integration 2016
DuyHai DOAN
Apache Cassandra Evangelist
@doanduyhai
Main use-cases
Load data from various
sources
Analytics (join, aggregate, transform, …)
Sanitize, validate, normalize, transform data
Schema migration,
Data conversion
@doanduyhai
Data import
3
•  Read data from CSV and dump into Cassandra ?
☞ Spark Job to distribute the import !
Load data from various
sources
Demo
4
@doanduyhai
Data cleaning
5
Sanitize, validate, normalize, transform data
•  Bugs in your application ?
•  Dirty input data ?
☞ Spark Job to clean it up!
Demo
6
@doanduyhai
Schema migration
7
•  Business requirements change with time ?
•  Current data model no longer relevant ?
☞ Spark Job to migrate data !
Schema migration,
Data conversion
Demo
8
@doanduyhai
Analytics
9
Given existing tables of performers and albums, I want:
①  top 10 most common music styles (pop,rock, RnB, …) ?
②  performer productivity(albums count) by origin country and by decade ?
☞ Spark Job to compute analytics !
Analytics (join, aggregate, transform, …)
Connector Architecture
•  Cluster Deployment
•  Data Locality
•  Failure Handling
•  Cross DC/cluster operations
@doanduyhai
Cluster Deployment
11
•  Stand-alone cluster
C*
SparkM
SparkW
C*
SparkW	
C*
SparkW	
C*
SparkW	
C*
SparkW
@doanduyhai
Data Locality – remember token ranges ?
12
A: −x,−
3x
4
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
B: −
3x
4
,−
2x
4
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
C: −
2x
4
,−
x
4
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
D: −
x
4
,0
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
E: 0,
x
4
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
F:
x
4
,
2x
4
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
G:
2x
4
,
3x
4
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
H :
3x
4
,x
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
C*
C*
C*
C*
C* C*
C* C*
@doanduyhai
Data Locality – how to
13
Spark partition RDD
Cassandra
tokens ranges
C*
SparkM
SparkW
C*
SparkW	
C*
SparkW	
C*
SparkW	
C*
SparkW
@doanduyhai
Data Locality – how to
14
C*
SparkM
SparkW
C*
SparkW	
C*
SparkW	
C*
SparkW	
C*
SparkW
@doanduyhai
Perfect data locality scenario
•  read localy from Cassandra
•  use operations that do not require shuffle in Spark (map, filter, …)
•  repartitionbyCassandraReplica()
à to a table having same partition key as original table
•  save back into this Cassandra table
Sanitize, validate, normalize, transform data
USE CASE
15
@doanduyhai
Failure Handling
16
C*
SparkM
SparkW
C*
SparkW	
C*
SparkW	
C*
SparkW	
C*
SparkW	
What if 1 node down ?
What if 1 node overloaded ?
@doanduyhai
Failure Handling
17
C*
SparkM
SparkW
C*
SparkW	
C*
SparkW	
C*
SparkW	
C*
SparkW	
What if 1 node down ?
What if 1 node overloaded ?
☞ Spark master will re-assign
the job to another worker
@doanduyhai
Failure Handling
18
Oh no, my data locality !!!
@doanduyhai
Failure Handling
19
@doanduyhai
Data Locality Impl
20
abstract'class'RDD[T](…)'{'
' @DeveloperApi'
' def'compute(split:'Partition,'context:'TaskContext):'Iterator[T]'
'
' protected'def'getPartitions:'Array[Partition]'
' '
' protected'def'getPreferredLocations(split:'Partition):'Seq[String]'='Nil''''''''
}'
@doanduyhai
CassandraRDD
21
def getPreferredLocations(split: Partition): Cassandra replicas IP address
corresponding to this Spark partition
@doanduyhai
Failure Handling
22
If RF > 1 the Spark master choses
the next preferred location, which
is a replica 😎
Tune parameters:
•  spark.locality.wait
•  spark.locality.wait.process
•  spark.locality.wait.node
@doanduyhai
Failure Handling
23
If RF > 1 the Spark master choses
the next preferred location, which
is a replica 😎
Tune parameters:
•  spark.locality.wait
•  spark.locality.wait.process
•  spark.locality.wait.node
Only work for fixed
token ranges (vnodes)
@doanduyhai
Cross cluster/DC operations
24
Tales from the field, SASI index benchmark
•  Deployment automation
•  Parallel ingestion
•  Migrating data
•  Spark + Cassandra 3.4 SASI index for topK query
@doanduyhai
Deployment Automation
26
Use Ansible to bootstrap a cluster
•  role tools (install vim, htop, dstat, fio, jmxterm..)
•  role Cassandra. Do not put all nodes as seeds ….
•  role Spark (vanilla Spark). Slave on all nodes, master on a random node
DO NOT START ALL CASSANDRA NODES AT THE SAME TIME !!!!
•  bootstrap first seeds nodes
•  give ≥ 30secs between 2 node bootstrap for token range agreement
•  watch -n 5 nodetool status
@doanduyhai
Parallel ingestion for SASI index benchmark
27
Hardware specs
•  13 nodes
•  6 cores CPU (HT)
•  4 SSD in RAID 0 😎
•  64 Gb of RAM 
Cassandra conf:
•  G1GC 32Gb JVM Heap
•  compaction throughput in MB = 256
•  concurrent compactor = 2
@doanduyhai
Parallel ingestion for SASI index benchmark
28
@doanduyhai
Parallel ingestion for SASI index benchmark
29
3.2 billions row in 17h
(compaction disabled)
RF = 2
☞ ≈ 8000 ips
I/O idle, high CPU
@doanduyhai
Migrating Data
30
@doanduyhai
Migrating Data
31
@doanduyhai
TopK query
32
Pass 1, for each music provider
•  sum albums sales count by title
•  take top N, associate weight from descending order (1st = 1000, 2nd = 999 …)
Retrieve all albums from pass 1
•  re-sum the sum(sales count) and weight group by title
•  order again by sum(sales count) in descending order
•  take top N
@doanduyhai
TopK query
33
Target data set = 3.2 billions rows
•  minimum filter = 1 month (period_end_month = 201404 for ex)
•  worst filter = 3 months range
•  +8 other dynamic filters (music provider, distribution type …)
☞ SASI indices for filtering
☞ Spark for aggregation
@doanduyhai
TopK query results
34
3.2 billions rows in total
•  random distribution over 3 years (36 months) à 88 millions rows/month
Filters #rows Duration #rows/sec
3 months 376 947 612 14 mins (840 secs) 448 747
1 month 94 239 127 6.1 mins (366 secs) 257 483
1 month + 1 provider 7 267 983 2.1 mins (126 secs) 57 682
1 month + 1 provider + 1 country 2 737 178 1.5 mins (90 secs) 30 413
35
Q & A
! "
36
@doanduyhai
duy_hai.doan@datastax.com
https://academy.datastax.com/
Thank You

More Related Content

PDF
Cassandra introduction 2016
PDF
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
PDF
Spark Cassandra 2016
PDF
Fast track to getting started with DSE Max @ ING
PDF
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016
PDF
Spark cassandra integration, theory and practice
PDF
Cassandra 3 new features @ Geecon Krakow 2016
PDF
Cassandra 3 new features 2016
Cassandra introduction 2016
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Spark Cassandra 2016
Fast track to getting started with DSE Max @ ING
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016
Spark cassandra integration, theory and practice
Cassandra 3 new features @ Geecon Krakow 2016
Cassandra 3 new features 2016

What's hot (20)

PDF
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
PDF
Sasi, cassandra on full text search ride
PDF
Cassandra introduction 2016
PDF
Apache zeppelin the missing component for the big data ecosystem
PDF
Apache cassandra in 2016
PDF
Spark cassandra connector.API, Best Practices and Use-Cases
PDF
Datastax enterprise presentation
PDF
Datastax day 2016 introduction to apache cassandra
PDF
Spark zeppelin-cassandra at synchrotron
PDF
Spark Cassandra Connector Dataframes
PDF
Apache Spark and DataStax Enablement
PDF
Big data analytics with Spark & Cassandra
PDF
Data stax academy
PDF
Spark Cassandra Connector: Past, Present, and Future
PDF
How to use Parquet as a basis for ETL and analytics
PPTX
Apache spark Intro
PDF
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
PDF
Zero to Streaming: Spark and Cassandra
PDF
Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...
PDF
Lightning fast analytics with Spark and Cassandra
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Sasi, cassandra on full text search ride
Cassandra introduction 2016
Apache zeppelin the missing component for the big data ecosystem
Apache cassandra in 2016
Spark cassandra connector.API, Best Practices and Use-Cases
Datastax enterprise presentation
Datastax day 2016 introduction to apache cassandra
Spark zeppelin-cassandra at synchrotron
Spark Cassandra Connector Dataframes
Apache Spark and DataStax Enablement
Big data analytics with Spark & Cassandra
Data stax academy
Spark Cassandra Connector: Past, Present, and Future
How to use Parquet as a basis for ETL and analytics
Apache spark Intro
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Zero to Streaming: Spark and Cassandra
Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...
Lightning fast analytics with Spark and Cassandra
Ad

Viewers also liked (15)

PDF
Cassandra nice use cases and worst anti patterns no sql-matters barcelona
PDF
Cassandra introduction @ ParisJUG
PDF
Cassandra drivers and libraries
PDF
Introduction to KillrChat
PDF
KillrChat Data Modeling
PDF
KillrChat presentation
PDF
Cassandra introduction mars jug
PDF
Cassandra introduction @ NantesJUG
PDF
Apache Zeppelin @DevoxxFR 2016
PDF
Cassandra introduction at FinishJUG
PDF
Libon cassandra summiteu2014
PDF
Cassandra for the ops dos and donts
PDF
From rdbms to cassandra without a hitch
PDF
Apache zeppelin, the missing component for the big data ecosystem
PDF
Introduction to spark
Cassandra nice use cases and worst anti patterns no sql-matters barcelona
Cassandra introduction @ ParisJUG
Cassandra drivers and libraries
Introduction to KillrChat
KillrChat Data Modeling
KillrChat presentation
Cassandra introduction mars jug
Cassandra introduction @ NantesJUG
Apache Zeppelin @DevoxxFR 2016
Cassandra introduction at FinishJUG
Libon cassandra summiteu2014
Cassandra for the ops dos and donts
From rdbms to cassandra without a hitch
Apache zeppelin, the missing component for the big data ecosystem
Introduction to spark
Ad

Similar to Spark cassandra integration 2016 (20)

PDF
Apache Spark Best Practices Meetup Talk
PDF
The magic of (data parallel) distributed systems and where it all breaks - Re...
PDF
Tuning and Debugging in Apache Spark
PDF
Cassandra introduction apache con 2014 budapest
PDF
Spark Meetup at Uber
PPTX
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
PPTX
Building a modern Application with DataFrames
PPTX
Building a modern Application with DataFrames
PPTX
Tuning and Debugging in Apache Spark
PDF
Spark Summit EU 2015: Lessons from 300+ production users
PPTX
Paris Data Geek - Spark Streaming
PDF
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
PDF
Scio - Moving to Google Cloud, A Spotify Story
PDF
Big Data Beyond the JVM - Strata San Jose 2018
PPTX
Spark real world use cases and optimizations
PDF
New Analytics Toolbox DevNexus 2015
PDF
Are general purpose big data systems eating the world?
PPTX
MongoDB for Time Series Data: Sharding
PDF
[214]유연하고 확장성 있는 빅데이터 처리
Apache Spark Best Practices Meetup Talk
The magic of (data parallel) distributed systems and where it all breaks - Re...
Tuning and Debugging in Apache Spark
Cassandra introduction apache con 2014 budapest
Spark Meetup at Uber
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Building a modern Application with DataFrames
Building a modern Application with DataFrames
Tuning and Debugging in Apache Spark
Spark Summit EU 2015: Lessons from 300+ production users
Paris Data Geek - Spark Streaming
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Scio - Moving to Google Cloud, A Spotify Story
Big Data Beyond the JVM - Strata San Jose 2018
Spark real world use cases and optimizations
New Analytics Toolbox DevNexus 2015
Are general purpose big data systems eating the world?
MongoDB for Time Series Data: Sharding
[214]유연하고 확장성 있는 빅데이터 처리

More from Duyhai Doan (9)

PDF
Pourquoi Terraform n'est pas le bon outil pour les déploiements automatisés d...
PDF
Le futur d'apache cassandra
PDF
Big data 101 for beginners devoxxpl
PDF
Big data 101 for beginners riga dev days
PDF
Datastax day 2016 : Cassandra data modeling basics
PDF
Algorithme distribués pour big data saison 2 @DevoxxFR 2016
PDF
Cassandra UDF and Materialized Views
PDF
Distributed algorithms for big data @ GeeCon
PDF
Algorithmes distribues pour le big data @ DevoxxFR 2015
Pourquoi Terraform n'est pas le bon outil pour les déploiements automatisés d...
Le futur d'apache cassandra
Big data 101 for beginners devoxxpl
Big data 101 for beginners riga dev days
Datastax day 2016 : Cassandra data modeling basics
Algorithme distribués pour big data saison 2 @DevoxxFR 2016
Cassandra UDF and Materialized Views
Distributed algorithms for big data @ GeeCon
Algorithmes distribues pour le big data @ DevoxxFR 2015

Recently uploaded (20)

PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Cloud computing and distributed systems.
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Spectroscopy.pptx food analysis technology
PDF
Empathic Computing: Creating Shared Understanding
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Approach and Philosophy of On baking technology
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Chapter 3 Spatial Domain Image Processing.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Cloud computing and distributed systems.
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Big Data Technologies - Introduction.pptx
Network Security Unit 5.pdf for BCA BBA.
Spectroscopy.pptx food analysis technology
Empathic Computing: Creating Shared Understanding
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Digital-Transformation-Roadmap-for-Companies.pptx
Programs and apps: productivity, graphics, security and other tools
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
sap open course for s4hana steps from ECC to s4
Approach and Philosophy of On baking technology
Reach Out and Touch Someone: Haptics and Empathic Computing
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Machine learning based COVID-19 study performance prediction
Dropbox Q2 2025 Financial Results & Investor Presentation

Spark cassandra integration 2016