SlideShare a Scribd company logo
©2014 DataStax Confidential. Do not distribute without consent.
timothy.vincent@datastax.com
Tim Vincent
Solution Engineer
Lightning-fast analytics with Spark
for Cassandra and DataStax Enterprise
1
What is Spark?
* Apache Project since 2010
* Fast
* 10x-100x faster than Hadoop MapReduce
* In-memory storage
* Single JVM process per node
* Easy
* Rich Scala, Java and Python APIs
* 2x-5x less code
* Interactive shell
Analytic
Analytic
Search
API
map reduce
API
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save
...
API
* Resilient Distributed Datasets (RDDs)
* Collections of objects spread across a cluster
* Stored in RAM or on Disk
* Built through parallel transformations
* Automatically rebuilt on failure
* Operations
* Transformations (e.g. map, filter, groupBy)
* Actions (e.g. count, collect, save)
Operator Graph: Optimization and Fault Tolerance
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
map
= Cached partition= RDD
Fast
0
500
1000
1500
2000
2500
3000
3500
4000
1 5 10 20 30
RunningTime(s)
Number of Iterations
Hadoop Spark
110 sec / iteration
first iteration 80 sec
further iterations 1 sec
* Logistic Regression Performance
Why Spark on Cassandra?
* Data model independent queries
* Cross-table operations (JOIN, UNION, etc.)
* Complex analytics (e.g. machine learning)
* Data transformation, aggregation, etc.
* Stream processing (coming soon)
* Near real time
How to Spark on Cassandra?
* DataStax Cassandra Spark driver
* Open source: https://github.com/datastax/cassandra-driver-spark
* Compatible with
* Spark 0.9+
* Cassandra 2.0+
* DataStax Enterprise 4.5+
Analytics Workload Isolation
Cassandra
+ Spark DC
Cassandra
Only DC
Online
App
Analytical
App
Mixed Load Cassandra Cluster
Analytics High Availability
* All nodes are Spark Workers
* By default resilient to Worker failures
* First Spark node promoted as Spark Master
* Standby Master promoted on failure
* Master HA available in DataStax Enterprise
Spark Master
Spark Standby Master
Spark Worker
Cassandra Spark Driver
* Cassandra tables exposed as Spark RDDs
* Read from and write to Cassandra
* Mapping of C* tables and rows to Scala objects
* All Cassandra types supported and converted to Scala types
* Server side data selection
* Virtual Nodes support
* Scala only driver for now
Connecting to Cassandra
// Import Cassandra-specific functions on SparkContext and RDD objects
import com.datastax.driver.spark._
// Spark connection options
val conf = new SparkConf(true)
.setMaster("spark://192.168.123.10:7077")
.setAppName("cassandra-demo")
.set("cassandra.connection.host", "192.168.123.10") // initial contact
.set("cassandra.username", "cassandra")
.set("cassandra.password", "cassandra")
val sc = new SparkContext(conf)
Accessing Data
CREATE TABLE test.words (word text PRIMARY KEY, count int);
INSERT INTO test.words (word, count) VALUES ('bar', 30);
INSERT INTO test.words (word, count) VALUES ('foo', 20);
// Use table as RDD
val rdd = sc.cassandraTable("test", "words")
// rdd: CassandraRDD[CassandraRow] = CassandraRDD[0]
rdd.toArray.foreach(println)
// CassandraRow[word: bar, count: 30]
// CassandraRow[word: foo, count: 20]
rdd.columnNames // Stream(word, count)
rdd.size // 2
val firstRow = rdd.first // firstRow: CassandraRow = CassandraRow[word: bar, count: 30]
firstRow.getInt("count") // Int = 30
* Accessing table above as RDD:
Saving Data
val newRdd = sc.parallelize(Seq(("cat", 40), ("fox", 50)))
// newRdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[2]
newRdd.saveToCassandra("test", "words", Seq("word", "count"))
SELECT * FROM test.words;
word | count
------+-------
bar | 30
foo | 20
cat | 40
fox | 50
(4 rows)
* RDD above saved to Cassandra:
Type Mapping
CQL Type Scala Type
ascii String
bigint Long
boolean Boolean
counter Long
decimal BigDecimal, java.math.BigDecimal
double Double
float Float
inet java.net.InetAddress
int Int
list Vector, List, Iterable, Seq, IndexedSeq, java.util.List
map Map, TreeMap, java.util.HashMap
set Set, TreeSet, java.util.HashSet
text, varchar String
timestamp Long, java.util.Date, java.sql.Date, org.joda.time.DateTime
timeuuid java.util.UUID
uuid java.util.UUID
varint BigInt, java.math.BigInteger
*nullable values Option
Mapping Rows to Objects
CREATE TABLE test.cars (
id text PRIMARY KEY,
model text,
fuel_type text,
year int
);
case class Vehicle(
id: String,
model: String,
fuelType: String,
year: Int
)
sc.cassandraTable[Vehicle]("test", "cars").toArray
//Array(Vehicle(KF334L, Ford Mondeo, Petrol, 2009),
// Vehicle(MT8787, Hyundai x35, Diesel, 2011)

* Mapping rows to Scala Case Classes
* CQL underscore case column mapped to Scala camel case property
* Custom mapping functions (see docs)
Server Side Data Selection
* Reduce the amount of data transferred
* Selecting columns
* Selecting rows (by clustering columns and/or secondary indexes)
sc.cassandraTable("test", "users").select("username").toArray.foreach(println)
// CassandraRow{username: john}
// CassandraRow{username: tom}
sc.cassandraTable("test", "cars").select("model").where("color = ?", "black").toArray.foreach(println)
// CassandraRow{model: Ford Mondeo}
// CassandraRow{model: Hyundai x35}
Shark
* SQL query engine on top of Spark
* Not part of Apache Spark
* Hive compatible (JDBC, UDFs, types, metadata, etc.)
* Supports in-memory tables
* Available as a part of DataStax Enterprise
Shark In-memory Tables
CREATE TABLE CachedStocks TBLPROPERTIES ("shark.cache" = "true")
AS SELECT * from PortfolioDemo.Stocks WHERE value > 95.0;
OK
Time taken: 1.215 seconds
SELECT * FROM CachedStocks;
OK
MQT price 97.9270442241818
SII price 99.69238346610474
.
. (123 additional prices)
.
PBG price 96.09162963505352
Time taken: 0.569 seconds
Spark SQL vs Shark
Shark
or
Spark SQL
Streaming ML
Spark (General execution engine)
Graph
Cassandra
Compatible
Questions?

More Related Content

PPTX
Lightning fast analytics with Cassandra and Spark
PDF
Lightning fast analytics with Spark and Cassandra
PDF
Heuritech: Apache Spark REX
PPTX
Cassandra + Hadoop = Brisk
PDF
Hadoop Integration in Cassandra
PDF
Lightning fast analytics with Spark and Cassandra
PPTX
Using Spark to Load Oracle Data into Cassandra
PDF
Using Spark over Cassandra
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Spark and Cassandra
Heuritech: Apache Spark REX
Cassandra + Hadoop = Brisk
Hadoop Integration in Cassandra
Lightning fast analytics with Spark and Cassandra
Using Spark to Load Oracle Data into Cassandra
Using Spark over Cassandra

What's hot (20)

PPTX
Spark + Cassandra = Real Time Analytics on Operational Data
PDF
Cassandra spark connector
PPTX
Introduce to Spark sql 1.3.0
PPTX
Updates from Cassandra Summit 2016 & SASI Indexes
PDF
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
PDF
Cassandra + Spark + Elk
PDF
PySpark Cassandra - Amsterdam Spark Meetup
PDF
Intro to py spark (and cassandra)
PDF
DataEngConf SF16 - Spark SQL Workshop
PDF
Cassandra and Spark: Optimizing for Data Locality
PDF
Munich March 2015 - Cassandra + Spark Overview
PDF
Spark Cassandra Connector: Past, Present, and Future
PPTX
Hadoop+Cassandra_Integration
PDF
Spark Cassandra Connector Dataframes
PDF
The Automation Factory
PDF
Apache Spark and DataStax Enablement
PPTX
5 Ways to Use Spark to Enrich your Cassandra Environment
PDF
Zero to Streaming: Spark and Cassandra
PPTX
Real time data pipeline with spark streaming and cassandra with mesos
PPTX
Storlets fb session_16_9
Spark + Cassandra = Real Time Analytics on Operational Data
Cassandra spark connector
Introduce to Spark sql 1.3.0
Updates from Cassandra Summit 2016 & SASI Indexes
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Cassandra + Spark + Elk
PySpark Cassandra - Amsterdam Spark Meetup
Intro to py spark (and cassandra)
DataEngConf SF16 - Spark SQL Workshop
Cassandra and Spark: Optimizing for Data Locality
Munich March 2015 - Cassandra + Spark Overview
Spark Cassandra Connector: Past, Present, and Future
Hadoop+Cassandra_Integration
Spark Cassandra Connector Dataframes
The Automation Factory
Apache Spark and DataStax Enablement
5 Ways to Use Spark to Enrich your Cassandra Environment
Zero to Streaming: Spark and Cassandra
Real time data pipeline with spark streaming and cassandra with mesos
Storlets fb session_16_9
Ad

Viewers also liked (9)

PDF
The Pathology of Graph Databases
PDF
The Network: A Data Structure that Links Domains
PDF
An Evidential Logic for Multi-Relational Networks
PDF
From the Signal to the Symbol: Structure and Process in Artificial Intelligence
PDF
Business Case Calculator for DevOps Initiatives - Leading credit card service...
PDF
An Overview of Data Management Paradigms: Relational, Document, and Graph
PPTX
DataStax Careers
PDF
Titan: The Rise of Big Graph Data
PPTX
Digital Banking Strategy Roadmap - 3.24.15
The Pathology of Graph Databases
The Network: A Data Structure that Links Domains
An Evidential Logic for Multi-Relational Networks
From the Signal to the Symbol: Structure and Process in Artificial Intelligence
Business Case Calculator for DevOps Initiatives - Leading credit card service...
An Overview of Data Management Paradigms: Relational, Document, and Graph
DataStax Careers
Titan: The Rise of Big Graph Data
Digital Banking Strategy Roadmap - 3.24.15
Ad

Similar to Lightning Fast Analytics with Cassandra and Spark (20)

PDF
A Tale of Two APIs: Using Spark Streaming In Production
PPTX
Using spark 1.2 with Java 8 and Cassandra
PDF
Real-Time Spark: From Interactive Queries to Streaming
PDF
Intro to Spark and Spark SQL
PDF
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
PDF
Apache Spark, the Next Generation Cluster Computing
PPTX
Spark Cassandra Connector: Past, Present and Furure
PDF
Spark Streaming Programming Techniques You Should Know with Gerard Maas
PPTX
Dancing with the Elephant
PPTX
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
PDF
Big data analytics with Spark & Cassandra
PDF
Apache cassandra and spark. you got the the lighter, let's start the fire
PDF
Jump Start into Apache® Spark™ and Databricks
PPTX
Introduction to Apache Spark
PPTX
An Introduct to Spark - Atlanta Spark Meetup
PPTX
An Introduction to Spark
PPTX
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
PDF
No more struggles with Apache Spark workloads in production
PDF
Introduction to Apache Spark
PDF
Escape from Hadoop
A Tale of Two APIs: Using Spark Streaming In Production
Using spark 1.2 with Java 8 and Cassandra
Real-Time Spark: From Interactive Queries to Streaming
Intro to Spark and Spark SQL
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Apache Spark, the Next Generation Cluster Computing
Spark Cassandra Connector: Past, Present and Furure
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Dancing with the Elephant
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Big data analytics with Spark & Cassandra
Apache cassandra and spark. you got the the lighter, let's start the fire
Jump Start into Apache® Spark™ and Databricks
Introduction to Apache Spark
An Introduct to Spark - Atlanta Spark Meetup
An Introduction to Spark
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
No more struggles with Apache Spark workloads in production
Introduction to Apache Spark
Escape from Hadoop

Recently uploaded (20)

PDF
System and Network Administraation Chapter 3
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
Transform Your Business with a Software ERP System
PDF
How Creative Agencies Leverage Project Management Software.pdf
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
top salesforce developer skills in 2025.pdf
PPTX
L1 - Introduction to python Backend.pptx
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
ai tools demonstartion for schools and inter college
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
System and Network Administration Chapter 2
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
medical staffing services at VALiNTRY
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
System and Network Administraation Chapter 3
Understanding Forklifts - TECH EHS Solution
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Transform Your Business with a Software ERP System
How Creative Agencies Leverage Project Management Software.pdf
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
top salesforce developer skills in 2025.pdf
L1 - Introduction to python Backend.pptx
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
ai tools demonstartion for schools and inter college
Softaken Excel to vCard Converter Software.pdf
System and Network Administration Chapter 2
Design an Analysis of Algorithms II-SECS-1021-03
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Odoo Companies in India – Driving Business Transformation.pdf
medical staffing services at VALiNTRY
PTS Company Brochure 2025 (1).pdf.......
Wondershare Filmora 15 Crack With Activation Key [2025

Lightning Fast Analytics with Cassandra and Spark

  • 1. ©2014 DataStax Confidential. Do not distribute without consent. timothy.vincent@datastax.com Tim Vincent Solution Engineer Lightning-fast analytics with Spark for Cassandra and DataStax Enterprise 1
  • 2. What is Spark? * Apache Project since 2010 * Fast * 10x-100x faster than Hadoop MapReduce * In-memory storage * Single JVM process per node * Easy * Rich Scala, Java and Python APIs * 2x-5x less code * Interactive shell Analytic Analytic Search
  • 5. API * Resilient Distributed Datasets (RDDs) * Collections of objects spread across a cluster * Stored in RAM or on Disk * Built through parallel transformations * Automatically rebuilt on failure * Operations * Transformations (e.g. map, filter, groupBy) * Actions (e.g. count, collect, save)
  • 6. Operator Graph: Optimization and Fault Tolerance join filter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: map = Cached partition= RDD
  • 7. Fast 0 500 1000 1500 2000 2500 3000 3500 4000 1 5 10 20 30 RunningTime(s) Number of Iterations Hadoop Spark 110 sec / iteration first iteration 80 sec further iterations 1 sec * Logistic Regression Performance
  • 8. Why Spark on Cassandra? * Data model independent queries * Cross-table operations (JOIN, UNION, etc.) * Complex analytics (e.g. machine learning) * Data transformation, aggregation, etc. * Stream processing (coming soon) * Near real time
  • 9. How to Spark on Cassandra? * DataStax Cassandra Spark driver * Open source: https://github.com/datastax/cassandra-driver-spark * Compatible with * Spark 0.9+ * Cassandra 2.0+ * DataStax Enterprise 4.5+
  • 10. Analytics Workload Isolation Cassandra + Spark DC Cassandra Only DC Online App Analytical App Mixed Load Cassandra Cluster
  • 11. Analytics High Availability * All nodes are Spark Workers * By default resilient to Worker failures * First Spark node promoted as Spark Master * Standby Master promoted on failure * Master HA available in DataStax Enterprise Spark Master Spark Standby Master Spark Worker
  • 12. Cassandra Spark Driver * Cassandra tables exposed as Spark RDDs * Read from and write to Cassandra * Mapping of C* tables and rows to Scala objects * All Cassandra types supported and converted to Scala types * Server side data selection * Virtual Nodes support * Scala only driver for now
  • 13. Connecting to Cassandra // Import Cassandra-specific functions on SparkContext and RDD objects import com.datastax.driver.spark._ // Spark connection options val conf = new SparkConf(true) .setMaster("spark://192.168.123.10:7077") .setAppName("cassandra-demo") .set("cassandra.connection.host", "192.168.123.10") // initial contact .set("cassandra.username", "cassandra") .set("cassandra.password", "cassandra") val sc = new SparkContext(conf)
  • 14. Accessing Data CREATE TABLE test.words (word text PRIMARY KEY, count int); INSERT INTO test.words (word, count) VALUES ('bar', 30); INSERT INTO test.words (word, count) VALUES ('foo', 20); // Use table as RDD val rdd = sc.cassandraTable("test", "words") // rdd: CassandraRDD[CassandraRow] = CassandraRDD[0] rdd.toArray.foreach(println) // CassandraRow[word: bar, count: 30] // CassandraRow[word: foo, count: 20] rdd.columnNames // Stream(word, count) rdd.size // 2 val firstRow = rdd.first // firstRow: CassandraRow = CassandraRow[word: bar, count: 30] firstRow.getInt("count") // Int = 30 * Accessing table above as RDD:
  • 15. Saving Data val newRdd = sc.parallelize(Seq(("cat", 40), ("fox", 50))) // newRdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[2] newRdd.saveToCassandra("test", "words", Seq("word", "count")) SELECT * FROM test.words; word | count ------+------- bar | 30 foo | 20 cat | 40 fox | 50 (4 rows) * RDD above saved to Cassandra:
  • 16. Type Mapping CQL Type Scala Type ascii String bigint Long boolean Boolean counter Long decimal BigDecimal, java.math.BigDecimal double Double float Float inet java.net.InetAddress int Int list Vector, List, Iterable, Seq, IndexedSeq, java.util.List map Map, TreeMap, java.util.HashMap set Set, TreeSet, java.util.HashSet text, varchar String timestamp Long, java.util.Date, java.sql.Date, org.joda.time.DateTime timeuuid java.util.UUID uuid java.util.UUID varint BigInt, java.math.BigInteger *nullable values Option
  • 17. Mapping Rows to Objects CREATE TABLE test.cars ( id text PRIMARY KEY, model text, fuel_type text, year int ); case class Vehicle( id: String, model: String, fuelType: String, year: Int ) sc.cassandraTable[Vehicle]("test", "cars").toArray //Array(Vehicle(KF334L, Ford Mondeo, Petrol, 2009), // Vehicle(MT8787, Hyundai x35, Diesel, 2011)  * Mapping rows to Scala Case Classes * CQL underscore case column mapped to Scala camel case property * Custom mapping functions (see docs)
  • 18. Server Side Data Selection * Reduce the amount of data transferred * Selecting columns * Selecting rows (by clustering columns and/or secondary indexes) sc.cassandraTable("test", "users").select("username").toArray.foreach(println) // CassandraRow{username: john} // CassandraRow{username: tom} sc.cassandraTable("test", "cars").select("model").where("color = ?", "black").toArray.foreach(println) // CassandraRow{model: Ford Mondeo} // CassandraRow{model: Hyundai x35}
  • 19. Shark * SQL query engine on top of Spark * Not part of Apache Spark * Hive compatible (JDBC, UDFs, types, metadata, etc.) * Supports in-memory tables * Available as a part of DataStax Enterprise
  • 20. Shark In-memory Tables CREATE TABLE CachedStocks TBLPROPERTIES ("shark.cache" = "true") AS SELECT * from PortfolioDemo.Stocks WHERE value > 95.0; OK Time taken: 1.215 seconds SELECT * FROM CachedStocks; OK MQT price 97.9270442241818 SII price 99.69238346610474 . . (123 additional prices) . PBG price 96.09162963505352 Time taken: 0.569 seconds
  • 21. Spark SQL vs Shark Shark or Spark SQL Streaming ML Spark (General execution engine) Graph Cassandra Compatible

Editor's Notes

  • #7: Key thing to explain on this slide is that computation for 2nd iteration will not go beyond cached RDDs. So for example when F: requests second iteration it will not hit A: as long as data is in B:. We basically perform operation on A: and keep in B: RDD. Example of how Spark computes job stages. Boxes with solid outlines are RDDs. Partitions are shaded rectangles, in brownif they are already in memory. To run an action on RDD F, we build build stages at wide dependencies and pipeline narrow transformations inside each stage. In this case, stage 1’s output RDD is already in RAM, so we run stage 2 and then 3.