SlideShare a Scribd company logo
Spark + Cassandra 
Carl Yeksigian 
DataStax
Spark 
-Fast large-scale data processing framework 
-Focused on in-memory workloads 
-Supports Java, Scala, and Python 
-Integrated machine learning support (MLlib) 
-Streaming support 
-Simple developer API
Resilient Distributed Dataset (RDD) 
-Presents a simple Collection API to the 
developer 
-Breaks full collection into partitions, which can 
be operated on independently 
-Knows how to recalculate itself if data is lost 
-Abstracts how to complete a job from the tasks
RDD
RDD API
Partitions 
-Partitions can be created so they are on the 
same machine as the data
Uses for Spark with Cassandra 
-Ad-hoc queries 
-Joins, Unions across tables 
-Rewriting tables 
-Machine Learning
spark-cassandra-connector 
DataStax OSS Project 
https://github.com/datastax/spark-cassandra-connector
Spark Cassandra Connector 
-Exposes Cassandra tables as RDDs 
-Read from and write to Cassandra 
-Data type mapping 
-Scala and Java support
Spark + Bioinformatics 
-ADAM is a bioinformatics project out of UC 
Berkeley AMPLab 
-Combines Spark + Parquet + Avro 
https://github.com/bigdatagenomics/adam 
http://bdgenomics.org/
Simple Variant 
case class Variant ( 
sampleid: String, 
referencename: String, 
location: Long, 
allele: String) 
create table adam.variants ( 
sampleid ascii, 
referencename ascii, 
location bigint, 
allele ascii)
Connecting to Cassandra 
import com.datastax.spark.connector._ 
// Spark connection options 
val conf = new SparkConf(true) 
.setMaster("spark://192.168.345.10:7077") 
.setAppName("cassandra-demo") 
.set("cassandra.connection.host", "192.168.345.10") 
val sc = new SparkContext(conf)
Saving To Cassandra 
val variants: RDD[VariantContext] = sc.adamVCFLoad(args(0)) 
variants.flatMap(getVariant) 
.saveToCassandra("adam", "variants", AllColumns)
Querying Cassandra 
val rdd = sc.cassandraTable("adam", "variants") 
.map(r => (r.get[String]("allele"), 1L)) 
.reduceByKey(_ + _) 
.map(r => (r._2, r._1)) 
.sortByKey(ascending = false) 
rdd.collect() 
.foreach(bc => println("%40st%d".format(bc._2, bc._1)))
Thanks 
Acknowledgements: 
Timothy Danford (AMPLab) 
Matt Massie (AMPLab) 
Frank Nothaft (AMPLab) 
Jeff Hammerbacher (Cloudera/Mt Sinai)

More Related Content

PDF
Let's start with Spark
PPTX
Big data overview
PDF
Introduction to apache spark
PPTX
Spark from the Surface
PPTX
Getting started with SparkSQL - Desert Code Camp 2016
PPTX
Geek Night - Functional Data Processing using Spark and Scala
PDF
Data processing with spark in r & python
PDF
Spark sql under the hood - Data KRK meetup
Let's start with Spark
Big data overview
Introduction to apache spark
Spark from the Surface
Getting started with SparkSQL - Desert Code Camp 2016
Geek Night - Functional Data Processing using Spark and Scala
Data processing with spark in r & python
Spark sql under the hood - Data KRK meetup

What's hot (20)

PDF
Performance of Spark vs MapReduce
PDF
The SparkSQL things you maybe confuse
PPTX
Apache Cassandra Lunch #75: Getting Started with DataStax Enterprise on Docker
PPTX
Cassandra Learning
PPSX
A Seminar on NoSQL Databases.
PPTX
NoSQL (Non-Relational Databases)
PDF
Spark SQL
PDF
Spark Core
PDF
An Overview of Apache Spark
PPTX
Digital Transformation with Microsoft Azure
PPTX
Lightening Fast Big Data Analytics using Apache Spark
PPT
NoSQL Seminer
PPTX
Building a REST API with Cassandra on Datastax Astra Using Python and Node
PPTX
Apache spark - History and market overview
PPTX
Apache Cassandra Lunch #50: Machine Learning with Spark + Cassandra
PDF
Databases and how to choose them
PPTX
Lighting up Big Data Analytics with Apache Spark in Azure
PDF
Spark and scala course content | Spark and scala course online training
PPTX
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
PDF
Cassandra Distributions and Variants
Performance of Spark vs MapReduce
The SparkSQL things you maybe confuse
Apache Cassandra Lunch #75: Getting Started with DataStax Enterprise on Docker
Cassandra Learning
A Seminar on NoSQL Databases.
NoSQL (Non-Relational Databases)
Spark SQL
Spark Core
An Overview of Apache Spark
Digital Transformation with Microsoft Azure
Lightening Fast Big Data Analytics using Apache Spark
NoSQL Seminer
Building a REST API with Cassandra on Datastax Astra Using Python and Node
Apache spark - History and market overview
Apache Cassandra Lunch #50: Machine Learning with Spark + Cassandra
Databases and how to choose them
Lighting up Big Data Analytics with Apache Spark in Azure
Spark and scala course content | Spark and scala course online training
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
Cassandra Distributions and Variants
Ad

Similar to Spark + Cassandra (20)

PDF
Apache Spark 101
PDF
Apache Spark 101
PDF
Big Data Analytics and Ubiquitous computing
PDF
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
PDF
A look under the hood at Apache Spark's API and engine evolutions
PPTX
In Memory Analytics with Apache Spark
PDF
Apache Spark and DataStax Enablement
PDF
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
PDF
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
PPTX
Cassandra Lunch #89: Semi-Structured Data in Cassandra
PDF
Introduction to apache spark and the architecture
PDF
Apache Spark RDDs
PPTX
An Introduct to Spark - Atlanta Spark Meetup
PPTX
An Introduction to Spark
PDF
Apache Spark Introduction
PPTX
Azure Databricks is Easier Than You Think
PPTX
Spark core
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
PPTX
5 Ways to Use Spark to Enrich your Cassandra Environment
PDF
An Introduction to Apache Spark
Apache Spark 101
Apache Spark 101
Big Data Analytics and Ubiquitous computing
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
A look under the hood at Apache Spark's API and engine evolutions
In Memory Analytics with Apache Spark
Apache Spark and DataStax Enablement
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
Cassandra Lunch #89: Semi-Structured Data in Cassandra
Introduction to apache spark and the architecture
Apache Spark RDDs
An Introduct to Spark - Atlanta Spark Meetup
An Introduction to Spark
Apache Spark Introduction
Azure Databricks is Easier Than You Think
Spark core
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
5 Ways to Use Spark to Enrich your Cassandra Environment
An Introduction to Apache Spark
Ad

Recently uploaded (20)

PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
top salesforce developer skills in 2025.pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
assetexplorer- product-overview - presentation
PDF
System and Network Administration Chapter 2
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
ai tools demonstartion for schools and inter college
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPT
Introduction Database Management System for Course Database
PDF
Nekopoi APK 2025 free lastest update
PDF
System and Network Administraation Chapter 3
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
VVF-Customer-Presentation2025-Ver1.9.pptx
top salesforce developer skills in 2025.pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Upgrade and Innovation Strategies for SAP ERP Customers
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
assetexplorer- product-overview - presentation
System and Network Administration Chapter 2
Which alternative to Crystal Reports is best for small or large businesses.pdf
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
Odoo Companies in India – Driving Business Transformation.pdf
PTS Company Brochure 2025 (1).pdf.......
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
ai tools demonstartion for schools and inter college
Wondershare Filmora 15 Crack With Activation Key [2025
Introduction Database Management System for Course Database
Nekopoi APK 2025 free lastest update
System and Network Administraation Chapter 3
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
2025 Textile ERP Trends: SAP, Odoo & Oracle

Spark + Cassandra

  • 1. Spark + Cassandra Carl Yeksigian DataStax
  • 2. Spark -Fast large-scale data processing framework -Focused on in-memory workloads -Supports Java, Scala, and Python -Integrated machine learning support (MLlib) -Streaming support -Simple developer API
  • 3. Resilient Distributed Dataset (RDD) -Presents a simple Collection API to the developer -Breaks full collection into partitions, which can be operated on independently -Knows how to recalculate itself if data is lost -Abstracts how to complete a job from the tasks
  • 4. RDD
  • 6. Partitions -Partitions can be created so they are on the same machine as the data
  • 7. Uses for Spark with Cassandra -Ad-hoc queries -Joins, Unions across tables -Rewriting tables -Machine Learning
  • 8. spark-cassandra-connector DataStax OSS Project https://github.com/datastax/spark-cassandra-connector
  • 9. Spark Cassandra Connector -Exposes Cassandra tables as RDDs -Read from and write to Cassandra -Data type mapping -Scala and Java support
  • 10. Spark + Bioinformatics -ADAM is a bioinformatics project out of UC Berkeley AMPLab -Combines Spark + Parquet + Avro https://github.com/bigdatagenomics/adam http://bdgenomics.org/
  • 11. Simple Variant case class Variant ( sampleid: String, referencename: String, location: Long, allele: String) create table adam.variants ( sampleid ascii, referencename ascii, location bigint, allele ascii)
  • 12. Connecting to Cassandra import com.datastax.spark.connector._ // Spark connection options val conf = new SparkConf(true) .setMaster("spark://192.168.345.10:7077") .setAppName("cassandra-demo") .set("cassandra.connection.host", "192.168.345.10") val sc = new SparkContext(conf)
  • 13. Saving To Cassandra val variants: RDD[VariantContext] = sc.adamVCFLoad(args(0)) variants.flatMap(getVariant) .saveToCassandra("adam", "variants", AllColumns)
  • 14. Querying Cassandra val rdd = sc.cassandraTable("adam", "variants") .map(r => (r.get[String]("allele"), 1L)) .reduceByKey(_ + _) .map(r => (r._2, r._1)) .sortByKey(ascending = false) rdd.collect() .foreach(bc => println("%40st%d".format(bc._2, bc._1)))
  • 15. Thanks Acknowledgements: Timothy Danford (AMPLab) Matt Massie (AMPLab) Frank Nothaft (AMPLab) Jeff Hammerbacher (Cloudera/Mt Sinai)