SlideShare a Scribd company logo
Sorry for the Delay
• There were some technical difficulties, so we are giving folks a
few more minutes to join
• Again – sorry for the dely 
© 2014 DataStax, All Rights Reserved. Company Confidential 1
Big Data Analytics with Spark
All attendees
placed on mute
Input questions at any time
using the online interface
Webinar Housekeeping
Big Data Analytics with
Cassandra and Spark
Brian Hess
Sr. Product Manager for Analytics
DataStax
© 2014 DataStax, All Rights Reserved. Company Confidential 5
© 2014 DataStax, All Rights Reserved. Company Confidential 6
Willie Sutton
Bank Robber in the 1930s-1950s
FBI Most Wanted List 1950
Captured in 1952
© 2014 DataStax, All Rights Reserved. Company Confidential 7
Willie Sutton
When asked
“Why do you rob banks?”
“Because that’s where the
money is.”
Motivating Use Case
Internet of Things
© 2014 DataStax, All Rights Reserved. Company Confidential 8
Your
System
Motivating Use Case
Internet of Things
© 2014 DataStax, All Rights Reserved. Company Confidential 9
Your
System
Motivating Use Case
Internet of Things
© 2014 DataStax, All Rights Reserved. Company Confidential 10
Your
SystemFAULT
© 2014 DataStax, All Rights Reserved. Company Confidential
Cassandra
Spark
Spark + Cassandra
11
Apache Cassandra
• Distributed NoSQL database
– BigTable meets Dynamo
• All nodes are equal
– Always on
– Linear scale out - a lot
• More data
• More transactions
• Multi-Datacenter
– Geographic or Workload
• Cassandra Query Language
– SQL-like
© 2014 DataStax, All Rights Reserved. Company Confidential 12
200,000
txns/sec
100,000
txns/sec
400,000
txns/sec
How Cassandra Works – Writes
© 2014 DataStax, All Rights Reserved. Company Confidential 13
It’s 72°
How Cassandra Works – Writes
© 2014 DataStax, All Rights Reserved. Company Confidential 14
It’s 72°
How Cassandra Works – Writes
© 2014 DataStax, All Rights Reserved. Company Confidential 15
Done
How Cassandra Works – Writes
© 2014 DataStax, All Rights Reserved. Company Confidential 16
Done
Tunable Consistency
• Relax the Consistency in ACID
– Isn’t always needed – and isn’t guaranteed anyway (in distributed DBs)
– Reads my not get the most up-to-date data – but almost always will
• All data is replicated
– Set in the schema
– Distributed to nodes by Token Range
• Options:
– QUORUM, ONE, ALL
• Can ensure reads get most up-to-date value
– E.g. – read/write at QUORUM
© 2014 DataStax, All Rights Reserved. Company Confidential 17
How Cassandra Works – Tunable Consistency
© 2014 DataStax, All Rights Reserved. Company Confidential 18
You got it.
I’ll make sure
everyone gets it.
You got it.
A majority got it.
The rest will.
You got it.
One guy got it.
The rest will.
You got it.
Everyone has it.
How Cassandra Works – Query
© 2014 DataStax, All Rights Reserved. Company Confidential 19
SELECT user_id
FROM users
WHERE name =
‘PBCupFan’;
How Cassandra Works – Query
© 2014 DataStax, All Rights Reserved. Company Confidential 20
Sure Thing, Let me
get that for you.
SELECT user_id
FROM users
WHERE name =
‘PBCupFan’;
How Cassandra Works – Query
© 2014 DataStax, All Rights Reserved. Company Confidential 21
What do you guys
have for PBCup?
SELECT user_id
FROM users
WHERE name =
‘PBCupFan’;
How Cassandra Works – Query
© 2014 DataStax, All Rights Reserved. Company Confidential 22
Here’s what I have:
Here’s what I have:
SELECT user_id
FROM users
WHERE name =
‘PBCupFan’;
How Cassandra Works – Query
© 2014 DataStax, All Rights Reserved. Company Confidential 23
Let me resolve
any conflicts
SELECT user_id
FROM users
WHERE name =
‘PBCupFan’;
How Cassandra Works – Query
© 2014 DataStax, All Rights Reserved. Company Confidential 24
Here ya go!
user_id
---------
1234
(1 rows)
Cassandra for Internet of Things
It’s all about scaling
© 2014 DataStax, All Rights Reserved. Company Confidential 25
Cassandra for Internet of Things
It’s all about scaling
© 2014 DataStax, All Rights Reserved. Company Confidential 26
Cassandra for Internet of Things
It’s all about scaling
© 2014 DataStax, All Rights Reserved. Company Confidential 27
Cassandra
• Always On
– No down time
• Linear Scalability
– For writes or reads
– For data size
© 2014 DataStax, All Rights Reserved. Company Confidential 28
• Terrific choice for Internet of Things, Web, Mobile, etc.
– British Gas, Nike, etc – Thermostats, Manufacturing, Oil/Gas, etc
It’s where the data is!
Cassandra Limitations
• No aggregations
– Optimized for lookups & writes
– No GROUP BYs
– No Windowed Aggregates
• No Joins
– Data model to avoid
• Must select by partition key
– There are secondary indexes
• But they are an antipattern
• Not optimized for full-table
scans
© 2014 DataStax, All Rights Reserved. Company Confidential 29
It actually can’t do everything 
Apache Spark
• Distributed computing framework
• Generalized DAG execution
• Easy Abstraction for Datasets
• Integrated SQL Queries
• Streaming
• Machine Learning Library
© 2014 DataStax, All Rights Reserved. Company Confidential 30
Spark Components
© 2014 DataStax, All Rights Reserved. Company Confidential 31
Spark Core Engine
Spark SQL Spark
Streaming
MLlib GraphX Spark R
Spark Components
© 2014 DataStax, All Rights Reserved. Company Confidential 32
Spark Provides a Simple and Efficient
framework for Distributed Computations
Node Roles 2
In Memory Caching Yes!
Generic DAG Execution Yes!
Great Abstraction For Datasets?
Dataframe!
(previously Resilient Distributed Dataset (RDD))
Spark
Master
Spark
Worker
Spark
Worker
Spark
WorkerSpark Executor
Spark Partition
Dataframe
(or RDD)
Spark Provides a Simple and Efficient
framework for Distributed Computations
Spark Master: Assigns cluster resources to applications
Spark Worker: Manages executors running on a machine
Spark Executor: Started by Worker - Workhorse of the spark application
Spark
Master
Spark
Worker
Spark
Worker
Spark
WorkerSpark Executor
Spark Partition
Dataframe
(or RDD)
Spark Provides a Simple and Efficient
framework for Distributed Computations
Spark Master: Assigns cluster resources to applications
Spark Worker: Manages executors running on a machine
Spark Executor: Started by Worker - Workhorse of the spark application
Spark
Master
Spark
Worker
Spark
Worker
Spark
WorkerSpark Executor
Spark Partition
Dataframe
(or RDD)
Spark Provides a Simple and Efficient
framework for Distributed Computations
Spark Master: Assigns cluster resources to applications
Spark Worker: Manages executors running on a machine
Spark Executor: Started by Worker - Workhorse of the spark application
Spark
Master
Spark
Worker
Spark
Worker
Spark
WorkerSpark Executor
Spark Partition
Dataframe
(or RDD)
RDDs Can be Generated from a
Variety of Sources
Textfiles
Parallelized Collections
RDDs Can be Generated from a
Variety of Sources
Textfiles
Parallelized Collections
Big Data Analytics with Spark
Spark on Cassandra
© 2014 DataStax, All Rights Reserved. Company Confidential 40
Spark Core Engine
Spark SQL Spark
Streaming
MLlib GraphX Spark R
Cassandra
DataStax Spark-Cassandra Connector
Spark Cassandra Connector uses the DataStax
Java Driver to Read from and Write to Cassandra
Each Executor Maintains a
connection to the C* Cluster
Spark Executor
DataStax
Java Driver
Tokens 1-1000
Tokens 1001 -2000
Tokens …
RDD’s read into
different splits based
on sets of tokens
C*
Full Token
Range
© 2014 DataStax, All Rights Reserved. Company Confidential 42
Co-locate Spark and C* for Best Performance
• Run Cassandra and
Spark on same nodes
• Local reads/writes
• Increased performance
© 2014 DataStax, All Rights Reserved. Company Confidential 43
Things you can’t do in Cassandra
– Using SparkSQL
• JOINs
sc.sql("SELECT t.sensor_id, t.temp, m.location
FROM ks.temperatures t JOIN ks.metadata m
ON t.sensor_id = m.sensor_id
WHERE t.sensor_id = 12345");
• Aggregates
sc.sql("SELECT sensor_id, year, month, MAX(temp) mtemp
FROM ks.temperatures
GROUP BY sensor_id, year, month");
© 2014 DataStax, All Rights Reserved. Company Confidential 44
Things you can’t do in Cassandra
– External Data
• JOIN with HDFS data
val temp2014 = sc.textFile("webhdfs://myhadoop/data/temp2014.csv").
map(x=>x.split(",")).
map(x=>((x(0).toInt, x(1).toInt, x(2).toInt),
x(3).toDouble))
val temp2015 = sc.cassandraTable("ks", "temperatures").
map(x=>((x.getInt("sensor_id"), x.getInt("year"), x.getInt("month")),
x.getDouble("avgTemp")))
val hotter = temp2015.join(temp2014).filter(x => x._2._1._1 > x._2._2._1)
• Non-Partition Key Predicates
csc.sql("SELECT * FROM ks.temperatures WHERE temp > 100")
© 2014 DataStax, All Rights Reserved. Company Confidential 45
Tools
• ODBC and JDBC tools via SparkSQL
– Tableau, Pentaho, R, etc
• Apache Zeppelin (incubating)
A web-based notebook
that enables interactive data
analytics.
© 2014 DataStax, All Rights Reserved. Company Confidential 46
Quick word on Spark Streaming and Cassandra
• Very good combination
– Simple, powerful, useful, scalable, etc, etc, etc.
© 2014 DataStax, All Rights Reserved. Company Confidential 47
Receiver
Quick word on Spark Streaming and Cassandra
© 2014 DataStax, All Rights Reserved. Company Confidential 48
import com.datastax.spark.connector.streaming._
// Spark connection options
val conf = new SparkConf(true)...
// streaming with 1 second batch window
val ssc = new StreamingContext(conf, Seconds(1))
// stream input
val lines = ssc.socketTextStream(serverIP, serverPort)
// count words
val wordCounts = lines.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
// stream output
wordCounts.saveToCassandra("test", "words")
// start processing
ssc.start()
ssc.awaitTermination()
DataStax Enterprise
© 2014 DataStax, All Rights Reserved. Company Confidential 49
Combines Cassandra,
Spark, and Solr (and more!)
- Fault Tolerance
- Management
- Visual Monitoring
- Security
- ETC!
Motivating Use Case
Internet of Things
© 2014 DataStax, All Rights Reserved. Company Confidential 50
Cassandra + Spark
• Unleash the power of analytics
• On your operational data
– IoT, Web, Mobile, etc
© 2014 DataStax, All Rights Reserved. Company Confidential 51
“Because that’s where
the Data is.”
Contacts and Links
• Links
– Cassandra Summit: http://cassandrasummit-datastax.com/
– DataStax Academy: https://academy.datastax.com/
• Contacts
– Kevin Pardue, Regional Channel Manager: kevin.pardue@datastax.com
– Brian Hess, Sr Product Manager for Analytics: brian.hess@datastax.com
– Devin Saxon, Marketing Specialist: dsaxon@datastax.com
© 2014 DataStax, All Rights Reserved. Company Confidential 52
© 2014 DataStax, All Rights Reserved. Company Confidential 53

More Related Content

PDF
Real-time personal trainer on the SMACK stack
PDF
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
PPTX
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
PDF
Efficient Spark Analytics on Encrypted Data with Gidon Gershinsky
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
PDF
Making Apache Spark Better with Delta Lake
PDF
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
PPTX
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Real-time personal trainer on the SMACK stack
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
Efficient Spark Analytics on Encrypted Data with Gidon Gershinsky
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Making Apache Spark Better with Delta Lake
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

What's hot (20)

PPTX
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
PDF
Big Data Tools in AWS
PDF
Lambda architecture
PPTX
August 2016 HUG: Recent development in Apache Oozie
PDF
Streaming Big Data & Analytics For Scale
PPTX
PDF
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
PDF
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
PDF
Application Architectures with Hadoop
PPTX
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
PDF
Trend Micro Big Data Platform and Apache Bigtop
PDF
Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016
PDF
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
PPTX
HPC and cloud distributed computing, as a journey
PPTX
Building Highly Available Apps on Cassandra (Robbie Strickland, Weather Compa...
PDF
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
PPTX
Solr + Hadoop: Interactive Search for Hadoop
PDF
Rethinking Streaming Analytics For Scale
PPTX
Monitoring at scale - Sensu Kafka Kafka-connect Cassandra PrestoDB
PPTX
The Future of Hadoop: A deeper look at Apache Spark
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
Big Data Tools in AWS
Lambda architecture
August 2016 HUG: Recent development in Apache Oozie
Streaming Big Data & Analytics For Scale
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Application Architectures with Hadoop
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Trend Micro Big Data Platform and Apache Bigtop
Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
HPC and cloud distributed computing, as a journey
Building Highly Available Apps on Cassandra (Robbie Strickland, Weather Compa...
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Solr + Hadoop: Interactive Search for Hadoop
Rethinking Streaming Analytics For Scale
Monitoring at scale - Sensu Kafka Kafka-connect Cassandra PrestoDB
The Future of Hadoop: A deeper look at Apache Spark
Ad

Viewers also liked (14)

PPTX
OAuth-as-a-service using ASP.NET Web API and Windows Azure Access Control
PPTX
The Full Power of ASP.NET Web API
PPTX
ASP.NET Mvc 4 web api
PPTX
Web API or WCF - An Architectural Comparison
KEY
Intro to Data Science for Enterprise Big Data
PDF
Myths and Mathemagical Superpowers of Data Scientists
PDF
Titan: The Rise of Big Graph Data
PDF
How to Interview a Data Scientist
PDF
Titan: Big Graph Data with Cassandra
PDF
A Statistician's View on Big Data and Data Science (Version 1)
PDF
Introduction to R for Data Mining
PDF
Top 5 Considerations for a Big Data Solution
PDF
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
PPTX
What is Big Data?
OAuth-as-a-service using ASP.NET Web API and Windows Azure Access Control
The Full Power of ASP.NET Web API
ASP.NET Mvc 4 web api
Web API or WCF - An Architectural Comparison
Intro to Data Science for Enterprise Big Data
Myths and Mathemagical Superpowers of Data Scientists
Titan: The Rise of Big Graph Data
How to Interview a Data Scientist
Titan: Big Graph Data with Cassandra
A Statistician's View on Big Data and Data Science (Version 1)
Introduction to R for Data Mining
Top 5 Considerations for a Big Data Solution
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
What is Big Data?
Ad

Similar to Big Data Analytics with Spark (20)

PDF
Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014
PPTX
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
PDF
Cassandra 2.0 to 2.1
PPTX
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
PPT
Reporting from the Trenches: Intuit & Cassandra
PPTX
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
PPTX
Data Architectures for Robust Decision Making
PDF
Azure + DataStax Enterprise Powers Office 365 Per User Store
PPTX
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
PPTX
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
PPTX
Cassandra Summit 2014: Apache Cassandra at Telefonica CBS
PDF
Apache Spark & Cassandra use case at Telefónica Cbs by Antonio Alcacer
PPTX
Real Time Data Processing Using Spark Streaming
PDF
Data Con LA 2018 - Analyzing Movie Reviews using DataStax by Amanda Moran
PPTX
Get Started with Cloudera’s Cyber Solution
PDF
The Future of Data Management: The Enterprise Data Hub
PDF
Hadoop and the Future of SQL: Using BI Tools with Big Data
PPTX
BI, Reporting and Analytics on Apache Cassandra
PDF
Live traffic capture and replay in cassandra 4.0
PPTX
How Data Drives Business at Choice Hotels
Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Cassandra 2.0 to 2.1
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Reporting from the Trenches: Intuit & Cassandra
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
Data Architectures for Robust Decision Making
Azure + DataStax Enterprise Powers Office 365 Per User Store
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Cassandra Summit 2014: Apache Cassandra at Telefonica CBS
Apache Spark & Cassandra use case at Telefónica Cbs by Antonio Alcacer
Real Time Data Processing Using Spark Streaming
Data Con LA 2018 - Analyzing Movie Reviews using DataStax by Amanda Moran
Get Started with Cloudera’s Cyber Solution
The Future of Data Management: The Enterprise Data Hub
Hadoop and the Future of SQL: Using BI Tools with Big Data
BI, Reporting and Analytics on Apache Cassandra
Live traffic capture and replay in cassandra 4.0
How Data Drives Business at Choice Hotels

More from DataStax Academy (20)

PDF
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
PPTX
Introduction to DataStax Enterprise Graph Database
PPTX
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
PPTX
Cassandra on Docker @ Walmart Labs
PDF
Cassandra 3.0 Data Modeling
PPTX
Cassandra Adoption on Cisco UCS & Open stack
PDF
Data Modeling for Apache Cassandra
PDF
Coursera Cassandra Driver
PDF
Production Ready Cassandra
PDF
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 1
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 2
PDF
Standing Up Your First Cluster
PDF
Real Time Analytics with Dse
PDF
Introduction to Data Modeling with Apache Cassandra
PDF
Cassandra Core Concepts
PPTX
Enabling Search in your Cassandra Application with DataStax Enterprise
PPTX
Bad Habits Die Hard
PDF
Advanced Data Modeling with Apache Cassandra
PDF
Advanced Cassandra
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Cassandra on Docker @ Walmart Labs
Cassandra 3.0 Data Modeling
Cassandra Adoption on Cisco UCS & Open stack
Data Modeling for Apache Cassandra
Coursera Cassandra Driver
Production Ready Cassandra
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 2
Standing Up Your First Cluster
Real Time Analytics with Dse
Introduction to Data Modeling with Apache Cassandra
Cassandra Core Concepts
Enabling Search in your Cassandra Application with DataStax Enterprise
Bad Habits Die Hard
Advanced Data Modeling with Apache Cassandra
Advanced Cassandra

Recently uploaded (20)

PDF
cuic standard and advanced reporting.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Encapsulation theory and applications.pdf
PDF
KodekX | Application Modernization Development
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PPT
Teaching material agriculture food technology
PDF
Review of recent advances in non-invasive hemoglobin estimation
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Empathic Computing: Creating Shared Understanding
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Big Data Technologies - Introduction.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Machine learning based COVID-19 study performance prediction
cuic standard and advanced reporting.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Encapsulation theory and applications.pdf
KodekX | Application Modernization Development
MYSQL Presentation for SQL database connectivity
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Reach Out and Touch Someone: Haptics and Empathic Computing
20250228 LYD VKU AI Blended-Learning.pptx
Encapsulation_ Review paper, used for researhc scholars
Teaching material agriculture food technology
Review of recent advances in non-invasive hemoglobin estimation
The AUB Centre for AI in Media Proposal.docx
Dropbox Q2 2025 Financial Results & Investor Presentation
Empathic Computing: Creating Shared Understanding
Diabetes mellitus diagnosis method based random forest with bat algorithm
Big Data Technologies - Introduction.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Unlocking AI with Model Context Protocol (MCP)
Machine learning based COVID-19 study performance prediction

Big Data Analytics with Spark

  • 1. Sorry for the Delay • There were some technical difficulties, so we are giving folks a few more minutes to join • Again – sorry for the dely  © 2014 DataStax, All Rights Reserved. Company Confidential 1
  • 3. All attendees placed on mute Input questions at any time using the online interface Webinar Housekeeping
  • 4. Big Data Analytics with Cassandra and Spark Brian Hess Sr. Product Manager for Analytics DataStax
  • 5. © 2014 DataStax, All Rights Reserved. Company Confidential 5
  • 6. © 2014 DataStax, All Rights Reserved. Company Confidential 6 Willie Sutton Bank Robber in the 1930s-1950s FBI Most Wanted List 1950 Captured in 1952
  • 7. © 2014 DataStax, All Rights Reserved. Company Confidential 7 Willie Sutton When asked “Why do you rob banks?” “Because that’s where the money is.”
  • 8. Motivating Use Case Internet of Things © 2014 DataStax, All Rights Reserved. Company Confidential 8 Your System
  • 9. Motivating Use Case Internet of Things © 2014 DataStax, All Rights Reserved. Company Confidential 9 Your System
  • 10. Motivating Use Case Internet of Things © 2014 DataStax, All Rights Reserved. Company Confidential 10 Your SystemFAULT
  • 11. © 2014 DataStax, All Rights Reserved. Company Confidential Cassandra Spark Spark + Cassandra 11
  • 12. Apache Cassandra • Distributed NoSQL database – BigTable meets Dynamo • All nodes are equal – Always on – Linear scale out - a lot • More data • More transactions • Multi-Datacenter – Geographic or Workload • Cassandra Query Language – SQL-like © 2014 DataStax, All Rights Reserved. Company Confidential 12 200,000 txns/sec 100,000 txns/sec 400,000 txns/sec
  • 13. How Cassandra Works – Writes © 2014 DataStax, All Rights Reserved. Company Confidential 13 It’s 72°
  • 14. How Cassandra Works – Writes © 2014 DataStax, All Rights Reserved. Company Confidential 14 It’s 72°
  • 15. How Cassandra Works – Writes © 2014 DataStax, All Rights Reserved. Company Confidential 15 Done
  • 16. How Cassandra Works – Writes © 2014 DataStax, All Rights Reserved. Company Confidential 16 Done
  • 17. Tunable Consistency • Relax the Consistency in ACID – Isn’t always needed – and isn’t guaranteed anyway (in distributed DBs) – Reads my not get the most up-to-date data – but almost always will • All data is replicated – Set in the schema – Distributed to nodes by Token Range • Options: – QUORUM, ONE, ALL • Can ensure reads get most up-to-date value – E.g. – read/write at QUORUM © 2014 DataStax, All Rights Reserved. Company Confidential 17
  • 18. How Cassandra Works – Tunable Consistency © 2014 DataStax, All Rights Reserved. Company Confidential 18 You got it. I’ll make sure everyone gets it. You got it. A majority got it. The rest will. You got it. One guy got it. The rest will. You got it. Everyone has it.
  • 19. How Cassandra Works – Query © 2014 DataStax, All Rights Reserved. Company Confidential 19 SELECT user_id FROM users WHERE name = ‘PBCupFan’;
  • 20. How Cassandra Works – Query © 2014 DataStax, All Rights Reserved. Company Confidential 20 Sure Thing, Let me get that for you. SELECT user_id FROM users WHERE name = ‘PBCupFan’;
  • 21. How Cassandra Works – Query © 2014 DataStax, All Rights Reserved. Company Confidential 21 What do you guys have for PBCup? SELECT user_id FROM users WHERE name = ‘PBCupFan’;
  • 22. How Cassandra Works – Query © 2014 DataStax, All Rights Reserved. Company Confidential 22 Here’s what I have: Here’s what I have: SELECT user_id FROM users WHERE name = ‘PBCupFan’;
  • 23. How Cassandra Works – Query © 2014 DataStax, All Rights Reserved. Company Confidential 23 Let me resolve any conflicts SELECT user_id FROM users WHERE name = ‘PBCupFan’;
  • 24. How Cassandra Works – Query © 2014 DataStax, All Rights Reserved. Company Confidential 24 Here ya go! user_id --------- 1234 (1 rows)
  • 25. Cassandra for Internet of Things It’s all about scaling © 2014 DataStax, All Rights Reserved. Company Confidential 25
  • 26. Cassandra for Internet of Things It’s all about scaling © 2014 DataStax, All Rights Reserved. Company Confidential 26
  • 27. Cassandra for Internet of Things It’s all about scaling © 2014 DataStax, All Rights Reserved. Company Confidential 27
  • 28. Cassandra • Always On – No down time • Linear Scalability – For writes or reads – For data size © 2014 DataStax, All Rights Reserved. Company Confidential 28 • Terrific choice for Internet of Things, Web, Mobile, etc. – British Gas, Nike, etc – Thermostats, Manufacturing, Oil/Gas, etc It’s where the data is!
  • 29. Cassandra Limitations • No aggregations – Optimized for lookups & writes – No GROUP BYs – No Windowed Aggregates • No Joins – Data model to avoid • Must select by partition key – There are secondary indexes • But they are an antipattern • Not optimized for full-table scans © 2014 DataStax, All Rights Reserved. Company Confidential 29 It actually can’t do everything 
  • 30. Apache Spark • Distributed computing framework • Generalized DAG execution • Easy Abstraction for Datasets • Integrated SQL Queries • Streaming • Machine Learning Library © 2014 DataStax, All Rights Reserved. Company Confidential 30
  • 31. Spark Components © 2014 DataStax, All Rights Reserved. Company Confidential 31 Spark Core Engine Spark SQL Spark Streaming MLlib GraphX Spark R
  • 32. Spark Components © 2014 DataStax, All Rights Reserved. Company Confidential 32
  • 33. Spark Provides a Simple and Efficient framework for Distributed Computations Node Roles 2 In Memory Caching Yes! Generic DAG Execution Yes! Great Abstraction For Datasets? Dataframe! (previously Resilient Distributed Dataset (RDD)) Spark Master Spark Worker Spark Worker Spark WorkerSpark Executor Spark Partition Dataframe (or RDD)
  • 34. Spark Provides a Simple and Efficient framework for Distributed Computations Spark Master: Assigns cluster resources to applications Spark Worker: Manages executors running on a machine Spark Executor: Started by Worker - Workhorse of the spark application Spark Master Spark Worker Spark Worker Spark WorkerSpark Executor Spark Partition Dataframe (or RDD)
  • 35. Spark Provides a Simple and Efficient framework for Distributed Computations Spark Master: Assigns cluster resources to applications Spark Worker: Manages executors running on a machine Spark Executor: Started by Worker - Workhorse of the spark application Spark Master Spark Worker Spark Worker Spark WorkerSpark Executor Spark Partition Dataframe (or RDD)
  • 36. Spark Provides a Simple and Efficient framework for Distributed Computations Spark Master: Assigns cluster resources to applications Spark Worker: Manages executors running on a machine Spark Executor: Started by Worker - Workhorse of the spark application Spark Master Spark Worker Spark Worker Spark WorkerSpark Executor Spark Partition Dataframe (or RDD)
  • 37. RDDs Can be Generated from a Variety of Sources Textfiles Parallelized Collections
  • 38. RDDs Can be Generated from a Variety of Sources Textfiles Parallelized Collections
  • 40. Spark on Cassandra © 2014 DataStax, All Rights Reserved. Company Confidential 40 Spark Core Engine Spark SQL Spark Streaming MLlib GraphX Spark R Cassandra DataStax Spark-Cassandra Connector
  • 41. Spark Cassandra Connector uses the DataStax Java Driver to Read from and Write to Cassandra Each Executor Maintains a connection to the C* Cluster Spark Executor DataStax Java Driver Tokens 1-1000 Tokens 1001 -2000 Tokens … RDD’s read into different splits based on sets of tokens C* Full Token Range
  • 42. © 2014 DataStax, All Rights Reserved. Company Confidential 42
  • 43. Co-locate Spark and C* for Best Performance • Run Cassandra and Spark on same nodes • Local reads/writes • Increased performance © 2014 DataStax, All Rights Reserved. Company Confidential 43
  • 44. Things you can’t do in Cassandra – Using SparkSQL • JOINs sc.sql("SELECT t.sensor_id, t.temp, m.location FROM ks.temperatures t JOIN ks.metadata m ON t.sensor_id = m.sensor_id WHERE t.sensor_id = 12345"); • Aggregates sc.sql("SELECT sensor_id, year, month, MAX(temp) mtemp FROM ks.temperatures GROUP BY sensor_id, year, month"); © 2014 DataStax, All Rights Reserved. Company Confidential 44
  • 45. Things you can’t do in Cassandra – External Data • JOIN with HDFS data val temp2014 = sc.textFile("webhdfs://myhadoop/data/temp2014.csv"). map(x=>x.split(",")). map(x=>((x(0).toInt, x(1).toInt, x(2).toInt), x(3).toDouble)) val temp2015 = sc.cassandraTable("ks", "temperatures"). map(x=>((x.getInt("sensor_id"), x.getInt("year"), x.getInt("month")), x.getDouble("avgTemp"))) val hotter = temp2015.join(temp2014).filter(x => x._2._1._1 > x._2._2._1) • Non-Partition Key Predicates csc.sql("SELECT * FROM ks.temperatures WHERE temp > 100") © 2014 DataStax, All Rights Reserved. Company Confidential 45
  • 46. Tools • ODBC and JDBC tools via SparkSQL – Tableau, Pentaho, R, etc • Apache Zeppelin (incubating) A web-based notebook that enables interactive data analytics. © 2014 DataStax, All Rights Reserved. Company Confidential 46
  • 47. Quick word on Spark Streaming and Cassandra • Very good combination – Simple, powerful, useful, scalable, etc, etc, etc. © 2014 DataStax, All Rights Reserved. Company Confidential 47 Receiver
  • 48. Quick word on Spark Streaming and Cassandra © 2014 DataStax, All Rights Reserved. Company Confidential 48 import com.datastax.spark.connector.streaming._ // Spark connection options val conf = new SparkConf(true)... // streaming with 1 second batch window val ssc = new StreamingContext(conf, Seconds(1)) // stream input val lines = ssc.socketTextStream(serverIP, serverPort) // count words val wordCounts = lines.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) // stream output wordCounts.saveToCassandra("test", "words") // start processing ssc.start() ssc.awaitTermination()
  • 49. DataStax Enterprise © 2014 DataStax, All Rights Reserved. Company Confidential 49 Combines Cassandra, Spark, and Solr (and more!) - Fault Tolerance - Management - Visual Monitoring - Security - ETC!
  • 50. Motivating Use Case Internet of Things © 2014 DataStax, All Rights Reserved. Company Confidential 50
  • 51. Cassandra + Spark • Unleash the power of analytics • On your operational data – IoT, Web, Mobile, etc © 2014 DataStax, All Rights Reserved. Company Confidential 51 “Because that’s where the Data is.”
  • 52. Contacts and Links • Links – Cassandra Summit: http://cassandrasummit-datastax.com/ – DataStax Academy: https://academy.datastax.com/ • Contacts – Kevin Pardue, Regional Channel Manager: kevin.pardue@datastax.com – Brian Hess, Sr Product Manager for Analytics: brian.hess@datastax.com – Devin Saxon, Marketing Specialist: dsaxon@datastax.com © 2014 DataStax, All Rights Reserved. Company Confidential 52
  • 53. © 2014 DataStax, All Rights Reserved. Company Confidential 53

Editor's Notes

  • #34: Spark has a very simple Architecture (see chart) Basic model for RDD is really nice, Easy to grok RDD, many sources you can get this from Lots of fun languages supported
  • #35: Spark Master : Analgous to Job Tracker Initial contact point for applications Keeps track of state of system
  • #36: Spark Worker: Task Tracker … Manages starting "executors" on machines Reports and setups env for executors
  • #37: Spark Executor: Actually does the work Process started by worker, Communicates directly with driving application & master 1 Spark Partition per executor … KEY Spark Partition != Cassandra Partition
  • #38: RDD’s where do they come from All sorts of great places
  • #39: So how do we act with these RDD’s
  • #42: Basics on how OSSConnector works How RDD is Split up