SlideShare a Scribd company logo
Jim Hatcher
DFW Cassandra Users - Meetup
5/18/2016
5 Ways to use Spark to Enrich your Cassandra Environment
C*
Agenda
• Introduction
• Data Systems
• What is Cassandra?
• Tradeoffs vs. RDBMS
• Addressing Limitations
• What is Spark?
• Five Ways to Use Spark in a Cassandra Environment
• ETL
• Data Migrations
• Consistency Checking / Syncing Denormalized Data
• Analytics
• Machine Learning
• Spark Resources
Introduction
Jim Hatcher
james_hatcher@hotmail.com
At IHS, we take raw data and turn it into information and insights for our customers.
Automotive Systems (CarFax)
Defense Systems (Jane’s)
Oil & Gas Systems (Petra)
Maritime Systems
Technology Systems (Electronic Parts Database, Root Metrics)
Sources of Raw Data
Structure Data
Add Value
Customer-facing
Systems
Data Systems
Big Data Analytics
Hadoop, MapReduce, Hive,
Spark
NoSQL
Cassandra, Hbase,
MongoDB
Data Warehousing
SQL Server, Oracle, SAS,
Tableau
Relational Database
SQL Server, Oracle, DB2
Analytical Operational
ConventionalScaleBigData
Batch Processing
Minutes-to-Hours
Range Queries
Visualization / Dashboards
“Commodity” Hardware
Scale Out
Real-time Processing
Milliseconds
Discrete Seeks/Updates
Line of Business Apps
Large Servers using Shared
Storage
Scale Up
Hadoop, MapReduce, Hive,
Spark
Cassandra, Hbase,
MongoDb
SQL Server, Oracle, SAS,
Tableau
SQL Server, Oracle, DB2
Analytical Operational
ConventionalScaleBigDataData Systems
Factors:
• Size/Scale of Data
• Multi-Data Center
(with writes)
• Rate of Data Ingest
• Massive
Concurrency
• Uptime
Requirements
• Operational
Complexity
What is
Cassandra?
Cassandra Cluster
B
C
D
E
F
Client
-9223372036854775808
through
-6148914691236517207
-6148914691236517206
through
-3074457345618258605
-3074457345618258604
through
-3
-2
through
3074457345618258599
3074457345618258600
through
6148914691236517201
6148914691236517202
through
9223372036854775808
CREATE KEYSPACE orders
WITH replication =
{
'class': 'SimpleStrategy',
'replication_factor': 3
};
CREATE TABLE orders.customer
(
customer_id uuid,
customer_name varchar,
customer_age int,
PRIMARY KEY ( customer_id )
)
INSERT INTO customer (feb2b9e6-613b-4e6b-b470-981dc4d42525, ‘Bob’, 35)
SELECT customer_name, customer_age FROM customer WHERE customer_id = feb2b9e6-613b-4e6b-b470-981dc4d42525
A
What is Cassandra?
Cassandra is
• A NoSQL (i.e., non-relational) operational database
• Distributed (the data lives on many nodes)
• Highly Scalable (no scale ceiling)
• Highly Available (no single point of failure)
• Open Source
• Fast (optimized for fast reads and fast writes)
Cassandra uses:
• Commodity Hardware (no SAN/NAS or high-end hardware)
• Ring Architecture (not master/slave)
• Flexible Data Model
• CQL (abstraction layer for data access; not tied to a particular language)
DataStax provides consulting, support, and additional software around Cassandra.
What you gain with Cassandra:
• Linear Horizontal Scale (HUGE!)
• Multi-Data Center / Active-Active
• Fast, Scalable writes
• Fast Reads (by the key(s))
• Continuous Availability
• High Concurrency
• Schema Flexibility
• Cheaper (commodity hardware)?
What you give up with Cassandra:
• Tables only queryable by key
• 3rd Normal Form
• Data Integrity Checks
• Foreign Keys
• Unique Indexes
• Joins
• Secondary Indexes
• Grouping / Aggregation
• ACID
Tradeoffs (vs. RDBMS)
Limitations
Solutions
DenormalizeData
IdempotentDataModel
IndexinAnotherTool
ConsistencyChecker
BatchAnalytics
BatchETL
Tables only Queryable by Key X X
No Foreign Keys / Unique Indexes X X
No JOINs X X
No GROUP BYs / Aggregation X
Keeping Denormalized Data in Sync X
Creating New Tables for New Queries X
Addressing Limitations
What is Spark?
Spark is a processing framework designed
to work with distributed data.
“up to 100X faster than MapReduce”
according to spark.apache.org
Used in any ecosystem where you want to
work with distributed data (Hadoop,
Cassandra, etc.)
Includes other specialized libraries:
• SparkSQL
• Spark Streaming
• MLLib
• GraphX
Spark Facts
Conceptually Similar To MapReduce
Written In Scala
Supported By DataBricks
Supported Languages Scala, Java, or Python
Spark Architecture
Spark Client
Driver
Spark Context
Spark Master
Spark Worker
Spark Worker
Spark Worker
Executor
Executor
Executor
1. Request Resources
2. Allocate Resources
3.StartExecutors
4.PerformComputation
Credit: https://academy.datastax.com/courses/ds320-analytics-apache-spark/introduction-spark-architecture
Spark Terms / Concepts
Resilient Distributed Dataset (RDD)
Represents an immutable, partitioned collection of elements that can be operated on in parallel.
Dataframe
RDD + schema
This is the “way that everything in Spark is going”
Actions and Transformations
Transformations – create a new RDD but are executed in a lazy fashion (i.e., when an action fires)
Actions – cause a computation to be run and return a response to the driver program
Executing Spark Code
Spark Shell – run Spark commands interactively via the Spark REPL
Spark Submit – execute Spark jobs (i.e., JAR files); you can build a JAR file in the Java IDE of your
choice – Eclipse, IntelliJ, etc.
Spark with Cassandra
Credit:
https://academy.datastax.com/courses/ds320-
analytics-apache-spark/introduction-spark-
architecture
Cassandra Cluster
A
CB
Spark Worker
Spark WorkerSpark Worker
Spark Master
Spark Client
Spark Cassandra Connector – open source, supported by DataStax
https://github.com/datastax/spark-cassandra-connector
ETL (Extract, Transform, Load)
Text File
JDBC Data
Source
Cassandra
Hadoop
Extract Data
Spark: Create
RDD
Data Source(s) Spark Code
Transform Data
Spark: Map
function
Spark Code
Cassandra
Data Source(s)
Load Data
Spark: Save
Spark Code
ETL
import org.apache.spark.{SparkConf, SparkContext}
//Create a SparkConfig and a SparkContext
val sparkConf = new SparkConf(true)
.setAppName("MyEtlApp")
.setMaster("spark://10.1.1.1:7077")
.set("spark.cassandra.connection.host", "10.2.2.2")
)
val sc = new SparkContext(sparkConf)
//EXTRACT: Using the SparkContext, read a text file and expose it as an RDD
val logfile = sc.textFile("/weblog.csv")
//TRANSFORM: split the CSV into fields and then put the fields into a tuple
val split = logfile.map { line =>
line.split(",")
}
val transformed = split.map { record =>
( record(0), record(1) )
}
//LOAD: write the tuple structure into Cassandra
transformed.saveToCassandra("test", "weblog")
Data Migrations
Cassandra
Extract Data
Spark: Create
RDD
Data Source(s) Spark Code
Transform Data
Spark: Map
function
Spark Code
Cassandra
Data Source(s)
Load Data
Spark: Save
Spark Code
Data Migrations
import org.apache.spark.{SparkConf, SparkContext}
//Create a SparkConfig and a SparkContext
val sparkConf = new SparkConf(true)
.setAppName("MyEtlApp")
.setMaster("spark://10.1.1.1:7077")
.set("spark.cassandra.connection.host", "10.2.2.2")
)
val sc = new SparkContext(sparkConf)
//EXTRACT: Using the SparkContext, read a C* table and expose it as an RDD
val weblogRecords = sc.cassandraTable("test", "weblog").select("logtime", "page")
//TRANSFORM: pull fields out of the CassandraRow and put the fields into a tuple
val transformed = weblogRecords.map { row =>
( row.getString(1), row.getLong(0) )
}
//LOAD: write the tuple structure into Cassandra into a different table
transformed.saveToCassandra("test", "weblog_bypage")
Consistency Checking / Syncing Denormalized Data
Cassandra
Extract Data
Spark: Create
RDD of missing
records
Data Source(s) Spark Code
Base Table
DenormalizedTable1
DenormalizedTable2
Transform Data
Spark: Map
Function
Spark Code Spark Code
Load Data
Spark: Save
missing records
to Cassandra
Consistency Checking / Syncing Denormalized Data
import org.apache.spark.sql.hive.HiveContext
val hc = new HiveContext(sc)
val query1 = """
SELECT w1.logtime, w1.page
FROM test.weblog w1
LEFT JOIN test.weblog_bypage w2 ON w1.page = w2.page
WHERE w2.page IS NULL"""
val results1 = hc.sql(query1)
results1.collect.foreach(println)
val newRecord = Array(("2016-05-17 2:00:00", "page6.html"))
val newRecordRdd = sc.parallelize(newRecord)
newRecordRdd.saveToCassandra("test", "weblog")
results1.collect.foreach(println)
val transformed = results1.map { row =>
( row.getString(1), row.get(0) )
}
transformed.saveToCassandra("test", "weblog_bypage")
Analytics
//EXAMPLE of a JOIN
val query2 = """
SELECT w.page, w.logtime, p.owner
FROM test.weblog w
INNER JOIN test.webpage p ON w.page = p.page"""
val results2 = hc.sql(query2)
results2.collect.foreach(println)
//EXAMPLE of a GROUP BY
val query3 = """
SELECT w.page, COUNT(*) AS RecordCount
FROM test.weblog w
GROUP BY w.page
ORDER BY w.page"""
val results3 = hc.sql(query3)
results3.collect.foreach(println)
Machine Learning
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.sql.{DataFrame, Row, SQLContext}
case class LabeledDocument(id: Long, text: String, label: Double)
case class DataDocument(id: Long, text: String)
lazy val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
// Load the training data
val modelTrainingRecords = sc.cassandraTable("test", "ml_training")
.select("id", "text", "label")
val labeledDocuments = modelTrainingRecords.map { record =>
LabeledDocument(record.getLong("id") , record.getString("text"), record.getDouble("label"))
}.toDF
Machine Learning
// Create the pipeline
val pipeline = {
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.001)
new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
}
// Fit the pipeline to training documents.
val model = pipeline.fit(labeledDocuments)
Machine Learning
// Load the data to run against the model
val modelTestRecords = sc.cassandraTable("test", "ml_text")
val dataDocuments = modelTestRecords.map { record => DataDocument(record.getLong(0), record.getString(1)) }.toDF
model.transform(dataDocuments)
.select("id", "text", "probability", "prediction")
.collect()
.foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>
println(s"($id, $text) --> prob=$prob, prediction=$prediction")
}
Resources
Spark
• Books
http://shop.oreilly.com/product/0636920028512.do
Scala (Knowing Scala with really help you progress in Spark)
• Functional Programming Principles in Scala (videos)
https://www.youtube.com/user/afigfigueira/playlists?shelf_id=9&view=50&sort=dd
• Books
http://www.scala-lang.org/documentation/books.html
Spark and Cassandra
• DataStax Academy
http://academy.datastax.com/
• Self-paced course: DS320: DataStax Enterprise Analytics with Apache Spark – Really Good!
• Tutorials
• Spark Cassandra Connector website – lots of good examples
https://github.com/datastax/spark-cassandra-connector

More Related Content

PPTX
Updates from Cassandra Summit 2016 & SASI Indexes
PPTX
Using Spark to Load Oracle Data into Cassandra
PDF
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
PDF
Apache Spark and DataStax Enablement
PDF
Spark Cassandra Connector Dataframes
PPTX
Tuning and Debugging in Apache Spark
PDF
20140908 spark sql & catalyst
PDF
Lightning fast analytics with Spark and Cassandra
Updates from Cassandra Summit 2016 & SASI Indexes
Using Spark to Load Oracle Data into Cassandra
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark and DataStax Enablement
Spark Cassandra Connector Dataframes
Tuning and Debugging in Apache Spark
20140908 spark sql & catalyst
Lightning fast analytics with Spark and Cassandra

What's hot (20)

PDF
Zero to Streaming: Spark and Cassandra
PDF
Lightning fast analytics with Spark and Cassandra
PPTX
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
PDF
Apache Spark, the Next Generation Cluster Computing
PPT
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
PDF
Apache Spark RDDs
PPTX
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
PDF
Spark Cassandra Connector: Past, Present, and Future
PPT
Leveraging Hadoop in your PostgreSQL Environment
PPTX
Lightning fast analytics with Cassandra and Spark
PPTX
Apache Spark RDD 101
PDF
Analytics with Cassandra & Spark
PPTX
Transformations and actions a visual guide training
PDF
Big data analytics with Spark & Cassandra
PDF
DTCC '14 Spark Runtime Internals
PDF
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
PDF
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
Cassandra and Spark: Optimizing for Data Locality
PDF
Spark cassandra connector.API, Best Practices and Use-Cases
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Zero to Streaming: Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Apache Spark, the Next Generation Cluster Computing
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark RDDs
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Spark Cassandra Connector: Past, Present, and Future
Leveraging Hadoop in your PostgreSQL Environment
Lightning fast analytics with Cassandra and Spark
Apache Spark RDD 101
Analytics with Cassandra & Spark
Transformations and actions a visual guide training
Big data analytics with Spark & Cassandra
DTCC '14 Spark Runtime Internals
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Cassandra and Spark: Optimizing for Data Locality
Spark cassandra connector.API, Best Practices and Use-Cases
Apache Spark in Depth: Core Concepts, Architecture & Internals
Ad

Similar to 5 Ways to Use Spark to Enrich your Cassandra Environment (20)

PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
PPTX
Spark Study Notes
PPTX
Big data vahidamiri-tabriz-13960226-datastack.ir
PDF
Spark SQL
PPTX
Jump Start with Apache Spark 2.0 on Databricks
PPTX
Spark + Cassandra = Real Time Analytics on Operational Data
PDF
Apache Spark Introduction
PPTX
Building highly scalable data pipelines with Apache Spark
PPTX
Big data processing with Apache Spark and Oracle Database
PPTX
Learning spark ch09 - Spark SQL
PDF
Jump Start on Apache Spark 2.2 with Databricks
PDF
20170126 big data processing
PPTX
Azure Databricks is Easier Than You Think
PPTX
Cassandra Lunch #89: Semi-Structured Data in Cassandra
PPTX
Paris Data Geek - Spark Streaming
PPTX
Intro to Spark
PDF
Jump Start with Apache Spark 2.0 on Databricks
PDF
Apache Spark Overview @ ferret
PPTX
Spark core
PDF
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Study Notes
Big data vahidamiri-tabriz-13960226-datastack.ir
Spark SQL
Jump Start with Apache Spark 2.0 on Databricks
Spark + Cassandra = Real Time Analytics on Operational Data
Apache Spark Introduction
Building highly scalable data pipelines with Apache Spark
Big data processing with Apache Spark and Oracle Database
Learning spark ch09 - Spark SQL
Jump Start on Apache Spark 2.2 with Databricks
20170126 big data processing
Azure Databricks is Easier Than You Think
Cassandra Lunch #89: Semi-Structured Data in Cassandra
Paris Data Geek - Spark Streaming
Intro to Spark
Jump Start with Apache Spark 2.0 on Databricks
Apache Spark Overview @ ferret
Spark core
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Ad

Recently uploaded (20)

PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Computer network topology notes for revision
Data_Analytics_and_PowerBI_Presentation.pptx
Launch Your Data Science Career in Kochi – 2025
oil_refinery_comprehensive_20250804084928 (1).pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Fluorescence-microscope_Botany_detailed content
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Galatica Smart Energy Infrastructure Startup Pitch Deck
Introduction to Knowledge Engineering Part 1
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Clinical guidelines as a resource for EBP(1).pdf
Introduction-to-Cloud-ComputingFinal.pptx
Computer network topology notes for revision

5 Ways to Use Spark to Enrich your Cassandra Environment

  • 1. Jim Hatcher DFW Cassandra Users - Meetup 5/18/2016 5 Ways to use Spark to Enrich your Cassandra Environment C*
  • 2. Agenda • Introduction • Data Systems • What is Cassandra? • Tradeoffs vs. RDBMS • Addressing Limitations • What is Spark? • Five Ways to Use Spark in a Cassandra Environment • ETL • Data Migrations • Consistency Checking / Syncing Denormalized Data • Analytics • Machine Learning • Spark Resources
  • 3. Introduction Jim Hatcher james_hatcher@hotmail.com At IHS, we take raw data and turn it into information and insights for our customers. Automotive Systems (CarFax) Defense Systems (Jane’s) Oil & Gas Systems (Petra) Maritime Systems Technology Systems (Electronic Parts Database, Root Metrics) Sources of Raw Data Structure Data Add Value Customer-facing Systems
  • 4. Data Systems Big Data Analytics Hadoop, MapReduce, Hive, Spark NoSQL Cassandra, Hbase, MongoDB Data Warehousing SQL Server, Oracle, SAS, Tableau Relational Database SQL Server, Oracle, DB2 Analytical Operational ConventionalScaleBigData Batch Processing Minutes-to-Hours Range Queries Visualization / Dashboards “Commodity” Hardware Scale Out Real-time Processing Milliseconds Discrete Seeks/Updates Line of Business Apps Large Servers using Shared Storage Scale Up
  • 5. Hadoop, MapReduce, Hive, Spark Cassandra, Hbase, MongoDb SQL Server, Oracle, SAS, Tableau SQL Server, Oracle, DB2 Analytical Operational ConventionalScaleBigDataData Systems Factors: • Size/Scale of Data • Multi-Data Center (with writes) • Rate of Data Ingest • Massive Concurrency • Uptime Requirements • Operational Complexity
  • 6. What is Cassandra? Cassandra Cluster B C D E F Client -9223372036854775808 through -6148914691236517207 -6148914691236517206 through -3074457345618258605 -3074457345618258604 through -3 -2 through 3074457345618258599 3074457345618258600 through 6148914691236517201 6148914691236517202 through 9223372036854775808 CREATE KEYSPACE orders WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': 3 }; CREATE TABLE orders.customer ( customer_id uuid, customer_name varchar, customer_age int, PRIMARY KEY ( customer_id ) ) INSERT INTO customer (feb2b9e6-613b-4e6b-b470-981dc4d42525, ‘Bob’, 35) SELECT customer_name, customer_age FROM customer WHERE customer_id = feb2b9e6-613b-4e6b-b470-981dc4d42525 A
  • 7. What is Cassandra? Cassandra is • A NoSQL (i.e., non-relational) operational database • Distributed (the data lives on many nodes) • Highly Scalable (no scale ceiling) • Highly Available (no single point of failure) • Open Source • Fast (optimized for fast reads and fast writes) Cassandra uses: • Commodity Hardware (no SAN/NAS or high-end hardware) • Ring Architecture (not master/slave) • Flexible Data Model • CQL (abstraction layer for data access; not tied to a particular language) DataStax provides consulting, support, and additional software around Cassandra.
  • 8. What you gain with Cassandra: • Linear Horizontal Scale (HUGE!) • Multi-Data Center / Active-Active • Fast, Scalable writes • Fast Reads (by the key(s)) • Continuous Availability • High Concurrency • Schema Flexibility • Cheaper (commodity hardware)? What you give up with Cassandra: • Tables only queryable by key • 3rd Normal Form • Data Integrity Checks • Foreign Keys • Unique Indexes • Joins • Secondary Indexes • Grouping / Aggregation • ACID Tradeoffs (vs. RDBMS)
  • 9. Limitations Solutions DenormalizeData IdempotentDataModel IndexinAnotherTool ConsistencyChecker BatchAnalytics BatchETL Tables only Queryable by Key X X No Foreign Keys / Unique Indexes X X No JOINs X X No GROUP BYs / Aggregation X Keeping Denormalized Data in Sync X Creating New Tables for New Queries X Addressing Limitations
  • 10. What is Spark? Spark is a processing framework designed to work with distributed data. “up to 100X faster than MapReduce” according to spark.apache.org Used in any ecosystem where you want to work with distributed data (Hadoop, Cassandra, etc.) Includes other specialized libraries: • SparkSQL • Spark Streaming • MLLib • GraphX Spark Facts Conceptually Similar To MapReduce Written In Scala Supported By DataBricks Supported Languages Scala, Java, or Python
  • 11. Spark Architecture Spark Client Driver Spark Context Spark Master Spark Worker Spark Worker Spark Worker Executor Executor Executor 1. Request Resources 2. Allocate Resources 3.StartExecutors 4.PerformComputation Credit: https://academy.datastax.com/courses/ds320-analytics-apache-spark/introduction-spark-architecture
  • 12. Spark Terms / Concepts Resilient Distributed Dataset (RDD) Represents an immutable, partitioned collection of elements that can be operated on in parallel. Dataframe RDD + schema This is the “way that everything in Spark is going” Actions and Transformations Transformations – create a new RDD but are executed in a lazy fashion (i.e., when an action fires) Actions – cause a computation to be run and return a response to the driver program Executing Spark Code Spark Shell – run Spark commands interactively via the Spark REPL Spark Submit – execute Spark jobs (i.e., JAR files); you can build a JAR file in the Java IDE of your choice – Eclipse, IntelliJ, etc.
  • 13. Spark with Cassandra Credit: https://academy.datastax.com/courses/ds320- analytics-apache-spark/introduction-spark- architecture Cassandra Cluster A CB Spark Worker Spark WorkerSpark Worker Spark Master Spark Client Spark Cassandra Connector – open source, supported by DataStax https://github.com/datastax/spark-cassandra-connector
  • 14. ETL (Extract, Transform, Load) Text File JDBC Data Source Cassandra Hadoop Extract Data Spark: Create RDD Data Source(s) Spark Code Transform Data Spark: Map function Spark Code Cassandra Data Source(s) Load Data Spark: Save Spark Code
  • 15. ETL import org.apache.spark.{SparkConf, SparkContext} //Create a SparkConfig and a SparkContext val sparkConf = new SparkConf(true) .setAppName("MyEtlApp") .setMaster("spark://10.1.1.1:7077") .set("spark.cassandra.connection.host", "10.2.2.2") ) val sc = new SparkContext(sparkConf) //EXTRACT: Using the SparkContext, read a text file and expose it as an RDD val logfile = sc.textFile("/weblog.csv") //TRANSFORM: split the CSV into fields and then put the fields into a tuple val split = logfile.map { line => line.split(",") } val transformed = split.map { record => ( record(0), record(1) ) } //LOAD: write the tuple structure into Cassandra transformed.saveToCassandra("test", "weblog")
  • 16. Data Migrations Cassandra Extract Data Spark: Create RDD Data Source(s) Spark Code Transform Data Spark: Map function Spark Code Cassandra Data Source(s) Load Data Spark: Save Spark Code
  • 17. Data Migrations import org.apache.spark.{SparkConf, SparkContext} //Create a SparkConfig and a SparkContext val sparkConf = new SparkConf(true) .setAppName("MyEtlApp") .setMaster("spark://10.1.1.1:7077") .set("spark.cassandra.connection.host", "10.2.2.2") ) val sc = new SparkContext(sparkConf) //EXTRACT: Using the SparkContext, read a C* table and expose it as an RDD val weblogRecords = sc.cassandraTable("test", "weblog").select("logtime", "page") //TRANSFORM: pull fields out of the CassandraRow and put the fields into a tuple val transformed = weblogRecords.map { row => ( row.getString(1), row.getLong(0) ) } //LOAD: write the tuple structure into Cassandra into a different table transformed.saveToCassandra("test", "weblog_bypage")
  • 18. Consistency Checking / Syncing Denormalized Data Cassandra Extract Data Spark: Create RDD of missing records Data Source(s) Spark Code Base Table DenormalizedTable1 DenormalizedTable2 Transform Data Spark: Map Function Spark Code Spark Code Load Data Spark: Save missing records to Cassandra
  • 19. Consistency Checking / Syncing Denormalized Data import org.apache.spark.sql.hive.HiveContext val hc = new HiveContext(sc) val query1 = """ SELECT w1.logtime, w1.page FROM test.weblog w1 LEFT JOIN test.weblog_bypage w2 ON w1.page = w2.page WHERE w2.page IS NULL""" val results1 = hc.sql(query1) results1.collect.foreach(println) val newRecord = Array(("2016-05-17 2:00:00", "page6.html")) val newRecordRdd = sc.parallelize(newRecord) newRecordRdd.saveToCassandra("test", "weblog") results1.collect.foreach(println) val transformed = results1.map { row => ( row.getString(1), row.get(0) ) } transformed.saveToCassandra("test", "weblog_bypage")
  • 20. Analytics //EXAMPLE of a JOIN val query2 = """ SELECT w.page, w.logtime, p.owner FROM test.weblog w INNER JOIN test.webpage p ON w.page = p.page""" val results2 = hc.sql(query2) results2.collect.foreach(println) //EXAMPLE of a GROUP BY val query3 = """ SELECT w.page, COUNT(*) AS RecordCount FROM test.weblog w GROUP BY w.page ORDER BY w.page""" val results3 = hc.sql(query3) results3.collect.foreach(println)
  • 21. Machine Learning import org.apache.spark.ml.Pipeline import org.apache.spark.ml.classification.LogisticRegression import org.apache.spark.ml.feature.{HashingTF, Tokenizer} import org.apache.spark.mllib.linalg.Vector import org.apache.spark.sql.{DataFrame, Row, SQLContext} case class LabeledDocument(id: Long, text: String, label: Double) case class DataDocument(id: Long, text: String) lazy val sqlContext = new SQLContext(sc) import sqlContext.implicits._ // Load the training data val modelTrainingRecords = sc.cassandraTable("test", "ml_training") .select("id", "text", "label") val labeledDocuments = modelTrainingRecords.map { record => LabeledDocument(record.getLong("id") , record.getString("text"), record.getDouble("label")) }.toDF
  • 22. Machine Learning // Create the pipeline val pipeline = { val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words") val hashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol) .setOutputCol("features") val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.001) new Pipeline() .setStages(Array(tokenizer, hashingTF, lr)) } // Fit the pipeline to training documents. val model = pipeline.fit(labeledDocuments)
  • 23. Machine Learning // Load the data to run against the model val modelTestRecords = sc.cassandraTable("test", "ml_text") val dataDocuments = modelTestRecords.map { record => DataDocument(record.getLong(0), record.getString(1)) }.toDF model.transform(dataDocuments) .select("id", "text", "probability", "prediction") .collect() .foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) => println(s"($id, $text) --> prob=$prob, prediction=$prediction") }
  • 24. Resources Spark • Books http://shop.oreilly.com/product/0636920028512.do Scala (Knowing Scala with really help you progress in Spark) • Functional Programming Principles in Scala (videos) https://www.youtube.com/user/afigfigueira/playlists?shelf_id=9&view=50&sort=dd • Books http://www.scala-lang.org/documentation/books.html Spark and Cassandra • DataStax Academy http://academy.datastax.com/ • Self-paced course: DS320: DataStax Enterprise Analytics with Apache Spark – Really Good! • Tutorials • Spark Cassandra Connector website – lots of good examples https://github.com/datastax/spark-cassandra-connector