Big data analytics with Spark & Cassandra

Big Data Analytics mit
Spark & Cassandra_
JUG Stuttgart 01/2016
Matthias Niehoff

•Cassandra
•Spark
•Spark & Cassandra
•Spark Applications
•Spark Streaming
•Spark SQL
•Spark MLLib
Agenda_
2

•Distributed database
•Highly Available
•Linear Scalable
•Multi Datacenter Support
•No Single Point Of Failure
•CQL Query Language
• Similiar to SQL
• No Joins and aggregates
•Eventual Consistency „Tunable Consistency“
Cassandra_
4

Distributed Data Storage_
5
Node 1
Node 2
Node 3
Node 4
1-25
26-5051-75
76-0

CQL - Querying Language With Limitations_
6
SELECT * FROM performer WHERE name = 'ACDC'
—> ok
SELECT * FROM performer WHERE name = 'ACDC' and country =
'Australia'
—> not ok
SELECT country, COUNT(*) as quantity FROM artists GROUP BY
country ORDER BY quantity DESC
—> not supported
performer
name (PK)
genre
country

•Open Source & Apache project since 2010
•Data processing Framework
• Batch processing
• Stream processing
What Is Apache Spark_
8

•Fast
• up to 100 times faster than Hadoop
• a lot of in-memory processing
• linear scalable using more nodes
•Easy
• Scala, Java and Python API
• Clean Code (e.g. with lambdas in Java 8)
• expanded API: map, reduce, ﬁlter, groupBy, sort, union, join,
reduceByKey, groupByKey, sample, take, ﬁrst, count
•Fault-Tolerant
• easily reproducible
Why Use Spark_
9

•RDD‘s – Resilient Distributed Dataset
• Read–Only description of a collection of objects
• Distributed among the cluster (on memory or disk)
• Determined through transformations
• Allows automatically rebuild on failure
•Operations
• Transformations (map,ﬁlter,reduce...) —> new RDD
• Actions (count, collect, save)
•Only Actions start processing!
Easily Reproducable?_
10

•Partitions
• Describes the Partitions (i.e. one per Cassandra Partition)
•Dependencies
• dependencies on parent RDD’s
•Compute
• The function to compute the RDD’s partitions
•(Optional) Partitioner
• How is the data partitioned? (Hash, Range..)
•(Optional) Preferred Location
• Where to get the data (i.e. List of Cassandra Node IP’s)
Properties Of A RDD_
11

RDD Example_
12
scala> val textFile = sc.textFile("README.md")
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3
scala> val linesWithSpark = textFile.filter(line =>
line.contains("Spark"))
linesWithSpark: spark.RDD[String] = spark.FilteredRDD@7dd4af09
scala> linesWithSpark.count()
res0: Long = 126

Reproduce RDD’s Using A Tree_
13
Datenquelle
rdd1
rdd3
val1 rdd5
rdd2
rdd4
val2
rdd6
val3
map(..)filter(..)
union(..)
count()
count() count()
sample(..)
cache()

•Transformations
• map, flatMap
• sample, filter, distinct
• union, intersection, cartesian
•Actions
• reduce
• count
• collect,first, take
• saveAsTextFile
• foreach
Spark Transformations & Actions_
14

•Memory
• A lot of data in memory
• More memory —> Less disk IO —> Faster processing
• Minimum 8 GB / Node
•Network
• Communication between Driver, Cluster Manager & Worker
• Important for reduce operations
• 10 Gigabit LAN or better
•CPU
• Less communication between threads
• Good to parallelize
• Minimum 8 – 16 Cores / Node
What About Hardware?_
16

•Master Web UI (8080)
How To Monitor? (1/3)_
17

•Worker Web UI (8081)
18

•Application Web UI (4040)
19

([atomic,collection,object] , [atomic,collection,object])
val fluege =
List( ("Thomas", "Berlin"),("Mark", "Paris"),("Thomas", "Madrid"))
val pairRDD = sc.parallelize(fluege)
pairRDD.filter(_._1 == "Thomas")
.collect
.foreach(t => println(t._1 + " flog nach " + t._2))
Pair RDDs_
20
key – not unique value

•Parallelization!
• keys are use for partitioning
• pairs with different keys are distributed across the cluster
•Efﬁcient processing of
• aggregate by key
• group by key
• sort by key
• joins, union based on keys
Why Use Pair RDD’s_
21

RDD Dependencies_
22
„Narrow“ (pipeline-able)
map, ﬁlter
union
join on co partitioned data

RDD Dependencies_
23
„Wide“ (shuffle)
groupBy
on non partitioned data join on non co partitioned data

Use Spark And Cassandra In A Cluster_
26
Spark  
Client
Spark
Driver
C*
C*
C*C*
Spark
WN
Spark
WN
Spark
WN
Spark
WN
Spark
Master

Two Datacenter - Two Purposes_
27
C*
C*
C*C*
C*
C*
C*C*
Spark
WN
Spark
WN
Spark
WN
Spark
WN
Spark
Master
DC1 - Online DC2 - Analytics

•Spark Cassandra Connector by Datastax
• https://github.com/datastax/spark-cassandra-connector
•Cassandra tables as Spark RDD (read & write)
•Mapping of C* tables and rows onto Java/Scala objects
•Server-Side ﬁltering („where“)
•Compatible with
• Spark ≥ 0.9
• Cassandra ≥ 2.0
•Clone & Compile with SBT or download at Maven Central
Connecting Spark With Cassandra_
28

•Start the shell
bin/spark-shell  
--jars ~/path/to/jar/spark-cassandra-connector-
assembly-1.3.0.jar  
--conf spark.cassandra.connection.host=localhost
•Import Cassandra Classes
scala> import com.datastax.spark.connector._
Use The Connector In The Shell_
29

•Read complete table
val movies = sc.cassandraTable("movie","movies")
// returns CassandraRDD[CassandraRow]
•Read selected columns
val movies = sc.cassandraTable("movie","movies").select("title","year")
•Filter rows
val movies = sc.cassandraTable("movie","movies").where("title = 'Die
Hard'")
•Access Columns in Result Set
movies.collect.foreach(r => println(r.get[String]("title")))
Read A Cassandra Table_
30

Read As Tuple
val movies =  
sc.cassandraTable[(String,Int)]("movie","movies") 
.select("title","year")
val movies =  
sc.cassandraTable("movie","movies") 
.select("title","year") 
.as((_: String, _:Int))
// both result in a CassandraRDD[(String,Int)]
31

Read As Case Class
case class Movie(title: String, year: Int)
sc.cassandraTable[Movie]("movie","movies").select("title","year")
sc.cassandraTable("movie","movies").select("title","year").as(Movie)
32

•Every RDD can be saved
• Using Tuples
val tuples = sc.parallelize(Seq(("Hobbit",2012),("96 Hours",2008)))
tuples.saveToCassandra("movie","movies", SomeColumns("title","year")
• Using Case Classes
case class Movie (title:String, year: int) 
val objects =
sc.parallelize(Seq(Movie("Hobbit",2012),Movie("96 Hours",2008)))
objects.saveToCassandra("movie","movies")
Write Table_
33

// Load and format as Pair RDD
val pairRDD = sc.cassandraTable("movie","director")
.map(r => (r.getString("country"),r))
// Directors / Country, sorted
pairRDD.mapValues(v => 1).reduceByKey(_+_)
.sortBy(-_._2).collect.foreach(println)
// or, unsorted
pairRDD.countByKey().foreach(println)
// All Countries
pairRDD.keys()
Pair RDDs With Cassandra_
34
director
name text K
country text

•Joins can be expensive as they may require shufﬂing
val directors = sc.cassandraTable(..) 
.map(r => (r.getString("name"),r))
val movies = sc.cassandraTable() 
.map(r => (r.getString("director"),r))
movies.join(directors)
// RDD[(String, (CassandraRow, CassandraRow))]
Pair RDDs With Cassandra - Join
35
director
name text K
country text
movie
title text K
director text

•Automatically on read
•Not automatically on write
• No Shufﬂing Spark Operations -> Writes are local
• Shuffeling Spark Operartions
• Fan Out writes to Cassandra
• repartitionByCassandraReplica(“keyspace“, “table“) before write
•Joins with data locality
Using Data Locality With Cassandra_
36
sc.cassandraTable[CassandraRow](KEYSPACE, A)
.repartitionByCassandraReplica(KEYSPACE, B)
.joinWithCassandraTable[CassandraRow](KEYSPACE, B) 
.on(SomeColumns("id"))

•cassandraCount()
• Utilizes Cassandra query
• vs load the table into memory and do a count
•spanBy(), spanByKey()
• group data by Cassandra partition key
• does not need shufﬂing
• should be preferred over groupBy/groupByKey
CREATE TABLE events (year int, month int, ts timestamp, data
varchar, PRIMARY KEY (year,month,ts));
sc.cassandraTable("test", "events")
.spanBy(row => (row.getInt("year"), row.getInt("month")))
sc.cassandraTable("test", "events")
.keyBy(row => (row.getInt("year"), row.getInt("month")))
.spanByKey
Further Transformations & Actions_
37

•Normal Scala Application
•SBT as build tool
•source in src/main/scala-2.10
•assembly.sbt in root and project directory
•build.sbt in root directory
•sbt assembly to build
Scala Application_
40
libraryDependencies += "com.datastax.spark" % "spark-cassandra-connector" % "1.3.0"
libraryDependencies += "org.apache.spark" % "spark-core" % "1.3.1" % "provided"
libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" % "1.3.1" % "provided"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.3.1" %
"provided"

•Normal Java Application
•Java 8!
•MVN as build tool
•source in src/main/java
•in pom.xml
• dependencies (spark-core, spark-streaming, spark-mllib,  
spark-cassandra-connector)
• assembly-plugin or shade-plugin
•mvn clean install to build
Java Application_
41

•Special classes for Java
SparkConf conf =  
new SparkConf().setMaster("local[2]").setAppName("Java") 
.set("spark.cassandra.connection.host", "127.0.0.1");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaStreamingContext ssc = new JavaStreamingContext(conf,
Durations.seconds(1L));
JavaRDD<Integer> rdd =
sc.parallelize(Arrays.asList(1, 2, 3, 4, 5, 6));
rdd.filter(e -> e % 2 == 0).foreach(System.out::println);
Java Specials_
42

•Special classes for Java
import static
com.datastax.spark.connector.japi.CassandraJavaUtil.*;
CassandraTableScanJavaRDD<CassandraRow> table =
javaFunctions(sc.sparkContext())
.cassandraTable("keyspace", „table");
CassandraTableScanJavaRDD<Entity> table =
javaFunctions(sc.sparkContext()) 
.cassandraTable("keyspace", "table",mapRowTo(Entity.class))
javaFunctions(someRDD).writerBuilder("keyspace", "table",
mapToRow(Entity.class)).saveToCassandra();
Java Specials - Cassandra_
43

•SQL Queries with Spark (SQL & HiveQL)
• On structured data
• On DataFrame
• Every result of Spark SQL is a DataFrame
• All operations of the GenericRDD‘s available
•Supports (even on non primary key columns)
• Joins
• Union
• Group By
• Having
• Order By
Spark SQL_
45

val sqlContext = new SQLContext(sc)
val persons = sqlContext.jsonFile(path)
// Show the schema
persons.printSchema()
persons.registerTempTable("persons")
val adults =
sqlContext.sql("SELECT name FROM persons WHERE age > 18")
adults.collect.foreach(println)
Spark SQL - JSON Example_
46
{"name":"Michael"}
{"name":"Jan", "age":30}
{"name":"Tim", "age":17}

val csc = new CassandraSQLContext(sc)
csc.setKeyspace("musicdb")
val result = csc.sql("SELECT country, COUNT(*) as anzahl" +
"FROM artists GROUP BY country" +
"ORDER BY anzahl DESC");
result.collect.foreach(println);
Spark SQL - Cassandra Example_
47

•Real Time Processing using micro batches
•Supported sources: TCP, S3, Kafka, Twitter,..
•Data as Discretized Stream (DStream)
•Same programming model as for batches
•All Operations of the GenericRDD & SQL & MLLib
•Stateful Operations & Sliding Windows
Stream Processing With Spark Streaming_
50

import org.apache.spark.streaming._
val ssc = new StreamingContext(sc,Seconds(1))
val stream = ssc.socketTextStream("127.0.0.1",9999)
stream.map(x => 1).reduce(_ + _).print()
ssc.start()
// await manual termination or error
ssc.awaitTermination()
// manual termination
ssc.stop()
Spark Streaming - Example_
51

•Maintain State for each key in a DStream: updateStateByKey
Spark Streaming - Stateful Operations_
52
def updateAlbumCount(newValues: Seq[Int],runningCount:
Option[Int]) : Option[Int] =
{
val newCount = runningCount.getOrElse(0) + newValues.size
Some(newCount)
}
val countStream = stream.updateStateByKey[Int]
(updateAlbumCount _)
Stream is a DStream of Pair RDD's

•One Receiver -> One Node
• Start more receivers and union them
val numStreams = 5
val kafkaStreams = (1 to numStreams).map { i =>
KafkaUtils.createStream(...) }
val unifiedStream = streamingContext.union(kafkaStreams)
unifiedStream.print()
•Received data will be split up into blocks
• 1 block => 1 task
• blocks = batchSize / blockInterval
•Repartition data to distribute over cluster
Spark Streaming - Parallelism_
53

•Fully integrated in Spark
• Scalable
• Scala, Java & Python APIs
• Use with Spark Streaming & Spark SQL
•Packages various algorithms for machine learning
•Includes
• Clustering
• Classiﬁcation
• Prediction
• Collaborative Filtering
•Still under development
• performance, algorithms
Spark MLLib_
56

MLLib Example - Clustering_
57
age
set of data points meaningful clusters
income

// Load and parse data
val data = sc.textFile("data/mllib/kmeans_data.txt")
val parsedData = data 
.map(s => Vectors.dense(s.split(' ')
.map(_.toDouble))).cache()

// Cluster the data into 3 classes using KMeans with 20
iterations
val clusters = KMeans.train(parsedData, 2, 20)

// Evaluate clustering by computing Sum of Squared Errors
val SSE = clusters.computeCost(parsedData)
println("Sum of Squared Errors = " + WSSSE)
MLLib Example - Clustering (using KMeans)_
58

MLLib Example - Classiﬁcation_
59

MLLib Example - Classiﬁcation_
60

// Load training data in LIBSVM format.
val data =
MLUtils.loadLibSVMFile(sc, "sample_libsvm_data.txt")
// Split data into training (60%) and test (40%).
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1)
// Run training algorithm to build the model
val numIterations = 100
val model = SVMWithSGD.train(training, numIterations)
MLLib Example - Classiﬁcation (Linear SVM)_
61

// Compute raw scores on the test set.
val scoreAndLabels = test.map { point =>
val score = model.predict(point.features)
(score, point.label)
}
// Get evaluation metrics.
val metrics = new
BinaryClassificationMetrics(scoreAndLabels)
val auROC = metrics.areaUnderROC()
println("Area under ROC = " + auROC)
MLLib Example - Classiﬁcation (Linear SVM)_
62

MLLib Example - Collaborative Filtering_
63

// Load and parse the data (userid,itemid,rating)
val data = sc.textFile("data/mllib/als/test.data")
val ratings = data.map(_.split(',') match
{
case Array(user, item, rate) => Rating(user.toInt,
item.toInt, rate.toDouble)
})
// Build the recommendation model using ALS
val rank = 10
val numIterations = 20
val model = ALS.train(ratings, rank, numIterations, 0.01)
MLLib Example - Collaborative Filtering using ALS_
64

// Evaluate the model on rating data
val usersProducts = ratings.map {
case Rating(user, product, rate) => (user, product) }
val predictions = model.predict(usersProducts).map {
case Rating(user, product, rate) => ((user, product), rate)
}
val ratesAndPredictions = ratings.map {
case Rating(user, product, rate) =>((user, product), rate)}
.join(predictions)
val MSE = ratesAndPredictions.map {
case ((user, product), (r1, r2)) => val err = (r1 - r2);
err * err }.mean()
println("Mean Squared Error = " + MSE)
MLLib Example - Collaborative Filtering using ALS_
65

•In particular for huge amounts of external data
•Support for CSV, TSV, XML, JSON und other
Use Cases for Spark and Cassandra_
67
Data Loading
case class User (id: java.util.UUID, name: String)
val users = sc.textFile("users.csv")
.repartition(2*sc.defaultParallelism)
.map(line => line.split(",") match { case Array(id,name) =>
User(java.util.UUID.fromString(id), name)})
users.saveToCassandra("keyspace","users")

Validate consistency in a Cassandra database
•syntactic
• Uniqueness (only relevant for columns not in the PK)
• Referential integrity
• Integrity of the duplicates
•semantic
• Business- or Application constraints
• e.g.: At least one genre per movies, a maximum of 10 tags per blog
post
68
Validation & Normalization

•Modelling, Mining, Transforming, ....
•Use Cases
• Recommendation
• Fraud Detection
• Link Analysis (Social Networks, Web)
• Advertising
• Data Stream Analytics ( Spark Streaming)
• Machine Learning ( Spark ML)
69
Analyses (Joins, Transformations,..)

•Changes on existing tables
• New table required when changing primary key
• Otherwise changes could be performed in-place
•Creating new tables
• data derived from existing tables
• Support new queries
•Use the CassandraConnectors in Spark
70
Schema Migration

Thank you for your attention!
71

Questions?
Matthias Niehoff,
IT-Consultant
90
codecentric AG
Zeppelinstraße 2
76185 Karlsruhe, Germany
mobil: +49 (0) 172.1702676
matthias.niehoff@codecentric.de
www.codecentric.de
blog.codecentric.de
matthiasniehoff

Big data analytics with Spark & Cassandra

More Related Content

What's hot (20)

Viewers also liked (14)

Similar to Big data analytics with Spark & Cassandra (20)

Recently uploaded (20)

Big data analytics with Spark & Cassandra