Atlanta Spark User Meetup 09 22 2016

pipeline.io
After Dark 2.0End-to-End,Real-time, Advanced Analytics and ML
Big Data Reference Pipeline
Atlanta Spark User Group
Sept 22, 2016
Thanks Emory Continuing Education!
Chris Fregly
Research Scientist @ PipelineIO
We’re Hiring - Only Nice People!
pipeline.io
advancedspark.com

pipeline.io
Who Am I?
2
github.com/fluxcapacitor/pipeline
Meetup Founder
Advanced
Book Author
Advanced .

pipeline.io
Who Was I?
3
Streaming Data Engineer
Netflix Open Source Committer
Data Solutions Engineer
Apache Contributor
Principal Data Solutions Engineer
IBM Technology Center

pipeline.io
Advanced Spark and Tensorflow Meetup
Meetup Metrics
Top 5 Most Active Spark Meetup!
4000+ Members in just 1 year!!
6000+ Downloads of Docker Image
with many Meetup Demos!!!
@ advancedspark.com
Meetup Goals
Code dive deep into Spark and related open source code bases
Study integrations with Cassandra, ElasticSearch,Tachyon, S3,
BlinkDB, Mesos, YARN, Kafka, R, etc
Surface and share patterns and idioms of well-designed,
distributed, big data processing systems

pipeline.io
Atlanta Hadoop User Meetup Last Night
http://www.slideshare.net/cfregly/atlanta-hadoop-users-meetup-09-21-2016

pipeline.io
MLconf ATL Tomorrow

pipeline.io
Current PipelineIO Research
Model Deploying and Testing
Model Scaling and Serving
Online Model Training
Dynamic Model Optimizing
7

pipeline.io
PipelineIO Deliverables
100% Open Source!!
Github
https://github.com/fluxcapacitor/
DockerHub
https://hub.docker.com/r/fluxcapacitor
Workshop
http://pipeline.io
8

pipeline.io
Topics of This Talk (20-30 mins each)
① Spark Streaming and Spark ML
Generating Real-time Recommendations
② Spark Core
Tuning and Profiling
③ Spark SQL
Tuning and Customizing
9

pipeline.io
Live, Interactive Demo!
Kafka, Cassandra, ElasticSearch, Redis, Spark ML

pipeline.io
Audience Participation Required!
11
You -> Audience Instructions
①Navigate to
demo.pipeline.io
②Swipe software used in
Production Only!
Data ->
Scientist
This is Totally Anonymous!!

pipeline.io
Topics of This Talk (15-20 mins each)
Kafka, Cassandra, ElasticSearch, Redis, Docker
② Spark Core
③ Spark SQL
12

pipeline.io
Mechanical Sympathy
“Hardware and software working together.”
- Martin Thompson
http://mechanical-sympathy.blogspot.com
“Whatever your data structure, my array will win.”
- Scott Meyers
Every C++ Book, basically
13

pipeline.io
Spark and Mechanical Sympathy
14
Project
Tungsten
(Spark 1.4-1.6+)
100TB
GraySort
Challenge
(Spark 1.1-1.2)
Minimize Memory & GC
Maximize CPU Cache
Saturate Network I/O
Saturate Disk I/O

pipeline.io
CPU Cache Refresher
15
aka
“LLC”
My
Laptop

pipeline.io
CPU Cache Sympathy (AlphaSort paper)
Key (10 bytes) + Pointer (4 bytes) = 14 bytes
16
Key Ptr
Pre-process & Pull Key From Record
PtrKey-Prefix
2x CPU Cache-line Friendly!
Key-Prefix (4 bytes) + Pointer (4 bytes) = 8 bytes
Key (10 bytes) + Pad (2 bytes) + Pointer (4 bytes) = 16 bytes
Ptr
Must Dereference to Compare Key
Key
Pointer (4 bytes) = 4 bytes
Key Ptr
Pad
/Pad CPU Cache-line Friendly!
Dereference full key only
to resolve prefix duplicates

pipeline.io
Sort Performance Comparison
17

pipeline.io
Sequential vs Random Cache Misses
18

pipeline.io
Instrumenting and Monitoring CPU
Use Linux perf command!
20
http://www.brendangregg.com/blog/2015-11-06/java-mixed-mode-flame-graphs.html

pipeline.io
Results of Random vs. Sequential Sort
21
Naïve Random Access
Pointer Sort
Cache Friendly Sequential
Key/Pointer Sort
Ptr Key
Must Dereference to Compare Key
Key Ptr
Pre-process & Pull Key From Record
-35%
-90%
-68%
-26%
% Change
-55%
perf stat –event
L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses,LLC-prefetch-misses

pipeline.io
Demo!
Matrix Multiplication

pipeline.io
CPU Cache Naïve Matrix Multiplication
// Dot product of each row & column vector
for (i <- 0 until numRowA)
for (j <- 0 until numColsB)
for (k <- 0 until numColsA)
res[ i ][ j ] += matA[ i ][ k ] * matB[ k ][ j ];
23
Bad: Row-wise traversal,
not using full CPU cache line,
ineffective pre-fetching

pipeline.io
CPU Cache Friendly Matrix
Multiplication
// Transpose B
for (i <- 0 until numRowsB)
matBT[ i ][ j ] = matB[ j ][ i ];
24
Good: Full CPU Cache Line,
Effective Prefetching
OLD: matB [ k ][ j ];
// Modify algo for Transpose B
for (i <- 0 until numRowsA)
for (k <- 0 until numColsA)
res[ i ][ j ] += matA[ i ][ k ] * matBT[ j ][ k ];

pipeline.io
Results Of Matrix Multiplication
Cache-Friendly Matrix MultiplyNaïve Matrix Multiply
perf stat –event
L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses,
LLC-prefetch-misses,cache-misses,stalled-cycles-frontend
-96%
-93%
-93%
-70%
-53%
% Change
-63%
+8543%?

pipeline.io
Demo!
Thread Synchronization

pipeline.io
Thread and Context Switch Sympathy
Problem
Atomically Increment 2 Counters
(each at different increments) by
1000’s of Simultaneous Threads
27
Possible Solutions
① Synchronized Immutable
② Synchronized Mutable
③ AtomicReference CAS
④ Volatile?
Context Switches are Expen$ive!!
aka
“LLC”

pipeline.io
Synchronized Immutable Counters
case class Counters(left: Int, right: Int)
object SynchronizedImmutableCounters {
var counters = new Counters(0,0)
def getCounters(): Counters = {
this.synchronized { counters }
}
def increment(leftIncrement: Int, rightIncrement: Int): Unit = {
this.synchronized {
counters = new Counters(counters.left + leftIncrement,
counters.right + rightIncrement)
}
}
} 28
Locks whole
outer object!!

pipeline.io
Synchronized Mutable Counters
class MutableCounters(left: Int, right: Int) {
def increment(leftIncrement: Int, rightIncrement: Int): Unit={
this.synchronized {…}
}
def getCountersTuple(): (Int, Int) = {
this.synchronized{ (counters.left, counters.right) }
}
}
object SynchronizedMutableCounters {
val counters = new MutableCounters(0,0)
…
counters.increment(leftIncrement, rightIncrement)
}
29
Locks just
MutableCounters

pipeline.io
Lock-Free AtomicReference Counters
case class Counters(left: Int, right: Int)
object LockFreeAtomicReferenceCounters {
val counters = new AtomicReference[Counters](new Counters(0,0))
def increment(leftIncrement: Int, rightIncrement: Int) : Long = {
var originalCounters: Counters = null
var updatedCounters: Counters = null
do {
originalCounters = getCounters()
updatedCounters = new Counters(originalCounters.left+ leftIncrement,
originalCounters.right+ rightIncrement)
} // Retry lock-free, optimistic compareAndSet() until AtomicRef updates
while !(counters.compareAndSet(originalCounters, updatedCounters))
}
30Lock Free!!

pipeline.io
Lock-Free AtomicLong Counters
object LockFreeAtomicLongCounters {
// a single Long (64-bit) will maintain 2 separate Ints (32-bits each)
val counters = new AtomicLong()
…
var originalCounters = 0L
var updatedCounters = 0L
do {
originalCounters = counters.get()
…
// Store two 32-bit Int into one 64-bit Long
// Use >>> 32 and << 32 to set and retrieve each Int from the Long
}
// Retry lock-free, optimistic compareAndSet() until AtomicLong updates
while !(counters.compareAndSet(originalCounters, updatedCounters))
}31 Lock Free!!
A: The JVM does not
guarantee atomic
updates of 64-bit
longs and doubles
Q: Why not use
@volatile long?

pipeline.io
Results of Thread Synchronization
Immutable Case Class
32
Lock-Free AtomicLong
-64%
-46%
-17%
-33%
perf stat –event
context-switches,L1-dcache-load-misses,L1-dcache-prefetch-misses,
LLC-load-misses, LLC-prefetch-misses,cache-misses,stalled-cycles-frontend
% Change
-31%
-32%
-33%
-27%
case class Counters(left:Int, right: Int)
...
this.synchronized {
counters = new Counters(counters.left + leftIncrement,
counters.right + rightIncrement)
}
val counters = new AtomicLong()
…
do {
…
} while !(counters.compareAndSet(originalCounters,
updatedCounters))

pipeline.io
Profile Visualizations: Flame Graphs
33
Example: Spark Word Count
Java Stack Traces are Good! JDK 1.8+
(-XX:-Inline -XX:+PreserveFramePointer)
Plateaus are Bad!
I/O stalls, Heavy CPU
serialization, etc

pipeline.io
Project Tungsten: CPU and Memory
Create Custom Data Structures & Algorithms
Operate on serialized and compressed ByteArrays!
Minimize Garbage Collection
Reuse ByteArrays
In-place updates for aggregations
Maximize CPU Cache Effectiveness
8-byte alignment
AlphaSort-based Key-Prefix
Utilize Catalyst Dynamic Code Generation
Dynamic optimizations using entire query plan
Developer implements genCode() to create Scala source code
(String)
34

pipeline.io
Why is CPU the Bottleneck?
CPU for serialization, hashing, & compression
Spark 1.2 updates saturated Network, Disk I/O
10x increase in I/O throughput relative to CPU
More partition, pruning, and pushdown support
Newer columnar file formats help reduce I/O
35

pipeline.io
Custom Data Structs & Algos: Aggs
UnsafeFixedWidthAggregationMap
Uses BytesToBytesMap internally
In-place updates of serialized aggregation
No object creation on hot-path
TungstenAggregate & TungstenAggregationIterator
Operates directly on serialized, binary UnsafeRow
2 steps to avoid single-key OOMs
① Hash-based (grouping) agg spills to disk if needed
② Sort-based agg performs external merge sort on spills
36

pipeline.io
Custom Data Structures & Algorithms
o.a.s.util.collection.unsafe.sort.
UnsafeSortDataFormat
UnsafeExternalSorter
UnsafeShuffleWriter
UnsafeInMemorySorter
RecordPointerAndKeyPrefix
37
PtrKey-Prefix
2x CPU Cache-line Friendly!
SortDataFormat<RecordPointerAndKeyPrefix, Long[ ]>
Note: Mixing multiplesubclasses of SortDataFormat
simultaneously will prevent JIT inlining.
Supports merging compressed records
(if compression CODEC supports it, ie. LZF)
In-place external sorting of spilled BytesToBytes data
AlphaSort-based, 8-byte aligned sort key
In-place sorting of BytesToBytesMap data

pipeline.io
Code Generation
Problem
Boxing creates excessive objects
Expression tree evaluations are costly
JVM can’t inline polymorphic impls
Lack of polymorphism == poor code design
Solution
Code generation enables inlining
Rewrite and optimize code using overall plan, 8-byte align
Defer source code generation to each operator, UDF,
UDAF
Use Janino to compile generated source code ->bytecode
38

pipeline.io
Autoscaling Spark Workers (Spark 1.5+)
Scaling up is easy J
SparkContext.addExecutors() until max is
reached
Scaling down is hard L
SparkContext.removeExecutors()
Lose RDD cache inside Executor JVM
Must rebuild active RDD partitions in another Executor
JVM
Uses External Shuffle Service from Spark 1.1-1.2
If Executor JVM dies/restarts, shuffle keeps shufflin’! 39

pipeline.io
“Hidden” Spark Submit REST API
http://arturmkrtchyan.com/apache-spark-hidden-rest-api
Submit Spark Job
curl -X POST http://127.0.0.1:6066/v1/submissions/create
--header "Content-Type:application/json;charset=UTF-8"
--data ’{"action" : "CreateSubmissionRequest”,
"mainClass" : "org.apache.spark.examples.SparkPi”,
"sparkProperties" : {
"spark.jars" : "file:/spark/lib/spark-examples-1.5.1.jar",
"spark.app.name" : "SparkPi",…
}}’
Get Spark Job Status
curl http://127.0.0.1:6066/v1/submissions/status/<job-id-from-submit-request>
Kill Spark Job
curl -X POST http://127.0.0.1:6066/v1/submissions/kill/<job-id-from-submit-request>
40
(the snitch)

pipeline.io
Outline
Kafka, Cassandra, ElasticSearch, Redis, Docker
② Spark Core
③ Spark SQL
41

pipeline.io
Parquet Columnar File Format
Based on Google Dremel paper ~2010
Collaboration with Twitter and Cloudera
Columnar storage format for fast columnar aggs
Supports evolving schema
Supports pushdowns
Support nested partitions
Tight compression
Min/max heuristics enable file and chunk skipping 42
Min/Max Heuristics
For Chunk Skipping

pipeline.io
Partitions
Partition Based on Data Access Patterns
/genders.parquet/gender=M/…
/gender=F/… <-- Use Case: Access Users by Gender
/gender=U/…
Dynamic Partition Creation (Write)
Dynamically create partitions on write based on column (ie. Gender)
SQL: INSERT TABLE genders PARTITION (gender) SELECT …
DF: gendersDF.write.format("parquet").partitionBy("gender")
.save("/genders.parquet")
Partition Discovery (Read)
Dynamically infer partitions on read based on paths (ie./gender=F/…)
SQL: SELECT id FROM genders WHERE gender=F
DF: gendersDF.read.format("parquet").load("/genders.parquet/").select($"id").
.where("gender=F")
43

pipeline.io
Pruning
Partition Pruning
Filter out rows by partition
SELECT id, gender FROM genders WHERE gender = ‘F’
Column Pruning
Filter out columns by column filter
Extremely useful for columnar storage formats (Parquet)
Skip entire blocks of columns
SELECT id, gender FROM genders
44

pipeline.io
Pushdowns
aka. Predicate or Filter Pushdowns
Predicate returns true or false for given function
Filters rows deep into the data source
Reduces number of rows returned
Data Source must implement PrunedFilteredScan
def buildScan(requiredColumns: Array[String],
filters: Array[Filter]): RDD[Row]
45

pipeline.io
Demo!
File Formats, Partitions, Pushdowns, and Joins

pipeline.io
Predicate Pushdowns & Filter Collapsing
47
Filter pushdown
No extra pass
Filter combining
Only 1 extra pass
2 extra passes through the data after retrieval

pipeline.io
Join Between Partitioned &
Unpartitioned
48

pipeline.io
Join Between Partitioned & Partitioned
49

pipeline.io
Broadcast Join vs. Normal Shuffle Join
50

pipeline.io
Cartesian Join vs. Inner Join
51

pipeline.io
Visualizing the Query Plan
52
Effectiveness
of Filter
Cost-based
Join Optimization
Similar to
MapReduce
Map-side Join
& DistributedCache
Peak Memory for
Joins and Aggs
UnsafeFixedWidthAggregationMap
getPeakMemoryUsedBytes()

pipeline.io
Data Source API
Relations (o.a.s.sql.sources.interfaces.scala)
BaseRelation (abstract class): Provides schema of data
TableScan (impl): Read all data from source
PrunedFilteredScan (impl): Column pruning & predicate pushdowns
InsertableRelation (impl): Insert/overwrite data based on SaveMode
RelationProvider (trait/interface): Handle options, BaseRelation factory
Filters (o.a.s.sql.sources.filters.scala)
Filter (abstract class): Handles all filters supported by this source
EqualTo (impl)
GreaterThan (impl)
StringStartsWith (impl) 53

pipeline.io
Native Spark SQL Data Sources
54

pipeline.io
JSON Data Source
DataFrame
val ratingsDF = sqlContext.read.format("json")
.load("file:/root/pipeline/datasets/dating/ratings.json.bz2")
-- or –
val ratingsDF = sqlContext.read.json
("file:/root/pipeline/datasets/dating/ratings.json.bz2")
SQL Code
CREATE TABLE genders USING json
OPTIONS
(path "file:/root/pipeline/datasets/dating/genders.json.bz2")
55
json() convenience method

pipeline.io
Parquet Data Source
Configuration
spark.sql.parquet.filterPushdown=true
spark.sql.parquet.mergeSchema=false (unless your schema is evolving)
spark.sql.parquet.cacheMetadata=true (requires sqlContext.refreshTable())
spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo]
DataFrames
val gendersDF = sqlContext.read.format("parquet")
.load("file:/root/pipeline/datasets/dating/genders.parquet")
gendersDF.write.format("parquet").partitionBy("gender")
.save("file:/root/pipeline/datasets/dating/genders.parquet")
SQL
CREATE TABLE genders USING parquet
OPTIONS (path "file:/root/pipeline/datasets/dating/genders.parquet")
56

pipeline.io
ElasticSearch Data Source
Github
https://github.com/elastic/elasticsearch-hadoop
Maven
org.elasticsearch:elasticsearch-spark_2.10:2.1.0
Code
val esConfig = Map("pushdown" -> "true", "es.nodes" -> "<hostname>",
"es.port" -> "<port>")
df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite)
.options(esConfig).save("<index>/<document-type>")
57

pipeline.io
Cassandra Data Source
Github
https://github.com/datastax/spark-cassandra-connector
Maven
com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1
Code
ratingsDF.write
.format("org.apache.spark.sql.cassandra")
.mode(SaveMode.Append)
.options(Map("keyspace"->"<keyspace>",
"table"->"<table>")).save(…) 58

pipeline.io
Tips for Cassandra Analytics
By-pass Cassandra CQL “front door”
CQL Optimized for Transactions
Bulk read and write directly against SSTables
Check out Netflix OSS project “Aegisthus”
Cassandra becomesa first-class analytics option
Replicated analytics cluster no longer needed
59

pipeline.io
Creating a Custom Data Source
① Study existing implementations
o.a.s.sql.execution.datasources.jdbc.JDBCRelation
② Extend base traits & implement required methods
o.a.s.sql.sources.{BaseRelation,PrunedFilterScan}
Spark JDBC (o.a.s.sql.execution.datasources.jdbc)
class JDBCRelation extends BaseRelation
with PrunedFilteredScan
with InsertableRelation
DataStax Cassandra (o.a.s.sql.cassandra)
class CassandraSourceRelation extends BaseRelation
with PrunedFilteredScan
with InsertableRelation 60

pipeline.io
Demo!
Create a Custom Data Source

pipeline.io
Publishing Custom Data Sources
62
spark-packages.org

pipeline.io
Spark SQL UDF Code Generation
100+ UDFs now generating code
More to come in Spark 1.6+
Details in
SPARK-8159, SPARK-9571
Every UDF must use Expressions and
implement Expression.genCode()
to participate in the fun
Lambdas (RDD or Dataset API)
and sqlContext.udf.registerFunction()
are not enough!!

pipeline.io
Creating a Custom UDF with Code Gen
① Study existing implementations
o.a.s.sql.catalyst.expressions.Substring
② Extend and implement base trait
o.a.s.sql.catalyst.expressions.Expression.genCode
③ Don’t forget about Python!
python.pyspark.sql.functions.py
64

pipeline.io
Demo!
Creating a Custom UDF participating in Code Generation

pipeline.io
Spark 1.6 and 2.0 Improvements
Adaptiveness, Metrics, Datasets, and Streaming State

pipeline.io
Adaptive Query Execution
Adapt query execution using data from previous stages
Dynamically choose spark.sql.shuffle.partitions (default 200)
67
Broadcast Join
(popular keys)
Shuffle Join
(not-so-popular keys)
Adaptive
Hybrid
Join

pipeline.io
Adaptive Memory Management
Spark <1.6
Manual configure between 2 memory regions
Spark execution engine (shuffles, joins, sorts, aggs)
spark.shuffle.memoryFraction
RDD Data Cache
spark.storage.memoryFraction
Spark 1.6+
Unified memory regions
Dynamically expand/contract memory regions
Supports minimum for RDD storage (LRU Cache)
68

pipeline.io
Metrics
Shows exact memory usage per operator & node
Helps debugging and identifying skew
69

pipeline.io
Spark SQL API
Datasets type safe API (similar to RDDs) utilizing Tungsten
val ds = sqlContext.read.text("ratings.csv").as[String]
val df = ds.flatMap(_.split(",")).filter(_ != "").toDF() // RDD API, convert to DF
val agg = df.groupBy($"rating").agg(count("*") as "ct”).orderBy($"ct" desc)
Typed Aggregators used alongside UDFs and UDAFs
val simpleSum = new Aggregator[Int, Int, Int] with Serializable {
def zero: Int = 0
def reduce(b: Int, a: Int) = b + a
def merge(b1: Int, b2: Int) = b1 + b2
def finish(b: Int) = b
}.toColumn
val sum = Seq(1,2,3,4).toDS().select(simpleSum)
Query files directly without registerTempTable()
%sql SELECT * FROM json.`/datasets/movielens/ml-latest/movies.json` 70

pipeline.io
Spark Streaming State Management
New trackStateByKey()
Store deltas, compact later
More efficient per-key state update
Session TTL
Integrated A/B Testing (?!)
Show Failed Output in Admin UI
Better debugging
71

pipeline.io
Thank You!!
Chris Fregly
(http://pipeline.io)
San Francisco, California, USA
advancedspark.com
Sign up for the Meetup and Book
Contribute on Github!
Run All Demos in Docker
~6000 Docker Downloads!!
Find me on LinkedIn, Twitter, Github, Email, Fax 72

Atlanta Spark User Meetup 09 22 2016

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Atlanta Spark User Meetup 09 22 2016 (20)

More from Chris Fregly (20)

Recently uploaded (20)

Atlanta Spark User Meetup 09 22 2016