SlideShare a Scribd company logo
pipeline.io
After Dark 2.0End-to-End,Real-time, Advanced Analytics and ML
Big Data Reference Pipeline
Atlanta Spark User Group
Sept 22, 2016
Thanks Emory Continuing Education!
Chris Fregly
Research Scientist @ PipelineIO
We’re Hiring - Only Nice People!
pipeline.io
advancedspark.com
pipeline.io
Who Am I?
2
Research Scientist @ PipelineIO
github.com/fluxcapacitor/pipeline
Meetup Founder
Advanced
Book Author
Advanced .
pipeline.io
Who Was I?
3
Streaming Data Engineer
Netflix Open Source Committer
Data Solutions Engineer
Apache Contributor
Principal Data Solutions Engineer
IBM Technology Center
pipeline.io
Advanced Spark and Tensorflow Meetup
Meetup Metrics
Top 5 Most Active Spark Meetup!
4000+ Members in just 1 year!!
6000+ Downloads of Docker Image
with many Meetup Demos!!!
@ advancedspark.com
Meetup Goals
Code dive deep into Spark and related open source code bases
Study integrations with Cassandra, ElasticSearch,Tachyon, S3,
BlinkDB, Mesos, YARN, Kafka, R, etc
Surface and share patterns and idioms of well-designed,
distributed, big data processing systems
pipeline.io
Atlanta Hadoop User Meetup Last Night
http://www.slideshare.net/cfregly/atlanta-hadoop-users-meetup-09-21-2016
pipeline.io
MLconf ATL Tomorrow
pipeline.io
Current PipelineIO Research
Model Deploying and Testing
Model Scaling and Serving
Online Model Training
Dynamic Model Optimizing
7
pipeline.io
PipelineIO Deliverables
100% Open Source!!
Github
https://github.com/fluxcapacitor/
DockerHub
https://hub.docker.com/r/fluxcapacitor
Workshop
http://pipeline.io
8
pipeline.io
Topics of This Talk (20-30 mins each)
① Spark Streaming and Spark ML
Generating Real-time Recommendations
② Spark Core
Tuning and Profiling
⑱ Spark SQL
Tuning and Customizing
9
pipeline.io
Live, Interactive Demo!
Kafka, Cassandra, ElasticSearch, Redis, Spark ML
pipeline.io
Audience Participation Required!
11
You -> Audience Instructions
①Navigate to
demo.pipeline.io
②Swipe software used in
Production Only!
Data ->
Scientist
This is Totally Anonymous!!
pipeline.io
Topics of This Talk (15-20 mins each)
① Spark Streaming and Spark ML
Kafka, Cassandra, ElasticSearch, Redis, Docker
② Spark Core
Tuning and Profiling
⑱ Spark SQL
Tuning and Customizing
12
pipeline.io
Mechanical Sympathy
“Hardware and software working together.”
- Martin Thompson
http://mechanical-sympathy.blogspot.com
“Whatever your data structure, my array will win.”
- Scott Meyers
Every C++ Book, basically
13
pipeline.io
Spark and Mechanical Sympathy
14
Project
Tungsten
(Spark 1.4-1.6+)
100TB
GraySort
Challenge
(Spark 1.1-1.2)
Minimize Memory & GC
Maximize CPU Cache
Saturate Network I/O
Saturate Disk I/O
pipeline.io
CPU Cache Refresher
15
aka
“LLC”
My
Laptop
pipeline.io
CPU Cache Sympathy (AlphaSort paper)
Key (10 bytes) + Pointer (4 bytes) = 14 bytes
16
Key Ptr
Pre-process & Pull Key From Record
PtrKey-Prefix
2x CPU Cache-line Friendly!
Key-Prefix (4 bytes) + Pointer (4 bytes) = 8 bytes
Key (10 bytes) + Pad (2 bytes) + Pointer (4 bytes) = 16 bytes
Ptr
Must Dereference to Compare Key
Key
Pointer (4 bytes) = 4 bytes
Key Ptr
Pad
/Pad CPU Cache-line Friendly!
Dereference full key only
to resolve prefix duplicates
pipeline.io
Sort Performance Comparison
17
pipeline.io
Sequential vs Random Cache Misses
18
pipeline.io
Demo!
Sorting
pipeline.io
Instrumenting and Monitoring CPU
Use Linux perf command!
20
http://www.brendangregg.com/blog/2015-11-06/java-mixed-mode-flame-graphs.html
pipeline.io
Results of Random vs. Sequential Sort
21
NaĂŻve Random Access
Pointer Sort
Cache Friendly Sequential
Key/Pointer Sort
Ptr Key
Must Dereference to Compare Key
Key Ptr
Pre-process & Pull Key From Record
-35%
-90%
-68%
-26%
% Change
-55%
perf stat –event 
L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses,LLC-prefetch-misses
pipeline.io
Demo!
Matrix Multiplication
pipeline.io
CPU Cache NaĂŻve Matrix Multiplication
// Dot product of each row & column vector
for (i <- 0 until numRowA)
for (j <- 0 until numColsB)
for (k <- 0 until numColsA)
res[ i ][ j ] += matA[ i ][ k ] * matB[ k ][ j ];
23
Bad: Row-wise traversal,
not using full CPU cache line,
ineffective pre-fetching
pipeline.io
CPU Cache Friendly Matrix
Multiplication
// Transpose B
for (i <- 0 until numRowsB)
for (j <- 0 until numColsB)
matBT[ i ][ j ] = matB[ j ][ i ];
24
Good: Full CPU Cache Line,
Effective Prefetching
OLD: matB [ k ][ j ];
// Modify algo for Transpose B
for (i <- 0 until numRowsA)
for (j <- 0 until numColsB)
for (k <- 0 until numColsA)
res[ i ][ j ] += matA[ i ][ k ] * matBT[ j ][ k ];
pipeline.io
Results Of Matrix Multiplication
Cache-Friendly Matrix MultiplyNaĂŻve Matrix Multiply
perf stat –event 
L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses, 
LLC-prefetch-misses,cache-misses,stalled-cycles-frontend
-96%
-93%
-93%
-70%
-53%
% Change
-63%
+8543%?
pipeline.io
Demo!
Thread Synchronization
pipeline.io
Thread and Context Switch Sympathy
Problem
Atomically Increment 2 Counters
(each at different increments) by
1000’s of Simultaneous Threads
27
Possible Solutions
① Synchronized Immutable
② Synchronized Mutable
⑱ AtomicReference CAS
④ Volatile?
Context Switches are Expen$ive!!
aka
“LLC”
pipeline.io
Synchronized Immutable Counters
case class Counters(left: Int, right: Int)
object SynchronizedImmutableCounters {
var counters = new Counters(0,0)
def getCounters(): Counters = {
this.synchronized { counters }
}
def increment(leftIncrement: Int, rightIncrement: Int): Unit = {
this.synchronized {
counters = new Counters(counters.left + leftIncrement,
counters.right + rightIncrement)
}
}
} 28
Locks whole
outer object!!
pipeline.io
Synchronized Mutable Counters
class MutableCounters(left: Int, right: Int) {
def increment(leftIncrement: Int, rightIncrement: Int): Unit={
this.synchronized {
}
}
def getCountersTuple(): (Int, Int) = {
this.synchronized{ (counters.left, counters.right) }
}
}
object SynchronizedMutableCounters {
val counters = new MutableCounters(0,0)


def increment(leftIncrement: Int, rightIncrement: Int): Unit = {
counters.increment(leftIncrement, rightIncrement)
}
29
Locks just
MutableCounters
pipeline.io
Lock-Free AtomicReference Counters
case class Counters(left: Int, right: Int)
object LockFreeAtomicReferenceCounters {
val counters = new AtomicReference[Counters](new Counters(0,0))
def increment(leftIncrement: Int, rightIncrement: Int) : Long = {
var originalCounters: Counters = null
var updatedCounters: Counters = null
do {
originalCounters = getCounters()
updatedCounters = new Counters(originalCounters.left+ leftIncrement,
originalCounters.right+ rightIncrement)
} // Retry lock-free, optimistic compareAndSet() until AtomicRef updates
while !(counters.compareAndSet(originalCounters, updatedCounters))
}
30Lock Free!!
pipeline.io
Lock-Free AtomicLong Counters
object LockFreeAtomicLongCounters {
// a single Long (64-bit) will maintain 2 separate Ints (32-bits each)
val counters = new AtomicLong()


def increment(leftIncrement: Int, rightIncrement: Int): Unit = {
var originalCounters = 0L
var updatedCounters = 0L
do {
originalCounters = counters.get()


// Store two 32-bit Int into one 64-bit Long
// Use >>> 32 and << 32 to set and retrieve each Int from the Long
}
// Retry lock-free, optimistic compareAndSet() until AtomicLong updates
while !(counters.compareAndSet(originalCounters, updatedCounters))
}31 Lock Free!!
A: The JVM does not
guarantee atomic
updates of 64-bit
longs and doubles
Q: Why not use
@volatile long?
pipeline.io
Results of Thread Synchronization
Immutable Case Class
32
Lock-Free AtomicLong
-64%
-46%
-17%
-33%
perf stat –event 
context-switches,L1-dcache-load-misses,L1-dcache-prefetch-misses, 
LLC-load-misses, LLC-prefetch-misses,cache-misses,stalled-cycles-frontend
% Change
-31%
-32%
-33%
-27%
case class Counters(left:Int, right: Int)
...
this.synchronized {
counters = new Counters(counters.left + leftIncrement,
counters.right + rightIncrement)
}
val counters = new AtomicLong()


do {


} while !(counters.compareAndSet(originalCounters,
updatedCounters))
pipeline.io
Profile Visualizations: Flame Graphs
33
Example: Spark Word Count
Java Stack Traces are Good! JDK 1.8+
(-XX:-Inline -XX:+PreserveFramePointer)
Plateaus are Bad!
I/O stalls, Heavy CPU
serialization, etc
pipeline.io
Project Tungsten: CPU and Memory
Create Custom Data Structures & Algorithms
Operate on serialized and compressed ByteArrays!
Minimize Garbage Collection
Reuse ByteArrays
In-place updates for aggregations
Maximize CPU Cache Effectiveness
8-byte alignment
AlphaSort-based Key-Prefix
Utilize Catalyst Dynamic Code Generation
Dynamic optimizations using entire query plan
Developer implements genCode() to create Scala source code
(String)
34
pipeline.io
Why is CPU the Bottleneck?
CPU for serialization, hashing, & compression
Spark 1.2 updates saturated Network, Disk I/O
10x increase in I/O throughput relative to CPU
More partition, pruning, and pushdown support
Newer columnar file formats help reduce I/O
35
pipeline.io
Custom Data Structs & Algos: Aggs
UnsafeFixedWidthAggregationMap
Uses BytesToBytesMap internally
In-place updates of serialized aggregation
No object creation on hot-path
TungstenAggregate & TungstenAggregationIterator
Operates directly on serialized, binary UnsafeRow
2 steps to avoid single-key OOMs
① Hash-based (grouping) agg spills to disk if needed
② Sort-based agg performs external merge sort on spills
36
pipeline.io
Custom Data Structures & Algorithms
o.a.s.util.collection.unsafe.sort.
UnsafeSortDataFormat
UnsafeExternalSorter
UnsafeShuffleWriter
UnsafeInMemorySorter
RecordPointerAndKeyPrefix
37
PtrKey-Prefix
2x CPU Cache-line Friendly!
SortDataFormat<RecordPointerAndKeyPrefix, Long[ ]>
Note: Mixing multiplesubclasses of SortDataFormat
simultaneously will prevent JIT inlining.
Supports merging compressed records
(if compression CODEC supports it, ie. LZF)
In-place external sorting of spilled BytesToBytes data
AlphaSort-based, 8-byte aligned sort key
In-place sorting of BytesToBytesMap data
pipeline.io
Code Generation
Problem
Boxing creates excessive objects
Expression tree evaluations are costly
JVM can’t inline polymorphic impls
Lack of polymorphism == poor code design
Solution
Code generation enables inlining
Rewrite and optimize code using overall plan, 8-byte align
Defer source code generation to each operator, UDF,
UDAF
Use Janino to compile generated source code ->bytecode
38
pipeline.io
Autoscaling Spark Workers (Spark 1.5+)
Scaling up is easy J
SparkContext.addExecutors() until max is
reached
Scaling down is hard L
SparkContext.removeExecutors()
Lose RDD cache inside Executor JVM
Must rebuild active RDD partitions in another Executor
JVM
Uses External Shuffle Service from Spark 1.1-1.2
If Executor JVM dies/restarts, shuffle keeps shufflin’! 39
pipeline.io
“Hidden” Spark Submit REST API
http://arturmkrtchyan.com/apache-spark-hidden-rest-api
Submit Spark Job
curl -X POST http://127.0.0.1:6066/v1/submissions/create 
--header "Content-Type:application/json;charset=UTF-8" 
--data ’{"action" : "CreateSubmissionRequest”,
"mainClass" : "org.apache.spark.examples.SparkPi”,
"sparkProperties" : {
"spark.jars" : "file:/spark/lib/spark-examples-1.5.1.jar",
"spark.app.name" : "SparkPi",

}}’
Get Spark Job Status
curl http://127.0.0.1:6066/v1/submissions/status/<job-id-from-submit-request>
Kill Spark Job
curl -X POST http://127.0.0.1:6066/v1/submissions/kill/<job-id-from-submit-request>
40
(the snitch)
pipeline.io
Outline
① Spark Streaming and Spark ML
Kafka, Cassandra, ElasticSearch, Redis, Docker
② Spark Core
Tuning and Profiling
⑱ Spark SQL
Tuning and Customizing
41
pipeline.io
Parquet Columnar File Format
Based on Google Dremel paper ~2010
Collaboration with Twitter and Cloudera
Columnar storage format for fast columnar aggs
Supports evolving schema
Supports pushdowns
Support nested partitions
Tight compression
Min/max heuristics enable file and chunk skipping 42
Min/Max Heuristics
For Chunk Skipping
pipeline.io
Partitions
Partition Based on Data Access Patterns
/genders.parquet/gender=M/

/gender=F/
 <-- Use Case: Access Users by Gender
/gender=U/

Dynamic Partition Creation (Write)
Dynamically create partitions on write based on column (ie. Gender)
SQL: INSERT TABLE genders PARTITION (gender) SELECT 

DF: gendersDF.write.format("parquet").partitionBy("gender")
.save("/genders.parquet")
Partition Discovery (Read)
Dynamically infer partitions on read based on paths (ie./gender=F/
)
SQL: SELECT id FROM genders WHERE gender=F
DF: gendersDF.read.format("parquet").load("/genders.parquet/").select($"id").
.where("gender=F")
43
pipeline.io
Pruning
Partition Pruning
Filter out rows by partition
SELECT id, gender FROM genders WHERE gender = ‘F’
Column Pruning
Filter out columns by column filter
Extremely useful for columnar storage formats (Parquet)
Skip entire blocks of columns
SELECT id, gender FROM genders
44
pipeline.io
Pushdowns
aka. Predicate or Filter Pushdowns
Predicate returns true or false for given function
Filters rows deep into the data source
Reduces number of rows returned
Data Source must implement PrunedFilteredScan
def buildScan(requiredColumns: Array[String],
filters: Array[Filter]): RDD[Row]
45
pipeline.io
Demo!
File Formats, Partitions, Pushdowns, and Joins
pipeline.io
Predicate Pushdowns & Filter Collapsing
47
Filter pushdown
No extra pass
Filter combining
Only 1 extra pass
2 extra passes through the data after retrieval
pipeline.io
Join Between Partitioned &
Unpartitioned
48
pipeline.io
Join Between Partitioned & Partitioned
49
pipeline.io
Broadcast Join vs. Normal Shuffle Join
50
pipeline.io
Cartesian Join vs. Inner Join
51
pipeline.io
Visualizing the Query Plan
52
Effectiveness
of Filter
Cost-based
Join Optimization
Similar to
MapReduce
Map-side Join
& DistributedCache
Peak Memory for
Joins and Aggs
UnsafeFixedWidthAggregationMap
getPeakMemoryUsedBytes()
pipeline.io
Data Source API
Relations (o.a.s.sql.sources.interfaces.scala)
BaseRelation (abstract class): Provides schema of data
TableScan (impl): Read all data from source
PrunedFilteredScan (impl): Column pruning & predicate pushdowns
InsertableRelation (impl): Insert/overwrite data based on SaveMode
RelationProvider (trait/interface): Handle options, BaseRelation factory
Filters (o.a.s.sql.sources.filters.scala)
Filter (abstract class): Handles all filters supported by this source
EqualTo (impl)
GreaterThan (impl)
StringStartsWith (impl) 53
pipeline.io
Native Spark SQL Data Sources
54
pipeline.io
JSON Data Source
DataFrame
val ratingsDF = sqlContext.read.format("json")
.load("file:/root/pipeline/datasets/dating/ratings.json.bz2")
-- or –
val ratingsDF = sqlContext.read.json
("file:/root/pipeline/datasets/dating/ratings.json.bz2")
SQL Code
CREATE TABLE genders USING json
OPTIONS
(path "file:/root/pipeline/datasets/dating/genders.json.bz2")
55
json() convenience method
pipeline.io
Parquet Data Source
Configuration
spark.sql.parquet.filterPushdown=true
spark.sql.parquet.mergeSchema=false (unless your schema is evolving)
spark.sql.parquet.cacheMetadata=true (requires sqlContext.refreshTable())
spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo]
DataFrames
val gendersDF = sqlContext.read.format("parquet")
.load("file:/root/pipeline/datasets/dating/genders.parquet")
gendersDF.write.format("parquet").partitionBy("gender")
.save("file:/root/pipeline/datasets/dating/genders.parquet")
SQL
CREATE TABLE genders USING parquet
OPTIONS (path "file:/root/pipeline/datasets/dating/genders.parquet")
56
pipeline.io
ElasticSearch Data Source
Github
https://github.com/elastic/elasticsearch-hadoop
Maven
org.elasticsearch:elasticsearch-spark_2.10:2.1.0
Code
val esConfig = Map("pushdown" -> "true", "es.nodes" -> "<hostname>",
"es.port" -> "<port>")
df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite)
.options(esConfig).save("<index>/<document-type>")
57
pipeline.io
Cassandra Data Source
Github
https://github.com/datastax/spark-cassandra-connector
Maven
com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1
Code
ratingsDF.write
.format("org.apache.spark.sql.cassandra")
.mode(SaveMode.Append)
.options(Map("keyspace"->"<keyspace>",
"table"->"<table>")).save(
) 58
pipeline.io
Tips for Cassandra Analytics
By-pass Cassandra CQL “front door”
CQL Optimized for Transactions
Bulk read and write directly against SSTables
Check out Netflix OSS project “Aegisthus”
Cassandra becomesa first-class analytics option
Replicated analytics cluster no longer needed
59
pipeline.io
Creating a Custom Data Source
① Study existing implementations
o.a.s.sql.execution.datasources.jdbc.JDBCRelation
② Extend base traits & implement required methods
o.a.s.sql.sources.{BaseRelation,PrunedFilterScan}
Spark JDBC (o.a.s.sql.execution.datasources.jdbc)
class JDBCRelation extends BaseRelation
with PrunedFilteredScan
with InsertableRelation
DataStax Cassandra (o.a.s.sql.cassandra)
class CassandraSourceRelation extends BaseRelation
with PrunedFilteredScan
with InsertableRelation 60
pipeline.io
Demo!
Create a Custom Data Source
pipeline.io
Publishing Custom Data Sources
62
spark-packages.org
pipeline.io
Spark SQL UDF Code Generation
100+ UDFs now generating code
More to come in Spark 1.6+
Details in
SPARK-8159, SPARK-9571
Every UDF must use Expressions and
implement Expression.genCode()
to participate in the fun
Lambdas (RDD or Dataset API)
and sqlContext.udf.registerFunction()
are not enough!!
pipeline.io
Creating a Custom UDF with Code Gen
① Study existing implementations
o.a.s.sql.catalyst.expressions.Substring
② Extend and implement base trait
o.a.s.sql.catalyst.expressions.Expression.genCode
⑱ Don’t forget about Python!
python.pyspark.sql.functions.py
64
pipeline.io
Demo!
Creating a Custom UDF participating in Code Generation
pipeline.io
Spark 1.6 and 2.0 Improvements
Adaptiveness, Metrics, Datasets, and Streaming State
pipeline.io
Adaptive Query Execution
Adapt query execution using data from previous stages
Dynamically choose spark.sql.shuffle.partitions (default 200)
67
Broadcast Join
(popular keys)
Shuffle Join
(not-so-popular keys)
Adaptive
Hybrid
Join
pipeline.io
Adaptive Memory Management
Spark <1.6
Manual configure between 2 memory regions
Spark execution engine (shuffles, joins, sorts, aggs)
spark.shuffle.memoryFraction
RDD Data Cache
spark.storage.memoryFraction
Spark 1.6+
Unified memory regions
Dynamically expand/contract memory regions
Supports minimum for RDD storage (LRU Cache)
68
pipeline.io
Metrics
Shows exact memory usage per operator & node
Helps debugging and identifying skew
69
pipeline.io
Spark SQL API
Datasets type safe API (similar to RDDs) utilizing Tungsten
val ds = sqlContext.read.text("ratings.csv").as[String]
val df = ds.flatMap(_.split(",")).filter(_ != "").toDF() // RDD API, convert to DF
val agg = df.groupBy($"rating").agg(count("*") as "ct”).orderBy($"ct" desc)
Typed Aggregators used alongside UDFs and UDAFs
val simpleSum = new Aggregator[Int, Int, Int] with Serializable {
def zero: Int = 0
def reduce(b: Int, a: Int) = b + a
def merge(b1: Int, b2: Int) = b1 + b2
def finish(b: Int) = b
}.toColumn
val sum = Seq(1,2,3,4).toDS().select(simpleSum)
Query files directly without registerTempTable()
%sql SELECT * FROM json.`/datasets/movielens/ml-latest/movies.json` 70
pipeline.io
Spark Streaming State Management
New trackStateByKey()
Store deltas, compact later
More efficient per-key state update
Session TTL
Integrated A/B Testing (?!)
Show Failed Output in Admin UI
Better debugging
71
pipeline.io
Thank You!!
Chris Fregly
Research Scientist @ PipelineIO
(http://pipeline.io)
San Francisco, California, USA
advancedspark.com
Sign up for the Meetup and Book
Contribute on Github!
Run All Demos in Docker
~6000 Docker Downloads!!
Find me on LinkedIn, Twitter, Github, Email, Fax 72

More Related Content

PDF
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
 
PDF
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
PDF
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
PDF
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...
PDF
Advanced Spark and TensorFlow Meetup May 26, 2016
PDF
High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...
PDF
Optimizing, Profiling, and Deploying TensorFlow AI Models with GPUs - San Fra...
PDF
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
 
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...
Advanced Spark and TensorFlow Meetup May 26, 2016
High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...
Optimizing, Profiling, and Deploying TensorFlow AI Models with GPUs - San Fra...
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...

What's hot (20)

PDF
The OMR GC talk - Ruby Kaigi 2015
PDF
London Spark Meetup Project Tungsten Oct 12 2015
PDF
Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow
PDF
High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...
PPTX
Apache Storm 0.9 basic training - Verisign
PPTX
Inferno Scalable Deep Learning on Spark
PDF
Unit testing of spark applications
PDF
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
PDF
Big Data Beyond the JVM - Strata San Jose 2018
PDF
Python VS GO
PDF
Accelerating Big Data beyond the JVM - Fosdem 2018
PPTX
Scaling Apache Storm (Hadoop Summit 2015)
PPTX
JVM and OS Tuning for accelerating Spark application
PDF
On heap cache vs off-heap cache
PDF
Demystifying DataFrame and Dataset
PDF
Getting The Best Performance With PySpark
PPTX
Introduction to Storm
PDF
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
PDF
Spark summit2014 techtalk - testing spark
PDF
Sparkly Notebook: Interactive Analysis and Visualization with Spark
The OMR GC talk - Ruby Kaigi 2015
London Spark Meetup Project Tungsten Oct 12 2015
Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow
High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...
Apache Storm 0.9 basic training - Verisign
Inferno Scalable Deep Learning on Spark
Unit testing of spark applications
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Big Data Beyond the JVM - Strata San Jose 2018
Python VS GO
Accelerating Big Data beyond the JVM - Fosdem 2018
Scaling Apache Storm (Hadoop Summit 2015)
JVM and OS Tuning for accelerating Spark application
On heap cache vs off-heap cache
Demystifying DataFrame and Dataset
Getting The Best Performance With PySpark
Introduction to Storm
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Spark summit2014 techtalk - testing spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Ad

Viewers also liked (20)

PDF
Atlanta MLconf Machine Learning Conference 09-23-2016
PDF
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
PDF
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
PDF
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
PDF
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
PDF
Atlanta Hadoop Users Meetup 09 21 2016
PDF
Boston Spark Meetup May 24, 2016
PDF
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
PDF
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
PDF
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
PDF
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
PDF
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
PDF
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
PDF
DC Spark Users Group March 15 2016 - Spark and Netflix Recommendations
PDF
Spark Summit East NYC Meetup 02-16-2016
PDF
Chicago Spark Meetup 03 01 2016 - Spark and Recommendations
PDF
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
PDF
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
PPTX
Dublin Ireland Spark Meetup October 15, 2015
PDF
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
Atlanta MLconf Machine Learning Conference 09-23-2016
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
Atlanta Hadoop Users Meetup 09 21 2016
Boston Spark Meetup May 24, 2016
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
DC Spark Users Group March 15 2016 - Spark and Netflix Recommendations
Spark Summit East NYC Meetup 02-16-2016
Chicago Spark Meetup 03 01 2016 - Spark and Recommendations
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Dublin Ireland Spark Meetup October 15, 2015
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
Ad

Similar to Atlanta Spark User Meetup 09 22 2016 (20)

PDF
Protocol Independence
PPTX
The Other HPC: High Productivity Computing in Polystore Environments
PDF
Machine Learning with Apache Flink at Stockholm Machine Learning Group
PPTX
Apache Spark Structured Streaming + Apache Kafka = ♡
PPTX
Apache Flink Deep Dive
PPT
OpenMP And C++
PDF
Streaming 101: Hello World
PDF
Let's Get to the Rapids
PDF
Machine Learning on Code - SF meetup
PPTX
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
PDF
Golang Performance : microbenchmarks, profilers, and a war story
PPTX
Combining Phase Identification and Statistic Modeling for Automated Parallel ...
PPTX
Apache Flink at Strata San Jose 2016
PPTX
Counting Elements in Streams
PDF
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
PDF
SnappyData at Spark Summit 2017
PPTX
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
PDF
Serial-War
PDF
Sperasoft‬ talks j point 2015
PPT
Inside LoLA - Experiences from building a state space tool for place transiti...
Protocol Independence
The Other HPC: High Productivity Computing in Polystore Environments
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Apache Spark Structured Streaming + Apache Kafka = ♡
Apache Flink Deep Dive
OpenMP And C++
Streaming 101: Hello World
Let's Get to the Rapids
Machine Learning on Code - SF meetup
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Golang Performance : microbenchmarks, profilers, and a war story
Combining Phase Identification and Statistic Modeling for Automated Parallel ...
Apache Flink at Strata San Jose 2016
Counting Elements in Streams
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
SnappyData at Spark Summit 2017
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
Serial-War
Sperasoft‬ talks j point 2015
Inside LoLA - Experiences from building a state space tool for place transiti...

More from Chris Fregly (20)

PDF
AWS reInvent 2022 reCap AI/ML and Data
PDF
Pandas on AWS - Let me count the ways.pdf
PDF
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
PDF
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
PDF
Amazon reInvent 2020 Recap: AI and Machine Learning
PDF
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
PDF
Quantum Computing with Amazon Braket
PDF
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
PDF
AWS Re:Invent 2019 Re:Cap
PDF
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
PDF
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
PDF
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
PDF
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PDF
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PDF
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PDF
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
PDF
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
PDF
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PDF
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
PDF
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
AWS reInvent 2022 reCap AI/ML and Data
Pandas on AWS - Let me count the ways.pdf
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Amazon reInvent 2020 Recap: AI and Machine Learning
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Quantum Computing with Amazon Braket
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
AWS Re:Invent 2019 Re:Cap
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...

Recently uploaded (20)

PPTX
Odoo POS Development Services by CandidRoot Solutions
PPTX
ai tools demonstartion for schools and inter college
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Nekopoi APK 2025 free lastest update
PDF
top salesforce developer skills in 2025.pdf
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
System and Network Administraation Chapter 3
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
L1 - Introduction to python Backend.pptx
PPTX
ISO 45001 Occupational Health and Safety Management System
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
AI in Product Development-omnex systems
PPTX
Operating system designcfffgfgggggggvggggggggg
PPT
Introduction Database Management System for Course Database
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PPTX
history of c programming in notes for students .pptx
Odoo POS Development Services by CandidRoot Solutions
ai tools demonstartion for schools and inter college
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Nekopoi APK 2025 free lastest update
top salesforce developer skills in 2025.pdf
How to Migrate SBCGlobal Email to Yahoo Easily
Upgrade and Innovation Strategies for SAP ERP Customers
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
System and Network Administraation Chapter 3
Adobe Illustrator 28.6 Crack My Vision of Vector Design
L1 - Introduction to python Backend.pptx
ISO 45001 Occupational Health and Safety Management System
CHAPTER 2 - PM Management and IT Context
Design an Analysis of Algorithms I-SECS-1021-03
AI in Product Development-omnex systems
Operating system designcfffgfgggggggvggggggggg
Introduction Database Management System for Course Database
How to Choose the Right IT Partner for Your Business in Malaysia
VVF-Customer-Presentation2025-Ver1.9.pptx
history of c programming in notes for students .pptx

Atlanta Spark User Meetup 09 22 2016

  • 1. pipeline.io After Dark 2.0End-to-End,Real-time, Advanced Analytics and ML Big Data Reference Pipeline Atlanta Spark User Group Sept 22, 2016 Thanks Emory Continuing Education! Chris Fregly Research Scientist @ PipelineIO We’re Hiring - Only Nice People! pipeline.io advancedspark.com
  • 2. pipeline.io Who Am I? 2 Research Scientist @ PipelineIO github.com/fluxcapacitor/pipeline Meetup Founder Advanced Book Author Advanced .
  • 3. pipeline.io Who Was I? 3 Streaming Data Engineer Netflix Open Source Committer Data Solutions Engineer Apache Contributor Principal Data Solutions Engineer IBM Technology Center
  • 4. pipeline.io Advanced Spark and Tensorflow Meetup Meetup Metrics Top 5 Most Active Spark Meetup! 4000+ Members in just 1 year!! 6000+ Downloads of Docker Image with many Meetup Demos!!! @ advancedspark.com Meetup Goals Code dive deep into Spark and related open source code bases Study integrations with Cassandra, ElasticSearch,Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R, etc Surface and share patterns and idioms of well-designed, distributed, big data processing systems
  • 5. pipeline.io Atlanta Hadoop User Meetup Last Night http://www.slideshare.net/cfregly/atlanta-hadoop-users-meetup-09-21-2016
  • 7. pipeline.io Current PipelineIO Research Model Deploying and Testing Model Scaling and Serving Online Model Training Dynamic Model Optimizing 7
  • 8. pipeline.io PipelineIO Deliverables 100% Open Source!! Github https://github.com/fluxcapacitor/ DockerHub https://hub.docker.com/r/fluxcapacitor Workshop http://pipeline.io 8
  • 9. pipeline.io Topics of This Talk (20-30 mins each) ① Spark Streaming and Spark ML Generating Real-time Recommendations ② Spark Core Tuning and Profiling ⑱ Spark SQL Tuning and Customizing 9
  • 10. pipeline.io Live, Interactive Demo! Kafka, Cassandra, ElasticSearch, Redis, Spark ML
  • 11. pipeline.io Audience Participation Required! 11 You -> Audience Instructions ①Navigate to demo.pipeline.io ②Swipe software used in Production Only! Data -> Scientist This is Totally Anonymous!!
  • 12. pipeline.io Topics of This Talk (15-20 mins each) ① Spark Streaming and Spark ML Kafka, Cassandra, ElasticSearch, Redis, Docker ② Spark Core Tuning and Profiling ⑱ Spark SQL Tuning and Customizing 12
  • 13. pipeline.io Mechanical Sympathy “Hardware and software working together.” - Martin Thompson http://mechanical-sympathy.blogspot.com “Whatever your data structure, my array will win.” - Scott Meyers Every C++ Book, basically 13
  • 14. pipeline.io Spark and Mechanical Sympathy 14 Project Tungsten (Spark 1.4-1.6+) 100TB GraySort Challenge (Spark 1.1-1.2) Minimize Memory & GC Maximize CPU Cache Saturate Network I/O Saturate Disk I/O
  • 16. pipeline.io CPU Cache Sympathy (AlphaSort paper) Key (10 bytes) + Pointer (4 bytes) = 14 bytes 16 Key Ptr Pre-process & Pull Key From Record PtrKey-Prefix 2x CPU Cache-line Friendly! Key-Prefix (4 bytes) + Pointer (4 bytes) = 8 bytes Key (10 bytes) + Pad (2 bytes) + Pointer (4 bytes) = 16 bytes Ptr Must Dereference to Compare Key Key Pointer (4 bytes) = 4 bytes Key Ptr Pad /Pad CPU Cache-line Friendly! Dereference full key only to resolve prefix duplicates
  • 20. pipeline.io Instrumenting and Monitoring CPU Use Linux perf command! 20 http://www.brendangregg.com/blog/2015-11-06/java-mixed-mode-flame-graphs.html
  • 21. pipeline.io Results of Random vs. Sequential Sort 21 NaĂŻve Random Access Pointer Sort Cache Friendly Sequential Key/Pointer Sort Ptr Key Must Dereference to Compare Key Key Ptr Pre-process & Pull Key From Record -35% -90% -68% -26% % Change -55% perf stat –event L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses,LLC-prefetch-misses
  • 23. pipeline.io CPU Cache NaĂŻve Matrix Multiplication // Dot product of each row & column vector for (i <- 0 until numRowA) for (j <- 0 until numColsB) for (k <- 0 until numColsA) res[ i ][ j ] += matA[ i ][ k ] * matB[ k ][ j ]; 23 Bad: Row-wise traversal, not using full CPU cache line, ineffective pre-fetching
  • 24. pipeline.io CPU Cache Friendly Matrix Multiplication // Transpose B for (i <- 0 until numRowsB) for (j <- 0 until numColsB) matBT[ i ][ j ] = matB[ j ][ i ]; 24 Good: Full CPU Cache Line, Effective Prefetching OLD: matB [ k ][ j ]; // Modify algo for Transpose B for (i <- 0 until numRowsA) for (j <- 0 until numColsB) for (k <- 0 until numColsA) res[ i ][ j ] += matA[ i ][ k ] * matBT[ j ][ k ];
  • 25. pipeline.io Results Of Matrix Multiplication Cache-Friendly Matrix MultiplyNaĂŻve Matrix Multiply perf stat –event L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses, LLC-prefetch-misses,cache-misses,stalled-cycles-frontend -96% -93% -93% -70% -53% % Change -63% +8543%?
  • 27. pipeline.io Thread and Context Switch Sympathy Problem Atomically Increment 2 Counters (each at different increments) by 1000’s of Simultaneous Threads 27 Possible Solutions ① Synchronized Immutable ② Synchronized Mutable ⑱ AtomicReference CAS ④ Volatile? Context Switches are Expen$ive!! aka “LLC”
  • 28. pipeline.io Synchronized Immutable Counters case class Counters(left: Int, right: Int) object SynchronizedImmutableCounters { var counters = new Counters(0,0) def getCounters(): Counters = { this.synchronized { counters } } def increment(leftIncrement: Int, rightIncrement: Int): Unit = { this.synchronized { counters = new Counters(counters.left + leftIncrement, counters.right + rightIncrement) } } } 28 Locks whole outer object!!
  • 29. pipeline.io Synchronized Mutable Counters class MutableCounters(left: Int, right: Int) { def increment(leftIncrement: Int, rightIncrement: Int): Unit={ this.synchronized {
} } def getCountersTuple(): (Int, Int) = { this.synchronized{ (counters.left, counters.right) } } } object SynchronizedMutableCounters { val counters = new MutableCounters(0,0) 
 def increment(leftIncrement: Int, rightIncrement: Int): Unit = { counters.increment(leftIncrement, rightIncrement) } 29 Locks just MutableCounters
  • 30. pipeline.io Lock-Free AtomicReference Counters case class Counters(left: Int, right: Int) object LockFreeAtomicReferenceCounters { val counters = new AtomicReference[Counters](new Counters(0,0)) def increment(leftIncrement: Int, rightIncrement: Int) : Long = { var originalCounters: Counters = null var updatedCounters: Counters = null do { originalCounters = getCounters() updatedCounters = new Counters(originalCounters.left+ leftIncrement, originalCounters.right+ rightIncrement) } // Retry lock-free, optimistic compareAndSet() until AtomicRef updates while !(counters.compareAndSet(originalCounters, updatedCounters)) } 30Lock Free!!
  • 31. pipeline.io Lock-Free AtomicLong Counters object LockFreeAtomicLongCounters { // a single Long (64-bit) will maintain 2 separate Ints (32-bits each) val counters = new AtomicLong() 
 def increment(leftIncrement: Int, rightIncrement: Int): Unit = { var originalCounters = 0L var updatedCounters = 0L do { originalCounters = counters.get() 
 // Store two 32-bit Int into one 64-bit Long // Use >>> 32 and << 32 to set and retrieve each Int from the Long } // Retry lock-free, optimistic compareAndSet() until AtomicLong updates while !(counters.compareAndSet(originalCounters, updatedCounters)) }31 Lock Free!! A: The JVM does not guarantee atomic updates of 64-bit longs and doubles Q: Why not use @volatile long?
  • 32. pipeline.io Results of Thread Synchronization Immutable Case Class 32 Lock-Free AtomicLong -64% -46% -17% -33% perf stat –event context-switches,L1-dcache-load-misses,L1-dcache-prefetch-misses, LLC-load-misses, LLC-prefetch-misses,cache-misses,stalled-cycles-frontend % Change -31% -32% -33% -27% case class Counters(left:Int, right: Int) ... this.synchronized { counters = new Counters(counters.left + leftIncrement, counters.right + rightIncrement) } val counters = new AtomicLong() 
 do { 
 } while !(counters.compareAndSet(originalCounters, updatedCounters))
  • 33. pipeline.io Profile Visualizations: Flame Graphs 33 Example: Spark Word Count Java Stack Traces are Good! JDK 1.8+ (-XX:-Inline -XX:+PreserveFramePointer) Plateaus are Bad! I/O stalls, Heavy CPU serialization, etc
  • 34. pipeline.io Project Tungsten: CPU and Memory Create Custom Data Structures & Algorithms Operate on serialized and compressed ByteArrays! Minimize Garbage Collection Reuse ByteArrays In-place updates for aggregations Maximize CPU Cache Effectiveness 8-byte alignment AlphaSort-based Key-Prefix Utilize Catalyst Dynamic Code Generation Dynamic optimizations using entire query plan Developer implements genCode() to create Scala source code (String) 34
  • 35. pipeline.io Why is CPU the Bottleneck? CPU for serialization, hashing, & compression Spark 1.2 updates saturated Network, Disk I/O 10x increase in I/O throughput relative to CPU More partition, pruning, and pushdown support Newer columnar file formats help reduce I/O 35
  • 36. pipeline.io Custom Data Structs & Algos: Aggs UnsafeFixedWidthAggregationMap Uses BytesToBytesMap internally In-place updates of serialized aggregation No object creation on hot-path TungstenAggregate & TungstenAggregationIterator Operates directly on serialized, binary UnsafeRow 2 steps to avoid single-key OOMs ① Hash-based (grouping) agg spills to disk if needed ② Sort-based agg performs external merge sort on spills 36
  • 37. pipeline.io Custom Data Structures & Algorithms o.a.s.util.collection.unsafe.sort. UnsafeSortDataFormat UnsafeExternalSorter UnsafeShuffleWriter UnsafeInMemorySorter RecordPointerAndKeyPrefix 37 PtrKey-Prefix 2x CPU Cache-line Friendly! SortDataFormat<RecordPointerAndKeyPrefix, Long[ ]> Note: Mixing multiplesubclasses of SortDataFormat simultaneously will prevent JIT inlining. Supports merging compressed records (if compression CODEC supports it, ie. LZF) In-place external sorting of spilled BytesToBytes data AlphaSort-based, 8-byte aligned sort key In-place sorting of BytesToBytesMap data
  • 38. pipeline.io Code Generation Problem Boxing creates excessive objects Expression tree evaluations are costly JVM can’t inline polymorphic impls Lack of polymorphism == poor code design Solution Code generation enables inlining Rewrite and optimize code using overall plan, 8-byte align Defer source code generation to each operator, UDF, UDAF Use Janino to compile generated source code ->bytecode 38
  • 39. pipeline.io Autoscaling Spark Workers (Spark 1.5+) Scaling up is easy J SparkContext.addExecutors() until max is reached Scaling down is hard L SparkContext.removeExecutors() Lose RDD cache inside Executor JVM Must rebuild active RDD partitions in another Executor JVM Uses External Shuffle Service from Spark 1.1-1.2 If Executor JVM dies/restarts, shuffle keeps shufflin’! 39
  • 40. pipeline.io “Hidden” Spark Submit REST API http://arturmkrtchyan.com/apache-spark-hidden-rest-api Submit Spark Job curl -X POST http://127.0.0.1:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data ’{"action" : "CreateSubmissionRequest”, "mainClass" : "org.apache.spark.examples.SparkPi”, "sparkProperties" : { "spark.jars" : "file:/spark/lib/spark-examples-1.5.1.jar", "spark.app.name" : "SparkPi",
 }}’ Get Spark Job Status curl http://127.0.0.1:6066/v1/submissions/status/<job-id-from-submit-request> Kill Spark Job curl -X POST http://127.0.0.1:6066/v1/submissions/kill/<job-id-from-submit-request> 40 (the snitch)
  • 41. pipeline.io Outline ① Spark Streaming and Spark ML Kafka, Cassandra, ElasticSearch, Redis, Docker ② Spark Core Tuning and Profiling ⑱ Spark SQL Tuning and Customizing 41
  • 42. pipeline.io Parquet Columnar File Format Based on Google Dremel paper ~2010 Collaboration with Twitter and Cloudera Columnar storage format for fast columnar aggs Supports evolving schema Supports pushdowns Support nested partitions Tight compression Min/max heuristics enable file and chunk skipping 42 Min/Max Heuristics For Chunk Skipping
  • 43. pipeline.io Partitions Partition Based on Data Access Patterns /genders.parquet/gender=M/
 /gender=F/
 <-- Use Case: Access Users by Gender /gender=U/
 Dynamic Partition Creation (Write) Dynamically create partitions on write based on column (ie. Gender) SQL: INSERT TABLE genders PARTITION (gender) SELECT 
 DF: gendersDF.write.format("parquet").partitionBy("gender") .save("/genders.parquet") Partition Discovery (Read) Dynamically infer partitions on read based on paths (ie./gender=F/
) SQL: SELECT id FROM genders WHERE gender=F DF: gendersDF.read.format("parquet").load("/genders.parquet/").select($"id"). .where("gender=F") 43
  • 44. pipeline.io Pruning Partition Pruning Filter out rows by partition SELECT id, gender FROM genders WHERE gender = ‘F’ Column Pruning Filter out columns by column filter Extremely useful for columnar storage formats (Parquet) Skip entire blocks of columns SELECT id, gender FROM genders 44
  • 45. pipeline.io Pushdowns aka. Predicate or Filter Pushdowns Predicate returns true or false for given function Filters rows deep into the data source Reduces number of rows returned Data Source must implement PrunedFilteredScan def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] 45
  • 47. pipeline.io Predicate Pushdowns & Filter Collapsing 47 Filter pushdown No extra pass Filter combining Only 1 extra pass 2 extra passes through the data after retrieval
  • 50. pipeline.io Broadcast Join vs. Normal Shuffle Join 50
  • 52. pipeline.io Visualizing the Query Plan 52 Effectiveness of Filter Cost-based Join Optimization Similar to MapReduce Map-side Join & DistributedCache Peak Memory for Joins and Aggs UnsafeFixedWidthAggregationMap getPeakMemoryUsedBytes()
  • 53. pipeline.io Data Source API Relations (o.a.s.sql.sources.interfaces.scala) BaseRelation (abstract class): Provides schema of data TableScan (impl): Read all data from source PrunedFilteredScan (impl): Column pruning & predicate pushdowns InsertableRelation (impl): Insert/overwrite data based on SaveMode RelationProvider (trait/interface): Handle options, BaseRelation factory Filters (o.a.s.sql.sources.filters.scala) Filter (abstract class): Handles all filters supported by this source EqualTo (impl) GreaterThan (impl) StringStartsWith (impl) 53
  • 54. pipeline.io Native Spark SQL Data Sources 54
  • 55. pipeline.io JSON Data Source DataFrame val ratingsDF = sqlContext.read.format("json") .load("file:/root/pipeline/datasets/dating/ratings.json.bz2") -- or – val ratingsDF = sqlContext.read.json ("file:/root/pipeline/datasets/dating/ratings.json.bz2") SQL Code CREATE TABLE genders USING json OPTIONS (path "file:/root/pipeline/datasets/dating/genders.json.bz2") 55 json() convenience method
  • 56. pipeline.io Parquet Data Source Configuration spark.sql.parquet.filterPushdown=true spark.sql.parquet.mergeSchema=false (unless your schema is evolving) spark.sql.parquet.cacheMetadata=true (requires sqlContext.refreshTable()) spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo] DataFrames val gendersDF = sqlContext.read.format("parquet") .load("file:/root/pipeline/datasets/dating/genders.parquet") gendersDF.write.format("parquet").partitionBy("gender") .save("file:/root/pipeline/datasets/dating/genders.parquet") SQL CREATE TABLE genders USING parquet OPTIONS (path "file:/root/pipeline/datasets/dating/genders.parquet") 56
  • 57. pipeline.io ElasticSearch Data Source Github https://github.com/elastic/elasticsearch-hadoop Maven org.elasticsearch:elasticsearch-spark_2.10:2.1.0 Code val esConfig = Map("pushdown" -> "true", "es.nodes" -> "<hostname>", "es.port" -> "<port>") df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite) .options(esConfig).save("<index>/<document-type>") 57
  • 59. pipeline.io Tips for Cassandra Analytics By-pass Cassandra CQL “front door” CQL Optimized for Transactions Bulk read and write directly against SSTables Check out Netflix OSS project “Aegisthus” Cassandra becomesa first-class analytics option Replicated analytics cluster no longer needed 59
  • 60. pipeline.io Creating a Custom Data Source ① Study existing implementations o.a.s.sql.execution.datasources.jdbc.JDBCRelation ② Extend base traits & implement required methods o.a.s.sql.sources.{BaseRelation,PrunedFilterScan} Spark JDBC (o.a.s.sql.execution.datasources.jdbc) class JDBCRelation extends BaseRelation with PrunedFilteredScan with InsertableRelation DataStax Cassandra (o.a.s.sql.cassandra) class CassandraSourceRelation extends BaseRelation with PrunedFilteredScan with InsertableRelation 60
  • 62. pipeline.io Publishing Custom Data Sources 62 spark-packages.org
  • 63. pipeline.io Spark SQL UDF Code Generation 100+ UDFs now generating code More to come in Spark 1.6+ Details in SPARK-8159, SPARK-9571 Every UDF must use Expressions and implement Expression.genCode() to participate in the fun Lambdas (RDD or Dataset API) and sqlContext.udf.registerFunction() are not enough!!
  • 64. pipeline.io Creating a Custom UDF with Code Gen ① Study existing implementations o.a.s.sql.catalyst.expressions.Substring ② Extend and implement base trait o.a.s.sql.catalyst.expressions.Expression.genCode ⑱ Don’t forget about Python! python.pyspark.sql.functions.py 64
  • 65. pipeline.io Demo! Creating a Custom UDF participating in Code Generation
  • 66. pipeline.io Spark 1.6 and 2.0 Improvements Adaptiveness, Metrics, Datasets, and Streaming State
  • 67. pipeline.io Adaptive Query Execution Adapt query execution using data from previous stages Dynamically choose spark.sql.shuffle.partitions (default 200) 67 Broadcast Join (popular keys) Shuffle Join (not-so-popular keys) Adaptive Hybrid Join
  • 68. pipeline.io Adaptive Memory Management Spark <1.6 Manual configure between 2 memory regions Spark execution engine (shuffles, joins, sorts, aggs) spark.shuffle.memoryFraction RDD Data Cache spark.storage.memoryFraction Spark 1.6+ Unified memory regions Dynamically expand/contract memory regions Supports minimum for RDD storage (LRU Cache) 68
  • 69. pipeline.io Metrics Shows exact memory usage per operator & node Helps debugging and identifying skew 69
  • 70. pipeline.io Spark SQL API Datasets type safe API (similar to RDDs) utilizing Tungsten val ds = sqlContext.read.text("ratings.csv").as[String] val df = ds.flatMap(_.split(",")).filter(_ != "").toDF() // RDD API, convert to DF val agg = df.groupBy($"rating").agg(count("*") as "ct”).orderBy($"ct" desc) Typed Aggregators used alongside UDFs and UDAFs val simpleSum = new Aggregator[Int, Int, Int] with Serializable { def zero: Int = 0 def reduce(b: Int, a: Int) = b + a def merge(b1: Int, b2: Int) = b1 + b2 def finish(b: Int) = b }.toColumn val sum = Seq(1,2,3,4).toDS().select(simpleSum) Query files directly without registerTempTable() %sql SELECT * FROM json.`/datasets/movielens/ml-latest/movies.json` 70
  • 71. pipeline.io Spark Streaming State Management New trackStateByKey() Store deltas, compact later More efficient per-key state update Session TTL Integrated A/B Testing (?!) Show Failed Output in Admin UI Better debugging 71
  • 72. pipeline.io Thank You!! Chris Fregly Research Scientist @ PipelineIO (http://pipeline.io) San Francisco, California, USA advancedspark.com Sign up for the Meetup and Book Contribute on Github! Run All Demos in Docker ~6000 Docker Downloads!! Find me on LinkedIn, Twitter, Github, Email, Fax 72