SlideShare a Scribd company logo
Boosting Spark Performance
on Many-Core Machines
Qifan Pu
Sameer Agarwal (Databricks)
Reynold Xin (Databricks)
Ion Stoica
UC	
  BERKELEY	
  
Me
Ph.D. Student in AMPLab, advised by Prof. Ion Stoica
- Spark-related research projects
- e.g., how to run Spark in a geo-distributed fashion
- Big data storage (e.g., Alluxio)
- how to do memory management for multiple users
Intern at Databricks in the past summer
- Spark SQL team: aggregates, shuffle
This project
Spark performance on many-core machines
- Ongoing research, feedbacks are welcome
-  Focus of this talk:
•  understand shuffle performance
•  Investigate and implement In-memory shuffle
Moving beyond research
-  Hope is to get into Spark (but no guarantee yet)
Why do we care about many-core machines?
rdd.groupBy(…)…	
  
…	
  
…	
  
…	
  
…	
  
Spark started as a cluster computing framework
•  Spark’s one big success has been high scalability
•  Largest cluster known is 8000
•  Winner of 2014 Daytona GraySort Benchmark
•  Typical cluster sizes in practice:
“Our observation is that companies typically experiment with
cluster size of under 100 nodes and expand to 200 or more
nodes in the production stages.”
-- Susheel Kaushik (Senior Director at Pivotal)
Spark on cluster
Increasingly powerful single node machines
•  More cores packed on single chip
•  Intel Xeon Phi: 64-72 cores
•  Berkeley FireBox Project: ~100 cores
•  Larger instances in the Cloud
•  Various 32-core instances on EC2, Azure & GCE
•  EC2 X-Instance with 128 cores, 2TB (May 2016)
Memory (GB) vCPUs Hourly Cost ($) Cost/100GB Cost/8vCPU
x1.32xlarge 1952 128 13.338 0.68 0.83
g2.2xlarge 15 8 0.65 4.33 0.65
i2.2xlarge 61 8 1.705 2.80 1.70
m3.2xlarge 30 8 0.532 1.78 0.53
c3.2xlarge 15 8 0.42 2.80 0.42
Cost of many-core nodes
1 x1.32xlarge instance is a small cluster
(with more memory, fast inter-core network)
Spark’s design was based on many nodes
•  Data communication (a.k.a. shuffle)
•  Store intermediate data on disk
•  Serialization/deserialization needed across nodes
•  Now: much memory to spare, intra-node shuffle
•  Resource management
•  Designed to handle moderate amount on each node
•  Now: 1 executor for 100 cores + 2TB memory?
Focus of this talk
Ongoing work
Can we improve shuffle on single, multi-core
machines?
1, Memory is fast
2, We can use memory for shuffle
3, Therefore,
Shuffle will be fast
Will this “common sense” work?
Put in practice…
•  Spark: write to a file stream, and save all bytes to disk
•  Solution: …, …. to memory (bytes on heap)
spark.range(16M).repartition(1).selectExpr("sum(id)").collect()
0
 1
 2
 3
 4
 5
Vanilla Spark
Attempt 1
runtime (seconds)
Why zero improvement?
Attempt 1
Vanilla Spark
flushBuffer(),	
  4.81%	
  of	
  4me	
  
En4re	
  dura4on	
  of	
  a	
  job	
  (100%)	
  
Why zero improvement?
1, I/O throughput is not the bottleneck (in this job)
2, Buffer cache:
memory is being exploited by disk-based shuffle
Why zero improvement?
Understanding Shuffle Performance
Generated	
  Iterator	
  
OP1	
   OP2	
   OP3	
  
Queue	
  Input	
  
Filestreams	
  
flush	
  
flush	
  
flush	
  
Mappers	
  
Understanding Shuffle Performance
Generated	
  Iterator	
  
OP1	
   OP2	
   OP3	
  
Results	
  or	
  another	
  shuffle	
  
Reducers	
  
More complications
•  Sort vs. hash-based shuffle
•  Spill when memory runs out
•  Clear data after shuffle
…
Case 1
Generated	
  Iterator	
  
OP1	
   OP2	
   OP3	
  
Queue	
  Input	
  
Filestreams	
  
flush	
  
flush	
  
flush	
  
Thread	
  1	
  
Thread	
  2	
  
Case 1
Queue	
  Input	
  
flush	
  
flush	
  
flush	
  
Thread	
  1	
  
Thread	
  2	
  Case	
  1:	
  thread	
  1	
  slower	
  than	
  thread	
  2	
  
Filestreams	
  
Generated	
  Iterator	
  
OP1	
   OP2	
   OP3	
  
Case 2
Generated	
  Iterator	
  
OP1	
   OP2	
   OP3	
  
Queue	
  Input	
  
Filestreams	
  
flush	
  
flush	
  
flush	
  
Case 2
Queue	
  Input	
  
flush	
  
flush	
  
flush	
  
Case	
  2:	
  wri4ng/reading	
  file	
  streams	
  is	
  slow	
  
Filestreams	
  
Generated	
  Iterator	
  
OP1	
   OP2	
   OP3	
  
Case study 2
Generated	
  Iterator	
  
OP1	
   OP2	
   OP3	
  
Results	
  or	
  another	
  shuffle	
  
Reducers	
  
Case 3
Generated	
  Iterator	
  
OP1	
   OP2	
   OP3	
  
Queue	
  Input	
  
Filestreams	
  
flush	
  
flush	
  
flush	
  
Case 3
Queue	
  Input	
  
flush	
  
flush	
  
flush	
  
Case	
  3:	
  I/O	
  conten4ons	
  	
  
Filestreams	
  
Generated	
  Iterator	
  
OP1	
   OP2	
   OP3	
  
Our	
  first	
  aTempt	
  should	
  work	
  in	
  this	
  case!	
  
Can we improve case 2?
Queue	
  Input	
  
flush	
  
flush	
  
flush	
  
Filestreams	
  
Generated	
  Iterator	
  
OP1	
   OP2	
   OP3	
  
Previous	
  example	
  is	
  case	
  2.	
  
Attempt 2: get rid of ser/deser, etc
•  Attempt 2: create NxM queues (N=mappers, M=reducers)
push corresponding records into queues
mappers reducers
•  No serialization
•  No copy (data structure
shared by both sides)
Attempt 2: get rid of ser/deser, etc
•  Attempt 2: create NxM queues (N=mappers, M=reducers)
push corresponding records into queues
0
 2
 4
 6
 8
 10
 12
 14
 16
 18
Vanilla Spark
Attempt 1
Attempt 2
runtime (seconds)
3x	
  worse	
  
0
 2
 4
 6
 8
 10
 12
 14
 16
 18
Vanilla Spark
Attempt 1
Attempt 2
runtime (seconds)
Processing (s)
GC (s)
Attempt 2: get rid of ser/deser, etc
•  Attempt 2: create NxM queues (N=mappers, M=reducers)
push corresponding records into queues
•  Attempt 3: instead of queue, copy records to memory pages
Number of objects: ~records à ~pages
NxM pages, or alternatively, one page per reducer
Attempt 3: avoiding GC
record1	
   recor..	
  
..d2	
   record3	
  
…	
  
…	
  
…	
  
•  Unsafe Row (a buffer-backed row format):
•  row.pointTo(buffer)
Spark SQL
record1	
   recor..	
  
..d2	
   record3	
  
…	
  
…	
  
…	
  
Instantaneous creation of
unsafe rows by pointing to
different offsets In the
page
Attempt 3: avoiding GC
•  Attempt 3: copy records onto large memory pages
0
 2
 4
 6
 8
 10
 12
 14
 16
 18
Vanilla Spark
Attempt 1
Attempt 2
Attempt 3
runtime (seconds)
Improvement	
  from	
  avoid	
  ser/deser,	
  copy,	
  
I/O	
  conten4on,	
  expensive	
  code	
  path	
  	
  
Consistent improvement with varying size
spark.range(N).repar44on().selectExpr("sum(id)").collect()	
  
Use	
  N	
  from	
  2^	
  20	
  to	
  2^27 	
  	
  
0
50
100
150
200
250
300
Disk-based
 In-memory Shuffle
nanoseconds/row
2^20
2^21
2^22
2^23
2^24
2^25
2^26
2^27
TPC-DS performance (single node)
0
10
20
30
40
50
60
QueryRuntime(s)
In-memory Shuffle
Vanilla-Spark
27/33 queries improve with a median of 31%
Extending to multiple nodes
•  Implementation
•  All data goes to memory
•  For remote transfer, copy from memory to network buffer
•  A more memory-preserving way…
•  Local transfer goes to memory
•  Remote transfer goes to disk
•  Cons1: have to enforce stricter locality on reducers
•  Cons2: cannot avoid I/O contentions
spark.range(N).repar44on().selectExpr("sum(id)").collect()	
  
Simple shuffle job
0
100
200
300
400
500
600
1
 2
 3
 4
 1
 2
 3
 4
ns/row
Reduce Stage
Map Stage
Vanilla-Spark
 In-memory Shuffle
Map:
Consistent improvement
Reduce:
Improvement decreases with
more nodes
TPC-DS performance (x1.xlarge32)
0
20
40
60
80
q13
 q20
 q18
 q11
 q3
QueryRuntime(s)
In-memory Shuffle
Vanilla-Spark
•  SF=100
•  Pick top 5 queries
from single node
experiment
•  Best of 10 runs
Many other performance bottlenecks need investigation!
Summary
•  Spark on many-core requires many architectural changes
•  In-memory shuffle
•  How to improve shuffle performance with memory
•  31% improvement over Spark
•  On-going research
•  Identify other performance bottlenecks
Thank you
Qifan Pu
qifan@cs.berkeley.edu

More Related Content

PDF
Spark Summit EU talk by Stavros kontopoulos and Justin Pihony
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
PDF
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
PDF
Taking Spark Streaming to the Next Level with Datasets and DataFrames
PDF
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
PDF
Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
PPTX
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
PDF
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Spark Summit EU talk by Stavros kontopoulos and Justin Pihony
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)

What's hot (20)

PDF
Spark Summit EU 2015: Lessons from 300+ production users
PDF
SparkSQL: A Compiler from Queries to RDDs
PDF
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
PPTX
Use r tutorial part1, introduction to sparkr
PDF
Spark Summit EU talk by Sital Kedia
PDF
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
PDF
Spark Summit EU talk by Shay Nativ and Dvir Volk
PDF
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
PDF
Demystifying DataFrame and Dataset
PDF
A Journey into Databricks' Pipelines: Journey and Lessons Learned
PDF
Deep Dive: Memory Management in Apache Spark
PDF
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
PDF
From Pipelines to Refineries: Scaling Big Data Applications
PDF
Processing 70Tb Of Genomics Data With ADAM And Toil
PDF
Scaling Data Analytics Workloads on Databricks
PDF
Extending Spark With Java Agent (handout)
PPTX
Intro to Spark development
PDF
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
PPTX
Spark Summit EU talk by Sameer Agarwal
PPTX
Robust and Scalable ETL over Cloud Storage with Apache Spark
Spark Summit EU 2015: Lessons from 300+ production users
SparkSQL: A Compiler from Queries to RDDs
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Use r tutorial part1, introduction to sparkr
Spark Summit EU talk by Sital Kedia
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
Spark Summit EU talk by Shay Nativ and Dvir Volk
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Demystifying DataFrame and Dataset
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Deep Dive: Memory Management in Apache Spark
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From Pipelines to Refineries: Scaling Big Data Applications
Processing 70Tb Of Genomics Data With ADAM And Toil
Scaling Data Analytics Workloads on Databricks
Extending Spark With Java Agent (handout)
Intro to Spark development
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Spark Summit EU talk by Sameer Agarwal
Robust and Scalable ETL over Cloud Storage with Apache Spark
Ad

Viewers also liked (20)

PDF
Spark Summit EU talk by Luca Canali
PDF
Spark Summit EU talk by John Musser
PDF
Spark Summit EU talk by Simon Whitear
PDF
Spark Summit EU talk by Erwin Datema and Roeland van Ham
PDF
Spark performance tuning - Maksud Ibrahimov
PDF
Re-Architecting Spark For Performance Understandability
PDF
Enhancing Spark SQL Optimizer with Reliable Statistics
PPTX
Optimizing Apache Spark SQL Joins
PDF
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
PPTX
Vasiliy Litvinov - Python Profiling
PDF
What’s eating python performance
PPTX
Denis Nagorny - Pumping Python Performance
PDF
The High Performance Python Landscape by Ian Ozsvald
PPTX
Boost.Python: C++ and Python Integration
PDF
Spark + Scikit Learn- Performance Tuning
PDF
Spark Summit EU talk by Oscar Castaneda
PDF
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
PDF
Spark Summit EU talk by Javier Aguedes
PDF
Python profiling
PDF
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by John Musser
Spark Summit EU talk by Simon Whitear
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark performance tuning - Maksud Ibrahimov
Re-Architecting Spark For Performance Understandability
Enhancing Spark SQL Optimizer with Reliable Statistics
Optimizing Apache Spark SQL Joins
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Vasiliy Litvinov - Python Profiling
What’s eating python performance
Denis Nagorny - Pumping Python Performance
The High Performance Python Landscape by Ian Ozsvald
Boost.Python: C++ and Python Integration
Spark + Scikit Learn- Performance Tuning
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
Spark Summit EU talk by Javier Aguedes
Python profiling
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Ad

Similar to Spark Summit EU talk by Qifan Pu (20)

PDF
Introduction to Spark Training
PDF
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
PDF
Top 5 mistakes when writing Spark applications
PDF
Boosting spark performance: An Overview of Techniques
PDF
Top 5 mistakes when writing Spark applications
PDF
Spark Performance Tuning .pdf
PDF
Lessons from Running Large Scale Spark Workloads
PDF
Top 5 Mistakes When Writing Spark Applications
PDF
Top 5 mistakes when writing Spark applications
PDF
Top 5 mistakes when writing Spark applications
PPTX
Spark Overview and Performance Issues
ODP
Spark Deep Dive
PDF
Apache Spark Performance tuning and Best Practise
PPTX
SORT & JOIN IN SPARK 2.0
PDF
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
PPTX
Spark Performance Tuning | Best PySpark & Databricks Online Training
PDF
TriHUG talk on Spark and Shark
PPTX
IBM Spark Meetup - RDD & Spark Basics
PDF
Scaling Apache Spark at Facebook
Introduction to Spark Training
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Top 5 mistakes when writing Spark applications
Boosting spark performance: An Overview of Techniques
Top 5 mistakes when writing Spark applications
Spark Performance Tuning .pdf
Lessons from Running Large Scale Spark Workloads
Top 5 Mistakes When Writing Spark Applications
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
Spark Overview and Performance Issues
Spark Deep Dive
Apache Spark Performance tuning and Best Practise
SORT & JOIN IN SPARK 2.0
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Spark Performance Tuning | Best PySpark & Databricks Online Training
TriHUG talk on Spark and Shark
IBM Spark Meetup - RDD & Spark Basics
Scaling Apache Spark at Facebook

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded (20)

PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Introduction to Business Data Analytics.
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Business Acumen Training GuidePresentation.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Global journeys: estimating international migration
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Computer network topology notes for revision
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
climate analysis of Dhaka ,Banglades.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
.pdf is not working space design for the following data for the following dat...
Moving the Public Sector (Government) to a Digital Adoption
Miokarditis (Inflamasi pada Otot Jantung)
Introduction to Business Data Analytics.
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Business Acumen Training GuidePresentation.pptx
Quality review (1)_presentation of this 21
Supervised vs unsupervised machine learning algorithms
STUDY DESIGN details- Lt Col Maksud (21).pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Global journeys: estimating international migration
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Fluorescence-microscope_Botany_detailed content
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Introduction-to-Cloud-ComputingFinal.pptx
Computer network topology notes for revision

Spark Summit EU talk by Qifan Pu

  • 1. Boosting Spark Performance on Many-Core Machines Qifan Pu Sameer Agarwal (Databricks) Reynold Xin (Databricks) Ion Stoica UC  BERKELEY  
  • 2. Me Ph.D. Student in AMPLab, advised by Prof. Ion Stoica - Spark-related research projects - e.g., how to run Spark in a geo-distributed fashion - Big data storage (e.g., Alluxio) - how to do memory management for multiple users Intern at Databricks in the past summer - Spark SQL team: aggregates, shuffle
  • 3. This project Spark performance on many-core machines - Ongoing research, feedbacks are welcome -  Focus of this talk: •  understand shuffle performance •  Investigate and implement In-memory shuffle Moving beyond research -  Hope is to get into Spark (but no guarantee yet)
  • 4. Why do we care about many-core machines? rdd.groupBy(…)…   …   …   …   …   Spark started as a cluster computing framework
  • 5. •  Spark’s one big success has been high scalability •  Largest cluster known is 8000 •  Winner of 2014 Daytona GraySort Benchmark •  Typical cluster sizes in practice: “Our observation is that companies typically experiment with cluster size of under 100 nodes and expand to 200 or more nodes in the production stages.” -- Susheel Kaushik (Senior Director at Pivotal) Spark on cluster
  • 6. Increasingly powerful single node machines •  More cores packed on single chip •  Intel Xeon Phi: 64-72 cores •  Berkeley FireBox Project: ~100 cores •  Larger instances in the Cloud •  Various 32-core instances on EC2, Azure & GCE •  EC2 X-Instance with 128 cores, 2TB (May 2016)
  • 7. Memory (GB) vCPUs Hourly Cost ($) Cost/100GB Cost/8vCPU x1.32xlarge 1952 128 13.338 0.68 0.83 g2.2xlarge 15 8 0.65 4.33 0.65 i2.2xlarge 61 8 1.705 2.80 1.70 m3.2xlarge 30 8 0.532 1.78 0.53 c3.2xlarge 15 8 0.42 2.80 0.42 Cost of many-core nodes 1 x1.32xlarge instance is a small cluster (with more memory, fast inter-core network)
  • 8. Spark’s design was based on many nodes •  Data communication (a.k.a. shuffle) •  Store intermediate data on disk •  Serialization/deserialization needed across nodes •  Now: much memory to spare, intra-node shuffle •  Resource management •  Designed to handle moderate amount on each node •  Now: 1 executor for 100 cores + 2TB memory? Focus of this talk Ongoing work
  • 9. Can we improve shuffle on single, multi-core machines? 1, Memory is fast 2, We can use memory for shuffle 3, Therefore, Shuffle will be fast Will this “common sense” work?
  • 10. Put in practice… •  Spark: write to a file stream, and save all bytes to disk •  Solution: …, …. to memory (bytes on heap) spark.range(16M).repartition(1).selectExpr("sum(id)").collect() 0 1 2 3 4 5 Vanilla Spark Attempt 1 runtime (seconds) Why zero improvement? Attempt 1 Vanilla Spark
  • 11. flushBuffer(),  4.81%  of  4me   En4re  dura4on  of  a  job  (100%)   Why zero improvement?
  • 12. 1, I/O throughput is not the bottleneck (in this job) 2, Buffer cache: memory is being exploited by disk-based shuffle Why zero improvement?
  • 13. Understanding Shuffle Performance Generated  Iterator   OP1   OP2   OP3   Queue  Input   Filestreams   flush   flush   flush   Mappers  
  • 14. Understanding Shuffle Performance Generated  Iterator   OP1   OP2   OP3   Results  or  another  shuffle   Reducers  
  • 15. More complications •  Sort vs. hash-based shuffle •  Spill when memory runs out •  Clear data after shuffle …
  • 16. Case 1 Generated  Iterator   OP1   OP2   OP3   Queue  Input   Filestreams   flush   flush   flush   Thread  1   Thread  2  
  • 17. Case 1 Queue  Input   flush   flush   flush   Thread  1   Thread  2  Case  1:  thread  1  slower  than  thread  2   Filestreams   Generated  Iterator   OP1   OP2   OP3  
  • 18. Case 2 Generated  Iterator   OP1   OP2   OP3   Queue  Input   Filestreams   flush   flush   flush  
  • 19. Case 2 Queue  Input   flush   flush   flush   Case  2:  wri4ng/reading  file  streams  is  slow   Filestreams   Generated  Iterator   OP1   OP2   OP3  
  • 20. Case study 2 Generated  Iterator   OP1   OP2   OP3   Results  or  another  shuffle   Reducers  
  • 21. Case 3 Generated  Iterator   OP1   OP2   OP3   Queue  Input   Filestreams   flush   flush   flush  
  • 22. Case 3 Queue  Input   flush   flush   flush   Case  3:  I/O  conten4ons     Filestreams   Generated  Iterator   OP1   OP2   OP3   Our  first  aTempt  should  work  in  this  case!  
  • 23. Can we improve case 2? Queue  Input   flush   flush   flush   Filestreams   Generated  Iterator   OP1   OP2   OP3   Previous  example  is  case  2.  
  • 24. Attempt 2: get rid of ser/deser, etc •  Attempt 2: create NxM queues (N=mappers, M=reducers) push corresponding records into queues mappers reducers •  No serialization •  No copy (data structure shared by both sides)
  • 25. Attempt 2: get rid of ser/deser, etc •  Attempt 2: create NxM queues (N=mappers, M=reducers) push corresponding records into queues 0 2 4 6 8 10 12 14 16 18 Vanilla Spark Attempt 1 Attempt 2 runtime (seconds) 3x  worse  
  • 26. 0 2 4 6 8 10 12 14 16 18 Vanilla Spark Attempt 1 Attempt 2 runtime (seconds) Processing (s) GC (s) Attempt 2: get rid of ser/deser, etc •  Attempt 2: create NxM queues (N=mappers, M=reducers) push corresponding records into queues
  • 27. •  Attempt 3: instead of queue, copy records to memory pages Number of objects: ~records à ~pages NxM pages, or alternatively, one page per reducer Attempt 3: avoiding GC record1   recor..   ..d2   record3   …   …   …  
  • 28. •  Unsafe Row (a buffer-backed row format): •  row.pointTo(buffer) Spark SQL record1   recor..   ..d2   record3   …   …   …   Instantaneous creation of unsafe rows by pointing to different offsets In the page
  • 29. Attempt 3: avoiding GC •  Attempt 3: copy records onto large memory pages 0 2 4 6 8 10 12 14 16 18 Vanilla Spark Attempt 1 Attempt 2 Attempt 3 runtime (seconds) Improvement  from  avoid  ser/deser,  copy,   I/O  conten4on,  expensive  code  path    
  • 30. Consistent improvement with varying size spark.range(N).repar44on().selectExpr("sum(id)").collect()   Use  N  from  2^  20  to  2^27     0 50 100 150 200 250 300 Disk-based In-memory Shuffle nanoseconds/row 2^20 2^21 2^22 2^23 2^24 2^25 2^26 2^27
  • 31. TPC-DS performance (single node) 0 10 20 30 40 50 60 QueryRuntime(s) In-memory Shuffle Vanilla-Spark 27/33 queries improve with a median of 31%
  • 32. Extending to multiple nodes •  Implementation •  All data goes to memory •  For remote transfer, copy from memory to network buffer •  A more memory-preserving way… •  Local transfer goes to memory •  Remote transfer goes to disk •  Cons1: have to enforce stricter locality on reducers •  Cons2: cannot avoid I/O contentions
  • 33. spark.range(N).repar44on().selectExpr("sum(id)").collect()   Simple shuffle job 0 100 200 300 400 500 600 1 2 3 4 1 2 3 4 ns/row Reduce Stage Map Stage Vanilla-Spark In-memory Shuffle Map: Consistent improvement Reduce: Improvement decreases with more nodes
  • 34. TPC-DS performance (x1.xlarge32) 0 20 40 60 80 q13 q20 q18 q11 q3 QueryRuntime(s) In-memory Shuffle Vanilla-Spark •  SF=100 •  Pick top 5 queries from single node experiment •  Best of 10 runs Many other performance bottlenecks need investigation!
  • 35. Summary •  Spark on many-core requires many architectural changes •  In-memory shuffle •  How to improve shuffle performance with memory •  31% improvement over Spark •  On-going research •  Identify other performance bottlenecks