SlideShare a Scribd company logo
Flash for Apache Spark Shuffle with Cosco
Flash for Spark Shuffle with Cosco
Aaron Gabriel Feldman
Software Engineer at Facebook
Agenda
1. Motivation
2. Intro to shuffle architecture
3. Flash
4. Hybrid RAM + flash techniques
5. Future improvements
6. Testing techniques
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Why should you care?
▪ IO efficiency
▪ Cosco is a service that improves IO efficiency (disk service time) by 3x for shuffle data
▪ Compute efficiency
▪ Flash supports more workload with less Cosco hardware
▪ Query latency is less of a focus
▪ Cosco helps shuffle-heavy queries, but query latency has not been our focus. We have been focused on batch workloads.
▪ Flash unlocks new possibilities to improve query latency, but that is future work
▪ Techniques for development and analysis
▪ Hopefully, some of these are applicable outside of Cosco
Intro to Shuffle Architecture
Spark Shuffle Recap
Map 0
Map 1
Map m
Reduce 0
Reduce 1
Reduce r
Partition
Mappers
Map Output Files
(on disk/DFS) Reducers
Map output files written to local storage or distributed filesystem
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
Spark Shuffle Recap
Map 0
Map 1
Map m
Reduce 0
Reduce 1
Reduce r
Partition
Mappers
Map Output Files
(on disk/DFS) Reducers
Reducers pull from map output files
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
Spark Shuffle Recap
Map 0
Map 1
Map m
Reduce 0
Reduce 1
Reduce r
Partition
Mappers
Map Output Files
(on disk/DFS) Reducers
Sort by
key
Iterator
Iterator
Iterator
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
Spark Shuffle Recap
Map 0
Map 1
Map m
Reduce 0
Reduce 1
Reduce r
Partition
Mappers
Map Output Files
(on disk/DFS) Reducers
Sort by
key
Iterator
Iterator
Iterator
Write amplification is ~3x
Write amplification problem
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
Spark Shuffle Recap
Map 0
Map 1
Map m
Reduce 0
Reduce 1
Reduce r
Partition
Sort by
key
Iterator
Iterator
Iterator
Write amplification is ~3x
And small IOs problem
M x R
Avg IO size is ~200 KiB
Mappers
Map Output Files
(on disk/DFS) Reducers
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
Spark Shuffle Recap
Map 0
Map 1
Map m
Reduce 0
Reduce 1
Reduce r
Partition
Mappers
Map Output Files
(on disk/DFS) Reducers
Reducers pull from map output files
Sort by
key
Iterator
Iterator
Iterator
Simplified drawing
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
Spark Shuffle Recap
Map 1
Map m
Reduce 1
Reduce r
Mappers
Map Output Files
(on disk/DFS) Reducers
Reducers pull from map output files
Sort by
key
Iterator
Iterator
Simplified drawing
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
Spark Shuffle Recap
Map 1
Map m
Reduce 1
Reduce r
Mappers
Map Output Files
(on disk/DFS) Reducers
Simplified drawing
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
Spark Shuffle Recap
Map 1
Map m
Mappers
Map Output Files
(on disk/DFS)
Simplified drawing
Reduce 1
Reduce r
Reducers
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
Cosco Shuffle for Spark
Reduce 1
Reduce r
Mappers Reducers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition 1
Partition r
Shuffle Services
(N = thousands)
Map m
Map 1
Mappers stream their output to Cosco Shuffle Services, which buffer in memory
Streaming
output
In-memory buffering
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
Cosco Shuffle for Spark
Reduce 1
Reduce r
Mappers Reducers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition 1
(file 0 buffer)
Partition r
(file 0 buffer)
File 0
File 0
Shuffle Services
(N = thousands)
Distributed Filesystem
(HDFS/Warm Storage)
Map m
Map 1
Sort and flush buffers to DFS when full
Streaming
output
In-memory buffering
Sort (if required by query)
Flush
Flush
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
Cosco Shuffle for Spark
Reduce 1
Reduce r
Mappers Reducers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition 1
(file 1 buffer)
Partition r
(file 0 buffer)
File 0
File 1
File 0
Shuffle Services
(N = thousands)
Distributed Filesystem
(HDFS/Warm Storage)
Map m
Map 1
Streaming
output
In-memory buffering
Flush
Sort (if required by query)
Flush
Sort and flush buffers to DFS when full
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
Cosco Shuffle for Spark
Reduce 1
Reduce r
Mappers Reducers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition 1
(file 2 buffer)
Partition r
(file 0 buffer)
File 0
File 1
File 0
Shuffle Services
(N = thousands)
Distributed Filesystem
(HDFS/Warm Storage)
Map m
Map 1
File 2
Streaming
output
In-memory buffering
Flush
Sort (if required by query)
Flush
Sort and flush buffers to DFS when full
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
Cosco Shuffle for Spark
Reduce 1
Reduce r
Mappers Reducers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition 1
(file 2 buffer)
Partition r
(file 1 buffer)
File 0
File 1
File 2
File 0
File 1
Shuffle Services
(N = thousands)
Distributed Filesystem
(HDFS/Warm Storage)
Map m
Map 1
Streaming
output
In-memory buffering
Flush
Sort (if required by query)
Flush
Sort and flush buffers to DFS when full
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
Iterator
Iterator
Cosco Shuffle for Spark
Reduce 1
Reduce r
Mappers Reducers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition 1
(file 2 buffer)
Partition r
(file 1 buffer)
File 0
File 1
File 2
File 0
File 1
Shuffle Services
(N = thousands)
Distributed Filesystem
(HDFS/Warm Storage)
Map m
Map 1
Streaming
output
In-memory buffering
Flush
Sort (if required by query)
Flush
Reducers do a streaming merge after map stage completes
Streaming
merge
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
Replace DRAM with Flash
for Buffering
Buffering Is Appending
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = thousands)
Map m
Map 1
Each package is a few 10s of KiB
Buffering Is Appending
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = thousands)
Map m
Map 1
Each package is a few 10s of KiB
Buffering Is Appending
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = thousands)
Map m
Map 1
Each package is a few 10s of KiB
Buffering Is Appending
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = thousands)
Map m
Map 1
Each package is a few 10s of KiB
Buffering Is Appending
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = thousands)
Map m
Map 1
Each package is a few 10s of KiB
Replace DRAM with Flash for Buffering
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = thousands)
Map m
Map 1
Each package is a few 10s of KiB
Simply buffer to flash instead of memory
On flash
Replace DRAM with Flash for Buffering
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = thousands)
Map m
Map 1
Each package is a few 10s of KiB
▪ Appending is a friendly pattern
for flash
▪ Minimize flash write amplification -> minimizing wear on
the drive
Simply buffer to flash instead of memory
On flash
Replace DRAM with Flash for Buffering
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = thousands)
Map m
Map 1
Each package is a few 10s of KiB
▪ Appending is a friendly pattern
for flash
▪ Minimize flash write amplification -> minimizing wear on
the drive
Simply buffer to flash instead of memory
On flash
Read back to main
memory for sorting
Replace DRAM with Flash for Buffering
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = thousands)
Map m
Map 1
Each package is a few 10s of KiB
▪ Appending is a friendly pattern
for flash
▪ Minimize flash write amplification -> minimizing wear on
the drive
▪ Flash write/read latency is
negligible
▪ Generally non-blocking
▪ Latency is much less than buffering time
Simply buffer to flash instead of memory
On flash
Read back to main
memory for sorting
Example Rule of Thumb
▪ Hypothetical example numbers
▪ Assume 1 GB Flash can endure ~10 GB of writes per day for the lifetime of the device
▪ Assume you are indifferent between consuming 1 GB DRAM vs ~10 GB Flash with write throughput at the endurance limit
▪ Then, you would be indifferent between consuming 1 GB DRAM vs ~100 GB/day Flash
▪ Notes
▪ These numbers chosen entirely because they are round -> Easier to illustrate math on slides
▪ DRAM consumes more power than Flash
Would you rather consume 1 GB DRAM or flash that can endure 100 GB/day of write throughput?
Basic Evaluation
▪ Example Cosco cluster
▪ 10 nodes
▪ Each node uses 100 GB DRAM for buffering
▪ And has additional DRAM for sorting, RPCs, etc.
▪ So, 1 TB DRAM for buffering in total
▪ Again, numbers are chosen for illustration only
▪ Apply the example rule of thumb
▪ Indifferent between consuming 1 TB DRAM vs 100 TB/day flash endurance
▪ If this cluster shuffles less than 100 TB/day, then it is efficient to
replace DRAM with Flash
▪ Each node replaces 100 GB DRAM with ~1 TB flash for buffering
▪ Nodes keep some DRAM for sorting, RPCs, etc.
Basic Evaluation
Summary for cluster shuffling 100 TB/day
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
Shuffle Service 10
CPU
DRAM for sorting,
RPCs, etc.
100 GB
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
Shuffle Service 10
CPU
DRAM for sorting,
RPCs, etc.
1 TB
Flash for buffering
Hybrid Techniques for Efficiency
Two Hybrid Techniques
Two ways to use both DRAM and flash for buffering
1. Buffer in DRAM first, flush to flash only under memory pressure
2. Buffer fastest-filling partitions in DRAM, send slowest-filling
partitions to flash
Hybrid Technique #1
Take advantage of variation in shuffle workload over time
Time
Bytes buffered in
Cosco Shuffle Service
Hybrid Technique #1
Take advantage of variation in shuffle workload over time
Time
Bytes
buffered
Buffer only in DRAM Buffer only in flash
1 TB
100 TB written/day
Hybrid Technique #1
Take advantage of variation in shuffle workload over time
Buffer only in DRAM
Buffer only in flash
1 TB
100 TB written/day
Hybrid
Buffer in DRAM and flash
250 GB
25 TB written/day
Hybrid Technique #1
Buffer in DRAM first, flush to flash only under memory pressure
250 GB DRAM
25 TB written/day to flash
▪ Example: 25% RAM +
25% flash supports
100% throughput
▪ Spikier workload -> more win
▪ Safer to push the
system to its limits
▪ Run out of memory -> immediate bad
consequences
▪ But exceed flash endurance guidelines
-> okay if you make up for it by writing
less in the future
Hybrid Technique #1
Buffer in DRAM first, flush to flashPure-DRAM Cosco
Implementation requires balancing. Flash adds another dimension. How to adapt balancing logic?
Balancing
Logic
Redirect to another
shuffle service
Flush to
DFS
Backpressure
mappers
???
Redirect to another
shuffle service
Flush to
DFS
Backpressure
mappers
Flush to
Flash
Shuffle Service is
out of DRAM
Shuffle Service is
out of DRAM
Hybrid Technique #1
Buffer in DRAM first, flush to flashPure-DRAM Cosco
Plug into pre-existing balancing logic
Shuffle Service is
out of DRAM
Balancing
Logic
Redirect to another
shuffle service
Flush to
DFS
Backpressure
mappers
Balancing
Logic
Redirect to another
shuffle service
Flush to
DFS
Backpressure
mappers
Shuffle Service is
out of DRAM
Same logic
Flash
working set
smaller than
THRESHOLD
?
No
Flush to
Flash
Yes
Hybrid Technique #1
Plug into pre-existing balancing logic
Balancing
Logic
Redirect to another
shuffle service
Flush to
DFS
Backpressure
mappers
Shuffle Service is
out of DRAM
Flash
working set
smaller than
THRESHOLD
?
No
Flush to
Flash
Yes
▪ THRESHOLD limits flash working set
size
▪ Configure THRESHOLD to stay under
flash endurance limits
▪ Then predict cluster performance as if
working-set flash were DRAM
Hybrid Technique #1
Summary
▪ Take advantage of variation in
total shuffle workload over time
▪ Buffer in DRAM first, flush to
flash only under memory
pressure
▪ Adapt balancing logic
Hybrid Technique #2
Take advantage of variation in partition fill rate
▪ Some partitions fill more slowly than others
▪ Slower partitions wear out flash less quickly
▪ So, use flash to buffer slower partitions, and use DRAM to buffer faster
partitions
Hybrid Technique #2
▪ 1 TB
▪ Supports 100K streams each
buffering up to 10MB
▪ 10 TB, 100 TB written/day
▪ 100K streams each writing 1 GB/day
which is 12 KB/second. (Sanity check:
5 min map stage -> 3.6 MB partition.)
▪ Or 200K streams each writing
6KB/second -> These streams are
better on flash
▪ Or 50K streams each writing 24
KB/second -> These streams would
be better on DRAM
FlashDRAM
Take advantage of variation in partition fill rate: Illustrated with numbers
Hybrid Technique #2
Buffer fastest-filling partitions in DRAM and slowest-filling partitions in flash
▪ Technique
▪ Periodically measure partition fill rate
▪ If fill rate is less than threshold KB/s, then buffer partition data in flash
▪ Else, buffer partition data in DRAM
▪ Evaluation
▪ Assume “break-even” threshold of 12 KB/s from previous slide
▪ Suppose that 50% of buffer time is spent on partitions that are slower than 12 KB/s
▪ Suppose these slow partitions write an average of 3 KB/s
▪ Then, you can replace half of your buffering DRAM with 25% as much flash
Hybrid Technique #2
Real-world partition fill rates
Percentile of partitions
Fill rate
0 KiB/sec
1st
MiB’s/sec
99th
Percentile of partitions
Fill rate,
log scale
0 KiB/sec
1st
MiB’s/sec
99th
Hybrid Technique #2
Real-world partition fill rates
Percentile of partitions
Percentile of partitions weighted by buffering time
Fill rate
0 KiB/sec
1st
MiB’s/sec
99th
Percentile of partitions
Percentile of partitions weighted by buffering time
Fill rate,
log scale
0 KiB/sec
1st
MiB’s/sec
99th
Combine both hybrid techniques
Buffer in DRAM first, then send the slowest partitions to flash when under memory pressure
▪ Evaluation
▪ Difficult theoretical estimation
▪ Or, do a discrete-event simulation -> Later in this presentation
Future Improvements
Lower-Latency Queries
Made possible by flash
▪ Serve shuffle data directly from flash for some jobs
▪ This is “free” until flash drive gets so full that write amplification factor increases (~80% full)
▪ Prioritize interactive/low-latency queries to serve from flash
▪ Buffer bigger chunks to decrease reducer merging
▪ Fewer chunks -> Less chance that reducer needs to do an external merge
Further Efficiency Wins
Made possible by flash
▪ Decrease Cosco replication factor since flash is non-volatile
▪ Currently Cosco replication is R2: Each map output byte is stored on two shuffle services until it is flushed to durable DFS
▪ Most Shuffle Service crashes in production are resolved in a few minutes with process restart
▪ Decrease Cosco replication to R1 for some queries, and attempt to automatically recover map output data from flash after restart
▪ Buffer bigger chunks to allow more efficient Reed-Solomon encodings
on DFS
Practical Evaluation Techniques
Practical Evaluation Techniques
▪ Discrete event simulation
▪ Synthetic load generation on a test cluster
▪ Shadow testing on a test cluster
▪ Special canary in a production cluster
Discrete Event Simulation
https://en.wikipedia.org/wiki/Discrete-event_simulation, 2020-05-18
Discrete Event Simulation
Shuffle Service Model DFS Model
Example
Partition 3
Partition 42
Time: 00h:01m:30.000s
Total KB written to flash: 9,000
Overall avg file size written to DFS: NaN
Discrete Event Simulation
Shuffle Service Model DFS Model
Example
Partition 3
Partition 42
Discrete event
Time: 00h:01m:30.250s
Total KB written to flash: 9,050
Overall avg file size written to DFS: NaN
Discrete Event Simulation
Shuffle Service Model DFS Model
Example
Partition 3
Partition 42
Discrete event
Time: 00h:01m:30.500s
Total KB written to flash: 9,100
Overall avg file size written to DFS: NaN
Discrete Event Simulation
Shuffle Service Model DFS Model
Example
Partition 3
Partition 42
Discrete event
Time: 00h:01m:30.750s
Total KB written to flash: 9,150
Overall avg file size written to DFS: NaN
Discrete Event Simulation
Shuffle Service Model DFS Model
Example
Partition 3
Partition 42
Discrete event
Time: 00h:01m:31.000s
Total KB written to flash: 9,200
Overall avg file size written to DFS: NaN
Discrete Event Simulation
Shuffle Service Model DFS Model
Example
Partition 3
Partition 42
Discrete event
Time: 00h:01m:31.500s
Total KB written to flash: 9,250
Overall avg file size written to DFS: NaN
Discrete Event Simulation
Shuffle Service Model DFS Model
Example
Partition 3
Partition 42
Discrete event
Time: 00h:01m:32.000s
Total KB written to flash: 9,300
Overall avg file size written to DFS: NaN
DFS Model
File 0
Discrete Event Simulation
Shuffle Service Model
Example
Partition 3
Partition 42
Sort & flush
Discrete event
Time: 00h:01m:32.000s
Total KB written to flash: 9,300
Overall avg file size written to DFS: NaN9,200
DFS Model
File 0
Discrete Event Simulation
Shuffle Service Model
Example
Partition 3
Partition 42
Discrete event
Time: 00h:01m:32.500s
Total KB written to flash: 9,350
Overall avg file size written to DFS: NaN9,200
Discrete Event Simulation
Drive simulation based on production data
cosco_chunks dataset
Partition
Shuffle
Service ID
Chunk (DFS
file) number
Chunk Start
Time
Chunk
Size
Chunk
Buffering Time
Chunk Fill Rate (derived from
size and buffering time)
3 10 5
2020-05-19
00:00:00.000
10
MiB
5000ms 2 MiB/s
42 10 2
2020-05-19
00:01:00.000
31
MiB
10000ms 3.1 MiB/s
…
…
Canary on a Production Cluster
▪ Many important metrics are observed on mappers
▪ Example: “percentage of task time spent shuffling”
▪ Example: “map task success rate”
▪ Problem: Mappers talk to many Shuffle Services
▪ Simultaneously
▪ Dynamic balancing can re-route to different Shuffle Services
▪ Solution: Subclusters
▪ Pre-existing feature for large clusters
▪ Each Shuffle Service belongs to one subcluster
▪ Each mapper is assigned to one subcluster, and only uses Shuffle Services in that subcluster
▪ Compare performance of subclusters that contain flash machines vs subclusters that don’t
Chen Yang
Software Engineer at Facebook
Sergey Makagonov
Software Engineer at Facebook
Special Thanks
SOS: Optimizing Shuffle IO, Spark Summit 2018
Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
Previous Shuffle presentations from
Facebook
cosco@fb.com mailing list
Contact
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Flash for Apache Spark Shuffle with Cosco

More Related Content

PPTX
Apache Spark Architecture
PDF
Apache Spark Core—Deep Dive—Proper Optimization
PDF
Cosco: An Efficient Facebook-Scale Shuffle Service
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
PDF
Deep Dive: Memory Management in Apache Spark
PPTX
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
PDF
Memory Management in Apache Spark
PDF
Apache Spark Core – Practical Optimization
Apache Spark Architecture
Apache Spark Core—Deep Dive—Proper Optimization
Cosco: An Efficient Facebook-Scale Shuffle Service
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Deep Dive: Memory Management in Apache Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Memory Management in Apache Spark
Apache Spark Core – Practical Optimization

What's hot (20)

PDF
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
PPTX
Apache Spark overview
PPTX
Hive + Tez: A Performance Deep Dive
PDF
Spark shuffle introduction
PDF
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
PPTX
Tuning and Debugging in Apache Spark
PDF
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
PDF
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
Understanding Query Plans and Spark UIs
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
PDF
Building robust CDC pipeline with Apache Hudi and Debezium
PDF
The Parquet Format and Performance Optimization Opportunities
PDF
Hoodie - DataEngConf 2017
PDF
Building large scale transactional data lake using apache hudi
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
PPTX
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
PDF
Delta Lake: Optimizing Merge
PPTX
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark overview
Hive + Tez: A Performance Deep Dive
Spark shuffle introduction
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Tuning and Debugging in Apache Spark
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Understanding Query Plans and Spark UIs
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Building robust CDC pipeline with Apache Hudi and Debezium
The Parquet Format and Performance Optimization Opportunities
Hoodie - DataEngConf 2017
Building large scale transactional data lake using apache hudi
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Delta Lake: Optimizing Merge
Apache Tez - A New Chapter in Hadoop Data Processing
Ad

Similar to Flash for Apache Spark Shuffle with Cosco (20)

PDF
SHARE Interface in Flash Storage for Relational and NoSQL Databases
PDF
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
PDF
Accelerated Spark on Azure: Seamless and Scalable Hardware Offloads in the C...
PDF
Cmu 2011 09.pptx
PDF
Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...
PPTX
Seattle Scalability Meetup - Ted Dunning - MapR
PPTX
Openstorage with OpenStack, by Bradley
PPTX
Pm 01 bradley stone_openstorage_openstack
PPT
Key Challenges in Cloud Computing and How Yahoo! is Approaching Them
PDF
Optimizing MapReduce Job performance
PDF
Building A Scalable Open Source Storage Solution
PDF
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
PDF
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
PDF
2012 11 Openstack China
PDF
CLFS 2010
PDF
Handling the growth of data
PPTX
Elastic storage in the cloud session 5224 final v2
PPTX
The Performance of MapReduce: An In-depth Study
PPTX
moi-connect16
PPS
Filesystems
SHARE Interface in Flash Storage for Relational and NoSQL Databases
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
Accelerated Spark on Azure: Seamless and Scalable Hardware Offloads in the C...
Cmu 2011 09.pptx
Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...
Seattle Scalability Meetup - Ted Dunning - MapR
Openstorage with OpenStack, by Bradley
Pm 01 bradley stone_openstorage_openstack
Key Challenges in Cloud Computing and How Yahoo! is Approaching Them
Optimizing MapReduce Job performance
Building A Scalable Open Source Storage Solution
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
2012 11 Openstack China
CLFS 2010
Handling the growth of data
Elastic storage in the cloud session 5224 final v2
The Performance of MapReduce: An In-depth Study
moi-connect16
Filesystems
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPTX
Introduction to Inferential Statistics.pptx
PDF
annual-report-2024-2025 original latest.
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Leprosy and NLEP programme community medicine
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PPTX
CYBER SECURITY the Next Warefare Tactics
PDF
Microsoft Core Cloud Services powerpoint
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Introduction to Data Science and Data Analysis
PDF
Introduction to the R Programming Language
PPTX
Managing Community Partner Relationships
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Global Data and Analytics Market Outlook Report
PPTX
IMPACT OF LANDSLIDE.....................
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Pilar Kemerdekaan dan Identi Bangsa.pptx
Introduction to Inferential Statistics.pptx
annual-report-2024-2025 original latest.
STERILIZATION AND DISINFECTION-1.ppthhhbx
Leprosy and NLEP programme community medicine
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
CYBER SECURITY the Next Warefare Tactics
Microsoft Core Cloud Services powerpoint
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Introduction to Data Science and Data Analysis
Introduction to the R Programming Language
Managing Community Partner Relationships
ISS -ESG Data flows What is ESG and HowHow
Global Data and Analytics Market Outlook Report
IMPACT OF LANDSLIDE.....................
Qualitative Qantitative and Mixed Methods.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj

Flash for Apache Spark Shuffle with Cosco

  • 2. Flash for Spark Shuffle with Cosco Aaron Gabriel Feldman Software Engineer at Facebook
  • 3. Agenda 1. Motivation 2. Intro to shuffle architecture 3. Flash 4. Hybrid RAM + flash techniques 5. Future improvements 6. Testing techniques
  • 4. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  • 5. Why should you care? ▪ IO efficiency ▪ Cosco is a service that improves IO efficiency (disk service time) by 3x for shuffle data ▪ Compute efficiency ▪ Flash supports more workload with less Cosco hardware ▪ Query latency is less of a focus ▪ Cosco helps shuffle-heavy queries, but query latency has not been our focus. We have been focused on batch workloads. ▪ Flash unlocks new possibilities to improve query latency, but that is future work ▪ Techniques for development and analysis ▪ Hopefully, some of these are applicable outside of Cosco
  • 6. Intro to Shuffle Architecture
  • 7. Spark Shuffle Recap Map 0 Map 1 Map m Reduce 0 Reduce 1 Reduce r Partition Mappers Map Output Files (on disk/DFS) Reducers Map output files written to local storage or distributed filesystem Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  • 8. Spark Shuffle Recap Map 0 Map 1 Map m Reduce 0 Reduce 1 Reduce r Partition Mappers Map Output Files (on disk/DFS) Reducers Reducers pull from map output files Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  • 9. Spark Shuffle Recap Map 0 Map 1 Map m Reduce 0 Reduce 1 Reduce r Partition Mappers Map Output Files (on disk/DFS) Reducers Sort by key Iterator Iterator Iterator Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  • 10. Spark Shuffle Recap Map 0 Map 1 Map m Reduce 0 Reduce 1 Reduce r Partition Mappers Map Output Files (on disk/DFS) Reducers Sort by key Iterator Iterator Iterator Write amplification is ~3x Write amplification problem Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  • 11. Spark Shuffle Recap Map 0 Map 1 Map m Reduce 0 Reduce 1 Reduce r Partition Sort by key Iterator Iterator Iterator Write amplification is ~3x And small IOs problem M x R Avg IO size is ~200 KiB Mappers Map Output Files (on disk/DFS) Reducers Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  • 12. Spark Shuffle Recap Map 0 Map 1 Map m Reduce 0 Reduce 1 Reduce r Partition Mappers Map Output Files (on disk/DFS) Reducers Reducers pull from map output files Sort by key Iterator Iterator Iterator Simplified drawing Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  • 13. Spark Shuffle Recap Map 1 Map m Reduce 1 Reduce r Mappers Map Output Files (on disk/DFS) Reducers Reducers pull from map output files Sort by key Iterator Iterator Simplified drawing Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  • 14. Spark Shuffle Recap Map 1 Map m Reduce 1 Reduce r Mappers Map Output Files (on disk/DFS) Reducers Simplified drawing Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  • 15. Spark Shuffle Recap Map 1 Map m Mappers Map Output Files (on disk/DFS) Simplified drawing Reduce 1 Reduce r Reducers Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  • 16. Cosco Shuffle for Spark Reduce 1 Reduce r Mappers Reducers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition 1 Partition r Shuffle Services (N = thousands) Map m Map 1 Mappers stream their output to Cosco Shuffle Services, which buffer in memory Streaming output In-memory buffering Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  • 17. Cosco Shuffle for Spark Reduce 1 Reduce r Mappers Reducers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition 1 (file 0 buffer) Partition r (file 0 buffer) File 0 File 0 Shuffle Services (N = thousands) Distributed Filesystem (HDFS/Warm Storage) Map m Map 1 Sort and flush buffers to DFS when full Streaming output In-memory buffering Sort (if required by query) Flush Flush Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  • 18. Cosco Shuffle for Spark Reduce 1 Reduce r Mappers Reducers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition 1 (file 1 buffer) Partition r (file 0 buffer) File 0 File 1 File 0 Shuffle Services (N = thousands) Distributed Filesystem (HDFS/Warm Storage) Map m Map 1 Streaming output In-memory buffering Flush Sort (if required by query) Flush Sort and flush buffers to DFS when full Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  • 19. Cosco Shuffle for Spark Reduce 1 Reduce r Mappers Reducers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition 1 (file 2 buffer) Partition r (file 0 buffer) File 0 File 1 File 0 Shuffle Services (N = thousands) Distributed Filesystem (HDFS/Warm Storage) Map m Map 1 File 2 Streaming output In-memory buffering Flush Sort (if required by query) Flush Sort and flush buffers to DFS when full Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  • 20. Cosco Shuffle for Spark Reduce 1 Reduce r Mappers Reducers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition 1 (file 2 buffer) Partition r (file 1 buffer) File 0 File 1 File 2 File 0 File 1 Shuffle Services (N = thousands) Distributed Filesystem (HDFS/Warm Storage) Map m Map 1 Streaming output In-memory buffering Flush Sort (if required by query) Flush Sort and flush buffers to DFS when full Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  • 21. Iterator Iterator Cosco Shuffle for Spark Reduce 1 Reduce r Mappers Reducers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition 1 (file 2 buffer) Partition r (file 1 buffer) File 0 File 1 File 2 File 0 File 1 Shuffle Services (N = thousands) Distributed Filesystem (HDFS/Warm Storage) Map m Map 1 Streaming output In-memory buffering Flush Sort (if required by query) Flush Reducers do a streaming merge after map stage completes Streaming merge Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  • 22. Replace DRAM with Flash for Buffering
  • 23. Buffering Is Appending Mappers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition r Shuffle Services (N = thousands) Map m Map 1 Each package is a few 10s of KiB
  • 24. Buffering Is Appending Mappers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition r Shuffle Services (N = thousands) Map m Map 1 Each package is a few 10s of KiB
  • 25. Buffering Is Appending Mappers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition r Shuffle Services (N = thousands) Map m Map 1 Each package is a few 10s of KiB
  • 26. Buffering Is Appending Mappers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition r Shuffle Services (N = thousands) Map m Map 1 Each package is a few 10s of KiB
  • 27. Buffering Is Appending Mappers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition r Shuffle Services (N = thousands) Map m Map 1 Each package is a few 10s of KiB
  • 28. Replace DRAM with Flash for Buffering Mappers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition r Shuffle Services (N = thousands) Map m Map 1 Each package is a few 10s of KiB Simply buffer to flash instead of memory On flash
  • 29. Replace DRAM with Flash for Buffering Mappers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition r Shuffle Services (N = thousands) Map m Map 1 Each package is a few 10s of KiB ▪ Appending is a friendly pattern for flash ▪ Minimize flash write amplification -> minimizing wear on the drive Simply buffer to flash instead of memory On flash
  • 30. Replace DRAM with Flash for Buffering Mappers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition r Shuffle Services (N = thousands) Map m Map 1 Each package is a few 10s of KiB ▪ Appending is a friendly pattern for flash ▪ Minimize flash write amplification -> minimizing wear on the drive Simply buffer to flash instead of memory On flash Read back to main memory for sorting
  • 31. Replace DRAM with Flash for Buffering Mappers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition r Shuffle Services (N = thousands) Map m Map 1 Each package is a few 10s of KiB ▪ Appending is a friendly pattern for flash ▪ Minimize flash write amplification -> minimizing wear on the drive ▪ Flash write/read latency is negligible ▪ Generally non-blocking ▪ Latency is much less than buffering time Simply buffer to flash instead of memory On flash Read back to main memory for sorting
  • 32. Example Rule of Thumb ▪ Hypothetical example numbers ▪ Assume 1 GB Flash can endure ~10 GB of writes per day for the lifetime of the device ▪ Assume you are indifferent between consuming 1 GB DRAM vs ~10 GB Flash with write throughput at the endurance limit ▪ Then, you would be indifferent between consuming 1 GB DRAM vs ~100 GB/day Flash ▪ Notes ▪ These numbers chosen entirely because they are round -> Easier to illustrate math on slides ▪ DRAM consumes more power than Flash Would you rather consume 1 GB DRAM or flash that can endure 100 GB/day of write throughput?
  • 33. Basic Evaluation ▪ Example Cosco cluster ▪ 10 nodes ▪ Each node uses 100 GB DRAM for buffering ▪ And has additional DRAM for sorting, RPCs, etc. ▪ So, 1 TB DRAM for buffering in total ▪ Again, numbers are chosen for illustration only ▪ Apply the example rule of thumb ▪ Indifferent between consuming 1 TB DRAM vs 100 TB/day flash endurance ▪ If this cluster shuffles less than 100 TB/day, then it is efficient to replace DRAM with Flash ▪ Each node replaces 100 GB DRAM with ~1 TB flash for buffering ▪ Nodes keep some DRAM for sorting, RPCs, etc.
  • 34. Basic Evaluation Summary for cluster shuffling 100 TB/day CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering Shuffle Service 10 CPU DRAM for sorting, RPCs, etc. 100 GB DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering Shuffle Service 10 CPU DRAM for sorting, RPCs, etc. 1 TB Flash for buffering
  • 35. Hybrid Techniques for Efficiency
  • 36. Two Hybrid Techniques Two ways to use both DRAM and flash for buffering 1. Buffer in DRAM first, flush to flash only under memory pressure 2. Buffer fastest-filling partitions in DRAM, send slowest-filling partitions to flash
  • 37. Hybrid Technique #1 Take advantage of variation in shuffle workload over time Time Bytes buffered in Cosco Shuffle Service
  • 38. Hybrid Technique #1 Take advantage of variation in shuffle workload over time Time Bytes buffered Buffer only in DRAM Buffer only in flash 1 TB 100 TB written/day
  • 39. Hybrid Technique #1 Take advantage of variation in shuffle workload over time Buffer only in DRAM Buffer only in flash 1 TB 100 TB written/day Hybrid Buffer in DRAM and flash 250 GB 25 TB written/day
  • 40. Hybrid Technique #1 Buffer in DRAM first, flush to flash only under memory pressure 250 GB DRAM 25 TB written/day to flash ▪ Example: 25% RAM + 25% flash supports 100% throughput ▪ Spikier workload -> more win ▪ Safer to push the system to its limits ▪ Run out of memory -> immediate bad consequences ▪ But exceed flash endurance guidelines -> okay if you make up for it by writing less in the future
  • 41. Hybrid Technique #1 Buffer in DRAM first, flush to flashPure-DRAM Cosco Implementation requires balancing. Flash adds another dimension. How to adapt balancing logic? Balancing Logic Redirect to another shuffle service Flush to DFS Backpressure mappers ??? Redirect to another shuffle service Flush to DFS Backpressure mappers Flush to Flash Shuffle Service is out of DRAM Shuffle Service is out of DRAM
  • 42. Hybrid Technique #1 Buffer in DRAM first, flush to flashPure-DRAM Cosco Plug into pre-existing balancing logic Shuffle Service is out of DRAM Balancing Logic Redirect to another shuffle service Flush to DFS Backpressure mappers Balancing Logic Redirect to another shuffle service Flush to DFS Backpressure mappers Shuffle Service is out of DRAM Same logic Flash working set smaller than THRESHOLD ? No Flush to Flash Yes
  • 43. Hybrid Technique #1 Plug into pre-existing balancing logic Balancing Logic Redirect to another shuffle service Flush to DFS Backpressure mappers Shuffle Service is out of DRAM Flash working set smaller than THRESHOLD ? No Flush to Flash Yes ▪ THRESHOLD limits flash working set size ▪ Configure THRESHOLD to stay under flash endurance limits ▪ Then predict cluster performance as if working-set flash were DRAM
  • 44. Hybrid Technique #1 Summary ▪ Take advantage of variation in total shuffle workload over time ▪ Buffer in DRAM first, flush to flash only under memory pressure ▪ Adapt balancing logic
  • 45. Hybrid Technique #2 Take advantage of variation in partition fill rate ▪ Some partitions fill more slowly than others ▪ Slower partitions wear out flash less quickly ▪ So, use flash to buffer slower partitions, and use DRAM to buffer faster partitions
  • 46. Hybrid Technique #2 ▪ 1 TB ▪ Supports 100K streams each buffering up to 10MB ▪ 10 TB, 100 TB written/day ▪ 100K streams each writing 1 GB/day which is 12 KB/second. (Sanity check: 5 min map stage -> 3.6 MB partition.) ▪ Or 200K streams each writing 6KB/second -> These streams are better on flash ▪ Or 50K streams each writing 24 KB/second -> These streams would be better on DRAM FlashDRAM Take advantage of variation in partition fill rate: Illustrated with numbers
  • 47. Hybrid Technique #2 Buffer fastest-filling partitions in DRAM and slowest-filling partitions in flash ▪ Technique ▪ Periodically measure partition fill rate ▪ If fill rate is less than threshold KB/s, then buffer partition data in flash ▪ Else, buffer partition data in DRAM ▪ Evaluation ▪ Assume “break-even” threshold of 12 KB/s from previous slide ▪ Suppose that 50% of buffer time is spent on partitions that are slower than 12 KB/s ▪ Suppose these slow partitions write an average of 3 KB/s ▪ Then, you can replace half of your buffering DRAM with 25% as much flash
  • 48. Hybrid Technique #2 Real-world partition fill rates Percentile of partitions Fill rate 0 KiB/sec 1st MiB’s/sec 99th Percentile of partitions Fill rate, log scale 0 KiB/sec 1st MiB’s/sec 99th
  • 49. Hybrid Technique #2 Real-world partition fill rates Percentile of partitions Percentile of partitions weighted by buffering time Fill rate 0 KiB/sec 1st MiB’s/sec 99th Percentile of partitions Percentile of partitions weighted by buffering time Fill rate, log scale 0 KiB/sec 1st MiB’s/sec 99th
  • 50. Combine both hybrid techniques Buffer in DRAM first, then send the slowest partitions to flash when under memory pressure ▪ Evaluation ▪ Difficult theoretical estimation ▪ Or, do a discrete-event simulation -> Later in this presentation
  • 52. Lower-Latency Queries Made possible by flash ▪ Serve shuffle data directly from flash for some jobs ▪ This is “free” until flash drive gets so full that write amplification factor increases (~80% full) ▪ Prioritize interactive/low-latency queries to serve from flash ▪ Buffer bigger chunks to decrease reducer merging ▪ Fewer chunks -> Less chance that reducer needs to do an external merge
  • 53. Further Efficiency Wins Made possible by flash ▪ Decrease Cosco replication factor since flash is non-volatile ▪ Currently Cosco replication is R2: Each map output byte is stored on two shuffle services until it is flushed to durable DFS ▪ Most Shuffle Service crashes in production are resolved in a few minutes with process restart ▪ Decrease Cosco replication to R1 for some queries, and attempt to automatically recover map output data from flash after restart ▪ Buffer bigger chunks to allow more efficient Reed-Solomon encodings on DFS
  • 55. Practical Evaluation Techniques ▪ Discrete event simulation ▪ Synthetic load generation on a test cluster ▪ Shadow testing on a test cluster ▪ Special canary in a production cluster
  • 57. Discrete Event Simulation Shuffle Service Model DFS Model Example Partition 3 Partition 42 Time: 00h:01m:30.000s Total KB written to flash: 9,000 Overall avg file size written to DFS: NaN
  • 58. Discrete Event Simulation Shuffle Service Model DFS Model Example Partition 3 Partition 42 Discrete event Time: 00h:01m:30.250s Total KB written to flash: 9,050 Overall avg file size written to DFS: NaN
  • 59. Discrete Event Simulation Shuffle Service Model DFS Model Example Partition 3 Partition 42 Discrete event Time: 00h:01m:30.500s Total KB written to flash: 9,100 Overall avg file size written to DFS: NaN
  • 60. Discrete Event Simulation Shuffle Service Model DFS Model Example Partition 3 Partition 42 Discrete event Time: 00h:01m:30.750s Total KB written to flash: 9,150 Overall avg file size written to DFS: NaN
  • 61. Discrete Event Simulation Shuffle Service Model DFS Model Example Partition 3 Partition 42 Discrete event Time: 00h:01m:31.000s Total KB written to flash: 9,200 Overall avg file size written to DFS: NaN
  • 62. Discrete Event Simulation Shuffle Service Model DFS Model Example Partition 3 Partition 42 Discrete event Time: 00h:01m:31.500s Total KB written to flash: 9,250 Overall avg file size written to DFS: NaN
  • 63. Discrete Event Simulation Shuffle Service Model DFS Model Example Partition 3 Partition 42 Discrete event Time: 00h:01m:32.000s Total KB written to flash: 9,300 Overall avg file size written to DFS: NaN
  • 64. DFS Model File 0 Discrete Event Simulation Shuffle Service Model Example Partition 3 Partition 42 Sort & flush Discrete event Time: 00h:01m:32.000s Total KB written to flash: 9,300 Overall avg file size written to DFS: NaN9,200
  • 65. DFS Model File 0 Discrete Event Simulation Shuffle Service Model Example Partition 3 Partition 42 Discrete event Time: 00h:01m:32.500s Total KB written to flash: 9,350 Overall avg file size written to DFS: NaN9,200
  • 66. Discrete Event Simulation Drive simulation based on production data cosco_chunks dataset Partition Shuffle Service ID Chunk (DFS file) number Chunk Start Time Chunk Size Chunk Buffering Time Chunk Fill Rate (derived from size and buffering time) 3 10 5 2020-05-19 00:00:00.000 10 MiB 5000ms 2 MiB/s 42 10 2 2020-05-19 00:01:00.000 31 MiB 10000ms 3.1 MiB/s … …
  • 67. Canary on a Production Cluster ▪ Many important metrics are observed on mappers ▪ Example: “percentage of task time spent shuffling” ▪ Example: “map task success rate” ▪ Problem: Mappers talk to many Shuffle Services ▪ Simultaneously ▪ Dynamic balancing can re-route to different Shuffle Services ▪ Solution: Subclusters ▪ Pre-existing feature for large clusters ▪ Each Shuffle Service belongs to one subcluster ▪ Each mapper is assigned to one subcluster, and only uses Shuffle Services in that subcluster ▪ Compare performance of subclusters that contain flash machines vs subclusters that don’t
  • 68. Chen Yang Software Engineer at Facebook Sergey Makagonov Software Engineer at Facebook Special Thanks
  • 69. SOS: Optimizing Shuffle IO, Spark Summit 2018 Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019 Previous Shuffle presentations from Facebook
  • 71. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.