Flash for Apache Spark Shuffle with Cosco

Flash for Spark Shuffle with Cosco
Aaron Gabriel Feldman
Software Engineer at Facebook

Agenda
1. Motivation
2. Intro to shuffle architecture
3. Flash
4. Hybrid RAM + flash techniques
5. Future improvements
6. Testing techniques

Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

Why should you care?
▪ IO efficiency
▪ Cosco is a service that improves IO efficiency (disk service time) by 3x for shuffle data
▪ Compute efficiency
▪ Flash supports more workload with less Cosco hardware
▪ Query latency is less of a focus
▪ Cosco helps shuffle-heavy queries, but query latency has not been our focus. We have been focused on batch workloads.
▪ Flash unlocks new possibilities to improve query latency, but that is future work
▪ Techniques for development and analysis
▪ Hopefully, some of these are applicable outside of Cosco

Spark Shuffle Recap
Map 0
Map 1
Map m
Reduce 0
Reduce 1
Reduce r
Partition
Mappers
Map Output Files
(on disk/DFS) Reducers
Map output files written to local storage or distributed filesystem
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019

Spark Shuffle Recap
Map 0
Map 1
Map m
Reduce 0
Reduce 1
Reduce r
Partition
Mappers
Map Output Files
Reducers pull from map output files

Spark Shuffle Recap
Map 0
Map 1
Map m
Reduce 0
Reduce 1
Reduce r
Partition
Mappers
Map Output Files
Sort by
key
Iterator
Iterator
Iterator

Spark Shuffle Recap
Map 0
Map 1
Map m
Reduce 0
Reduce 1
Reduce r
Partition
Mappers
Map Output Files
Sort by
key
Iterator
Iterator
Iterator
Write amplification is ~3x
Write amplification problem

Spark Shuffle Recap
Map 0
Map 1
Map m
Reduce 0
Reduce 1
Reduce r
Partition
Sort by
key
Iterator
Iterator
Iterator
Write amplification is ~3x
And small IOs problem
M x R
Avg IO size is ~200 KiB
Mappers
Map Output Files

Spark Shuffle Recap
Map 0
Map 1
Map m
Reduce 0
Reduce 1
Reduce r
Partition
Mappers
Map Output Files
Sort by
key
Iterator
Iterator
Iterator
Simplified drawing

Spark Shuffle Recap
Map 1
Map m
Reduce 1
Reduce r
Mappers
Map Output Files
Sort by
key
Iterator
Iterator
Simplified drawing

Spark Shuffle Recap
Map 1
Map m
Reduce 1
Reduce r
Mappers
Map Output Files
Simplified drawing

Spark Shuffle Recap
Map 1
Map m
Mappers
Map Output Files
(on disk/DFS)
Simplified drawing
Reduce 1
Reduce r
Reducers

Cosco Shuffle for Spark
Reduce 1
Reduce r
Mappers Reducers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition 1
Partition r
Shuffle Services
(N = thousands)
Map m
Map 1
Mappers stream their output to Cosco Shuffle Services, which buffer in memory
Streaming
output
In-memory buffering

Reduce 1
Reduce r
Mappers Reducers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition 1
(file 0 buffer)
Partition r
(file 0 buffer)
File 0
File 0
Shuffle Services
(N = thousands)
Distributed Filesystem
(HDFS/Warm Storage)
Map m
Map 1
Sort and flush buffers to DFS when full
Streaming
output
In-memory buffering
Sort (if required by query)
Flush
Flush

Reduce 1
Reduce r
Mappers Reducers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition 1
(file 1 buffer)
Partition r
(file 0 buffer)
File 0
File 1
File 0
Shuffle Services
(N = thousands)
(HDFS/Warm Storage)
Map m
Map 1
Streaming
output
In-memory buffering
Flush
Flush

Reduce 1
Reduce r
Mappers Reducers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition 1
(file 2 buffer)
Partition r
(file 0 buffer)
File 0
File 1
File 0
Shuffle Services
(N = thousands)
(HDFS/Warm Storage)
Map m
Map 1
File 2
Streaming
output
In-memory buffering
Flush
Flush

Reduce 1
Reduce r
Mappers Reducers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition 1
(file 2 buffer)
Partition r
(file 1 buffer)
File 0
File 1
File 2
File 0
File 1
Shuffle Services
(N = thousands)
(HDFS/Warm Storage)
Map m
Map 1
Streaming
output
In-memory buffering
Flush
Flush

Iterator
Iterator
Reduce 1
Reduce r
Mappers Reducers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition 1
(file 2 buffer)
Partition r
(file 1 buffer)
File 0
File 1
File 2
File 0
File 1
Shuffle Services
(N = thousands)
(HDFS/Warm Storage)
Map m
Map 1
Streaming
output
In-memory buffering
Flush
Flush
Reducers do a streaming merge after map stage completes
Streaming
merge

Replace DRAM with Flash
for Buffering

Buffering Is Appending
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = thousands)
Map m
Map 1
Each package is a few 10s of KiB

Replace DRAM with Flash for Buffering
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = thousands)
Map m
Map 1
Simply buffer to flash instead of memory
On flash

Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = thousands)
Map m
Map 1
▪ Appending is a friendly pattern
for flash
▪ Minimize flash write amplification -> minimizing wear on
the drive
On flash

Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = thousands)
Map m
Map 1
for flash
the drive
On flash
Read back to main
memory for sorting

Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = thousands)
Map m
Map 1
for flash
the drive
▪ Flash write/read latency is
negligible
▪ Generally non-blocking
▪ Latency is much less than buffering time
On flash
Read back to main
memory for sorting

Example Rule of Thumb
▪ Hypothetical example numbers
▪ Assume 1 GB Flash can endure ~10 GB of writes per day for the lifetime of the device
▪ Assume you are indifferent between consuming 1 GB DRAM vs ~10 GB Flash with write throughput at the endurance limit
▪ Then, you would be indifferent between consuming 1 GB DRAM vs ~100 GB/day Flash
▪ Notes
▪ These numbers chosen entirely because they are round -> Easier to illustrate math on slides
▪ DRAM consumes more power than Flash
Would you rather consume 1 GB DRAM or flash that can endure 100 GB/day of write throughput?

Basic Evaluation
▪ Example Cosco cluster
▪ 10 nodes
▪ Each node uses 100 GB DRAM for buffering
▪ And has additional DRAM for sorting, RPCs, etc.
▪ So, 1 TB DRAM for buffering in total
▪ Again, numbers are chosen for illustration only
▪ Apply the example rule of thumb
▪ Indifferent between consuming 1 TB DRAM vs 100 TB/day flash endurance
▪ If this cluster shuffles less than 100 TB/day, then it is efficient to
replace DRAM with Flash
▪ Each node replaces 100 GB DRAM with ~1 TB flash for buffering
▪ Nodes keep some DRAM for sorting, RPCs, etc.

Basic Evaluation
Summary for cluster shuffling 100 TB/day
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
Shuffle Service 10
CPU
DRAM for sorting,
RPCs, etc.
100 GB
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
Shuffle Service 10
CPU
DRAM for sorting,
RPCs, etc.
1 TB
Flash for buffering

Hybrid Techniques for Efficiency

Two Hybrid Techniques
Two ways to use both DRAM and flash for buffering
1. Buffer in DRAM first, flush to flash only under memory pressure
2. Buffer fastest-filling partitions in DRAM, send slowest-filling
partitions to flash

Hybrid Technique #1
Take advantage of variation in shuffle workload over time
Time
Bytes buffered in
Cosco Shuffle Service

Hybrid Technique #1
Time
Bytes
buffered
Buffer only in DRAM Buffer only in flash
1 TB
100 TB written/day

Hybrid Technique #1
Buffer only in DRAM
Buffer only in flash
1 TB
100 TB written/day
Hybrid
Buffer in DRAM and flash
250 GB
25 TB written/day

Hybrid Technique #1
Buffer in DRAM first, flush to flash only under memory pressure
250 GB DRAM
25 TB written/day to flash
▪ Example: 25% RAM +
25% flash supports
100% throughput
▪ Spikier workload -> more win
▪ Safer to push the
system to its limits
▪ Run out of memory -> immediate bad
consequences
▪ But exceed flash endurance guidelines
-> okay if you make up for it by writing
less in the future

Hybrid Technique #1
Buffer in DRAM first, flush to flashPure-DRAM Cosco
Implementation requires balancing. Flash adds another dimension. How to adapt balancing logic?
Balancing
Logic
Redirect to another
shuffle service
Flush to
DFS
Backpressure
mappers
???
Redirect to another
shuffle service
Flush to
DFS
Backpressure
mappers
Flush to
Flash
Shuffle Service is
out of DRAM
Shuffle Service is
out of DRAM

Hybrid Technique #1
Buffer in DRAM first, flush to flashPure-DRAM Cosco
Plug into pre-existing balancing logic
Shuffle Service is
out of DRAM
Balancing
Logic
Redirect to another
shuffle service
Flush to
DFS
Backpressure
mappers
Balancing
Logic
Redirect to another
shuffle service
Flush to
DFS
Backpressure
mappers
Shuffle Service is
out of DRAM
Same logic
Flash
working set
smaller than
THRESHOLD
?
No
Flush to
Flash
Yes

Hybrid Technique #1
Plug into pre-existing balancing logic
Balancing
Logic
Redirect to another
shuffle service
Flush to
DFS
Backpressure
mappers
Shuffle Service is
out of DRAM
Flash
working set
smaller than
THRESHOLD
?
No
Flush to
Flash
Yes
▪ THRESHOLD limits flash working set
size
▪ Configure THRESHOLD to stay under
flash endurance limits
▪ Then predict cluster performance as if
working-set flash were DRAM

Hybrid Technique #1
Summary
▪ Take advantage of variation in
total shuffle workload over time
▪ Buffer in DRAM first, flush to
flash only under memory
pressure
▪ Adapt balancing logic

Hybrid Technique #2
Take advantage of variation in partition fill rate
▪ Some partitions fill more slowly than others
▪ Slower partitions wear out flash less quickly
▪ So, use flash to buffer slower partitions, and use DRAM to buffer faster
partitions

Hybrid Technique #2
▪ 1 TB
▪ Supports 100K streams each
buffering up to 10MB
▪ 10 TB, 100 TB written/day
▪ 100K streams each writing 1 GB/day
which is 12 KB/second. (Sanity check:
5 min map stage -> 3.6 MB partition.)
▪ Or 200K streams each writing
6KB/second -> These streams are
better on flash
▪ Or 50K streams each writing 24
KB/second -> These streams would
be better on DRAM
FlashDRAM
Take advantage of variation in partition fill rate: Illustrated with numbers

Hybrid Technique #2
Buffer fastest-filling partitions in DRAM and slowest-filling partitions in flash
▪ Technique
▪ Periodically measure partition fill rate
▪ If fill rate is less than threshold KB/s, then buffer partition data in flash
▪ Else, buffer partition data in DRAM
▪ Evaluation
▪ Assume “break-even” threshold of 12 KB/s from previous slide
▪ Suppose that 50% of buffer time is spent on partitions that are slower than 12 KB/s
▪ Suppose these slow partitions write an average of 3 KB/s
▪ Then, you can replace half of your buffering DRAM with 25% as much flash

Hybrid Technique #2
Real-world partition fill rates
Percentile of partitions
Fill rate
0 KiB/sec
1st
MiB’s/sec
99th
Fill rate,
log scale
0 KiB/sec
1st
MiB’s/sec
99th

Hybrid Technique #2
Real-world partition fill rates
Percentile of partitions weighted by buffering time
Fill rate
0 KiB/sec
1st
MiB’s/sec
99th
Percentile of partitions weighted by buffering time
Fill rate,
log scale
0 KiB/sec
1st
MiB’s/sec
99th

Combine both hybrid techniques
Buffer in DRAM first, then send the slowest partitions to flash when under memory pressure
▪ Evaluation
▪ Difficult theoretical estimation
▪ Or, do a discrete-event simulation -> Later in this presentation

Lower-Latency Queries
Made possible by flash
▪ Serve shuffle data directly from flash for some jobs
▪ This is “free” until flash drive gets so full that write amplification factor increases (~80% full)
▪ Prioritize interactive/low-latency queries to serve from flash
▪ Buffer bigger chunks to decrease reducer merging
▪ Fewer chunks -> Less chance that reducer needs to do an external merge

Further Efficiency Wins
Made possible by flash
▪ Decrease Cosco replication factor since flash is non-volatile
▪ Currently Cosco replication is R2: Each map output byte is stored on two shuffle services until it is flushed to durable DFS
▪ Most Shuffle Service crashes in production are resolved in a few minutes with process restart
▪ Decrease Cosco replication to R1 for some queries, and attempt to automatically recover map output data from flash after restart
▪ Buffer bigger chunks to allow more efficient Reed-Solomon encodings
on DFS

Practical Evaluation Techniques

Practical Evaluation Techniques
▪ Discrete event simulation
▪ Synthetic load generation on a test cluster
▪ Shadow testing on a test cluster
▪ Special canary in a production cluster

Discrete Event Simulation
https://en.wikipedia.org/wiki/Discrete-event_simulation, 2020-05-18

Shuffle Service Model DFS Model
Example
Partition 3
Partition 42
Time: 00h:01m:30.000s
Total KB written to flash: 9,000
Overall avg file size written to DFS: NaN

Example
Partition 3
Partition 42
Discrete event
Time: 00h:01m:30.250s

Example
Partition 3
Partition 42
Discrete event
Time: 00h:01m:30.500s

Example
Partition 3
Partition 42
Discrete event
Time: 00h:01m:30.750s

Example
Partition 3
Partition 42
Discrete event
Time: 00h:01m:31.000s

Example
Partition 3
Partition 42
Discrete event
Time: 00h:01m:31.500s

Example
Partition 3
Partition 42
Discrete event
Time: 00h:01m:32.000s

DFS Model
File 0
Shuffle Service Model
Example
Partition 3
Partition 42
Sort & flush
Discrete event
Time: 00h:01m:32.000s
Overall avg file size written to DFS: NaN9,200

DFS Model
File 0
Shuffle Service Model
Example
Partition 3
Partition 42
Discrete event
Time: 00h:01m:32.500s
Overall avg file size written to DFS: NaN9,200

Drive simulation based on production data
cosco_chunks dataset
Partition
Shuffle
Service ID
Chunk (DFS
file) number
Chunk Start
Time
Chunk
Size
Chunk
Buffering Time
Chunk Fill Rate (derived from
size and buffering time)
3 10 5
2020-05-19
00:00:00.000
10
MiB
5000ms 2 MiB/s
42 10 2
2020-05-19
00:01:00.000
31
MiB
10000ms 3.1 MiB/s
…
…

Canary on a Production Cluster
▪ Many important metrics are observed on mappers
▪ Example: “percentage of task time spent shuffling”
▪ Example: “map task success rate”
▪ Problem: Mappers talk to many Shuffle Services
▪ Simultaneously
▪ Dynamic balancing can re-route to different Shuffle Services
▪ Solution: Subclusters
▪ Pre-existing feature for large clusters
▪ Each Shuffle Service belongs to one subcluster
▪ Each mapper is assigned to one subcluster, and only uses Shuffle Services in that subcluster
▪ Compare performance of subclusters that contain flash machines vs subclusters that don’t

Chen Yang
Sergey Makagonov
Special Thanks

SOS: Optimizing Shuffle IO, Spark Summit 2018
Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
Previous Shuffle presentations from
Facebook

cosco@fb.com mailing list
Contact

Flash for Apache Spark Shuffle with Cosco

More Related Content

What's hot (20)

Similar to Flash for Apache Spark Shuffle with Cosco (20)

More from Databricks (20)

Recently uploaded (20)

Flash for Apache Spark Shuffle with Cosco