Metrics-driven tuning of Apache Spark at scale

Edwina Lu and Ye Zhou,
Metrics-Driven Tuning of
Apache Spark at Scale

Hadoop Infra @ LinkedIn
• 10+ clusters
• 10,000+ nodes
• 1000+ users
2

Number of daily Spark apps for one cluster: close to
3K, a 2.4x increase in last 3 quarters
Spark applications consume 25% of resources,
average daily Spark resource consumption: 1.6 PBHr
3
Spark @ LinkedIn
0
500
1000
1500
2000
2500
3000
Number of Applications per Day
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
Average Daily Resource Usage
Spark Non-Spark

What We Discovered About Spark Usage
Only ~34% of allocated memory was
actually used.
Example application:
 200 executors
 spark.driver.memory: 16GB
 spark.executor.memory: 16GB
 Max executor JVM used memory: 6.6GB
 Max driver JVM used memory: 5.4GB
 Total wasted memory: 1.8TB
 Time: 1h
4
34%
61%
5%
Executor Memory
Peak Used JVM
Memory
Unused Executor
Memory
Reserved Memory

Memory Tuning: Motivation
• Memory and CPUs cost money
• These are limited resources, so must be used efficiently
• With 34% of allocated memory used, if memory usage is more efficient,
we can run 2-3 times as many Spark applications on the same
hardware
5

Memory Tuning: What and How to Tune?
• Spark tuning can be
complicated, with many
metrics and
configuration
parameters
• Many users have
limited knowledge
about how to tune
Spark applications
6

Memory Tuning: Scaling
• Data scientist and engineer time cost even more money
• Analyzing applications and giving tuning advice in person does not
scale for the Spark team or users who must wait for help
• Infrastructure efficiency vs. developer productivity
– Do we have to choose between these two?
7

Dr. Elephant
• Performance monitoring and tuning service
• Identify badly tuned applications and causes
• Provide actionable advice for fixing issues
• Compare performance changes over time
8

Dr. Elephant: How does it Work?
9
Metrics
Fetcher
History
Server
Application
Fetcher
Resource
Manager
Run
Rule 1
Run
Rule 2
Run
Rule 3
Database
Dr. Elephant UI

Challenges for Dr. Elephant to Support Spark
• Spark tuning heuristics
– What are the necessary metrics to enable effective tuning?
• Fetch Spark history
– Spark components are not equally scalable
10

Spark Memory Overview
11
Executor Memory
spark.executor.memory
Overhead (off-heap
memory)
spark.yarn.executor.memoryOverhead
max(executorMemory * 0.1, 384MB)
Execution Memory Storage Memory
spark.memory.storageFraction
Reserved Memory
300 MB
User Memory
1 – spark.memory.fraction = 0.4
Executor Container
UNIFIED MEMORY
spark.memory.fraction = 0.6
JVMUSEDMEMORY
EXECUTORMEMORY

Executor JVM Used Memory Heuristic
Spark
Executor
Memory
Peak JVM
Used Memory
Reserved
Memory
16GB
275.9MB300MBWastedMemory
Executor JVM Used Memory
Severity: Severe
The configured executor memory is much higher than
the maximum amount of JVM used by executors.
Please set spark.executor.memory to a lower value.
spark.executor.memory: 16 GB
Max executor peak JVM used memory: 6.6 GB
Suggested spark.executor.memory: 7 GB
12

Executor Unified Memory Heuristic
Unified
Memory
Peak
Unified
Memory
8.36GB
474.42KBWastedMemory
Executor Peak Unified Memory
Severity: Critical
The allocated unified memory is much higher than the
maximum amount of unified memory used by executors.
Please lower spark.memory.fraction.
spark.executor.memory: 10 GB
spark.memory.fraction: 0.6
Allocated unified memory: 6 GB
Max peak JVM used memory: 7.2 GB
Max peak unified memory: 1.2 GB
Suggested spark.memory.fraction: 0.2
13

Execution Memory Spill Heuristic
Disk
Executor
Memory Unified
Memory
Execution Memory Spill
Severity: Severe
Execution memory spill has been detected in stage 3. Shuffle
read bytes and spill are evenly distributed. There are 200 tasks
for this stage. Please increase spark.sql.shuffle.partitions, or
modify the code to use more partitions, or reduce the number of
executor cores.
spark.executor.memory 10 GB
spark.executor.cores 3
spark.executor.instances 300
Stage 3:
Median shuffle read bytes: 954 MB
Max shuffle read bytes: 955 MB
Median shuffle write bytes: 359 MB
Max shuffle write bytes: 388 MB
Median memoryBytesSpilled: 1.2 GB
Max memoryBytesSpilled: 1.2 GB
Num tasks: 200
14

Executor GC Heuristic
13 Seconds
2 Minutes
Executor Runtime
GCTime
Executor GC
Severity: Moderate
Executors are spending too much time in GC. Please
increase spark.executor.memory.
Spark.executor.memory: 4 GB
GC time to executor run time ratio: 0.164
Total executor run time: 1 Hour 15 Minutes
Total GC time: 12 Minutes
15

Automating Spark Tuning with Dr. Elephant @ LinkedIn
Well Tuned? Ship It!
Production
Tune It!
Yes
No
Development
16

Architecture
Executor
Task Task
Cache Driver
Task Scheduler
Listener Bus
Executor
Task Task
Cach
e
Executor
Task Task
Cache
HDFS
Spark
History Logs
Spark
History
Server
DAG Scheduler
EventLoggi
ng Listener
AppState
Listener
Task
Heartbeats
Task
Task
Heartbeats
Heartbeats REST
API
Web
UI
17

Upstream Ticket
SPARK-23206: Additional Memory Tuning Metrics
• New executor level memory metrics:
– JVM used memory
– Execution memory
– Storage memory
– Unified memory
• Metrics sent from executors to driver via Heartbeat
• Peak values for executor metrics logged at stage end
• Metrics exposed via web UI and REST API
18

Overview of our Solution
Scalable
application
metrics provider
Spark History
Server (SHS)
Enhancements
on SHS
Benefits brought by
enhanced SHS
Scalable
application history
provider
Dr Elephant
Performance
analysis at scale
Debug
Easy investigation
of past applications
19

Spark History Server (SHS) at LinkedIn
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
Log Parsing
Web UI Rest APIs
20

How does SHS work?
Apps DBsListing DB
Queued
Thread Pool
Update
Jetty Handlers
Thread Pool
Createhttp://www.yoursite.com
http://www.yoursite.com
http://www.yoursite.com
SHS
SPARK-18085
21

Not Happy
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
Log Parsing
Web UI
Rest APIs
22

SHS Issues
• Missing applications
– Users cannot find their applications on the home page
• Extended loading time
– Application details page take a very long time (up to 0.5 hour) to load
• Handling large history files
– SHS gets completely stalled
• Handling high-volume concurrent requests
– SHS doesn’t return expected JSON response
23

Missing Applications
1
2
3
4
Submit Job
Start running
Job Failed
Check it out on
SHS
24

5
6
7
8Wait SHS to catch
up
Finally it shows
up
Check out the
details
Keep loading…
No response
Extended Loading Time
25

Extended Listing Delay
Listing DB
History Files
Update
1. Replay same file multiple times
2. Limited threads for the replay
3. Processing time proportional to file size
26

How to Decrease the Listing Delay
Listing DB
Read from
extended
attributes
Spark
Driver
Write log file content
Write log file extended
attributes key/value
Read from log content
when fail to read from
extended attributes
1
2
NameNode
• Use HDFS Extended Attributes
27

Extended Loading Delay
Apps DBs
Request
Response
SHS
Replaying all the events takes a long time for large log file
Replay
28

How to Decrease the Loading Delay
• DB creation time is unavoidable
• Start DB creation prior to User’s request for every application log file
Apps DBs
Request
SHS
Replay
Request
Response
29

Results of Improvement
• SHS can get the completed/running application information into home
page within 1 minute.
• Start to create DBs in 5 minutes for 90% applications right after they finish
30

Scalability Issues
• Increasing number of
Spark applications
• Increasing Spark users
31

Severe Garbage Collection (GC)
Full GC Full GC Full GCFull GC
32

What Caused GC?
• Unnecessary events used too
much memory while replaying
• SHS got completely stalled
• SHS needs to ignore those
unnecessary events
33
23GB

High-Volume Concurrent Requests
• When REST call frequency
goes beyond certain threshold,
SHS is likely to return non-
JSON response to users
• Home page shows empty list
34

Upstream Tickets
SPARK-23607: Use HDFS extended attributes to store application
summary
SPARK-21961: Filter out BlockStatusUpdates in History Server
when analyzing logs
SPARK-23608: Synchronize when attaching and detaching
SparkUI in History Server
35

Results
• User can always find their applications on
SHS home page within 1 minute
• For 90% of applications DBs, SHS will start
creating them within 5 minutes after they
complete
• Stable and reliable service
• Handle high-volume concurrent requests
36

Future Work
• More memory metrics:
– Netty memory
– Total memory
• More Tuning:
– Skew in assignment of tasks to executors
– Size/time skew in tasks for a stage
– DAG analysis
• Incremental Replay for History Logs
• Horizontal Scalable History Server
37

Metrics-driven tuning of Apache Spark at scale

More Related Content

What's hot (20)

Similar to Metrics-driven tuning of Apache Spark at scale (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Metrics-driven tuning of Apache Spark at scale