Hadoop & Spark Performance tuning using Dr. Elephant

Dr. Elephant
github.com/linkedin/dr-elephant
Akshay Rai
Hadoop Dev Team

Scale and Optimize Hardware
● More users, more jobs, more resources
● Large investment in hardware
● Can’t keep upgrading and adding machines to solve problem forever
● Some tuning is needed to get things running

Users are more valuable than machines
What do we do?

User Productivity
● Freedom to experiment and run jobs on the cluster
● Build tools to help developers. (Hadoop DSL, Resolvers for Pig/Hive)
○ Improve developer lifecycle
○ Also reduce unnecessary resource wastage

How easy is it to tune a job?
● Problems are not obvious
● Critical information is scattered
● Inter-related settings
● Large parameter space

Expert Intervention
● Not enough support resources available
● Poor coverage
● Difficult to prioritize efforts
● Delays user development
Random
Suggestions

Training is not at all easy
● Too many users
● Diverse backgrounds
● Scope is large and evolving
● Other responsibilities are more important

What does Dr. Elephant do?
● Automated performance monitoring and tuning tool
● Help every user get the best performance from their jobs
● Highlights common mistakes
● Indicates best practices and tuning tips
● Provides a platform for other performance related tools
● Analyzes hundred thousand jobs every day

Simplified analysis of a flow’s historical executions
● Monitoring performance, resource usage and many others
● Comparing flows against previous executions
● Impact of tuning a specific parameter or a changing a line of code

How does a Heuristic work?
● Fetch Counters and Task Data
● Some logic to compute a value
● Compare value against threshold levels

Heuristic Severity
Severity Color Description
CRITICAL The job is in critical state and must be tuned
SEVERE There is scope for improvement
MODERATE There is scope for further improvement
LOW There is scope for few minor improvements
NONE The job is safe. No tuning necessary

Mapper Skew Problem
● Number of Mappers depend on the number of splits
● Varying size of splits can cause skewness in the Mapper Input

Solution to Mapper Skewness
● Each Mapper should process the same amount of data
● Combine the small chunks and feed it to a single Mapper

Example | Spark Executor Load Balance

Spark Driver
Executor
1
Executor
2
Executor
3
RDD
Partition 1
Partition 2
Partition 3

Adding a New Heuristic
1. Create a new heuristic and test it.
2. Create a new view for the heuristic. For example, helpMapperSpill.scala.html
3. Add the details of the heuristic in the HeuristicConf.xml file.
<heuristic>
<applicationtype>mapreduce</applicationtype>
<heuristicname>Mapper GC</heuristicname>
<classname>com.linkedin.dre.mapreduce.heuristics.MapperGC</classname>
<viewname>views.html.help.mapreduce.helpGC</viewname>
</heuristic>
4. Run Dr. Elephant. It should now include the new heuristics.

Configuring Heuristics/Threshold levels
<heuristics>
<heuristic>
<applicationtype>mapreduce</applicationtype>
<heuristicname>Mapper Data Skew</heuristicname>
<classname>com.linkedin.dre.mapreduce.heuristics.MapperDataSkew</classname>
<viewname>views.html.help.mapreduce.helpMapperDataSkew</viewname>
<params>
<num_tasks_severity>10, 50, 100, 200</num_tasks_severity>
<deviation_severity>2, 4, 8, 16</deviation_severity>
<files_severity>1/8, 1/4, 1/2, 1</files_severity>
</params>
</heuristic>
</heuristics>

Workflow monitoring and reports
● Performance characteristics change
○ Data Growth
○ Data distribution change
○ Hardware change
○ Incremental software change
● Monitor performance on each execution
● Compare behaviour across revisions
● Cost to Serve analysis

Production Reviews | JIRA Bot
● Separate cluster for critical workloads
● Audit before deployment
● Improved accuracy
● Faster turnaround
● Higher throughput

Upcoming
● Job Resource Usage and Wastage
● Job Wait time
● Real time analysis of a job
● Workflow DAG visualization
● Improved Spark heuristics

References
Engineering Blog: engineering.linkedin.com/blog/2016/04/dr-elephant-open-
source-self-serve-performance-tuning-hadoop-spark
Open Source Github Link:
github.com/linkedin/dr-elephant
Mailing List:
Dr-elephant-users
Hadoop Summit 2015:
https://www.youtube.com/watch?v=aL3OJ4YoxPA

Hadoop & Spark Performance tuning using Dr. Elephant

More Related Content

What's hot (20)

Similar to Hadoop & Spark Performance tuning using Dr. Elephant (20)

Recently uploaded (20)

Hadoop & Spark Performance tuning using Dr. Elephant