SlideShare a Scribd company logo
Dr. Elephant
github.com/linkedin/dr-elephant
Akshay Rai
Hadoop Dev Team
Introduction
Scaling Hadoop
Infrastructure
Scale and Optimize Hardware
● More users, more jobs, more resources
● Large investment in hardware
● Can’t keep upgrading and adding machines to solve problem forever
● Some tuning is needed to get things running
Users are more valuable than machines
What do we do?
Improve User Productivity
User Productivity
● Freedom to experiment and run jobs on the cluster
● Build tools to help developers. (Hadoop DSL, Resolvers for Pig/Hive)
○ Improve developer lifecycle
○ Also reduce unnecessary resource wastage
The Tuning Problem
How easy is it to tune a job?
● Problems are not obvious
● Critical information is scattered
● Inter-related settings
● Large parameter space
Here’s what we learned!
Expert Intervention
● Not enough support resources available
● Poor coverage
● Difficult to prioritize efforts
● Delays user development
Random
Suggestions
Training is not at all easy
● Too many users
● Diverse backgrounds
● Scope is large and evolving
● Other responsibilities are more important
Scaling Productivity is Hard!
Dr. Elephant to the Rescue
What does Dr. Elephant do?
● Automated performance monitoring and tuning tool
● Help every user get the best performance from their jobs
● Highlights common mistakes
● Indicates best practices and tuning tips
● Provides a platform for other performance related tools
● Analyzes hundred thousand jobs every day
Architecture
Dashboard
Search
Job Page
MapReduce Report
Failed Job
Help Page
Tuning Tips
Awesome Features
Simplified analysis of a flow’s historical executions
● Monitoring performance, resource usage and many others
● Comparing flows against previous executions
● Impact of tuning a specific parameter or a changing a line of code
Flow History
Job History
Heuristics
How does a Heuristic work?
● Fetch Counters and Task Data
● Some logic to compute a value
● Compare value against threshold levels
Heuristic Severity
Severity Color Description
CRITICAL The job is in critical state and must be tuned
SEVERE There is scope for improvement
MODERATE There is scope for further improvement
LOW There is scope for few minor improvements
NONE The job is safe. No tuning necessary
Example | Mapper Data Skew
Mapper Skew Problem
● Number of Mappers depend on the number of splits
● Varying size of splits can cause skewness in the Mapper Input
Solution to Mapper Skewness
● Each Mapper should process the same amount of data
● Combine the small chunks and feed it to a single Mapper
Example | Spark Executor Load Balance
Spark Driver
Executor
1
Executor
2
Executor
3
RDD
Partition 1
Partition 2
Partition 3
Custom Heuristics
Adding a New Heuristic
1. Create a new heuristic and test it.
2. Create a new view for the heuristic. For example, helpMapperSpill.scala.html
3. Add the details of the heuristic in the HeuristicConf.xml file.
<heuristic>
<applicationtype>mapreduce</applicationtype>
<heuristicname>Mapper GC</heuristicname>
<classname>com.linkedin.dre.mapreduce.heuristics.MapperGC</classname>
<viewname>views.html.help.mapreduce.helpGC</viewname>
</heuristic>
4. Run Dr. Elephant. It should now include the new heuristics.
Configuring Heuristics/Threshold levels
<heuristics>
<heuristic>
<applicationtype>mapreduce</applicationtype>
<heuristicname>Mapper Data Skew</heuristicname>
<classname>com.linkedin.dre.mapreduce.heuristics.MapperDataSkew</classname>
<viewname>views.html.help.mapreduce.helpMapperDataSkew</viewname>
<params>
<num_tasks_severity>10, 50, 100, 200</num_tasks_severity>
<deviation_severity>2, 4, 8, 16</deviation_severity>
<files_severity>1/8, 1/4, 1/2, 1</files_severity>
</params>
</heuristic>
</heuristics>
Elephagent
Workflow monitoring and reports
● Performance characteristics change
○ Data Growth
○ Data distribution change
○ Hardware change
○ Incremental software change
● Monitor performance on each execution
● Compare behaviour across revisions
● Cost to Serve analysis
Production Reviews | JIRA Bot
● Separate cluster for critical workloads
● Audit before deployment
● Improved accuracy
● Faster turnaround
● Higher throughput
Future Plans
Upcoming
● Job Resource Usage and Wastage
● Job Wait time
● Real time analysis of a job
● Workflow DAG visualization
● Improved Spark heuristics
References
Engineering Blog: engineering.linkedin.com/blog/2016/04/dr-elephant-open-
source-self-serve-performance-tuning-hadoop-spark
Open Source Github Link:
github.com/linkedin/dr-elephant
Mailing List:
Dr-elephant-users
Hadoop Summit 2015:
https://www.youtube.com/watch?v=aL3OJ4YoxPA
Thank You
©2014 LinkedIn Corporation. All Rights
Reserved.
©2014 LinkedIn Corporation. All Rights
Reserved.
© 2016

More Related Content

PDF
Evolution of The Twitter Stack
PDF
CI CD Basics
ODP
Introduction to Chef
PPTX
Introduction to chef
PDF
PesterSec: Using Pester & ScriptAnalyzer to Detect Obfuscated PowerShell
PDF
Introduction To Git For Version Control Architecture And Common Commands Comp...
PPTX
Terraform infraestructura como código
PDF
ClickHouse new features and development roadmap, by Aleksei Milovidov
Evolution of The Twitter Stack
CI CD Basics
Introduction to Chef
Introduction to chef
PesterSec: Using Pester & ScriptAnalyzer to Detect Obfuscated PowerShell
Introduction To Git For Version Control Architecture And Common Commands Comp...
Terraform infraestructura como código
ClickHouse new features and development roadmap, by Aleksei Milovidov

What's hot (20)

PPTX
Grokking opensource with github
ODP
Introduction to Version Control
PPT
Linux VDI with OpenStack – How to Deliver Linux Virtual Desktops on Demand
PPTX
서버리스 데이터 플로우 개발기 - 김재현 (Superb AI) :: AWS Community Day 2020
PDF
Rancher Rodeo 13 mai 2022
PDF
Implementando una Arquitectura de Microservicios
ODP
Monitoring IO performance with iostat and pt-diskstats
PDF
Linux systems - Linux Commands and Shell Scripting
PPTX
Using Queryable State for Fun and Profit
PPTX
01 - Git vs SVN
PDF
ハードウェアと著作権
PDF
DCSF 19 Deploying Rootless buildkit on Kubernetes
PDF
Stop the Guessing: Performance Methodologies for Production Systems
PDF
High Concurrency Architecture and Laravel Performance Tuning
PPT
Linux: Basics OF Linux
PPT
RedHat Linux
 
PPTX
Service Discovery In Kubernetes
PDF
CD using ArgoCD(KnolX).pdf
PDF
Shell scripting
PDF
Mongodb replication
Grokking opensource with github
Introduction to Version Control
Linux VDI with OpenStack – How to Deliver Linux Virtual Desktops on Demand
서버리스 데이터 플로우 개발기 - 김재현 (Superb AI) :: AWS Community Day 2020
Rancher Rodeo 13 mai 2022
Implementando una Arquitectura de Microservicios
Monitoring IO performance with iostat and pt-diskstats
Linux systems - Linux Commands and Shell Scripting
Using Queryable State for Fun and Profit
01 - Git vs SVN
ハードウェアと著作権
DCSF 19 Deploying Rootless buildkit on Kubernetes
Stop the Guessing: Performance Methodologies for Production Systems
High Concurrency Architecture and Laravel Performance Tuning
Linux: Basics OF Linux
RedHat Linux
 
Service Discovery In Kubernetes
CD using ArgoCD(KnolX).pdf
Shell scripting
Mongodb replication
Ad

Similar to Hadoop & Spark Performance tuning using Dr. Elephant (20)

PPTX
The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark
PPTX
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
PDF
Polyglot Persistence - Two Great Tastes That Taste Great Together
PDF
OSMC 2019 | How to improve database Observability by Charles Judith
PDF
Apache Cassandra at Target - Cassandra Summit 2014
PDF
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo
PDF
What is Distributed Computing, Why we use Apache Spark
PDF
Big data & frameworks: no book for you anymore
PDF
Big data & frameworks: no book for you anymore.
ODP
Agile NCR 2013- Anirudh Bhatnagar - Hadoop unit testing agile ncr
PDF
End to end MLworkflows
PPTX
Druid Summit 2023 : Changing Druid Ingestion from 3 hours to 5 minutes
PDF
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
PPTX
Developing a Map Reduce Application
PDF
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
PDF
MapReduce
PDF
MapReduce
PPTX
Mapreduce is for Hadoop Ecosystem in Data Science
PDF
ENAR short course
The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Polyglot Persistence - Two Great Tastes That Taste Great Together
OSMC 2019 | How to improve database Observability by Charles Judith
Apache Cassandra at Target - Cassandra Summit 2014
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo
What is Distributed Computing, Why we use Apache Spark
Big data & frameworks: no book for you anymore
Big data & frameworks: no book for you anymore.
Agile NCR 2013- Anirudh Bhatnagar - Hadoop unit testing agile ncr
End to end MLworkflows
Druid Summit 2023 : Changing Druid Ingestion from 3 hours to 5 minutes
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
Developing a Map Reduce Application
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
MapReduce
MapReduce
Mapreduce is for Hadoop Ecosystem in Data Science
ENAR short course
Ad

Recently uploaded (20)

PDF
Introduction to Data Science and Data Analysis
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Introduction to machine learning and Linear Models
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Mega Projects Data Mega Projects Data
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Lecture1 pattern recognition............
PDF
[EN] Industrial Machine Downtime Prediction
Introduction to Data Science and Data Analysis
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
.pdf is not working space design for the following data for the following dat...
ISS -ESG Data flows What is ESG and HowHow
STERILIZATION AND DISINFECTION-1.ppthhhbx
Database Infoormation System (DBIS).pptx
Introduction-to-Cloud-ComputingFinal.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Miokarditis (Inflamasi pada Otot Jantung)
Introduction to machine learning and Linear Models
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
IBA_Chapter_11_Slides_Final_Accessible.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Introduction to Knowledge Engineering Part 1
Mega Projects Data Mega Projects Data
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Reliability_Chapter_ presentation 1221.5784
Lecture1 pattern recognition............
[EN] Industrial Machine Downtime Prediction

Hadoop & Spark Performance tuning using Dr. Elephant