SlideShare a Scribd company logo
Hadoop Vaidya Viraj Bhat ( [email_address] ) Suhas Gogate ( [email_address] ) Milind Bhandarkar ( [email_address] ) Cloud Computing & Data Infrastructure Group, Yahoo! Inc. Hadoop World October 2, 2009
Hadoop & Job Optimization: Why ? Hadoop is a highly configurable commodity cluster computing framework Performance tuning of Hadoop jobs is a significant challenge! 165+ tunable parameters Tuning one parameter adversely affects others Hadoop Job Optimization Job Performance – User perspective Reduce end-to-end execution time Yield quicker analysis of data Cluster Utilization – Provider perspective Efficient sharing of cluster resources across multiple users Increase overall throughput in terms of number of jobs/unit time
Hadoop Vaidya    --  Rule based performance diagnostics Tool Rule based performance diagnosis of M/R jobs  M/R performance analysis expertise is captured and provided as an input through a set of pre-defined diagnostic rules Detects performance problems by postmortem analysis of a job by executing the diagnostic rules against the job execution counters Provides targeted advice against individual performance problems Extensible framework You can add your own rules, based on a rule template and published job counters data structures  Write complex rules using existing simpler rules Vaidya : An expert (versed in his own profession , esp. in medical science) , skilled in the art of healing , a physician
Hadoop Vaidya : Status Input Data used for evaluating the rules Job History, Job Configuration (xml) A Contrib project under Apache  Hadoop Available in Hadoop version 0.20.0 http://issues.apache.org/jira/browse/HADOOP-4179  Automated deployment for analysis of thousands of daily jobs on the Yahoo! Grids Helps quickly identify inefficient user jobs utilizing more resources and advice them appropriately Helps certify user jobs before moving to production clusters  (compliance)
Diagnostic Test Rule <DiagnosticTest> <Title> Balanced Reduce Partitioning </Title> <ClassName> org.apache.hadoop.vaidya.postexdiagnosis.tests.BalancedReducePartitioning </ClassName> <Description> This rule tests as to how well the input to reduce tasks is balanced </Description> <Importance> High </Importance>  <SuccessThreshold> 0.20 </SuccessThreshold> <Prescription> advice </Prescription> <InputElement> <PercentReduceRecords> 0.85 </PercentReduceRecords> </InputElement> </DiagnosticTest >
Diagnostic Report Element <TestReportElement> <TestTitle> Balanced Reduce Partitioning </TestTitle> <TestDescription> This rule tests as to how well the input to reduce tasks is balanced </TestDescription> <TestImportance> HIGH </TestImportance> <TestResult> POSITIVE(FAILED) </TestResult> <TestSeverity> 0.98 </TestSeverity> <ReferenceDetails> * TotalReduceTasks: 1000 * BusyReduceTasks processing 85% of total records: 2 * Impact: 0.98 </ReferenceDetails> <TestPrescription> * Use the appropriate partitioning function * For streaming job consider following partitioner and hadoop config parameters * org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner * -jobconf stream.map.output.field.separator, -jobconf stream.num.map.output.key.fields </TestPrescription> </TestReportElement>
Hadoop Vaidya Rules - Examples Balanced Reduce Partitioning Checks if intermediate data is well partitioned among reducers. Map/Reduce tasks reading HDFS files as side effect Checks if HDFS files are being read as side effect and in effect causing the access bottleneck across map/reduce tasks Percent Re-execution of Map/Reduce tasks Map tasks data locality Checks the % data locality for Map tasks Use of Combiner & Combiner efficiency Checks if there is a potential in using combiner after map stage Intermediate data compression Checks if intermediate data is compressed to lower the shuffle time Currently there are 15 rules
Performance Analysis for sample set of Jobs Vaidya Rules Total jobs analyzed  = 794
Future Enhancements Online progress analysis of the Map/Reduce jobs to improve utilization Correlation of various prescriptions suggested by Hadoop Vaidya to detect larger performance bottlenecks Proactive SLA monitoring  Detect inefficiently executing jobs early enough or those that would eventually fail due to any resource constraints Integration with the Job History viewer Production Job Certification
 
Results of Hadoop Vaidya Total jobs analyzed = 22602 Rules which yielded POSITIVE (TEST FAILED) Balanced Reduce Partitioning (4247 jobs / 18.79%) Impact of Map tasks re-execution (1 job) Impact of Reduce tasks re-execution (8 jobs) #Maps/Reduces tasks reading HDFS data as side effect (20570 jobs / 91%) Map side disk spill (864 jobs / 3.8%)

More Related Content

PDF
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
PDF
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
PDF
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
PDF
May 2013 HUG: HCatalog/Hive Data Out
PDF
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
PDF
White paper hadoop performancetuning
PPTX
February 2014 HUG : Pig On Tez
PPTX
Big Data Performance and Capacity Management
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
May 2013 HUG: HCatalog/Hive Data Out
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
White paper hadoop performancetuning
February 2014 HUG : Pig On Tez
Big Data Performance and Capacity Management

What's hot (20)

PPTX
Hive at Yahoo: Letters from the trenches
ODP
Tune hadoop
PPTX
Apache Hadoop YARN 3.x in Alibaba
PPTX
Hive+Tez: A performance deep dive
PDF
Hadoop Administration pdf
PDF
Syncsort et le retour d'expérience ComScore
PPTX
Hadoop configuration & performance tuning
PPTX
The Evolution of the Hadoop Ecosystem
PPTX
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
PDF
Hadoop sqoop
PDF
Hadoop scheduler
PPTX
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
PDF
Resource Aware Scheduling for Hadoop [Final Presentation]
PPTX
Real time hadoop + mapreduce intro
PDF
Big Data Journey
PDF
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
PPTX
Hadoop And Their Ecosystem
PDF
Hadoop ecosystem
PPTX
Yahoo's Experience Running Pig on Tez at Scale
PDF
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hive at Yahoo: Letters from the trenches
Tune hadoop
Apache Hadoop YARN 3.x in Alibaba
Hive+Tez: A performance deep dive
Hadoop Administration pdf
Syncsort et le retour d'expérience ComScore
Hadoop configuration & performance tuning
The Evolution of the Hadoop Ecosystem
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Hadoop sqoop
Hadoop scheduler
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
Resource Aware Scheduling for Hadoop [Final Presentation]
Real time hadoop + mapreduce intro
Big Data Journey
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Hadoop And Their Ecosystem
Hadoop ecosystem
Yahoo's Experience Running Pig on Tez at Scale
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Ad

Viewers also liked (17)

PPTX
November 2013 HUG: Compute Capacity Calculator
PDF
How to Profit from Factoring 2015
PPTX
Fish Sticks by Stephen C Lundin, John Christensen and Harry Paul
PDF
What is system level analysis
PPTX
Rate zonal centrifugation and Its applications
PPTX
Top 10 team coordinator interview questions and answers
PDF
Apache Hadoop on Virtual Machines
PPTX
Moving From a Selenium Grid to the Cloud - A Real Life Story
PPTX
Progeny LIMS
PPTX
Introduction to Designing and Building Big Data Applications
PPTX
Getting Past No
PPT
IT Strategic Planning (Case Studies)
PPT
Matrix Effect
PPTX
The purpose and Benefits of setting high standards for your work
PPTX
High Performance Computing and Big Data
PPTX
Cost of Ownership for Hadoop Implementation
PDF
Digital Assurance: Develop a Comprehensive Testing Strategy for Digital Trans...
November 2013 HUG: Compute Capacity Calculator
How to Profit from Factoring 2015
Fish Sticks by Stephen C Lundin, John Christensen and Harry Paul
What is system level analysis
Rate zonal centrifugation and Its applications
Top 10 team coordinator interview questions and answers
Apache Hadoop on Virtual Machines
Moving From a Selenium Grid to the Cloud - A Real Life Story
Progeny LIMS
Introduction to Designing and Building Big Data Applications
Getting Past No
IT Strategic Planning (Case Studies)
Matrix Effect
The purpose and Benefits of setting high standards for your work
High Performance Computing and Big Data
Cost of Ownership for Hadoop Implementation
Digital Assurance: Develop a Comprehensive Testing Strategy for Digital Trans...
Ad

Similar to HW09 Hadoop Vaidya (20)

PDF
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
PPT
Hw09 Fingerpointing Sourcing Performance Issues
PPTX
Hadoop performance optimization tips
PPTX
PDF
A hadoop map reduce
PPT
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
PPTX
Map Reduce Online
PPTX
March 2011 HUG: Scaling Hadoop
PPT
Hadoop ecosystem framework n hadoop in live environment
PPTX
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
PPTX
Performance Management in ‘Big Data’ Applications
PPTX
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
PDF
Characterization of hadoop jobs using unsupervised learning
DOC
Resume
PPTX
Apache Hadoop India Summit 2011 Keynote talk "Scaling Hadoop Applications" by...
PDF
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
PDF
Optimization of workload prediction based on map reduce frame work in a cloud...
PDF
Optimization of workload prediction based on map reduce frame work in a cloud...
PDF
Novel Scheduling Algorithms for Efficient Deployment of Map Reduce Applicatio...
PDF
Scheduling MapReduce Jobs in HPC Clusters
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Hw09 Fingerpointing Sourcing Performance Issues
Hadoop performance optimization tips
A hadoop map reduce
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Map Reduce Online
March 2011 HUG: Scaling Hadoop
Hadoop ecosystem framework n hadoop in live environment
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Performance Management in ‘Big Data’ Applications
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Characterization of hadoop jobs using unsupervised learning
Resume
Apache Hadoop India Summit 2011 Keynote talk "Scaling Hadoop Applications" by...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
Optimization of workload prediction based on map reduce frame work in a cloud...
Optimization of workload prediction based on map reduce frame work in a cloud...
Novel Scheduling Algorithms for Efficient Deployment of Map Reduce Applicatio...
Scheduling MapReduce Jobs in HPC Clusters

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18

Recently uploaded (20)

PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Encapsulation theory and applications.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Cloud computing and distributed systems.
Building Integrated photovoltaic BIPV_UPV.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Digital-Transformation-Roadmap-for-Companies.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Unlocking AI with Model Context Protocol (MCP)
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Chapter 3 Spatial Domain Image Processing.pdf
sap open course for s4hana steps from ECC to s4
“AI and Expert System Decision Support & Business Intelligence Systems”
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Encapsulation theory and applications.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Encapsulation_ Review paper, used for researhc scholars
Diabetes mellitus diagnosis method based random forest with bat algorithm
Cloud computing and distributed systems.

HW09 Hadoop Vaidya

  • 1. Hadoop Vaidya Viraj Bhat ( [email_address] ) Suhas Gogate ( [email_address] ) Milind Bhandarkar ( [email_address] ) Cloud Computing & Data Infrastructure Group, Yahoo! Inc. Hadoop World October 2, 2009
  • 2. Hadoop & Job Optimization: Why ? Hadoop is a highly configurable commodity cluster computing framework Performance tuning of Hadoop jobs is a significant challenge! 165+ tunable parameters Tuning one parameter adversely affects others Hadoop Job Optimization Job Performance – User perspective Reduce end-to-end execution time Yield quicker analysis of data Cluster Utilization – Provider perspective Efficient sharing of cluster resources across multiple users Increase overall throughput in terms of number of jobs/unit time
  • 3. Hadoop Vaidya -- Rule based performance diagnostics Tool Rule based performance diagnosis of M/R jobs M/R performance analysis expertise is captured and provided as an input through a set of pre-defined diagnostic rules Detects performance problems by postmortem analysis of a job by executing the diagnostic rules against the job execution counters Provides targeted advice against individual performance problems Extensible framework You can add your own rules, based on a rule template and published job counters data structures Write complex rules using existing simpler rules Vaidya : An expert (versed in his own profession , esp. in medical science) , skilled in the art of healing , a physician
  • 4. Hadoop Vaidya : Status Input Data used for evaluating the rules Job History, Job Configuration (xml) A Contrib project under Apache Hadoop Available in Hadoop version 0.20.0 http://issues.apache.org/jira/browse/HADOOP-4179 Automated deployment for analysis of thousands of daily jobs on the Yahoo! Grids Helps quickly identify inefficient user jobs utilizing more resources and advice them appropriately Helps certify user jobs before moving to production clusters (compliance)
  • 5. Diagnostic Test Rule <DiagnosticTest> <Title> Balanced Reduce Partitioning </Title> <ClassName> org.apache.hadoop.vaidya.postexdiagnosis.tests.BalancedReducePartitioning </ClassName> <Description> This rule tests as to how well the input to reduce tasks is balanced </Description> <Importance> High </Importance> <SuccessThreshold> 0.20 </SuccessThreshold> <Prescription> advice </Prescription> <InputElement> <PercentReduceRecords> 0.85 </PercentReduceRecords> </InputElement> </DiagnosticTest >
  • 6. Diagnostic Report Element <TestReportElement> <TestTitle> Balanced Reduce Partitioning </TestTitle> <TestDescription> This rule tests as to how well the input to reduce tasks is balanced </TestDescription> <TestImportance> HIGH </TestImportance> <TestResult> POSITIVE(FAILED) </TestResult> <TestSeverity> 0.98 </TestSeverity> <ReferenceDetails> * TotalReduceTasks: 1000 * BusyReduceTasks processing 85% of total records: 2 * Impact: 0.98 </ReferenceDetails> <TestPrescription> * Use the appropriate partitioning function * For streaming job consider following partitioner and hadoop config parameters * org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner * -jobconf stream.map.output.field.separator, -jobconf stream.num.map.output.key.fields </TestPrescription> </TestReportElement>
  • 7. Hadoop Vaidya Rules - Examples Balanced Reduce Partitioning Checks if intermediate data is well partitioned among reducers. Map/Reduce tasks reading HDFS files as side effect Checks if HDFS files are being read as side effect and in effect causing the access bottleneck across map/reduce tasks Percent Re-execution of Map/Reduce tasks Map tasks data locality Checks the % data locality for Map tasks Use of Combiner & Combiner efficiency Checks if there is a potential in using combiner after map stage Intermediate data compression Checks if intermediate data is compressed to lower the shuffle time Currently there are 15 rules
  • 8. Performance Analysis for sample set of Jobs Vaidya Rules Total jobs analyzed = 794
  • 9. Future Enhancements Online progress analysis of the Map/Reduce jobs to improve utilization Correlation of various prescriptions suggested by Hadoop Vaidya to detect larger performance bottlenecks Proactive SLA monitoring Detect inefficiently executing jobs early enough or those that would eventually fail due to any resource constraints Integration with the Job History viewer Production Job Certification
  • 10.  
  • 11. Results of Hadoop Vaidya Total jobs analyzed = 22602 Rules which yielded POSITIVE (TEST FAILED) Balanced Reduce Partitioning (4247 jobs / 18.79%) Impact of Map tasks re-execution (1 job) Impact of Reduce tasks re-execution (8 jobs) #Maps/Reduces tasks reading HDFS data as side effect (20570 jobs / 91%) Map side disk spill (864 jobs / 3.8%)