SlideShare a Scribd company logo
RFHO C: A Random-Forest Approach to Auto-Tuning Hadoops
Configuration
Abstract:
Hadoop is a widely-used implementation framework of the MapReduce
programming model for large-scale data processing. Hadoop performance
however is significantly affected by the settings of the Hadoop configuration
parameters. Unfortunately, manually tuning these parameters is very time-
consuming, if at all practical. This paper proposes an approach, called RFHOC, to
automatically tune the Hadoop configuration parameters for optimized
performance for a given application running on a given cluster. RFHOC constructs
two ensembles of performance models using a random-forest approach for the
map and reduce stage respectively. Leveraging these models, RFHOC employs a
genetic algorithm to automatically search the Hadoop configuration space. The
evaluation of RFHOC using five typical Hadoop programs, each with five different
input data sets, shows that it achieves a performance speedup by a factor of
2.11 on average and up to 7.4 over the recently proposed cost-based
optimization (CBO) approach. In addition, RFHOC's performance benefit increases
with input data set size.

More Related Content

PDF
Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...
PDF
Map-Reduce Synchronized and Comparative Queue Capacity Scheduler in Hadoop fo...
PPTX
Parallel & Distributed Computing
DOCX
EMR: A SCALABLE GRAPH-BASED RANKING MODEL FOR CONTENT-BASED IMAGE RETRIEVAL
PDF
20190703_AGIT_GeoRasterWorkshop_GriddedData_KPatenge
PPTX
Finalprojectpresentation
PPTX
Hadoop interview questions
PPTX
Sawmill - Integrating R and Large Data Clouds
Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...
Map-Reduce Synchronized and Comparative Queue Capacity Scheduler in Hadoop fo...
Parallel & Distributed Computing
EMR: A SCALABLE GRAPH-BASED RANKING MODEL FOR CONTENT-BASED IMAGE RETRIEVAL
20190703_AGIT_GeoRasterWorkshop_GriddedData_KPatenge
Finalprojectpresentation
Hadoop interview questions
Sawmill - Integrating R and Large Data Clouds

What's hot (20)

PPTX
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
PDF
Novel Scheduling Algorithms for Efficient Deployment of Map Reduce Applicatio...
PDF
HPC4E Final results
PPTX
SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013
PDF
Partitioning SKA Dataflows for Optimal Graph Execution
PDF
C044051215
PDF
Scaling Application on High Performance Computing Clusters and Analysis of th...
PPTX
Using GDAL In Your GIS Workflow
PPTX
Hadoop in sigmod 2011
PPTX
SPD and KEA: HDF5 based file formats for Earth Observation
PDF
Improved Map reduce Framework using High Utility Transactional Databases
PPTX
Data Warehouse Offload
DOCX
A Dual-Clock VLSI Design of H.265 Sample Adaptive Offset Estimation for 8k Ul...
PDF
Hadoop performance modeling for job
PDF
06340356
PDF
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
PDF
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
PPTX
Major 2 p pt
PPTX
Improved Methods for Accessing Scientific Data for the Masses
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Novel Scheduling Algorithms for Efficient Deployment of Map Reduce Applicatio...
HPC4E Final results
SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013
Partitioning SKA Dataflows for Optimal Graph Execution
C044051215
Scaling Application on High Performance Computing Clusters and Analysis of th...
Using GDAL In Your GIS Workflow
Hadoop in sigmod 2011
SPD and KEA: HDF5 based file formats for Earth Observation
Improved Map reduce Framework using High Utility Transactional Databases
Data Warehouse Offload
A Dual-Clock VLSI Design of H.265 Sample Adaptive Offset Estimation for 8k Ul...
Hadoop performance modeling for job
06340356
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
Major 2 p pt
Improved Methods for Accessing Scientific Data for the Masses
Ad

Viewers also liked (7)

PDF
Dark patterns
PDF
Knit Hat Collection
PDF
Mustang investor presentation 2015_july_final.20.07.15
PDF
Taustauuring
PPTX
What's inside the jar?
PDF
Wieso, Weshalb, Warum - Zur digitalen Langzeitarchivierung in der Archäologie...
PDF
Forschungsdaten – Nach der Publikation ist vor der Archivierung!
Dark patterns
Knit Hat Collection
Mustang investor presentation 2015_july_final.20.07.15
Taustauuring
What's inside the jar?
Wieso, Weshalb, Warum - Zur digitalen Langzeitarchivierung in der Archäologie...
Forschungsdaten – Nach der Publikation ist vor der Archivierung!
Ad

Similar to Rfhoc a random forest approach to auto-tuning hadoop's configuration (6)

PDF
Rfhoc a random forest approach to auto-tuning hadoop’s configuration
PDF
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
PPTX
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
PDF
Optimization on Key-value Stores in Cloud Environment
PDF
Hadoop Mapreduce Cookbook Srinath Perera Thilina Gunarathne
PPTX
Intro to hadoop
Rfhoc a random forest approach to auto-tuning hadoop’s configuration
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
Optimization on Key-value Stores in Cloud Environment
Hadoop Mapreduce Cookbook Srinath Perera Thilina Gunarathne
Intro to hadoop

More from ieeepondy (20)

PDF
Demand aware network function placement
PDF
Service description in the nfv revolution trends, challenges and a way forward
PDF
Secure optimization computation outsourcing in cloud computing a case study o...
PDF
Spatial related traffic sign inspection for inventory purposes using mobile l...
PDF
Standards for hybrid clouds
PDF
Resource and instance hour minimization for deadline constrained dag applicat...
PDF
Reliable and confidential cloud storage with efficient data forwarding functi...
PDF
Rebuttal to “comments on ‘control cloud data access privilege and anonymity w...
PDF
Scalable cloud–sensor architecture for the internet of things
PDF
Scalable algorithms for nearest neighbor joins on big trajectory data
PDF
Robust workload and energy management for sustainable data centers
PDF
Privacy preserving deep computation model on cloud for big data feature learning
PDF
Pricing the cloud ieee projects, ieee projects chennai, ieee projects 2016,ie...
PDF
Protection of big data privacy
PDF
Power optimization with bler constraint for wireless fronthauls in c ran
PDF
Performance aware cloud resource allocation via fitness-enabled auction
PDF
Performance limitations of a text search application running in cloud instances
PDF
Performance analysis and optimal cooperative cluster size for randomly distri...
PDF
Predictive control for energy aware consolidation in cloud datacenters
PDF
Over flow multi site aware big data management for scientific workflows on cl...
Demand aware network function placement
Service description in the nfv revolution trends, challenges and a way forward
Secure optimization computation outsourcing in cloud computing a case study o...
Spatial related traffic sign inspection for inventory purposes using mobile l...
Standards for hybrid clouds
Resource and instance hour minimization for deadline constrained dag applicat...
Reliable and confidential cloud storage with efficient data forwarding functi...
Rebuttal to “comments on ‘control cloud data access privilege and anonymity w...
Scalable cloud–sensor architecture for the internet of things
Scalable algorithms for nearest neighbor joins on big trajectory data
Robust workload and energy management for sustainable data centers
Privacy preserving deep computation model on cloud for big data feature learning
Pricing the cloud ieee projects, ieee projects chennai, ieee projects 2016,ie...
Protection of big data privacy
Power optimization with bler constraint for wireless fronthauls in c ran
Performance aware cloud resource allocation via fitness-enabled auction
Performance limitations of a text search application running in cloud instances
Performance analysis and optimal cooperative cluster size for randomly distri...
Predictive control for energy aware consolidation in cloud datacenters
Over flow multi site aware big data management for scientific workflows on cl...

Recently uploaded (20)

PPTX
Pharma ospi slides which help in ospi learning
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
Insiders guide to clinical Medicine.pdf
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PPTX
Lesson notes of climatology university.
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Pre independence Education in Inndia.pdf
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
Computing-Curriculum for Schools in Ghana
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
Basic Mud Logging Guide for educational purpose
Pharma ospi slides which help in ospi learning
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Insiders guide to clinical Medicine.pdf
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Abdominal Access Techniques with Prof. Dr. R K Mishra
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Lesson notes of climatology university.
O7-L3 Supply Chain Operations - ICLT Program
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Final Presentation General Medicine 03-08-2024.pptx
Pre independence Education in Inndia.pdf
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Computing-Curriculum for Schools in Ghana
human mycosis Human fungal infections are called human mycosis..pptx
TR - Agricultural Crops Production NC III.pdf
Basic Mud Logging Guide for educational purpose

Rfhoc a random forest approach to auto-tuning hadoop's configuration

  • 1. RFHO C: A Random-Forest Approach to Auto-Tuning Hadoops Configuration Abstract: Hadoop is a widely-used implementation framework of the MapReduce programming model for large-scale data processing. Hadoop performance however is significantly affected by the settings of the Hadoop configuration parameters. Unfortunately, manually tuning these parameters is very time- consuming, if at all practical. This paper proposes an approach, called RFHOC, to automatically tune the Hadoop configuration parameters for optimized performance for a given application running on a given cluster. RFHOC constructs two ensembles of performance models using a random-forest approach for the map and reduce stage respectively. Leveraging these models, RFHOC employs a genetic algorithm to automatically search the Hadoop configuration space. The evaluation of RFHOC using five typical Hadoop programs, each with five different input data sets, shows that it achieves a performance speedup by a factor of 2.11 on average and up to 7.4 over the recently proposed cost-based optimization (CBO) approach. In addition, RFHOC's performance benefit increases with input data set size.