SlideShare a Scribd company logo
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 03 Issue: 03 | Mar-2014, Available @ http://www.ijret.org 264
OPTIMIZATION OF WORKLOAD PREDICTION BASED ON MAP
REDUCE FRAME WORK IN A CLOUD SYSTEM
V.Sivaranjani1
, R.Jayamala2
1
Student, Pervasive Computing Technology, Bharathidasan Institute of Technology,Tamil Nadu, India
2
Assistant Professor, Computer Science and Engineering, Bharathidasan Institute of Technology, Tamil Nadu, India
Abstract
Nowadays cloud computing is emerging Technology. It is used to access anytime and anywhere through the internet. Hadoop is
an open-source Cloud computing environment that implements the Googletm MapReduce framework. Hadoop is a framework for
distributed processing of large datasets across large clusters of computers. This paper proposes the workload of jobs in clusters
mode using Hadoop. MapReduce is a programming model in hadoop used for maintaining the workload of the jobs. Depend on
the job analysis statistics the future workload of the cluster is predicted for potential performance optimization by using genetic
algorithm.
Key Words: Cloud computing, Hadoop Framework, MapReduce Analysis, Workload
--------------------------------------------------------------------***----------------------------------------------------------------------
1. INTRODUCTION
The large scale data processing is very important aspects of
the multimode cluster setup. It is very challenging problem.
The MapReduce framework [1] is proposed by Google
provides an efficient and scalable solution for working
large-scale data. The basic concept of MapReduce
framework is used to distribute the data among many nodes
and process them in parallel manner. Hadoop is a open-
source implementation of MapReduce framework. Hadoop
use the Yahoo, Facebook, Twitter etc.
The MapReduce consists of the two Phases. 1) Map and 2)
Reduce. The Map is used to split the job into several
independent chunks and each chunks assigned to different
computing data node. In the reduce phase, the data is
aggregated, summarized, filtered or combining the given
data. The result is stored in a Distributed File System.
Hadoop[2] is an open-source implementation of a
MapReduce framework. The components of the MapReduce
framework are 1) Job Tracker, 2) Task Tracker, 3) Name
Node 4) Data Node.
The Name Node stores the file system metadata. Which file
are maps to what block locations and which blocks are
stored on which data node. The data node is where the
actual data resides. All data nodes send the heartbeat
messages to name node every 3 seconds to say data nodes
are alive. If name node does not receive the heartbeat
message from data node for 10 minutes, that data node is
dead. All data node talks each other to rebalance the data,
move and copy. The Job Tracker is used to managing the
Task tracker and resource management that is tracking
resource availability and time management of each job. The
Task tracker is pre-configured a number of tasks and accept
of each task. The Job Tracker consists of Job History. Get
the required information from Job History to predict the
future workload.
This paper describe about work load prediction on map
reduce framework. The chapter 2 describes about System
Architecture Design. Chapter 3 describes about Load
prediction. Chapter 4 describes optimization process.
Chapter 5 describes about Implementation and analysis.
Chapter 6 describes Conclusion and Future work.
2. SYSTEM ARCHITECTURE DESIGN
The Job executes in cluster setup to get the job history
information from the job tracker. The architecture design of
the optimization of workload prediction based on the map
reduce framework in a cloud system.
Fig- 1. Represents the MapReduce framework consists of
different components are Name Node, Job Tracker and Task
Tracker. The Name node stores the file in a distribute file
system. The Job Tracker monitoring the resource
availability and resource management of MapReduce
framework.
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 03 Issue: 03 | Mar-2014, Available @ http://www.ijret.org 265
Fig -1: System Architecture Design
The Job Tracker consists of two phases. 1) Logs and Job
History. Job History maintaining the past job description
and provides different parameters like number of nodes in a
cluster, number of the jobs, job Id, execution time and
memory usage of each job etc. The Task Tracker is where
the data is store resides and maintains data node
information.
This paper proposes the prediction tracker component in
MapReduce framework. The prediction tracker consists of
two components 1) Analysis 2) GA (Genetic
Algorithm).The analysis component get the job history
related information from the Job Tracker. The GA is used to
predict the future workload in optimized manner.
3. LOAD PREDICTION PROCESS
The load prediction mainly focuses the prediction tracker.
The Analysis components of prediction tracker acquire the
require job history information from the Job Tracker. The
genetic algorithm is used to get the optimized solution for
workload prediction based on the historical data.
The description of paper is listed as follows.
 Collect the workload of each job from the
Hadoop cluster.
 Analysis the workload of each job
 Based the results, optimization performance is
evaluated.
The trace file [3] of the job tracker data are JobID (a unique
job identifier), job status (successful, failed or killed), job
submission time, job launch time, job finish time, the
number of map tasks, the number of reduce tasks, total
duration of map tasks, total duration of reduce tasks,
read/write bytes on HDFS (Hadoop Distributed File
System), read/write bytes on local disks.
4. OPTIMIZATION PROCESS
Hadoop framework gives the trace file of the job tracker to
get the job submission time. Prediction process [3][4] is
based on the job submission time, duration of job
completion time.
Forecast (Prediction) is an essential aspect of managing any
organization is planning for the future. It is used to
determine future inventory, costs, capacities and interest rate
changes. There the two basic approaches of forecasting:
qualitative approach, quantitative approach [6]. Qualitative
approach is subjective, they are appropriate when past data
are not available. Quantitative approach is used to forecast
future data when past data are available.
This paper focuses on quantitative approach, based on an
analysis of historical data which consider time series. A time
series is set of observations measured at successive points in
time. Time series is used to predict future values based on
previously observed value [7].
Genetic algorithm is used to find the predicted value using
historical data[8]. First step of the algorithm, select the
population depends upon the original data element. Each
element converted to the binary number to make a binary
string or chromosome. The crossover point is selected and
performs the crossover process and mutation process.
Binary strings are converted to the real value. All actual
value is converted to the binary strings or chromosomes.
Operators of the genetic algorithm are three type’s selection,
crossover and mutation.
The genetic algorithm [9] is used to
1. Initialize the population with random individuals.
2. Evaluate the fitness value of the individuals.
3. Select good solutions by using s-wise tournament
selection without replacement
4. Create new individuals by recombining the selected
population using single point crossover
5. Evaluate the fitness value of all offspring.
6. Repeat steps 3–5 until some convergence criteria are met.
Calculate the error rate using mean absolute percentage
error. The mean absolute percentage error (MAPE) is also
known as mean absolute percentage deviation (MAPD). It is
a measure the accurate method for constructing acceptable
time series values in statistics. The formula of MAPE
Prediction Tracker
Analysis GA
MAPREDUCE FRAMEWORK
Name
Node
Job Tracker
Log History
Task
Tracker
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 03 Issue: 03 | Mar-2014, Available @ http://www.ijret.org 266
𝑀 =
1
𝑛
⃒
At − Ft
At
𝑛
𝑡=0
⃒
At - Actual value
Ft- Forecast value
n – Number of absolute value.
M – Mean Absolute percentage Error
5. IMPLEMENTATION AND ANALYSIS
In this paper, hadoop framework is installed in ubuntu
operating system. Job history detail inferred from the job
tracker with time series based. Table -1 represents the error
rate of workload prediction.
Table -1: Example of Error value calculation
SI.NO Predicted Value Actual Value Error Rate
1 12 15 0.2
2 15 14 0.07142
3 4 5 0.2
MAPE error rate(%) 9.04733
6. CONCLUSION AND FUTURE WORK
In this paper, we have presented the analysis of Hadoop
trace derived from a single-node production Hadoop cluster.
The trace covers the jobs execution files. In the future, we
plan to work on the implications derived from this work and
integrate them into the multi node cluster in real time.
REFERENCES
[1]. J. Dean and S. Ghemawat, “Mapreduce: Simplified
data processing on large clusters,” in OSDI, 2004,
pp. 137–150.
[2]. T. White, Hadoop - The Definitive Guide. O’Reilly,
2009.
[3]. Zujie Ren, Xianghua Xu, Jian Wan et.al “Workload
Characterization on a Production Hadoop Cluster:
A Case Study on Taobao” Proceedings of the 2012
IEEE International Symposium on Workload
Characterization, 2012.
[4]. Sheng Di, Cho-Li Wang, “Error-Tolerant Resource
Allocation and Payment Minimization for Cloud
System” Proc. IEEE Transactions on parallel and
distributed systems, VOL. 24, NO. 6, 2013, pp-
1097-1106.
[5]. Zhen Xiao, Weijia Song, and Qi Chen ”Dynamic
Resource Allocation Using VirtualMachines for
Cloud Computing Environment” proc. IEEE
Transactions on parallel and distributed systems,
VOL. 24, NO. 6, JUNE 2013, pp. 1107-1117.
[6]. http://www.wikipwedia.com/wiki/Time_series.
[7]. Sam Mahfound and Ganesh Mani “Financial
Forecasting Using Genetic Algorithms”
http://citeseerx.ist.psu.edu/viewdoc/download?doi=
10.1.1.86.9698&rep=rep1&type=pdf.
[8]. Satyendra, ArghyaGhosh, Subhojit Roy, J. Pal
Choudhury, S. R. Bhadra Chaudhuri “A Novel
Approach of Genetic Algorithm in Prediction of
Time Series Data” in Proc of Special issues of
international journal of computer application
(ACCTHPCA), June 2012.
[9]. Abhishek Verma, Xavier Llora, David E. Goldberg
and Roy H. Campbell,“Scaling GeneticAlgorithms
using MapReduce” Proceedings of journal of
cluster computing, special issue, 2011.
BIOGRAPHIES
V.Sivaranjani is a student,of M.E in
Pervasive Computing Technology at
Bharathidasan Institute of
Technology. Her current research
focuses on the cloud computing and
parallel computing.
Mrs.R.Jayamala, Asst. Professor
under the Department of Computer
Science and Engineering at
Bharathidasan Institute of
Technology. Her research focuses
on the cloud computing and
Networks.

More Related Content

PDF
An enhanced adaptive scoring job scheduling algorithm with replication strate...
PDF
Efficient Resource Management Mechanism with Fault Tolerant Model for Computa...
PDF
[IJET V2I2P18] Authors: Roopa G Yeklaspur, Dr.Yerriswamy.T
PDF
Improving the Performance of Mapping based on Availability- Alert Algorithm U...
PDF
A survey on the performance of job scheduling in workflow application
PDF
An adaptive algorithm for task scheduling for computational grid
PDF
A survey of various scheduling algorithm in cloud computing environment
PDF
GROUPING BASED JOB SCHEDULING ALGORITHM USING PRIORITY QUEUE AND HYBRID ALGOR...
An enhanced adaptive scoring job scheduling algorithm with replication strate...
Efficient Resource Management Mechanism with Fault Tolerant Model for Computa...
[IJET V2I2P18] Authors: Roopa G Yeklaspur, Dr.Yerriswamy.T
Improving the Performance of Mapping based on Availability- Alert Algorithm U...
A survey on the performance of job scheduling in workflow application
An adaptive algorithm for task scheduling for computational grid
A survey of various scheduling algorithm in cloud computing environment
GROUPING BASED JOB SCHEDULING ALGORITHM USING PRIORITY QUEUE AND HYBRID ALGOR...

What's hot (19)

PDF
A Survey of Job Scheduling Algorithms Whit Hierarchical Structure to Load Ba...
PDF
Optimized Access Strategies for a Distributed Database Design
PDF
Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...
PDF
Heuristics based multi queue job scheduling for cloud computing environment
PDF
Comparative Analysis of Various Grid Based Scheduling Algorithms
PDF
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
PDF
Dynamically Partitioning Big Data Using Virtual Machine Mapping
PDF
Job Resource Ratio Based Priority Driven Scheduling in Cloud Computing
PDF
DGBSA : A BATCH JOB SCHEDULINGALGORITHM WITH GA WITH REGARD TO THE THRESHOLD ...
PDF
J0210053057
PDF
Proposing a New Job Scheduling Algorithm in Grid Environment Using a Combinat...
PDF
Fusion method used to tolerate the faults occurred in disrtibuted system
PDF
IRJET- Enhance Dynamic Heterogeneous Shortest Job first (DHSJF): A Task Schedu...
PDF
Document retrieval using clustering
PDF
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
PDF
Ijarcet vol-2-issue-3-904-915
PDF
Distributed Feature Selection for Efficient Economic Big Data Analysis
PDF
Hybrid Approach for Intrusion Detection Model Using Combination of K-Means Cl...
PDF
Energy efficient task scheduling algorithms for cloud data centers
A Survey of Job Scheduling Algorithms Whit Hierarchical Structure to Load Ba...
Optimized Access Strategies for a Distributed Database Design
Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...
Heuristics based multi queue job scheduling for cloud computing environment
Comparative Analysis of Various Grid Based Scheduling Algorithms
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
Dynamically Partitioning Big Data Using Virtual Machine Mapping
Job Resource Ratio Based Priority Driven Scheduling in Cloud Computing
DGBSA : A BATCH JOB SCHEDULINGALGORITHM WITH GA WITH REGARD TO THE THRESHOLD ...
J0210053057
Proposing a New Job Scheduling Algorithm in Grid Environment Using a Combinat...
Fusion method used to tolerate the faults occurred in disrtibuted system
IRJET- Enhance Dynamic Heterogeneous Shortest Job first (DHSJF): A Task Schedu...
Document retrieval using clustering
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
Ijarcet vol-2-issue-3-904-915
Distributed Feature Selection for Efficient Economic Big Data Analysis
Hybrid Approach for Intrusion Detection Model Using Combination of K-Means Cl...
Energy efficient task scheduling algorithms for cloud data centers
Ad

Viewers also liked (20)

PDF
Mobility management in heterogeneous wireless networks
PDF
Design and characterization of various shapes of microcantilever for human im...
PDF
A challenge for security and service level agreement in cloud computinge
PDF
Optimization of energy use intensity in a design build framework
PDF
A comparative study on road traffic management systems
PDF
Adaptive transmit diversity selection (atds) based on stbc and sfbc fir 2 x1 ...
PDF
Automated water head controller for domestic application
PDF
On generating functions of biorthogonal polynomials
PDF
Comparison of data security in grid and cloud
PDF
Importance of post processing for improved binarization of text documents
PDF
Novel model for rural housing development
PDF
Mac protocols for cooperative diversity in wlan
PDF
Impact of power electronics on global warming
PDF
Wear behaviour of si c reinforced al6061 alloy metal matrix composites by usi...
PDF
Intelligent location tracking scheme for handling user’s mobility
PDF
Td ams processing for vlsi implementation of ldpc decoder
PDF
An extended database reverse engineering – a key for database forensic invest...
PDF
Degradation of mono azo dye in aqueous solution using
PDF
Hybrid aco iwd optimization algorithm for minimizing weighted flowtime in clo...
PDF
Dsp based implementation of field oriented control of
Mobility management in heterogeneous wireless networks
Design and characterization of various shapes of microcantilever for human im...
A challenge for security and service level agreement in cloud computinge
Optimization of energy use intensity in a design build framework
A comparative study on road traffic management systems
Adaptive transmit diversity selection (atds) based on stbc and sfbc fir 2 x1 ...
Automated water head controller for domestic application
On generating functions of biorthogonal polynomials
Comparison of data security in grid and cloud
Importance of post processing for improved binarization of text documents
Novel model for rural housing development
Mac protocols for cooperative diversity in wlan
Impact of power electronics on global warming
Wear behaviour of si c reinforced al6061 alloy metal matrix composites by usi...
Intelligent location tracking scheme for handling user’s mobility
Td ams processing for vlsi implementation of ldpc decoder
An extended database reverse engineering – a key for database forensic invest...
Degradation of mono azo dye in aqueous solution using
Hybrid aco iwd optimization algorithm for minimizing weighted flowtime in clo...
Dsp based implementation of field oriented control of
Ad

Similar to Optimization of workload prediction based on map reduce frame work in a cloud system (20)

PDF
A hadoop map reduce
PDF
Paper id 25201498
PDF
IRJET- Analysis for EnhancedForecastof Expense Movement in Stock Exchange
PDF
Characterization of hadoop jobs using unsupervised learning
PDF
Survey on Performance of Hadoop Map reduce Optimization Methods
PDF
G017143640
PDF
Big Data Analysis and Its Scheduling Policy – Hadoop
PDF
IRJET- Big Data Processes and Analysis using Hadoop Framework
PDF
Review: Data Driven Traffic Flow Forecasting using MapReduce in Distributed M...
PDF
C044051215
PDF
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
DOC
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
PDF
IRJET-An Efficient Technique to Improve Resources Utilization for Hadoop Mapr...
PDF
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
PPT
Seminar Presentation Hadoop
PDF
An experimental evaluation of performance
PPTX
Hadoop Introduction
PDF
Earlier stage for straggler detection and handling using combined CPU test an...
PDF
High Dimensionality Structures Selection for Efficient Economic Big data usin...
A hadoop map reduce
Paper id 25201498
IRJET- Analysis for EnhancedForecastof Expense Movement in Stock Exchange
Characterization of hadoop jobs using unsupervised learning
Survey on Performance of Hadoop Map reduce Optimization Methods
G017143640
Big Data Analysis and Its Scheduling Policy – Hadoop
IRJET- Big Data Processes and Analysis using Hadoop Framework
Review: Data Driven Traffic Flow Forecasting using MapReduce in Distributed M...
C044051215
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
IRJET-An Efficient Technique to Improve Resources Utilization for Hadoop Mapr...
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Seminar Presentation Hadoop
An experimental evaluation of performance
Hadoop Introduction
Earlier stage for straggler detection and handling using combined CPU test an...
High Dimensionality Structures Selection for Efficient Economic Big data usin...

More from eSAT Publishing House (20)

PDF
Likely impacts of hudhud on the environment of visakhapatnam
PDF
Impact of flood disaster in a drought prone area – case study of alampur vill...
PDF
Hudhud cyclone – a severe disaster in visakhapatnam
PDF
Groundwater investigation using geophysical methods a case study of pydibhim...
PDF
Flood related disasters concerned to urban flooding in bangalore, india
PDF
Enhancing post disaster recovery by optimal infrastructure capacity building
PDF
Effect of lintel and lintel band on the global performance of reinforced conc...
PDF
Wind damage to trees in the gitam university campus at visakhapatnam by cyclo...
PDF
Wind damage to buildings, infrastrucuture and landscape elements along the be...
PDF
Shear strength of rc deep beam panels – a review
PDF
Role of voluntary teams of professional engineers in dissater management – ex...
PDF
Risk analysis and environmental hazard management
PDF
Review study on performance of seismically tested repaired shear walls
PDF
Monitoring and assessment of air quality with reference to dust particles (pm...
PDF
Low cost wireless sensor networks and smartphone applications for disaster ma...
PDF
Coastal zones – seismic vulnerability an analysis from east coast of india
PDF
Can fracture mechanics predict damage due disaster of structures
PDF
Assessment of seismic susceptibility of rc buildings
PDF
A geophysical insight of earthquake occurred on 21 st may 2014 off paradip, b...
PDF
Effect of hudhud cyclone on the development of visakhapatnam as smart and gre...
Likely impacts of hudhud on the environment of visakhapatnam
Impact of flood disaster in a drought prone area – case study of alampur vill...
Hudhud cyclone – a severe disaster in visakhapatnam
Groundwater investigation using geophysical methods a case study of pydibhim...
Flood related disasters concerned to urban flooding in bangalore, india
Enhancing post disaster recovery by optimal infrastructure capacity building
Effect of lintel and lintel band on the global performance of reinforced conc...
Wind damage to trees in the gitam university campus at visakhapatnam by cyclo...
Wind damage to buildings, infrastrucuture and landscape elements along the be...
Shear strength of rc deep beam panels – a review
Role of voluntary teams of professional engineers in dissater management – ex...
Risk analysis and environmental hazard management
Review study on performance of seismically tested repaired shear walls
Monitoring and assessment of air quality with reference to dust particles (pm...
Low cost wireless sensor networks and smartphone applications for disaster ma...
Coastal zones – seismic vulnerability an analysis from east coast of india
Can fracture mechanics predict damage due disaster of structures
Assessment of seismic susceptibility of rc buildings
A geophysical insight of earthquake occurred on 21 st may 2014 off paradip, b...
Effect of hudhud cyclone on the development of visakhapatnam as smart and gre...

Recently uploaded (20)

PPTX
additive manufacturing of ss316l using mig welding
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
Construction Project Organization Group 2.pptx
PPTX
Welding lecture in detail for understanding
PPT
Mechanical Engineering MATERIALS Selection
PDF
Structs to JSON How Go Powers REST APIs.pdf
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
Geodesy 1.pptx...............................................
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
UNIT 4 Total Quality Management .pptx
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
additive manufacturing of ss316l using mig welding
Lesson 3_Tessellation.pptx finite Mathematics
OOP with Java - Java Introduction (Basics)
Construction Project Organization Group 2.pptx
Welding lecture in detail for understanding
Mechanical Engineering MATERIALS Selection
Structs to JSON How Go Powers REST APIs.pdf
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Geodesy 1.pptx...............................................
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Strings in CPP - Strings in C++ are sequences of characters used to store and...
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Embodied AI: Ushering in the Next Era of Intelligent Systems
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
UNIT 4 Total Quality Management .pptx
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx

Optimization of workload prediction based on map reduce frame work in a cloud system

  • 1. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 03 Issue: 03 | Mar-2014, Available @ http://www.ijret.org 264 OPTIMIZATION OF WORKLOAD PREDICTION BASED ON MAP REDUCE FRAME WORK IN A CLOUD SYSTEM V.Sivaranjani1 , R.Jayamala2 1 Student, Pervasive Computing Technology, Bharathidasan Institute of Technology,Tamil Nadu, India 2 Assistant Professor, Computer Science and Engineering, Bharathidasan Institute of Technology, Tamil Nadu, India Abstract Nowadays cloud computing is emerging Technology. It is used to access anytime and anywhere through the internet. Hadoop is an open-source Cloud computing environment that implements the Googletm MapReduce framework. Hadoop is a framework for distributed processing of large datasets across large clusters of computers. This paper proposes the workload of jobs in clusters mode using Hadoop. MapReduce is a programming model in hadoop used for maintaining the workload of the jobs. Depend on the job analysis statistics the future workload of the cluster is predicted for potential performance optimization by using genetic algorithm. Key Words: Cloud computing, Hadoop Framework, MapReduce Analysis, Workload --------------------------------------------------------------------***---------------------------------------------------------------------- 1. INTRODUCTION The large scale data processing is very important aspects of the multimode cluster setup. It is very challenging problem. The MapReduce framework [1] is proposed by Google provides an efficient and scalable solution for working large-scale data. The basic concept of MapReduce framework is used to distribute the data among many nodes and process them in parallel manner. Hadoop is a open- source implementation of MapReduce framework. Hadoop use the Yahoo, Facebook, Twitter etc. The MapReduce consists of the two Phases. 1) Map and 2) Reduce. The Map is used to split the job into several independent chunks and each chunks assigned to different computing data node. In the reduce phase, the data is aggregated, summarized, filtered or combining the given data. The result is stored in a Distributed File System. Hadoop[2] is an open-source implementation of a MapReduce framework. The components of the MapReduce framework are 1) Job Tracker, 2) Task Tracker, 3) Name Node 4) Data Node. The Name Node stores the file system metadata. Which file are maps to what block locations and which blocks are stored on which data node. The data node is where the actual data resides. All data nodes send the heartbeat messages to name node every 3 seconds to say data nodes are alive. If name node does not receive the heartbeat message from data node for 10 minutes, that data node is dead. All data node talks each other to rebalance the data, move and copy. The Job Tracker is used to managing the Task tracker and resource management that is tracking resource availability and time management of each job. The Task tracker is pre-configured a number of tasks and accept of each task. The Job Tracker consists of Job History. Get the required information from Job History to predict the future workload. This paper describe about work load prediction on map reduce framework. The chapter 2 describes about System Architecture Design. Chapter 3 describes about Load prediction. Chapter 4 describes optimization process. Chapter 5 describes about Implementation and analysis. Chapter 6 describes Conclusion and Future work. 2. SYSTEM ARCHITECTURE DESIGN The Job executes in cluster setup to get the job history information from the job tracker. The architecture design of the optimization of workload prediction based on the map reduce framework in a cloud system. Fig- 1. Represents the MapReduce framework consists of different components are Name Node, Job Tracker and Task Tracker. The Name node stores the file in a distribute file system. The Job Tracker monitoring the resource availability and resource management of MapReduce framework.
  • 2. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 03 Issue: 03 | Mar-2014, Available @ http://www.ijret.org 265 Fig -1: System Architecture Design The Job Tracker consists of two phases. 1) Logs and Job History. Job History maintaining the past job description and provides different parameters like number of nodes in a cluster, number of the jobs, job Id, execution time and memory usage of each job etc. The Task Tracker is where the data is store resides and maintains data node information. This paper proposes the prediction tracker component in MapReduce framework. The prediction tracker consists of two components 1) Analysis 2) GA (Genetic Algorithm).The analysis component get the job history related information from the Job Tracker. The GA is used to predict the future workload in optimized manner. 3. LOAD PREDICTION PROCESS The load prediction mainly focuses the prediction tracker. The Analysis components of prediction tracker acquire the require job history information from the Job Tracker. The genetic algorithm is used to get the optimized solution for workload prediction based on the historical data. The description of paper is listed as follows.  Collect the workload of each job from the Hadoop cluster.  Analysis the workload of each job  Based the results, optimization performance is evaluated. The trace file [3] of the job tracker data are JobID (a unique job identifier), job status (successful, failed or killed), job submission time, job launch time, job finish time, the number of map tasks, the number of reduce tasks, total duration of map tasks, total duration of reduce tasks, read/write bytes on HDFS (Hadoop Distributed File System), read/write bytes on local disks. 4. OPTIMIZATION PROCESS Hadoop framework gives the trace file of the job tracker to get the job submission time. Prediction process [3][4] is based on the job submission time, duration of job completion time. Forecast (Prediction) is an essential aspect of managing any organization is planning for the future. It is used to determine future inventory, costs, capacities and interest rate changes. There the two basic approaches of forecasting: qualitative approach, quantitative approach [6]. Qualitative approach is subjective, they are appropriate when past data are not available. Quantitative approach is used to forecast future data when past data are available. This paper focuses on quantitative approach, based on an analysis of historical data which consider time series. A time series is set of observations measured at successive points in time. Time series is used to predict future values based on previously observed value [7]. Genetic algorithm is used to find the predicted value using historical data[8]. First step of the algorithm, select the population depends upon the original data element. Each element converted to the binary number to make a binary string or chromosome. The crossover point is selected and performs the crossover process and mutation process. Binary strings are converted to the real value. All actual value is converted to the binary strings or chromosomes. Operators of the genetic algorithm are three type’s selection, crossover and mutation. The genetic algorithm [9] is used to 1. Initialize the population with random individuals. 2. Evaluate the fitness value of the individuals. 3. Select good solutions by using s-wise tournament selection without replacement 4. Create new individuals by recombining the selected population using single point crossover 5. Evaluate the fitness value of all offspring. 6. Repeat steps 3–5 until some convergence criteria are met. Calculate the error rate using mean absolute percentage error. The mean absolute percentage error (MAPE) is also known as mean absolute percentage deviation (MAPD). It is a measure the accurate method for constructing acceptable time series values in statistics. The formula of MAPE Prediction Tracker Analysis GA MAPREDUCE FRAMEWORK Name Node Job Tracker Log History Task Tracker
  • 3. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 03 Issue: 03 | Mar-2014, Available @ http://www.ijret.org 266 𝑀 = 1 𝑛 ⃒ At − Ft At 𝑛 𝑡=0 ⃒ At - Actual value Ft- Forecast value n – Number of absolute value. M – Mean Absolute percentage Error 5. IMPLEMENTATION AND ANALYSIS In this paper, hadoop framework is installed in ubuntu operating system. Job history detail inferred from the job tracker with time series based. Table -1 represents the error rate of workload prediction. Table -1: Example of Error value calculation SI.NO Predicted Value Actual Value Error Rate 1 12 15 0.2 2 15 14 0.07142 3 4 5 0.2 MAPE error rate(%) 9.04733 6. CONCLUSION AND FUTURE WORK In this paper, we have presented the analysis of Hadoop trace derived from a single-node production Hadoop cluster. The trace covers the jobs execution files. In the future, we plan to work on the implications derived from this work and integrate them into the multi node cluster in real time. REFERENCES [1]. J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” in OSDI, 2004, pp. 137–150. [2]. T. White, Hadoop - The Definitive Guide. O’Reilly, 2009. [3]. Zujie Ren, Xianghua Xu, Jian Wan et.al “Workload Characterization on a Production Hadoop Cluster: A Case Study on Taobao” Proceedings of the 2012 IEEE International Symposium on Workload Characterization, 2012. [4]. Sheng Di, Cho-Li Wang, “Error-Tolerant Resource Allocation and Payment Minimization for Cloud System” Proc. IEEE Transactions on parallel and distributed systems, VOL. 24, NO. 6, 2013, pp- 1097-1106. [5]. Zhen Xiao, Weijia Song, and Qi Chen ”Dynamic Resource Allocation Using VirtualMachines for Cloud Computing Environment” proc. IEEE Transactions on parallel and distributed systems, VOL. 24, NO. 6, JUNE 2013, pp. 1107-1117. [6]. http://www.wikipwedia.com/wiki/Time_series. [7]. Sam Mahfound and Ganesh Mani “Financial Forecasting Using Genetic Algorithms” http://citeseerx.ist.psu.edu/viewdoc/download?doi= 10.1.1.86.9698&rep=rep1&type=pdf. [8]. Satyendra, ArghyaGhosh, Subhojit Roy, J. Pal Choudhury, S. R. Bhadra Chaudhuri “A Novel Approach of Genetic Algorithm in Prediction of Time Series Data” in Proc of Special issues of international journal of computer application (ACCTHPCA), June 2012. [9]. Abhishek Verma, Xavier Llora, David E. Goldberg and Roy H. Campbell,“Scaling GeneticAlgorithms using MapReduce” Proceedings of journal of cluster computing, special issue, 2011. BIOGRAPHIES V.Sivaranjani is a student,of M.E in Pervasive Computing Technology at Bharathidasan Institute of Technology. Her current research focuses on the cloud computing and parallel computing. Mrs.R.Jayamala, Asst. Professor under the Department of Computer Science and Engineering at Bharathidasan Institute of Technology. Her research focuses on the cloud computing and Networks.