SlideShare a Scribd company logo
GROUP – B3
CLOUD WORKLOAD ANALYSIS AND SIMULATION
Investigating the behaviorsof a cloud, focusing on their workload
patterns
Jan – May 2014
TABLE OF CONTENTS
Contents
Highlights: _______________________________________________________________________________________ 1
The approach ...........................................................................................................................................................................1
Dataset preprocessing and analysis ..................................................................................................................................1
Clustering analysis.................................................................................................................................................................1
Time series analysis...............................................................................................................................................................1
Workload prediction..............................................................................................................................................................1
Looking ahead .........................................................................................................................................................................1
1.Objective:_______________________________________________________________________________________ 2
2.The Approach:__________________________________________________________________________________ 3
3.Dataset preprocessing and analysis ____________________________________________________________ 4
3.1 Preprocessing_________________________________________________________________________________________4
3.2 Analysis:_______________________________________________________________________________________________4
4.Calculation resource usage statistics:___________________________________________________________ 6
5.Classification of users and identifying target users: ____________________________________________ 8
6.Time series analysis ___________________________________________________________________________10
7.Workload Prediction __________________________________________________________________________11
9.Issues faced and possible solutions: ___________________________________________________________14
10.Looking ahead _______________________________________________________________________________15
GROUP MEMBERS_______________________________________________________________________________16
References:______________________________________________________________________________________16
Page 1
Highlights:
The approach
 Studied google trace data schema
 Studied related technicalpapers and summarized useful observations
 Devised an approach to analyzecloud workloadusing observations from technicalpapers and considering
google trace data’s schema
Datasetpreprocessing and analysis
 Preprocessed the data to prepare it foranalysis
 Visualized important statistics for feasibility decision and computed relevantattributes
 The main attributes were analyzedand visualized and observations were made.
Clustering analysis
 Applied various clustering algorithm, compared the results and chose the best clustering foruser and task
analysis
 Users wereclassified primarily based on estimation ratios and tasks based on cpu, memory usage
Time series analysis
 Target users and their task were identified from the clustering results
 Dynamic time warping algorithm was run on tasks to identify patterns in their resource usage
Workload prediction
 Users withspecific resource usage patterns are identified
 Resource forusers is allocated based on the identified usage pattern with a threshold value
Looking ahead
 Improvisations to our approach
Page 2
1. Objective:
 Analyze and report the cloud workload data based on Google cloud trace
 Use graphical tools to visualize the data (you may need to write programs to process the data
 in order to feed them into visual tools)
 Study and summarize the papers regarding other people’s experience in Google cloud trace
analysis
 Determine the workload characteristics from your analysis of the Google cloud trace
 Try to reallocate unused resources of a user to other users who require them
Page 3
2. The Approach:
Based on our study on the google cloud trace data and the gathered observations from the technical
papers we devised the following approach for the problem:
 Analyze and visualize the data to identify important attributes that determine user workload
pattern and ignore rest of the attributes
 Calculate resource usage statistics of users to identify the feasibility of resource re-allocation
 Classify users based on their resource usage quality[1] (amount of unused resource/resource
requested) using clustering analysis
 Identify target users based on the clustering analysis for resource re-allocation
 Study the workload pattern of tasks of the target users and classify tasks based on their lengths
 Perform time series analysis on long tasks
 Identify (if there is) a pattern for a user and associate that pattern for that user (or) form clusters
of tasks of all users that have similar workload based on time series analysis
 Predict the usage pattern of a user if the current task’s pattern matches the pattern associated
with that user (or) matches the one of the cluster formed in the previous step.
Page 4
3. Dataset preprocessing and analysis
3.1 Preprocessing
Inconsistent and vague data was processed to perform analysis. The task-usage table has many
records for a same Job ID-task index pair because the same task might be re-submitted or re-
scheduled due to task failure. So to avoid reading many values for the same Job ID-Task index pair
pre-processing was done.
Pre-processing: All records were grouped by JobID-Task index and the last occurring record of
repeating task records was considered and stored as a single record.
Time is converted into days and hours for per day analysis
3.2 Analysis:
The data in the cloud trace tables were visualized. The data which were found to be constant/within
a small range of values for most of the records were not considered for analysis. The attributes that
play a major part in shaping the user profile and task profile are considered important attributes. The
main attributes from a table were analyzed and visualized and certain observations were made.
Figure 1CPU requested per user (blue) Vs CPU used per user (red)
Observation: Most users overestimate the resources they need and use less than 5% of the requested resources
A few users under estimate the resources and use more than thrice the amount of requested resources.
Page 5
Figure 2Memory requested per user (blue) Vs Memory used per user (red)
Users with negative (orange) memory estimation ratio have used resources more than requested.
Users with memory estimation ratio between 0.9 and 1 have not used more than 90% of the requested resource.
Page 6
4. Calculationresource usage statistics:
As we are concerned with re-allocation of unused resources, we should look at those users who over-
estimate the resources as observed in the previous section.
To identify those users who over-estimate the resources a new attribute is calculated.
Estimation ratio [1] = (requested resource – used resource)/requested resource.
Estimation ratio varies from 0 to 1.
0 – User has used up/more than the requested resource
1 – User has not used any of the requested resource
Also from the visualizations and observations made, the following are identified as important
attributes:
User: Submission rate, CPU estimation ratio, Memory estimation ratio
Task: Task length, CPU usage, Memory usage
Figure 3CPU Estimation ratio per User
Users with negative (red) CPU estimation ratio have used resources more than requested.
Users with CPU estimation ratio between 0.9 and 1 have not used more than 90% of the requested resource.
Page 7
Figure 4Memory Estimation ratio per User
Users with negative (orange) memory estimation ratio have used resources more than requested.
Users with memory estimation ratio between 0.9 and 1 have not used more than 90% of the requested resource.
Page 8
5. Classificationof users and identifying target users:
The dimensions for classification are
User: Submission rate, CPU estimation ratio, Memory estimation ratio
We use the following clustering algorithms to identify optimal number of clusters for users and tasks
 K- means
 Expectation – Maximization (EM)
 Cascade Simple K-means
 X-means[2]
We categorize the users and tasks using these clustering algorithms with the above dimensions for
users.
We compare and choose the best clustering for users and tasks.
K-means (4 clusters) EM Clustering
Page 9
X- means Cascade Simple K-means
K means clustering with 4 clusters was selected as it offers good clustering of users based on the
CPU and memory estimation ratios.
From the clustering results we observed:
97% of the users have estimation ratios ranging from 0.7-1.0. That is 97% of the users don’t user more
than 70% of the resources they request. We targeted User Cluster 0 & Cluster 3 ( more than 90 %
unused)
We targeted tasks that were long enough to perform efficient resource allocation. We performed
clustering on task lengths of these users to filter out short tasks
Page 10
6. Time series analysis
To identify user’s tasks with similar workload, we ran the DTW[3] algorithm on each tasks of Cluster0 and
Cluster3 users, computed the DTW between all target user’s tasks and a reference sine curve (refer
Issues faced section)
Clustered tasks that have same DTW value
These tasks were identified to have similar workload curve.
Two tasks with same DTW distances having similar workloads
The clusters hence formed a reference workload curvewas randomly selected from one of task’s
workload in the group of tasks in that cluster. (due to time constraint)
Page 11
7. WorkloadPrediction
 When a user from the targeted list of users issues a task, the task’s workload is studied for pre-
determined amount of time. This time period was determined by trial and error basis, as the
minimum time at which all reference curves are different.
 During this time period, the task’s workload is compared with the reference curve of all task
clusters formed in the previous step.
 If the current task’s workload curve has zero distance with one of the reference curves i.e., similar
to a reference curve, the current curve is expected to behave similar to the reference curve and its
workload is predicted.
Since resource allocation and de-allocation cannot be done dynamically because of:
 Huge overhead
 Delay in allocating resources
Resource allocation must happen once in every pre-determined interval of time and cannot happen
continuously.
Hence for stealing the resource, the allocation and re-allocation is a step-function. Based on the
predicted curves slope a step-up or step-down is performed. Also a threshold value is set to
accommodate unexpected spikes in the workload.
Successful prediction:
Average unused resource: 94% Average resource stolen:65% (Req – Allocated)
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
0.055
0.06
0.065
0.07
0.075
0.08
0.085
0.09
0.095
0.1
0.105
0.11
0.115
0.12
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100
CPUUsage
Time(s)
Efficient resource allocation curve
Used
Allocated
Req
Page 12
Failed prediction:
Reason: The above chart shows a case where our algorithm has failed to predict correctly. This is
because of the random selection of the reference curve for task clusters. Though the randomly selected
reference curve has generated a descent resource allocation curve, there are points at which the current
task spikes and exceeds the allocated resource. The solution to this issue is discussed in the Issues faced
section.(Solution discussed later)
0
0.005
0.01
0.015
0.02
0.025 1
11
21
31
41
51
61
71
81
91
101
111
121
131
141
151
161
171
181
191
201
211
221
231
241
251
261
271
281
291
301
311
321
331
341
351
361
371
CPUuse
Time (s)
Allocated
Usage
Page 13
8. Tools and Algorithms Used:
 JAVA: For extracting required data out of the datasets, we used Java programming (csv
reader/writer, hashmaps).
 DTW on MATLAB: Implemented DTW using Matlab’s in-built function.
 WEKA 3.7: To run clustering algorithms – K-means, EM, Cascade Simple K-Means, Xmeans
 TABLEAU 8.1: To visualize the datasets and results.
 Naïve Bayes on MATLAB for choosing right cluster: Couldn’t use this because the data was
continuous and the algorithm needed discrete data.
 Co-relation on MATLAB: Since DTW was a better option for comparing 2 curves, we dropped this
algorithm.
Page 14
9. Issues faced and possible solutions:
 MATLAB crashing: While executing DTW to each task of each user (which comes up to nearly
9000), MATLAB crashed which was rectified by running the data in batches. DTW algorithm in
MATLAB takes 2 vectors as input, changes them to matrix and multiplies which increases the time
complexity to great extent.
 MATLAB Numeric: We had problem in getting MATLAB to take user as a String data type along
with other parameters. We had to run Java programs to map users of tasks to the corresponding
DTW values.
 Naïve Bayes algorithm: Learning from the existing data and predicting the given test curve using
Naïve Bayes would have been a better prediction – but since the data was continuous, we couldn’t
implement this algorithm.
 We initially considered per user’s tasks and ran DTW algorithm on them to identify if a user has a
recurring workload pattern. As very few users had such workload pattern, we ended up ignoring
lot of data. So we considered tasks of all users for DTW instead per user’s task.
Page 15
10. Looking ahead
10.1Improvementsand optimizations
 Choosing a good reference curvefor running DTW was difficult. Having a straight line as a
reference curve gave us mediocre results as curves with peaks at different instances of time were
grouped as similar. So we compared results with the line x=y and sine curve as reference curve
and we got good results for sine curve.
 Choosing a representative curve for a task cluster was performed on a random basis due to time
constraint. This can be bettered using curve fitting algorithm to get an overall reference curve for
a cluster.
 During prediction incidents such as the workload of a task changing to look like some other
cluster is not handled now. This can be handled by comparing the current task’s workload with all
cluster’s reference curve continuously and when current task looks like shifting to some other
cluster, the step curve can be mapped to the new cluster’s reference curve dynamically. This
constant monitoring and dynamic mapping improves the prediction accuracy.
Page 16
GROUPMEMBERS
PRABHAKAR
GANESAMURHTY
PRIYANKA MEHTA ARUNRAJA SRINIVASAN ABINAYA
SHANMUGARAJ
prabhakarg@utdallas.edu priyanka.nmehta@gmail.com arunraja.srinivasan@yahoo
.com
abias1702@gmail.com
References:
1. Solis Moreno, I, Garraghan, P, Townend, PM and Xu, J (2013) An Approach for Characterizing Workloads in Google
Cloud to Derive Realistic Resource Utilization Models. In: Service Oriented System Engineering (SOSE), 2013 IEEE 7th
International Symposium on. UNSPECIFIED. IEEE, 49 - 60 (12). ISBN 978-1-4673-5659-6
2. X-means: Extending K-means with Efficient Estimation of the Number of Clusters (2000) by Dau Pelleg , Andrew Moore
3. Wang, Xiaoyue, et al. "Experimental comparison of representation methods and distance measures for time
series data." Data Mining and Knowledge Discovery (2010): 1-35.

More Related Content

DOCX
Cloud workload analysis and simulation
PDF
Optimization of workload prediction based on map reduce frame work in a cloud...
PDF
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
PDF
Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...
PDF
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
PDF
Iss 6
PDF
A survey on ranking sql queries using skyline and user
PDF
Document Recommendation using Boosting Based Multi-graph Classification: A Re...
Cloud workload analysis and simulation
Optimization of workload prediction based on map reduce frame work in a cloud...
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
Iss 6
A survey on ranking sql queries using skyline and user
Document Recommendation using Boosting Based Multi-graph Classification: A Re...

What's hot (19)

PDF
Data Dimensional Reduction by Order Prediction in Heterogeneous Environment
PDF
Column store databases approaches and optimization techniques
PDF
Enhancing Big Data Analysis by using Map-reduce Technique
PDF
Ijmet 10 01_141
PDF
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
PDF
New proximity estimate for incremental update of non uniformly distributed cl...
PDF
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
PDF
Re-enactment of Newspaper Articles
PDF
Map-Reduce Synchronized and Comparative Queue Capacity Scheduler in Hadoop fo...
PDF
SCIENTIFIC WORKFLOW CLUSTERING BASED ON MOTIF DISCOVERY
PDF
Propose a Method to Improve Performance in Grid Environment, Using Multi-Crit...
PDF
Review: Data Driven Traffic Flow Forecasting using MapReduce in Distributed M...
PDF
Survey on Load Rebalancing for Distributed File System in Cloud
PDF
MULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKS
PDF
Effective data mining for proper
PDF
Improving the Performance of Mapping based on Availability- Alert Algorithm U...
PDF
Design and Implementation of SOA Enhanced Semantic Information Retrieval web ...
PDF
Reengineering of relational databases to object oriented database
Data Dimensional Reduction by Order Prediction in Heterogeneous Environment
Column store databases approaches and optimization techniques
Enhancing Big Data Analysis by using Map-reduce Technique
Ijmet 10 01_141
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
New proximity estimate for incremental update of non uniformly distributed cl...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
Re-enactment of Newspaper Articles
Map-Reduce Synchronized and Comparative Queue Capacity Scheduler in Hadoop fo...
SCIENTIFIC WORKFLOW CLUSTERING BASED ON MOTIF DISCOVERY
Propose a Method to Improve Performance in Grid Environment, Using Multi-Crit...
Review: Data Driven Traffic Flow Forecasting using MapReduce in Distributed M...
Survey on Load Rebalancing for Distributed File System in Cloud
MULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKS
Effective data mining for proper
Improving the Performance of Mapping based on Availability- Alert Algorithm U...
Design and Implementation of SOA Enhanced Semantic Information Retrieval web ...
Reengineering of relational databases to object oriented database
Ad

Viewers also liked (15)

PPT
Iggy Azalea
PPT
Everything that you need to know about the
PPTX
Mariel slide
PPTX
Evaluation 4
PPTX
Go langsing
PDF
SAMA Newsletter
PDF
Habitat Tokyo
DOC
Xarxes-Jiatai Lu i Yunjie Liu
PDF
The Rise of Smart Operations
PDF
Budidaya Lele
DOC
Xarxes Informàtiques
PDF
CV - MJNICHOLSON
PPT
Real betis 2
DOC
Xarxes informàtiques
DOC
Alba Galera - xarxes
Iggy Azalea
Everything that you need to know about the
Mariel slide
Evaluation 4
Go langsing
SAMA Newsletter
Habitat Tokyo
Xarxes-Jiatai Lu i Yunjie Liu
The Rise of Smart Operations
Budidaya Lele
Xarxes Informàtiques
CV - MJNICHOLSON
Real betis 2
Xarxes informàtiques
Alba Galera - xarxes
Ad

Similar to cloudworkloadanalysisandsimulation-140521153543-phpapp02 (20)

PPTX
Cloud workload analysis and simulation
PDF
PPTX
Performance testing in scope of migration to cloud by Serghei Radov
PDF
Hi Maturity in the CMMI Services Context
PDF
Cost-Based Task Scheduling in Cloud Computing
PDF
Gini in A Bottle: A Case Study of Pareto’s Principle in the Wild
PDF
ANALYSIS ON LOAD BALANCING ALGORITHMS IMPLEMENTATION ON CLOUD COMPUTING ENVIR...
PDF
A compendium on load forecasting approaches and models
PDF
C017531925
PDF
International Journal of Engineering Research and Development
PPTX
Desktop to Cloud Transformation Planning
PDF
How to Do Capacity Planning
PDF
IRJET- Time and Resource Efficient Task Scheduling in Cloud Computing Environ...
PPTX
Task Scheduling Using Firefly algorithm with cloudsim
PDF
A customized task scheduling in cloud using genetic algorithm
PDF
Bragged Regression Tree Algorithm for Dynamic Distribution and Scheduling of ...
PDF
Workload Characterization for Resource Optimization of Big Data Analytics: Be...
PDF
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
PDF
Ieeepro techno solutions 2014 ieee dotnet project - decreasing impact of sl...
PDF
Ieeepro techno solutions 2014 ieee java project - decreasing impact of sla ...
Cloud workload analysis and simulation
Performance testing in scope of migration to cloud by Serghei Radov
Hi Maturity in the CMMI Services Context
Cost-Based Task Scheduling in Cloud Computing
Gini in A Bottle: A Case Study of Pareto’s Principle in the Wild
ANALYSIS ON LOAD BALANCING ALGORITHMS IMPLEMENTATION ON CLOUD COMPUTING ENVIR...
A compendium on load forecasting approaches and models
C017531925
International Journal of Engineering Research and Development
Desktop to Cloud Transformation Planning
How to Do Capacity Planning
IRJET- Time and Resource Efficient Task Scheduling in Cloud Computing Environ...
Task Scheduling Using Firefly algorithm with cloudsim
A customized task scheduling in cloud using genetic algorithm
Bragged Regression Tree Algorithm for Dynamic Distribution and Scheduling of ...
Workload Characterization for Resource Optimization of Big Data Analytics: Be...
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
Ieeepro techno solutions 2014 ieee dotnet project - decreasing impact of sl...
Ieeepro techno solutions 2014 ieee java project - decreasing impact of sla ...

cloudworkloadanalysisandsimulation-140521153543-phpapp02

  • 1. GROUP – B3 CLOUD WORKLOAD ANALYSIS AND SIMULATION Investigating the behaviorsof a cloud, focusing on their workload patterns Jan – May 2014
  • 2. TABLE OF CONTENTS Contents Highlights: _______________________________________________________________________________________ 1 The approach ...........................................................................................................................................................................1 Dataset preprocessing and analysis ..................................................................................................................................1 Clustering analysis.................................................................................................................................................................1 Time series analysis...............................................................................................................................................................1 Workload prediction..............................................................................................................................................................1 Looking ahead .........................................................................................................................................................................1 1.Objective:_______________________________________________________________________________________ 2 2.The Approach:__________________________________________________________________________________ 3 3.Dataset preprocessing and analysis ____________________________________________________________ 4 3.1 Preprocessing_________________________________________________________________________________________4 3.2 Analysis:_______________________________________________________________________________________________4 4.Calculation resource usage statistics:___________________________________________________________ 6 5.Classification of users and identifying target users: ____________________________________________ 8 6.Time series analysis ___________________________________________________________________________10 7.Workload Prediction __________________________________________________________________________11 9.Issues faced and possible solutions: ___________________________________________________________14 10.Looking ahead _______________________________________________________________________________15 GROUP MEMBERS_______________________________________________________________________________16 References:______________________________________________________________________________________16
  • 3. Page 1 Highlights: The approach  Studied google trace data schema  Studied related technicalpapers and summarized useful observations  Devised an approach to analyzecloud workloadusing observations from technicalpapers and considering google trace data’s schema Datasetpreprocessing and analysis  Preprocessed the data to prepare it foranalysis  Visualized important statistics for feasibility decision and computed relevantattributes  The main attributes were analyzedand visualized and observations were made. Clustering analysis  Applied various clustering algorithm, compared the results and chose the best clustering foruser and task analysis  Users wereclassified primarily based on estimation ratios and tasks based on cpu, memory usage Time series analysis  Target users and their task were identified from the clustering results  Dynamic time warping algorithm was run on tasks to identify patterns in their resource usage Workload prediction  Users withspecific resource usage patterns are identified  Resource forusers is allocated based on the identified usage pattern with a threshold value Looking ahead  Improvisations to our approach
  • 4. Page 2 1. Objective:  Analyze and report the cloud workload data based on Google cloud trace  Use graphical tools to visualize the data (you may need to write programs to process the data  in order to feed them into visual tools)  Study and summarize the papers regarding other people’s experience in Google cloud trace analysis  Determine the workload characteristics from your analysis of the Google cloud trace  Try to reallocate unused resources of a user to other users who require them
  • 5. Page 3 2. The Approach: Based on our study on the google cloud trace data and the gathered observations from the technical papers we devised the following approach for the problem:  Analyze and visualize the data to identify important attributes that determine user workload pattern and ignore rest of the attributes  Calculate resource usage statistics of users to identify the feasibility of resource re-allocation  Classify users based on their resource usage quality[1] (amount of unused resource/resource requested) using clustering analysis  Identify target users based on the clustering analysis for resource re-allocation  Study the workload pattern of tasks of the target users and classify tasks based on their lengths  Perform time series analysis on long tasks  Identify (if there is) a pattern for a user and associate that pattern for that user (or) form clusters of tasks of all users that have similar workload based on time series analysis  Predict the usage pattern of a user if the current task’s pattern matches the pattern associated with that user (or) matches the one of the cluster formed in the previous step.
  • 6. Page 4 3. Dataset preprocessing and analysis 3.1 Preprocessing Inconsistent and vague data was processed to perform analysis. The task-usage table has many records for a same Job ID-task index pair because the same task might be re-submitted or re- scheduled due to task failure. So to avoid reading many values for the same Job ID-Task index pair pre-processing was done. Pre-processing: All records were grouped by JobID-Task index and the last occurring record of repeating task records was considered and stored as a single record. Time is converted into days and hours for per day analysis 3.2 Analysis: The data in the cloud trace tables were visualized. The data which were found to be constant/within a small range of values for most of the records were not considered for analysis. The attributes that play a major part in shaping the user profile and task profile are considered important attributes. The main attributes from a table were analyzed and visualized and certain observations were made. Figure 1CPU requested per user (blue) Vs CPU used per user (red) Observation: Most users overestimate the resources they need and use less than 5% of the requested resources A few users under estimate the resources and use more than thrice the amount of requested resources.
  • 7. Page 5 Figure 2Memory requested per user (blue) Vs Memory used per user (red) Users with negative (orange) memory estimation ratio have used resources more than requested. Users with memory estimation ratio between 0.9 and 1 have not used more than 90% of the requested resource.
  • 8. Page 6 4. Calculationresource usage statistics: As we are concerned with re-allocation of unused resources, we should look at those users who over- estimate the resources as observed in the previous section. To identify those users who over-estimate the resources a new attribute is calculated. Estimation ratio [1] = (requested resource – used resource)/requested resource. Estimation ratio varies from 0 to 1. 0 – User has used up/more than the requested resource 1 – User has not used any of the requested resource Also from the visualizations and observations made, the following are identified as important attributes: User: Submission rate, CPU estimation ratio, Memory estimation ratio Task: Task length, CPU usage, Memory usage Figure 3CPU Estimation ratio per User Users with negative (red) CPU estimation ratio have used resources more than requested. Users with CPU estimation ratio between 0.9 and 1 have not used more than 90% of the requested resource.
  • 9. Page 7 Figure 4Memory Estimation ratio per User Users with negative (orange) memory estimation ratio have used resources more than requested. Users with memory estimation ratio between 0.9 and 1 have not used more than 90% of the requested resource.
  • 10. Page 8 5. Classificationof users and identifying target users: The dimensions for classification are User: Submission rate, CPU estimation ratio, Memory estimation ratio We use the following clustering algorithms to identify optimal number of clusters for users and tasks  K- means  Expectation – Maximization (EM)  Cascade Simple K-means  X-means[2] We categorize the users and tasks using these clustering algorithms with the above dimensions for users. We compare and choose the best clustering for users and tasks. K-means (4 clusters) EM Clustering
  • 11. Page 9 X- means Cascade Simple K-means K means clustering with 4 clusters was selected as it offers good clustering of users based on the CPU and memory estimation ratios. From the clustering results we observed: 97% of the users have estimation ratios ranging from 0.7-1.0. That is 97% of the users don’t user more than 70% of the resources they request. We targeted User Cluster 0 & Cluster 3 ( more than 90 % unused) We targeted tasks that were long enough to perform efficient resource allocation. We performed clustering on task lengths of these users to filter out short tasks
  • 12. Page 10 6. Time series analysis To identify user’s tasks with similar workload, we ran the DTW[3] algorithm on each tasks of Cluster0 and Cluster3 users, computed the DTW between all target user’s tasks and a reference sine curve (refer Issues faced section) Clustered tasks that have same DTW value These tasks were identified to have similar workload curve. Two tasks with same DTW distances having similar workloads The clusters hence formed a reference workload curvewas randomly selected from one of task’s workload in the group of tasks in that cluster. (due to time constraint)
  • 13. Page 11 7. WorkloadPrediction  When a user from the targeted list of users issues a task, the task’s workload is studied for pre- determined amount of time. This time period was determined by trial and error basis, as the minimum time at which all reference curves are different.  During this time period, the task’s workload is compared with the reference curve of all task clusters formed in the previous step.  If the current task’s workload curve has zero distance with one of the reference curves i.e., similar to a reference curve, the current curve is expected to behave similar to the reference curve and its workload is predicted. Since resource allocation and de-allocation cannot be done dynamically because of:  Huge overhead  Delay in allocating resources Resource allocation must happen once in every pre-determined interval of time and cannot happen continuously. Hence for stealing the resource, the allocation and re-allocation is a step-function. Based on the predicted curves slope a step-up or step-down is performed. Also a threshold value is set to accommodate unexpected spikes in the workload. Successful prediction: Average unused resource: 94% Average resource stolen:65% (Req – Allocated) 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06 0.065 0.07 0.075 0.08 0.085 0.09 0.095 0.1 0.105 0.11 0.115 0.12 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100 CPUUsage Time(s) Efficient resource allocation curve Used Allocated Req
  • 14. Page 12 Failed prediction: Reason: The above chart shows a case where our algorithm has failed to predict correctly. This is because of the random selection of the reference curve for task clusters. Though the randomly selected reference curve has generated a descent resource allocation curve, there are points at which the current task spikes and exceeds the allocated resource. The solution to this issue is discussed in the Issues faced section.(Solution discussed later) 0 0.005 0.01 0.015 0.02 0.025 1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191 201 211 221 231 241 251 261 271 281 291 301 311 321 331 341 351 361 371 CPUuse Time (s) Allocated Usage
  • 15. Page 13 8. Tools and Algorithms Used:  JAVA: For extracting required data out of the datasets, we used Java programming (csv reader/writer, hashmaps).  DTW on MATLAB: Implemented DTW using Matlab’s in-built function.  WEKA 3.7: To run clustering algorithms – K-means, EM, Cascade Simple K-Means, Xmeans  TABLEAU 8.1: To visualize the datasets and results.  Naïve Bayes on MATLAB for choosing right cluster: Couldn’t use this because the data was continuous and the algorithm needed discrete data.  Co-relation on MATLAB: Since DTW was a better option for comparing 2 curves, we dropped this algorithm.
  • 16. Page 14 9. Issues faced and possible solutions:  MATLAB crashing: While executing DTW to each task of each user (which comes up to nearly 9000), MATLAB crashed which was rectified by running the data in batches. DTW algorithm in MATLAB takes 2 vectors as input, changes them to matrix and multiplies which increases the time complexity to great extent.  MATLAB Numeric: We had problem in getting MATLAB to take user as a String data type along with other parameters. We had to run Java programs to map users of tasks to the corresponding DTW values.  Naïve Bayes algorithm: Learning from the existing data and predicting the given test curve using Naïve Bayes would have been a better prediction – but since the data was continuous, we couldn’t implement this algorithm.  We initially considered per user’s tasks and ran DTW algorithm on them to identify if a user has a recurring workload pattern. As very few users had such workload pattern, we ended up ignoring lot of data. So we considered tasks of all users for DTW instead per user’s task.
  • 17. Page 15 10. Looking ahead 10.1Improvementsand optimizations  Choosing a good reference curvefor running DTW was difficult. Having a straight line as a reference curve gave us mediocre results as curves with peaks at different instances of time were grouped as similar. So we compared results with the line x=y and sine curve as reference curve and we got good results for sine curve.  Choosing a representative curve for a task cluster was performed on a random basis due to time constraint. This can be bettered using curve fitting algorithm to get an overall reference curve for a cluster.  During prediction incidents such as the workload of a task changing to look like some other cluster is not handled now. This can be handled by comparing the current task’s workload with all cluster’s reference curve continuously and when current task looks like shifting to some other cluster, the step curve can be mapped to the new cluster’s reference curve dynamically. This constant monitoring and dynamic mapping improves the prediction accuracy.
  • 18. Page 16 GROUPMEMBERS PRABHAKAR GANESAMURHTY PRIYANKA MEHTA ARUNRAJA SRINIVASAN ABINAYA SHANMUGARAJ prabhakarg@utdallas.edu priyanka.nmehta@gmail.com arunraja.srinivasan@yahoo .com abias1702@gmail.com References: 1. Solis Moreno, I, Garraghan, P, Townend, PM and Xu, J (2013) An Approach for Characterizing Workloads in Google Cloud to Derive Realistic Resource Utilization Models. In: Service Oriented System Engineering (SOSE), 2013 IEEE 7th International Symposium on. UNSPECIFIED. IEEE, 49 - 60 (12). ISBN 978-1-4673-5659-6 2. X-means: Extending K-means with Efficient Estimation of the Number of Clusters (2000) by Dau Pelleg , Andrew Moore 3. Wang, Xiaoyue, et al. "Experimental comparison of representation methods and distance measures for time series data." Data Mining and Knowledge Discovery (2010): 1-35.