SlideShare a Scribd company logo
INSTITUTEOFCOMPUTINGTECHNOLOGY
CloudRank-D:A Benchmark Suite
for Private Cloud Systems
Jing Quan
Institute of Computing Technology, Chinese
Academy of Sciences and University of Science
and Technology of China
1
HVC tutorial
in conjunction with The 19th IEEE International Symposium
on High Performance Computer Architecture (HPCA 2013)
HVC TutorialHPCA 2013
Contents
• Background & Motivation
• Introduction of CloudRank-D
• Use cases
HVC TutorialHPCA 2013
Contents
• Background & Motivation
• Introduction of CloudRank-D
• Use cases
HVC TutorialHPCA 2013
What is Private Cloud ?
• Private Cloud
– The cloud infrastructure is provisioned for exclusive use by
a single organization comprising multiple consumers (e.g.,
business units). It may be owned, managed, and operated
by the organization, a third party, or some combination of
them, and it may exist on or off premises.
"The NIST Definition of Cloud Computing" National Institute of Standards and Technology. Retrieved 24 July 2011
http://blogs.technet.com/b/yungchou/archive/2011/03/21/what-is-
private-cloud.aspx
HVC TutorialHPCA 2013
Typical Data Processing Application
Recommender
systerm
Social
Network
……
Search
Engine
Hadoop Master Node
Job
Production
Client Front End
MapReduce Jobs
Job
Deployment
Scheduler
Job flow
Framework
Node Node …… Node
HDFS
HVC TutorialHPCA 2013
User Concerns
Xeon
Xeon
Xeon
Xeon
Atom
Atom
Atom
Atom
How to quantitatively measure
systems?
Which one is better (ranking
systems)?
How to guide optimization?
HVC TutorialHPCA 2013
What is CloudRank-D?
CloudRank-D
Private cloud systems
Ranking systems
Data processing
General Description
CloudRank-D is a benchmark suite, used to evaluate
private cloud systems that is shared for running data
processing applications.
HVC TutorialHPCA 2013
Why CloudRank-D?
Benchmark Target of Evaluation
MineBench Data mining algorithms
GridMix Hadoop framework
HiBench Hadoop framework
WL suite Hadoop framework
CloudRank-D The whole system
HVC TutorialHPCA 2013
Our Focus: Evaluating the Whole System
Applications
(Data analysis)
Framework
(Hadoop)
System
platform
System
platform
Default framework
(Hadoop)
Applications
(Data analysis)
Performanc of
Software &
Hardware
CloudRank-DGridMix etc.
Hadoop
Performance
vs
HVC TutorialHPCA 2013
Comparison of Different Benchmarks Suites
Mine-
Bench
Grid-
Mix
HiBench
WL
suite
CloudSuite CloudRank-D
Representa-
tive
applications
Basic
operations
n y y y n y
Classification y n y n y y
Clustering y n y n n y
Recommend-
ation
n n n n n y
Sequence
learning
y n n n n y
Association
rule mining
y n n n n y
Data
warehouse
operations
n n n y n y
HVC TutorialHPCA 2013
Comparison of Different Benchmarks Suites(Cont')
MineB
ench
Grid
Mix
HiBench
WL
suite
CloudSuite CloudRank-D
Workloads
description
Submission
pattern
n n n y n y
Scheduling
strategies
n n n n n y
System
software
configuration n n n n n y
Data models n n n n n y
Data
semantics
n n n n n y
Scalable data
size
y y n y n y
Category of
datacentric
computation n n n y n y
HVC TutorialHPCA 2013
Contents
• Background & Motivation
• Introduction of CloudRank-D
– Methodology
• Use cases
HVC TutorialHPCA 2013
CloudRank-D Methodology
System
platform
Workloads
with usage patterns
Performance
reports
running
feedback
Get the
peak system
performance
Ⅰ.Measure systems
Ⅱ.Find a suitable system
Ⅲ.Optimize systems
HVC TutorialHPCA 2013
Configurable Workloads with Tunable
Usage Patterns
Scalable
applications and
input datasets
Tunable
submission
patterns
Configurable
runtime system
• Representive
applications domains
• User specific
• Scalable data size
• Modeling production
system logs
• Experiences from
industry and
academic
HVC TutorialHPCA 2013
Usage
patterns
Scalable
applications and
input data sets
Tunable
submission
patterns
Configurable
framework
CloudRank-D Methodology:
Workloads with Usage Patterns
HVC TutorialHPCA 2013
Scalable Applications and Input Data Sets
Scalable
applications and
input data sets
Submitted jobs composed
of appropriate applications
Expanded data sets
HVC TutorialHPCA 2013
NO. Category Application Data size Data semantics
1
basic
operation
sort
scalable
(scale to
10PB)
automatically
generated
2 word count
3 grep
4
classification
naive bayes
5
support
vector
machine
Scientist Search
6 cluster k-means scalable sougou corpus
7
recommenda
tion
Item based
collaborative
filtering
scalable ratings on movies
Applications and Input Data Sets
HVC TutorialHPCA 2013
NO. Category Application Data size Data semantics
8
association
rule mining
frequent pattern
growth
fixed
retail market basket data
click-stream data , traffic
accident data, collection of
web html documents
9
sequence
learning
hidden morkov
model
scalable
Scientist Search
10
warehouse
operation
grep select
automatically
generated table
11 ranking select
12 aggregation
13
uservisits-ranking
join
Applications and Input Data Sets (Cont')
You can add any applications you want !
HVC TutorialHPCA 2013
Applications Combinations Demonstration
WebCrawling
DataMining
MachineLearning
ImageProcessing
Naive Bayes
& SVM
HMM &
IBCF & FPG
35%
TextIndexing
LogProcessing
Basic
Operations
31%
Reporting
DataStorage
Hive 34%
wiki.apache.org/hadoop/PoweredBy
HVC TutorialHPCA 2013
Data Set Sizes Demonstration
Map Number Percentage Size
<10 40.57% 128MB~1.25GB
10~500 39.33% 1.25GB~62.5GB
500~2000 12.03% 63.5GB~250GB
>2000 8.07% 250GB~
Workload Characterization on a Production Hadoop
Cluster: A Case Study on Taobao
HVC TutorialHPCA 2013
Usage
patterns
Scalable
applications and
input data size
Tunable
submission
patterns
Configurable
framework
Workloads with usage patterns
HVC TutorialHPCA 2013
Submission Patterns
Submission
patterns
Submission intervals
Submission orders
HVC TutorialHPCA 2013
Submission Intervals
Form the Facebook report, distribution of inter-arrival
times was roughly exponential with a mean of 14
seconds.
Ddelay scheduling: A simple technique for achieving locality and fairness in
cluster scheduling. In Proceeding In Proceedings of the 5th European conference
on Computer systems.
Probability density function
HVC TutorialHPCA 2013
Submission Orders
• For the workloads with different resource
sizes and different catelogs
– Submitting jobs randomly
– Submitting jobs with batch model
HVC TutorialHPCA 2013
Usage
patterns
Scalable
applications and
input data size
Tunable
submission
patterns
Configurable
framework
Workloads with usage patterns
HVC TutorialHPCA 2013
Hadoop Configurations
Dimensions Explanation
Map/Reduce
Number
affect system utilization
Scheduling Policy
Hadoop chooses which job to run according
to this policy
Main Parameters
mapred.tasktracker.map.tasks.maximum
mapred.tasktracker.reduce.tasks.maxmum
mapred.child.java.opts
dfs.block.size
HVC TutorialHPCA 2013
Hadoop Settings
Parameter Value
Mapred.tasktracker.tasks.r
educe.maximum
usually, this value is equal to the core number of
current node
dis.block.size default value is 64M, you can change it to ensure
there won't be too much map number for most
workloads
Map (adjust through the
block size)
10~100 per node, and it's would be better if the
execution time was more than 1min
HVC TutorialHPCA 2013
Scheduling Policy
• Common schedule algorithms
– First input first out
– Fair-share scheduler
– Capacity scheduler
• Fair-share scheduling can do a good job
Workload Characterization on a Production Hadoop Cluster: A Case Study on Taobao
HVC TutorialHPCA 2013
CloudRank-D methodology:
Our metrics
• Focus
– From user perspective
– Easy to compare and understand
• Metrics
– Data processed per second or joule
• How to get it?
DPS=Total data input size/Total run time
DPJ=Total data input size/Total energy consumption
HVC TutorialHPCA 2013
Contents
• Background & Motivation
• Introduction of CloudRank-D
• Use cases
HVC TutorialHPCA 2013
How to use?
CloudRank-D
HVC TutorialHPCA 2013
Use Case 1: Comparing Two Hardware
Platforms
Cluster 1 Cluster 2
Xeon
Xeon
Xeon
Xeon
Atom
Atom
Atom
Atom
Two clusters comprise 128 nodes respectively.
HVC TutorialHPCA 2013
Procedures
Step 1
Prepared hardware platform
Step 1
Build foundation platform
Step 2
Customize workloads
Step 3
Run workloads
Step 4
Get results and optimize systems
HVC TutorialHPCA 2013
Base Information
• Evaluating two private cloud systems
• Using all workloads we provide
• Deploying uniform software platform
• Adopting same configuration
HVC TutorialHPCA 2013
Software Configuration
software stack
Hadoop version 0.20.2
Hive version 0.6.0
Mahout version 0.6
map/reduce slot 4 map slots and 2 reduce slots
Hadoop system
configuration
default
Hadoop scheduling
algorithm
fair schedule
HVC TutorialHPCA 2013
Run your workloads
Job Submission
Patterns
You can submit the workloads according
to the exponential distribution with a
specified mean submission interval --- 14
seconds
Submission order : Random
HVC TutorialHPCA 2013
An example of result
The comparion between Xeon Atom on two metrics
• Xeon
– less time,
more energy
• Atom
– more time,
less energy
AtomXeon
Totaldataprocessed
persecond(KB/S)
4000
2000
0
Xeon Atom
10
0
Totaldataprocessed
perjoule(KB/J)
5
HVC TutorialHPCA 2013
Optimized (Cont')
• Tuning the interval
We can see that the best performance occurred when the
interval value is 70.
HVC TutorialHPCA 2013
Use Case 2: Scheduling Evaluation
39/
I have designed a new Hadoop
scheduling algorithm, but I
don’t have the workloads for
test.
How to evaluate the
scheduling ? And let
people trust the
evaluations results.
HVC TutorialHPCA 2013
Using CloudRank-D
Step 1
Building foundation platform with different scheduling policy
Step 1
Build foundation platform
Step 2
Customizing workloads with productive scenarios
Step 3
Running workloads
Step 4
Getting the metrics under different scheduling policy
HVC TutorialHPCA 2013
Our Result
5
0
Totaldataprocessedperjoule(KB/J)
Fair scheduler FIFO scheduler
4000
2000
0
Totaldataprocessedpersecond(KB/S)
Fair scheduler FIFO scheduler
We can see that fair scheduler works better than FIFO scheduler.
HVC TutorialHPCA 2013
• Contact us
– Websit: http://prof.ict.ac.cn/CloudRank/
– Email: quanjing@ict.ac.cn
HVC TutorialHPCA 2013
Thanks

More Related Content

PPT
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
PDF
Dynamic Resource Allocation Algorithm using Containers
PDF
On-Prem Solution for the Selection of Wind Energy Models
PPTX
High Performance Data Analytics with Java on Large Multicore HPC Clusters
PDF
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
PPTX
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems
PPTX
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
PPTX
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
Dynamic Resource Allocation Algorithm using Containers
On-Prem Solution for the Selection of Wind Energy Models
High Performance Data Analytics with Java on Large Multicore HPC Clusters
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...

What's hot (20)

PPTX
Introducing the TPCx-HS Benchmark for Big Data
ODP
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
PDF
Benchmarking Hadoop and Big Data
PDF
Introduction to Spark
PDF
Apache Spark Overview
PDF
Build a Time Series Application with Apache Spark and Apache HBase
PDF
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
PDF
Distributed Database practicals
PPTX
PDF
Big Data Benchmarking Tutorial
PDF
Modern Computing: Cloud, Distributed, & High Performance
PDF
Lessons Learned on Benchmarking Big Data Platforms
PPTX
Scheduling Policies in YARN
PPTX
What's new in Hadoop Common and HDFS
PDF
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
PDF
Apache Spark streaming and HBase
PDF
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
PDF
BDSE 2015 Evaluation of Big Data Platforms with HiBench
PPTX
My Dissertation 2016
PDF
RISELab:Enabling Intelligent Real-Time Decisions
Introducing the TPCx-HS Benchmark for Big Data
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
Benchmarking Hadoop and Big Data
Introduction to Spark
Apache Spark Overview
Build a Time Series Application with Apache Spark and Apache HBase
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Distributed Database practicals
Big Data Benchmarking Tutorial
Modern Computing: Cloud, Distributed, & High Performance
Lessons Learned on Benchmarking Big Data Platforms
Scheduling Policies in YARN
What's new in Hadoop Common and HDFS
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Apache Spark streaming and HBase
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
BDSE 2015 Evaluation of Big Data Platforms with HiBench
My Dissertation 2016
RISELab:Enabling Intelligent Real-Time Decisions
Ad

Viewers also liked (17)

PPTX
Hiv
PDF
Kalam speech
PDF
Bruno Progress Report
PPTX
Value Study Challenge Project
PDF
Camera ready-nash equilibrium-ngct2015-format
XLS
Nilai Produktif Tkj Semester 1 Th 2009
PPTX
Listrik Statis
PPT
PPT
Introductie Persistente Identifiers
KEY
SiteValet Launch at TIAPEI annual meeting
PPT
Personal Definition of "Passion"Collage
PPT
History and Process of Animation
PDF
Bruno Afg
PPT
How to fire a kiln
PPT
strategy
PPT
Mate Selection And Attraction Power Point
PPTX
C++ Project
Hiv
Kalam speech
Bruno Progress Report
Value Study Challenge Project
Camera ready-nash equilibrium-ngct2015-format
Nilai Produktif Tkj Semester 1 Th 2009
Listrik Statis
Introductie Persistente Identifiers
SiteValet Launch at TIAPEI annual meeting
Personal Definition of "Passion"Collage
History and Process of Animation
Bruno Afg
How to fire a kiln
strategy
Mate Selection And Attraction Power Point
C++ Project
Ad

Similar to Data set cloudrank-d-hpca_tutorial (20)

PDF
Optimising Service Deployment and Infrastructure Resource Configuration
PDF
Bringing Private Cloud Computing to HPC and Science - Berkeley Lab - July 2014
PPT
Survey on cloud simulator
PPTX
Webinar: Burst ANSYS Workloads to the Cloud with Univa & UberCloud
PDF
OpenNebula TechDay Boston 2015 - Bringing Private Cloud Computing to HPC and ...
PPTX
project--2 nd review_2
PPTX
project--2 nd review_2
PDF
Optimized placement in Openstack for NFV
PPT
Lessons Learned during IBM SmartCloud Orchestrator Deployment at a Large Tel...
PDF
[Srijan Wednesday Webinars] How to Build a Cloud Native Platform for Enterpri...
PDF
Cloud computing & Batch processing: potentiels & perspectives
PDF
Navops talk at hpc in the cloud meetup 19 march 2019
PDF
Cloud-Native Patterns for Data-Intensive Applications
PPTX
High Performance Computing Pitch Deck
PDF
Cloud computing Fundamentals - behind the hood of cloud platforms
PDF
Cloud computing Fundamentals - behind the hood of cloud platforms
PDF
OpenNebulaConf 2013 - Hands-on Tutorial: 1. Introduction and Architecture
PDF
Concurrent and Distributed CloudSim Simulations
PDF
Adopting the Cloud
Optimising Service Deployment and Infrastructure Resource Configuration
Bringing Private Cloud Computing to HPC and Science - Berkeley Lab - July 2014
Survey on cloud simulator
Webinar: Burst ANSYS Workloads to the Cloud with Univa & UberCloud
OpenNebula TechDay Boston 2015 - Bringing Private Cloud Computing to HPC and ...
project--2 nd review_2
project--2 nd review_2
Optimized placement in Openstack for NFV
Lessons Learned during IBM SmartCloud Orchestrator Deployment at a Large Tel...
[Srijan Wednesday Webinars] How to Build a Cloud Native Platform for Enterpri...
Cloud computing & Batch processing: potentiels & perspectives
Navops talk at hpc in the cloud meetup 19 march 2019
Cloud-Native Patterns for Data-Intensive Applications
High Performance Computing Pitch Deck
Cloud computing Fundamentals - behind the hood of cloud platforms
Cloud computing Fundamentals - behind the hood of cloud platforms
OpenNebulaConf 2013 - Hands-on Tutorial: 1. Introduction and Architecture
Concurrent and Distributed CloudSim Simulations
Adopting the Cloud

More from aminnezarat (15)

PPTX
Health-medicine-and-Block-chain1402-1-12.pptx
PPTX
ارائه ابزار.pptx
PDF
00 - BigData-Chapter_01-PDC.pdf
PDF
Smart Data Strategy EN (1).pdf
PPT
slides8 SharedMemory.ppt
PPT
BASIC_MPI.ppt
PPT
Chap2 GGKK.ppt
PPTX
06 hpc library_fundamentals_of_parallelism_and_code_optimization-www.astek.ir
PPTX
05 mpi fundamentals_of_parallelism_and_code_optimization-www.astek.ir
PPTX
04 memory traffic_fundamentals_of_parallelism_and_code_optimization-www.astek...
PPTX
03 open mp_fundamentals_of_parallelism_and_code_optimization-www.astek.ir
PPTX
02 vectorization fundamentals_of_parallelism_and_code_optimization-www.astek.ir
PPTX
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
PDF
Machine learning and big-data-in-physics 13970711-Dr. Amin Nezarat
PDF
Big data HPC Convergence-Dr. Amin-Nezarat-(aminnezarat@gmail.com)-2019
Health-medicine-and-Block-chain1402-1-12.pptx
ارائه ابزار.pptx
00 - BigData-Chapter_01-PDC.pdf
Smart Data Strategy EN (1).pdf
slides8 SharedMemory.ppt
BASIC_MPI.ppt
Chap2 GGKK.ppt
06 hpc library_fundamentals_of_parallelism_and_code_optimization-www.astek.ir
05 mpi fundamentals_of_parallelism_and_code_optimization-www.astek.ir
04 memory traffic_fundamentals_of_parallelism_and_code_optimization-www.astek...
03 open mp_fundamentals_of_parallelism_and_code_optimization-www.astek.ir
02 vectorization fundamentals_of_parallelism_and_code_optimization-www.astek.ir
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
Machine learning and big-data-in-physics 13970711-Dr. Amin Nezarat
Big data HPC Convergence-Dr. Amin-Nezarat-(aminnezarat@gmail.com)-2019

Recently uploaded (20)

PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
System and Network Administraation Chapter 3
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
assetexplorer- product-overview - presentation
PPTX
L1 - Introduction to python Backend.pptx
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
medical staffing services at VALiNTRY
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Cost to Outsource Software Development in 2025
PDF
PTS Company Brochure 2025 (1).pdf.......
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
wealthsignaloriginal-com-DS-text-... (1).pdf
System and Network Administraation Chapter 3
Softaken Excel to vCard Converter Software.pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Design an Analysis of Algorithms II-SECS-1021-03
Adobe Illustrator 28.6 Crack My Vision of Vector Design
CHAPTER 2 - PM Management and IT Context
Which alternative to Crystal Reports is best for small or large businesses.pdf
assetexplorer- product-overview - presentation
L1 - Introduction to python Backend.pptx
Navsoft: AI-Powered Business Solutions & Custom Software Development
medical staffing services at VALiNTRY
Upgrade and Innovation Strategies for SAP ERP Customers
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Operating system designcfffgfgggggggvggggggggg
Wondershare Filmora 15 Crack With Activation Key [2025
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Cost to Outsource Software Development in 2025
PTS Company Brochure 2025 (1).pdf.......

Data set cloudrank-d-hpca_tutorial

  • 1. INSTITUTEOFCOMPUTINGTECHNOLOGY CloudRank-D:A Benchmark Suite for Private Cloud Systems Jing Quan Institute of Computing Technology, Chinese Academy of Sciences and University of Science and Technology of China 1 HVC tutorial in conjunction with The 19th IEEE International Symposium on High Performance Computer Architecture (HPCA 2013)
  • 2. HVC TutorialHPCA 2013 Contents • Background & Motivation • Introduction of CloudRank-D • Use cases
  • 3. HVC TutorialHPCA 2013 Contents • Background & Motivation • Introduction of CloudRank-D • Use cases
  • 4. HVC TutorialHPCA 2013 What is Private Cloud ? • Private Cloud – The cloud infrastructure is provisioned for exclusive use by a single organization comprising multiple consumers (e.g., business units). It may be owned, managed, and operated by the organization, a third party, or some combination of them, and it may exist on or off premises. "The NIST Definition of Cloud Computing" National Institute of Standards and Technology. Retrieved 24 July 2011 http://blogs.technet.com/b/yungchou/archive/2011/03/21/what-is- private-cloud.aspx
  • 5. HVC TutorialHPCA 2013 Typical Data Processing Application Recommender systerm Social Network …… Search Engine Hadoop Master Node Job Production Client Front End MapReduce Jobs Job Deployment Scheduler Job flow Framework Node Node …… Node HDFS
  • 6. HVC TutorialHPCA 2013 User Concerns Xeon Xeon Xeon Xeon Atom Atom Atom Atom How to quantitatively measure systems? Which one is better (ranking systems)? How to guide optimization?
  • 7. HVC TutorialHPCA 2013 What is CloudRank-D? CloudRank-D Private cloud systems Ranking systems Data processing General Description CloudRank-D is a benchmark suite, used to evaluate private cloud systems that is shared for running data processing applications.
  • 8. HVC TutorialHPCA 2013 Why CloudRank-D? Benchmark Target of Evaluation MineBench Data mining algorithms GridMix Hadoop framework HiBench Hadoop framework WL suite Hadoop framework CloudRank-D The whole system
  • 9. HVC TutorialHPCA 2013 Our Focus: Evaluating the Whole System Applications (Data analysis) Framework (Hadoop) System platform System platform Default framework (Hadoop) Applications (Data analysis) Performanc of Software & Hardware CloudRank-DGridMix etc. Hadoop Performance vs
  • 10. HVC TutorialHPCA 2013 Comparison of Different Benchmarks Suites Mine- Bench Grid- Mix HiBench WL suite CloudSuite CloudRank-D Representa- tive applications Basic operations n y y y n y Classification y n y n y y Clustering y n y n n y Recommend- ation n n n n n y Sequence learning y n n n n y Association rule mining y n n n n y Data warehouse operations n n n y n y
  • 11. HVC TutorialHPCA 2013 Comparison of Different Benchmarks Suites(Cont') MineB ench Grid Mix HiBench WL suite CloudSuite CloudRank-D Workloads description Submission pattern n n n y n y Scheduling strategies n n n n n y System software configuration n n n n n y Data models n n n n n y Data semantics n n n n n y Scalable data size y y n y n y Category of datacentric computation n n n y n y
  • 12. HVC TutorialHPCA 2013 Contents • Background & Motivation • Introduction of CloudRank-D – Methodology • Use cases
  • 13. HVC TutorialHPCA 2013 CloudRank-D Methodology System platform Workloads with usage patterns Performance reports running feedback Get the peak system performance Ⅰ.Measure systems Ⅱ.Find a suitable system Ⅲ.Optimize systems
  • 14. HVC TutorialHPCA 2013 Configurable Workloads with Tunable Usage Patterns Scalable applications and input datasets Tunable submission patterns Configurable runtime system • Representive applications domains • User specific • Scalable data size • Modeling production system logs • Experiences from industry and academic
  • 15. HVC TutorialHPCA 2013 Usage patterns Scalable applications and input data sets Tunable submission patterns Configurable framework CloudRank-D Methodology: Workloads with Usage Patterns
  • 16. HVC TutorialHPCA 2013 Scalable Applications and Input Data Sets Scalable applications and input data sets Submitted jobs composed of appropriate applications Expanded data sets
  • 17. HVC TutorialHPCA 2013 NO. Category Application Data size Data semantics 1 basic operation sort scalable (scale to 10PB) automatically generated 2 word count 3 grep 4 classification naive bayes 5 support vector machine Scientist Search 6 cluster k-means scalable sougou corpus 7 recommenda tion Item based collaborative filtering scalable ratings on movies Applications and Input Data Sets
  • 18. HVC TutorialHPCA 2013 NO. Category Application Data size Data semantics 8 association rule mining frequent pattern growth fixed retail market basket data click-stream data , traffic accident data, collection of web html documents 9 sequence learning hidden morkov model scalable Scientist Search 10 warehouse operation grep select automatically generated table 11 ranking select 12 aggregation 13 uservisits-ranking join Applications and Input Data Sets (Cont') You can add any applications you want !
  • 19. HVC TutorialHPCA 2013 Applications Combinations Demonstration WebCrawling DataMining MachineLearning ImageProcessing Naive Bayes & SVM HMM & IBCF & FPG 35% TextIndexing LogProcessing Basic Operations 31% Reporting DataStorage Hive 34% wiki.apache.org/hadoop/PoweredBy
  • 20. HVC TutorialHPCA 2013 Data Set Sizes Demonstration Map Number Percentage Size <10 40.57% 128MB~1.25GB 10~500 39.33% 1.25GB~62.5GB 500~2000 12.03% 63.5GB~250GB >2000 8.07% 250GB~ Workload Characterization on a Production Hadoop Cluster: A Case Study on Taobao
  • 21. HVC TutorialHPCA 2013 Usage patterns Scalable applications and input data size Tunable submission patterns Configurable framework Workloads with usage patterns
  • 22. HVC TutorialHPCA 2013 Submission Patterns Submission patterns Submission intervals Submission orders
  • 23. HVC TutorialHPCA 2013 Submission Intervals Form the Facebook report, distribution of inter-arrival times was roughly exponential with a mean of 14 seconds. Ddelay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In Proceeding In Proceedings of the 5th European conference on Computer systems. Probability density function
  • 24. HVC TutorialHPCA 2013 Submission Orders • For the workloads with different resource sizes and different catelogs – Submitting jobs randomly – Submitting jobs with batch model
  • 25. HVC TutorialHPCA 2013 Usage patterns Scalable applications and input data size Tunable submission patterns Configurable framework Workloads with usage patterns
  • 26. HVC TutorialHPCA 2013 Hadoop Configurations Dimensions Explanation Map/Reduce Number affect system utilization Scheduling Policy Hadoop chooses which job to run according to this policy Main Parameters mapred.tasktracker.map.tasks.maximum mapred.tasktracker.reduce.tasks.maxmum mapred.child.java.opts dfs.block.size
  • 27. HVC TutorialHPCA 2013 Hadoop Settings Parameter Value Mapred.tasktracker.tasks.r educe.maximum usually, this value is equal to the core number of current node dis.block.size default value is 64M, you can change it to ensure there won't be too much map number for most workloads Map (adjust through the block size) 10~100 per node, and it's would be better if the execution time was more than 1min
  • 28. HVC TutorialHPCA 2013 Scheduling Policy • Common schedule algorithms – First input first out – Fair-share scheduler – Capacity scheduler • Fair-share scheduling can do a good job Workload Characterization on a Production Hadoop Cluster: A Case Study on Taobao
  • 29. HVC TutorialHPCA 2013 CloudRank-D methodology: Our metrics • Focus – From user perspective – Easy to compare and understand • Metrics – Data processed per second or joule • How to get it? DPS=Total data input size/Total run time DPJ=Total data input size/Total energy consumption
  • 30. HVC TutorialHPCA 2013 Contents • Background & Motivation • Introduction of CloudRank-D • Use cases
  • 31. HVC TutorialHPCA 2013 How to use? CloudRank-D
  • 32. HVC TutorialHPCA 2013 Use Case 1: Comparing Two Hardware Platforms Cluster 1 Cluster 2 Xeon Xeon Xeon Xeon Atom Atom Atom Atom Two clusters comprise 128 nodes respectively.
  • 33. HVC TutorialHPCA 2013 Procedures Step 1 Prepared hardware platform Step 1 Build foundation platform Step 2 Customize workloads Step 3 Run workloads Step 4 Get results and optimize systems
  • 34. HVC TutorialHPCA 2013 Base Information • Evaluating two private cloud systems • Using all workloads we provide • Deploying uniform software platform • Adopting same configuration
  • 35. HVC TutorialHPCA 2013 Software Configuration software stack Hadoop version 0.20.2 Hive version 0.6.0 Mahout version 0.6 map/reduce slot 4 map slots and 2 reduce slots Hadoop system configuration default Hadoop scheduling algorithm fair schedule
  • 36. HVC TutorialHPCA 2013 Run your workloads Job Submission Patterns You can submit the workloads according to the exponential distribution with a specified mean submission interval --- 14 seconds Submission order : Random
  • 37. HVC TutorialHPCA 2013 An example of result The comparion between Xeon Atom on two metrics • Xeon – less time, more energy • Atom – more time, less energy AtomXeon Totaldataprocessed persecond(KB/S) 4000 2000 0 Xeon Atom 10 0 Totaldataprocessed perjoule(KB/J) 5
  • 38. HVC TutorialHPCA 2013 Optimized (Cont') • Tuning the interval We can see that the best performance occurred when the interval value is 70.
  • 39. HVC TutorialHPCA 2013 Use Case 2: Scheduling Evaluation 39/ I have designed a new Hadoop scheduling algorithm, but I don’t have the workloads for test. How to evaluate the scheduling ? And let people trust the evaluations results.
  • 40. HVC TutorialHPCA 2013 Using CloudRank-D Step 1 Building foundation platform with different scheduling policy Step 1 Build foundation platform Step 2 Customizing workloads with productive scenarios Step 3 Running workloads Step 4 Getting the metrics under different scheduling policy
  • 41. HVC TutorialHPCA 2013 Our Result 5 0 Totaldataprocessedperjoule(KB/J) Fair scheduler FIFO scheduler 4000 2000 0 Totaldataprocessedpersecond(KB/S) Fair scheduler FIFO scheduler We can see that fair scheduler works better than FIFO scheduler.
  • 42. HVC TutorialHPCA 2013 • Contact us – Websit: http://prof.ict.ac.cn/CloudRank/ – Email: quanjing@ict.ac.cn