詹剑锋：Big databench—benchmarking big data systems

BigDataBench: Benchmarking
Big Data Systems

http://prof.ict.ac.cn/jfzhan

INSTITUTE OF COMPUTING TECHNOLOGY

1

Jianfeng Zhan
Computer Systems Research Center, ICT, CAS
CCF Big Data Technology Conference
2013-12-06

Why Big Data Benchmarking?

2

Measuring big data architecture and
systems quantitatively
2/

What is BigDataBench?


An open source project on big data
benchmarking:
•

3/

http://prof.ict.ac.cn/BigDataBench/

•

6 real-world data sets and 19 workloads
–

•

4V characteristics
–

3/

Extended in near future

Volume, Variety, Velocity, and Veracity

4/

Comparison of Big Data Benchmarking Efforts

4/

5/

Possible Users
Systems
OS for big data
File systems for big data
…………………………..

Architecture

Data
management

Processor
Memory
Networks

…………..

BigDataBench
Performance
optimization
Co-design

5/

…….....

Distributed systems
Scheduling
Programming systems

Research Publications


Characterizing data analysis workloads in data
centers. Zhen Jia, Lei Wang, Jianfeng Zhan,
Lixing Zhang, and Chunjie Luo. IISWC 2013
Best paper award

6/



6/

BigDataBench: a Big Data Benchmark Suite
from Internet Services. Lei Wang, Jianfeng
Zhan, et al. HPCA 2014, Industry Session.

Outline

7/

1

2
3

Benchmarking Methodology and Decision

Case Study

3

How to Use

5
4

Future Work

8/

BigDataBench Methodology

4V of Big Data

8/

BigDataBench

Methodology (Cont’)

9/

Represent
ative Data
Sets
Investigate
Typical
Application
Domains

Data Types
Structured
Semi-structured
Unstructured

Data
Sources
Text data
Graph data
Table data
Extended …

Big Data
Sets
Preserving
4V

data generation tool
preserving data
characteristics

Diverse
Worklo
ads

Application
Types

Basic & Important
Operations and
Algorithms
Extended…

Offline analytics
Realtime analytics
Online services

Represent
Software Stack
Extended…

BigDataBench

Big Data
Workloads

10/

Methodology (Cont’)
4V of Big Data

System and architecture
characteristics

10/

BigDataBench

Similarity
analysis

Top Sites on the Web

More details in http://www.alexa.com/topsites/global;0

Search Engine, Social Network and Electronic
Commerce hold 80% page views of all the
11/
Internet service.

12/

12/

and
atte
rep
nti
res
ons
ent
to
ativ
diff
• Inc
e
ere
lud
app
nt
e
lica
app
diff
tio
lica
ere
n
tio
nt
sce
n
dat
nar
typ
a
ios
es:
• •sou
Co
Se
onl
arc
ver
ine
rce
ser
rep
sh
En
vic
•res
Te
gin
e,
xt
ent
e,
rea
dat
Elativ
a,
tim
eco
Gr
m
e
ap
soft
me
ana
h
rce,
lyti
war
dat
cs,
eSo
a,
cia
off
Ta
sta
l
lin
ble
Ne
e
cks
dat

Workloads Chosen

tw
ana
a
ork
lyti
cs

13/

19 Chosen Workloads
Micro Benchmarks
Basic Datastore
Operations
Relational Queries
Application
Scenarios
Search engines

Social networks

E-commerce system

13/

Data Generation Tools


Data Sources


Text, Graph and Table
• Six real raw data


14/

Synthetics Data


Scale
• From GB to PB



Features
• Preserve characteristics of real-world data

14/

15/

Naïve Text generator
machine
evaluate
big
system
data
mining
architecture

select word randomly

CPU

cpu

memory
benchmarking
learning

words

documents

following multinomial distribution

Only modeling on the word level;

15/

Improved Text generator

16/

topic2

topic1

select topic randomly

machine
evaluate
big
CPU
data
mining
architecture

CPU

select word randomly

benchmarking

topic3

memory system
learning

topics
following multinomial distribution

words
following multinomial distribution under topic2

Modeling on the both topic and word
level
16/

document

Outline

17/

1

2
3


Case Study

3
5
4
17/

How to Use

Future Work

BigDataBench Case Study

18/

Performance evaluation and Diagnosis
SJTU, and XJTU

Workload
Characterization

Evaluating Big
Data Hardware
Systems

ICT, CAS
SIAT, CAS

USTC, and Florida
International
University

BigDataBench

Networks for
big data
OSU

Energy Efficiency of
Big Data Systems
CNCERT

http://prof.ict.ac.cn/BigDataBench/#users
18/

20/

Workloads Analyzed

http://prof.ict.ac.cn/BigDataBench

Floating point operation intensity
Data Analytics

Services

21

The total number of (floating point or integer) instructions divided by the
total number of memory access bytes in a run of workload.
Very low floating point operation intensities ( 0.009), two orders of
magnitude lower than the theory number of state-of-practice CPU (1.8)
21/

Instruction Breakdown
Data Analytics

Services

 Less floating point operations
22/


More Integer operations

23/

Ratio of Integer to Floating Point
Operations
Data Analytics




Services

The average of big data workloads is 100
Parsec, HPCC and SPECFP (1.4, 1.0, 0.67)

Integer operation intensity
Data Analytics

Services

The average integer operation intensity of big data
24/ workloads is 0.49
 That of PARSEC, HPCC, SPECFP is 1.5, 0.38, 0.23


Cache Behaviors
Data Analytics

Services

Big data workloads have high L1I misses than HPC workloads
 Data analysis workloads have better L2 cache behaviors than service workloads
25/
except BFS




Big data workloads have good L3 behaviors

TLB Behaviors
14
data analysis

5
service

ITLB misses of big data workloads are higher than HPC workloads.
 DTLB misses of big data workloads are higher than HPC workloads.
26/


26/

BigDataBench Case Study

27/

Performance evaluation and Diagnosis
SJTU, and XJTU

Big Data workload
Characterization

Evaluating Big
Data Hardware
Systems

ICT, CAS
SIAT, CAS

USTC, and Florida
International
University

BigDataBench

Networks for
big data
OSU

Energy Efficiency of
Big Data Systems
CNCERT

http://prof.ict.ac.cn/BigDataBench/#users

Evaluating Big Data Hardware Systems

28/

Experimental Platforms
Xeon (Common processor)
Atom ( Low power processor)
Tilera (Many

Brief Comparison
Basic Information
core processor)

CPU Type

Intel Atom D510

Tilera TilePro36

CPU Core

4 cores @
1.6GHz

2 cores @
1.66GHz

36 cores @
500MHz

L1 I/D
Cache

32KB

24KB

16KB/8KB

L2 Cache
29/

Intel Xeon
E5310

4096KB

512KB

64KB

Experimental Platforms
Hadoop Cluster
Information

Xeon VS Atom

Xeon VS Tilera

[ 1 Xeon master+7
Comprison
[1 Xeon master+7 Xeon
Xeon slaves ] VS [ 1
(the same logical
slaves] VS [ 1 Xeon
Atom master +7 Atom
core number)
master +1 Tilera slave]
slaves]
Hadoop setting

30/

Following the guidance on Hadoop official
website

Benchmark Selection
BigDataBench 1.0
Application

Characteristics

Sort

O(n*log2n)

Integer comparison

WordCount

O(n)

Integer comparison and
calculation

Grep

O(n)

String comparison

Naïve Bayes

O(m*n)

Floating-point computation

SVM

31/

Time
Complexity

O(n3)

Floating-point computation

Metrics
Performance: Data processed per second
(DPS)
Energy Efficiency: Application Performance
Power Usage Effectiveness(DPJ)

32/

Reference
Jing Quan, University of Science and Technology of China, Yingjie
Shi, Chinese Academy of Sciences, Ming Zhao, Florida
International University, Wei Yang, University of Science and
Technology of China.
”The Implications from Benchmarking Three Different Data
Center Platforms”
The First Workshop on Big Data Benchmarks, Performance
Optimization, and Emerging hardware (BPOE 2013) in
conjunction with 2013 IEEE International Conference on Big
Data (IEEE Big Data 2013)

35/

Outline

36/

1

2
3


Case Study

3

How to Use

5
4

Future Work

BigDataBench Class


For Architecture




For OS



37/

19 among 19
19 among 19

For Runtime environment (Hadoop)


9 of 19 workloads
•Sort, Grep, WordCount, PageRank, Index, Kmeans, Connected Components,
Collaborative Filtering and Naive Bayes.



For Data management


6 of 19 workloads
•Read, Write, Scan, Select Query, Aggregate Query, Join Query

37/

BigDataBench Class: data sources


Text related


6 of 19 workloads
•Sort, Grep, WordCount, Index, Collaborative Filtering and Naive Bayes



Graph related


•BFS, PageRank, Kmeans, and Connected Components

38/



4 of 19 workloads

Table related


9 of 19 workloads
•Read, Write, Scan, Select Query, Aggregate Query, Join Query, Nutch Server, Olio
Server and Rubis Server

BigDataBench Class: Application Types


Online Services


6 of 19 workloads
• Read, Write, Scan, Nutch server, Olio Server and Rubis server



Offline Analytics

39/ 

10 of 19 workloads
• Sort, Grep, WordCount, BFS, PageRank, Index, Kmeans, Connected Components,
Collaborative Filtering and Naive Bayes.



Realtime Analytics


3 of 19 workloads
• Select Query, Aggregate Query and Join Query

BigDataBench Class: Application Domains


Search engine related:


Basic Operations + Search Engine

7 of 19 workloads
•Sort, Grep, WordCount, BFS, PageRank, Index and Nutch Server



Social network related:

Basic Cloud OLTP+ Basic Relational Query+ Social

Network
40/





9 of 19 workloads
•Read, Write, Scan, Select Query, Aggregate Query, Join Query, Olio Server, Kmeans and
Connected Components

E-commerce related:

Basic Cloud OLTP+ Basic Relational Query+ Social

Network


9 of 19 workloads
• Read, Write, Scan, Select Query, Aggregate Query, Join Query, Rubis server, Collaborative
Filtering and Naive Bayes

Outline

41/

1

2
3


Case Study

3

How to Use

5
4

Future Work

Near Future Work


Multi-media data



Deep learning workloads

42/




42/

HPC
Refine BigDataBench

Related Resources


BigDataBench project




BPOE workshop


43/

http://prof.ict.ac.cn/BigDataBench





http://prof.ict.ac.cn/bpoe
A series of workshops on Big Data Benchmarks,
Performance Optimization, and Emerging Hardware
BPOE-4: interaction among OS, architecture, and data
management
• Co-located with ASPLOS 2014

BPOE-4 SC
Christos Kozyrakis, Stanford
 Xiaofang Zhou, University of Queensland
 Dhabaleswar K Panda, Ohio State University
 Raghunath Nambiar, Cisco
 Lizy K John, University of Texas at Austin
 Xiaoyong Du, Renmin University of China
44/
 H. Peter Hofstee, IBM Austin Research Laboratory
 Ippokratis Pandis, IBM Almaden Research Center
 Alexandros Labrinidis, University of Pittsburgh
 Bill Jia, Facebook
 Jianfeng Zhan, ICT, Chinese Academy of Sciences


詹剑锋：Big databench—benchmarking big data systems

More Related Content

What's hot (20)

Viewers also liked (11)

Similar to 詹剑锋：Big databench—benchmarking big data systems (20)

More from hdhappy001 (20)

Recently uploaded (20)

詹剑锋：Big databench—benchmarking big data systems