Hanborq Optimizations on Hadoop MapReduce

Hanborq Optimizations on
Hadoop MapReduce
Feb.21, 2012
Big Data Engineering Team

Motivations
• MapReduce is a proved and successful data processing framework. It can be implemented enough
efficient and flexible.
– MapReduce: Simplified Data Processing on Large Clusters
– MapReduce: A Flexible Data Processing Tool

• Hadoop is the most popular Open Source implementation of MapReduce, but it’s not so good
implemented.
– Long latency.
– Inefficiency with low performance.
– Not so flexible. Not simple enough to build, develop, etc.
– It has taken so long time to become mature, since 2006.

• Our customers always challenge the latency and performance of Hadoop MapReduce.
• The majority of real deploys cannot afford to install thousands of server. The inefficiency of Hadoop
waste many hardware and energy.

• It’s possible to improve the Hadoop Core and applications to achieve better experience.
– Tenzing: A SQL Implementation On The MapReduce Framework
– Dremel: Interactive Analysis of Web-Scale Datasets

2

To build a fast architecture.

MAPREDUCE RUNTIME ENVIRONMENT

3

Runtime
Job/Task Schedule & Latency (1)
• Problem: even for a empty job, Hadoop will take tens of seconds to complete it.

• HDH Improvements:
– Worker Pool
• Like Google Tenzing, HDH MapReduce runtime does not spawn new JVM processes for each job/task, but
instead start these slot/worker processes at initialization phase and keep them running constantly.
• Fast/Real-time communication between Workers and TaskTracker.
– Transfer Job description information (JobConf, splits info, etc.) within RPC (Client->JobTracker).
• Reduce the overhead to transfer, persist, load and parse XML files.
• Reduce the useless and default attributes to be transferred for JobConf.
– Heartbeat (TaskTracker)
• Speed up the task assignment.
• Use triggered(out-of-band ) real-time heartbeat when special event happens.
– Avoid duplicated loading of configuration files for Configuration Objects.
– Use DistributedCache to deploy job’s jar files.
– Avoid some unnecessary “sleep”.
– Avoid some big buffer/memory allocation which takes long time and to be inefficient.
– …

4

Runtime
Job/Task Schedule & Latency (2)
• Worker Pool, RPC, Heartbeat

MapReduce RPC
JobTracker
Client (JobConf)

beat
Heart

TaskTracker TaskTracker TaskTracker

Child Child Child Child Child Child Child Child Child
Worker Worker Worker Worker Worker Worker Worker Worker Worker
Worker Pool Worker Pool Worker Pool

5

Hints
• Enable these configuration properties
– mapreduce.tasktracker.outofband.heartbeat = true
– mapred.job.reuse.jvm.num.tasks = -1 (even through it is not
used in HDH)

• Issues
– Java, GC issue for a constantly running worker JVM.

6

To build a fast engine.

MAPREDUCE PROCESSING ENGINE

7

Processing Engine Improvements
• Shuffle: Use sendfile to reduce data copy and context
switch.
• Shuffle: Netty Shuffle Server (map side) and Batch Fetch
(reduce side).
• Sort Avoidance.
– Spilling and Partitioning, Counting Sort, Bytes Merge, Early
Reduce, etc.
– Hash Aggregation in job implementation.

8

Shuffle: Use sendfile
to reduce data copy and context switch (1)
• Rewrite shuffle server with Netty using Zero Copy API to
transfer map output data.
– Less data copy, more efficiency
– Less data copy, less CPU usage
– Less context switches, less CPU usage

• Saving more CPU to do user tasks.

9

Traditional data copy Data copy with sendfile

vs.

If NIC support gather operations

10

Traditional context switches Context Switch with sendfile

vs.

11

Shuffle:
Netty Server & Batch Fetch (1)
• Less TCP connection overhead.
• Reduce the effect of TCP slow start.
• More important, better shuffle schedule in Reduce Phase
result in better overall performance.

• Configuration
– mapred-site.xml
<property>
<name>mapreduce.shuffle.max.maps</name>
<value>4</value>
<description>Reduce side batch fetch for efficient shuffle copy.</description>
</property>

12

Shuffle:
Netty Server & Batch Fetch (2)
One connection per map Batch fetch
• Each fetch thread in reduce copy • Fetch thread copy multiple map outputs
per connection.
one map output per
• This fetch thread take over this TT, other
connection, even there are many fetch threads can’t fetch outputs from
outputs in TT. this TT during coping period.

vs.

13

Shuffle:
Netty Server & Batch Fetch evaluations
• Test Data
– 8 text file, each ~600MB in size
– ~50,000,000 records in total
• key: 10bytes, record : 98bytes
• Test job
– The test job including following phases: MapSortShuffleSortMergeReduce(only read input, no
output).
80 maps, 4reduces
80 maps,1 reduce 80 maps, 2reduces 02:01 01:51
03:48 03:36 01:41
03:49 01:44 01:36 01:32 01:34
03:45 02:49
03:44 02:53 01:26
03:45
02:16 02:17 02:11 02:16

Time(min)
01:09
Time(min)
Time(min)

03:40 02:10 CDH3U2
CDH3U2 CDH3U2
03:36 00:52 Batch Fetch=1
03:36
03:36 Batch Fetch=1 01:26 Batch Fetch=1
00:35 Batch Fetch = 2
Batch Fetch = 2 Batch Fetch = 2
03:32 00:43 00:17 Batch Fetch=4
Batch Fetch=4 Batch Fetch=4
03:27 00:00 00:00 Batch Fetch=20
Batch Fetch=20 Batch Fetch=20

We find the gains of this improvement are not very distinct when the total M*R is at a low level. To be verified!
14

Sort Avoidance
• Many real-world jobs require shuffling, but not sorting. And the sorting bring
much overhead.
– Hash Aggregations
– Hash Joins
– Filtering and simple processing (process each record independently from other
records)
– …, etc.

• When sorting is turned off, the mapper feeds data to the reducer which directly
passes the data to the Reduce() function bypassing the intermediate sorting step.
– Spilling, Partitioning, Merging and Reducing will be more efficient.

• How to turn off sorting?
– JobConf job = (JobConf) getConf();
– job.setBoolean("mapred.sort.avoidance", true);

15

Sort Avoidance:
Spilling and Partitioning
• When spills, records compare by partition only.
• Partition comparison using counting sort [O(n)], not quick sort
[O(nlog n)].

16

Sort Avoidance:
Early Reduce (Remove shuffle barrier)

• Currently reduce function can’t start until all map
outputs have been fetched already.
• When sort is unnecessary, reduce function can start as
soon as there is any map output available.
• Greatly improve overall performance!

17

Sort Avoidance:
Bytes Merge
• No overhead of key/value
serialization/deserialization,
comparison.
• Don’t take care of
records, just bytes.
• Just concatenate byte
streams together – read in
bytes, write out bytes.

18

Sort Avoidance:
Sequential Reduce Inputs
• Sequential read input files to feed reduce function, So
no disk seeks, better performance.

19

Let’s try HDH Hadoop.

BENCHMARKS

20

Benchmarks:
Runtime Job/Task Schedule & Latency (1)
• Testbed:
– 5 node cluster (4 slaves), 8 map slots and 1 reduce slots per node.
• Test Jobs:
– bin/hadoop jar hadoop-examples-0.20.2-?dh3u2.jar sleep -m maps -r reduces -mt 1 -rt 1

• The HDH is very fast to launch the job and tasks.

Job Latency (in second, lower is better) Job Latency (in second, lower is better)
Total Tasks (32 maps, 4 reduces) Total Tasks (96 maps, 4 reduces)
30 50
43
24 45
25 40
21
20 35
30
24
15 25
20
10 15
10
5
1 5 1
0 0
CDH3u2 (Cloudera) CDH3u2 (Cloudera) HDH3u2 (Hanborq) CDH3u2 (Cloudera) CDH3u2 (Cloudera) HDH3u2 (Hanborq)
(reuse.jvm disabled) (reuse.jvm enabled) (reuse.jvm disabled) (reuse.jvm enabled)

21

Benchmarks:
Runtime Job/Task Schedule & Latency (2)
• Another Testbed:
• Test Jobs:
– bin/hadoop jar hadoop-examples-0.20.2-?dh3u2.jar sleep -m 24~384 -r 9 -mt 1 -rt 1

Job latency according to number of map tasks
(lower and flater is better)
90
80
70
60
time (seconds)

CDH3u2
50 (reuse.jvm disabled)
CDH3u2
40
(reuse.jvm enabled)
30
HDH
20
10
0
24maps 96maps 192maps 288maps 384maps

22

Benchmarks:
Sort Avoidance and Aggregation (1)
• Testbed:
– There is only 6GB RAM and 1 SATA 7.2K disk.
• Test Data:
– Data size : ~ 20G
• Test Cases:
– Case1
• Hash aggregation in map and reduce phase.
• Map only output limited integer key-value pairs, so shuffling data is very tiny (in MB)
– Case2
• Always use the old method (sort and combiner) to implement aggregation.
• Map output many integer key-value pairs, but shuffling data is still not large (tens of MB)
– Case3
• Hash aggregation in reduce phase, map does not use hash but just output many longer key-value pairs.
• Map output many long key-value pairs, so shuffling data is distinct large (~12 GB) WRT Case1 and Case2.
• This case is intently designed to test and highlight the effect of Sort-Avoidance.

23

Benchmarks:
• The Case1 and Case2 like:
Sort Avoidance and Aggregation
– SELECT intA, COUNT(1) FROM T1 GROUP BY intA;
(lower is better)
• The Case3 likes:
2400
– SELECT A, B, C, D, SUM(M), SUM(N), SUM(R), SUM(P), SUM(Q) ...
2186
2200 FROM T2 GROUP BY A, B, C, D;

2000

1800
• Case1:
– The shuffling data is very small, sorting on it is very fast in
1600
memory. So the improvement just gains ~11%, which may
1400 mainly come from the “Worker Pool implementation”..
• Case2:
time (seconds)

1200
– Still use sorting to do aggregation, the tiny gains may mainly
1000
come from the “Worker Pool implementation”.
800 – This case also demonstrate the processing engine
615
600 improvements do not bring in negative effect.

400
• Case3:
197 175 216 198 – The shuffling(and sorting in CDH3u2) data is large
200
enough, the gains from Sort-Avoidance become very
0 distinct.
Case1 Case2 Case3
CHD3u2 (Cloudera) 197 216 2186
HDH (Hanborq) 175 198 615
24

Benchmarks:
• Testbed:
– 4 node cluster (3 slaves), 8 map slots and 3 reduce slots per node. Real Aggregration Jobs
– 48GB RAM and 5 SATA 7.2k disks (lower is better)
– Large RAM and more disks to avoid the bottleneck of disk IO 700

600
• Test Data:
– 400,000,000 rows of comma separated text. ~100 bytes per row. 500
– ~40GB data in total. (then we can avoid the bottleneck of disk IO)

time (seconds)
400

• Query1:
300
– Equals to: select type, sum(reqnum) from cdr group by type;
• Query2: 200
– Equals to: select
userid, sum(reqnum), max(reqnum), min(reqnum), avg(reqnum), sum(dur), ma 100
x(dur), min(dur), avg(dur) from cdr group by userid;
0
Case1-1 Case2-1 Case1-2 Case2-2
• Case1-1: Use sort to implement aggregation of Query1 CDH3u2 (Cloudera) 238 603 136 206
• Case2-1: Use sort to implement aggregation of Query2 HDH (Hanborq) 233 578 96 151
• Case1-2: Use hash(map and reduce) to implement aggregation of Analysis:
Query1 - Case1/2 still use sort, the gains may mainly come from Worker Pool.
- Case 3/4 use hash aggregation, so we can benefit distinctly from
• Case2-2: Use hash(map and reduce) to implement aggregation of
Sort-Avoidance.
Query2 25

Benchmarks:
TeraSort
• Testbed: TeraSort: Sort 100GB
– 5 node cluster (4 slaves), 8 map slots and 2 (in minute, lower is better)
reduce slots per node. 60

– There is only 6GB memory and 1 SATA 7.2K disk. 49
50
43

• Generate 100GB data 40

– bin/hadoop jar hadoop-examples-0.20.2- 30
?dh3u2.jar teragen 1000000000 /teradata
• Terasort Job: 20

– bin/hadoop jar hadoop-examples-0.20.2- 10
?dh3u2.jar terasort /teradata /terasort
0
CDH3u2 (Cloudera) HDH (Hanborq)

• Since there is only 1 disk on each machine, the bottleneck is the disk (iostat).
• Got 12% improvement under the bottleneck of disk.
• The gains may come from: Shuffle, and Task Scheduling in Worker Pool.
26

Benchmarks:
Integration with Hive
• Testbed Hive Query
– 4 node cluster (3 slaves), 8 map slots and 3 reduce slots per node. (lower is better)
– 48GB RAM and 5 SATA 7.2k disks
500
– Large RAM and more disks to avoid the bottleneck of disk IO

450
• Dataset
– 400,000,000 rows of comma separated text. ~100 bytes per row.
400
– ~40GB data in total. (then we can avoid the bottleneck of disk IO)
350
• Query1
– select type, sum(reqnum) from cdr group by type; 300

time (seconds)
• Query2
– INSERT OVERWRITE DIRECTORY '/tmp/out' select 250 CDH3u2 (Cloudera)
userid, sum(reqnum), max(reqnum), min(reqnum), avg(reqnum), s
um(dur), max(dur), min(dur), avg(dur) from cdr group by userid; HDH (Hanborq)
200

• Result Analysis
150
– Since we did not modify Hive to use the “Sort-Avoidance”
feature, the short saved time may mainly come from the “Worker
Pool” 100
– We have plan to modify Hive to support “Sort-Avoidance” for such
aggregation and aggregation-join queries. 50

0
Query1 Qyery2
27

Benchmark More …
• The above evaluations just ran on a small cluster for
developers, to demonstrate a quick view of our
improvements.

• We are working on further improvements and doing
more comprehensive evaluations on a larger and
powerful cluster. And the result will be output as soon
as possible.

28

To be open.

OPEN SOURCE

29

HDH
Hanborq Distribution with Hadoop
• HDH to make Hadoop Fast, Simple and Robust.

• HDH delivers a series of improvements on Hadoop Core, and Hadoop-
based tools and applications for putting Hadoop to work solving Big Data
problems in production.

• HDH may be ideal for enterprises seeking an
integrated, fast, simple, and robust Hadoop Distribution. In particular, if
you think your MapReduce jobs are slow and low performing, the HDH
may be you choice.

• Like and based-on Apache Hadoop and Cloudera’s CDH, Hanborq
delivers HDH. Except for Hadoop Core, it will include various and
different components.
30

Hanborq Open Source
• Github
– Welcome to visit Hanborq’s Open Source Repositories
– https://github.com/hanborq/

• Hadoop
– A Hanborq optimized Hadoop Core, especially with high performance of
MapReduce. It's the core part of HDH.

• RockStor (coming soon)
– An Object Storage System implementation over Hadoop and HBase, which can
provide similar service like Amazon S3 (Simple Storage Service.)

• We will continue to open source more useful projects in the future …

31

Thank You Very Much!
Anty Rao, Guangxian Liao, Schubert Zhang
{ant.rao, liaoguangxian, schubert.zhang}@gmail.com

https://github.com/hanborq/hadoop
http://www.slideshare.net/hanborq
http://www.slideshare.net/schubertzhang

to be continue …

32

Hanborq Optimizations on Hadoop MapReduce

More Related Content

What's hot (20)

Similar to Hanborq Optimizations on Hadoop MapReduce (20)

More from Hanborq Inc. (11)

Recently uploaded (20)

Hanborq Optimizations on Hadoop MapReduce