Making pig fly optimizing data processing on hadoop presentation

Making Pig Fly
Optimizing Data Processing on Hadoop
Daniel Dai (@daijy)
Thejas Nair (@thejasn)
© Hortonworks Inc. 2011
Page 1

What is Apache Pig?
Pig Latin, a high level
data processing
language.
Page 2
Architecting the Future of Big Data
An engine that
executes Pig
Latin locally or on
a Hadoop cluster.
Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/

Pig-latin example
• Query : Get the list of web pages visited by users whose
age is between 20 and 29 years.
USERS = load ‘users’ as (uid, age);
USERS_20s = filter USERS by age >= 20 and age <= 29;
PVs = load ‘pages’ as (url, uid, timestamp);
PVs_u20s = join USERS_20s by uid, PVs by uid;
Page 3

Why pig ?
•Faster development
– Fewer lines of code
– Don’t re-invent the wheel
• Flexible
– Metadata is optional
– Extensible
– Procedural programming
Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/
Page 4

Pig optimizations
• Ideally user should not have to
bother
• Reality
– Pig is still young and immature
– Pig does not have the whole picture
–Cluster configuration
–Data histogram
– Pig philosophy: Pig is docile
Page 5

Pig optimizations
• What pig does for you
– Do safe transformations of query to optimize
– Optimized operations (join, sort)
• What you do
– Organize input in optimal way
– Optimize pig-latin query
– Tell pig what join/group algorithm to use
Page 6

Rule based optimizer
• Column pruner
• Push up filter
• Push down flatten
• Push up limit
• Partition pruning
• Global optimizer
Page 7

Column Pruner
• Pig will do column pruning automatically
A = load ‘input’ as (a0, a1, a2);
B = foreach A generate a0+a1;
C = order B by $0;
Store C into ‘output’;
• Cases Pig will not do column pruning
automatically
– No schema specified in load statement
Page 8
Pig will prune
a2 automatically
A = load ‘input’;
B = order A by $0;
C = foreach B generate $0+$1;
DIY
A1 = foreach A generate $0, $1;
B = order A1 by $0;
C = foreach B generate $0+$1;

Column Pruner
• Another case Pig does not do column
pruning
– Pig does not keep track of unused column after
grouping
B = group A by a0;
C = foreach B generate SUM(A.a1);
Page 9
DIY
A1 = foreach A generate $0, $1;
B = group A1 by a0;
C = foreach B generate SUM(A.a1);

Push up filter
• Pig split the filter condition before push
B
Page 10
A
Join
a0>0 && b0>10
Filter
A
Join
a0>0
B
Filter b0>10
Original query Split filter condition
A
Join
a0>0
B
Filter
b0>10
Push up filter

Other push up/down
• Push down flatten
• Push up limit
Limit
Page 11
Load
Flatten
Order
Load
Order
Flatten
A = load ‘input’ as (a0:bag, a1);
B = foreach A generate
flattten(a0), a1;
C = order B by a1;
Load
Foreach
Limit
Load
Foreach
Load (limited)
Foreach
Load
Order
Limit
Load
Order (limited)

Partition pruning
• Prune unnecessary partitions entirely
– HCatLoader
2010
2011
2012
Page 12
HCatLoader
Filter
(year>=2011)
2010
2011
2012
HCatLoader
(year>=2011)

Intermediate file compression
Pig Script
Page 13
map 1
reduce 1
Pig temp file
map 2
reduce 2
Pig temp file
map 3
reduce 3
•Intermediate file
between map and
reduce
– Snappy
•Temp file between
mapreduce jobs
– No compression by
default

Enable temp file compression
•Pig temp file are not compressed by
default
– Issues with snappy (HADOOP-7990)
– LZO: not Apache license
•Enable LZO compression
–Install LZO for Hadoop
–In conf/pig.properties
pig.tmpfilecompression = true
pig.tmpfilecompression.codec = lzo
–With lzo, up to > 90% disk saving and 4x query
speed up
Page 14

Multiquery
• Combine two or more map/reduce
job into one
– Happens automatically
– Cases we want to control multiquery: combine too
many
Page 15
Load
Group by $0 Group by $1
Foreach Foreach
Store Store
Group by $2
Foreach
Store

Control multiquery
• Disable multiquery
– Command line option: -M
• Using “exec” to mark the boundary
B0 = group A by $0;
C0 = foreach B0 generate group, COUNT(A);
Store C0 into ‘output0’;
B1 = group A by $1;
exec
B2 = group A by $2;
Page 16

Implement the right UDF
• Algebraic UDF
– Initial
– Intermediate
– Final
B0 = group A by $0;
C0 = foreach B0 generate group, SUM(A);
Page 17
Map
Initial
Combiner
Intermediate
Reduce
Final

Implement the right UDF
• Accumulator UDF
– Reduce side UDF
– Normally takes a bag
• Benefit
– Big bag are passed in
batches
– Avoid using too much
memory
– Batch size
Page 18
B0 = group A by $0;
C0 = foreach B0 generate group,
my_accum(A);
my_accum extends Accumulator {
public void accumulate() {
// take a bag trunk
}
public void getValue() {
// called after all bag trunks are
processed
}
pig.accumulative.batchsize=20000 }

Memory optimization
• Control bag size on reduce side
Mapreduce:
reduce(Text key, Iterator<Writable>
values, ……)
– If bag size exceed threshold, spill to disk
– Control the bag size to fit the bag in memory if
possible
Page 19
Iterator
Bag of Input 1 Bag of Input 2 Bag of Input 3
pig.cachedbag.memusage=0.2

Optimization starts before pig
• Input format
• Serialization format
• Compression
Page 20

Input format -Test Query
> searches = load ’aol_search_logs.txt'
using PigStorage() as (ID, Query, …);
> search_thejas = filter searches by Query
matches '.*thejas.*';
> dump search_thejas;
(1568578 , thejasminesupperclub, ….)
Page 21

Input formats
Page 22
140
120
100
80
60
40
20
0
RunTime (sec)
RunTime (sec)

Columnar format
•RCFile
•Columnar format for a group of rows
•More efficient if you query subset of
columns
Page 23

Tests with RCFile
• Tests with load + project + filter out all
records.
• Using hcatalog, w compression,types
•Test 1
•Project 1 out of 5 columns
•Test 2
•Project all 5 columns
Page 24

RCFile test results
Page 25
140
120
100
80
60
40
20
0
Project 1 (sec) Project all (sec)
Plain Text
RCFile

Cost based optimizations
• Optimizations decisions based on
your query/data
• Often iterative process
Page 26
Run
query
Measure
Tune

Cost based optimization - Aggregation
• Hash Based Agg
Map
(logic)
M.
Output
Use pig.exec.mapPartAgg=true to enable
Map task
Page 27
HBA
HBA
Output
Reduce task

Cost based optimization – Hash Agg.
• Auto off feature
• switches off HBA if output reduction is
not good enough
• Configuring Hash Agg
• Configure auto off feature -
pig.exec.mapPartAgg.minReduction
• Configure memory used -
pig.cachedbag.memusage
Page 28

Cost based optimization - Join
• Use appropriate join algorithm
•Skew on join key - Skew join
•Fits in memory – FR join
Page 29

Cost based optimization – MR tuning
•Tune MR parameters to reduce IO
•Control spills using map sort params
•Reduce shuffle/sort-merge params
Page 30

Parallelism of reduce tasks
0:25:55
0:23:02
0:20:10
0:17:17
0:14:24
4 6 8 24 48 256
Page 31
Runtime
Runtime
• Number of reduce slots = 6
• Factors affecting runtime
• Cores simultaneously used/skew
• Cost of having additional reduce tasks

Cost based optimization – keep data sorted
•Frequent joins operations on same
keys
• Keep data sorted on keys
• Use merge join
• Optimized group on sorted keys
• Works with few load functions – needs
additional i/f implementation
Page 32

Optimizations for sorted data
Page 33
90
80
70
60
50
40
30
20
10
0
sort+sort+join+join join + join
Join 2
Join 1
Sort2
Sort1

Future Directions
• Optimize using stats
• Using historical stats w hcatalog
• Sampling
Page 34

Questions
Page 35
?

Making pig fly optimizing data processing on hadoop presentation

More Related Content

What's hot (17)

Viewers also liked (8)

Similar to Making pig fly optimizing data processing on hadoop presentation (20)

Recently uploaded (20)

Making pig fly optimizing data processing on hadoop presentation

Editor's Notes