SlideShare a Scribd company logo
Making Pig Fly 
Optimizing Data Processing on Hadoop 
Daniel Dai (@daijy) 
Thejas Nair (@thejasn) 
© Hortonworks Inc. 2011 
Page 1
What is Apache Pig? 
Pig Latin, a high level 
data processing 
language. 
© Hortonworks Inc. 2011 
Page 2 
Architecting the Future of Big Data 
An engine that 
executes Pig 
Latin locally or on 
a Hadoop cluster. 
Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/
Pig-latin example 
• Query : Get the list of web pages visited by users whose 
age is between 20 and 29 years. 
USERS = load ‘users’ as (uid, age); 
USERS_20s = filter USERS by age >= 20 and age <= 29; 
PVs = load ‘pages’ as (url, uid, timestamp); 
PVs_u20s = join USERS_20s by uid, PVs by uid; 
© Hortonworks Inc. 2011 
Page 3 
Architecting the Future of Big Data
Why pig ? 
•Faster development 
– Fewer lines of code 
– Don’t re-invent the wheel 
• Flexible 
– Metadata is optional 
– Extensible 
– Procedural programming 
Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/ 
© Hortonworks Inc. 2011 
Page 4 
Architecting the Future of Big Data
Pig optimizations 
• Ideally user should not have to 
bother 
• Reality 
– Pig is still young and immature 
– Pig does not have the whole picture 
–Cluster configuration 
–Data histogram 
– Pig philosophy: Pig is docile 
© Hortonworks Inc. 2011 
Page 5 
Architecting the Future of Big Data
Pig optimizations 
• What pig does for you 
– Do safe transformations of query to optimize 
– Optimized operations (join, sort) 
• What you do 
– Organize input in optimal way 
– Optimize pig-latin query 
– Tell pig what join/group algorithm to use 
© Hortonworks Inc. 2011 
Page 6 
Architecting the Future of Big Data
Rule based optimizer 
• Column pruner 
• Push up filter 
• Push down flatten 
• Push up limit 
• Partition pruning 
• Global optimizer 
© Hortonworks Inc. 2011 
Page 7 
Architecting the Future of Big Data
Column Pruner 
• Pig will do column pruning automatically 
A = load ‘input’ as (a0, a1, a2); 
B = foreach A generate a0+a1; 
C = order B by $0; 
Store C into ‘output’; 
• Cases Pig will not do column pruning 
automatically 
– No schema specified in load statement 
© Hortonworks Inc. 2011 
Page 8 
Architecting the Future of Big Data 
Pig will prune 
a2 automatically 
A = load ‘input’; 
B = order A by $0; 
C = foreach B generate $0+$1; 
Store C into ‘output’; 
DIY 
A = load ‘input’; 
A1 = foreach A generate $0, $1; 
B = order A1 by $0; 
C = foreach B generate $0+$1; 
Store C into ‘output’;
Column Pruner 
• Another case Pig does not do column 
pruning 
– Pig does not keep track of unused column after 
grouping 
A = load ‘input’ as (a0, a1, a2); 
B = group A by a0; 
C = foreach B generate SUM(A.a1); 
Store C into ‘output’; 
© Hortonworks Inc. 2011 
Page 9 
Architecting the Future of Big Data 
DIY 
A = load ‘input’ as (a0, a1, a2); 
A1 = foreach A generate $0, $1; 
B = group A1 by a0; 
C = foreach B generate SUM(A.a1); 
Store C into ‘output’;
Push up filter 
• Pig split the filter condition before push 
B 
© Hortonworks Inc. 2011 
Page 10 
Architecting the Future of Big Data 
A 
Join 
a0>0 && b0>10 
Filter 
A 
Join 
a0>0 
B 
Filter b0>10 
Original query Split filter condition 
A 
Join 
a0>0 
B 
Filter 
b0>10 
Push up filter
Other push up/down 
• Push down flatten 
• Push up limit 
Limit 
© Hortonworks Inc. 2011 
Page 11 
Architecting the Future of Big Data 
Load 
Flatten 
Order 
Load 
Order 
Flatten 
A = load ‘input’ as (a0:bag, a1); 
B = foreach A generate 
flattten(a0), a1; 
C = order B by a1; 
Store C into ‘output’; 
Load 
Foreach 
Limit 
Load 
Foreach 
Load (limited) 
Foreach 
Load 
Order 
Limit 
Load 
Order (limited)
Partition pruning 
• Prune unnecessary partitions entirely 
– HCatLoader 
2010 
2011 
2012 
© Hortonworks Inc. 2011 
Page 12 
HCatLoader 
Architecting the Future of Big Data 
Filter 
(year>=2011) 
2010 
2011 
2012 
HCatLoader 
(year>=2011)
Intermediate file compression 
Pig Script 
© Hortonworks Inc. 2011 
Page 13 
Architecting the Future of Big Data 
map 1 
reduce 1 
Pig temp file 
map 2 
reduce 2 
Pig temp file 
map 3 
reduce 3 
•Intermediate file 
between map and 
reduce 
– Snappy 
•Temp file between 
mapreduce jobs 
– No compression by 
default
Enable temp file compression 
•Pig temp file are not compressed by 
default 
– Issues with snappy (HADOOP-7990) 
– LZO: not Apache license 
•Enable LZO compression 
–Install LZO for Hadoop 
–In conf/pig.properties 
pig.tmpfilecompression = true 
pig.tmpfilecompression.codec = lzo 
–With lzo, up to > 90% disk saving and 4x query 
speed up 
© Hortonworks Inc. 2011 
Page 14 
Architecting the Future of Big Data
Multiquery 
• Combine two or more map/reduce 
job into one 
– Happens automatically 
– Cases we want to control multiquery: combine too 
many 
© Hortonworks Inc. 2011 
Page 15 
Architecting the Future of Big Data 
Load 
Group by $0 Group by $1 
Foreach Foreach 
Store Store 
Group by $2 
Foreach 
Store
Control multiquery 
• Disable multiquery 
– Command line option: -M 
• Using “exec” to mark the boundary 
A = load ‘input’; 
B0 = group A by $0; 
C0 = foreach B0 generate group, COUNT(A); 
Store C0 into ‘output0’; 
B1 = group A by $1; 
C1 = foreach B1 generate group, COUNT(A); 
Store C1 into ‘output1’; 
exec 
B2 = group A by $2; 
C2 = foreach B2 generate group, COUNT(A); 
Store C2 into ‘output2’; 
© Hortonworks Inc. 2011 
Page 16 
Architecting the Future of Big Data
Implement the right UDF 
• Algebraic UDF 
– Initial 
– Intermediate 
– Final 
A = load ‘input’; 
B0 = group A by $0; 
C0 = foreach B0 generate group, SUM(A); 
Store C0 into ‘output0’; 
© Hortonworks Inc. 2011 
Page 17 
Architecting the Future of Big Data 
Map 
Initial 
Combiner 
Intermediate 
Reduce 
Final
Implement the right UDF 
• Accumulator UDF 
– Reduce side UDF 
– Normally takes a bag 
• Benefit 
– Big bag are passed in 
batches 
– Avoid using too much 
memory 
– Batch size 
© Hortonworks Inc. 2011 
Page 18 
Architecting the Future of Big Data 
A = load ‘input’; 
B0 = group A by $0; 
C0 = foreach B0 generate group, 
my_accum(A); 
Store C0 into ‘output0’; 
my_accum extends Accumulator { 
public void accumulate() { 
// take a bag trunk 
} 
public void getValue() { 
// called after all bag trunks are 
processed 
} 
pig.accumulative.batchsize=20000 }
Memory optimization 
• Control bag size on reduce side 
Mapreduce: 
reduce(Text key, Iterator<Writable> 
values, ……) 
– If bag size exceed threshold, spill to disk 
– Control the bag size to fit the bag in memory if 
possible 
© Hortonworks Inc. 2011 
Page 19 
Architecting the Future of Big Data 
Iterator 
Bag of Input 1 Bag of Input 2 Bag of Input 3 
pig.cachedbag.memusage=0.2
Optimization starts before pig 
• Input format 
• Serialization format 
• Compression 
© Hortonworks Inc. 2011 
Page 20 
Architecting the Future of Big Data
Input format -Test Query 
> searches = load ’aol_search_logs.txt' 
using PigStorage() as (ID, Query, …); 
> search_thejas = filter searches by Query 
matches '.*thejas.*'; 
> dump search_thejas; 
(1568578 , thejasminesupperclub, ….) 
© Hortonworks Inc. 2011 
Page 21 
Architecting the Future of Big Data
Input formats 
© Hortonworks Inc. 2011 
Page 22 
Architecting the Future of Big Data 
140 
120 
100 
80 
60 
40 
20 
0 
RunTime (sec) 
RunTime (sec)
Columnar format 
•RCFile 
•Columnar format for a group of rows 
•More efficient if you query subset of 
columns 
© Hortonworks Inc. 2011 
Page 23 
Architecting the Future of Big Data
Tests with RCFile 
• Tests with load + project + filter out all 
records. 
• Using hcatalog, w compression,types 
•Test 1 
•Project 1 out of 5 columns 
•Test 2 
•Project all 5 columns 
© Hortonworks Inc. 2011 
Page 24 
Architecting the Future of Big Data
RCFile test results 
© Hortonworks Inc. 2011 
Page 25 
Architecting the Future of Big Data 
140 
120 
100 
80 
60 
40 
20 
0 
Project 1 (sec) Project all (sec) 
Plain Text 
RCFile
Cost based optimizations 
• Optimizations decisions based on 
your query/data 
• Often iterative process 
© Hortonworks Inc. 2011 
Page 26 
Architecting the Future of Big Data 
Run 
query 
Measure 
Tune
Cost based optimization - Aggregation 
• Hash Based Agg 
Map 
(logic) 
M. 
Output 
Use pig.exec.mapPartAgg=true to enable 
© Hortonworks Inc. 2011 
Map task 
Page 27 
Architecting the Future of Big Data 
HBA 
HBA 
Output 
Reduce task
Cost based optimization – Hash Agg. 
• Auto off feature 
• switches off HBA if output reduction is 
not good enough 
• Configuring Hash Agg 
• Configure auto off feature - 
pig.exec.mapPartAgg.minReduction 
• Configure memory used - 
pig.cachedbag.memusage 
© Hortonworks Inc. 2011 
Page 28 
Architecting the Future of Big Data
Cost based optimization - Join 
• Use appropriate join algorithm 
•Skew on join key - Skew join 
•Fits in memory – FR join 
© Hortonworks Inc. 2011 
Page 29 
Architecting the Future of Big Data
Cost based optimization – MR tuning 
•Tune MR parameters to reduce IO 
•Control spills using map sort params 
•Reduce shuffle/sort-merge params 
© Hortonworks Inc. 2011 
Page 30 
Architecting the Future of Big Data
Parallelism of reduce tasks 
0:25:55 
0:23:02 
0:20:10 
0:17:17 
0:14:24 
4 6 8 24 48 256 
© Hortonworks Inc. 2011 
Page 31 
Architecting the Future of Big Data 
Runtime 
Runtime 
• Number of reduce slots = 6 
• Factors affecting runtime 
• Cores simultaneously used/skew 
• Cost of having additional reduce tasks
Cost based optimization – keep data sorted 
•Frequent joins operations on same 
keys 
• Keep data sorted on keys 
• Use merge join 
• Optimized group on sorted keys 
• Works with few load functions – needs 
additional i/f implementation 
© Hortonworks Inc. 2011 
Page 32 
Architecting the Future of Big Data
Optimizations for sorted data 
© Hortonworks Inc. 2011 
Page 33 
Architecting the Future of Big Data 
90 
80 
70 
60 
50 
40 
30 
20 
10 
0 
sort+sort+join+join join + join 
Join 2 
Join 1 
Sort2 
Sort1
Future Directions 
• Optimize using stats 
• Using historical stats w hcatalog 
• Sampling 
© Hortonworks Inc. 2011 
Page 34 
Architecting the Future of Big Data
Questions 
© Hortonworks Inc. 2011 
Page 35 
Architecting the Future of Big Data 
?
© Hortonworks Inc. 2011 Page 36

More Related Content

PDF
Tez: Accelerating Data Pipelines - fifthel
PPTX
Data organization: hive meetup
PPTX
ORC 2015: Faster, Better, Smaller
PPTX
Tune up Yarn and Hive
 
PPTX
Cost-based query optimization in Apache Hive 0.14
PPTX
Using Apache Hive with High Performance
PPTX
TEZ-8 UI Walkthrough
PDF
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Tez: Accelerating Data Pipelines - fifthel
Data organization: hive meetup
ORC 2015: Faster, Better, Smaller
Tune up Yarn and Hive
 
Cost-based query optimization in Apache Hive 0.14
Using Apache Hive with High Performance
TEZ-8 UI Walkthrough
Architecting a Scalable Hadoop Platform: Top 10 considerations for success

What's hot (17)

PPTX
HiveACIDPublic
PPTX
Spectra Logic's BlackPearl Developers Summit 2016
PPTX
Hive: Loading Data
PPTX
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
PDF
Hive join optimizations
PPTX
Llap: Locality is Dead
PPTX
Hive Does ACID
PPTX
Cost-based query optimization in Apache Hive
PDF
Sql saturday pig session (wes floyd) v2
PPTX
Query optimization techniques in Apache Hive
PPTX
Apache Hive ACID Project
PDF
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
PDF
Quick Introduction to Apache Tez
PDF
Optimizing Hive Queries
PDF
17th Athens Big Data Meetup - 1st Talk - Speedup Machine Application Learning...
PPTX
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
PPTX
LLAP: Locality is dead (in the cloud)
HiveACIDPublic
Spectra Logic's BlackPearl Developers Summit 2016
Hive: Loading Data
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Hive join optimizations
Llap: Locality is Dead
Hive Does ACID
Cost-based query optimization in Apache Hive
Sql saturday pig session (wes floyd) v2
Query optimization techniques in Apache Hive
Apache Hive ACID Project
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Quick Introduction to Apache Tez
Optimizing Hive Queries
17th Athens Big Data Meetup - 1st Talk - Speedup Machine Application Learning...
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
LLAP: Locality is dead (in the cloud)
Ad

Viewers also liked (8)

PPTX
MongoDB in the Middle of a Hybrid Cloud and Polyglot Persistence Architecture
PDF
A Mobile-First, Cloud-First Stack at Pearson
PPTX
Hardware Provisioning for MongoDB
PDF
Big Data Paris - A Modern Enterprise Architecture
PPTX
An Enterprise Architect's View of MongoDB
PDF
Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
PDF
Algorithm Analyzing
PPTX
Webinar: When to Use MongoDB
MongoDB in the Middle of a Hybrid Cloud and Polyglot Persistence Architecture
A Mobile-First, Cloud-First Stack at Pearson
Hardware Provisioning for MongoDB
Big Data Paris - A Modern Enterprise Architecture
An Enterprise Architect's View of MongoDB
Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
Algorithm Analyzing
Webinar: When to Use MongoDB
Ad

Similar to Making pig fly optimizing data processing on hadoop presentation (20)

PPTX
Pig programming is more fun: New features in Pig
PDF
Pig programming is fun
PPTX
Introduction to pig
PDF
Pig Out to Hadoop
PPTX
Hadoop as data refinery
PPTX
Hadoop as Data Refinery - Steve Loughran
PDF
From flat files to deconstructed database
PPTX
New features in Pig 0.11
PDF
Balogh gyorgy big_data
PDF
Whats newinhive090hadoopsummit2012bof
PPTX
Intro to Big Data - Orlando Code Camp 2014
PDF
Jan 2012 HUG: HCatalog
PPTX
Introduction to PIG
PDF
Introduction to Big Data
PPTX
Gyorgy balogh modern_big_data_technologies_sec_world_2014
PPTX
Big data business case
PDF
Apache pig
PPT
PUC Masterclass Big Data
PPTX
Strata NY 2018: The deconstructed database
Pig programming is more fun: New features in Pig
Pig programming is fun
Introduction to pig
Pig Out to Hadoop
Hadoop as data refinery
Hadoop as Data Refinery - Steve Loughran
From flat files to deconstructed database
New features in Pig 0.11
Balogh gyorgy big_data
Whats newinhive090hadoopsummit2012bof
Intro to Big Data - Orlando Code Camp 2014
Jan 2012 HUG: HCatalog
Introduction to PIG
Introduction to Big Data
Gyorgy balogh modern_big_data_technologies_sec_world_2014
Big data business case
Apache pig
PUC Masterclass Big Data
Strata NY 2018: The deconstructed database

Recently uploaded (20)

PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
System and Network Administraation Chapter 3
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
Introduction to Artificial Intelligence
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
medical staffing services at VALiNTRY
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PPTX
assetexplorer- product-overview - presentation
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Digital Strategies for Manufacturing Companies
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
history of c programming in notes for students .pptx
PPTX
ai tools demonstartion for schools and inter college
PPTX
Transform Your Business with a Software ERP System
PDF
Digital Systems & Binary Numbers (comprehensive )
PDF
Nekopoi APK 2025 free lastest update
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Odoo Companies in India – Driving Business Transformation.pdf
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
System and Network Administraation Chapter 3
2025 Textile ERP Trends: SAP, Odoo & Oracle
Introduction to Artificial Intelligence
Operating system designcfffgfgggggggvggggggggg
medical staffing services at VALiNTRY
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
wealthsignaloriginal-com-DS-text-... (1).pdf
assetexplorer- product-overview - presentation
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Digital Strategies for Manufacturing Companies
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
history of c programming in notes for students .pptx
ai tools demonstartion for schools and inter college
Transform Your Business with a Software ERP System
Digital Systems & Binary Numbers (comprehensive )
Nekopoi APK 2025 free lastest update
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Odoo Companies in India – Driving Business Transformation.pdf

Making pig fly optimizing data processing on hadoop presentation

  • 1. Making Pig Fly Optimizing Data Processing on Hadoop Daniel Dai (@daijy) Thejas Nair (@thejasn) © Hortonworks Inc. 2011 Page 1
  • 2. What is Apache Pig? Pig Latin, a high level data processing language. © Hortonworks Inc. 2011 Page 2 Architecting the Future of Big Data An engine that executes Pig Latin locally or on a Hadoop cluster. Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/
  • 3. Pig-latin example • Query : Get the list of web pages visited by users whose age is between 20 and 29 years. USERS = load ‘users’ as (uid, age); USERS_20s = filter USERS by age >= 20 and age <= 29; PVs = load ‘pages’ as (url, uid, timestamp); PVs_u20s = join USERS_20s by uid, PVs by uid; © Hortonworks Inc. 2011 Page 3 Architecting the Future of Big Data
  • 4. Why pig ? •Faster development – Fewer lines of code – Don’t re-invent the wheel • Flexible – Metadata is optional – Extensible – Procedural programming Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/ © Hortonworks Inc. 2011 Page 4 Architecting the Future of Big Data
  • 5. Pig optimizations • Ideally user should not have to bother • Reality – Pig is still young and immature – Pig does not have the whole picture –Cluster configuration –Data histogram – Pig philosophy: Pig is docile © Hortonworks Inc. 2011 Page 5 Architecting the Future of Big Data
  • 6. Pig optimizations • What pig does for you – Do safe transformations of query to optimize – Optimized operations (join, sort) • What you do – Organize input in optimal way – Optimize pig-latin query – Tell pig what join/group algorithm to use © Hortonworks Inc. 2011 Page 6 Architecting the Future of Big Data
  • 7. Rule based optimizer • Column pruner • Push up filter • Push down flatten • Push up limit • Partition pruning • Global optimizer © Hortonworks Inc. 2011 Page 7 Architecting the Future of Big Data
  • 8. Column Pruner • Pig will do column pruning automatically A = load ‘input’ as (a0, a1, a2); B = foreach A generate a0+a1; C = order B by $0; Store C into ‘output’; • Cases Pig will not do column pruning automatically – No schema specified in load statement © Hortonworks Inc. 2011 Page 8 Architecting the Future of Big Data Pig will prune a2 automatically A = load ‘input’; B = order A by $0; C = foreach B generate $0+$1; Store C into ‘output’; DIY A = load ‘input’; A1 = foreach A generate $0, $1; B = order A1 by $0; C = foreach B generate $0+$1; Store C into ‘output’;
  • 9. Column Pruner • Another case Pig does not do column pruning – Pig does not keep track of unused column after grouping A = load ‘input’ as (a0, a1, a2); B = group A by a0; C = foreach B generate SUM(A.a1); Store C into ‘output’; © Hortonworks Inc. 2011 Page 9 Architecting the Future of Big Data DIY A = load ‘input’ as (a0, a1, a2); A1 = foreach A generate $0, $1; B = group A1 by a0; C = foreach B generate SUM(A.a1); Store C into ‘output’;
  • 10. Push up filter • Pig split the filter condition before push B © Hortonworks Inc. 2011 Page 10 Architecting the Future of Big Data A Join a0>0 && b0>10 Filter A Join a0>0 B Filter b0>10 Original query Split filter condition A Join a0>0 B Filter b0>10 Push up filter
  • 11. Other push up/down • Push down flatten • Push up limit Limit © Hortonworks Inc. 2011 Page 11 Architecting the Future of Big Data Load Flatten Order Load Order Flatten A = load ‘input’ as (a0:bag, a1); B = foreach A generate flattten(a0), a1; C = order B by a1; Store C into ‘output’; Load Foreach Limit Load Foreach Load (limited) Foreach Load Order Limit Load Order (limited)
  • 12. Partition pruning • Prune unnecessary partitions entirely – HCatLoader 2010 2011 2012 © Hortonworks Inc. 2011 Page 12 HCatLoader Architecting the Future of Big Data Filter (year>=2011) 2010 2011 2012 HCatLoader (year>=2011)
  • 13. Intermediate file compression Pig Script © Hortonworks Inc. 2011 Page 13 Architecting the Future of Big Data map 1 reduce 1 Pig temp file map 2 reduce 2 Pig temp file map 3 reduce 3 •Intermediate file between map and reduce – Snappy •Temp file between mapreduce jobs – No compression by default
  • 14. Enable temp file compression •Pig temp file are not compressed by default – Issues with snappy (HADOOP-7990) – LZO: not Apache license •Enable LZO compression –Install LZO for Hadoop –In conf/pig.properties pig.tmpfilecompression = true pig.tmpfilecompression.codec = lzo –With lzo, up to > 90% disk saving and 4x query speed up © Hortonworks Inc. 2011 Page 14 Architecting the Future of Big Data
  • 15. Multiquery • Combine two or more map/reduce job into one – Happens automatically – Cases we want to control multiquery: combine too many © Hortonworks Inc. 2011 Page 15 Architecting the Future of Big Data Load Group by $0 Group by $1 Foreach Foreach Store Store Group by $2 Foreach Store
  • 16. Control multiquery • Disable multiquery – Command line option: -M • Using “exec” to mark the boundary A = load ‘input’; B0 = group A by $0; C0 = foreach B0 generate group, COUNT(A); Store C0 into ‘output0’; B1 = group A by $1; C1 = foreach B1 generate group, COUNT(A); Store C1 into ‘output1’; exec B2 = group A by $2; C2 = foreach B2 generate group, COUNT(A); Store C2 into ‘output2’; © Hortonworks Inc. 2011 Page 16 Architecting the Future of Big Data
  • 17. Implement the right UDF • Algebraic UDF – Initial – Intermediate – Final A = load ‘input’; B0 = group A by $0; C0 = foreach B0 generate group, SUM(A); Store C0 into ‘output0’; © Hortonworks Inc. 2011 Page 17 Architecting the Future of Big Data Map Initial Combiner Intermediate Reduce Final
  • 18. Implement the right UDF • Accumulator UDF – Reduce side UDF – Normally takes a bag • Benefit – Big bag are passed in batches – Avoid using too much memory – Batch size © Hortonworks Inc. 2011 Page 18 Architecting the Future of Big Data A = load ‘input’; B0 = group A by $0; C0 = foreach B0 generate group, my_accum(A); Store C0 into ‘output0’; my_accum extends Accumulator { public void accumulate() { // take a bag trunk } public void getValue() { // called after all bag trunks are processed } pig.accumulative.batchsize=20000 }
  • 19. Memory optimization • Control bag size on reduce side Mapreduce: reduce(Text key, Iterator<Writable> values, ……) – If bag size exceed threshold, spill to disk – Control the bag size to fit the bag in memory if possible © Hortonworks Inc. 2011 Page 19 Architecting the Future of Big Data Iterator Bag of Input 1 Bag of Input 2 Bag of Input 3 pig.cachedbag.memusage=0.2
  • 20. Optimization starts before pig • Input format • Serialization format • Compression © Hortonworks Inc. 2011 Page 20 Architecting the Future of Big Data
  • 21. Input format -Test Query > searches = load ’aol_search_logs.txt' using PigStorage() as (ID, Query, …); > search_thejas = filter searches by Query matches '.*thejas.*'; > dump search_thejas; (1568578 , thejasminesupperclub, ….) © Hortonworks Inc. 2011 Page 21 Architecting the Future of Big Data
  • 22. Input formats © Hortonworks Inc. 2011 Page 22 Architecting the Future of Big Data 140 120 100 80 60 40 20 0 RunTime (sec) RunTime (sec)
  • 23. Columnar format •RCFile •Columnar format for a group of rows •More efficient if you query subset of columns © Hortonworks Inc. 2011 Page 23 Architecting the Future of Big Data
  • 24. Tests with RCFile • Tests with load + project + filter out all records. • Using hcatalog, w compression,types •Test 1 •Project 1 out of 5 columns •Test 2 •Project all 5 columns © Hortonworks Inc. 2011 Page 24 Architecting the Future of Big Data
  • 25. RCFile test results © Hortonworks Inc. 2011 Page 25 Architecting the Future of Big Data 140 120 100 80 60 40 20 0 Project 1 (sec) Project all (sec) Plain Text RCFile
  • 26. Cost based optimizations • Optimizations decisions based on your query/data • Often iterative process © Hortonworks Inc. 2011 Page 26 Architecting the Future of Big Data Run query Measure Tune
  • 27. Cost based optimization - Aggregation • Hash Based Agg Map (logic) M. Output Use pig.exec.mapPartAgg=true to enable © Hortonworks Inc. 2011 Map task Page 27 Architecting the Future of Big Data HBA HBA Output Reduce task
  • 28. Cost based optimization – Hash Agg. • Auto off feature • switches off HBA if output reduction is not good enough • Configuring Hash Agg • Configure auto off feature - pig.exec.mapPartAgg.minReduction • Configure memory used - pig.cachedbag.memusage © Hortonworks Inc. 2011 Page 28 Architecting the Future of Big Data
  • 29. Cost based optimization - Join • Use appropriate join algorithm •Skew on join key - Skew join •Fits in memory – FR join © Hortonworks Inc. 2011 Page 29 Architecting the Future of Big Data
  • 30. Cost based optimization – MR tuning •Tune MR parameters to reduce IO •Control spills using map sort params •Reduce shuffle/sort-merge params © Hortonworks Inc. 2011 Page 30 Architecting the Future of Big Data
  • 31. Parallelism of reduce tasks 0:25:55 0:23:02 0:20:10 0:17:17 0:14:24 4 6 8 24 48 256 © Hortonworks Inc. 2011 Page 31 Architecting the Future of Big Data Runtime Runtime • Number of reduce slots = 6 • Factors affecting runtime • Cores simultaneously used/skew • Cost of having additional reduce tasks
  • 32. Cost based optimization – keep data sorted •Frequent joins operations on same keys • Keep data sorted on keys • Use merge join • Optimized group on sorted keys • Works with few load functions – needs additional i/f implementation © Hortonworks Inc. 2011 Page 32 Architecting the Future of Big Data
  • 33. Optimizations for sorted data © Hortonworks Inc. 2011 Page 33 Architecting the Future of Big Data 90 80 70 60 50 40 30 20 10 0 sort+sort+join+join join + join Join 2 Join 1 Sort2 Sort1
  • 34. Future Directions • Optimize using stats • Using historical stats w hcatalog • Sampling © Hortonworks Inc. 2011 Page 34 Architecting the Future of Big Data
  • 35. Questions © Hortonworks Inc. 2011 Page 35 Architecting the Future of Big Data ?
  • 36. © Hortonworks Inc. 2011 Page 36

Editor's Notes

  • #8: Pig’s optimizer applies these for you in most cases, but the user can often apply these rules more aggressively.
  • #15: With gzip we saw a better compression (96-99%) but at a cost of 4% slowdown. The compression of map output is enabled by default, and is done using snappy which is part of hadoop. But the output of MR job currently does not support snappy, and the light weight compression algorithm of lzo does not ship with apache hadoop as it its license is GPL.
  • #21: Optimizations start before you write your pig query. The choice of how input is stored is made before you use pig, so pig does not get to help you there. The important criteria include serialization format and choice of compression.
  • #22: The numbers you see in practice can be different from what you expect from theory. So I ran some experiments to see how pig performs with different input options. I used famous/infamous aol search data released back in 2006. I wanted a query that does not do much, so I added a filter that looks my name in the data. I was quite sure aol users are not likely to be searching for me! But apparently, there is one row out of 36 million that matched my name, but that wasn’t actually me!
  • #23: I tried different ways of storing input, the default PigStorage() which uses a human readable text format. I measured the total time it took for all the map tasks, which was a total of 69 seconds for 36 M recs, that is around ½ M per core/sec. Then I tried compressed form of PigStorage, which uses LZO for compression – the data size size is reduced to a third. LZO compression is a lightweight compression, so it does not add too much of cpu overhead. The reduced input file size will save on IO, but in this case, the data copy is available locally and since size is small it is likely to be in OS cache. Compression will add more value when that is not the case. In first two cases, I ran the query without specifying the data type for each column, so they did not get deserialized to corresponding java types. When I specify the datatype, PigStorage takes lot longer. I tried AvroStorage load function with types, and it performs significantly better than PigStorage with types.
  • #28: Pig introduced a new aggregation algorithm in pig 0.10. The only supported algorithm earlier was one that used combiner. But the problem with combiner is that MR serializes map output to a buffer, and then deserializes it, in the process of getting sorted data out to combiner phase. The serlization-deserisilazation is expensive. So in 0.10, we use a hash based aggregation within the map itself, so that this cost can be avoided. Instead of map logic’s output going to combiner it goes to the new HBA operator, which does partial aggregation and reduces output size. In 0.10 the hash based agg is off by default. This is because, it is a new feature, and we thought of letting people try it out and give feedback. In most cases this should outperform combiner based aggregation. In theory there are few extreme cases where combiner based aggregation can be useful.
  • #29: As you can see in the previous diagram, HBA’s usefulness depends on how much it reduces the map output. If it does not reduce it by much, the cpu costs of using HBA is not worth it. So hash-based agg has an auto-off feature. The operator stops trying to do aggregation if it sees that there is not much of output size reduction that is happening. It is set to a factor of 10, ie if the data size does not get reduced to a 10th, it disables itself. But based on some performance tests we did, values like 3 or 4 are also safe for most cases. You can also configure memory used by hash based agg using pig.cachedbag.memusage. It is percentage of memory to be used for retaining bags of data in memory, higher value keeps more records in memory and can help reduce output size, but if value is too high, you run into risk of running out of memory. But for most cases, the default of 20% is likely to work. That is not one of the first things to look at to improve performance.
  • #31: The common mapreduce parameters you can tweak are also applicable to pig. You would want to look at the map task spill counts, to see if spill is happening more than once. If that is the case, then you want to see if you can allocate a larger sort buffer by increasing the value of io.sort.mb configuration parameter. There are also other parameters the decide how the regions within buffer are allocated, you can look at those to optimize it further. There are also reduce side shuffle parameters that you can look at, it can also help in reducing IO. You can specify the MR properties on pig commandline or set it in the properties file.
  • #33: TODO: open jira for optimized group on sorted data.
  • #34: Numbers using google 1 gram data. Doing a join of data against itself on word+year .