Pig on Tez - Low Latency ETL with Big Data

Pig on Tez
Daniel Dai
@daijy
Rohini Palaniswamy
@rohini_pswamy
H a d o o p S u m m i t 2 0 1 4 , S a n J o s e

Agenda
 Team Introduction
 Apache Pig
 Why Pig on Tez?
 Pig on Tez
- Design
- Tez features in Pig
- Performance
- Current status
- Future Plan
2

3
Apache Pig on Tez Team
Daniel Dai
Pig PMC
Hortonworks
Rohini Palaniswamy
Pig PMC
Yahoo!
Olga Natkovich
Pig PMC
Yahoo!
Cheolsoo Park
VP Pig, Pig PMC
Netflix
Mark Wagner
Pig Committer
LinkedIn
Alex Bain
Pig Contributor
LinkedIn

Pig Latin
 Procedural scripting language
 Closer to relational algebra
 Heavily used for ETL
 Schema / No schema data, Pig eats everything
 More than SQL and Feature rich
4
Multiquery Nested Foreach Illustrate
Algebraic and Accumulator java
UDFs
Script Embedding Scalars
Macros
non-java UDFs (jython, python,
javascript, groovy, jruby)
Distributed Orderby, Skewed
Join

Pig users
 Heavily used for ETL at Web Scale by Major Internet Companies
 At Yahoo!
- 60% of total hadoop jobs run daily
- 12 million monthly pig jobs
 Other heavy users
- Twitter
- Netflix
- LinkedIn
- Ebay
- Salesforce
 Standard data science tool, in university textbook
5

Why Pig on Tez?
 DAG execution framework
 Low level DAG framework
- Build DAG by defining vertices and edges
- Customize scheduling of DAG and routing of data
 Highly customizable with pluggable implementations
 Resource efficient
 Performance
- Without having to increase memory
 Natively built on top of YARN
- Multi-tenancy, resource allocation come for free
 Scale
 Security
 Excellent support from Tez community
- Bikas Saha, Siddharth Seth, Hitesh Shah
6

Design
8
Logical Plan
Tez Plan MR Plan
Physical Plan
Tez Execution Engine MR Execution Engine
LogToPhyTranslationVisitor
MRCompilerTezCompiler

DAG Plan – Split Group by + Join
9
f = LOAD ‘foo’ AS (x, y, z);
g1 = GROUP f BY y;
g2 = GROUP f BY z;
j = JOIN g1 BY group,
g2 BY group;
Group by y Group by z
Load foo
Join
Load g1 and Load g2
Group by y Group by z
Load foo
Join
Multiple outputs
Reduce follows
reduce
HDFS HDFS
Split multiplex de-multiplex

DAG Execution - Visualization
10
Vertex 1
(Load)
Vertex 2
(Group)
Vertex 3
(Group)
Vertex 4
(Join)
MROutput
MRInput

DAG Plan – Distributed Orderby
11
Aggregate
Sample
Sort
Partition
A = LOAD ‘foo’ AS (x, y);
B = FILTER A by $0 is not
null;
C = ORDER f BY x;
Stage sample map
on distributed cache
Load/Filter
& Sample
Aggregate
Partition
Sort
Broadcast sample map
HDFS
HDFS
Load/FilterHDFS
HDFS
Map
Reduce
Map
Reduce
Map
1-1 Unsorted
Edge
Cache sample map

Session Reuse
 Feature
- Session reuse
 Submit more than one DAG to same AM
 Usage
- Each Pig script uses a single session
- Grunt shell uses one session for all commands till timeout
- More than one DAG submitted for merge join, ‘exec’
 Benefits
- A pig script with 5 MR jobs has 5 AM containers launched. Single AM for one
pig script in Tez saves capacity.
- Eliminates issue of queue and resource contention faced in MR by every new
MR job in the pipeline of a multi-stage pig script.
12

Container Reuse
 Features
- Container reuse
 Rerun new tasks on already launched containers (jvm)
 Usage
- Turned on by default for all pig scripts and grunt shell
 Benefits
- Reduced launch overhead
 Container request and release overhead
 Resource localization overhead
 JVM launch time overhead
- Reduced network IO
 1-1 edge tasks are launched on same node
- Object caching
 User impact
- Have to review/profile and fix custom LoadFunc/StoreFunc/UDFs for static variables
and memory leaks due to jvm reuse.
13

Custom Vertex Input/Output/Processor/Manager
 Features
- Custom Vertex Processor
- Custom Input and Output between vertices
- Custom Vertex Manager
 Usage
- PigProcessor instead of MapProcessor and ReduceProcessor
- Unsorted input/output
 with Partitioner – Union
 without Partitioner – Broadcast Edge (Replicate join, Orderby and Skewed join), 1-1
Edge (Order by, Skewed join and Multiquery off)
- Custom Vertex Manager – Automatic Parallelism Estimation
 Benefits
- No framework restrictions like MR
- More efficient processing and algorithms
14

Broadcast Edge and Object Caching
 Feature
- Broadcast Edge
 Broadcast same data to all tasks in successor vertices
- Object Caching
 Ability to cache objects in memory for scope of Vertex, DAG and Session
- Input fetch on choice
 Usage
- Replicate join small table
- Orderby and Skewed join partitioning samples
 Benefits
- Replace use of Distributed cache and avoid NodeManager bottleneck of localization
- Avoid input fetching if in cache on container reuse
- Performance gains of upto 3x in tests for replicated join on smaller clusters with
higher container reuse
15

Vertex Groups
 Feature
- Vertex Grouping
 Ability to group multiple vertices into one vertex group and produce a combined output
 Usage
- Union operator
 Benefits
- Better performance due to elimination
of an additional vertex
- Performance gains of 1.2x to 2x over MR
16
A = LOAD ‘a’;
B = LOAD ‘b’;
C = UNION A, B;
D = GROUP C by $0;
Load A Load B
GROUP

Dynamic Parallelism
 Determine parallelism beforehand is hard
 Dynamic adjust parallelism at runtime
 Tez VertexManagerPlugin
- Custom policy to determine parallelism at runtime
- Library of common policy: ShuffleVertexManager
17

Dynamic Parallelism - ShuffleVertexManager
18
Load A
JOIN
Load A
JOIN 4 2
Load B
Load B
 Stock VertexManagerPlugin from Tez
 Used by Group, Hash Join, etc
 Dynamic reduce parallelism of vertex based on estimated input size

Dynamic Parallelism – PartitionerDefinedVertexManager
 Custom VertexManagerPlugin Used by Order by / Skewed Join
 Dynamic increase / decrease parallelism based on input size
19
Load/Filter
& Sample
Sample
Aggregate
Partition
Sort
Calculate
#Parallelism

Performance numbers –
21
0
10
20
30
40
50
60
70
80
Prod script 1
1.5x
1 MR Job
3172 vs 3172 Tasks
Prod script 2
2.1x
12 MR jobs
966 vs 941 Tasks
Prod script 3
1.5x
4 MR jobs on 8.4
TB input
21397 vs 21382
Tasks
Prod script 4
2%
4 MR Jobs on 25.2
TB input
101864 vs 101856
tasks
Timeinmins
MR
Tez
28 vs 18m
11 vs 5m
50 vs 35m
74 vs 72m

Performance numbers –
22
0
20
40
60
80
100
120
140
160
Prod script 1
2.52x
5 MR Jobs
Prod script 2
2.02x
5 MR Jobs
Prod script 3
2.22x
12 MR Jobs
Prod script 4
1.75x
15 MR jobs
Timeinmins
MR
Tez
25 vs 10m
34 vs 16m
2h 22m vs 1h 21m
1h 46m vs 48m

Performance Numbers – Interactive Query
24
0
100
200
300
400
500
600
700
10G 5G 1G 500M
Timeinsecs
Input Size
TPC-H Q10
MR
Tez
2.49X
3.41X
4.89X 6X
 When the input data is small, latency dominates
 Tez significantly reduce latency through session/container reuse

Performance Numbers – Iterative Algorithm
25
 Pig can be used to implement iterative algorithm using embedding
 Iterative algorithm is ideal for container reuse
 Example: k-means Algorithm
- Each iteration takes an average 1.48s after the first iteration (vs 27s for MR)
0
1000
2000
3000
10 50 100
Timeinsecs
Iteration
k-means
MR
Tez
14.84X
13.12X
5.37X
* Source code can be downloaded at http://hortonworks.com/blog/new-apache-pig-features-part-2-embedding

Performance is proportional to …
 Number of stages in the DAG
- Higher the number of stages in the DAG, performance of Tez over MR will be
better due to elimination of map read stages.
 Size of intermediate output
- More the size of intermediate output, the performance of Tez over MR will be
better due to reduced HDFS usage.
 Cluster/queue capacity
- More congested a queue is, the performance of Tez over MR will be better due
to container reuse.
 Size of data in the job
- For smaller data and more stages, the performance of Tez over MR will be
better as percentage of launch overhead in the total time is high for smaller
jobs.
26

Where are we?
 90% feature parity with Pig on MR
- No Local mode (TEZ-235)
- Rarely used operators not implemented
 MAPREDUCE (native mapreduce jobs)
 Collected CoGroup
 98% of ~1300 e2e tests pass.
 35% of ~2850 unit tests pass. Porting of rest pending on Tez Local mode.
 Tez branch merged into trunk and will be part of Pig 0.14 release
 Netflix has Lipstick working with Pig on Tez
- Credits: Jacob Perkins, Cheolsoo Park
28

User Impact
 Tez
- Zero pain deployment
- Tez library installation on local disk and copy to HDFS
 Pig
- No pain migration from Pig on MR to Pig on Tez
 Existing scripts work as is without any modification
 Only two additional steps to execute in Tez mode
– export TEZ_HOME=/tez-install-location
– pig -x tez myscript.pig
- Users to review/profile and fix custom LoadFunc/StoreFunc/UDFs for static
variables and memory leaks due to jvm reuse.
29

What next?
 Support for Tez Local mode
 All unit tests ported
 Improve
- Stability
- Usability
- Debuggability
 Apache Release
- Pig 0.14 with Tez released by Sep 2014
 Deployment
- In research in Yahoo! by early Q3
- In production in Yahoo and Netflix by Q3/Q4
 Performance
- From 1.2-3x to 1.5x-5x by Q4
30

Tez Features - WIP
 Tez UI
- Application Master UI and Job history UI is in the works by integrating via
Application Timeline server.
- Currently only AM logs are easily viewable. Task logs are available but have to
grep the AM log to find the URL.
 Tez Local mode
 Tez AM Recovery
- Tez checkpointing and resuming on AM failure is functional but needs more
work. With single DAG execution of whole script, AM retries can be very costly.
 Input fetch optimizations
- Custom ShuffleHandler on NodeManager
- Local input fetch on container reuse
31

What next - Performance?
 Shared Edges
- Same output to multiple downstream vertices
 Multiple Vertex Caching
 Unsorted shuffle for skewed join and order by
 Custom edge manager and data routing for skewed join
 Groupby and join using hashing and avoid sorting
 Better memory management
 Dynamic reconfiguration of DAG
- Automatically determine type of join - replicate, skewed or hash join
32

We are hiring!!!
Hortonworks
Stop by Kiosk D5
Yahoo!
Stop by Kiosk P9
or reach out to us at
bigdata@yahoo-inc.com.
Thank You

Pig on Tez - Low Latency ETL with Big Data

More Related Content

What's hot (20)

Viewers also liked (18)

Similar to Pig on Tez - Low Latency ETL with Big Data (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Pig on Tez - Low Latency ETL with Big Data