Introduction to pig

Apache Pig – Introduction and
Hands-on
Ravi Mutyala
Systems Architect, Hortonworks
Twitter: @rmutyala

© Hortonworks Inc. 2012

Big Data Platforms
Cost per TB, Adoption

Size of bubble = cost
effectiveness of solution

Source:

2

Topics
• What is Pig?
• Why Pig ?
• Language Features
• Labs
• 0.10.0 Features
• Features in the pipeline
•Q &A

Page 3

What is Pig?
• System for processing large unstructured Data
• Uses HDFS and MapReduce
• Data flow Language
• Directional Asymptotic Graph
• Started at Yahoo! Research
• Joined Apache incubator in 2007
• Graduated to Subproject of Hadoop in 2008
• Top level project in Apache since 2010

Page 4

Pig Philosophy 
• Pigs eat anything
• Pigs live anywhere
• Pigs are domesticated animals
• Pigs can fly

Page 5

Components
• Pig Engine – Parser, Optimizer and distributed query
execution
• Grunt – CLI shell
• Pig Latin – Procedural Language

Page 6

Why Pig ?
• High level language that increases programmer
productivity.
• Designed for Parallel Data flow.
• Reduces complexity by abstracting low level Map and
Reduce jobs and Map Reduce job chaining
• Can be run on a client/gateway machine with no
configuration on the cluster
• Multiple versions of Pig can co-exist as long as they
are compatible with Hadoop version.

Page 7

Running Pig
Pig Latin script executes in 3 modes
• MapReduce: Code executes as MapReduce on a
Hadoop Cluster
$ pig myscript.pig
• Local: Code executes locally in a single JVM using
local data
$ pig –x local myscript.pig

• Interactive: pig with no script starts the grunt shell
where commands can be run interactively

Page 8

GRUNT shell
• fs -ls
• fs -cat filename
• fs -copyFromLocal localfile hdfsfile

Page 9

Data Types
• Scalar Types
– int, long, float, double, chararray, bytearray, boolean, datetime
• Complex Types
– Map. Collection of key value pairs
– [name#alan, age#30]
– Tuple. Ordered set of values
– (alan,40,engineering)
– Bags. Unordered collection of tuples
– {(alan,40,engineering),(bob,45,sales)}

Page 10

• Relations and a set of operations that work on
relations
• Schema for relations is optional
• $0… $n can be used for fields in relations
• null means the data in undefined.
• Any missing or invalid fields are loaded as null

Page 11

Input and Output
• A = LOAD ‘file’ USING PigStorage(‘,’) AS
(data1:datatype1, data2:datatype2.. )

• STORE A INTO ‘file2’ using PigStorage(‘,’)

• DUMP A

• DESCRIBE A

Page 12

Relational Operations
• GROUP A BY A.age;

• FOREACH B GENERATE A.$1 – A.$3;

• FILTER A BY A.$1 > 10;

• ORDER A BY A.$1 DESC, A.$2;

• JOIN A BY A.$1, B BY B.$5;
• JOIN A BY (A.$1, A.$5) LEFT OUTER, B BY (B.$2,
B.$3);

Page 13

• LIMIT A 10;

• SAMPLE A 0.1;

• GROUP A BY A.$1 PARALLEL 10;

• User Definited Functions AND piggybank
register 'your_path_to_piggybank/piggybank.jar';
divs = load 'NYSE_dividends’;
backwards = foreach divs generate
org.apache.pig.piggybank.evaluation.string.Reverse($1);

Page 14

• Invoking static java methods

• FLATTEN

• TOKENIZE

Page 15

0.10.0 Features
• Ruby UDFs
• PigStorage with schemas
• Additional UDF improvements
• Language Improvements
– Boolean type
– otherwise
– Maps, Bags and Tuples can be generated without UDFs
– Register collection of jars
• Performance Improvements

Page 16

Current work in progress
• DataTime datatype
• CUBE, ROLLUP and RANK operators
• Native support for windows
• Lower memory footprint

Page 17

References
• Labs are from
– https://github.com/alanfgates/programmingpig
– https://github.com/michiard/CLOUDS-LAB

• 0.10.0 Features and current WIP
– http://www.slideshare.net/hortonworks/pig-out-to-hadoop by Alan
Gates

Page 18

Hortonworks Training
The expert source for
Apache Hadoop training & certification

Role-based Developer and Administration training
– Coursework built and maintained by the core Apache Hadoop development team.
– The “right” course, with the most extensive and realistic hands-on materials
– Provide an immersive experience into real-world Hadoop scenarios
– Public and Private courses available

Comprehensive Apache Hadoop Certification
– Become a trusted and valuable
Apache Hadoop expert

Page 19

Thank You!
Questions & Answers
Ravi Mutyala
Systems Architect
Hortonworks
Twitter: @rmutyala
www.hortonworks.com

Page 20

Introduction to pig

More Related Content

What's hot (20)

Viewers also liked (12)

Similar to Introduction to pig (20)

Recently uploaded (20)

Introduction to pig