Introduction to pig

Agenda
 What Is Pig? And what it is use for?
 Pig Philosophy
 Pig’s Data Model
 Pig Example
 Pig Latin
 Pig Latin vs SQL
 Pig Macros
 Pig UDF’s

What Is Pig? And what it is
use for?
 Pig has a pig engine which is used for executing data flows in parallel like how
map tasks are distributed among the cluster nodes and get job done in
Mapreduce .
 Pig uses its own Pig Latin language for expressing these data flows.
 Pig runs on Hadoop. It makes use of both the Hadoop Distributed File System,
HDFS, and Hadoop’s processing system, MapReduce.
 By default, Pig reads input files from HDFS, uses HDFS to store intermediate data
between MapReduce jobs, and writes its output to HDFS.
 Pig Latin use cases tend to fall into three separate categories: traditional extract
transform load (ETL) data pipelines, research on raw data, and iterative
processing.

Pig Philosophy
 Pigs eat anything
 Pigs live anywhere
 Pigs are domestic animals
 Pigs fly

Pig’s Data Model : Types
 Pig’s data types can be divided into two categories: scalar types, which
contain a single value, and complex types, which contain other types

Pig’s Data Model : Schemas
 If a schema for the data is available, Pig will make use of it, both for up-front error
checking and for optimization
 Syntax: Loads= load ‘data.txt' as(col1:int, col2:chararray, col3:chararray,
col4:float);
 It is also possible to specify the schema without giving explicit data types. In this case,
the data type is assumed to be bytearray
 Syntax: Loads= load ‘data.txt' as(col1, col2, col3, col4);

Pig Example
 Grunt is Pig’s interactive shell. It enables users to enter Pig Latin interactively and
provides a shell for users to interact with HDFS.
 To enter Grunt, pig -x local, pig -x mapreduce, pig -x tez
 records = LOAD 'input/ncdc/micro-tab/sample.txt‘ AS (year:chararray,
temperature:int, quality:int);
 filtered_records = FILTER records BY temperature != 9999 AND quality IN (0, 1, 4,
5, 9);
 grouped_records = GROUP filtered_records BY year;
 max_temp = FOREACH grouped_records GENERATE group,
MAX(filtered_records.temperature);
 DUMP max_temp;

Pig Latin : RelationalOperators

Pig Latin : Diagnostic/UDF
Operators

Pig Macros
 Macros provide a way to package reusable pieces of Pig Latin code from within
Pig Latin itself.
DEFINE max_by_group(X, group_key, max_field) RETURNS Y {
A = GROUP $X by $group_key;
$Y = FOREACH A GENERATE group, MAX($X.$max_field);
};
records = LOAD 'input/ncdc/micro-tab/sample.txt‘ AS (year:chararray, temperature:int,
quality:int);
filtered_records = FILTER records BY temperature != 9999 AND quality IN (0, 1, 4, 5, 9);
max_temp = max_by_group(filtered_records, year, temperature);
DUMP max_temp;

Pig UDF’s
 A Filter UDF
 An Eval UDF
 A Load UDF

Introduction to pig

More Related Content

What's hot (20)

Similar to Introduction to pig (20)

More from Uday Vakalapudi (12)

Recently uploaded (20)

Introduction to pig