Pig latin

Pig Latin-Basics
• Pig Latin is the language used to analyze
data in Hadoop using Apache Pig.
• Pig Latin – Data Model
• The data model of Pig is fully nested.
• A Relation is the outermost structure of
the Pig Latin data model. And it is a bag
where −
• A bag is a collection of tuples.
• A tuple is an ordered set of fields.
• A field is a piece of data.

Pig Latin – Statemets
 While processing data using Pig Latin, statements are the
basic constructs.
 These statements work with relations. They include
expressions and schemas.
 Every statement ends with a semicolon (;).
 We will perform various operations using operators provided
by Pig Latin, through statements.
 Except LOAD and STORE, while performing all other
operations, Pig Latin statements take a relation as input and
produce another relation as output.
 As soon as you enter a Load statement in the Grunt shell, its
semantic checking will be carried out. To see the contents of
the schema, you need to use the Dump operator. Only after
performing the dump operation, the MapReduce job for
loading the data into the file system will be carried out.

Loading Data using LOAD
statement
• Example
• Given below is a Pig Latin statement,
which loads data to Apache Pig.
grunt> Student_data = LOAD
'student_data.txt' USING
PigStorage(',')as ( id:int,
firstname:chararray,
lastname:chararray, phone:chararray,
city:chararray );

Pig Latin – Data types
Null Values
• Values for all the above data types can
be NULL. Apache Pig treats null
values in a similar way as SQL does.
• A null can be an unknown value or a
non-existent value. It is used as a
placeholder for optional values. These
nulls can occur naturally or can be
the result of an operation.

Pig Latin – Type Construction
Operators

Pig Latin – Relational
Operations

Apache Pig - Reading Data
• In general, Apache Pig works on top of Hadoop.
• It is an analytical tool that analyzes large
datasets that exist in the Hadoop File System.
• To analyze data using Apache Pig, we have to
initially load the data into Apache Pig.
• Steps to load data into Pig using LOAD
Function.
• Copy the local text file into HDFS file in a
directory name pig_data.
• Input file contains data separated by ‘,’(comma).

Operator
• You can load data into Apache Pig from the file system (HDFS/ Local) using LOAD
operator of Pig Latin.
 Syntax
• The load statement consists of two parts divided by the “=” operator.
• On the left-hand side, we need to mention the name of the relation where we want
to store the data, and on the right-hand side, we have to define how we store the
data.
 Given below is the syntax of the Load operator.
• Relation_name = LOAD 'Input file path' USING function as schema
• Where,
• relation_name − We have to mention the relation in which we want to store the
data.
• Input file path − We have to mention the HDFS directory where the file is stored.
(In MapReduce mode)
• function − We have to choose a function from the set of load functions provided
by Apache Pig (BinStorage, JsonLoader, PigStorage, TextLoader).
• Schema − We have to define the schema of the data. We can define the required
schema as follows −
 (column1 : data type, column2 : data type, column3 : data type);

Operator
 Start the Pig Grunt Shell
• First of all, open the Linux terminal. Start the Pig
Grunt shell in MapReduce mode as shown below.
 $ Pig –x mapreduce
 Execute the Load Statement
• Now load the data from the file student_data.txt
into Pig by executing the following Pig Latin
statement in the Grunt shell.
 grunt> student = LOAD
'hdfs://localhost:9000/pig_data/student_data.tx
t' USING PigStorage(',') as ( id:int,
firstname:chararray, lastname:chararray,
phone:chararray, city:chararray );

Operator

Apache Pig - Storing Data
• we learnt how to load data into Apache Pig. You can store the
loaded data in the file system using the store operator. This
chapter explains how to store data in Apache Pig using the
Store operator.
 Syntax
• Given below is the syntax of the Store statement.
 STORE Relation_name INTO ' required_directory_path '
[USING function];
 Example:
 Now, let us store the relation ‘student’ in the HDFS directory
“/pig_Output/” as shown below.
 grunt> STORE student INTO '
hdfs://localhost:9000/pig_Output/ ' USING PigStorage (‘,’);
 We can verify the output hdfs file contents using cat
command.

Apache Pig - Diagnostic
Operators
• The load statement will simply load the data into the specified relation in Apache Pig. To verify the execution of
the Load statement, you have to use the Diagnostic Operators. Pig Latin provides four different types of
diagnostic operators −
 Dump operator
 Describe operator
 Explanation operator
 Illustration operator
 Dump Operator
• The Dump operator is used to run the Pig Latin statements and display the results on the screen. It is generally
used for debugging Purpose.
 Syntax:
 grunt> Dump Relation_Name
 Example
• Assume we have a file student_data.txt in HDFS.
• And we have read it into a relation student using the LOAD operator as shown below.
 grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt' USING PigStorage(',') as ( id:int,
firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
 grunt> Dump student
• Once you execute the above Pig Latin statement, it will start a MapReduce job to read data from HDFS.
• Note: Running the LOAD statement will not load the data into the relation STUDENT.
• Executing the Dump statement will load the data.

Apache Pig - Describe
Operator
• The describe operator is used to view the schema of a
relation.
 Syntax:
 grunt> Describe Relation_name
 Example:
 grunt> describe student;
• Where student is the relation name.
 Output
• Once you execute the above Pig Latin statement, it will
produce the following output.
 grunt> student: { id: int,firstname: chararray,lastname:
chararray,phone: chararray,city: chararray }

Apache Pig - Explain Operator
• The explain operator is used to display the logical,
physical, and MapReduce execution plans of a
relation.
 Syntax:
 grunt> explain Relation_name;
 Example:
 grunt> explain student;
 Output:
• It will produce the following output as in the
attachment.

Apache Pig - Illustrate
Operator
• The illustrate operator gives you the step-by-step
execution of a sequence of statements.
 Syntax:
 grunt> illustrate Relation_name;
 Example:
• Assume we have a relation student.
 grunt> illustrate student;
 Output:
• On executing the above statement, you will get the
following output.

Apache Pig - Group Operator
• The GROUP operator is used to group the data in one or more relations. It collects the data
having the same key.
 Syntax:
 grunt> Group_data = GROUP Relation_name BY age;
 Example:
• Assume we have a relation with name student_details with student details like id, name, age
etc.
• Now, let us group the records/tuples in the relation by age as shown below.
 grunt> group_data = GROUP student_details by age;
 Verification:
• Verify the relation group_data using the DUMP operator as shown below.
 grunt> Dump group_data;
 Output:
• Then you will get output displaying the contents of the relation named group_data as shown
below. Here you can observe that the resulting schema has two columns
• One is age, by which we have grouped the relation.
• The other is a bag, which contains the group of tuples, student records with the respective age.
 You can see the schema of the table after grouping the data using the describe command as
shown below.
 grunt> Describe group_data; group_data: {group: int,student_details: {(id: int,firstname:
chararray, lastname: chararray,age: int,phone: chararray,city: chararray)}}

Group Multiple columns
• In the same way, you can get the sample illustration of the schema
using the illustrate command as shown below.
 $ Illustrate group_data;
 Output:
 Grouping by Multiple Columns:
 We can group the data using multiple columns also as shown below.
 grunt> group_multiple = GROUP student_details by (age, city);
 Group All:
• We can group the data using all columns also as shown below.
 grunt> group_all = GROUP student_details All;
 Now, verify the content of the relation group_all as shown below.
 grunt> Dump group_all;

Apache Pig - Cogroup
Operator
• The COGROUP operator works more or less in the same way as the
GROUP operator. The only difference between the two operators is
that the group operator is normally used with one relation, while the
cogroup operator is used in statements involving two or more
relations.
 Grouping Two Relations using Cogroup:
• Assume that we have two relations namely student_details and
employee_details in pig.
• Now, let us group the records/tuples of the relations
student_details and employee_details with the key age, as shown
below.
 grunt> cogroup_data = COGROUP student_details by age,
employee_details by age;
 Verification
• Verify the relation cogroup_data using the DUMP operator as shown
below.
 grunt> Dump cogroup_data;

COGROUP Operator
 Output
• It will produce the following output, displaying the contents of the
relation named cogroup_data as shown below.
• The cogroup operator groups the tuples from each relation
according to age where each group depicts a particular age value.
• For example, if we consider the 1st tuple of the result, it is grouped
by age 21. And it contains two bags −
 the first bag holds all the tuples from the first relation
(student_details in this case) having age 21, and
 the second bag contains all the tuples from the second relation
(employee_details in this case) having age 21.
• In case a relation doesn’t have tuples having the age value 21, it
returns an empty bag.

Apache Pig - Join Operator
• The JOIN operator is used to combine records from two or more relations.
• While performing a join operation, we declare one (or a group of) tuple(s) from each
relation, as keys.
• When these keys match, the two particular tuples are matched, else the records
are dropped.
• Joins can be of the following types −
 Self-join
 Inner-join
 Outer-join − left join, right join, and full join
• Joins in PIG are similar to SQL joins. In Pig, we will be joining the relations where
in SQL we will be joining the tables.
• We see only the syntaxes of the different joins in PIG.
 Self – join:
 grunt> Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ;
 Example
• Let us perform self-join operation on the relation customers, by joining the two
relations customers1 and customers2 as shown below.
• grunt> customers3 = JOIN customers1 BY id, customers2 BY id;
 Verification:
 grunt> Dump customers3;

JOINS
 Inner Join:
 Syntax:
• grunt> result = JOIN relation1 BY columnname, relation2 BY columnname;
 Example
• Let us perform inner join operation on the two relations customers and orders as shown below.
 grunt> coustomer_orders = JOIN customers BY id, orders BY customer_id;
 Verification:
 grunt> Dump coustomer_orders;
 Left Outer Join:
 grunt> Relation3_name = JOIN Relation1_name BY id LEFT OUTER, Relation2_name BY customer_id;
 Example
• Let us perform left outer join operation on the two relations customers and orders as shown below.
 grunt> outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id;
 Verification:
 grunt> Dump outer_left;
 Right Outer Join:
 grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;
 Example
• Let us perform right outer join operation on the two relations customers and orders as shown below.
 grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;
 Verification:
 grunt> Dump outer_right

JOINS
 Full Outer Join:
• grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;
 Example
• Let us perform full outer join operation on the two relations customers and
orders as shown below.
 grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;
 Verification:
 grunt> Dump outer_full;
 Using Multiple Keys:
 grunt> Relation3_name = JOIN Relation2_name BY (key1, key2), Relation3_name
BY (key1, key2);
 Example:
 grunt> emp = JOIN employee BY (id,jobid), employee_contact BY (id,jobid);
 Verification
 grunt> Dump emp;

Apache Pig - Cross Operator
• The CROSS operator computes the cross-product of two or more relations.
 Syntax:
• grunt> Relation3_name = CROSS Relation1_name, Relation2_name;
 Example:
• Assume that we have two Pig relations namely customers and orders.
• Let us now get the cross-product of these two relations using the cross operator
on these two relations as shown below.
 grunt> cross_data = CROSS customers, orders;
 Verification:
 grunt> Dump cross_data;
 Output:
• It will produce the following output, displaying the contents of the relation
cross_data.
• The Output will be cross product. Each row in the relation customers will be
joined with each row of orders starting from the last record.

Apache Pig - Union Operator
• The UNION operator of Pig Latin is used to merge the content of two
relations. To perform UNION operation on two relations, their
columns and domains must be identical.
 Syntax:
• grunt> Relation_name3 = UNION Relation_name1, Relation_name2;
 Example:
• Assume that we have two relations namely student1 and student2
containing same number and same type of columns. Data in the
relations is different.
• Let us now merge the contents of these two relations using the
UNION operator as shown below.
 grunt> student = UNION student1, student2;
 Verification:
 grunt> Dump student;
 Output:
• Combine the records in the both relations student1 and student2
into the relation student.

Apache Pig - Split Operator
• The SPLIT operator is used to split a relation into two or more relations.
 Syntax:
 grunt> SPLIT Relation1_name INTO Relation2_name IF (condition1),
Relation2_name (condition2);
 Example:
• Assume that we have a relation named student_details.
• Let us now split the relation into two, one listing the employees of age less than
23, and the other listing the employees having the age between 22 and 25.
 SPLIT student_details into student_details1 if age<23, student_details2 if
(age>22 and age<25);
 Verification:
 grunt> Dump student_details1;
 grunt> Dump student_details2;
 Output:
• It will produce the following output, displaying the contents of the relations
student_details1 and student_details2 respectively.

Apache Pig - Filter Operator
• The FILTER operator is used to select the required tuples from a
relation based on a condition.
 Syntax:
 grunt> Relation2_name = FILTER Relation1_name BY (condition);
 Example:
• Let us now use the Filter operator to get the details of the students
who belong to the city Chennai.
 filter_data = FILTER student_details BY city == 'Chennai';
 Verification:
• grunt> Dump filter_data;
 Output:
• It will produce the following output, displaying the contents of the
relation filter_data as follows.
 (6,Archana,Mishra,23,9848022335,Chennai)
(8,Bharathi,Nambiayar,24,9848022333,Chennai)

Apache Pig - Distinct
Operator
• The DISTINCT operator is used to remove redundant (duplicate)
tuples from a relation.
 Syntax:
 grunt> Relation_name2 = DISTINCT Relation_name1;
 Example:
• Let us now remove the redundant (duplicate) tuples from the
relation named student_details using the DISTINCT operator, and
store it as another relation named distinct_data as shown below.
 grunt> distinct_data = DISTINCT student_details;
 Verification:
 grunt> Dump distinct_data;
 Output:
• Dump operator will display the distinct_data producing the distinct
rows from student_details table.

Apache Pig - Foreach
Operator
• The FOREACH operator is used to generate specified data transformations based on the column data.
• The name itself is indicating that for each element of a data bag, the respective action will be performed.
 Syntax:
• grunt> Relation_name2 = FOREACH Relatin_name1 GENERATE (required data);
 Example
 grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',') as
(id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray, city:chararray);
• Let us now get the id, age, and city values of each student from the relation student_details and store it into
another relation named foreach_data using the foreach operator as shown below.
 grunt> foreach_data = FOREACH student_details GENERATE id,age,city;
 Verification:
 grunt> Dump foreach_data;
 Output:
 Dump operator displays the below data.
(1,21,Hyderabad)
(2,22,Kolkata)
(3,22,Delhi)
(4,21,Pune)
(5,23,Bhuwaneshwar)
(6,23,Chennai)
(7,24,trivendram)
(8,24,Chennai)

Apache Pig - Order By
• The ORDER BY operator is used to display the contents of a relation in a sorted
order based on one or more fields.
 Syntax:
• grunt> Relation_name2 = ORDER Relation_name1 BY (ASC|DESC);
 Example:
 grunt> student_details = LOAD
'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',') as
(id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray,
city:chararray);
 Let us now sort the relation in a descending order based on the age of the student
and store it into another relation named order_by_data using the ORDER BY
operator as shown below.
 grunt> order_by_data = ORDER student_details BY age DESC;
 Verification:
 grunt> Dump order_by_data;
 Output:
 Dump operator will produce the student_details data sorted by age in descending
order.

Apache Pig - Limit Operator
• The LIMIT operator is used to get a limited number of tuples from a relation.
 Syntax:
• grunt> Result = LIMIT Relation_name required number of tuples;
 Example
• Assume that we have a file named student_details.
 grunt> student_details = LOAD
'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',') as
(id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray,
city:chararray);
 grunt> limit_data = LIMIT student_details 4;
 Verification:
 grunt> Dump limit_data;
 Output:
• Dump operator will produce the data of student_details with limited number of
rows i.e; 4 rows.
 (1,Rajiv,Reddy,21,9848022337,Hyderabad)
(2,siddarth,Battacharya,22,9848022338,Kolkata)
(3,Rajesh,Khanna,22,9848022339,Delhi)
(4,Preethi,Agarwal,21,9848022330,Pune)

Apache Pig - Load & Store
Functions
• The Load and Store functions in Apache Pig are used to determine how the data goes and comes out
of Pig. These functions are used with the load and store operators. Given below is the list of load and
store functions available in Pig.

Apache Pig - Bag & Tuple
Functions

Apache Pig - Date-time
Functions

Apache Pig - User Defined
Functions
• In addition to the built-in functions, Apache Pig provides
extensive support for User Defined Functions (UDF’s). Using
these UDF’s, we can define our own functions and use them.
The UDF support is provided in six programming languages,
namely, Java, Jython, Python, JavaScript, Ruby and Groovy.
• For writing UDF’s, complete support is provided in Java and
limited support is provided in all the remaining languages.
Using Java, you can write UDF’s involving all parts of the
processing like data load/store, column transformation, and
aggregation. Since Apache Pig has been written in Java, the
UDF’s written using Java language work efficiently compared
to other languages.
• In Apache Pig, we also have a Java repository for UDF’s
named Piggybank. Using Piggybank, we can access Java
UDF’s written by other users, and contribute our own UDF’s.

Types of UDF’s in Java
• While writing UDF’s using Java, we can create and use
the following three types of functions −
• Filter Functions − The filter functions are used as
conditions in filter statements. These functions accept a
Pig value as input and return a Boolean value.
• Eval Functions − The Eval functions are used in
FOREACH-GENERATE statements. These functions
accept a Pig value as input and return a Pig result.
• Algebraic Functions − The Algebraic functions act on
inner bags in a FOREACHGENERATE statement. These
functions are used to perform full MapReduce
operations on an inner bag.
• All UDFs must extend "org.apache.pig.EvalFunc"
• All functions must override "exec" method.

UDF Example
packagemyudfs;
importjava.io.IOException;
importorg.apache.pig.EvalFunc;
importorg.apache.pig.data.Tuple;
public class UPPER extends EvalFunc<String>
{
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
String str = (String)input.get(0);
returnstr.toUpperCase();
}catch(Exception e){
throw new IOException("Caught exception processing input row ", e);
}
}
}
Create a jar of the above code as myudfs.jar.
Now write the script in a file and save it as .pig. Here I am using script.pig.
• -- script.pig
• REGISTER myudfs.jar;
• A = LOAD 'data' AS (name: chararray, age: int, gpa: float);
• B = FOREACH A GENERATE myudfs.UPPER(name);
• DUMP B;
Finally run the script in the terminal to get the output.

Apache Pig - Running Scripts
• we will see how how to run Apache Pig scripts in batch
mode.
 Comments in Pig Script
• While writing a script in a file, we can include
comments in it as shown below.
 Multi-line comments
• We will begin the multi-line comments with '/*', end
them with '*/'.
 /* These are the multi-line comments In the pig script
*/
 Single –line comments
• We will begin the single-line comments with '--'.
 --we can write single line comments like this.

Executing Pig Script in Batch
mode
 Executing Pig Script in Batch mode:
• While executing Apache Pig statements in batch mode, follow the steps given below.
 Step 1
• Write all the required Pig Latin statements in a single file. We can write all the Pig Latin
statements and commands in a single file and save it as .pig file.
 Step 2
• Execute the Apache Pig script. You can execute the Pig script from the shell (Linux) as shown
below.
 Local mode:
 $ pig -x local Sample_script.pig
 MapReduce mode:
 $ pig -x mapreduce Sample_script.pig
• You can execute it from the Grunt shell as well using the exec command as shown below.
 grunt> exec /sample_script.pig
• Executing a Pig Script from HDFS
• We can also execute a Pig script that resides in the HDFS. Suppose there is a Pig script with
the name Sample_script.pig in the HDFS directory named /pig_data/. We can execute it as
shown below.
 $ pig -x mapreduce hdfs://localhost:9000/pig_data/Sample_script.pig

Executing pig script from HDFS
 Example:
• We have a sample script with the name sample_script.pig, in
the same HDFS directory. This file contains statements
performing operations and transformations on the student
relation, as shown below.
 student = LOAD
'hdfs://localhost:9000/pig_data/student_details.txt' USING
PigStorage(',') as (id:int, firstname:chararray,
lastname:chararray, phone:chararray, city:chararray);
 student_order = ORDER student BY age DESC;
 student_limit = LIMIT student_order 4;
 Dump student_limit;
• Let us now execute the sample_script.pig as shown below.
 $./pig -x mapreduce
hdfs://localhost:9000/pig_data/sample_script.pig
• Apache Pig gets executed and gives you the output.

WORD COUNT EXAMPLE - PIG
SCRIPT
• How to find the number of occurrences of the words in a file using
the pig script?
 Word Count Example Using Pig Script:
 lines = LOAD '/user/hadoop/HDFS_File.txt' AS (line:chararray);
 words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as
word;
 grouped = GROUP words BY word;
 wordcount = FOREACH grouped GENERATE group, COUNT(words);
 DUMP wordcount;
 The above pig script, first splits each line into words using
the TOKENIZE operator. The tokenize function creates a bag of
words. Using the FLATTEN function, the bag is converted into a
tuple. In the third statement, the words are grouped together so that
the count can be computed which is done in fourth statement.
You can see just with 5 lines of pig program, we have solved the
word count problem very easily.

Pig latin

More Related Content

What's hot (20)

Similar to Pig latin (20)

Recently uploaded (20)

Pig latin