SlideShare a Scribd company logo
Pig Latin
By Sadiq Basha
Pig Latin-Basics
• Pig Latin is the language used to analyze
data in Hadoop using Apache Pig.
• Pig Latin – Data Model
• The data model of Pig is fully nested.
• A Relation is the outermost structure of
the Pig Latin data model. And it is a bag
where −
• A bag is a collection of tuples.
• A tuple is an ordered set of fields.
• A field is a piece of data.
Pig Latin – Statemets
 While processing data using Pig Latin, statements are the
basic constructs.
 These statements work with relations. They include
expressions and schemas.
 Every statement ends with a semicolon (;).
 We will perform various operations using operators provided
by Pig Latin, through statements.
 Except LOAD and STORE, while performing all other
operations, Pig Latin statements take a relation as input and
produce another relation as output.
 As soon as you enter a Load statement in the Grunt shell, its
semantic checking will be carried out. To see the contents of
the schema, you need to use the Dump operator. Only after
performing the dump operation, the MapReduce job for
loading the data into the file system will be carried out.
Loading Data using LOAD
statement
• Example
• Given below is a Pig Latin statement,
which loads data to Apache Pig.
grunt> Student_data = LOAD
'student_data.txt' USING
PigStorage(',')as ( id:int,
firstname:chararray,
lastname:chararray, phone:chararray,
city:chararray );
Pig Latin – Data types
Pig Latin – Data types
Null Values
• Values for all the above data types can
be NULL. Apache Pig treats null
values in a similar way as SQL does.
• A null can be an unknown value or a
non-existent value. It is used as a
placeholder for optional values. These
nulls can occur naturally or can be
the result of an operation.
Pig Latin – Type Construction
Operators
Pig Latin – Relational
Operations
Pig Latin – Relational
Operations
Pig Latin – Relational
Operations
Apache Pig - Reading Data
• In general, Apache Pig works on top of Hadoop.
• It is an analytical tool that analyzes large
datasets that exist in the Hadoop File System.
• To analyze data using Apache Pig, we have to
initially load the data into Apache Pig.
• Steps to load data into Pig using LOAD
Function.
• Copy the local text file into HDFS file in a
directory name pig_data.
• Input file contains data separated by ‘,’(comma).
Loading Data using LOAD
Operator
• You can load data into Apache Pig from the file system (HDFS/ Local) using LOAD
operator of Pig Latin.
 Syntax
• The load statement consists of two parts divided by the “=” operator.
• On the left-hand side, we need to mention the name of the relation where we want
to store the data, and on the right-hand side, we have to define how we store the
data.
 Given below is the syntax of the Load operator.
• Relation_name = LOAD 'Input file path' USING function as schema
• Where,
• relation_name − We have to mention the relation in which we want to store the
data.
• Input file path − We have to mention the HDFS directory where the file is stored.
(In MapReduce mode)
• function − We have to choose a function from the set of load functions provided
by Apache Pig (BinStorage, JsonLoader, PigStorage, TextLoader).
• Schema − We have to define the schema of the data. We can define the required
schema as follows −
 (column1 : data type, column2 : data type, column3 : data type);
Loading Data using LOAD
Operator
 Start the Pig Grunt Shell
• First of all, open the Linux terminal. Start the Pig
Grunt shell in MapReduce mode as shown below.
 $ Pig –x mapreduce
 Execute the Load Statement
• Now load the data from the file student_data.txt
into Pig by executing the following Pig Latin
statement in the Grunt shell.
 grunt> student = LOAD
'hdfs://localhost:9000/pig_data/student_data.tx
t' USING PigStorage(',') as ( id:int,
firstname:chararray, lastname:chararray,
phone:chararray, city:chararray );
Loading Data using LOAD
Operator
Apache Pig - Storing Data
• we learnt how to load data into Apache Pig. You can store the
loaded data in the file system using the store operator. This
chapter explains how to store data in Apache Pig using the
Store operator.
 Syntax
• Given below is the syntax of the Store statement.
 STORE Relation_name INTO ' required_directory_path '
[USING function];
 Example:
 Now, let us store the relation ‘student’ in the HDFS directory
“/pig_Output/” as shown below.
 grunt> STORE student INTO '
hdfs://localhost:9000/pig_Output/ ' USING PigStorage (‘,’);
 We can verify the output hdfs file contents using cat
command.
Apache Pig - Diagnostic
Operators
• The load statement will simply load the data into the specified relation in Apache Pig. To verify the execution of
the Load statement, you have to use the Diagnostic Operators. Pig Latin provides four different types of
diagnostic operators −
 Dump operator
 Describe operator
 Explanation operator
 Illustration operator
 Dump Operator
• The Dump operator is used to run the Pig Latin statements and display the results on the screen. It is generally
used for debugging Purpose.
 Syntax:
 grunt> Dump Relation_Name
 Example
• Assume we have a file student_data.txt in HDFS.
• And we have read it into a relation student using the LOAD operator as shown below.
 grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt' USING PigStorage(',') as ( id:int,
firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
 grunt> Dump student
• Once you execute the above Pig Latin statement, it will start a MapReduce job to read data from HDFS.
• Note: Running the LOAD statement will not load the data into the relation STUDENT.
• Executing the Dump statement will load the data.
Apache Pig - Describe
Operator
• The describe operator is used to view the schema of a
relation.
 Syntax:
 grunt> Describe Relation_name
 Example:
 grunt> describe student;
• Where student is the relation name.
 Output
• Once you execute the above Pig Latin statement, it will
produce the following output.
 grunt> student: { id: int,firstname: chararray,lastname:
chararray,phone: chararray,city: chararray }
Apache Pig - Explain Operator
• The explain operator is used to display the logical,
physical, and MapReduce execution plans of a
relation.
 Syntax:
 grunt> explain Relation_name;
 Example:
 grunt> explain student;
 Output:
• It will produce the following output as in the
attachment.
Apache Pig - Illustrate
Operator
• The illustrate operator gives you the step-by-step
execution of a sequence of statements.
 Syntax:
 grunt> illustrate Relation_name;
 Example:
• Assume we have a relation student.
 grunt> illustrate student;
 Output:
• On executing the above statement, you will get the
following output.
Apache Pig - Group Operator
• The GROUP operator is used to group the data in one or more relations. It collects the data
having the same key.
 Syntax:
 grunt> Group_data = GROUP Relation_name BY age;
 Example:
• Assume we have a relation with name student_details with student details like id, name, age
etc.
• Now, let us group the records/tuples in the relation by age as shown below.
 grunt> group_data = GROUP student_details by age;
 Verification:
• Verify the relation group_data using the DUMP operator as shown below.
 grunt> Dump group_data;
 Output:
• Then you will get output displaying the contents of the relation named group_data as shown
below. Here you can observe that the resulting schema has two columns
• One is age, by which we have grouped the relation.
• The other is a bag, which contains the group of tuples, student records with the respective age.
 You can see the schema of the table after grouping the data using the describe command as
shown below.
 grunt> Describe group_data; group_data: {group: int,student_details: {(id: int,firstname:
chararray, lastname: chararray,age: int,phone: chararray,city: chararray)}}
Group Multiple columns
• In the same way, you can get the sample illustration of the schema
using the illustrate command as shown below.
 $ Illustrate group_data;
 Output:
 Grouping by Multiple Columns:
 We can group the data using multiple columns also as shown below.
 grunt> group_multiple = GROUP student_details by (age, city);
 Group All:
• We can group the data using all columns also as shown below.
 grunt> group_all = GROUP student_details All;
 Now, verify the content of the relation group_all as shown below.
 grunt> Dump group_all;
Apache Pig - Cogroup
Operator
• The COGROUP operator works more or less in the same way as the
GROUP operator. The only difference between the two operators is
that the group operator is normally used with one relation, while the
cogroup operator is used in statements involving two or more
relations.
 Grouping Two Relations using Cogroup:
• Assume that we have two relations namely student_details and
employee_details in pig.
• Now, let us group the records/tuples of the relations
student_details and employee_details with the key age, as shown
below.
 grunt> cogroup_data = COGROUP student_details by age,
employee_details by age;
 Verification
• Verify the relation cogroup_data using the DUMP operator as shown
below.
 grunt> Dump cogroup_data;
COGROUP Operator
 Output
• It will produce the following output, displaying the contents of the
relation named cogroup_data as shown below.
• The cogroup operator groups the tuples from each relation
according to age where each group depicts a particular age value.
• For example, if we consider the 1st tuple of the result, it is grouped
by age 21. And it contains two bags −
 the first bag holds all the tuples from the first relation
(student_details in this case) having age 21, and
 the second bag contains all the tuples from the second relation
(employee_details in this case) having age 21.
• In case a relation doesn’t have tuples having the age value 21, it
returns an empty bag.
Apache Pig - Join Operator
• The JOIN operator is used to combine records from two or more relations.
• While performing a join operation, we declare one (or a group of) tuple(s) from each
relation, as keys.
• When these keys match, the two particular tuples are matched, else the records
are dropped.
• Joins can be of the following types −
 Self-join
 Inner-join
 Outer-join − left join, right join, and full join
• Joins in PIG are similar to SQL joins. In Pig, we will be joining the relations where
in SQL we will be joining the tables.
• We see only the syntaxes of the different joins in PIG.
 Self – join:
 grunt> Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ;
 Example
• Let us perform self-join operation on the relation customers, by joining the two
relations customers1 and customers2 as shown below.
• grunt> customers3 = JOIN customers1 BY id, customers2 BY id;
 Verification:
 grunt> Dump customers3;
JOINS
 Inner Join:
 Syntax:
• grunt> result = JOIN relation1 BY columnname, relation2 BY columnname;
 Example
• Let us perform inner join operation on the two relations customers and orders as shown below.
 grunt> coustomer_orders = JOIN customers BY id, orders BY customer_id;
 Verification:
 grunt> Dump coustomer_orders;
 Left Outer Join:
 grunt> Relation3_name = JOIN Relation1_name BY id LEFT OUTER, Relation2_name BY customer_id;
 Example
• Let us perform left outer join operation on the two relations customers and orders as shown below.
 grunt> outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id;
 Verification:
 grunt> Dump outer_left;
 Right Outer Join:
 grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;
 Example
• Let us perform right outer join operation on the two relations customers and orders as shown below.
 grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;
 Verification:
 grunt> Dump outer_right
JOINS
 Full Outer Join:
• grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;
 Example
• Let us perform full outer join operation on the two relations customers and
orders as shown below.
 grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;
 Verification:
 grunt> Dump outer_full;
 Using Multiple Keys:
 grunt> Relation3_name = JOIN Relation2_name BY (key1, key2), Relation3_name
BY (key1, key2);
 Example:
 grunt> emp = JOIN employee BY (id,jobid), employee_contact BY (id,jobid);
 Verification
 grunt> Dump emp;
Apache Pig - Cross Operator
• The CROSS operator computes the cross-product of two or more relations.
 Syntax:
• grunt> Relation3_name = CROSS Relation1_name, Relation2_name;
 Example:
• Assume that we have two Pig relations namely customers and orders.
• Let us now get the cross-product of these two relations using the cross operator
on these two relations as shown below.
 grunt> cross_data = CROSS customers, orders;
 Verification:
 grunt> Dump cross_data;
 Output:
• It will produce the following output, displaying the contents of the relation
cross_data.
• The Output will be cross product. Each row in the relation customers will be
joined with each row of orders starting from the last record.
Apache Pig - Union Operator
• The UNION operator of Pig Latin is used to merge the content of two
relations. To perform UNION operation on two relations, their
columns and domains must be identical.
 Syntax:
• grunt> Relation_name3 = UNION Relation_name1, Relation_name2;
 Example:
• Assume that we have two relations namely student1 and student2
containing same number and same type of columns. Data in the
relations is different.
• Let us now merge the contents of these two relations using the
UNION operator as shown below.
 grunt> student = UNION student1, student2;
 Verification:
 grunt> Dump student;
 Output:
• Combine the records in the both relations student1 and student2
into the relation student.
Apache Pig - Split Operator
• The SPLIT operator is used to split a relation into two or more relations.
 Syntax:
 grunt> SPLIT Relation1_name INTO Relation2_name IF (condition1),
Relation2_name (condition2);
 Example:
• Assume that we have a relation named student_details.
• Let us now split the relation into two, one listing the employees of age less than
23, and the other listing the employees having the age between 22 and 25.
 SPLIT student_details into student_details1 if age<23, student_details2 if
(age>22 and age<25);
 Verification:
 grunt> Dump student_details1;
 grunt> Dump student_details2;
 Output:
• It will produce the following output, displaying the contents of the relations
student_details1 and student_details2 respectively.
Apache Pig - Filter Operator
• The FILTER operator is used to select the required tuples from a
relation based on a condition.
 Syntax:
 grunt> Relation2_name = FILTER Relation1_name BY (condition);
 Example:
• Assume that we have a relation named student_details.
• Let us now use the Filter operator to get the details of the students
who belong to the city Chennai.
 filter_data = FILTER student_details BY city == 'Chennai';
 Verification:
• grunt> Dump filter_data;
 Output:
• It will produce the following output, displaying the contents of the
relation filter_data as follows.
 (6,Archana,Mishra,23,9848022335,Chennai)
(8,Bharathi,Nambiayar,24,9848022333,Chennai)
Apache Pig - Distinct
Operator
• The DISTINCT operator is used to remove redundant (duplicate)
tuples from a relation.
 Syntax:
 grunt> Relation_name2 = DISTINCT Relation_name1;
 Example:
• Assume that we have a relation named student_details.
• Let us now remove the redundant (duplicate) tuples from the
relation named student_details using the DISTINCT operator, and
store it as another relation named distinct_data as shown below.
 grunt> distinct_data = DISTINCT student_details;
 Verification:
 grunt> Dump distinct_data;
 Output:
• Dump operator will display the distinct_data producing the distinct
rows from student_details table.
Apache Pig - Foreach
Operator
• The FOREACH operator is used to generate specified data transformations based on the column data.
• The name itself is indicating that for each element of a data bag, the respective action will be performed.
 Syntax:
• grunt> Relation_name2 = FOREACH Relatin_name1 GENERATE (required data);
 Example
• Assume that we have a relation named student_details.
 grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',') as
(id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray, city:chararray);
• Let us now get the id, age, and city values of each student from the relation student_details and store it into
another relation named foreach_data using the foreach operator as shown below.
 grunt> foreach_data = FOREACH student_details GENERATE id,age,city;
 Verification:
 grunt> Dump foreach_data;
 Output:
 Dump operator displays the below data.
(1,21,Hyderabad)
(2,22,Kolkata)
(3,22,Delhi)
(4,21,Pune)
(5,23,Bhuwaneshwar)
(6,23,Chennai)
(7,24,trivendram)
(8,24,Chennai)
Apache Pig - Order By
• The ORDER BY operator is used to display the contents of a relation in a sorted
order based on one or more fields.
 Syntax:
• grunt> Relation_name2 = ORDER Relation_name1 BY (ASC|DESC);
 Example:
• Assume that we have a relation named student_details.
 grunt> student_details = LOAD
'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',') as
(id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray,
city:chararray);
 Let us now sort the relation in a descending order based on the age of the student
and store it into another relation named order_by_data using the ORDER BY
operator as shown below.
 grunt> order_by_data = ORDER student_details BY age DESC;
 Verification:
 grunt> Dump order_by_data;
 Output:
 Dump operator will produce the student_details data sorted by age in descending
order.
Apache Pig - Limit Operator
• The LIMIT operator is used to get a limited number of tuples from a relation.
 Syntax:
• grunt> Result = LIMIT Relation_name required number of tuples;
 Example
• Assume that we have a file named student_details.
 grunt> student_details = LOAD
'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',') as
(id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray,
city:chararray);
 grunt> limit_data = LIMIT student_details 4;
 Verification:
 grunt> Dump limit_data;
 Output:
• Dump operator will produce the data of student_details with limited number of
rows i.e; 4 rows.
 (1,Rajiv,Reddy,21,9848022337,Hyderabad)
(2,siddarth,Battacharya,22,9848022338,Kolkata)
(3,Rajesh,Khanna,22,9848022339,Delhi)
(4,Preethi,Agarwal,21,9848022330,Pune)
Apache Pig - Load & Store
Functions
• The Load and Store functions in Apache Pig are used to determine how the data goes and comes out
of Pig. These functions are used with the load and store operators. Given below is the list of load and
store functions available in Pig.
Apache Pig - Bag & Tuple
Functions
Apache Pig - String Functions
Apache Pig - String Functions
Apache Pig - String Functions
Apache Pig - Date-time
Functions
Apache Pig - Date-time
Functions
Apache Pig - Date-time
Functions
Apache Pig - Date-time
Functions
Apache Pig - User Defined
Functions
• In addition to the built-in functions, Apache Pig provides
extensive support for User Defined Functions (UDF’s). Using
these UDF’s, we can define our own functions and use them.
The UDF support is provided in six programming languages,
namely, Java, Jython, Python, JavaScript, Ruby and Groovy.
• For writing UDF’s, complete support is provided in Java and
limited support is provided in all the remaining languages.
Using Java, you can write UDF’s involving all parts of the
processing like data load/store, column transformation, and
aggregation. Since Apache Pig has been written in Java, the
UDF’s written using Java language work efficiently compared
to other languages.
• In Apache Pig, we also have a Java repository for UDF’s
named Piggybank. Using Piggybank, we can access Java
UDF’s written by other users, and contribute our own UDF’s.
Types of UDF’s in Java
• While writing UDF’s using Java, we can create and use
the following three types of functions −
• Filter Functions − The filter functions are used as
conditions in filter statements. These functions accept a
Pig value as input and return a Boolean value.
• Eval Functions − The Eval functions are used in
FOREACH-GENERATE statements. These functions
accept a Pig value as input and return a Pig result.
• Algebraic Functions − The Algebraic functions act on
inner bags in a FOREACHGENERATE statement. These
functions are used to perform full MapReduce
operations on an inner bag.
• All UDFs must extend "org.apache.pig.EvalFunc"
• All functions must override "exec" method.
UDF Example
packagemyudfs;
importjava.io.IOException;
importorg.apache.pig.EvalFunc;
importorg.apache.pig.data.Tuple;
public class UPPER extends EvalFunc<String>
{
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
String str = (String)input.get(0);
returnstr.toUpperCase();
}catch(Exception e){
throw new IOException("Caught exception processing input row ", e);
}
}
}
Create a jar of the above code as myudfs.jar.
Now write the script in a file and save it as .pig. Here I am using script.pig.
• -- script.pig
• REGISTER myudfs.jar;
• A = LOAD 'data' AS (name: chararray, age: int, gpa: float);
• B = FOREACH A GENERATE myudfs.UPPER(name);
• DUMP B;
Finally run the script in the terminal to get the output.
Apache Pig - Running Scripts
• we will see how how to run Apache Pig scripts in batch
mode.
 Comments in Pig Script
• While writing a script in a file, we can include
comments in it as shown below.
 Multi-line comments
• We will begin the multi-line comments with '/*', end
them with '*/'.
 /* These are the multi-line comments In the pig script
*/
 Single –line comments
• We will begin the single-line comments with '--'.
 --we can write single line comments like this.
Executing Pig Script in Batch
mode
 Executing Pig Script in Batch mode:
• While executing Apache Pig statements in batch mode, follow the steps given below.
 Step 1
• Write all the required Pig Latin statements in a single file. We can write all the Pig Latin
statements and commands in a single file and save it as .pig file.
 Step 2
• Execute the Apache Pig script. You can execute the Pig script from the shell (Linux) as shown
below.
 Local mode:
 $ pig -x local Sample_script.pig
 MapReduce mode:
 $ pig -x mapreduce Sample_script.pig
• You can execute it from the Grunt shell as well using the exec command as shown below.
 grunt> exec /sample_script.pig
• Executing a Pig Script from HDFS
• We can also execute a Pig script that resides in the HDFS. Suppose there is a Pig script with
the name Sample_script.pig in the HDFS directory named /pig_data/. We can execute it as
shown below.
 $ pig -x mapreduce hdfs://localhost:9000/pig_data/Sample_script.pig
Executing pig script from HDFS
 Example:
• We have a sample script with the name sample_script.pig, in
the same HDFS directory. This file contains statements
performing operations and transformations on the student
relation, as shown below.
 student = LOAD
'hdfs://localhost:9000/pig_data/student_details.txt' USING
PigStorage(',') as (id:int, firstname:chararray,
lastname:chararray, phone:chararray, city:chararray);
 student_order = ORDER student BY age DESC;
 student_limit = LIMIT student_order 4;
 Dump student_limit;
• Let us now execute the sample_script.pig as shown below.
 $./pig -x mapreduce
hdfs://localhost:9000/pig_data/sample_script.pig
• Apache Pig gets executed and gives you the output.
WORD COUNT EXAMPLE - PIG
SCRIPT
• How to find the number of occurrences of the words in a file using
the pig script?
 Word Count Example Using Pig Script:
 lines = LOAD '/user/hadoop/HDFS_File.txt' AS (line:chararray);
 words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as
word;
 grouped = GROUP words BY word;
 wordcount = FOREACH grouped GENERATE group, COUNT(words);
 DUMP wordcount;
 The above pig script, first splits each line into words using
the TOKENIZE operator. The tokenize function creates a bag of
words. Using the FLATTEN function, the bag is converted into a
tuple. In the third statement, the words are grouped together so that
the count can be computed which is done in fourth statement.
You can see just with 5 lines of pig program, we have solved the
word count problem very easily.

More Related Content

PPTX
Network Security- Secure Socket Layer
PPT
17. Trees and Graphs
PDF
Chapter 2 - Multimedia Communications
PDF
MULTIMEDIA COMMUNICATION & NETWORKS
PPTX
Binary Search Tree
PPS
PPTX
Graph in data structure
PPT
Strategic capacity planning for products and services
Network Security- Secure Socket Layer
17. Trees and Graphs
Chapter 2 - Multimedia Communications
MULTIMEDIA COMMUNICATION & NETWORKS
Binary Search Tree
Graph in data structure
Strategic capacity planning for products and services

What's hot (20)

PPTX
Apache PIG
PPT
File organization 1
PPTX
FUNCTION DEPENDENCY AND TYPES & EXAMPLE
PPTX
Structure of dbms
PPTX
PDF
Network programming Using Python
PPTX
Decomposition methods in DBMS
PPTX
Degree of relationship set
PPTX
Linux network file system (nfs)
PDF
Nested Queries Lecture
PPTX
Introduction to HDFS
PPTX
Chapter1
PPTX
Map Reduce
PPT
SQLITE Android
PPTX
Concurrency Control in Database Management System
PPTX
Functional dependencies in Database Management System
PPT
Hierarchical Object Oriented Design
PPTX
This pointer
PPT
ADO.NET
PDF
Identifying classes and objects ooad
Apache PIG
File organization 1
FUNCTION DEPENDENCY AND TYPES & EXAMPLE
Structure of dbms
Network programming Using Python
Decomposition methods in DBMS
Degree of relationship set
Linux network file system (nfs)
Nested Queries Lecture
Introduction to HDFS
Chapter1
Map Reduce
SQLITE Android
Concurrency Control in Database Management System
Functional dependencies in Database Management System
Hierarchical Object Oriented Design
This pointer
ADO.NET
Identifying classes and objects ooad
Ad

Similar to Pig latin (20)

PPTX
Session 04 pig - slides
PPTX
Unit-5 [Pig] working and architecture.pptx
PPTX
Pig_Presentation
PDF
06 pig-01-intro
PDF
pig intro.pdf
PPTX
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
PDF
Apache Pig: A big data processor
PPTX
Unit 4 lecture-3
PPT
pig.ppt
PPTX
03 pig intro
PPTX
A slide share pig in CCS334 for big data analytics
PPTX
PDF
Big Data - Lab A1 (SC 11 Tutorial)
PPTX
AWS Hadoop and PIG and overview
PPTX
04 pig data operations
PPTX
Aggregate.pptx
PDF
An Overview of Hadoop
PDF
Spark Performance Tuning .pdf
PDF
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Session 04 pig - slides
Unit-5 [Pig] working and architecture.pptx
Pig_Presentation
06 pig-01-intro
pig intro.pdf
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache Pig: A big data processor
Unit 4 lecture-3
pig.ppt
03 pig intro
A slide share pig in CCS334 for big data analytics
Big Data - Lab A1 (SC 11 Tutorial)
AWS Hadoop and PIG and overview
04 pig data operations
Aggregate.pptx
An Overview of Hadoop
Spark Performance Tuning .pdf
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Ad

Recently uploaded (20)

PDF
Complications of Minimal Access Surgery at WLH
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PPTX
Cell Structure & Organelles in detailed.
PDF
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
Institutional Correction lecture only . . .
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PPTX
Cell Types and Its function , kingdom of life
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPTX
PPH.pptx obstetrics and gynecology in nursing
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Business Ethics Teaching Materials for college
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Complications of Minimal Access Surgery at WLH
STATICS OF THE RIGID BODIES Hibbelers.pdf
Cell Structure & Organelles in detailed.
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
VCE English Exam - Section C Student Revision Booklet
Microbial disease of the cardiovascular and lymphatic systems
Institutional Correction lecture only . . .
Final Presentation General Medicine 03-08-2024.pptx
2.FourierTransform-ShortQuestionswithAnswers.pdf
Module 4: Burden of Disease Tutorial Slides S2 2025
Cell Types and Its function , kingdom of life
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPH.pptx obstetrics and gynecology in nursing
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Business Ethics Teaching Materials for college
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
O5-L3 Freight Transport Ops (International) V1.pdf
3rd Neelam Sanjeevareddy Memorial Lecture.pdf

Pig latin

  • 2. Pig Latin-Basics • Pig Latin is the language used to analyze data in Hadoop using Apache Pig. • Pig Latin – Data Model • The data model of Pig is fully nested. • A Relation is the outermost structure of the Pig Latin data model. And it is a bag where − • A bag is a collection of tuples. • A tuple is an ordered set of fields. • A field is a piece of data.
  • 3. Pig Latin – Statemets  While processing data using Pig Latin, statements are the basic constructs.  These statements work with relations. They include expressions and schemas.  Every statement ends with a semicolon (;).  We will perform various operations using operators provided by Pig Latin, through statements.  Except LOAD and STORE, while performing all other operations, Pig Latin statements take a relation as input and produce another relation as output.  As soon as you enter a Load statement in the Grunt shell, its semantic checking will be carried out. To see the contents of the schema, you need to use the Dump operator. Only after performing the dump operation, the MapReduce job for loading the data into the file system will be carried out.
  • 4. Loading Data using LOAD statement • Example • Given below is a Pig Latin statement, which loads data to Apache Pig. grunt> Student_data = LOAD 'student_data.txt' USING PigStorage(',')as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
  • 5. Pig Latin – Data types
  • 6. Pig Latin – Data types Null Values • Values for all the above data types can be NULL. Apache Pig treats null values in a similar way as SQL does. • A null can be an unknown value or a non-existent value. It is used as a placeholder for optional values. These nulls can occur naturally or can be the result of an operation.
  • 7. Pig Latin – Type Construction Operators
  • 8. Pig Latin – Relational Operations
  • 9. Pig Latin – Relational Operations
  • 10. Pig Latin – Relational Operations
  • 11. Apache Pig - Reading Data • In general, Apache Pig works on top of Hadoop. • It is an analytical tool that analyzes large datasets that exist in the Hadoop File System. • To analyze data using Apache Pig, we have to initially load the data into Apache Pig. • Steps to load data into Pig using LOAD Function. • Copy the local text file into HDFS file in a directory name pig_data. • Input file contains data separated by ‘,’(comma).
  • 12. Loading Data using LOAD Operator • You can load data into Apache Pig from the file system (HDFS/ Local) using LOAD operator of Pig Latin.  Syntax • The load statement consists of two parts divided by the “=” operator. • On the left-hand side, we need to mention the name of the relation where we want to store the data, and on the right-hand side, we have to define how we store the data.  Given below is the syntax of the Load operator. • Relation_name = LOAD 'Input file path' USING function as schema • Where, • relation_name − We have to mention the relation in which we want to store the data. • Input file path − We have to mention the HDFS directory where the file is stored. (In MapReduce mode) • function − We have to choose a function from the set of load functions provided by Apache Pig (BinStorage, JsonLoader, PigStorage, TextLoader). • Schema − We have to define the schema of the data. We can define the required schema as follows −  (column1 : data type, column2 : data type, column3 : data type);
  • 13. Loading Data using LOAD Operator  Start the Pig Grunt Shell • First of all, open the Linux terminal. Start the Pig Grunt shell in MapReduce mode as shown below.  $ Pig –x mapreduce  Execute the Load Statement • Now load the data from the file student_data.txt into Pig by executing the following Pig Latin statement in the Grunt shell.  grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.tx t' USING PigStorage(',') as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
  • 14. Loading Data using LOAD Operator
  • 15. Apache Pig - Storing Data • we learnt how to load data into Apache Pig. You can store the loaded data in the file system using the store operator. This chapter explains how to store data in Apache Pig using the Store operator.  Syntax • Given below is the syntax of the Store statement.  STORE Relation_name INTO ' required_directory_path ' [USING function];  Example:  Now, let us store the relation ‘student’ in the HDFS directory “/pig_Output/” as shown below.  grunt> STORE student INTO ' hdfs://localhost:9000/pig_Output/ ' USING PigStorage (‘,’);  We can verify the output hdfs file contents using cat command.
  • 16. Apache Pig - Diagnostic Operators • The load statement will simply load the data into the specified relation in Apache Pig. To verify the execution of the Load statement, you have to use the Diagnostic Operators. Pig Latin provides four different types of diagnostic operators −  Dump operator  Describe operator  Explanation operator  Illustration operator  Dump Operator • The Dump operator is used to run the Pig Latin statements and display the results on the screen. It is generally used for debugging Purpose.  Syntax:  grunt> Dump Relation_Name  Example • Assume we have a file student_data.txt in HDFS. • And we have read it into a relation student using the LOAD operator as shown below.  grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt' USING PigStorage(',') as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );  grunt> Dump student • Once you execute the above Pig Latin statement, it will start a MapReduce job to read data from HDFS. • Note: Running the LOAD statement will not load the data into the relation STUDENT. • Executing the Dump statement will load the data.
  • 17. Apache Pig - Describe Operator • The describe operator is used to view the schema of a relation.  Syntax:  grunt> Describe Relation_name  Example:  grunt> describe student; • Where student is the relation name.  Output • Once you execute the above Pig Latin statement, it will produce the following output.  grunt> student: { id: int,firstname: chararray,lastname: chararray,phone: chararray,city: chararray }
  • 18. Apache Pig - Explain Operator • The explain operator is used to display the logical, physical, and MapReduce execution plans of a relation.  Syntax:  grunt> explain Relation_name;  Example:  grunt> explain student;  Output: • It will produce the following output as in the attachment.
  • 19. Apache Pig - Illustrate Operator • The illustrate operator gives you the step-by-step execution of a sequence of statements.  Syntax:  grunt> illustrate Relation_name;  Example: • Assume we have a relation student.  grunt> illustrate student;  Output: • On executing the above statement, you will get the following output.
  • 20. Apache Pig - Group Operator • The GROUP operator is used to group the data in one or more relations. It collects the data having the same key.  Syntax:  grunt> Group_data = GROUP Relation_name BY age;  Example: • Assume we have a relation with name student_details with student details like id, name, age etc. • Now, let us group the records/tuples in the relation by age as shown below.  grunt> group_data = GROUP student_details by age;  Verification: • Verify the relation group_data using the DUMP operator as shown below.  grunt> Dump group_data;  Output: • Then you will get output displaying the contents of the relation named group_data as shown below. Here you can observe that the resulting schema has two columns • One is age, by which we have grouped the relation. • The other is a bag, which contains the group of tuples, student records with the respective age.  You can see the schema of the table after grouping the data using the describe command as shown below.  grunt> Describe group_data; group_data: {group: int,student_details: {(id: int,firstname: chararray, lastname: chararray,age: int,phone: chararray,city: chararray)}}
  • 21. Group Multiple columns • In the same way, you can get the sample illustration of the schema using the illustrate command as shown below.  $ Illustrate group_data;  Output:  Grouping by Multiple Columns:  We can group the data using multiple columns also as shown below.  grunt> group_multiple = GROUP student_details by (age, city);  Group All: • We can group the data using all columns also as shown below.  grunt> group_all = GROUP student_details All;  Now, verify the content of the relation group_all as shown below.  grunt> Dump group_all;
  • 22. Apache Pig - Cogroup Operator • The COGROUP operator works more or less in the same way as the GROUP operator. The only difference between the two operators is that the group operator is normally used with one relation, while the cogroup operator is used in statements involving two or more relations.  Grouping Two Relations using Cogroup: • Assume that we have two relations namely student_details and employee_details in pig. • Now, let us group the records/tuples of the relations student_details and employee_details with the key age, as shown below.  grunt> cogroup_data = COGROUP student_details by age, employee_details by age;  Verification • Verify the relation cogroup_data using the DUMP operator as shown below.  grunt> Dump cogroup_data;
  • 23. COGROUP Operator  Output • It will produce the following output, displaying the contents of the relation named cogroup_data as shown below. • The cogroup operator groups the tuples from each relation according to age where each group depicts a particular age value. • For example, if we consider the 1st tuple of the result, it is grouped by age 21. And it contains two bags −  the first bag holds all the tuples from the first relation (student_details in this case) having age 21, and  the second bag contains all the tuples from the second relation (employee_details in this case) having age 21. • In case a relation doesn’t have tuples having the age value 21, it returns an empty bag.
  • 24. Apache Pig - Join Operator • The JOIN operator is used to combine records from two or more relations. • While performing a join operation, we declare one (or a group of) tuple(s) from each relation, as keys. • When these keys match, the two particular tuples are matched, else the records are dropped. • Joins can be of the following types −  Self-join  Inner-join  Outer-join − left join, right join, and full join • Joins in PIG are similar to SQL joins. In Pig, we will be joining the relations where in SQL we will be joining the tables. • We see only the syntaxes of the different joins in PIG.  Self – join:  grunt> Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ;  Example • Let us perform self-join operation on the relation customers, by joining the two relations customers1 and customers2 as shown below. • grunt> customers3 = JOIN customers1 BY id, customers2 BY id;  Verification:  grunt> Dump customers3;
  • 25. JOINS  Inner Join:  Syntax: • grunt> result = JOIN relation1 BY columnname, relation2 BY columnname;  Example • Let us perform inner join operation on the two relations customers and orders as shown below.  grunt> coustomer_orders = JOIN customers BY id, orders BY customer_id;  Verification:  grunt> Dump coustomer_orders;  Left Outer Join:  grunt> Relation3_name = JOIN Relation1_name BY id LEFT OUTER, Relation2_name BY customer_id;  Example • Let us perform left outer join operation on the two relations customers and orders as shown below.  grunt> outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id;  Verification:  grunt> Dump outer_left;  Right Outer Join:  grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;  Example • Let us perform right outer join operation on the two relations customers and orders as shown below.  grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;  Verification:  grunt> Dump outer_right
  • 26. JOINS  Full Outer Join: • grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;  Example • Let us perform full outer join operation on the two relations customers and orders as shown below.  grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;  Verification:  grunt> Dump outer_full;  Using Multiple Keys:  grunt> Relation3_name = JOIN Relation2_name BY (key1, key2), Relation3_name BY (key1, key2);  Example:  grunt> emp = JOIN employee BY (id,jobid), employee_contact BY (id,jobid);  Verification  grunt> Dump emp;
  • 27. Apache Pig - Cross Operator • The CROSS operator computes the cross-product of two or more relations.  Syntax: • grunt> Relation3_name = CROSS Relation1_name, Relation2_name;  Example: • Assume that we have two Pig relations namely customers and orders. • Let us now get the cross-product of these two relations using the cross operator on these two relations as shown below.  grunt> cross_data = CROSS customers, orders;  Verification:  grunt> Dump cross_data;  Output: • It will produce the following output, displaying the contents of the relation cross_data. • The Output will be cross product. Each row in the relation customers will be joined with each row of orders starting from the last record.
  • 28. Apache Pig - Union Operator • The UNION operator of Pig Latin is used to merge the content of two relations. To perform UNION operation on two relations, their columns and domains must be identical.  Syntax: • grunt> Relation_name3 = UNION Relation_name1, Relation_name2;  Example: • Assume that we have two relations namely student1 and student2 containing same number and same type of columns. Data in the relations is different. • Let us now merge the contents of these two relations using the UNION operator as shown below.  grunt> student = UNION student1, student2;  Verification:  grunt> Dump student;  Output: • Combine the records in the both relations student1 and student2 into the relation student.
  • 29. Apache Pig - Split Operator • The SPLIT operator is used to split a relation into two or more relations.  Syntax:  grunt> SPLIT Relation1_name INTO Relation2_name IF (condition1), Relation2_name (condition2);  Example: • Assume that we have a relation named student_details. • Let us now split the relation into two, one listing the employees of age less than 23, and the other listing the employees having the age between 22 and 25.  SPLIT student_details into student_details1 if age<23, student_details2 if (age>22 and age<25);  Verification:  grunt> Dump student_details1;  grunt> Dump student_details2;  Output: • It will produce the following output, displaying the contents of the relations student_details1 and student_details2 respectively.
  • 30. Apache Pig - Filter Operator • The FILTER operator is used to select the required tuples from a relation based on a condition.  Syntax:  grunt> Relation2_name = FILTER Relation1_name BY (condition);  Example: • Assume that we have a relation named student_details. • Let us now use the Filter operator to get the details of the students who belong to the city Chennai.  filter_data = FILTER student_details BY city == 'Chennai';  Verification: • grunt> Dump filter_data;  Output: • It will produce the following output, displaying the contents of the relation filter_data as follows.  (6,Archana,Mishra,23,9848022335,Chennai) (8,Bharathi,Nambiayar,24,9848022333,Chennai)
  • 31. Apache Pig - Distinct Operator • The DISTINCT operator is used to remove redundant (duplicate) tuples from a relation.  Syntax:  grunt> Relation_name2 = DISTINCT Relation_name1;  Example: • Assume that we have a relation named student_details. • Let us now remove the redundant (duplicate) tuples from the relation named student_details using the DISTINCT operator, and store it as another relation named distinct_data as shown below.  grunt> distinct_data = DISTINCT student_details;  Verification:  grunt> Dump distinct_data;  Output: • Dump operator will display the distinct_data producing the distinct rows from student_details table.
  • 32. Apache Pig - Foreach Operator • The FOREACH operator is used to generate specified data transformations based on the column data. • The name itself is indicating that for each element of a data bag, the respective action will be performed.  Syntax: • grunt> Relation_name2 = FOREACH Relatin_name1 GENERATE (required data);  Example • Assume that we have a relation named student_details.  grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray, city:chararray); • Let us now get the id, age, and city values of each student from the relation student_details and store it into another relation named foreach_data using the foreach operator as shown below.  grunt> foreach_data = FOREACH student_details GENERATE id,age,city;  Verification:  grunt> Dump foreach_data;  Output:  Dump operator displays the below data. (1,21,Hyderabad) (2,22,Kolkata) (3,22,Delhi) (4,21,Pune) (5,23,Bhuwaneshwar) (6,23,Chennai) (7,24,trivendram) (8,24,Chennai)
  • 33. Apache Pig - Order By • The ORDER BY operator is used to display the contents of a relation in a sorted order based on one or more fields.  Syntax: • grunt> Relation_name2 = ORDER Relation_name1 BY (ASC|DESC);  Example: • Assume that we have a relation named student_details.  grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray, city:chararray);  Let us now sort the relation in a descending order based on the age of the student and store it into another relation named order_by_data using the ORDER BY operator as shown below.  grunt> order_by_data = ORDER student_details BY age DESC;  Verification:  grunt> Dump order_by_data;  Output:  Dump operator will produce the student_details data sorted by age in descending order.
  • 34. Apache Pig - Limit Operator • The LIMIT operator is used to get a limited number of tuples from a relation.  Syntax: • grunt> Result = LIMIT Relation_name required number of tuples;  Example • Assume that we have a file named student_details.  grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray, city:chararray);  grunt> limit_data = LIMIT student_details 4;  Verification:  grunt> Dump limit_data;  Output: • Dump operator will produce the data of student_details with limited number of rows i.e; 4 rows.  (1,Rajiv,Reddy,21,9848022337,Hyderabad) (2,siddarth,Battacharya,22,9848022338,Kolkata) (3,Rajesh,Khanna,22,9848022339,Delhi) (4,Preethi,Agarwal,21,9848022330,Pune)
  • 35. Apache Pig - Load & Store Functions • The Load and Store functions in Apache Pig are used to determine how the data goes and comes out of Pig. These functions are used with the load and store operators. Given below is the list of load and store functions available in Pig.
  • 36. Apache Pig - Bag & Tuple Functions
  • 37. Apache Pig - String Functions
  • 38. Apache Pig - String Functions
  • 39. Apache Pig - String Functions
  • 40. Apache Pig - Date-time Functions
  • 41. Apache Pig - Date-time Functions
  • 42. Apache Pig - Date-time Functions
  • 43. Apache Pig - Date-time Functions
  • 44. Apache Pig - User Defined Functions • In addition to the built-in functions, Apache Pig provides extensive support for User Defined Functions (UDF’s). Using these UDF’s, we can define our own functions and use them. The UDF support is provided in six programming languages, namely, Java, Jython, Python, JavaScript, Ruby and Groovy. • For writing UDF’s, complete support is provided in Java and limited support is provided in all the remaining languages. Using Java, you can write UDF’s involving all parts of the processing like data load/store, column transformation, and aggregation. Since Apache Pig has been written in Java, the UDF’s written using Java language work efficiently compared to other languages. • In Apache Pig, we also have a Java repository for UDF’s named Piggybank. Using Piggybank, we can access Java UDF’s written by other users, and contribute our own UDF’s.
  • 45. Types of UDF’s in Java • While writing UDF’s using Java, we can create and use the following three types of functions − • Filter Functions − The filter functions are used as conditions in filter statements. These functions accept a Pig value as input and return a Boolean value. • Eval Functions − The Eval functions are used in FOREACH-GENERATE statements. These functions accept a Pig value as input and return a Pig result. • Algebraic Functions − The Algebraic functions act on inner bags in a FOREACHGENERATE statement. These functions are used to perform full MapReduce operations on an inner bag. • All UDFs must extend "org.apache.pig.EvalFunc" • All functions must override "exec" method.
  • 46. UDF Example packagemyudfs; importjava.io.IOException; importorg.apache.pig.EvalFunc; importorg.apache.pig.data.Tuple; public class UPPER extends EvalFunc<String> { public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ String str = (String)input.get(0); returnstr.toUpperCase(); }catch(Exception e){ throw new IOException("Caught exception processing input row ", e); } } } Create a jar of the above code as myudfs.jar. Now write the script in a file and save it as .pig. Here I am using script.pig. • -- script.pig • REGISTER myudfs.jar; • A = LOAD 'data' AS (name: chararray, age: int, gpa: float); • B = FOREACH A GENERATE myudfs.UPPER(name); • DUMP B; Finally run the script in the terminal to get the output.
  • 47. Apache Pig - Running Scripts • we will see how how to run Apache Pig scripts in batch mode.  Comments in Pig Script • While writing a script in a file, we can include comments in it as shown below.  Multi-line comments • We will begin the multi-line comments with '/*', end them with '*/'.  /* These are the multi-line comments In the pig script */  Single –line comments • We will begin the single-line comments with '--'.  --we can write single line comments like this.
  • 48. Executing Pig Script in Batch mode  Executing Pig Script in Batch mode: • While executing Apache Pig statements in batch mode, follow the steps given below.  Step 1 • Write all the required Pig Latin statements in a single file. We can write all the Pig Latin statements and commands in a single file and save it as .pig file.  Step 2 • Execute the Apache Pig script. You can execute the Pig script from the shell (Linux) as shown below.  Local mode:  $ pig -x local Sample_script.pig  MapReduce mode:  $ pig -x mapreduce Sample_script.pig • You can execute it from the Grunt shell as well using the exec command as shown below.  grunt> exec /sample_script.pig • Executing a Pig Script from HDFS • We can also execute a Pig script that resides in the HDFS. Suppose there is a Pig script with the name Sample_script.pig in the HDFS directory named /pig_data/. We can execute it as shown below.  $ pig -x mapreduce hdfs://localhost:9000/pig_data/Sample_script.pig
  • 49. Executing pig script from HDFS  Example: • We have a sample script with the name sample_script.pig, in the same HDFS directory. This file contains statements performing operations and transformations on the student relation, as shown below.  student = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray);  student_order = ORDER student BY age DESC;  student_limit = LIMIT student_order 4;  Dump student_limit; • Let us now execute the sample_script.pig as shown below.  $./pig -x mapreduce hdfs://localhost:9000/pig_data/sample_script.pig • Apache Pig gets executed and gives you the output.
  • 50. WORD COUNT EXAMPLE - PIG SCRIPT • How to find the number of occurrences of the words in a file using the pig script?  Word Count Example Using Pig Script:  lines = LOAD '/user/hadoop/HDFS_File.txt' AS (line:chararray);  words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;  grouped = GROUP words BY word;  wordcount = FOREACH grouped GENERATE group, COUNT(words);  DUMP wordcount;  The above pig script, first splits each line into words using the TOKENIZE operator. The tokenize function creates a bag of words. Using the FLATTEN function, the bag is converted into a tuple. In the third statement, the words are grouped together so that the count can be computed which is done in fourth statement. You can see just with 5 lines of pig program, we have solved the word count problem very easily.