Introduction to pig.


 An engine for executing data flow in parallel on Hadoop.
 Pig is an open-source high level data flow system.
 It provides a simple language called Pig Latin, for
queries and data manipulation.
 Pig Latin already have most of the traditional data
operation functionalities built into
• Filtering data
• Sorting data
• Joining data
 Pig users can create their own functions for reading,
processing and writing data under UDF(user defined
functions).
What is Pig?


 Pig Latin program is made up of a series of
operations or transformations that are applied to the
input data to produce output. The job of Pig is to
convert the transformations into a series of
MapReduce jobs.
What is Pig Latin
Program?


 It’s easy to learn, especially if you’re familiar with SQL.
 Pig’s multi-query approach reduces the number of times
data is scanned. This means 1/20th the lines of code and
1/16th the development time when compared to writing
raw MapReduce.
 Performance of Pig is in par with raw MapReduce
 Pig provides data operations like filters, joins, ordering,
etc. and nested data types like tuples, bags, and maps,
that are missing from MapReduce.
 Pig Latin is easy to write and read.
Why Do you Need Pig?


 Pig was originally developed by Yahoo in 2006, for
researchers to have an ad-hoc way of creating and
executing MapReduce jobs on very large data sets.
 It was created to reduce the development time
through its multi-query approach. Pig is also created
for professionals from non-Java background, to
make their job easier.
Why was Pig Created?


Pig can be used under following scenarios:
 When data loads are time sensitive.
 When processing various data sources.
 When analytical insights are required through
sampling.
Where Should Pig be
Used?


 In places where the data is completely unstructured,
like video, audio and readable text.
 In places where time constraints exist, as Pig is
slower than MapReduce jobs.
 In places where more power is required to optimize
the codes.
Where Not to Use Pig?


 Pigs Eat Anything
 Pigs Lives Anywhere
 Pigs are Domestic Animals
 Pigs Fly
Pig Philosophy


 Pig can operate on data whether it has metadata or
not.
 Pig can operate on data that is relational, nested or
unstructured.
 Pig can easily be extended to operate on data beyond
files
• Including key/value stores, databases, etc.
Pigs Eats Everything


 Pig is intended to be a language for parallel data
processing.
 Pig is not tied to one particular parallel framework.
 Pig has been implemented first on Hadoop,
• Not intend that to be only on Hadoop
 Pig on MongoDB
 Pig with Cassandra
Pigs Live Anywhere


 Designed to be easily controlled and modified by its
users.
 Integration of user designed functions(UDF)
• UDF are written in Java, Jython,
 Pig supports customer Loaders and store
• Load and Store
 Pig Supports streaming
• Execution of external executables
 Using Hadoop Streaming Methods
 Pig uses Optimizer by rearranging some of the operations
for better performance.
Pigs are domestic
Animals


 Pig processes data very fast
 The notion is that consistently improve Pig’s
performance, and implementation is done in a way
that Pig performance just go higher above.
Pigs Fly


 Processing of web logs.
 Data processing for search platforms.
 Support for Ad-hoc queries across large data sets.
 Quick prototyping of algorithms for processing large
data sets.
Applications of Apache
Pig:


Yahoo uses Pig for the following purpose:
 In Pipelines – To bring logs from its web servers,
where these logs undergo a cleaning step to remove
bots, company interval views and clicks.
 In Research – To quickly write a script to test a theory.
Pig Integration makes it easy for the researchers to
take a Perl or Python script and run it against a huge
data set.
How Yahoo! Uses Pig:


Here’s the hierarchy of Pig’s program structure:
 Script – Pig Can run a file script that contains Pig Commands.
Eg: pig script .pig runs the command in the local file script.pig
 Grunt – It is an interactive shell for running Pig commands. It is
also possible to run pig scripts from within Grunts using run
and exec.
 Embedded – Can run Pig programs from Java, much like you
can use JDBC to run SQL programs from Java.
Basic Program Structure
of Pig:


 Pig comprises of 4 basic types of data models. They are as follows:
 Atom – It is a simple atomic data value. It is stored as a string but
can be used as either a string or a number
 Tuple – An ordered set of fields
 Bag – An collection of tuples.
 Map – set of key value pairs.
Basic Types of Data
Models in Pig:


 Yahoo Pig Tutorial
• http://developer.yahoo.com/hadoop/tutorial/pigtut
orial.html
 edureka.co
• https://www.edureka.co/blog/introduction-to-pig/
 slideshare.net
• https://www.slideshare.net/Avkashslide/introductio
n-to-apache-pig-18002897
Resources

Introduction to pig.

More Related Content

What's hot (20)

Similar to Introduction to pig. (20)

More from Triloki Gupta (7)

Recently uploaded (20)

Introduction to pig.