SlideShare a Scribd company logo
2
Most read
6
Most read
15
Most read
Triloki Gupta
205217006

 An engine for executing data flow in parallel on Hadoop.
 Pig is an open-source high level data flow system.
 It provides a simple language called Pig Latin, for
queries and data manipulation.
 Pig Latin already have most of the traditional data
operation functionalities built into
• Filtering data
• Sorting data
• Joining data
 Pig users can create their own functions for reading,
processing and writing data under UDF(user defined
functions).
What is Pig?

 Pig Latin program is made up of a series of
operations or transformations that are applied to the
input data to produce output. The job of Pig is to
convert the transformations into a series of
MapReduce jobs.
What is Pig Latin
Program?

 It’s easy to learn, especially if you’re familiar with SQL.
 Pig’s multi-query approach reduces the number of times
data is scanned. This means 1/20th the lines of code and
1/16th the development time when compared to writing
raw MapReduce.
 Performance of Pig is in par with raw MapReduce
 Pig provides data operations like filters, joins, ordering,
etc. and nested data types like tuples, bags, and maps,
that are missing from MapReduce.
 Pig Latin is easy to write and read.
Why Do you Need Pig?

 Pig was originally developed by Yahoo in 2006, for
researchers to have an ad-hoc way of creating and
executing MapReduce jobs on very large data sets.
 It was created to reduce the development time
through its multi-query approach. Pig is also created
for professionals from non-Java background, to
make their job easier.
Why was Pig Created?

Pig can be used under following scenarios:
 When data loads are time sensitive.
 When processing various data sources.
 When analytical insights are required through
sampling.
Where Should Pig be
Used?

 In places where the data is completely unstructured,
like video, audio and readable text.
 In places where time constraints exist, as Pig is
slower than MapReduce jobs.
 In places where more power is required to optimize
the codes.
Where Not to Use Pig?

 Pigs Eat Anything
 Pigs Lives Anywhere
 Pigs are Domestic Animals
 Pigs Fly
Pig Philosophy

 Pig can operate on data whether it has metadata or
not.
 Pig can operate on data that is relational, nested or
unstructured.
 Pig can easily be extended to operate on data beyond
files
• Including key/value stores, databases, etc.
Pigs Eats Everything

 Pig is intended to be a language for parallel data
processing.
 Pig is not tied to one particular parallel framework.
 Pig has been implemented first on Hadoop,
• Not intend that to be only on Hadoop
 Pig on MongoDB
 Pig with Cassandra
Pigs Live Anywhere

 Designed to be easily controlled and modified by its
users.
 Integration of user designed functions(UDF)
• UDF are written in Java, Jython,
 Pig supports customer Loaders and store
• Load and Store
 Pig Supports streaming
• Execution of external executables
 Using Hadoop Streaming Methods
 Pig uses Optimizer by rearranging some of the operations
for better performance.
Pigs are domestic
Animals

 Pig processes data very fast
 The notion is that consistently improve Pig’s
performance, and implementation is done in a way
that Pig performance just go higher above.
Pigs Fly

 Processing of web logs.
 Data processing for search platforms.
 Support for Ad-hoc queries across large data sets.
 Quick prototyping of algorithms for processing large
data sets.
Applications of Apache
Pig:

Yahoo uses Pig for the following purpose:
 In Pipelines – To bring logs from its web servers,
where these logs undergo a cleaning step to remove
bots, company interval views and clicks.
 In Research – To quickly write a script to test a theory.
Pig Integration makes it easy for the researchers to
take a Perl or Python script and run it against a huge
data set.
How Yahoo! Uses Pig:

Here’s the hierarchy of Pig’s program structure:
 Script – Pig Can run a file script that contains Pig Commands.
Eg: pig script .pig runs the command in the local file script.pig
 Grunt – It is an interactive shell for running Pig commands. It is
also possible to run pig scripts from within Grunts using run
and exec.
 Embedded – Can run Pig programs from Java, much like you
can use JDBC to run SQL programs from Java.
Basic Program Structure
of Pig:

Components of Pig:

 Pig comprises of 4 basic types of data models. They are as follows:
 Atom – It is a simple atomic data value. It is stored as a string but
can be used as either a string or a number
 Tuple – An ordered set of fields
 Bag – An collection of tuples.
 Map – set of key value pairs.
Basic Types of Data
Models in Pig:

 Yahoo Pig Tutorial
• http://developer.yahoo.com/hadoop/tutorial/pigtut
orial.html
 edureka.co
• https://www.edureka.co/blog/introduction-to-pig/
 slideshare.net
• https://www.slideshare.net/Avkashslide/introductio
n-to-apache-pig-18002897
Resources


More Related Content

PDF
Apache Pig: A big data processor
PPT
Data dictionary
PPT
Hive(ppt)
PPTX
NOSQL Databases types and Uses
PPTX
Dbms architecture
PDF
NOSQL- Presentation on NoSQL
PPTX
Hadoop Architecture
PPTX
3 Level Architecture
Apache Pig: A big data processor
Data dictionary
Hive(ppt)
NOSQL Databases types and Uses
Dbms architecture
NOSQL- Presentation on NoSQL
Hadoop Architecture
3 Level Architecture

What's hot (20)

PPSX
PPTX
What is NoSQL and CAP Theorem
PPTX
Chapter1: NoSQL: It’s about making intelligent choices
PPTX
Overview of Big data(ppt)
PPT
Schemaless Databases
PPT
Introduction to MongoDB
PPT
NOSQL Database: Apache Cassandra
PPTX
Data mining
PPTX
(r)Evolution of Machine Learning
PPTX
Cloud Computing & Big Data
KEY
Testing Hadoop jobs with MRUnit
PPTX
Big Data Open Source Technologies
PPTX
Introduction to Hadoop
PPTX
lazy learners and other classication methods
PPTX
Major issues in data mining
PPTX
Big data-ppt
PPTX
Data cubes
PPTX
The Basics of MongoDB
PPTX
Data mining Measuring similarity and desimilarity
PPT
OLAP
What is NoSQL and CAP Theorem
Chapter1: NoSQL: It’s about making intelligent choices
Overview of Big data(ppt)
Schemaless Databases
Introduction to MongoDB
NOSQL Database: Apache Cassandra
Data mining
(r)Evolution of Machine Learning
Cloud Computing & Big Data
Testing Hadoop jobs with MRUnit
Big Data Open Source Technologies
Introduction to Hadoop
lazy learners and other classication methods
Major issues in data mining
Big data-ppt
Data cubes
The Basics of MongoDB
Data mining Measuring similarity and desimilarity
OLAP
Ad

Similar to Introduction to pig. (20)

PPTX
Enhancing Big Data Analytics with Pig and Hadoop: Harnessing the Power of Dis...
PDF
Unit V.pdf
PDF
Introduction-to-Pig.pdmhjjjkhhgggghhjjjj
PPT
lecturte 5. Hgfjhffjyy to the data will be 1.ppt
PDF
What is apache_pig
PDF
What is apache_pig
PPTX
Pig power tools_by_viswanath_gangavaram
PPTX
An Introduction to Apache Pig
ODP
What is apache pig
PPTX
BDA R20 21NM - Summary Big Data Analytics
PPTX
Unit-5 [Pig] working and architecture.pptx
PDF
43_Sameer_Kumar_Das2
PDF
Apache pig
PPTX
Big data components - Introduction to Flume, Pig and Sqoop
PPTX
Unit 4 lecture2
PPTX
A slide share pig in CCS334 for big data analytics
PDF
A comparative survey based on processing network traffic data using hadoop pi...
PDF
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
PPTX
the mapreduce programming paradigm in cybersecurity
Enhancing Big Data Analytics with Pig and Hadoop: Harnessing the Power of Dis...
Unit V.pdf
Introduction-to-Pig.pdmhjjjkhhgggghhjjjj
lecturte 5. Hgfjhffjyy to the data will be 1.ppt
What is apache_pig
What is apache_pig
Pig power tools_by_viswanath_gangavaram
An Introduction to Apache Pig
What is apache pig
BDA R20 21NM - Summary Big Data Analytics
Unit-5 [Pig] working and architecture.pptx
43_Sameer_Kumar_Das2
Apache pig
Big data components - Introduction to Flume, Pig and Sqoop
Unit 4 lecture2
A slide share pig in CCS334 for big data analytics
A comparative survey based on processing network traffic data using hadoop pi...
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
the mapreduce programming paradigm in cybersecurity
Ad

More from Triloki Gupta (7)

PPTX
GCP Deployment- Vertex AI
PPTX
Flask-Python
PPTX
Speaker identification
PPTX
Sign Language Recognition based on Hands symbols Classification
PPTX
Naive Bayes Classifier using R.
PPTX
Meta analysis.
PPTX
Enhancement of Old Images and Documents by Digital Image Processing Techniques.
GCP Deployment- Vertex AI
Flask-Python
Speaker identification
Sign Language Recognition based on Hands symbols Classification
Naive Bayes Classifier using R.
Meta analysis.
Enhancement of Old Images and Documents by Digital Image Processing Techniques.

Recently uploaded (20)

PDF
Microsoft 365 products and services descrption
DOCX
Factor Analysis Word Document Presentation
PPTX
modul_python (1).pptx for professional and student
PPTX
New ISO 27001_2022 standard and the changes
PPTX
A Complete Guide to Streamlining Business Processes
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PDF
Introduction to Data Science and Data Analysis
PDF
Transcultural that can help you someday.
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PPTX
Introduction to Inferential Statistics.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PDF
Introduction to the R Programming Language
Microsoft 365 products and services descrption
Factor Analysis Word Document Presentation
modul_python (1).pptx for professional and student
New ISO 27001_2022 standard and the changes
A Complete Guide to Streamlining Business Processes
SAP 2 completion done . PRESENTATION.pptx
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
Introduction to Data Science and Data Analysis
Transcultural that can help you someday.
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
Introduction to Inferential Statistics.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
retention in jsjsksksksnbsndjddjdnFPD.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
Introduction to the R Programming Language

Introduction to pig.

  • 2.   An engine for executing data flow in parallel on Hadoop.  Pig is an open-source high level data flow system.  It provides a simple language called Pig Latin, for queries and data manipulation.  Pig Latin already have most of the traditional data operation functionalities built into • Filtering data • Sorting data • Joining data  Pig users can create their own functions for reading, processing and writing data under UDF(user defined functions). What is Pig?
  • 3.   Pig Latin program is made up of a series of operations or transformations that are applied to the input data to produce output. The job of Pig is to convert the transformations into a series of MapReduce jobs. What is Pig Latin Program?
  • 4.   It’s easy to learn, especially if you’re familiar with SQL.  Pig’s multi-query approach reduces the number of times data is scanned. This means 1/20th the lines of code and 1/16th the development time when compared to writing raw MapReduce.  Performance of Pig is in par with raw MapReduce  Pig provides data operations like filters, joins, ordering, etc. and nested data types like tuples, bags, and maps, that are missing from MapReduce.  Pig Latin is easy to write and read. Why Do you Need Pig?
  • 5.   Pig was originally developed by Yahoo in 2006, for researchers to have an ad-hoc way of creating and executing MapReduce jobs on very large data sets.  It was created to reduce the development time through its multi-query approach. Pig is also created for professionals from non-Java background, to make their job easier. Why was Pig Created?
  • 6.  Pig can be used under following scenarios:  When data loads are time sensitive.  When processing various data sources.  When analytical insights are required through sampling. Where Should Pig be Used?
  • 7.   In places where the data is completely unstructured, like video, audio and readable text.  In places where time constraints exist, as Pig is slower than MapReduce jobs.  In places where more power is required to optimize the codes. Where Not to Use Pig?
  • 8.   Pigs Eat Anything  Pigs Lives Anywhere  Pigs are Domestic Animals  Pigs Fly Pig Philosophy
  • 9.   Pig can operate on data whether it has metadata or not.  Pig can operate on data that is relational, nested or unstructured.  Pig can easily be extended to operate on data beyond files • Including key/value stores, databases, etc. Pigs Eats Everything
  • 10.   Pig is intended to be a language for parallel data processing.  Pig is not tied to one particular parallel framework.  Pig has been implemented first on Hadoop, • Not intend that to be only on Hadoop  Pig on MongoDB  Pig with Cassandra Pigs Live Anywhere
  • 11.   Designed to be easily controlled and modified by its users.  Integration of user designed functions(UDF) • UDF are written in Java, Jython,  Pig supports customer Loaders and store • Load and Store  Pig Supports streaming • Execution of external executables  Using Hadoop Streaming Methods  Pig uses Optimizer by rearranging some of the operations for better performance. Pigs are domestic Animals
  • 12.   Pig processes data very fast  The notion is that consistently improve Pig’s performance, and implementation is done in a way that Pig performance just go higher above. Pigs Fly
  • 13.   Processing of web logs.  Data processing for search platforms.  Support for Ad-hoc queries across large data sets.  Quick prototyping of algorithms for processing large data sets. Applications of Apache Pig:
  • 14.  Yahoo uses Pig for the following purpose:  In Pipelines – To bring logs from its web servers, where these logs undergo a cleaning step to remove bots, company interval views and clicks.  In Research – To quickly write a script to test a theory. Pig Integration makes it easy for the researchers to take a Perl or Python script and run it against a huge data set. How Yahoo! Uses Pig:
  • 15.  Here’s the hierarchy of Pig’s program structure:  Script – Pig Can run a file script that contains Pig Commands. Eg: pig script .pig runs the command in the local file script.pig  Grunt – It is an interactive shell for running Pig commands. It is also possible to run pig scripts from within Grunts using run and exec.  Embedded – Can run Pig programs from Java, much like you can use JDBC to run SQL programs from Java. Basic Program Structure of Pig:
  • 17.   Pig comprises of 4 basic types of data models. They are as follows:  Atom – It is a simple atomic data value. It is stored as a string but can be used as either a string or a number  Tuple – An ordered set of fields  Bag – An collection of tuples.  Map – set of key value pairs. Basic Types of Data Models in Pig:
  • 18.   Yahoo Pig Tutorial • http://developer.yahoo.com/hadoop/tutorial/pigtut orial.html  edureka.co • https://www.edureka.co/blog/introduction-to-pig/  slideshare.net • https://www.slideshare.net/Avkashslide/introductio n-to-apache-pig-18002897 Resources
  • 19.