SlideShare a Scribd company logo
Introduction to
PIG
Agenda
 What Is Pig? And what it is use for?
 Pig Philosophy
 Pig’s Data Model
 Pig Example
 Pig Latin
 Pig Latin vs SQL
 Pig Macros
 Pig UDF’s
What Is Pig? And what it is
use for?
 Pig has a pig engine which is used for executing data flows in parallel like how
map tasks are distributed among the cluster nodes and get job done in
Mapreduce .
 Pig uses its own Pig Latin language for expressing these data flows.
 Pig runs on Hadoop. It makes use of both the Hadoop Distributed File System,
HDFS, and Hadoop’s processing system, MapReduce.
 By default, Pig reads input files from HDFS, uses HDFS to store intermediate data
between MapReduce jobs, and writes its output to HDFS.
 Pig Latin use cases tend to fall into three separate categories: traditional extract
transform load (ETL) data pipelines, research on raw data, and iterative
processing.
Pig Philosophy
 Pigs eat anything
 Pigs live anywhere
 Pigs are domestic animals
 Pigs fly
Pig’s Data Model : Types
 Pig’s data types can be divided into two categories: scalar types, which
contain a single value, and complex types, which contain other types
Pig’s Data Model : Schemas
 If a schema for the data is available, Pig will make use of it, both for up-front error
checking and for optimization
 Syntax: Loads= load ‘data.txt' as(col1:int, col2:chararray, col3:chararray,
col4:float);
 It is also possible to specify the schema without giving explicit data types. In this case,
the data type is assumed to be bytearray
 Syntax: Loads= load ‘data.txt' as(col1, col2, col3, col4);
Pig Example
 Grunt is Pig’s interactive shell. It enables users to enter Pig Latin interactively and
provides a shell for users to interact with HDFS.
 To enter Grunt, pig -x local, pig -x mapreduce, pig -x tez
 records = LOAD 'input/ncdc/micro-tab/sample.txt‘ AS (year:chararray,
temperature:int, quality:int);
 filtered_records = FILTER records BY temperature != 9999 AND quality IN (0, 1, 4,
5, 9);
 grouped_records = GROUP filtered_records BY year;
 max_temp = FOREACH grouped_records GENERATE group,
MAX(filtered_records.temperature);
 DUMP max_temp;
Pig Latin : RelationalOperators
Pig Latin : Diagnostic/UDF
Operators
Pig Latin vs SQL
Pig Macros
 Macros provide a way to package reusable pieces of Pig Latin code from within
Pig Latin itself.
DEFINE max_by_group(X, group_key, max_field) RETURNS Y {
A = GROUP $X by $group_key;
$Y = FOREACH A GENERATE group, MAX($X.$max_field);
};
records = LOAD 'input/ncdc/micro-tab/sample.txt‘ AS (year:chararray, temperature:int,
quality:int);
filtered_records = FILTER records BY temperature != 9999 AND quality IN (0, 1, 4, 5, 9);
max_temp = max_by_group(filtered_records, year, temperature);
DUMP max_temp;
Pig UDF’s
 A Filter UDF
 An Eval UDF
 A Load UDF

More Related Content

PDF
Docopt, beautiful command-line options for R, user2014
PDF
Chunked, dplyr for large text files
PDF
Scaling up genomic analysis with ADAM
PPTX
Managing Genomes At Scale: What We Learned - StampedeCon 2014
PDF
Fast Variant Calling with ADAM and avocado
PDF
Why is Bioinformatics a Good Fit for Spark?
PPTX
Frequent Itemset Mining on BigData
PDF
Lightning fast genomics with Spark, Adam and Scala
Docopt, beautiful command-line options for R, user2014
Chunked, dplyr for large text files
Scaling up genomic analysis with ADAM
Managing Genomes At Scale: What We Learned - StampedeCon 2014
Fast Variant Calling with ADAM and avocado
Why is Bioinformatics a Good Fit for Spark?
Frequent Itemset Mining on BigData
Lightning fast genomics with Spark, Adam and Scala

What's hot (20)

PDF
Scalable Genome Analysis with ADAM
PDF
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
PDF
ADAM—Spark Summit, 2014
PDF
Next Generation Programming in R
PDF
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
PPTX
Scaling AncestryDNA with the Hadoop Ecosystem. Presented at the San Jose Hado...
PDF
Introduction to Spark: Or how I learned to love 'big data' after all.
PDF
Using the whole web as your dataset
PPTX
Intro to hadoop ecosystem
PDF
Spark Summit East 2015
PPTX
Text Mining Infrastructure in R
PDF
RESTo - restful semantic search tool for geospatial
PDF
More Complete Resultset Retrieval from Large Heterogeneous RDF Sources
PDF
A Map of the PyData Stack
PPT
Parquet overview
PPT
Strata-Hadoop 2015 Presentation
PPTX
Pycon 2016-open-space
PDF
A Closer Look at the Changing Dynamics of DBpedia Mappings
ODP
Presentation dropbox
PDF
Big Data com Python
Scalable Genome Analysis with ADAM
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
ADAM—Spark Summit, 2014
Next Generation Programming in R
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Scaling AncestryDNA with the Hadoop Ecosystem. Presented at the San Jose Hado...
Introduction to Spark: Or how I learned to love 'big data' after all.
Using the whole web as your dataset
Intro to hadoop ecosystem
Spark Summit East 2015
Text Mining Infrastructure in R
RESTo - restful semantic search tool for geospatial
More Complete Resultset Retrieval from Large Heterogeneous RDF Sources
A Map of the PyData Stack
Parquet overview
Strata-Hadoop 2015 Presentation
Pycon 2016-open-space
A Closer Look at the Changing Dynamics of DBpedia Mappings
Presentation dropbox
Big Data com Python
Ad

Similar to Introduction to pig (20)

PPTX
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
PDF
Unit V.pdf
PPTX
Apache pig presentation_siddharth_mathur
PPTX
power point presentation on pig -hadoop framework
PPTX
Apache pig presentation_siddharth_mathur
PPTX
PigHive presentation and hive impor.pptx
PPT
Hadoop Technologies
PPTX
PigHive.pptx
PDF
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
KEY
Getting Started on Hadoop
PDF
Hadoop Pig: MapReduce the easy way!
PPTX
Unit-5 [Pig] working and architecture.pptx
PPTX
Pig workshop
PPTX
Pig Philosophy Big data Analytics.pptx
PPTX
Hadoop workshop
PPTX
PigHive.pptx
PDF
43_Sameer_Kumar_Das2
PPTX
Hands on Hadoop and pig
PPTX
Unit 4 lecture2
PPTX
Understanding Pig and Hive in Apache Hadoop
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Unit V.pdf
Apache pig presentation_siddharth_mathur
power point presentation on pig -hadoop framework
Apache pig presentation_siddharth_mathur
PigHive presentation and hive impor.pptx
Hadoop Technologies
PigHive.pptx
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Getting Started on Hadoop
Hadoop Pig: MapReduce the easy way!
Unit-5 [Pig] working and architecture.pptx
Pig workshop
Pig Philosophy Big data Analytics.pptx
Hadoop workshop
PigHive.pptx
43_Sameer_Kumar_Das2
Hands on Hadoop and pig
Unit 4 lecture2
Understanding Pig and Hive in Apache Hadoop
Ad

More from Uday Vakalapudi (12)

PPTX
Introduction to sqoop
PPTX
Introduction to hbase
PPTX
Introduction to Hive
PPTX
Introduction to HDFS and MapReduce
PPTX
Advanced topics in hive
PPTX
Mapreduce total order sorting technique
PPTX
Repartition join in mapreduce
PPTX
Hadoop Mapreduce joins
PPTX
Oozie workflow using HUE 2.2
PPTX
Apache Storm and twitter Streaming API integration
PPTX
How Hadoop Exploits Data Locality
PPTX
Flume basic
Introduction to sqoop
Introduction to hbase
Introduction to Hive
Introduction to HDFS and MapReduce
Advanced topics in hive
Mapreduce total order sorting technique
Repartition join in mapreduce
Hadoop Mapreduce joins
Oozie workflow using HUE 2.2
Apache Storm and twitter Streaming API integration
How Hadoop Exploits Data Locality
Flume basic

Recently uploaded (20)

PPT
DATA COLLECTION METHODS-ppt for nursing research
PDF
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPTX
importance of Data-Visualization-in-Data-Science. for mba studnts
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Transcultural that can help you someday.
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
How to run a consulting project- client discovery
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
[EN] Industrial Machine Downtime Prediction
PPT
Predictive modeling basics in data cleaning process
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Mega Projects Data Mega Projects Data
PPTX
Modelling in Business Intelligence , information system
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
DATA COLLECTION METHODS-ppt for nursing research
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
importance of Data-Visualization-in-Data-Science. for mba studnts
STERILIZATION AND DISINFECTION-1.ppthhhbx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
climate analysis of Dhaka ,Banglades.pptx
Transcultural that can help you someday.
Introduction-to-Cloud-ComputingFinal.pptx
Qualitative Qantitative and Mixed Methods.pptx
How to run a consulting project- client discovery
IBA_Chapter_11_Slides_Final_Accessible.pptx
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
[EN] Industrial Machine Downtime Prediction
Predictive modeling basics in data cleaning process
Galatica Smart Energy Infrastructure Startup Pitch Deck
Mega Projects Data Mega Projects Data
Modelling in Business Intelligence , information system
The THESIS FINAL-DEFENSE-PRESENTATION.pptx

Introduction to pig

  • 2. Agenda  What Is Pig? And what it is use for?  Pig Philosophy  Pig’s Data Model  Pig Example  Pig Latin  Pig Latin vs SQL  Pig Macros  Pig UDF’s
  • 3. What Is Pig? And what it is use for?  Pig has a pig engine which is used for executing data flows in parallel like how map tasks are distributed among the cluster nodes and get job done in Mapreduce .  Pig uses its own Pig Latin language for expressing these data flows.  Pig runs on Hadoop. It makes use of both the Hadoop Distributed File System, HDFS, and Hadoop’s processing system, MapReduce.  By default, Pig reads input files from HDFS, uses HDFS to store intermediate data between MapReduce jobs, and writes its output to HDFS.  Pig Latin use cases tend to fall into three separate categories: traditional extract transform load (ETL) data pipelines, research on raw data, and iterative processing.
  • 4. Pig Philosophy  Pigs eat anything  Pigs live anywhere  Pigs are domestic animals  Pigs fly
  • 5. Pig’s Data Model : Types  Pig’s data types can be divided into two categories: scalar types, which contain a single value, and complex types, which contain other types
  • 6. Pig’s Data Model : Schemas  If a schema for the data is available, Pig will make use of it, both for up-front error checking and for optimization  Syntax: Loads= load ‘data.txt' as(col1:int, col2:chararray, col3:chararray, col4:float);  It is also possible to specify the schema without giving explicit data types. In this case, the data type is assumed to be bytearray  Syntax: Loads= load ‘data.txt' as(col1, col2, col3, col4);
  • 7. Pig Example  Grunt is Pig’s interactive shell. It enables users to enter Pig Latin interactively and provides a shell for users to interact with HDFS.  To enter Grunt, pig -x local, pig -x mapreduce, pig -x tez  records = LOAD 'input/ncdc/micro-tab/sample.txt‘ AS (year:chararray, temperature:int, quality:int);  filtered_records = FILTER records BY temperature != 9999 AND quality IN (0, 1, 4, 5, 9);  grouped_records = GROUP filtered_records BY year;  max_temp = FOREACH grouped_records GENERATE group, MAX(filtered_records.temperature);  DUMP max_temp;
  • 8. Pig Latin : RelationalOperators
  • 9. Pig Latin : Diagnostic/UDF Operators
  • 11. Pig Macros  Macros provide a way to package reusable pieces of Pig Latin code from within Pig Latin itself. DEFINE max_by_group(X, group_key, max_field) RETURNS Y { A = GROUP $X by $group_key; $Y = FOREACH A GENERATE group, MAX($X.$max_field); }; records = LOAD 'input/ncdc/micro-tab/sample.txt‘ AS (year:chararray, temperature:int, quality:int); filtered_records = FILTER records BY temperature != 9999 AND quality IN (0, 1, 4, 5, 9); max_temp = max_by_group(filtered_records, year, temperature); DUMP max_temp;
  • 12. Pig UDF’s  A Filter UDF  An Eval UDF  A Load UDF