SlideShare a Scribd company logo
ETL with Apache Pig
By
Arjun Shah
Under the guidance of
Dr Duc Thanh Tran
Agenda
• What is Pig?
• Introduction to Pig Latin
• Installation of Pig
• Getting Started with Pig
• Examples
What is Pig?
• Pig is a dataflow language
• Language is called PigLatin
• Pretty simple syntax
• Under the covers, PigLatin scripts are turned into MapReduce jobs
and executed on the cluster
• Built for Hadoop
• Translates script to MapReduce program under the hood
• Originally developed at Yahoo!
• Huge contributions from Hortonworks, Twitter
What Pig Does
• Pig was designed for performing a long series of
data operations, making it ideal for three
categories of Big Data jobs:
• Extract-transform-load (ETL) data pipelines,
• Research on raw data, and
• Iterative data processing.
Features of Pig
• Joining datasets
• Grouping data
• Referring to elements by position rather than name ($0, $1, etc)
• Loading non-delimited data using a custom SerDe (Writing a custom Reader and Writer)
• Creation of user-defined functions (UDF), written in Java
• And more..
Pig: Install
• There are some prerequisites that one needs to
follow for installing pig. They are:
• JAVA_HOME should be set up
• Hadoop should be installed (Single node
cluster)
• Useful link :
http://codesfusion.blogspot.com/2013/10/setup-
hadoop-2x-220-on-ubuntu.html
Pig: Install(2)
pig.apache.org/docs/r0.12.0/start.html
Pig: Install(3)
Pig: Install(4)
Pig: Install(5)
Move tar file to any location
• $cd /usr/local
• • $cp ~/Download/pig-0.12.0.tar.gz
• • $sudo tar xzf pig-0.12.0.tar.gz
• • $mv pig-0.12.0.tar.gz pig
Change .bashrc
• Edit the .bashrc file:
• $ gedit ~/.bashrc
• Add to .bashrc
• • export PIG_HOME=/usr/local/pig
• • export PATH=$PATH:$PIG_HOME/bin
• Close and then open terminal. Try pig -h
pig -h : Output
Pig: Configure
• The user can run Pig in two modes:
• Local mode (pig -x local) - With access to a single
machine, all files are installed and run using a
local host and file system.
• Hadoop mode - This is the default mode, which
requires access to a Hadoop cluster
• The user can run Pig in either mode using the “pig”
command or the “java” command.
Pig: Run
• Script: Pig can run a script file that contains Pig commands.
• For example,
% pig script.pig
• Runs the commands in the local file ”script.pig”.
• Alternatively, for very short scripts, you can use the -e option to run a script specified as a string on a
command line.
•
• Grunt: Grunt is an interactive shell for running Pig commands.
• Grunt is started when no file is specified for Pig to run, and the -e option is not used.
• Note: It is also possible to run Pig scripts from within Grunt using run and exec.
• Embedded: You can run Pig programs from Java, much like you can use JDBC to run SQL programs
from Java.
• There are more details on the Pig wiki at http://wiki.apache.org/pig/EmbeddedPig
•
Pig Latin: Loading Data
• LOAD
- Reads data from the file system
• Syntax
- LOAD ‘input’ [USING function] [AS schema];
-Eg, A = LOAD ‘input’ USING PigStorage(‘t’) AS
(name:chararray, age:int, gpa:float);
Schema
• Use schemas to assign types to fields
• A = LOAD 'data' AS (name, age, gpa);
-name, age, gpa default to bytearrays
• A = LOAD 'data' AS (name:chararray, age:int,
gpa:float);
-name is now a String (chararray), age is integer
and gpa is float
Describing Schema
• Describe
• Provides the schema of a relation
• Syntax
• DESCRIBE [alias];
• If schema is not provided, describe will say “Schema for alias unknown”
• grunt> A = load 'data' as (a:int, b: long, c: float);
• grunt> describe A;
• A: {a: int, b: long, c: float}
• grunt> B = load 'somemoredata';
• grunt> describe B;
• Schema for B unknown.
Dump and Store
• Dump writes the output to console
• grunt> A = load ‘data’;
• grunt> DUMP A; //This will print contents of A on Console
• Store writes output to a HDFS location
• grunt> A = load ‘data’;
• grunt> STORE A INTO ‘/user/username/output’; //This will
write contents of A to HDFS
• Pig starts a job only when a DUMP or STORE is encountered
Referencing Fields
• Fields are referred to by positional notation OR by name (alias)
• Positional notation is generated by the system
• Starts with $0
• Names are assigned by you using schemas. Eg, A = load
‘data’ as (name:chararray, age:int);
• With positional notation, fields can be accessed as
• A = load ‘data’;
• B = foreach A generate $0, $1; //1st & 2nd column
Limit
• Limits the number of output tuples
• Syntax
• alias = LIMIT alias n;
• grunt> A = load 'data';
• grunt> B = LIMIT A 10;
• grunt> DUMP B; --Prints only 10 rows
Foreach.. Generate
• Used for data transformations and projections
• Syntax
• alias = FOREACH { block | nested_block };
• nested_block usage later in the deck
• grunt> A = load ‘data’ as (a1,a2,a3);
• grunt> B = FOREACH A GENERATE *,
• grunt> DUMP B;
• (1,2,3)
• (4,2,1)
• grunt> C = FOREACH A GENERATE a1, a3;
• grunt> DUMP C;
• (1,3)
• (4,1)
Filter
• Selects tuples from a relation based on some condition
• Syntax
• alias = FILTER alias BY expression;
• Example, to filter for ‘marcbenioff’
• A = LOAD ‘sfdcemployees’ USING PigStorage(‘,’) as
(name:chararray,employeesince:int,age:int);
• B = FILTER A BY name == ‘marcbenioff’;
• You can use boolean operators (AND, OR, NOT)
• B = FILTER A BY (employeesince < 2005) AND (NOT(name ==
‘marcbenioff’));
Group By
• Groups data in one or more relations (similar to SQL GROUP BY)
• Syntax:
• alias = GROUP alias { ALL | BY expression} [, alias ALL | BY expression …] [PARALLEL
n];
• Eg, to group by (employee start year at Salesforce)
• A = LOAD ‘sfdcemployees’ USING PigStorage(‘,’) as (name:chararray,
employeesince:int, age:int);
• B = GROUP A BY (employeesince);
• You can also group by all fields together
• B = GROUP B BY ALL;
• Or Group by multiple fields
• B = GROUP A BY (age, employeesince);
Demo: Sample Data (employee.txt)
• Example contents of ‘employee.txt’ a tab delimited text
• 1 Peter234000000 none
• 2 Peter_01 234000000 none
• 124163 Jacob 10000 cloud
• 124164 Arthur 1000000 setlabs
• 124165 Robert 1000000 setlabs
• 124166 Ram 450000 es
• 124167 Madhusudhan 450000 e&r
• 124168 Alex 6500000 e&r
• 124169 Bob 50000 cloud
Demo: Employees with salary > 1lk
• Loading data from employee.txt into emps bag and with a schema
empls = LOAD ‘employee.txt’ AS (id:int, name:chararray, salary:double,
dept:chararray);
• Filtering the data as required
rich = FILTER empls BY $2 > 100000;
• Sorting
sortd = ORDER rich BY salary DESC;
• Storing the final results
STORE sortd INTO ‘rich_employees.txt’;
• Or alternatively we can dump the record on the screen
DUMP sortd;
------------------------------------------------------------------
• Group by salary
grp = GROUP empls BY salary;
• Get count of employees in each salary group
cnt = FOREACH grp GENERATE group, COUNT(empls.id) as emp_cnt;
Pig_Presentation
Output
More PigLatin (1/2)
• Load using PigStorage
• empls = LOAD ‘employee.txt’ USING
PigStorage('t') AS (id:int, name:chararray,
salary:double, dept:chararray);
• Store using PigStorage
• STORE srtd INTO ‘rich_employees.txt’ USING
PigStorage('t');
More PigLatin (2/2)
• To view the schema of a relation
• DESCRIBE empls;
• To view step-by-step execution of a series of
statements
• ILLUSTRATE empls;
• To view the execution plan of a relation
• EXPLAIN empls;
Exploring Pig with Project
Data Set
Pig: Local Mode using
Project Example
Pig_Presentation
Pig_Presentation
Pig_Presentation
Pig:Hadoop Mode (GUI)
using Project Example
Pig_Presentation
Pig_Presentation
Output
Crimes having category as
VANDALISM
Output
Pig_Presentation
Crimes occurring on
Saturday & Sunday
Output
Grouping crimes by category
Output
PigLatin: UDF
• Pig provides extensive support for user-defined
functions (UDFs) as a way to specify custom
processing. Functions can be a part of almost
every operator in Pig
• All UDF’s are case sensitive
UDF: Types
• Eval Functions (EvalFunc)
• Ex: StringConcat (built-in) : Generates the concatenation of the first two fields
of a tuple.
• Aggregate Functions (EvalFunc & Algebraic)
• Ex: COUNT, AVG ( both built-in)
• Filter Functions (FilterFunc)
• Ex: IsEmpty (built-in)
• Load/Store Functions (LoadFunc/ StoreFunc)
• Ex: PigStorage (built-in)
• Note: URL for built in functions:
http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/builtin/package-
summary.html
Summary
• Pig can be used to run ETL jobs on Hadoop. It
saves you from writing MapReduce code in Java
while its syntax may look familiar to SQL users.
Nonetheless, it is important to take some time to
learn Pig and to understand its advantages and
limitations. Who knows, maybe pigs can fly after
all.

More Related Content

PPTX
Introduction to Apache Pig
PPTX
Hadoop Pig
PDF
Programming Hive Reading #4
PDF
Debugging PySpark: Spark Summit East talk by Holden Karau
PPT
apache pig performance optimizations talk at apachecon 2010
PPT
Hadoop and Pig at Twitter__HadoopSummit2010
PPTX
Hadoop 20111117
PDF
Tajo Seoul Meetup-201501
Introduction to Apache Pig
Hadoop Pig
Programming Hive Reading #4
Debugging PySpark: Spark Summit East talk by Holden Karau
apache pig performance optimizations talk at apachecon 2010
Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop 20111117
Tajo Seoul Meetup-201501

What's hot (20)

PPTX
Benchmarking Solr Performance at Scale
PDF
20141111 파이썬으로 Hadoop MR프로그래밍
PDF
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
PPTX
Pig latin
PDF
Hive Anatomy
PDF
Dive into PySpark
PPTX
Full Text search in Django with Postgres
PDF
Apache Hadoop Shell Rewrite
PDF
Python高级编程(二)
PDF
Hadoop Streaming: Programming Hadoop without Java
PPTX
Migrating to Puppet 4.0
PPTX
Solr 4: Run Solr in SolrCloud Mode on your local file system.
PDF
High Performance Solr
PDF
Using Morphlines for On-the-Fly ETL
PPTX
Parse, scale to millions
PDF
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
PDF
PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...
PPTX
Value protocols and codables
PPTX
Hadoop on osx
PPTX
Writing Hadoop Jobs in Scala using Scalding
Benchmarking Solr Performance at Scale
20141111 파이썬으로 Hadoop MR프로그래밍
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Pig latin
Hive Anatomy
Dive into PySpark
Full Text search in Django with Postgres
Apache Hadoop Shell Rewrite
Python高级编程(二)
Hadoop Streaming: Programming Hadoop without Java
Migrating to Puppet 4.0
Solr 4: Run Solr in SolrCloud Mode on your local file system.
High Performance Solr
Using Morphlines for On-the-Fly ETL
Parse, scale to millions
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...
Value protocols and codables
Hadoop on osx
Writing Hadoop Jobs in Scala using Scalding
Ad

Similar to Pig_Presentation (20)

PDF
pig intro.pdf
PDF
06 pig-01-intro
PDF
Apache Pig: A big data processor
PPTX
power point presentation on pig -hadoop framework
PPTX
AWS Hadoop and PIG and overview
PPTX
PigHive presentation and hive impor.pptx
PPTX
File handle in PROGRAMMable extensible interpreted .pptx
PPTX
PigHive.pptx
PPTX
Apache PIG
PPTX
Unit-5 [Pig] working and architecture.pptx
PPTX
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
PPTX
Understanding Pig and Hive in Apache Hadoop
PPTX
PigHive.pptx
PDF
Hadoop pig
PPTX
03 pig intro
PPT
pig.ppt
PPT
Logstash
PDF
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
PPTX
PSGI and Plack from first principles
PDF
Practical pig
pig intro.pdf
06 pig-01-intro
Apache Pig: A big data processor
power point presentation on pig -hadoop framework
AWS Hadoop and PIG and overview
PigHive presentation and hive impor.pptx
File handle in PROGRAMMable extensible interpreted .pptx
PigHive.pptx
Apache PIG
Unit-5 [Pig] working and architecture.pptx
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Understanding Pig and Hive in Apache Hadoop
PigHive.pptx
Hadoop pig
03 pig intro
pig.ppt
Logstash
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
PSGI and Plack from first principles
Practical pig
Ad

Recently uploaded (20)

PPT
Quality review (1)_presentation of this 21
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Computer network topology notes for revision
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Quality review (1)_presentation of this 21
Business Ppt On Nestle.pptx huunnnhhgfvu
climate analysis of Dhaka ,Banglades.pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Database Infoormation System (DBIS).pptx
Business Acumen Training GuidePresentation.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Computer network topology notes for revision
Moving the Public Sector (Government) to a Digital Adoption
Fluorescence-microscope_Botany_detailed content
Introduction-to-Cloud-ComputingFinal.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
.pdf is not working space design for the following data for the following dat...
Clinical guidelines as a resource for EBP(1).pdf
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm

Pig_Presentation

  • 1. ETL with Apache Pig By Arjun Shah Under the guidance of Dr Duc Thanh Tran
  • 2. Agenda • What is Pig? • Introduction to Pig Latin • Installation of Pig • Getting Started with Pig • Examples
  • 3. What is Pig? • Pig is a dataflow language • Language is called PigLatin • Pretty simple syntax • Under the covers, PigLatin scripts are turned into MapReduce jobs and executed on the cluster • Built for Hadoop • Translates script to MapReduce program under the hood • Originally developed at Yahoo! • Huge contributions from Hortonworks, Twitter
  • 4. What Pig Does • Pig was designed for performing a long series of data operations, making it ideal for three categories of Big Data jobs: • Extract-transform-load (ETL) data pipelines, • Research on raw data, and • Iterative data processing.
  • 5. Features of Pig • Joining datasets • Grouping data • Referring to elements by position rather than name ($0, $1, etc) • Loading non-delimited data using a custom SerDe (Writing a custom Reader and Writer) • Creation of user-defined functions (UDF), written in Java • And more..
  • 6. Pig: Install • There are some prerequisites that one needs to follow for installing pig. They are: • JAVA_HOME should be set up • Hadoop should be installed (Single node cluster) • Useful link : http://codesfusion.blogspot.com/2013/10/setup- hadoop-2x-220-on-ubuntu.html
  • 11. Move tar file to any location • $cd /usr/local • • $cp ~/Download/pig-0.12.0.tar.gz • • $sudo tar xzf pig-0.12.0.tar.gz • • $mv pig-0.12.0.tar.gz pig
  • 12. Change .bashrc • Edit the .bashrc file: • $ gedit ~/.bashrc • Add to .bashrc • • export PIG_HOME=/usr/local/pig • • export PATH=$PATH:$PIG_HOME/bin • Close and then open terminal. Try pig -h
  • 13. pig -h : Output
  • 14. Pig: Configure • The user can run Pig in two modes: • Local mode (pig -x local) - With access to a single machine, all files are installed and run using a local host and file system. • Hadoop mode - This is the default mode, which requires access to a Hadoop cluster • The user can run Pig in either mode using the “pig” command or the “java” command.
  • 15. Pig: Run • Script: Pig can run a script file that contains Pig commands. • For example, % pig script.pig • Runs the commands in the local file ”script.pig”. • Alternatively, for very short scripts, you can use the -e option to run a script specified as a string on a command line. • • Grunt: Grunt is an interactive shell for running Pig commands. • Grunt is started when no file is specified for Pig to run, and the -e option is not used. • Note: It is also possible to run Pig scripts from within Grunt using run and exec. • Embedded: You can run Pig programs from Java, much like you can use JDBC to run SQL programs from Java. • There are more details on the Pig wiki at http://wiki.apache.org/pig/EmbeddedPig •
  • 16. Pig Latin: Loading Data • LOAD - Reads data from the file system • Syntax - LOAD ‘input’ [USING function] [AS schema]; -Eg, A = LOAD ‘input’ USING PigStorage(‘t’) AS (name:chararray, age:int, gpa:float);
  • 17. Schema • Use schemas to assign types to fields • A = LOAD 'data' AS (name, age, gpa); -name, age, gpa default to bytearrays • A = LOAD 'data' AS (name:chararray, age:int, gpa:float); -name is now a String (chararray), age is integer and gpa is float
  • 18. Describing Schema • Describe • Provides the schema of a relation • Syntax • DESCRIBE [alias]; • If schema is not provided, describe will say “Schema for alias unknown” • grunt> A = load 'data' as (a:int, b: long, c: float); • grunt> describe A; • A: {a: int, b: long, c: float} • grunt> B = load 'somemoredata'; • grunt> describe B; • Schema for B unknown.
  • 19. Dump and Store • Dump writes the output to console • grunt> A = load ‘data’; • grunt> DUMP A; //This will print contents of A on Console • Store writes output to a HDFS location • grunt> A = load ‘data’; • grunt> STORE A INTO ‘/user/username/output’; //This will write contents of A to HDFS • Pig starts a job only when a DUMP or STORE is encountered
  • 20. Referencing Fields • Fields are referred to by positional notation OR by name (alias) • Positional notation is generated by the system • Starts with $0 • Names are assigned by you using schemas. Eg, A = load ‘data’ as (name:chararray, age:int); • With positional notation, fields can be accessed as • A = load ‘data’; • B = foreach A generate $0, $1; //1st & 2nd column
  • 21. Limit • Limits the number of output tuples • Syntax • alias = LIMIT alias n; • grunt> A = load 'data'; • grunt> B = LIMIT A 10; • grunt> DUMP B; --Prints only 10 rows
  • 22. Foreach.. Generate • Used for data transformations and projections • Syntax • alias = FOREACH { block | nested_block }; • nested_block usage later in the deck • grunt> A = load ‘data’ as (a1,a2,a3); • grunt> B = FOREACH A GENERATE *, • grunt> DUMP B; • (1,2,3) • (4,2,1) • grunt> C = FOREACH A GENERATE a1, a3; • grunt> DUMP C; • (1,3) • (4,1)
  • 23. Filter • Selects tuples from a relation based on some condition • Syntax • alias = FILTER alias BY expression; • Example, to filter for ‘marcbenioff’ • A = LOAD ‘sfdcemployees’ USING PigStorage(‘,’) as (name:chararray,employeesince:int,age:int); • B = FILTER A BY name == ‘marcbenioff’; • You can use boolean operators (AND, OR, NOT) • B = FILTER A BY (employeesince < 2005) AND (NOT(name == ‘marcbenioff’));
  • 24. Group By • Groups data in one or more relations (similar to SQL GROUP BY) • Syntax: • alias = GROUP alias { ALL | BY expression} [, alias ALL | BY expression …] [PARALLEL n]; • Eg, to group by (employee start year at Salesforce) • A = LOAD ‘sfdcemployees’ USING PigStorage(‘,’) as (name:chararray, employeesince:int, age:int); • B = GROUP A BY (employeesince); • You can also group by all fields together • B = GROUP B BY ALL; • Or Group by multiple fields • B = GROUP A BY (age, employeesince);
  • 25. Demo: Sample Data (employee.txt) • Example contents of ‘employee.txt’ a tab delimited text • 1 Peter234000000 none • 2 Peter_01 234000000 none • 124163 Jacob 10000 cloud • 124164 Arthur 1000000 setlabs • 124165 Robert 1000000 setlabs • 124166 Ram 450000 es • 124167 Madhusudhan 450000 e&r • 124168 Alex 6500000 e&r • 124169 Bob 50000 cloud
  • 26. Demo: Employees with salary > 1lk • Loading data from employee.txt into emps bag and with a schema empls = LOAD ‘employee.txt’ AS (id:int, name:chararray, salary:double, dept:chararray); • Filtering the data as required rich = FILTER empls BY $2 > 100000; • Sorting sortd = ORDER rich BY salary DESC; • Storing the final results STORE sortd INTO ‘rich_employees.txt’; • Or alternatively we can dump the record on the screen DUMP sortd; ------------------------------------------------------------------ • Group by salary grp = GROUP empls BY salary; • Get count of employees in each salary group cnt = FOREACH grp GENERATE group, COUNT(empls.id) as emp_cnt;
  • 29. More PigLatin (1/2) • Load using PigStorage • empls = LOAD ‘employee.txt’ USING PigStorage('t') AS (id:int, name:chararray, salary:double, dept:chararray); • Store using PigStorage • STORE srtd INTO ‘rich_employees.txt’ USING PigStorage('t');
  • 30. More PigLatin (2/2) • To view the schema of a relation • DESCRIBE empls; • To view step-by-step execution of a series of statements • ILLUSTRATE empls; • To view the execution plan of a relation • EXPLAIN empls;
  • 31. Exploring Pig with Project Data Set
  • 32. Pig: Local Mode using Project Example
  • 36. Pig:Hadoop Mode (GUI) using Project Example
  • 40. Crimes having category as VANDALISM
  • 45. Grouping crimes by category
  • 47. PigLatin: UDF • Pig provides extensive support for user-defined functions (UDFs) as a way to specify custom processing. Functions can be a part of almost every operator in Pig • All UDF’s are case sensitive
  • 48. UDF: Types • Eval Functions (EvalFunc) • Ex: StringConcat (built-in) : Generates the concatenation of the first two fields of a tuple. • Aggregate Functions (EvalFunc & Algebraic) • Ex: COUNT, AVG ( both built-in) • Filter Functions (FilterFunc) • Ex: IsEmpty (built-in) • Load/Store Functions (LoadFunc/ StoreFunc) • Ex: PigStorage (built-in) • Note: URL for built in functions: http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/builtin/package- summary.html
  • 49. Summary • Pig can be used to run ETL jobs on Hadoop. It saves you from writing MapReduce code in Java while its syntax may look familiar to SQL users. Nonetheless, it is important to take some time to learn Pig and to understand its advantages and limitations. Who knows, maybe pigs can fly after all.