SlideShare a Scribd company logo
Apache Pig – Introduction and
Hands-on
Ravi Mutyala
Systems Architect, Hortonworks
Twitter: @rmutyala




© Hortonworks Inc. 2012
Big Data Platforms
Cost per TB, Adoption




                        Size of bubble = cost
                        effectiveness of solution


                        Source:




                                         2
Topics
• What is Pig?
• Why Pig ?
• Language Features
• Labs
• 0.10.0 Features
• Features in the pipeline
•Q &A




                                Page 3
      © Hortonworks Inc. 2012
What is Pig?
• System for processing large unstructured Data
• Uses HDFS and MapReduce
• Data flow Language
• Directional Asymptotic Graph
• Started at Yahoo! Research
• Joined Apache incubator in 2007
• Graduated to Subproject of Hadoop in 2008
• Top level project in Apache since 2010




                                                  Page 4
     © Hortonworks Inc. 2012
Pig Philosophy 
• Pigs eat anything
• Pigs live anywhere
• Pigs are domesticated animals
• Pigs can fly




                                  Page 5
     © Hortonworks Inc. 2012
Components
• Pig Engine – Parser, Optimizer and distributed query
  execution
• Grunt – CLI shell
• Pig Latin – Procedural Language




                                                     Page 6
     © Hortonworks Inc. 2012
Why Pig ?
• High level language that increases programmer
  productivity.
• Designed for Parallel Data flow.
• Reduces complexity by abstracting low level Map and
  Reduce jobs and Map Reduce job chaining
• Can be run on a client/gateway machine with no
  configuration on the cluster
• Multiple versions of Pig can co-exist as long as they
  are compatible with Hadoop version.




                                                     Page 7
     © Hortonworks Inc. 2012
Running Pig
Pig Latin script executes in 3 modes
• MapReduce: Code executes as MapReduce on a
  Hadoop Cluster
       $ pig myscript.pig
• Local: Code executes locally in a single JVM using
  local data
       $ pig –x local myscript.pig


• Interactive: pig with no script starts the grunt shell
  where commands can be run interactively




                                                           Page 8
      © Hortonworks Inc. 2012
GRUNT shell
• fs -ls
• fs -cat filename
• fs -copyFromLocal localfile hdfsfile




                                         Page 9
      © Hortonworks Inc. 2012
Data Types
• Scalar Types
  – int, long, float, double, chararray, bytearray, boolean, datetime
• Complex Types
  – Map. Collection of key value pairs
       – [name#alan, age#30]
  – Tuple. Ordered set of values
       – (alan,40,engineering)
  – Bags. Unordered collection of tuples
       – {(alan,40,engineering),(bob,45,sales)}




                                                                        Page 10
      © Hortonworks Inc. 2012
• Relations and a set of operations that work on
  relations
• Schema for relations is optional
• $0… $n can be used for fields in relations
• null means the data in undefined.
• Any missing or invalid fields are loaded as null




                                                     Page 11
      © Hortonworks Inc. 2012
Input and Output
• A = LOAD ‘file’ USING PigStorage(‘,’) AS
  (data1:datatype1, data2:datatype2.. )

• STORE A INTO ‘file2’ using PigStorage(‘,’)

• DUMP A

• DESCRIBE A




                                               Page 12
      © Hortonworks Inc. 2012
Relational Operations
• GROUP A BY A.age;

• FOREACH B GENERATE A.$1 – A.$3;

• FILTER A BY A.$1 > 10;

• ORDER A BY A.$1 DESC, A.$2;

• JOIN A BY A.$1, B BY B.$5;
• JOIN A BY (A.$1, A.$5) LEFT OUTER, B BY (B.$2,
  B.$3);

                                                   Page 13
     © Hortonworks Inc. 2012
• LIMIT A 10;

• SAMPLE A 0.1;

• GROUP A BY A.$1 PARALLEL 10;

• User Definited Functions AND piggybank
  register 'your_path_to_piggybank/piggybank.jar';
  divs = load 'NYSE_dividends’;
  backwards = foreach divs generate
  org.apache.pig.piggybank.evaluation.string.Reverse($1);




                                                            Page 14
      © Hortonworks Inc. 2012
• Invoking static java methods

• FLATTEN

• TOKENIZE




                                 Page 15
     © Hortonworks Inc. 2012
0.10.0 Features
• Ruby UDFs
• PigStorage with schemas
• Additional UDF improvements
• Language Improvements
  – Boolean type
  – otherwise
  – Maps, Bags and Tuples can be generated without UDFs
  – Register collection of jars
• Performance Improvements




                                                          Page 16
     © Hortonworks Inc. 2012
Current work in progress
• DataTime datatype
• CUBE, ROLLUP and RANK operators
• Native support for windows
• Lower memory footprint




                                    Page 17
     © Hortonworks Inc. 2012
References
• Labs are from
  – https://github.com/alanfgates/programmingpig
  – https://github.com/michiard/CLOUDS-LAB


• 0.10.0 Features and current WIP
  – http://www.slideshare.net/hortonworks/pig-out-to-hadoop by Alan
    Gates




                                                                 Page 18
     © Hortonworks Inc. 2012
Hortonworks Training
                          The expert source for
                          Apache Hadoop training & certification

Role-based Developer and Administration training
  – Coursework built and maintained by the core Apache Hadoop development team.
  – The “right” course, with the most extensive and realistic hands-on materials
  – Provide an immersive experience into real-world Hadoop scenarios
  – Public and Private courses available



Comprehensive Apache Hadoop Certification
  – Become a trusted and valuable
    Apache Hadoop expert




                                                                             Page 19
      © Hortonworks Inc. 2012
Thank You!
Questions & Answers
   Ravi Mutyala
   Systems Architect
   Hortonworks
   Twitter: @rmutyala
   www.hortonworks.com




                               Page 20
     © Hortonworks Inc. 2012

More Related Content

PDF
Sql saturday pig session (wes floyd) v2
PPTX
Big Data Laboratory
PDF
OSDC 2013 | Introduction into Hadoop by Olivier Renault
PPTX
Introduction to Apache Pig
PPTX
Introduction to Pig | Pig Architecture | Pig Fundamentals
PPS
Introduction to Apache Hive
PPTX
Structor - Automated Building of Virtual Hadoop Clusters
PPTX
Triple-E’class Continuous Delivery with Hudson, Maven, Kokki and PyDev
Sql saturday pig session (wes floyd) v2
Big Data Laboratory
OSDC 2013 | Introduction into Hadoop by Olivier Renault
Introduction to Apache Pig
Introduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Apache Hive
Structor - Automated Building of Virtual Hadoop Clusters
Triple-E’class Continuous Delivery with Hudson, Maven, Kokki and PyDev

What's hot (20)

PDF
YARN - Strata 2014
PPTX
Facebook Analytics with Elastic Map/Reduce
PPTX
Evolving HDFS to Generalized Storage Subsystem
PPTX
One Click Hadoop Clusters - Anywhere (Using Docker)
PDF
BDTC2015 hulu-梁宇明-voidbox - docker on yarn
PDF
Strata London 2016: The future of column oriented data processing with Arrow ...
PPTX
HBase coprocessors, Uses, Abuses, Solutions
PPTX
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
PPTX
November 2014 HUG: Apache Tez - A Performance View into Large Scale Data-proc...
PDF
Hadoop pycon2011uk
PDF
Hadoop Present - Open Enterprise Hadoop
PPTX
Apache NiFi Crash Course Intro
PDF
Hdp developer apache spark using python (lab guide) by hortonworks university...
PPTX
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
PPTX
Practical Kerberos with Apache HBase
PPTX
Apache NiFi in the Hadoop Ecosystem
PPTX
Hive Does ACID
PDF
Hortonworks Technical Workshop: Apache Ambari
PDF
Dataflow Management From Edge to Core with Apache NiFi
PPTX
De-Mystifying the Apache Phoenix QueryServer
YARN - Strata 2014
Facebook Analytics with Elastic Map/Reduce
Evolving HDFS to Generalized Storage Subsystem
One Click Hadoop Clusters - Anywhere (Using Docker)
BDTC2015 hulu-梁宇明-voidbox - docker on yarn
Strata London 2016: The future of column oriented data processing with Arrow ...
HBase coprocessors, Uses, Abuses, Solutions
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
November 2014 HUG: Apache Tez - A Performance View into Large Scale Data-proc...
Hadoop pycon2011uk
Hadoop Present - Open Enterprise Hadoop
Apache NiFi Crash Course Intro
Hdp developer apache spark using python (lab guide) by hortonworks university...
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Practical Kerberos with Apache HBase
Apache NiFi in the Hadoop Ecosystem
Hive Does ACID
Hortonworks Technical Workshop: Apache Ambari
Dataflow Management From Edge to Core with Apache NiFi
De-Mystifying the Apache Phoenix QueryServer
Ad

Viewers also liked (12)

PDF
Porting your hadoop app to horton works hdp
PPTX
Night owl by Boyd Meyer of PROS
PDF
Zeta architecture -2015
PPTX
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
PDF
Cloudera search
PPTX
Hadoop as a service presented by Ajay Jha at Houston Hadoop Meetup
PPTX
Oil and gas big data edition
PDF
Launching your career in Big Data
PDF
Hadoop to spark_v2
PPTX
Intro to Apache Spark by Marco Vasquez
PPTX
SHMcloud vision
PDF
Joe Witt presentation on Apache NiFi
Porting your hadoop app to horton works hdp
Night owl by Boyd Meyer of PROS
Zeta architecture -2015
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Cloudera search
Hadoop as a service presented by Ajay Jha at Houston Hadoop Meetup
Oil and gas big data edition
Launching your career in Big Data
Hadoop to spark_v2
Intro to Apache Spark by Marco Vasquez
SHMcloud vision
Joe Witt presentation on Apache NiFi
Ad

Similar to Introduction to pig (20)

PPTX
Inside hadoop-dev
PPTX
Mrinal devadas, Hortonworks Making Sense Of Big Data
PPTX
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVis
PPTX
Hdp r-google charttools-webinar-3-5-2013 (2)
PPTX
Apache Tez - A unifying Framework for Hadoop Data Processing
PPTX
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
PPTX
Munich HUG 21.11.2013
PPTX
201305 hadoop jpl-v3
PPTX
Don't Let Security Be The 'Elephant in the Room'
PDF
Pig Out to Hadoop
PPTX
Hadoop In Action
PDF
Apache Spark Workshop at Hadoop Summit
PDF
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
PDF
Hortonworks tech workshop in-memory processing with spark
PPTX
Introduction to hadoop V2
PPTX
Containerdays Intro to Habitat
PPT
Orange County HUG - Agile Data on HDP
PPTX
Spark crash course workshop at Hadoop Summit
PPT
LA HUG - Agile Analytics Applications on HDP
PDF
Storm Demo Talk - Colorado Springs May 2015
Inside hadoop-dev
Mrinal devadas, Hortonworks Making Sense Of Big Data
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Hdp r-google charttools-webinar-3-5-2013 (2)
Apache Tez - A unifying Framework for Hadoop Data Processing
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Munich HUG 21.11.2013
201305 hadoop jpl-v3
Don't Let Security Be The 'Elephant in the Room'
Pig Out to Hadoop
Hadoop In Action
Apache Spark Workshop at Hadoop Summit
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
Hortonworks tech workshop in-memory processing with spark
Introduction to hadoop V2
Containerdays Intro to Habitat
Orange County HUG - Agile Data on HDP
Spark crash course workshop at Hadoop Summit
LA HUG - Agile Analytics Applications on HDP
Storm Demo Talk - Colorado Springs May 2015

Recently uploaded (20)

PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Approach and Philosophy of On baking technology
PDF
cuic standard and advanced reporting.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Modernizing your data center with Dell and AMD
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
KodekX | Application Modernization Development
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Approach and Philosophy of On baking technology
cuic standard and advanced reporting.pdf
Encapsulation_ Review paper, used for researhc scholars
The AUB Centre for AI in Media Proposal.docx
Dropbox Q2 2025 Financial Results & Investor Presentation
Digital-Transformation-Roadmap-for-Companies.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Encapsulation theory and applications.pdf
Modernizing your data center with Dell and AMD
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
KodekX | Application Modernization Development
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Review of recent advances in non-invasive hemoglobin estimation
Agricultural_Statistics_at_a_Glance_2022_0.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
NewMind AI Weekly Chronicles - August'25 Week I
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy

Introduction to pig

  • 1. Apache Pig – Introduction and Hands-on Ravi Mutyala Systems Architect, Hortonworks Twitter: @rmutyala © Hortonworks Inc. 2012
  • 2. Big Data Platforms Cost per TB, Adoption Size of bubble = cost effectiveness of solution Source: 2
  • 3. Topics • What is Pig? • Why Pig ? • Language Features • Labs • 0.10.0 Features • Features in the pipeline •Q &A Page 3 © Hortonworks Inc. 2012
  • 4. What is Pig? • System for processing large unstructured Data • Uses HDFS and MapReduce • Data flow Language • Directional Asymptotic Graph • Started at Yahoo! Research • Joined Apache incubator in 2007 • Graduated to Subproject of Hadoop in 2008 • Top level project in Apache since 2010 Page 4 © Hortonworks Inc. 2012
  • 5. Pig Philosophy  • Pigs eat anything • Pigs live anywhere • Pigs are domesticated animals • Pigs can fly Page 5 © Hortonworks Inc. 2012
  • 6. Components • Pig Engine – Parser, Optimizer and distributed query execution • Grunt – CLI shell • Pig Latin – Procedural Language Page 6 © Hortonworks Inc. 2012
  • 7. Why Pig ? • High level language that increases programmer productivity. • Designed for Parallel Data flow. • Reduces complexity by abstracting low level Map and Reduce jobs and Map Reduce job chaining • Can be run on a client/gateway machine with no configuration on the cluster • Multiple versions of Pig can co-exist as long as they are compatible with Hadoop version. Page 7 © Hortonworks Inc. 2012
  • 8. Running Pig Pig Latin script executes in 3 modes • MapReduce: Code executes as MapReduce on a Hadoop Cluster $ pig myscript.pig • Local: Code executes locally in a single JVM using local data $ pig –x local myscript.pig • Interactive: pig with no script starts the grunt shell where commands can be run interactively Page 8 © Hortonworks Inc. 2012
  • 9. GRUNT shell • fs -ls • fs -cat filename • fs -copyFromLocal localfile hdfsfile Page 9 © Hortonworks Inc. 2012
  • 10. Data Types • Scalar Types – int, long, float, double, chararray, bytearray, boolean, datetime • Complex Types – Map. Collection of key value pairs – [name#alan, age#30] – Tuple. Ordered set of values – (alan,40,engineering) – Bags. Unordered collection of tuples – {(alan,40,engineering),(bob,45,sales)} Page 10 © Hortonworks Inc. 2012
  • 11. • Relations and a set of operations that work on relations • Schema for relations is optional • $0… $n can be used for fields in relations • null means the data in undefined. • Any missing or invalid fields are loaded as null Page 11 © Hortonworks Inc. 2012
  • 12. Input and Output • A = LOAD ‘file’ USING PigStorage(‘,’) AS (data1:datatype1, data2:datatype2.. ) • STORE A INTO ‘file2’ using PigStorage(‘,’) • DUMP A • DESCRIBE A Page 12 © Hortonworks Inc. 2012
  • 13. Relational Operations • GROUP A BY A.age; • FOREACH B GENERATE A.$1 – A.$3; • FILTER A BY A.$1 > 10; • ORDER A BY A.$1 DESC, A.$2; • JOIN A BY A.$1, B BY B.$5; • JOIN A BY (A.$1, A.$5) LEFT OUTER, B BY (B.$2, B.$3); Page 13 © Hortonworks Inc. 2012
  • 14. • LIMIT A 10; • SAMPLE A 0.1; • GROUP A BY A.$1 PARALLEL 10; • User Definited Functions AND piggybank register 'your_path_to_piggybank/piggybank.jar'; divs = load 'NYSE_dividends’; backwards = foreach divs generate org.apache.pig.piggybank.evaluation.string.Reverse($1); Page 14 © Hortonworks Inc. 2012
  • 15. • Invoking static java methods • FLATTEN • TOKENIZE Page 15 © Hortonworks Inc. 2012
  • 16. 0.10.0 Features • Ruby UDFs • PigStorage with schemas • Additional UDF improvements • Language Improvements – Boolean type – otherwise – Maps, Bags and Tuples can be generated without UDFs – Register collection of jars • Performance Improvements Page 16 © Hortonworks Inc. 2012
  • 17. Current work in progress • DataTime datatype • CUBE, ROLLUP and RANK operators • Native support for windows • Lower memory footprint Page 17 © Hortonworks Inc. 2012
  • 18. References • Labs are from – https://github.com/alanfgates/programmingpig – https://github.com/michiard/CLOUDS-LAB • 0.10.0 Features and current WIP – http://www.slideshare.net/hortonworks/pig-out-to-hadoop by Alan Gates Page 18 © Hortonworks Inc. 2012
  • 19. Hortonworks Training The expert source for Apache Hadoop training & certification Role-based Developer and Administration training – Coursework built and maintained by the core Apache Hadoop development team. – The “right” course, with the most extensive and realistic hands-on materials – Provide an immersive experience into real-world Hadoop scenarios – Public and Private courses available Comprehensive Apache Hadoop Certification – Become a trusted and valuable Apache Hadoop expert Page 19 © Hortonworks Inc. 2012
  • 20. Thank You! Questions & Answers Ravi Mutyala Systems Architect Hortonworks Twitter: @rmutyala www.hortonworks.com Page 20 © Hortonworks Inc. 2012