SlideShare a Scribd company logo
Practical Pig + PigUnit




 Michael G. Noll, Verisign
 July 2012
This talk is about Apache Pig

   • High-level data flow language (think: DSL) for writing
     Hadoop MapReduce jobs
   • Why and when should you care about Pig?
           • You are an Hadoop beginner
                  • … and want to implement a JOIN, for instance
           • You are an Hadoop expert
           • You only scratch your head when you see
                public static void main(String args...)
           • You think Java is not the best tool for this job [pun!]
                  • Think: too low-level, too many lines of code, no interactive mode
                    for exploratory analysis, readability > performance, et cetera




                     Apache Hadoop, Pig and Hive are trademarks of the Apache Software Foundation.
Verisign Public      Java is a trademark of Oracle Corporation.                                      2
A basic Pig script

   • Example: sorting user records by users’ age
           records = LOAD ‘/path/to/input’
                        AS (user:chararray, age:int);

           sorted_records = ORDER records BY age DESC;

           STORE sorted_records INTO ‘/path/to/output’;



   • Popular alternatives to Pig
           • Hive: ~ SQL for Hadoop
           • Hadoop Streaming: use any programming language for MR
                  • Even though you still write code in a “real” programming
                    language, Streaming provides an environment that makes it more
                    convenient than native Hadoop Java code.

Verisign Public                                                                  3
Preliminaries

   • Talk is based on Pig 0.10.0, released in April ’12
   • Some notable 0.10.0 improvements
           •      Hadoop 2.0 support
           •      Loading and storing JSON
           •      Ctrl-C’ing a Pig job will terminate all associated Hadoop jobs
           •      Amazon S3 support




Verisign Public                                                                    4
Testing Pig – a primer




Verisign Public             5
“Testing” Pig scripts – some examples


              DESCRIBE | EXPLAIN | ILLUSTRATE | DUMP


              $ pig -x local


              $ pig [-debug | -dryrun]


              $ pig -param input=/path/to/small-sample.txt




Verisign Public                                              6
“Testing” Pig scripts (cont.)

   • JobTracker UI              • PigStats, JobStats,
                                  HadoopJobHistoryLoader



  Now what have you been using?



     Also: inspecting Hadoop log files, …


Verisign Public                                            7
However…

   • Previous approaches are primarily useful (and used)
     for creating the Pig script in the first place
           • Like ILLUSTRATE
   • None of them are really geared towards unit testing
   • Difficult to automate (think: production environment)
                  #!/bin/bash
                  pig –param date=$1 –param output=$2 myscript.pig
                  hadoop fs –copyToLocal $2 /tmp/jobresult
                  if [ ARGH!!! ] ...


   • Difficult to integrate into a typical development
     workflow, e.g. backed by Maven, Java and a CI server
                  $ mvn clean test              ??

Verisign Public     Maven is a trademark of JFrog ltd.               8
PigUnit




Verisign Public   9
PigUnit

   • Available in Pig since version 0.8
              “PigUnit provides a unit-testing framework that plugs into JUnit
              to help you write unit tests that can be run on a regular basis.”
              -- Alan F. Gates, Programming Pig

   • Easy way to add Pig unit testing to your dev workflow
     iff you are a Java developer
           • See “Tips and Tricks” later for working around this constraint
   • Works with both JUnit and TestNG
   • PigUnit docs have “potential”
           • Some basic examples, then it’s looking at the source code of
             both PigUnit and Pig (but it’s manageable)
   • http://pig.apache.org/docs/r0.10.0/test.html#pigunit

Verisign Public                                                                   10
Getting PigUnit up and running

   • PigUnit is not included in current Pig releases :(
   • You must manually build the PigUnit jar file

         $ cd /path/to/pig-sources # can be a release tarball
         $ ant jar pigunit-jar
         ...
         $ ls -l pig*jar
         -rw-r—r-- 1 mnoll mnoll 17768497 ... pig.jar
         -rw-r—r-- 1 mnoll mnoll   285627 ... pigunit.jar



   • Add these jar(s) to your CLASSPATH, done!




Verisign Public                                                 11
PigUnit and Maven

   • Unfortunately the Apache Pig project does not yet
     publish an official Maven artifact for PigUnit
                  WILL NOT WORK IN pom.xml :(
                  <dependency>
                      <groupId>org.apache.pig</groupId>
                      <artifactId>pigunit</artifactId>
                      <version>0.10.0</version>
                  </dependency>

   • Alternatives:
           •      Publish to your local Artifactory instance
           •      Use a local file-based <repository>
           •      Use a <system> scope in pom.xml (not recommended)
           •      Use trusted third-party repos like Cloudera’s


Verisign Public       Artifactory is a trademark of JFrog ltd.        12
A simple PigUnit test




Verisign Public            13
A simple PigUnit test

   • Here, we provide input + output data in the Java code
   • Pig script is read from file wordcount.pig
           @Test
           public void testSimpleExample() {
               PigTest simpleTest = new PigTest(‚wordcount.pig‛);

                  String[] input = { ‚foo‛, ‚bar‛, ‚foo‛ };
                  String[] expectedOutput = {
                      ‚(foo,2)‛,
                      ‚(bar,1)‛
                  };

                  simpleTest.assertOutput(
                      ‚aliasInput‛, input,
                      ‚aliasOutput‛, expectedOutput
                  );
           }
Verisign Public                                                     14
A simple PigUnit test (cont.)

   • wordcount.pig
           -- PigUnit populates the alias ‘aliasInput’
           -- with the test input data
           aliasInput = LOAD ‘<tmpLoc>’ AS <schema>;

           -- ...here comes your actual code...

           -- PigUnit will treat the contents of the alias
           -- ‘aliasOutput’ as the actual output data in
           -- the assert statement
           aliasOutput = <your_final_statement>;

           -- Note: PigUnit ignores STORE operations by default
           STORE aliasOutput INTO ‘output’;




Verisign Public                                                   15
A simple PigUnit test (cont.)
                   simpleTest.assertOutput(
       1               ‚aliasInput‛, input,
       2               ‚aliasOutput‛, expectedOutput
                   );



       1          Pig injects input[] = { ‚foo‛, ‚bar‛, ‚foo‛ } into the
                  alias named aliasInput in the Pig script.
                  For this purpose Pig creates a temporary file, writes the
                  equivalent of StringUtils.join(input, ‚n‛) to the file,
                  and finally makes its location available to the LOAD operation.


       2          Pig opens an iterator on the content of aliasOutput, and runs
                  assertEquals() based on StringUtils.join(..., ‚n‛)
                  with expectedOutput and the actual content.

           See o.a.p.pigunit.{PigTest, Cluster} and o.a.p.test.Util.

Verisign Public                                                                     16
PigUnit drawbacks

• How to divide your “main” Pig script into testable units?
       • Only run a single end-to-end test for the full script?
       • Extract testable snippets from the main script?
                  • Argh, code duplication!
       • Split the main script into logical units = smaller scripts; then run
         individual tests and include the smaller scripts in the main script
                  • Ok-ish but splitting too much makes the Pig code hard to
                    understand (too many trees, no forest).
• PigUnit is a nice tool but batteries are not included
       • It does work but it is not as convenient or powerful as you’d like.
                  • Notably you still need to know and write Java to use it. But one
                    compelling reason for Pig is that you can do without Java.
       • You may end up writing your own wrapper/helper lib around it.
                  • Consider contributing this back to the Apache Pig project!


Verisign Public                                                                        17
Tips and tricks




Verisign Public      18
Connecting to a real cluster (default: local mode)

     // this is not enough to enable cluster mode in PigUnit
     pigServer = new PigServer(ExecType.MAPREDUCE);
     // ...do PigUnit stuff...

     // rather:
     Properties props = System.getProperties();
     if (clusterMode)
         props.setProperty(‚pigunit.exectype.cluster‛, ‚true‛);
     else
         props.removeProperty(‚pigunit.exectype.cluster‛);

   • $HADOOP_CONF_DIR must be in CLASSPATH
   • Similar approach for enabling LZO support
           • mapred.output.compress => ‚true‛
           • mapred.output.compression.codec => ‚c.h.c.lzo.LzopCodec‛



Verisign Public                                                     19
Write a convenient PigUnit runner for your users

   • Pig user != Java developer
   • Pig users should only need to provide three files:
           •    pig/myscript.pig
           • input/testdata.txt
           • output/expected.txt
   • PigUnit runner discovers and runs tests for users
           • PigTest#assertOutput() can also handle files
           • But you must manage file uploads and similar “glue” yourself

      pigUnitRunner.runPigTest(
          new Path(scriptFile),
          new Path(inputFile),
          new Path(expectedOutputFile)
      );


Verisign Public                                                             20
Slightly off-topic: Java/Pig combo

   • Pig API provides nifty features to control Pig workflows
     through Java
           • Similar to how working with PigUnit feels
   • Definitely worth a look!
   // ‘pigParams’ is the main glue between Java and Pig here,
   // e.g. to specify the location of input data
   pigServer.registerScript(scriptInputStream, pigParams);

   ExecJob job = pigServer.store(
           ‚aliasOutput‛,
           ‚/path/to/output‛,
           ‚PigStorage()‛
       );

   if (job != null && job.getStatus() == JOB_STATUS.COMPLETED)
       System.out.println(‚Happy world!‛);

Verisign Public                                                  21
Thank You




© 2012 VeriSign, Inc. All rights reserved. VERISIGN and other trademarks, service marks, and
designs are registered or unregistered trademarks of VeriSign, Inc. and its subsidiaries in the United
States and in foreign countries. All other trademarks are property of their respective owners.

More Related Content

PDF
Debugging PySpark: Spark Summit East talk by Holden Karau
PDF
Introduction to pig & pig latin
PDF
Developing Java Streaming Applications with Apache Storm
PDF
Karmasphere hadoop-productivity-tools
PDF
Real-time streams and logs with Storm and Kafka
ZIP
Constructing Web APIs with Rack, Sinatra and MongoDB
PDF
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
ODP
Cascalog internal dsl_preso
Debugging PySpark: Spark Summit East talk by Holden Karau
Introduction to pig & pig latin
Developing Java Streaming Applications with Apache Storm
Karmasphere hadoop-productivity-tools
Real-time streams and logs with Storm and Kafka
Constructing Web APIs with Rack, Sinatra and MongoDB
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Cascalog internal dsl_preso

What's hot (20)

PPT
Real-Time Streaming with Apache Spark Streaming and Apache Storm
PDF
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
PDF
Fluentd at Bay Area Kubernetes Meetup
PDF
DBD::Gofer 200809
PDF
PySpark Cassandra - Amsterdam Spark Meetup
PDF
Apache storm vs. Spark Streaming
PDF
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
PPTX
Scaling Apache Storm (Hadoop Summit 2015)
PPTX
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
PDF
Real-Time Analytics with Kafka, Cassandra and Storm
PDF
Handling not so big data
PDF
PySpark in practice slides
PPTX
Up and running with pyspark
PDF
Logging for Production Systems in The Container Era
PPTX
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
PDF
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
PPTX
Overview of Spark for HPC
PDF
Understanding Memory Management In Spark For Fun And Profit
PPTX
Introduction to Storm
PPTX
January 2011 HUG: Pig Presentation
Real-Time Streaming with Apache Spark Streaming and Apache Storm
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
Fluentd at Bay Area Kubernetes Meetup
DBD::Gofer 200809
PySpark Cassandra - Amsterdam Spark Meetup
Apache storm vs. Spark Streaming
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
Scaling Apache Storm (Hadoop Summit 2015)
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
Real-Time Analytics with Kafka, Cassandra and Storm
Handling not so big data
PySpark in practice slides
Up and running with pyspark
Logging for Production Systems in The Container Era
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Overview of Spark for HPC
Understanding Memory Management In Spark For Fun And Profit
Introduction to Storm
January 2011 HUG: Pig Presentation
Ad

Viewers also liked (14)

PPTX
Unit testing pig
PDF
Coscup 2013 : Continuous Integration on top of hadoop
PPTX
How to Test Big Data Systems | QualiTest Group
PPTX
Feb 2013 HUG: HIT (Hadoop Integration Testing) for Automated Certification an...
PDF
Big Data - Hadoop and MapReduce for QA and testing by Aditya Garg
PDF
Scalding for Hadoop
PPTX
Introduction to Big data tdd and pig unit
PPTX
Hands on Hadoop and pig
PDF
Scalding - Hadoop Word Count in LESS than 70 lines of code
PPTX
Applying Testing Techniques for Big Data and Hadoop
PPTX
Big Data Testing Approach - Rohit Kharabe
PPTX
Introduction to Apache Pig
PDF
Unit testing of spark applications
PPTX
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Unit testing pig
Coscup 2013 : Continuous Integration on top of hadoop
How to Test Big Data Systems | QualiTest Group
Feb 2013 HUG: HIT (Hadoop Integration Testing) for Automated Certification an...
Big Data - Hadoop and MapReduce for QA and testing by Aditya Garg
Scalding for Hadoop
Introduction to Big data tdd and pig unit
Hands on Hadoop and pig
Scalding - Hadoop Word Count in LESS than 70 lines of code
Applying Testing Techniques for Big Data and Hadoop
Big Data Testing Approach - Rohit Kharabe
Introduction to Apache Pig
Unit testing of spark applications
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Ad

Similar to Practical Pig and PigUnit (Michael Noll, Verisign) (20)

PPTX
An Introduction to Apache Pig
PDF
How I hack on puppet modules
PDF
Pipeline as code for your infrastructure as Code
PDF
Software development practices in python
ODP
Puppet and the HashiCorp Suite
PDF
From SaltStack to Puppet and beyond...
PDF
Puppet Development Workflow
PDF
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
PDF
Continuous Infrastructure: Modern Puppet for the Jenkins Project - PuppetConf...
PPTX
Using the puppet debugger for lightweight exploration
PPTX
ASP.NET 5 auf Raspberry PI & docker
PDF
Arbeiten mit distribute, pip und virtualenv
PDF
Django dev-env-my-way
PDF
Improving Operations Efficiency with Puppet
PDF
Virtualenv
PPTX
How to deploy spark instance using ansible 2.0 in fiware lab v2
PPTX
How to Deploy Spark Instance Using Ansible 2.0 in FIWARE Lab
PDF
PyParis2018 - Python tooling for continuous deployment
PDF
Puppet getting started by Dirk Götz
ODP
Automated reproducible images on openstack using vagrant and packer
An Introduction to Apache Pig
How I hack on puppet modules
Pipeline as code for your infrastructure as Code
Software development practices in python
Puppet and the HashiCorp Suite
From SaltStack to Puppet and beyond...
Puppet Development Workflow
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
Continuous Infrastructure: Modern Puppet for the Jenkins Project - PuppetConf...
Using the puppet debugger for lightweight exploration
ASP.NET 5 auf Raspberry PI & docker
Arbeiten mit distribute, pip und virtualenv
Django dev-env-my-way
Improving Operations Efficiency with Puppet
Virtualenv
How to deploy spark instance using ansible 2.0 in fiware lab v2
How to Deploy Spark Instance Using Ansible 2.0 in FIWARE Lab
PyParis2018 - Python tooling for continuous deployment
Puppet getting started by Dirk Götz
Automated reproducible images on openstack using vagrant and packer

More from Swiss Big Data User Group (20)

PDF
Making Hadoop based analytics simple for everyone to use
PDF
A real life project using Cassandra at a large Swiss Telco operator
PDF
Data Analytics – B2B vs. B2C
PDF
PDF
Building a Hadoop Data Warehouse with Impala
PDF
Closing The Loop for Evaluating Big Data Analysis
PDF
Big Data and Data Science for traditional Swiss companies
PPTX
Design Patterns for Large-Scale Real-Time Learning
PDF
Educating Data Scientists of the Future
PDF
Unleash the power of Big Data in your existing Data Warehouse
PDF
Big data for Telco: opportunity or threat?
PDF
Project "Babelfish" - A data warehouse to attack complexity
PDF
Brainserve Datacenter: the High-Density Choice
PDF
Urturn on AWS: scaling infra, cost and time to maket
PDF
The World Wide Distributed Computing Architecture of the LHC Datagrid
PPTX
New opportunities for connected data : Neo4j the graph database
PDF
Technology Outlook - The new Era of computing
PDF
In-Store Analysis with Hadoop
PDF
Big Data Visualization With ParaView
PPTX
Introduction to Apache Drill
Making Hadoop based analytics simple for everyone to use
A real life project using Cassandra at a large Swiss Telco operator
Data Analytics – B2B vs. B2C
Building a Hadoop Data Warehouse with Impala
Closing The Loop for Evaluating Big Data Analysis
Big Data and Data Science for traditional Swiss companies
Design Patterns for Large-Scale Real-Time Learning
Educating Data Scientists of the Future
Unleash the power of Big Data in your existing Data Warehouse
Big data for Telco: opportunity or threat?
Project "Babelfish" - A data warehouse to attack complexity
Brainserve Datacenter: the High-Density Choice
Urturn on AWS: scaling infra, cost and time to maket
The World Wide Distributed Computing Architecture of the LHC Datagrid
New opportunities for connected data : Neo4j the graph database
Technology Outlook - The new Era of computing
In-Store Analysis with Hadoop
Big Data Visualization With ParaView
Introduction to Apache Drill

Recently uploaded (20)

PPTX
MYSQL Presentation for SQL database connectivity
PDF
KodekX | Application Modernization Development
PDF
Empathic Computing: Creating Shared Understanding
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
NewMind AI Monthly Chronicles - July 2025
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Encapsulation theory and applications.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPT
Teaching material agriculture food technology
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Cloud computing and distributed systems.
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
cuic standard and advanced reporting.pdf
MYSQL Presentation for SQL database connectivity
KodekX | Application Modernization Development
Empathic Computing: Creating Shared Understanding
Advanced methodologies resolving dimensionality complications for autism neur...
Unlocking AI with Model Context Protocol (MCP)
NewMind AI Monthly Chronicles - July 2025
“AI and Expert System Decision Support & Business Intelligence Systems”
The AUB Centre for AI in Media Proposal.docx
Encapsulation theory and applications.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Teaching material agriculture food technology
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Cloud computing and distributed systems.
Network Security Unit 5.pdf for BCA BBA.
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Reach Out and Touch Someone: Haptics and Empathic Computing
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
cuic standard and advanced reporting.pdf

Practical Pig and PigUnit (Michael Noll, Verisign)

  • 1. Practical Pig + PigUnit Michael G. Noll, Verisign July 2012
  • 2. This talk is about Apache Pig • High-level data flow language (think: DSL) for writing Hadoop MapReduce jobs • Why and when should you care about Pig? • You are an Hadoop beginner • … and want to implement a JOIN, for instance • You are an Hadoop expert • You only scratch your head when you see public static void main(String args...) • You think Java is not the best tool for this job [pun!] • Think: too low-level, too many lines of code, no interactive mode for exploratory analysis, readability > performance, et cetera Apache Hadoop, Pig and Hive are trademarks of the Apache Software Foundation. Verisign Public Java is a trademark of Oracle Corporation. 2
  • 3. A basic Pig script • Example: sorting user records by users’ age records = LOAD ‘/path/to/input’ AS (user:chararray, age:int); sorted_records = ORDER records BY age DESC; STORE sorted_records INTO ‘/path/to/output’; • Popular alternatives to Pig • Hive: ~ SQL for Hadoop • Hadoop Streaming: use any programming language for MR • Even though you still write code in a “real” programming language, Streaming provides an environment that makes it more convenient than native Hadoop Java code. Verisign Public 3
  • 4. Preliminaries • Talk is based on Pig 0.10.0, released in April ’12 • Some notable 0.10.0 improvements • Hadoop 2.0 support • Loading and storing JSON • Ctrl-C’ing a Pig job will terminate all associated Hadoop jobs • Amazon S3 support Verisign Public 4
  • 5. Testing Pig – a primer Verisign Public 5
  • 6. “Testing” Pig scripts – some examples DESCRIBE | EXPLAIN | ILLUSTRATE | DUMP $ pig -x local $ pig [-debug | -dryrun] $ pig -param input=/path/to/small-sample.txt Verisign Public 6
  • 7. “Testing” Pig scripts (cont.) • JobTracker UI • PigStats, JobStats, HadoopJobHistoryLoader Now what have you been using? Also: inspecting Hadoop log files, … Verisign Public 7
  • 8. However… • Previous approaches are primarily useful (and used) for creating the Pig script in the first place • Like ILLUSTRATE • None of them are really geared towards unit testing • Difficult to automate (think: production environment) #!/bin/bash pig –param date=$1 –param output=$2 myscript.pig hadoop fs –copyToLocal $2 /tmp/jobresult if [ ARGH!!! ] ... • Difficult to integrate into a typical development workflow, e.g. backed by Maven, Java and a CI server $ mvn clean test ?? Verisign Public Maven is a trademark of JFrog ltd. 8
  • 10. PigUnit • Available in Pig since version 0.8 “PigUnit provides a unit-testing framework that plugs into JUnit to help you write unit tests that can be run on a regular basis.” -- Alan F. Gates, Programming Pig • Easy way to add Pig unit testing to your dev workflow iff you are a Java developer • See “Tips and Tricks” later for working around this constraint • Works with both JUnit and TestNG • PigUnit docs have “potential” • Some basic examples, then it’s looking at the source code of both PigUnit and Pig (but it’s manageable) • http://pig.apache.org/docs/r0.10.0/test.html#pigunit Verisign Public 10
  • 11. Getting PigUnit up and running • PigUnit is not included in current Pig releases :( • You must manually build the PigUnit jar file $ cd /path/to/pig-sources # can be a release tarball $ ant jar pigunit-jar ... $ ls -l pig*jar -rw-r—r-- 1 mnoll mnoll 17768497 ... pig.jar -rw-r—r-- 1 mnoll mnoll 285627 ... pigunit.jar • Add these jar(s) to your CLASSPATH, done! Verisign Public 11
  • 12. PigUnit and Maven • Unfortunately the Apache Pig project does not yet publish an official Maven artifact for PigUnit WILL NOT WORK IN pom.xml :( <dependency> <groupId>org.apache.pig</groupId> <artifactId>pigunit</artifactId> <version>0.10.0</version> </dependency> • Alternatives: • Publish to your local Artifactory instance • Use a local file-based <repository> • Use a <system> scope in pom.xml (not recommended) • Use trusted third-party repos like Cloudera’s Verisign Public Artifactory is a trademark of JFrog ltd. 12
  • 13. A simple PigUnit test Verisign Public 13
  • 14. A simple PigUnit test • Here, we provide input + output data in the Java code • Pig script is read from file wordcount.pig @Test public void testSimpleExample() { PigTest simpleTest = new PigTest(‚wordcount.pig‛); String[] input = { ‚foo‛, ‚bar‛, ‚foo‛ }; String[] expectedOutput = { ‚(foo,2)‛, ‚(bar,1)‛ }; simpleTest.assertOutput( ‚aliasInput‛, input, ‚aliasOutput‛, expectedOutput ); } Verisign Public 14
  • 15. A simple PigUnit test (cont.) • wordcount.pig -- PigUnit populates the alias ‘aliasInput’ -- with the test input data aliasInput = LOAD ‘<tmpLoc>’ AS <schema>; -- ...here comes your actual code... -- PigUnit will treat the contents of the alias -- ‘aliasOutput’ as the actual output data in -- the assert statement aliasOutput = <your_final_statement>; -- Note: PigUnit ignores STORE operations by default STORE aliasOutput INTO ‘output’; Verisign Public 15
  • 16. A simple PigUnit test (cont.) simpleTest.assertOutput( 1 ‚aliasInput‛, input, 2 ‚aliasOutput‛, expectedOutput ); 1 Pig injects input[] = { ‚foo‛, ‚bar‛, ‚foo‛ } into the alias named aliasInput in the Pig script. For this purpose Pig creates a temporary file, writes the equivalent of StringUtils.join(input, ‚n‛) to the file, and finally makes its location available to the LOAD operation. 2 Pig opens an iterator on the content of aliasOutput, and runs assertEquals() based on StringUtils.join(..., ‚n‛) with expectedOutput and the actual content. See o.a.p.pigunit.{PigTest, Cluster} and o.a.p.test.Util. Verisign Public 16
  • 17. PigUnit drawbacks • How to divide your “main” Pig script into testable units? • Only run a single end-to-end test for the full script? • Extract testable snippets from the main script? • Argh, code duplication! • Split the main script into logical units = smaller scripts; then run individual tests and include the smaller scripts in the main script • Ok-ish but splitting too much makes the Pig code hard to understand (too many trees, no forest). • PigUnit is a nice tool but batteries are not included • It does work but it is not as convenient or powerful as you’d like. • Notably you still need to know and write Java to use it. But one compelling reason for Pig is that you can do without Java. • You may end up writing your own wrapper/helper lib around it. • Consider contributing this back to the Apache Pig project! Verisign Public 17
  • 19. Connecting to a real cluster (default: local mode) // this is not enough to enable cluster mode in PigUnit pigServer = new PigServer(ExecType.MAPREDUCE); // ...do PigUnit stuff... // rather: Properties props = System.getProperties(); if (clusterMode) props.setProperty(‚pigunit.exectype.cluster‛, ‚true‛); else props.removeProperty(‚pigunit.exectype.cluster‛); • $HADOOP_CONF_DIR must be in CLASSPATH • Similar approach for enabling LZO support • mapred.output.compress => ‚true‛ • mapred.output.compression.codec => ‚c.h.c.lzo.LzopCodec‛ Verisign Public 19
  • 20. Write a convenient PigUnit runner for your users • Pig user != Java developer • Pig users should only need to provide three files: • pig/myscript.pig • input/testdata.txt • output/expected.txt • PigUnit runner discovers and runs tests for users • PigTest#assertOutput() can also handle files • But you must manage file uploads and similar “glue” yourself pigUnitRunner.runPigTest( new Path(scriptFile), new Path(inputFile), new Path(expectedOutputFile) ); Verisign Public 20
  • 21. Slightly off-topic: Java/Pig combo • Pig API provides nifty features to control Pig workflows through Java • Similar to how working with PigUnit feels • Definitely worth a look! // ‘pigParams’ is the main glue between Java and Pig here, // e.g. to specify the location of input data pigServer.registerScript(scriptInputStream, pigParams); ExecJob job = pigServer.store( ‚aliasOutput‛, ‚/path/to/output‛, ‚PigStorage()‛ ); if (job != null && job.getStatus() == JOB_STATUS.COMPLETED) System.out.println(‚Happy world!‛); Verisign Public 21
  • 22. Thank You © 2012 VeriSign, Inc. All rights reserved. VERISIGN and other trademarks, service marks, and designs are registered or unregistered trademarks of VeriSign, Inc. and its subsidiaries in the United States and in foreign countries. All other trademarks are property of their respective owners.