SlideShare a Scribd company logo
Pig programming is more fun: New features in Pig



Daniel Dai (@daijy)
Thejas Nair (@thejasn)




© Hortonworks Inc. 2011                        Page 1
What is Apache Pig?
  Pig Latin, a high level                                                An engine that
  data processing                                                        executes Pig Latin
  language.                                                              locally or on a
                                                                         Hadoop cluster.




Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/

                  Architecting the Future of Big Data
                                                                                              Page 2
                  © Hortonworks Inc. 2011
Pig-latin example
• Query : Get the list of pages visited by users whose age is
  between 20 and 25 years.

users = load users as (name, age);

users_18_to_25 = filter users by age > 20 and age <= 25;

page_views = load pages as (user, url);

page_views_u18_to_25 = join users_18_to_25 by name,
page_views by user;

      Architecting the Future of Big Data
                                                          Page 3
      © Hortonworks Inc. 2011
Why pig ?
• Faster development
  –  Fewer lines of code
  –  Don’t re-invent the wheel

• Flexible
  –  Metadata is optional
  –  Extensible
  –  Procedural programming



         Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/

     Architecting the Future of Big Data
                                                                          Page 4
     © Hortonworks Inc. 2011
Before pig 0.9
   p1.pig                           p2.pig   p3.pig




     Architecting the Future of Big Data
                                                      Page 5
     © Hortonworks Inc. 2011
With pig macros
                                  p1.pig           p2.pig   p3.pig

macro1.pig                                                           macro2.pig




             Architecting the Future of Big Data
                                                                           Page 6
             © Hortonworks Inc. 2011
With pig macros
  p1.pig                                   p1.pig   rm_bots.pig




                                                    get_top.pig




     Architecting the Future of Big Data
                                                           Page 7
     © Hortonworks Inc. 2011
Pig macro example
• Page_views data : (user_name, url, timestamp, …)
• Find top 5 users by page views
• Find top 10 most visited pages.




      Architecting the Future of Big Data
                                                     Page 8
      © Hortonworks Inc. 2011
Pig Macro example
page_views = LOAD ..                           /* top x macro */
/* get top 5 users by page view */             DEFINE topCount (rel, col, topNum)
u_grp = GROUP .. by uname;                     RETURNS top_num_recs {
u_count = FOREACH .. COUNT ..                   grped = GROUP $rel by $col;
ord_u_count = ORDER u_count ..                  cnt_grp = FOREACH ..COUNT($rel)..
top_5_users = LIMIT ordered.. 5;                ord_cnt = ORDER .. by cnt;
DUMP top_5_users;                               $top_num_recs = LIMIT.. $topNum;
                                               }
/* get top 10 urls by page view */             -----------------------------------------
url_grp = GROUP .. by url;                     page_views = LOAD ..
url_count = FOREACH .. COUNT .                 /* get top 5 users by page view */
ord_url_count = ORDER url_count..              top_5_users = topCount(page_views,
top_10_urls = LIMIT ord_url.. 10;              uname, 5);
DUMP top_10_urls;                              DUMP top_5_users;
                                               …


         Architecting the Future of Big Data
                                                                                  Page 9
         © Hortonworks Inc. 2011
Pig macro
• Coming soon – piggybank with pig macros




     Architecting the Future of Big Data
                                            Page 10
     © Hortonworks Inc. 2011
Writing data flow program
• Writing a complex data pipeline is an iterative process

     Load                                   Load



   Transform                                Join



                                            Group   Transform   Filter




      Architecting the Future of Big Data
                                                                         Page 11
      © Hortonworks Inc. 2011
Writing data flow program


    Load                                   Load



  Transform                                Join



                                           Group   Transform         Filter


                                                               No output! L




     Architecting the Future of Big Data
                                                                               Page 12
     © Hortonworks Inc. 2011
Writing data flow program
• Debug!

        Load                                   Load


                                                       Was	
  join	
  on	
  
    Transform                                  Join      wrong	
  
                                                         a2ributes?	
  


Bug	
  in	
                                    Group          Transform                    Filter
   transform?	
  

                                                                               Did	
  filter	
  drop	
  
                                                                                    everything?	
  



         Architecting the Future of Big Data
                                                                                                          Page 13
         © Hortonworks Inc. 2011
Common approaches to debug
• Running on real (large) data
  – Inefficient, takes longer
• Running on (small) samples
  – Empty results on join, selective filters




      Architecting the Future of Big Data
                                               Page 14
      © Hortonworks Inc. 2011
Pig illustrate command
• Objective- Show examples for i/o of each statement that
  are
  – Realistic
  – Complete
  – Concise
  – Generated fast
• Steps
  – Downstream – sample and process
  – Prune
  – Upstream – generate realistic missing classes of examples
  – Prune


      Architecting the Future of Big Data
                                                           Page 15
      © Hortonworks Inc. 2011
Illustrate command demo




   Architecting the Future of Big Data
                                         Page 16
   © Hortonworks Inc. 2011
Pig relation-as-scalar
• In pig each statement alias is a relation
   – Relation is a set of records
• Task: Get list of pages whose load time was more
  than average.
• Steps
   1.  Compute average load time
   2.  Get list of pages whose load time is > average




      Architecting the Future of Big Data
                                                        Page 17
      © Hortonworks Inc. 2011
Pig relation-as-scalar
• Step 1 is like
  .. = load ..!
  ..= group ..!
  al_rel = foreach .. AVG(ltime) as avg_ltime;!


• Step 2 looks like
   page_views = load ‘pviews.txt’ as !
                               (url, ltime, ..);!
   !
   slow_views = filter page_views by !
                         ltime > avg_ltime!




       Architecting the Future of Big Data
                                                    Page 18
       © Hortonworks Inc. 2011
Pig relation-as-scalar
• Getting results of step 1 (average_gpa)
   – Join result of step 1 with students relation, or
   – Write result into file, then use udf to read from file
• Pig scalar feature now simplifies this-
   slow_views = filter page_views by !
                         ltime > al_rel.avg_ltime!


   – Runtime exception if al_rel has more than one record.




       Architecting the Future of Big Data
                                                              Page 19
       © Hortonworks Inc. 2011
UDF in Scripting Language
• Benefit
   – Use legacy code
   – Use library in scripting language
   – Leverage Hadoop for non-Java programmer
• Currently supported language
   – Python
   – JavaScript
   – Ruby
• Extensible Interface
   – Minimum effort to support another language



      Architecting the Future of Big Data
                                                  Page 20
      © Hortonworks Inc. 2011
Writing a Jython UDF
Write a Jython UDF                             •  Invoke Jython UDF when
                                                  needed
@outputSchema("word:chararray")                •  Type conversion
def concat(word):                                  –  Simple type
  return word + word                               –  Python Array <-> Pig Bag
                                                   –  Python Dict <-> Pig Map
                                                   –  Pyton Tuple <-> Pig Tuple

@outputSchemaFunction("squareSchema")          •  Convey schema to Pig
def square(num):                                   –  outputSchema
                                                   –  outputSchemaFunction
  if num == None:
      return None                              register 'util.py' using jython as util;
  return ((num)*(num))
                                               B = foreach A generate util.square
def squareSchema(input):                       (i));
  return input

         Architecting the Future of Big Data
                                                                                  Page 21
         © Hortonworks Inc. 2011
Use NLTK in Pig
• Example
   register ’nltk_util.py' using jython as nltk;
   ……
   B = foreach A generate nltk.tokenize(sentence)

 nltk_util.py
   import nltk
   porter = nltk.PorterStemmer()
   @outputSchema("words:{(word:chararray)}")
   def tokenize(sentence):
     tokens = nltk.word_tokenize(sentence)
     words = [porter.stem(t) for t in tokens]
     return words



      Architecting the Future of Big Data
                                                    Page 22
      © Hortonworks Inc. 2011
Writing a Script Engine
Writing a bridge UDF
class JythonFunction extends EvalFunc<Object> {
   public Object exec(Tuple tuple) {
     PyObject[] params = JythonUtils.pigTupleToPyTuple(tuple).getArray();
     PyObject result = function.__call__(params);
     return JythonUtils.pythonToPig(result);
   }
   public Schema outputSchema(Schema input) {
     PyObject outputSchemaDef = f.__findattr__("outputSchema".intern());
     return Utils.getSchemaFromString(outputSchemaDef.toString());
   }
}




        Architecting the Future of Big Data
                                                                            Page 23
        © Hortonworks Inc. 2011
Writing a Script Engine
Register scripting UDF

register 'util.py' using jython as util;

What happens in Pig
class JythonScriptEngine extends ScriptEngine {
   public void registerFunctions(String path, String namespace, PigContext
pigContext) {
     PythonInterpreter pi = Interpreter.interpreter;
     pi.execfile(path);
     for (PyTuple item : pi.getLocals().items())
        funcspec = new FuncSpec(JythonFunction.class.getCanonicalName() + "('"
                   + path + "','" + item. get(0)+"')");
        pigContext.registerFunction(namespace + key, funcspec);
   }
}



          Architecting the Future of Big Data
                                                                            Page 24
          © Hortonworks Inc. 2011
Algebraic UDF in JRuby
class Count < AlgebraicPigUdf
   output_schema Schema.long

  def initial t
    t.nil? ? 0 : 1
  end

  def intermed t
    return 0 if t.nil?
    t.flatten.inject(:+)
  end

  def final t
    intermed(t)
  end

end


          Architecting the Future of Big Data
                                                Page 25
          © Hortonworks Inc. 2011
Pig Embedding
• Embed Pig inside scripting language
  – Python
  – JavaScript
• Algorithms which cannot complete using one Pig script
  – Iterative algorithm
  PageRank, Kmeans, Neural Network, Apriori, etc
  – Parallel execution
  Random forrest
  – Divide and Conquer
  – Branching




      Architecting the Future of Big Data
                                                          Page 26
      © Hortonworks Inc. 2011
Pig Embedding
from org.apache.pig.scripting import Pig

                                                                             Compile	
  Pig	
  
input= ":INPATH:/singlefile/studenttab10k”
                                                                                Script	
  

P = Pig.compile("""A = load '$in' as (name, age, gpa); store A into ’output';""")

                                               Bind	
  Variables	
  
Q = P.bind({'in':input})

result = Q.runSingle()                         Launch	
  Pig	
  Script	
  

if result.isSuccessful():
    print "Pig job PASSED”
else:
    raise "Pig job FAILED"



         Architecting the Future of Big Data
                                                                                                  Page 27
         © Hortonworks Inc. 2011
Pig Embedding
 • Running embeded Pig script
    pig sample.py
 • What happen within Pig?
                                                                Pig
                                                                Script


             Python                           Python
             Script                           Script
sample.py                            Pig               Jython            Pig




        Architecting the Future of Big Data
                                                                               Page 28
        © Hortonworks Inc. 2011
Nested Operator
• Nested Operator: Operator inside foreach
  B = group A by name;
  C = foreach B {
    C0 = limit A 10;
    generate C0;
  }


• Prior Pig 0.10, supported nested operator
  – DISTINCT, FILTER, LIMIT, and ORDER BY
• New operators added in 0.10
  – CROSS, FOREACH



      Architecting the Future of Big Data
                                              Page 29
      © Hortonworks Inc. 2011
Nested Cross/Foreach
A = LOAD ’studenttab10k' as (name:chararray, age:int, gpa:double);
B = LOAD ’votertab10k' as (name:chararray, age:int, registration,
contributions:double);
C = cogroup A by name, B by name;
D = foreach C {
   C1 = filter A by gpa > 4;
   C2 = filter B by contributions > 500;
   C3 = cross C1, C2;
   C4 = foreach C3 generate CONCAT(CONCAT((chararray)gpa, '_'), (chararray)
contributions);
   generate flatten(C4);
}
store D into ’output'




       Architecting the Future of Big Data
                                                                      Page 30
       © Hortonworks Inc. 2011
Misc Loaders
• HBaseStorage
• CassandraStorage
• AvroStorage
• JsonLoader/JsonStorage




     Architecting the Future of Big Data
                                           Page 31
     © Hortonworks Inc. 2011
New operators to come
• Will be available in Pig 0.11
   – RANK
       – A distributed RANK implementation for Pig

   – CUBE




      Architecting the Future of Big Data
                                                     Page 32
      © Hortonworks Inc. 2011

More Related Content

PPTX
Embedding Pig in scripting languages
PPTX
Pig programming is more fun: New features in Pig
PDF
Hadoop, Pig, and Python (PyData NYC 2012)
PDF
Big Data Hadoop Training
PDF
Running R on Hadoop - CHUG - 20120815
PPTX
PPTX
Introduction to Pig | Pig Architecture | Pig Fundamentals
PDF
High-level Programming Languages: Apache Pig and Pig Latin
Embedding Pig in scripting languages
Pig programming is more fun: New features in Pig
Hadoop, Pig, and Python (PyData NYC 2012)
Big Data Hadoop Training
Running R on Hadoop - CHUG - 20120815
Introduction to Pig | Pig Architecture | Pig Fundamentals
High-level Programming Languages: Apache Pig and Pig Latin

What's hot (20)

PDF
Sql saturday pig session (wes floyd) v2
PDF
Massively Parallel Processing with Procedural Python (PyData London 2014)
PPTX
Python in big data world
PDF
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
PDF
IPython Notebook as a Unified Data Science Interface for Hadoop
PDF
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
PDF
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
PPTX
Pig on Tez - Low Latency ETL with Big Data
PPTX
Scalable Hadoop with succinct Python: the best of both worlds
PDF
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
PDF
Apache Pig: Making data transformation easy
PDF
Apache Pig for Data Scientists
PDF
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
KEY
Hive vs Pig for HadoopSourceCodeReading
PDF
Hadoop interview question
PPT
Hadoop, Hbase and Hive- Bay area Hadoop User Group
PDF
Hadoop 31-frequently-asked-interview-questions
PDF
Word Embedding for Nearest Words
PDF
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
PDF
Optimizing MapReduce Job performance
Sql saturday pig session (wes floyd) v2
Massively Parallel Processing with Procedural Python (PyData London 2014)
Python in big data world
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
IPython Notebook as a Unified Data Science Interface for Hadoop
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Pig on Tez - Low Latency ETL with Big Data
Scalable Hadoop with succinct Python: the best of both worlds
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Apache Pig: Making data transformation easy
Apache Pig for Data Scientists
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Hive vs Pig for HadoopSourceCodeReading
Hadoop interview question
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop 31-frequently-asked-interview-questions
Word Embedding for Nearest Words
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Optimizing MapReduce Job performance
Ad

Viewers also liked (9)

PPTX
ARCOS PALMARES Y PLANTARES
PDF
F cube - bits spark presentation
PPTX
Emo spark
PDF
Optimizing Hive Queries
PDF
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
PDF
Pivoting Data with SparkSQL by Andrew Ray
PDF
Hive tuning
PPT
K means Clustering Algorithm
PDF
SQL to Hive Cheat Sheet
ARCOS PALMARES Y PLANTARES
F cube - bits spark presentation
Emo spark
Optimizing Hive Queries
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Pivoting Data with SparkSQL by Andrew Ray
Hive tuning
K means Clustering Algorithm
SQL to Hive Cheat Sheet
Ad

Similar to Pig programming is fun (20)

PPTX
Making pig fly optimizing data processing on hadoop presentation
PPTX
Introduction to pig
PPTX
New features in Pig 0.11
PPTX
TriHUG November Pig Talk by Alan Gates
PDF
Pig Out to Hadoop
PPTX
Hadoop as data refinery
PPTX
Hadoop as Data Refinery - Steve Loughran
PPT
LA HUG - Agile Analytics Applications on HDP
PPT
Orange County HUG - Agile Data on HDP
PPTX
The Hadoop Ecosystem
PPTX
Introduction to Pig
PDF
Jan 2012 HUG: HCatalog
PPTX
03 pig intro
PPTX
Big data 101
PDF
Practical pig
KEY
Hortonworks: Agile Analytics Applications
KEY
Agile analytics applications on hadoop
PDF
Pig and Python to Process Big Data
KEY
Paris HUG - Agile Analytics Applications on Hadoop
Making pig fly optimizing data processing on hadoop presentation
Introduction to pig
New features in Pig 0.11
TriHUG November Pig Talk by Alan Gates
Pig Out to Hadoop
Hadoop as data refinery
Hadoop as Data Refinery - Steve Loughran
LA HUG - Agile Analytics Applications on HDP
Orange County HUG - Agile Data on HDP
The Hadoop Ecosystem
Introduction to Pig
Jan 2012 HUG: HCatalog
03 pig intro
Big data 101
Practical pig
Hortonworks: Agile Analytics Applications
Agile analytics applications on hadoop
Pig and Python to Process Big Data
Paris HUG - Agile Analytics Applications on Hadoop

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Encapsulation theory and applications.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
A Presentation on Artificial Intelligence
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Empathic Computing: Creating Shared Understanding
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Spectral efficient network and resource selection model in 5G networks
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
A comparative analysis of optical character recognition models for extracting...
Encapsulation theory and applications.pdf
Network Security Unit 5.pdf for BCA BBA.
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Unlocking AI with Model Context Protocol (MCP)
Encapsulation_ Review paper, used for researhc scholars
A Presentation on Artificial Intelligence
Digital-Transformation-Roadmap-for-Companies.pptx
Big Data Technologies - Introduction.pptx
MIND Revenue Release Quarter 2 2025 Press Release
“AI and Expert System Decision Support & Business Intelligence Systems”
Programs and apps: productivity, graphics, security and other tools
Assigned Numbers - 2025 - Bluetooth® Document
Building Integrated photovoltaic BIPV_UPV.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Empathic Computing: Creating Shared Understanding
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
sap open course for s4hana steps from ECC to s4
Spectral efficient network and resource selection model in 5G networks

Pig programming is fun

  • 1. Pig programming is more fun: New features in Pig Daniel Dai (@daijy) Thejas Nair (@thejasn) © Hortonworks Inc. 2011 Page 1
  • 2. What is Apache Pig? Pig Latin, a high level An engine that data processing executes Pig Latin language. locally or on a Hadoop cluster. Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/ Architecting the Future of Big Data Page 2 © Hortonworks Inc. 2011
  • 3. Pig-latin example • Query : Get the list of pages visited by users whose age is between 20 and 25 years. users = load users as (name, age); users_18_to_25 = filter users by age > 20 and age <= 25; page_views = load pages as (user, url); page_views_u18_to_25 = join users_18_to_25 by name, page_views by user; Architecting the Future of Big Data Page 3 © Hortonworks Inc. 2011
  • 4. Why pig ? • Faster development –  Fewer lines of code –  Don’t re-invent the wheel • Flexible –  Metadata is optional –  Extensible –  Procedural programming Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/ Architecting the Future of Big Data Page 4 © Hortonworks Inc. 2011
  • 5. Before pig 0.9 p1.pig p2.pig p3.pig Architecting the Future of Big Data Page 5 © Hortonworks Inc. 2011
  • 6. With pig macros p1.pig p2.pig p3.pig macro1.pig macro2.pig Architecting the Future of Big Data Page 6 © Hortonworks Inc. 2011
  • 7. With pig macros p1.pig p1.pig rm_bots.pig get_top.pig Architecting the Future of Big Data Page 7 © Hortonworks Inc. 2011
  • 8. Pig macro example • Page_views data : (user_name, url, timestamp, …) • Find top 5 users by page views • Find top 10 most visited pages. Architecting the Future of Big Data Page 8 © Hortonworks Inc. 2011
  • 9. Pig Macro example page_views = LOAD .. /* top x macro */ /* get top 5 users by page view */ DEFINE topCount (rel, col, topNum) u_grp = GROUP .. by uname; RETURNS top_num_recs { u_count = FOREACH .. COUNT .. grped = GROUP $rel by $col; ord_u_count = ORDER u_count .. cnt_grp = FOREACH ..COUNT($rel).. top_5_users = LIMIT ordered.. 5; ord_cnt = ORDER .. by cnt; DUMP top_5_users; $top_num_recs = LIMIT.. $topNum; } /* get top 10 urls by page view */ ----------------------------------------- url_grp = GROUP .. by url; page_views = LOAD .. url_count = FOREACH .. COUNT . /* get top 5 users by page view */ ord_url_count = ORDER url_count.. top_5_users = topCount(page_views, top_10_urls = LIMIT ord_url.. 10; uname, 5); DUMP top_10_urls; DUMP top_5_users; … Architecting the Future of Big Data Page 9 © Hortonworks Inc. 2011
  • 10. Pig macro • Coming soon – piggybank with pig macros Architecting the Future of Big Data Page 10 © Hortonworks Inc. 2011
  • 11. Writing data flow program • Writing a complex data pipeline is an iterative process Load Load Transform Join Group Transform Filter Architecting the Future of Big Data Page 11 © Hortonworks Inc. 2011
  • 12. Writing data flow program Load Load Transform Join Group Transform Filter No output! L Architecting the Future of Big Data Page 12 © Hortonworks Inc. 2011
  • 13. Writing data flow program • Debug! Load Load Was  join  on   Transform Join wrong   a2ributes?   Bug  in   Group Transform Filter transform?   Did  filter  drop   everything?   Architecting the Future of Big Data Page 13 © Hortonworks Inc. 2011
  • 14. Common approaches to debug • Running on real (large) data – Inefficient, takes longer • Running on (small) samples – Empty results on join, selective filters Architecting the Future of Big Data Page 14 © Hortonworks Inc. 2011
  • 15. Pig illustrate command • Objective- Show examples for i/o of each statement that are – Realistic – Complete – Concise – Generated fast • Steps – Downstream – sample and process – Prune – Upstream – generate realistic missing classes of examples – Prune Architecting the Future of Big Data Page 15 © Hortonworks Inc. 2011
  • 16. Illustrate command demo Architecting the Future of Big Data Page 16 © Hortonworks Inc. 2011
  • 17. Pig relation-as-scalar • In pig each statement alias is a relation – Relation is a set of records • Task: Get list of pages whose load time was more than average. • Steps 1.  Compute average load time 2.  Get list of pages whose load time is > average Architecting the Future of Big Data Page 17 © Hortonworks Inc. 2011
  • 18. Pig relation-as-scalar • Step 1 is like .. = load ..! ..= group ..! al_rel = foreach .. AVG(ltime) as avg_ltime;! • Step 2 looks like page_views = load ‘pviews.txt’ as ! (url, ltime, ..);! ! slow_views = filter page_views by ! ltime > avg_ltime! Architecting the Future of Big Data Page 18 © Hortonworks Inc. 2011
  • 19. Pig relation-as-scalar • Getting results of step 1 (average_gpa) – Join result of step 1 with students relation, or – Write result into file, then use udf to read from file • Pig scalar feature now simplifies this- slow_views = filter page_views by ! ltime > al_rel.avg_ltime! – Runtime exception if al_rel has more than one record. Architecting the Future of Big Data Page 19 © Hortonworks Inc. 2011
  • 20. UDF in Scripting Language • Benefit – Use legacy code – Use library in scripting language – Leverage Hadoop for non-Java programmer • Currently supported language – Python – JavaScript – Ruby • Extensible Interface – Minimum effort to support another language Architecting the Future of Big Data Page 20 © Hortonworks Inc. 2011
  • 21. Writing a Jython UDF Write a Jython UDF •  Invoke Jython UDF when needed @outputSchema("word:chararray") •  Type conversion def concat(word): –  Simple type return word + word –  Python Array <-> Pig Bag –  Python Dict <-> Pig Map –  Pyton Tuple <-> Pig Tuple @outputSchemaFunction("squareSchema") •  Convey schema to Pig def square(num): –  outputSchema –  outputSchemaFunction if num == None: return None register 'util.py' using jython as util; return ((num)*(num)) B = foreach A generate util.square def squareSchema(input): (i)); return input Architecting the Future of Big Data Page 21 © Hortonworks Inc. 2011
  • 22. Use NLTK in Pig • Example register ’nltk_util.py' using jython as nltk; …… B = foreach A generate nltk.tokenize(sentence) nltk_util.py import nltk porter = nltk.PorterStemmer() @outputSchema("words:{(word:chararray)}") def tokenize(sentence): tokens = nltk.word_tokenize(sentence) words = [porter.stem(t) for t in tokens] return words Architecting the Future of Big Data Page 22 © Hortonworks Inc. 2011
  • 23. Writing a Script Engine Writing a bridge UDF class JythonFunction extends EvalFunc<Object> { public Object exec(Tuple tuple) { PyObject[] params = JythonUtils.pigTupleToPyTuple(tuple).getArray(); PyObject result = function.__call__(params); return JythonUtils.pythonToPig(result); } public Schema outputSchema(Schema input) { PyObject outputSchemaDef = f.__findattr__("outputSchema".intern()); return Utils.getSchemaFromString(outputSchemaDef.toString()); } } Architecting the Future of Big Data Page 23 © Hortonworks Inc. 2011
  • 24. Writing a Script Engine Register scripting UDF register 'util.py' using jython as util; What happens in Pig class JythonScriptEngine extends ScriptEngine { public void registerFunctions(String path, String namespace, PigContext pigContext) { PythonInterpreter pi = Interpreter.interpreter; pi.execfile(path); for (PyTuple item : pi.getLocals().items()) funcspec = new FuncSpec(JythonFunction.class.getCanonicalName() + "('" + path + "','" + item. get(0)+"')"); pigContext.registerFunction(namespace + key, funcspec); } } Architecting the Future of Big Data Page 24 © Hortonworks Inc. 2011
  • 25. Algebraic UDF in JRuby class Count < AlgebraicPigUdf output_schema Schema.long def initial t t.nil? ? 0 : 1 end def intermed t return 0 if t.nil? t.flatten.inject(:+) end def final t intermed(t) end end Architecting the Future of Big Data Page 25 © Hortonworks Inc. 2011
  • 26. Pig Embedding • Embed Pig inside scripting language – Python – JavaScript • Algorithms which cannot complete using one Pig script – Iterative algorithm PageRank, Kmeans, Neural Network, Apriori, etc – Parallel execution Random forrest – Divide and Conquer – Branching Architecting the Future of Big Data Page 26 © Hortonworks Inc. 2011
  • 27. Pig Embedding from org.apache.pig.scripting import Pig Compile  Pig   input= ":INPATH:/singlefile/studenttab10k” Script   P = Pig.compile("""A = load '$in' as (name, age, gpa); store A into ’output';""") Bind  Variables   Q = P.bind({'in':input}) result = Q.runSingle() Launch  Pig  Script   if result.isSuccessful(): print "Pig job PASSED” else: raise "Pig job FAILED" Architecting the Future of Big Data Page 27 © Hortonworks Inc. 2011
  • 28. Pig Embedding • Running embeded Pig script pig sample.py • What happen within Pig? Pig Script Python Python Script Script sample.py Pig Jython Pig Architecting the Future of Big Data Page 28 © Hortonworks Inc. 2011
  • 29. Nested Operator • Nested Operator: Operator inside foreach B = group A by name; C = foreach B { C0 = limit A 10; generate C0; } • Prior Pig 0.10, supported nested operator – DISTINCT, FILTER, LIMIT, and ORDER BY • New operators added in 0.10 – CROSS, FOREACH Architecting the Future of Big Data Page 29 © Hortonworks Inc. 2011
  • 30. Nested Cross/Foreach A = LOAD ’studenttab10k' as (name:chararray, age:int, gpa:double); B = LOAD ’votertab10k' as (name:chararray, age:int, registration, contributions:double); C = cogroup A by name, B by name; D = foreach C { C1 = filter A by gpa > 4; C2 = filter B by contributions > 500; C3 = cross C1, C2; C4 = foreach C3 generate CONCAT(CONCAT((chararray)gpa, '_'), (chararray) contributions); generate flatten(C4); } store D into ’output' Architecting the Future of Big Data Page 30 © Hortonworks Inc. 2011
  • 31. Misc Loaders • HBaseStorage • CassandraStorage • AvroStorage • JsonLoader/JsonStorage Architecting the Future of Big Data Page 31 © Hortonworks Inc. 2011
  • 32. New operators to come • Will be available in Pig 0.11 – RANK – A distributed RANK implementation for Pig – CUBE Architecting the Future of Big Data Page 32 © Hortonworks Inc. 2011