Pig programming is fun

Pig programming is more fun: New features in Pig

Daniel Dai (@daijy)
Thejas Nair (@thejasn)

© Hortonworks Inc. 2011 Page 1

What is Apache Pig?
Pig Latin, a high level An engine that
data processing executes Pig Latin
language. locally or on a
Hadoop cluster.

Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/

Architecting the Future of Big Data
Page 2
© Hortonworks Inc. 2011

Pig-latin example
• Query : Get the list of pages visited by users whose age is
between 20 and 25 years.

users = load users as (name, age);

users_18_to_25 = filter users by age > 20 and age <= 25;

page_views = load pages as (user, url);

page_views_u18_to_25 = join users_18_to_25 by name,
page_views by user;

Page 3

Why pig ?
• Faster development
–  Fewer lines of code
–  Don’t re-invent the wheel

• Flexible
–  Metadata is optional
–  Extensible
–  Procedural programming

Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/

Page 4

Before pig 0.9
p1.pig p2.pig p3.pig

Page 5

With pig macros
p1.pig p2.pig p3.pig

macro1.pig macro2.pig

Page 6

With pig macros
p1.pig p1.pig rm_bots.pig

get_top.pig

Page 7

Pig macro example
• Page_views data : (user_name, url, timestamp, …)
• Find top 5 users by page views
• Find top 10 most visited pages.

Page 8

Pig Macro example
page_views = LOAD .. /* top x macro */
/* get top 5 users by page view */ DEFINE topCount (rel, col, topNum)
u_grp = GROUP .. by uname; RETURNS top_num_recs {
u_count = FOREACH .. COUNT .. grped = GROUP $rel by $col;
ord_u_count = ORDER u_count .. cnt_grp = FOREACH ..COUNT($rel)..
top_5_users = LIMIT ordered.. 5; ord_cnt = ORDER .. by cnt;
DUMP top_5_users; $top_num_recs = LIMIT.. $topNum;
}
/* get top 10 urls by page view */ -----------------------------------------
url_grp = GROUP .. by url; page_views = LOAD ..
url_count = FOREACH .. COUNT . /* get top 5 users by page view */
ord_url_count = ORDER url_count.. top_5_users = topCount(page_views,
top_10_urls = LIMIT ord_url.. 10; uname, 5);
DUMP top_10_urls; DUMP top_5_users;
…

Page 9

Pig macro
• Coming soon – piggybank with pig macros

Page 10

Writing data flow program
• Writing a complex data pipeline is an iterative process

Load Load

Transform Join

Group Transform Filter

Page 11


Load Load

Transform Join


No output! L

Page 12

• Debug!

Load Load

Was
join
on

Transform Join wrong

a2ributes?

Bug
in
transform?

Did
ﬁlter
drop

everything?

Page 13

Common approaches to debug
• Running on real (large) data
– Inefficient, takes longer
• Running on (small) samples
– Empty results on join, selective filters

Page 14

Pig illustrate command
• Objective- Show examples for i/o of each statement that
are
– Realistic
– Complete
– Concise
– Generated fast
• Steps
– Downstream – sample and process
– Prune
– Upstream – generate realistic missing classes of examples
– Prune

Page 15

Illustrate command demo

Page 16

Pig relation-as-scalar
• In pig each statement alias is a relation
– Relation is a set of records
• Task: Get list of pages whose load time was more
than average.
• Steps
1.  Compute average load time
2.  Get list of pages whose load time is > average

Page 17

• Step 1 is like
.. = load ..!
..= group ..!
al_rel = foreach .. AVG(ltime) as avg_ltime;!

• Step 2 looks like
page_views = load ‘pviews.txt’ as !
(url, ltime, ..);!
!
slow_views = filter page_views by !
ltime > avg_ltime!

Page 18

• Getting results of step 1 (average_gpa)
– Join result of step 1 with students relation, or
– Write result into file, then use udf to read from file
• Pig scalar feature now simplifies this-
slow_views = filter page_views by !
ltime > al_rel.avg_ltime!

– Runtime exception if al_rel has more than one record.

Page 19

UDF in Scripting Language
• Benefit
– Use legacy code
– Use library in scripting language
– Leverage Hadoop for non-Java programmer
• Currently supported language
– Python
– JavaScript
– Ruby
• Extensible Interface
– Minimum effort to support another language

Page 20

Writing a Jython UDF
Write a Jython UDF •  Invoke Jython UDF when
needed
@outputSchema("word:chararray") •  Type conversion
def concat(word): –  Simple type
return word + word –  Python Array <-> Pig Bag
–  Python Dict <-> Pig Map
–  Pyton Tuple <-> Pig Tuple

@outputSchemaFunction("squareSchema") •  Convey schema to Pig
def square(num): –  outputSchema
–  outputSchemaFunction
if num == None:
return None register 'util.py' using jython as util;
return ((num)*(num))
B = foreach A generate util.square
def squareSchema(input): (i));
return input

Page 21

Use NLTK in Pig
• Example
register ’nltk_util.py' using jython as nltk;
……
B = foreach A generate nltk.tokenize(sentence)

nltk_util.py
import nltk
porter = nltk.PorterStemmer()
@outputSchema("words:{(word:chararray)}")
def tokenize(sentence):
tokens = nltk.word_tokenize(sentence)
words = [porter.stem(t) for t in tokens]
return words

Page 22

Writing a Script Engine
Writing a bridge UDF
class JythonFunction extends EvalFunc<Object> {
public Object exec(Tuple tuple) {
PyObject[] params = JythonUtils.pigTupleToPyTuple(tuple).getArray();
PyObject result = function.__call__(params);
return JythonUtils.pythonToPig(result);
}
public Schema outputSchema(Schema input) {
PyObject outputSchemaDef = f.__findattr__("outputSchema".intern());
return Utils.getSchemaFromString(outputSchemaDef.toString());
}
}

Page 23

Writing a Script Engine
Register scripting UDF

register 'util.py' using jython as util;

What happens in Pig
class JythonScriptEngine extends ScriptEngine {
public void registerFunctions(String path, String namespace, PigContext
pigContext) {
PythonInterpreter pi = Interpreter.interpreter;
pi.execfile(path);
for (PyTuple item : pi.getLocals().items())
funcspec = new FuncSpec(JythonFunction.class.getCanonicalName() + "('"
+ path + "','" + item. get(0)+"')");
pigContext.registerFunction(namespace + key, funcspec);
}
}

Page 24

Algebraic UDF in JRuby
class Count < AlgebraicPigUdf
output_schema Schema.long

def initial t
t.nil? ? 0 : 1
end

def intermed t
return 0 if t.nil?
t.flatten.inject(:+)
end

def final t
intermed(t)
end

end

Page 25

Pig Embedding
• Embed Pig inside scripting language
– Python
– JavaScript
• Algorithms which cannot complete using one Pig script
– Iterative algorithm
PageRank, Kmeans, Neural Network, Apriori, etc
– Parallel execution
Random forrest
– Divide and Conquer
– Branching

Page 26

Pig Embedding
from org.apache.pig.scripting import Pig

Compile
Pig

input= ":INPATH:/singlefile/studenttab10k”
Script

P = Pig.compile("""A = load '$in' as (name, age, gpa); store A into ’output';""")

Bind
Variables

Q = P.bind({'in':input})

result = Q.runSingle() Launch
Pig
Script

if result.isSuccessful():
print "Pig job PASSED”
else:
raise "Pig job FAILED"

Page 27

Pig Embedding
• Running embeded Pig script
pig sample.py
• What happen within Pig?
Pig
Script

Python Python
Script Script
sample.py Pig Jython Pig

Page 28

Nested Operator
• Nested Operator: Operator inside foreach
B = group A by name;
C = foreach B {
C0 = limit A 10;
generate C0;
}

• Prior Pig 0.10, supported nested operator
– DISTINCT, FILTER, LIMIT, and ORDER BY
• New operators added in 0.10
– CROSS, FOREACH

Page 29

Nested Cross/Foreach
A = LOAD ’studenttab10k' as (name:chararray, age:int, gpa:double);
B = LOAD ’votertab10k' as (name:chararray, age:int, registration,
contributions:double);
C = cogroup A by name, B by name;
D = foreach C {
C1 = filter A by gpa > 4;
C2 = filter B by contributions > 500;
C3 = cross C1, C2;
C4 = foreach C3 generate CONCAT(CONCAT((chararray)gpa, '_'), (chararray)
contributions);
generate flatten(C4);
}
store D into ’output'

Page 30

Misc Loaders
• HBaseStorage
• CassandraStorage
• AvroStorage
• JsonLoader/JsonStorage

Page 31

New operators to come
• Will be available in Pig 0.11
– RANK
– A distributed RANK implementation for Pig

– CUBE

Page 32

Pig programming is fun

More Related Content

What's hot (20)

Viewers also liked (9)

Similar to Pig programming is fun (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Pig programming is fun