04 pig data operations

An Example
• Let’s look at a simple example by writing the program to
calculate the maximum recorded temperature by year for the
weather dataset in Pig Latin.
• Data:
YEAR TMP QUALITY
1950 0 1
1950 22 1
1950 -11 1
1949 111 1
1949 78 1
• Start up Grunt in local mode, then enter the first line of the Pig
script:
records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
• For simplicity, the program assumes that the input is tab-delimited
text, with each line having just year, temperature, and
quality fields.

An Example
records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
• This line describes the input data we want to process.
• The year:chararray notation describes the field’s name and type; a chararray
is like a Java string, and an int is like a Java int.
• The LOAD operator takes a URI argument; here we are just using a local file,
but we could refer to an HDFS URI.
• The AS clause (which is optional) gives the fields names to make it
convenient to refer to them in subsequent statements.
• The result of the LOAD operator, indeed any operator in Pig Latin, is a
relation, which is just a set of tuples.
• A tuple is just like a row of data in a database table, with multiple fields in a
particular order.
• In this example, the LOAD function produces a set of (year, temperature,
quality) tuples that are present in the input file.
• We write a relation with one tuple per line, where tuples are represented as
comma-separated items in parentheses: (1950,0,1)

An Example
• Relations are given names, or aliases, so they can be
referred to.
• This relation is given the records alias.
• We can examine the contents of an alias using the
DUMP operator:
DUMP records;
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)

An Example
• We can also see the structure of a relation—the relation’s schema—using
the DESCRIBE operator on the relation’s alias: DESCRIBE records;
filtered_records = FILTER records BY temperature != 9999 AND
(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);
• This statement removes records that have a missing temperature (indicated
by a value of 9999) or an unsatisfactory quality reading.
• For this small dataset, no records are filtered out.
grouped_records = GROUP filtered_records BY year;
• The third statement uses the GROUP function to group the records relation
by the year field.
• Let’s use DUMP to see what it produces for grouped_records.
• Let’s use DESCRIBE grouped_records; to see what is the structure of
grouped_records.

An Example
• We now have two rows, or tuples, one for each year in the input data. The
first field in each tuple is the field being grouped by (the year), and the
second field is a bag of tuples for that year.
• A bag is just an unordered collection of tuples, which in Pig Latin is
represented using curly braces.
• So now all that remains is to find the maximum temperature for the tuples in
each bag.
max_temp = FOREACH grouped_records GENERATE group,
MAX(filtered_records.temperature);
• FOREACH processes every row to generate a derived set of rows, using a
GENERATE clause to define the fields in each derived row.
• In this example, the first field is group, which is just the year.
• The second field filtered_records.temperature reference is to the
temperature field of the filtered_records bag in the grouped_records
relation.
• MAX is a built-in function for calculating the maximum value of fields in a
bag.

Pig Latin
• Supports read-only data analysis workloads that are
scan-centric; no transactions!
• Fully nested data model.
– Does not satisfy First normal form!
– By definition will violate the other normal forms.
• Extensive support for user-defined functions.
– UDF as first class citizen.
• Manages plain input files without any schema
information.
• A novel debugging environment.

Nested data/set model
• The nested set model is a particular technique for
representing nested sets (also known as trees or
hierarchies) in relational databases.

Why Nested Data Model?
• Closer to how programmers think and more natural to
them.
– E.g., To capture information about the positional
occurrences of terms in a collection of documents, a
programmer may create a structure of the form
Idx<documentId, Set<positions>> for each term.
– Normalization of the data creates two tables:
Term_info: (TermId, termString, ….)
Pos_info: (TermId, documentId, position)
– Obtain positional occurrence by joining these two tables on
TermId and grouping on <TermId, documentId>

Why Nested Data Model?
• Data is often stored on disk in an inherently nested
fashion.
– A web crawler might output for each url, the set of outlinks
from that url.
• A nested data model justifies a new algebraic
language!
• Adaptation by programmers because it is easier to
write user-defined functions.

Dataflow Language
• User specifies a sequence of steps where each step
specifies only a single, high level data transformation.
Similar to relational algebra and procedural –
desirable for programmers.
• With SQL, the user specifies a set of declarative
constraints. Non-procedural and desirable for non-programmers.

Dataflow Language: Example
• A high level program that specifies a query execution plan.
• Example: Suppose we have a table urls: (url, category, pagerank). The
following is a simple SQL query that finds, for each sufficiently large
category, the average pagerank of high-pagerank urls in that category.
• In PigLatin:

Lazy Execution
• Database style optimization by lazy processing of
expressions.
• Example
Recall urls: (url, category, pagerank)
Set of urls of pages that are classified as spam and have a
high pagerank score.
1. Spam_urls = Filter urls BY isSpam(url);
2. Culprit_urls = FILTER spam_urls BY pagerank > 0.8;
Optimized execution:
1. HighRank_urls = FILTER urls BY pagerank > 0.8;
2. Cultprit_urls = FILTER HighRank_urls BY isSpam (url);

Quick Start/Interoperability
• To process a file, the user provides a function that
gives Pig the ability to parse the content of the file
into records.
• Output of a Pig program is formatted based on a user-defined
function.
• Why do not conventional DBMSs do the same? (They
require importing data into system-managed tables)

Quick Start/Interoperability
• To process a file, the user provides a function that
gives Pig the ability to parse the content of the file
into records.
• Output of a Pig program is formatted based on a user-defined
function.
• Why do not conventional DBMSs do the same? (They
require importing data into system-managed tables)
– To enable transactional consistency guarantees,
– To enable efficient point lookups (RIDs),
– To curate data on behalf of the user, and record the schema
so that other users can make sense of the data.

Pig Latin - Simple Data Types
• PIG Latin statements work with relations,
– A Relation is a Bag (Outer Bag)
– A Bag is a collection of Tuples
– A Tuple is an ordered set of Fields
– A Field can be any simple or complex data type
• thus supports nested data model
• Simple data types
– int => signed 32 bit => 10
– long => signed 64 bit => 10L
– float => 32 bit => 10.5f, 10.5e2
– double => 64 bit => 10.5, 10.5e2
– Arrays
• chararray => string in UTF-8 => ‘Hello World”
• bytearray => byte array (blob)

Data Model
• Consists of four types:
– Atom: Contains a simple atomic value such as a string or a
number, e.g., ‘Joe’.
– Tuple: Sequence of fields, each of which might be any data
type, e.g., (‘Joe’, ‘lakers’)
– Bag: A collection of tuples with possible duplicates.
Schema of a bag is flexible.
– Map: A collection of data items, where each item has an
associated key through which it can be looked up. Keys
must be data atoms. Flexibility enables data to change
without re-writing programs.

A Comparison with Relational Algebra
• Pig Latin
– Everything is a bag.
– Dataflow language.
• Relational Algebra
– Everything is a table.

Pig Latin – NULL support
• Same as SQL definition : unknown or non-existent
• Null can be used as constant expression in place of expression
of any type
• If certain fields in the data are missing, it is load/store functions
responsibility to insert NULL
– E.g. text loader returns NULL in place of empty strings in the
data
• Operations that produce NULL
– Divide by zero
– Dereferencing a field or map key that does not exists
– UDFs can return NULL
• NULLs and Operators
– Comparison, Matches, Cast, Dereferencing returns null if one
of the input variables is null
– AVG, MIN, MAX, SUM functions ignore NULLs
– COUNT function counts values including NULLs
– If Filter expression is NULL then record is rejected

Expressions
A = LOAD ‘data.txt’ AS (f1:int , f2:bag{t:tuple (n1:int, n2:int)}, f3: map[] )
(1,
(2,3),
(4,6)
, [‘yahoo’#‘mail’])
A.f1 or
A.$0
A.f2 or
A.$1
Field referred to by position Field referred to by name Projection of a data item
(2,3),
(4,6)
A.f3 or
A.$2
A.f2 =
(2,3),
(4,6)
A.$1 =
A.$0 = 1
A.f2.$0 =
(2),
(4)
Map lookup
A.f3# ‘yahoo’ = ‘mail’
Function application
SUM(A.f2.$0) = 6
COUNT(A.f2) = 2L
A = name of an outer
bag/relation
NOTE: bag, tuple keywords are optional

Comparison Operators
a  (f1:int , f2:bag{t:tuple (n1:int, n2:int)}, f3: map[] )
(1,
(2,3),
(4,6)
,[‘yahoo’#‘mail’])
f1or $0 f2 or $1 f3 or $2
 Numerical comparison (==, !=, >, >=, <, <= )
 f1 > 5
 f3#‘yahoo’ == ‘mail’
 Regular expression matching : matches
 f3#‘yahoo’ matches ‘(?i)MAIL’
Logical Operators AND, OR, NOT
 f1==1 AND f3#‘yahoo’ eq ‘mail’
 Conditional Expression (aka BinCond)
 (Condition?exp1:exp2)
 f3#‘yahoo’ matches ‘(?i)MAIL’ ? ‘matched’ : ‘notmatched’

Pig Built-in Functions
• Pig has a variety of built-in functions for each type
– Storage
• TextLoader: for loading unstructured text files. Each line is
loaded as a tuple with a single field which is the entire line.
– Filter
• isEmpty: tests if bags are empty
– Eval Functions
• COUNT: computes number of elements in a bag
• SUM: computes the sum of the numeric values in a single-column
bag
• AVG: computes the average of the numeric values in a single-column
bag
• MIN/MAX: computes the min/max of the numeric values in a
single-column bag.
• SIZE: returns size of any datum example map
• CONCAT: concatenate two chararrays or two bytearrays
• TOKENIZE: splits a string and outputs a bag of words
• DIFF: compares the fields of a tuple with size 2

Specifying Input Data
• Use LOAD command to specify input data file.
• Input file is query_log.txt
• Convert input file into tuples using myLoad deserializer.
• Loaded tuples have 3 fields.
• USING and AS clauses are optional.
– Default serializer that expects a plain text, tab-deliminated file, is used.
• No schema  reference fields by position $0
• Return value, assigned to “queries”, is a handle to a bag.
– “queries” can be used as input to subsequent Pig Latin expressions.
– Handles such as “queries” are logical. No data is actually read and no
processing carried out until the instruction that explicitly asks for output
(STORE).
– Think of it as a “logical view”.

FOREACH
• Once input data file(s) have been specified through LOAD, one can specify
the processing that needs to be carried out on the data.
• One of the basic operations is that of applying some processing to every
tuple of a data set.
• This is achieved through the FOREACH command. For example:
• The above command specifies that each tuple of the bag queries (loaded by
previous command) should be processed independently to produce an
output tuple.
• The first field of the output tuple is the userId field of the input tuple, and
the second field of the output tuple is the result of applying the UDF
expandQuery to the queryString field of the input tuple.

Per-tuple Processing with FOREACH
• Suppose the UDF expandQuery generates a bag of likely expansions of a
given query string.
• Then an example transformation carried out by the above statement is a
bag of likely expansions of a given query string.
• Semantics:
– No dependence between processing of different tupels of the input 
Parallelism!
– GENERATE can be followed by a list of any expression.

FOREACH & Flattening
• To eliminate nesting in data, use FLATTEN.
• FLATTEN consumes a bag, extracts the fields of the tuples in the bag, and
makes them fields of the tuple being output by GENERATE, removing one
level of nesting.

Discarding Unwanted Data: FILTER
• Identical to the select operator of relational algebra.
• Synatx:
– FILTER bag-id BY expression
• Expression is:
field-name op Constant
Field-name op UDF
op might be ==, eq, !=, neq, <, >, <=, >=
• A comparison operation may utilize boolean operators (AND, OR, NOT) with several
expressions
• For example, to get rid of bot traffic in the bag queries
• Since arbitrary expressions are allowed, it follows that we can use UDFs
while filtering.
• Thus, in our less ideal world, where bots don’t identify themselves, we can
use a sophisticated UDF (isBot) to perform the filtering, e.g.:

A Comparison with Relational Algebra
• Pig Latin
– Everything is a bag.
– FILTER is same as the
Select operator.
• Relational Algebra
– Everything is a table.
– Select operator is same as
the FILTER cmd.

Grouping related data
• COGROUP groups together tuples from one or more data sets that are
related in some way.
• Example:
– For example, suppose we have two data sets that we have specified through a
LOAD command:
– Results contains, for different query strings, the urls shown as search results and
the position at which they are shown.
– Revenue contains, for different query strings, and different ad slots, the average
amount of revenue made by the ad for that query string at that slot.
– Then to group together all search result data and revenue data for the same
query string, we can write:

COGROUP
• The output of a COGROUP contains one tuple for each group.
– First field of the tuple, named group, is the group identifier.
– Each of the next fields is a bag, one for each input being cogrouped, and is
named the same as the alias of that input.

COGROUP is not JOIN
• Grouping can be performed according to arbitrary expressions which may
include UDFs.
• Grouping is different than “Join”
• It is evident that JOIN is equivalent to COGROUP, followed by taking a cross
product of the tuples in the nested bags. While joins are widely applicable,
certain custom processing might require access to the tuples of the groups
before the cross-product is taken.

Example
• Suppose we were trying to attribute search revenue to search-result urls to
figure out the monetary worth of each url. We might have a sophisticated
model for doing so. To accomplish this task in Pig Latin, we can follow the
COGROUP with the following statement:
• Where distributeRevenue is a UDF that accepts search results and revenue
information for a query string at a time, and outputs a bag of urls and the
revenue attributed to them.
• For example, distributeRevenue might attribute revenue from the top slot
entirely to the first search result, while the revenue from the side slot may
be attributed equally to all the results.

Example…
• Assign search revenue to search-result urls to figure out the monetary
worth of each url. A UDF, distributeRevenue attributes revenue from the
top slot entirely to the first search result, while the revenue from the side
slot may be attributed equally to all the results.

WITH JOIN
• To specify the same operation in SQL, one would have to join by queryString,
then group by queryString, and then apply a custom aggregation function.
• But while doing the join, the system would compute the cross product of the
• search and revenue information, which the custom aggregation function
would then have to undo.
• Thus, the whole process become quite inefficient, and the query becomes
hard to read and understand.

Special Case of COGROUP: GROUP
• A special case of COGROUP when there is only one data set involved.
• Example: Find the total revenue for each query string.
• In the second statement above, revenue.amount refers to a projection of
the nested bag in the tuples of grouped_revenue.
• Also, as in SQL, the AS clause is used to assign names to fields on the fly.
• To group all tuples of a data set together (e.g., to compute the overall total
revenue), one uses the syntax GROUP revenue ALL.

JOIN
• Pig Latin supports equi-joins.
• It is easy to verify that JOIN is only a syntactic shortcut for COGROUP
followed by flattening.
• The above join command is equivalent to:

MapReduce in Pig Latin
• With the GROUP and FOREACH statements, it is trivial to express a mapreduce
program in Pig Latin.
• Converting to our data-model terminology, a map function operates on one
input tuple at a time, and outputs a bag of key-value pairs.
• The reduce function then operates on all values for a key at a time to produce
the final result.
• The first line applies the map UDF to every tuple on the input, and flattens the
bag of key value pairs that it produces.
• We use the shorthand * as in SQL to denote that all the fields of the input tuples
are passed to the map UDF.
• Assuming the first field of the map output to be the key, the second statement
groups by key.
• The third statement then passes the bag of values for every key to the reduce
UDF to obtain the final result.

Other Commands
• Pig Latin has a number of other commands that are
very similar to their SQL counterparts. These are:
– UNION: Returns the union of two or more bags.
– CROSS: Returns the cross product of two or more bags.
– ORDER: Orders a bag by the specified field(s).
– DISTINCT: Eliminates duplicate tuples in a bag. This
command is just a shortcut for grouping the bag by all fields,
and then projecting out the groups.

Asking for Output: STORE
• The user can ask for the result of a Pig Latin expression
• sequence to be materialized to a file, by issuing the STORE
• command, e.g.,
• The above command specifies that bag query_revenues should be serialized
to the file myoutput using the custom serializer myStore.
• As with LOAD, the USING clause may be omitted for a default serializer that
writes plain text, tabdelimited files.
• Pig also comes with a built-in serializer/ deserializer that can load/store
arbitrarily nested data.

Word Count using Pig
myinput = LOAD ‘input.txt' USING TextLoader() as (text_line:chararray);
words = FOREACH myinput GENERATE FLATTEN(TOKENIZE(text_line));
grouped = GROUP words BY $0;
counts = FOREACH grouped GENERATE group, COUNT(words);
STORE counts into ‘pigoutput’ using PigStorage();
Write to HDFS pigoutput/part-* file

Build Inverted Index
• Load set of files as string:chararray
• Associate filenames with their string representation
• Union all the entries <filename, string>
• For each entry tokenize string to generate
– <filename, word> tuples
• Group by word
– <word1, {(filename1, word1), (filename2, word1)…}>
– For each group take records with distinct filenames from the
associated bag
– Generate <word1, {(fillename1) (filename2)..}
• Store it

Build Inverted Index
t1 = LOAD ‘input1.txt’ USING TextLoader() AS (string:chararray);
t2 = FOREACH t1 GENERATE ‘input1.txt’ as fname, string;
t3 = LOAD ‘input2.txt’ USING TextLoader() as (string:chararray);
t4 = FOREACH t3 GENERATE ‘input2.txt’ as fname, string;
text = UNION t2, t4;
words = FOREACH text GENERATE fname, FLATTEN(TOKENIZE(string));
word_groups = GROUP words BY $1;
index = FOREACH word_groups { files = DISTINCT $1.$0; GENERATE $0, cnt, files; };
STORE index INTO ‘inverted_index’ using PigStorage();
Nested FOREACH

End of session
Day – 3: Apache Pig Data Operations

04 pig data operations

More Related Content

What's hot (20)

Viewers also liked (8)

Similar to 04 pig data operations (20)

More from Subhas Kumar Ghosh (20)

Recently uploaded (20)

04 pig data operations