SlideShare a Scribd company logo
Apache Pig Data Operations
An Example 
• Let’s look at a simple example by writing the program to 
calculate the maximum recorded temperature by year for the 
weather dataset in Pig Latin. 
• Data: 
YEAR TMP QUALITY 
1950 0 1 
1950 22 1 
1950 -11 1 
1949 111 1 
1949 78 1 
• Start up Grunt in local mode, then enter the first line of the Pig 
script: 
records = LOAD 'input/ncdc/micro-tab/sample.txt' 
AS (year:chararray, temperature:int, quality:int); 
• For simplicity, the program assumes that the input is tab-delimited 
text, with each line having just year, temperature, and 
quality fields.
An Example 
records = LOAD 'input/ncdc/micro-tab/sample.txt' 
AS (year:chararray, temperature:int, quality:int); 
• This line describes the input data we want to process. 
• The year:chararray notation describes the field’s name and type; a chararray 
is like a Java string, and an int is like a Java int. 
• The LOAD operator takes a URI argument; here we are just using a local file, 
but we could refer to an HDFS URI. 
• The AS clause (which is optional) gives the fields names to make it 
convenient to refer to them in subsequent statements. 
• The result of the LOAD operator, indeed any operator in Pig Latin, is a 
relation, which is just a set of tuples. 
• A tuple is just like a row of data in a database table, with multiple fields in a 
particular order. 
• In this example, the LOAD function produces a set of (year, temperature, 
quality) tuples that are present in the input file. 
• We write a relation with one tuple per line, where tuples are represented as 
comma-separated items in parentheses: (1950,0,1)
An Example 
• Relations are given names, or aliases, so they can be 
referred to. 
• This relation is given the records alias. 
• We can examine the contents of an alias using the 
DUMP operator: 
DUMP records; 
(1950,0,1) 
(1950,22,1) 
(1950,-11,1) 
(1949,111,1) 
(1949,78,1)
An Example 
• We can also see the structure of a relation—the relation’s schema—using 
the DESCRIBE operator on the relation’s alias: DESCRIBE records; 
filtered_records = FILTER records BY temperature != 9999 AND 
(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9); 
• This statement removes records that have a missing temperature (indicated 
by a value of 9999) or an unsatisfactory quality reading. 
• For this small dataset, no records are filtered out. 
grouped_records = GROUP filtered_records BY year; 
• The third statement uses the GROUP function to group the records relation 
by the year field. 
• Let’s use DUMP to see what it produces for grouped_records. 
• Let’s use DESCRIBE grouped_records; to see what is the structure of 
grouped_records.
An Example 
• We now have two rows, or tuples, one for each year in the input data. The 
first field in each tuple is the field being grouped by (the year), and the 
second field is a bag of tuples for that year. 
• A bag is just an unordered collection of tuples, which in Pig Latin is 
represented using curly braces. 
• So now all that remains is to find the maximum temperature for the tuples in 
each bag. 
max_temp = FOREACH grouped_records GENERATE group, 
MAX(filtered_records.temperature); 
• FOREACH processes every row to generate a derived set of rows, using a 
GENERATE clause to define the fields in each derived row. 
• In this example, the first field is group, which is just the year. 
• The second field filtered_records.temperature reference is to the 
temperature field of the filtered_records bag in the grouped_records 
relation. 
• MAX is a built-in function for calculating the maximum value of fields in a 
bag.
Pig Latin 
• Supports read-only data analysis workloads that are 
scan-centric; no transactions! 
• Fully nested data model. 
– Does not satisfy First normal form! 
– By definition will violate the other normal forms. 
• Extensive support for user-defined functions. 
– UDF as first class citizen. 
• Manages plain input files without any schema 
information. 
• A novel debugging environment.
Nested data/set model 
• The nested set model is a particular technique for 
representing nested sets (also known as trees or 
hierarchies) in relational databases.
Why Nested Data Model? 
• Closer to how programmers think and more natural to 
them. 
– E.g., To capture information about the positional 
occurrences of terms in a collection of documents, a 
programmer may create a structure of the form 
Idx<documentId, Set<positions>> for each term. 
– Normalization of the data creates two tables: 
Term_info: (TermId, termString, ….) 
Pos_info: (TermId, documentId, position) 
– Obtain positional occurrence by joining these two tables on 
TermId and grouping on <TermId, documentId>
Why Nested Data Model? 
• Data is often stored on disk in an inherently nested 
fashion. 
– A web crawler might output for each url, the set of outlinks 
from that url. 
• A nested data model justifies a new algebraic 
language! 
• Adaptation by programmers because it is easier to 
write user-defined functions.
Dataflow Language 
• User specifies a sequence of steps where each step 
specifies only a single, high level data transformation. 
Similar to relational algebra and procedural – 
desirable for programmers. 
• With SQL, the user specifies a set of declarative 
constraints. Non-procedural and desirable for non-programmers.
Dataflow Language: Example 
• A high level program that specifies a query execution plan. 
• Example: Suppose we have a table urls: (url, category, pagerank). The 
following is a simple SQL query that finds, for each sufficiently large 
category, the average pagerank of high-pagerank urls in that category. 
• In PigLatin:
Lazy Execution 
• Database style optimization by lazy processing of 
expressions. 
• Example 
Recall urls: (url, category, pagerank) 
Set of urls of pages that are classified as spam and have a 
high pagerank score. 
1. Spam_urls = Filter urls BY isSpam(url); 
2. Culprit_urls = FILTER spam_urls BY pagerank > 0.8; 
Optimized execution: 
1. HighRank_urls = FILTER urls BY pagerank > 0.8; 
2. Cultprit_urls = FILTER HighRank_urls BY isSpam (url);
Quick Start/Interoperability 
• To process a file, the user provides a function that 
gives Pig the ability to parse the content of the file 
into records. 
• Output of a Pig program is formatted based on a user-defined 
function. 
• Why do not conventional DBMSs do the same? (They 
require importing data into system-managed tables)
Quick Start/Interoperability 
• To process a file, the user provides a function that 
gives Pig the ability to parse the content of the file 
into records. 
• Output of a Pig program is formatted based on a user-defined 
function. 
• Why do not conventional DBMSs do the same? (They 
require importing data into system-managed tables) 
– To enable transactional consistency guarantees, 
– To enable efficient point lookups (RIDs), 
– To curate data on behalf of the user, and record the schema 
so that other users can make sense of the data.
Pig Latin - Simple Data Types 
• PIG Latin statements work with relations, 
– A Relation is a Bag (Outer Bag) 
– A Bag is a collection of Tuples 
– A Tuple is an ordered set of Fields 
– A Field can be any simple or complex data type 
• thus supports nested data model 
• Simple data types 
– int => signed 32 bit => 10 
– long => signed 64 bit => 10L 
– float => 32 bit => 10.5f, 10.5e2 
– double => 64 bit => 10.5, 10.5e2 
– Arrays 
• chararray => string in UTF-8 => ‘Hello World” 
• bytearray => byte array (blob)
Data Model 
• Consists of four types: 
– Atom: Contains a simple atomic value such as a string or a 
number, e.g., ‘Joe’. 
– Tuple: Sequence of fields, each of which might be any data 
type, e.g., (‘Joe’, ‘lakers’) 
– Bag: A collection of tuples with possible duplicates. 
Schema of a bag is flexible. 
– Map: A collection of data items, where each item has an 
associated key through which it can be looked up. Keys 
must be data atoms. Flexibility enables data to change 
without re-writing programs.
A Comparison with Relational Algebra 
• Pig Latin 
– Everything is a bag. 
– Dataflow language. 
• Relational Algebra 
– Everything is a table. 
– Dataflow language.
Pig Latin – NULL support 
• Same as SQL definition : unknown or non-existent 
• Null can be used as constant expression in place of expression 
of any type 
• If certain fields in the data are missing, it is load/store functions 
responsibility to insert NULL 
– E.g. text loader returns NULL in place of empty strings in the 
data 
• Operations that produce NULL 
– Divide by zero 
– Dereferencing a field or map key that does not exists 
– UDFs can return NULL 
• NULLs and Operators 
– Comparison, Matches, Cast, Dereferencing returns null if one 
of the input variables is null 
– AVG, MIN, MAX, SUM functions ignore NULLs 
– COUNT function counts values including NULLs 
– If Filter expression is NULL then record is rejected
Expressions in Pig Latin
Expressions 
A = LOAD ‘data.txt’ AS (f1:int , f2:bag{t:tuple (n1:int, n2:int)}, f3: map[] ) 
(1, 
(2,3), 
(4,6) 
, [‘yahoo’#‘mail’]) 
A.f1 or 
A.$0 
A.f2 or 
A.$1 
Field referred to by position Field referred to by name Projection of a data item 
(2,3), 
(4,6) 
A.f3 or 
A.$2 
A.f2 = 
(2,3), 
(4,6) 
A.$1 = 
A.$0 = 1 
A.f2.$0 = 
(2), 
(4) 
Map lookup 
A.f3# ‘yahoo’ = ‘mail’ 
Function application 
SUM(A.f2.$0) = 6 
COUNT(A.f2) = 2L 
A = name of an outer 
bag/relation 
NOTE: bag, tuple keywords are optional
Comparison Operators 
a  (f1:int , f2:bag{t:tuple (n1:int, n2:int)}, f3: map[] ) 
(1, 
(2,3), 
(4,6) 
,[‘yahoo’#‘mail’]) 
f1or $0 f2 or $1 f3 or $2 
 Numerical comparison (==, !=, >, >=, <, <= ) 
 f1 > 5 
 f3#‘yahoo’ == ‘mail’ 
 Regular expression matching : matches 
 f3#‘yahoo’ matches ‘(?i)MAIL’ 
Logical Operators AND, OR, NOT 
 f1==1 AND f3#‘yahoo’ eq ‘mail’ 
 Conditional Expression (aka BinCond) 
 (Condition?exp1:exp2) 
 f3#‘yahoo’ matches ‘(?i)MAIL’ ? ‘matched’ : ‘notmatched’
Pig Built-in Functions 
• Pig has a variety of built-in functions for each type 
– Storage 
• TextLoader: for loading unstructured text files. Each line is 
loaded as a tuple with a single field which is the entire line. 
– Filter 
• isEmpty: tests if bags are empty 
– Eval Functions 
• COUNT: computes number of elements in a bag 
• SUM: computes the sum of the numeric values in a single-column 
bag 
• AVG: computes the average of the numeric values in a single-column 
bag 
• MIN/MAX: computes the min/max of the numeric values in a 
single-column bag. 
• SIZE: returns size of any datum example map 
• CONCAT: concatenate two chararrays or two bytearrays 
• TOKENIZE: splits a string and outputs a bag of words 
• DIFF: compares the fields of a tuple with size 2
Specifying Input Data 
• Use LOAD command to specify input data file. 
• Input file is query_log.txt 
• Convert input file into tuples using myLoad deserializer. 
• Loaded tuples have 3 fields. 
• USING and AS clauses are optional. 
– Default serializer that expects a plain text, tab-deliminated file, is used. 
• No schema  reference fields by position $0 
• Return value, assigned to “queries”, is a handle to a bag. 
– “queries” can be used as input to subsequent Pig Latin expressions. 
– Handles such as “queries” are logical. No data is actually read and no 
processing carried out until the instruction that explicitly asks for output 
(STORE). 
– Think of it as a “logical view”.
FOREACH 
• Once input data file(s) have been specified through LOAD, one can specify 
the processing that needs to be carried out on the data. 
• One of the basic operations is that of applying some processing to every 
tuple of a data set. 
• This is achieved through the FOREACH command. For example: 
• The above command specifies that each tuple of the bag queries (loaded by 
previous command) should be processed independently to produce an 
output tuple. 
• The first field of the output tuple is the userId field of the input tuple, and 
the second field of the output tuple is the result of applying the UDF 
expandQuery to the queryString field of the input tuple.
Per-tuple Processing with FOREACH 
• Suppose the UDF expandQuery generates a bag of likely expansions of a 
given query string. 
• Then an example transformation carried out by the above statement is a 
bag of likely expansions of a given query string. 
• Semantics: 
– No dependence between processing of different tupels of the input  
Parallelism! 
– GENERATE can be followed by a list of any expression.
FOREACH & Flattening 
• To eliminate nesting in data, use FLATTEN. 
• FLATTEN consumes a bag, extracts the fields of the tuples in the bag, and 
makes them fields of the tuple being output by GENERATE, removing one 
level of nesting.
Discarding Unwanted Data: FILTER 
• Identical to the select operator of relational algebra. 
• Synatx: 
– FILTER bag-id BY expression 
• Expression is: 
field-name op Constant 
Field-name op UDF 
op might be ==, eq, !=, neq, <, >, <=, >= 
• A comparison operation may utilize boolean operators (AND, OR, NOT) with several 
expressions 
• For example, to get rid of bot traffic in the bag queries 
• Since arbitrary expressions are allowed, it follows that we can use UDFs 
while filtering. 
• Thus, in our less ideal world, where bots don’t identify themselves, we can 
use a sophisticated UDF (isBot) to perform the filtering, e.g.:
A Comparison with Relational Algebra 
• Pig Latin 
– Everything is a bag. 
– Dataflow language. 
– FILTER is same as the 
Select operator. 
• Relational Algebra 
– Everything is a table. 
– Dataflow language. 
– Select operator is same as 
the FILTER cmd.
Grouping related data 
• COGROUP groups together tuples from one or more data sets that are 
related in some way. 
• Example: 
– For example, suppose we have two data sets that we have specified through a 
LOAD command: 
– Results contains, for different query strings, the urls shown as search results and 
the position at which they are shown. 
– Revenue contains, for different query strings, and different ad slots, the average 
amount of revenue made by the ad for that query string at that slot. 
– Then to group together all search result data and revenue data for the same 
query string, we can write:
COGROUP 
• The output of a COGROUP contains one tuple for each group. 
– First field of the tuple, named group, is the group identifier. 
– Each of the next fields is a bag, one for each input being cogrouped, and is 
named the same as the alias of that input.
COGROUP is not JOIN 
• Grouping can be performed according to arbitrary expressions which may 
include UDFs. 
• Grouping is different than “Join” 
• It is evident that JOIN is equivalent to COGROUP, followed by taking a cross 
product of the tuples in the nested bags. While joins are widely applicable, 
certain custom processing might require access to the tuples of the groups 
before the cross-product is taken.
Example 
• Suppose we were trying to attribute search revenue to search-result urls to 
figure out the monetary worth of each url. We might have a sophisticated 
model for doing so. To accomplish this task in Pig Latin, we can follow the 
COGROUP with the following statement: 
• Where distributeRevenue is a UDF that accepts search results and revenue 
information for a query string at a time, and outputs a bag of urls and the 
revenue attributed to them. 
• For example, distributeRevenue might attribute revenue from the top slot 
entirely to the first search result, while the revenue from the side slot may 
be attributed equally to all the results.
Example… 
• Assign search revenue to search-result urls to figure out the monetary 
worth of each url. A UDF, distributeRevenue attributes revenue from the 
top slot entirely to the first search result, while the revenue from the side 
slot may be attributed equally to all the results.
WITH JOIN 
• To specify the same operation in SQL, one would have to join by queryString, 
then group by queryString, and then apply a custom aggregation function. 
• But while doing the join, the system would compute the cross product of the 
• search and revenue information, which the custom aggregation function 
would then have to undo. 
• Thus, the whole process become quite inefficient, and the query becomes 
hard to read and understand.
Special Case of COGROUP: GROUP 
• A special case of COGROUP when there is only one data set involved. 
• Example: Find the total revenue for each query string. 
• In the second statement above, revenue.amount refers to a projection of 
the nested bag in the tuples of grouped_revenue. 
• Also, as in SQL, the AS clause is used to assign names to fields on the fly. 
• To group all tuples of a data set together (e.g., to compute the overall total 
revenue), one uses the syntax GROUP revenue ALL.
JOIN 
• Pig Latin supports equi-joins. 
• It is easy to verify that JOIN is only a syntactic shortcut for COGROUP 
followed by flattening. 
• The above join command is equivalent to:
MapReduce in Pig Latin 
• With the GROUP and FOREACH statements, it is trivial to express a mapreduce 
program in Pig Latin. 
• Converting to our data-model terminology, a map function operates on one 
input tuple at a time, and outputs a bag of key-value pairs. 
• The reduce function then operates on all values for a key at a time to produce 
the final result. 
• The first line applies the map UDF to every tuple on the input, and flattens the 
bag of key value pairs that it produces. 
• We use the shorthand * as in SQL to denote that all the fields of the input tuples 
are passed to the map UDF. 
• Assuming the first field of the map output to be the key, the second statement 
groups by key. 
• The third statement then passes the bag of values for every key to the reduce 
UDF to obtain the final result.
Other Commands 
• Pig Latin has a number of other commands that are 
very similar to their SQL counterparts. These are: 
– UNION: Returns the union of two or more bags. 
– CROSS: Returns the cross product of two or more bags. 
– ORDER: Orders a bag by the specified field(s). 
– DISTINCT: Eliminates duplicate tuples in a bag. This 
command is just a shortcut for grouping the bag by all fields, 
and then projecting out the groups.
Asking for Output: STORE 
• The user can ask for the result of a Pig Latin expression 
• sequence to be materialized to a file, by issuing the STORE 
• command, e.g., 
• The above command specifies that bag query_revenues should be serialized 
to the file myoutput using the custom serializer myStore. 
• As with LOAD, the USING clause may be omitted for a default serializer that 
writes plain text, tabdelimited files. 
• Pig also comes with a built-in serializer/ deserializer that can load/store 
arbitrarily nested data.
Word Count using Pig 
myinput = LOAD ‘input.txt' USING TextLoader() as (text_line:chararray); 
words = FOREACH myinput GENERATE FLATTEN(TOKENIZE(text_line)); 
grouped = GROUP words BY $0; 
counts = FOREACH grouped GENERATE group, COUNT(words); 
STORE counts into ‘pigoutput’ using PigStorage(); 
Write to HDFS pigoutput/part-* file
Build Inverted Index 
• Load set of files as string:chararray 
• Associate filenames with their string representation 
• Union all the entries <filename, string> 
• For each entry tokenize string to generate 
– <filename, word> tuples 
• Group by word 
– <word1, {(filename1, word1), (filename2, word1)…}> 
– For each group take records with distinct filenames from the 
associated bag 
– Generate <word1, {(fillename1) (filename2)..} 
• Store it
Build Inverted Index 
t1 = LOAD ‘input1.txt’ USING TextLoader() AS (string:chararray); 
t2 = FOREACH t1 GENERATE ‘input1.txt’ as fname, string; 
t3 = LOAD ‘input2.txt’ USING TextLoader() as (string:chararray); 
t4 = FOREACH t3 GENERATE ‘input2.txt’ as fname, string; 
text = UNION t2, t4; 
words = FOREACH text GENERATE fname, FLATTEN(TOKENIZE(string)); 
word_groups = GROUP words BY $1; 
index = FOREACH word_groups { files = DISTINCT $1.$0; GENERATE $0, cnt, files; }; 
STORE index INTO ‘inverted_index’ using PigStorage(); 
Nested FOREACH
End of session 
Day – 3: Apache Pig Data Operations

More Related Content

PPT
Http request&response by Vignesh 15 MAR 2014
DOCX
Concurrency Control Techniques
PPTX
XXE - XML External Entity Attack
PPTX
Exploitation techniques and fuzzing
PPT
Registry forensics
PPT
200308 Active Directory Security
PPTX
Security of the database
PDF
8. mutual exclusion in Distributed Operating Systems
Http request&response by Vignesh 15 MAR 2014
Concurrency Control Techniques
XXE - XML External Entity Attack
Exploitation techniques and fuzzing
Registry forensics
200308 Active Directory Security
Security of the database
8. mutual exclusion in Distributed Operating Systems

What's hot (20)

PDF
Mobile database
PPTX
Memory Forensics for IR - Leveraging Volatility to Hunt Advanced Actors
PDF
Saml authentication bypass
PDF
CNIT 129S: Ch 6: Attacking Authentication
PPTX
Splunk Dashboarding & Universal Vs. Heavy Forwarders
PPTX
Concurrency Control in Database Management System
PPTX
NIST Cloud Computing Reference Architecture.pptx
PPTX
Http request smuggling
PDF
Linux - DNS
PDF
CS9222 ADVANCED OPERATING SYSTEMS
PPTX
Semaphore
PPT
Less09 managing undo data
PPT
Samba server configuration
PPTX
Distributed Query Processing
PPTX
Sisteme de Operare: Memorie virtuala
PPT
Network Performance Management
PPTX
Linux and DNS Server
PDF
Introduction of Java GC Tuning and Java Java Mission Control
Mobile database
Memory Forensics for IR - Leveraging Volatility to Hunt Advanced Actors
Saml authentication bypass
CNIT 129S: Ch 6: Attacking Authentication
Splunk Dashboarding & Universal Vs. Heavy Forwarders
Concurrency Control in Database Management System
NIST Cloud Computing Reference Architecture.pptx
Http request smuggling
Linux - DNS
CS9222 ADVANCED OPERATING SYSTEMS
Semaphore
Less09 managing undo data
Samba server configuration
Distributed Query Processing
Sisteme de Operare: Memorie virtuala
Network Performance Management
Linux and DNS Server
Introduction of Java GC Tuning and Java Java Mission Control
Ad

Viewers also liked (8)

PDF
Apache PIG - User Defined Functions
PPTX
05 pig user defined functions (udfs)
PDF
RxJava applied [JavaDay Kyiv 2016]
PPTX
RxJava Applied
PDF
Graphs as Streams: Rethinking Graph Processing in the Streaming Era
PDF
Reactive Thinking in Java
PPTX
05 k-means clustering
PDF
Intro to Graph Databases Using Tinkerpop, TitanDB, and Gremlin
Apache PIG - User Defined Functions
05 pig user defined functions (udfs)
RxJava applied [JavaDay Kyiv 2016]
RxJava Applied
Graphs as Streams: Rethinking Graph Processing in the Streaming Era
Reactive Thinking in Java
05 k-means clustering
Intro to Graph Databases Using Tinkerpop, TitanDB, and Gremlin
Ad

Similar to 04 pig data operations (20)

PPTX
PPTX
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
PPTX
Pig workshop
PPTX
Pig statements
PPT
pig.ppt
PPTX
PigHive.pptx
PPTX
Apache PIG
PPTX
Understanding Pig and Hive in Apache Hadoop
PPTX
Apache pig
PPTX
4.1-Pig.pptx
PPTX
Pig_Presentation
PDF
20080529dublinpt2
PPTX
Pig latin
PPTX
Apache pig presentation_siddharth_mathur
PPTX
Lec_4_1_IntrotoPIG.pptx
PPTX
Apache pig presentation_siddharth_mathur
PPTX
power point presentation on pig -hadoop framework
PPTX
Unit-5 [Pig] working and architecture.pptx
PDF
Pig
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Pig workshop
Pig statements
pig.ppt
PigHive.pptx
Apache PIG
Understanding Pig and Hive in Apache Hadoop
Apache pig
4.1-Pig.pptx
Pig_Presentation
20080529dublinpt2
Pig latin
Apache pig presentation_siddharth_mathur
Lec_4_1_IntrotoPIG.pptx
Apache pig presentation_siddharth_mathur
power point presentation on pig -hadoop framework
Unit-5 [Pig] working and architecture.pptx
Pig

More from Subhas Kumar Ghosh (20)

PPTX
07 logistic regression and stochastic gradient descent
PPTX
06 how to write a map reduce version of k-means clustering
PPTX
03 hive query language (hql)
PPTX
02 data warehouse applications with hive
PPTX
PPTX
06 pig etl features
PPTX
03 pig intro
PPTX
02 naive bays classifier and sentiment analysis
PPTX
Hadoop performance optimization tips
PPTX
Hadoop Day 3
PDF
Hadoop exercise
PDF
Hadoop map reduce v2
PPTX
Hadoop job chaining
PDF
Hadoop secondary sort and a custom comparator
PDF
Hadoop combiner and partitioner
PPTX
Hadoop deconstructing map reduce job step by step
PDF
Hadoop map reduce in operation
PDF
Hadoop map reduce concepts
PDF
Hadoop availability
PDF
Hadoop scheduler
07 logistic regression and stochastic gradient descent
06 how to write a map reduce version of k-means clustering
03 hive query language (hql)
02 data warehouse applications with hive
06 pig etl features
03 pig intro
02 naive bays classifier and sentiment analysis
Hadoop performance optimization tips
Hadoop Day 3
Hadoop exercise
Hadoop map reduce v2
Hadoop job chaining
Hadoop secondary sort and a custom comparator
Hadoop combiner and partitioner
Hadoop deconstructing map reduce job step by step
Hadoop map reduce in operation
Hadoop map reduce concepts
Hadoop availability
Hadoop scheduler

Recently uploaded (20)

PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Nekopoi APK 2025 free lastest update
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
Digital Strategies for Manufacturing Companies
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
L1 - Introduction to python Backend.pptx
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
System and Network Administraation Chapter 3
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
Essential Infomation Tech presentation.pptx
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Design an Analysis of Algorithms II-SECS-1021-03
Nekopoi APK 2025 free lastest update
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
wealthsignaloriginal-com-DS-text-... (1).pdf
Digital Strategies for Manufacturing Companies
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
L1 - Introduction to python Backend.pptx
Operating system designcfffgfgggggggvggggggggg
Upgrade and Innovation Strategies for SAP ERP Customers
System and Network Administraation Chapter 3
Softaken Excel to vCard Converter Software.pdf
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Essential Infomation Tech presentation.pptx
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
VVF-Customer-Presentation2025-Ver1.9.pptx

04 pig data operations

  • 1. Apache Pig Data Operations
  • 2. An Example • Let’s look at a simple example by writing the program to calculate the maximum recorded temperature by year for the weather dataset in Pig Latin. • Data: YEAR TMP QUALITY 1950 0 1 1950 22 1 1950 -11 1 1949 111 1 1949 78 1 • Start up Grunt in local mode, then enter the first line of the Pig script: records = LOAD 'input/ncdc/micro-tab/sample.txt' AS (year:chararray, temperature:int, quality:int); • For simplicity, the program assumes that the input is tab-delimited text, with each line having just year, temperature, and quality fields.
  • 3. An Example records = LOAD 'input/ncdc/micro-tab/sample.txt' AS (year:chararray, temperature:int, quality:int); • This line describes the input data we want to process. • The year:chararray notation describes the field’s name and type; a chararray is like a Java string, and an int is like a Java int. • The LOAD operator takes a URI argument; here we are just using a local file, but we could refer to an HDFS URI. • The AS clause (which is optional) gives the fields names to make it convenient to refer to them in subsequent statements. • The result of the LOAD operator, indeed any operator in Pig Latin, is a relation, which is just a set of tuples. • A tuple is just like a row of data in a database table, with multiple fields in a particular order. • In this example, the LOAD function produces a set of (year, temperature, quality) tuples that are present in the input file. • We write a relation with one tuple per line, where tuples are represented as comma-separated items in parentheses: (1950,0,1)
  • 4. An Example • Relations are given names, or aliases, so they can be referred to. • This relation is given the records alias. • We can examine the contents of an alias using the DUMP operator: DUMP records; (1950,0,1) (1950,22,1) (1950,-11,1) (1949,111,1) (1949,78,1)
  • 5. An Example • We can also see the structure of a relation—the relation’s schema—using the DESCRIBE operator on the relation’s alias: DESCRIBE records; filtered_records = FILTER records BY temperature != 9999 AND (quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9); • This statement removes records that have a missing temperature (indicated by a value of 9999) or an unsatisfactory quality reading. • For this small dataset, no records are filtered out. grouped_records = GROUP filtered_records BY year; • The third statement uses the GROUP function to group the records relation by the year field. • Let’s use DUMP to see what it produces for grouped_records. • Let’s use DESCRIBE grouped_records; to see what is the structure of grouped_records.
  • 6. An Example • We now have two rows, or tuples, one for each year in the input data. The first field in each tuple is the field being grouped by (the year), and the second field is a bag of tuples for that year. • A bag is just an unordered collection of tuples, which in Pig Latin is represented using curly braces. • So now all that remains is to find the maximum temperature for the tuples in each bag. max_temp = FOREACH grouped_records GENERATE group, MAX(filtered_records.temperature); • FOREACH processes every row to generate a derived set of rows, using a GENERATE clause to define the fields in each derived row. • In this example, the first field is group, which is just the year. • The second field filtered_records.temperature reference is to the temperature field of the filtered_records bag in the grouped_records relation. • MAX is a built-in function for calculating the maximum value of fields in a bag.
  • 7. Pig Latin • Supports read-only data analysis workloads that are scan-centric; no transactions! • Fully nested data model. – Does not satisfy First normal form! – By definition will violate the other normal forms. • Extensive support for user-defined functions. – UDF as first class citizen. • Manages plain input files without any schema information. • A novel debugging environment.
  • 8. Nested data/set model • The nested set model is a particular technique for representing nested sets (also known as trees or hierarchies) in relational databases.
  • 9. Why Nested Data Model? • Closer to how programmers think and more natural to them. – E.g., To capture information about the positional occurrences of terms in a collection of documents, a programmer may create a structure of the form Idx<documentId, Set<positions>> for each term. – Normalization of the data creates two tables: Term_info: (TermId, termString, ….) Pos_info: (TermId, documentId, position) – Obtain positional occurrence by joining these two tables on TermId and grouping on <TermId, documentId>
  • 10. Why Nested Data Model? • Data is often stored on disk in an inherently nested fashion. – A web crawler might output for each url, the set of outlinks from that url. • A nested data model justifies a new algebraic language! • Adaptation by programmers because it is easier to write user-defined functions.
  • 11. Dataflow Language • User specifies a sequence of steps where each step specifies only a single, high level data transformation. Similar to relational algebra and procedural – desirable for programmers. • With SQL, the user specifies a set of declarative constraints. Non-procedural and desirable for non-programmers.
  • 12. Dataflow Language: Example • A high level program that specifies a query execution plan. • Example: Suppose we have a table urls: (url, category, pagerank). The following is a simple SQL query that finds, for each sufficiently large category, the average pagerank of high-pagerank urls in that category. • In PigLatin:
  • 13. Lazy Execution • Database style optimization by lazy processing of expressions. • Example Recall urls: (url, category, pagerank) Set of urls of pages that are classified as spam and have a high pagerank score. 1. Spam_urls = Filter urls BY isSpam(url); 2. Culprit_urls = FILTER spam_urls BY pagerank > 0.8; Optimized execution: 1. HighRank_urls = FILTER urls BY pagerank > 0.8; 2. Cultprit_urls = FILTER HighRank_urls BY isSpam (url);
  • 14. Quick Start/Interoperability • To process a file, the user provides a function that gives Pig the ability to parse the content of the file into records. • Output of a Pig program is formatted based on a user-defined function. • Why do not conventional DBMSs do the same? (They require importing data into system-managed tables)
  • 15. Quick Start/Interoperability • To process a file, the user provides a function that gives Pig the ability to parse the content of the file into records. • Output of a Pig program is formatted based on a user-defined function. • Why do not conventional DBMSs do the same? (They require importing data into system-managed tables) – To enable transactional consistency guarantees, – To enable efficient point lookups (RIDs), – To curate data on behalf of the user, and record the schema so that other users can make sense of the data.
  • 16. Pig Latin - Simple Data Types • PIG Latin statements work with relations, – A Relation is a Bag (Outer Bag) – A Bag is a collection of Tuples – A Tuple is an ordered set of Fields – A Field can be any simple or complex data type • thus supports nested data model • Simple data types – int => signed 32 bit => 10 – long => signed 64 bit => 10L – float => 32 bit => 10.5f, 10.5e2 – double => 64 bit => 10.5, 10.5e2 – Arrays • chararray => string in UTF-8 => ‘Hello World” • bytearray => byte array (blob)
  • 17. Data Model • Consists of four types: – Atom: Contains a simple atomic value such as a string or a number, e.g., ‘Joe’. – Tuple: Sequence of fields, each of which might be any data type, e.g., (‘Joe’, ‘lakers’) – Bag: A collection of tuples with possible duplicates. Schema of a bag is flexible. – Map: A collection of data items, where each item has an associated key through which it can be looked up. Keys must be data atoms. Flexibility enables data to change without re-writing programs.
  • 18. A Comparison with Relational Algebra • Pig Latin – Everything is a bag. – Dataflow language. • Relational Algebra – Everything is a table. – Dataflow language.
  • 19. Pig Latin – NULL support • Same as SQL definition : unknown or non-existent • Null can be used as constant expression in place of expression of any type • If certain fields in the data are missing, it is load/store functions responsibility to insert NULL – E.g. text loader returns NULL in place of empty strings in the data • Operations that produce NULL – Divide by zero – Dereferencing a field or map key that does not exists – UDFs can return NULL • NULLs and Operators – Comparison, Matches, Cast, Dereferencing returns null if one of the input variables is null – AVG, MIN, MAX, SUM functions ignore NULLs – COUNT function counts values including NULLs – If Filter expression is NULL then record is rejected
  • 21. Expressions A = LOAD ‘data.txt’ AS (f1:int , f2:bag{t:tuple (n1:int, n2:int)}, f3: map[] ) (1, (2,3), (4,6) , [‘yahoo’#‘mail’]) A.f1 or A.$0 A.f2 or A.$1 Field referred to by position Field referred to by name Projection of a data item (2,3), (4,6) A.f3 or A.$2 A.f2 = (2,3), (4,6) A.$1 = A.$0 = 1 A.f2.$0 = (2), (4) Map lookup A.f3# ‘yahoo’ = ‘mail’ Function application SUM(A.f2.$0) = 6 COUNT(A.f2) = 2L A = name of an outer bag/relation NOTE: bag, tuple keywords are optional
  • 22. Comparison Operators a  (f1:int , f2:bag{t:tuple (n1:int, n2:int)}, f3: map[] ) (1, (2,3), (4,6) ,[‘yahoo’#‘mail’]) f1or $0 f2 or $1 f3 or $2  Numerical comparison (==, !=, >, >=, <, <= )  f1 > 5  f3#‘yahoo’ == ‘mail’  Regular expression matching : matches  f3#‘yahoo’ matches ‘(?i)MAIL’ Logical Operators AND, OR, NOT  f1==1 AND f3#‘yahoo’ eq ‘mail’  Conditional Expression (aka BinCond)  (Condition?exp1:exp2)  f3#‘yahoo’ matches ‘(?i)MAIL’ ? ‘matched’ : ‘notmatched’
  • 23. Pig Built-in Functions • Pig has a variety of built-in functions for each type – Storage • TextLoader: for loading unstructured text files. Each line is loaded as a tuple with a single field which is the entire line. – Filter • isEmpty: tests if bags are empty – Eval Functions • COUNT: computes number of elements in a bag • SUM: computes the sum of the numeric values in a single-column bag • AVG: computes the average of the numeric values in a single-column bag • MIN/MAX: computes the min/max of the numeric values in a single-column bag. • SIZE: returns size of any datum example map • CONCAT: concatenate two chararrays or two bytearrays • TOKENIZE: splits a string and outputs a bag of words • DIFF: compares the fields of a tuple with size 2
  • 24. Specifying Input Data • Use LOAD command to specify input data file. • Input file is query_log.txt • Convert input file into tuples using myLoad deserializer. • Loaded tuples have 3 fields. • USING and AS clauses are optional. – Default serializer that expects a plain text, tab-deliminated file, is used. • No schema  reference fields by position $0 • Return value, assigned to “queries”, is a handle to a bag. – “queries” can be used as input to subsequent Pig Latin expressions. – Handles such as “queries” are logical. No data is actually read and no processing carried out until the instruction that explicitly asks for output (STORE). – Think of it as a “logical view”.
  • 25. FOREACH • Once input data file(s) have been specified through LOAD, one can specify the processing that needs to be carried out on the data. • One of the basic operations is that of applying some processing to every tuple of a data set. • This is achieved through the FOREACH command. For example: • The above command specifies that each tuple of the bag queries (loaded by previous command) should be processed independently to produce an output tuple. • The first field of the output tuple is the userId field of the input tuple, and the second field of the output tuple is the result of applying the UDF expandQuery to the queryString field of the input tuple.
  • 26. Per-tuple Processing with FOREACH • Suppose the UDF expandQuery generates a bag of likely expansions of a given query string. • Then an example transformation carried out by the above statement is a bag of likely expansions of a given query string. • Semantics: – No dependence between processing of different tupels of the input  Parallelism! – GENERATE can be followed by a list of any expression.
  • 27. FOREACH & Flattening • To eliminate nesting in data, use FLATTEN. • FLATTEN consumes a bag, extracts the fields of the tuples in the bag, and makes them fields of the tuple being output by GENERATE, removing one level of nesting.
  • 28. Discarding Unwanted Data: FILTER • Identical to the select operator of relational algebra. • Synatx: – FILTER bag-id BY expression • Expression is: field-name op Constant Field-name op UDF op might be ==, eq, !=, neq, <, >, <=, >= • A comparison operation may utilize boolean operators (AND, OR, NOT) with several expressions • For example, to get rid of bot traffic in the bag queries • Since arbitrary expressions are allowed, it follows that we can use UDFs while filtering. • Thus, in our less ideal world, where bots don’t identify themselves, we can use a sophisticated UDF (isBot) to perform the filtering, e.g.:
  • 29. A Comparison with Relational Algebra • Pig Latin – Everything is a bag. – Dataflow language. – FILTER is same as the Select operator. • Relational Algebra – Everything is a table. – Dataflow language. – Select operator is same as the FILTER cmd.
  • 30. Grouping related data • COGROUP groups together tuples from one or more data sets that are related in some way. • Example: – For example, suppose we have two data sets that we have specified through a LOAD command: – Results contains, for different query strings, the urls shown as search results and the position at which they are shown. – Revenue contains, for different query strings, and different ad slots, the average amount of revenue made by the ad for that query string at that slot. – Then to group together all search result data and revenue data for the same query string, we can write:
  • 31. COGROUP • The output of a COGROUP contains one tuple for each group. – First field of the tuple, named group, is the group identifier. – Each of the next fields is a bag, one for each input being cogrouped, and is named the same as the alias of that input.
  • 32. COGROUP is not JOIN • Grouping can be performed according to arbitrary expressions which may include UDFs. • Grouping is different than “Join” • It is evident that JOIN is equivalent to COGROUP, followed by taking a cross product of the tuples in the nested bags. While joins are widely applicable, certain custom processing might require access to the tuples of the groups before the cross-product is taken.
  • 33. Example • Suppose we were trying to attribute search revenue to search-result urls to figure out the monetary worth of each url. We might have a sophisticated model for doing so. To accomplish this task in Pig Latin, we can follow the COGROUP with the following statement: • Where distributeRevenue is a UDF that accepts search results and revenue information for a query string at a time, and outputs a bag of urls and the revenue attributed to them. • For example, distributeRevenue might attribute revenue from the top slot entirely to the first search result, while the revenue from the side slot may be attributed equally to all the results.
  • 34. Example… • Assign search revenue to search-result urls to figure out the monetary worth of each url. A UDF, distributeRevenue attributes revenue from the top slot entirely to the first search result, while the revenue from the side slot may be attributed equally to all the results.
  • 35. WITH JOIN • To specify the same operation in SQL, one would have to join by queryString, then group by queryString, and then apply a custom aggregation function. • But while doing the join, the system would compute the cross product of the • search and revenue information, which the custom aggregation function would then have to undo. • Thus, the whole process become quite inefficient, and the query becomes hard to read and understand.
  • 36. Special Case of COGROUP: GROUP • A special case of COGROUP when there is only one data set involved. • Example: Find the total revenue for each query string. • In the second statement above, revenue.amount refers to a projection of the nested bag in the tuples of grouped_revenue. • Also, as in SQL, the AS clause is used to assign names to fields on the fly. • To group all tuples of a data set together (e.g., to compute the overall total revenue), one uses the syntax GROUP revenue ALL.
  • 37. JOIN • Pig Latin supports equi-joins. • It is easy to verify that JOIN is only a syntactic shortcut for COGROUP followed by flattening. • The above join command is equivalent to:
  • 38. MapReduce in Pig Latin • With the GROUP and FOREACH statements, it is trivial to express a mapreduce program in Pig Latin. • Converting to our data-model terminology, a map function operates on one input tuple at a time, and outputs a bag of key-value pairs. • The reduce function then operates on all values for a key at a time to produce the final result. • The first line applies the map UDF to every tuple on the input, and flattens the bag of key value pairs that it produces. • We use the shorthand * as in SQL to denote that all the fields of the input tuples are passed to the map UDF. • Assuming the first field of the map output to be the key, the second statement groups by key. • The third statement then passes the bag of values for every key to the reduce UDF to obtain the final result.
  • 39. Other Commands • Pig Latin has a number of other commands that are very similar to their SQL counterparts. These are: – UNION: Returns the union of two or more bags. – CROSS: Returns the cross product of two or more bags. – ORDER: Orders a bag by the specified field(s). – DISTINCT: Eliminates duplicate tuples in a bag. This command is just a shortcut for grouping the bag by all fields, and then projecting out the groups.
  • 40. Asking for Output: STORE • The user can ask for the result of a Pig Latin expression • sequence to be materialized to a file, by issuing the STORE • command, e.g., • The above command specifies that bag query_revenues should be serialized to the file myoutput using the custom serializer myStore. • As with LOAD, the USING clause may be omitted for a default serializer that writes plain text, tabdelimited files. • Pig also comes with a built-in serializer/ deserializer that can load/store arbitrarily nested data.
  • 41. Word Count using Pig myinput = LOAD ‘input.txt' USING TextLoader() as (text_line:chararray); words = FOREACH myinput GENERATE FLATTEN(TOKENIZE(text_line)); grouped = GROUP words BY $0; counts = FOREACH grouped GENERATE group, COUNT(words); STORE counts into ‘pigoutput’ using PigStorage(); Write to HDFS pigoutput/part-* file
  • 42. Build Inverted Index • Load set of files as string:chararray • Associate filenames with their string representation • Union all the entries <filename, string> • For each entry tokenize string to generate – <filename, word> tuples • Group by word – <word1, {(filename1, word1), (filename2, word1)…}> – For each group take records with distinct filenames from the associated bag – Generate <word1, {(fillename1) (filename2)..} • Store it
  • 43. Build Inverted Index t1 = LOAD ‘input1.txt’ USING TextLoader() AS (string:chararray); t2 = FOREACH t1 GENERATE ‘input1.txt’ as fname, string; t3 = LOAD ‘input2.txt’ USING TextLoader() as (string:chararray); t4 = FOREACH t3 GENERATE ‘input2.txt’ as fname, string; text = UNION t2, t4; words = FOREACH text GENERATE fname, FLATTEN(TOKENIZE(string)); word_groups = GROUP words BY $1; index = FOREACH word_groups { files = DISTINCT $1.$0; GENERATE $0, cnt, files; }; STORE index INTO ‘inverted_index’ using PigStorage(); Nested FOREACH
  • 44. End of session Day – 3: Apache Pig Data Operations