SlideShare a Scribd company logo
2
Most read
6
Most read
7
Most read
Pig
Relational Operators - I:
Order, Distinct, Limit,
GroupBy
Pig Relational Operator: ORDER
 This helps to sort the data based on Ascending or
Descending manner.
1. Sorting for numerical fields are based on numerically.
2. Sorting for chararray fields are based on lexically.
3. Sorting for bytearray fields are based on lexically.
4. Nulls are considered to be smaller than other values. Therefore
it will always come first or last during ascending or descending
the results.
Let’s perform this with the help of an example;
grunt> dataTransaction = Load ‘/home/hduser/datasets/store.csv’
using PigStorage(‘,’) AS
(Product_Name:chararray,CustomerName:chararray,Transaction_I
D:byearray,TransAmt1:bytearray,TransAmt2:bytearray,
TransAmt3:bytearray, Place:chararray, Department:chararray);
Rupak Roy
Pig Relational Operator: ORDER
grunt> orderbyName = ORDER dataTransaction by
Name;
Example 2:
grunt> orderbyNameNsymbol= order datatransaction
by date, symbol;
Example 3:
grunt> desc= order datatransaction by close desc,
open;
Here close column will have descending order and
since we didn’t mentioned any order for open column
it will take ascending order by default.
Rupak Roy
Pig Relational Operator: LIMIT
 Limit simply limits the number of records to display.
 For example, if we have 1 million rows and columns
and if we dump the results for testing or any purpose it
will take a lot of time to finish displaying the results one
by one which is very time consuming process, so to
make sure our required script is working it is better to
view some results rather than displaying the whole
results.
 However Pig will still read all the records even we limit
the display of results by (assume) 20 records, but it will
also display 20 different records each time even we
use the same limit query. We can overcome this issue
by using ORDER operator immediately after the limit
operator and will guarantee the same 20 records
each time when we use the same limit query.
Rupak Roy
Pig Relational Operator: LIMIT
grunt> Trecords= LIMIT dataTransaction 20;
grunt> dump Trecords;
grunt> dump Trecords; #it will display 20 different records.
#to overcome the limit issue
grunt> order= order dataTransaction by $0;
grunt> Trecords= Limit order 20;
grunt> dump Trecords;
Again we will test for the same results;
grunt> Trecrods = Limit order 20;
Rupak Roy
Pig Relational Operator: DISTINCT
 This operator simply removes the duplicate data.
grunt> RD = DISTINCT dataTransaction;
Note: Distinct operator makes use of a combiner
or we can say semi-reducer between the map
phase and reducer phase to remove the
duplicates.
Rupak Roy
Pig Relational Operator: GROUP
 Group operator is one of the important functions for grouping
the data from a large pool of datasets.
grunt> grouping = GROUP dataTransaction by Place;
grunt> describe grouping;
grunt>cnt = foreach grouping GENERATE group, COUNT( $1);
grunt> dump cnt;
By applying this to our dataset store.csv we can find from how
many places each customer purchase similar products.
Note: to verify this result, load the dataset in excel (since this is a
subset of a dataset and will be suitable to view it in excel). Use the
filter function then select only Christy Britain, we will see she have
purchased the similar products from 6 different places.
Rupak Roy
 Group on multiple keys:
grunt> grouped = GROUP dataTransaction by
(Place, Department);
Pig Relational Operator: GROUP
Rupak Roy
Next
 More into advanced relational operators
like foreach, filter, join and more.
Rupak Roy

More Related Content

PPT
20. Parallel Databases in DBMS
PPT
Data Base Management System
PPTX
Data Mining: Mining stream time series and sequence data
PPT
5.2 mining time series data
PPTX
Big data and Hadoop
PPT
b+ tree
PPTX
Web mining
PPTX
Introduction to Data Mining
20. Parallel Databases in DBMS
Data Base Management System
Data Mining: Mining stream time series and sequence data
5.2 mining time series data
Big data and Hadoop
b+ tree
Web mining
Introduction to Data Mining

What's hot (20)

PDF
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
PPSX
PPTX
PPT on Data Science Using Python
PPTX
Digital data
PPTX
Optical Character Recognition( OCR )
PPTX
Relational databases
PPTX
Apache PIG
PPT
Multivalued dependency
PPT
2.3 bayesian classification
PPTX
Credit card fraud detection
PPT
4.3 multimedia datamining
PPT
Codd's rules
PPTX
Data mining tasks
PPT
10. XML in DBMS
PPTX
Data mining
PDF
Lecture4 big data technology foundations
PDF
Database design & Normalization (1NF, 2NF, 3NF)
PPTX
SQL - Structured query language introduction
PPTX
Text MIning
PPTX
Relational Data Model Introduction
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
PPT on Data Science Using Python
Digital data
Optical Character Recognition( OCR )
Relational databases
Apache PIG
Multivalued dependency
2.3 bayesian classification
Credit card fraud detection
4.3 multimedia datamining
Codd's rules
Data mining tasks
10. XML in DBMS
Data mining
Lecture4 big data technology foundations
Database design & Normalization (1NF, 2NF, 3NF)
SQL - Structured query language introduction
Text MIning
Relational Data Model Introduction
Ad

Similar to Apache PIG Relational Operations (20)

PDF
Apache Pig Relational Operators - II
PPTX
Session 04 -Pig Continued
PDF
pig intro.pdf
PPTX
Apache pig
PPTX
04 pig data operations
PPTX
Pig latin
PDF
Enhancing Spark SQL Optimizer with Reliable Statistics
PPTX
Pig_Presentation
PPTX
Pig statements
PPTX
Understanding Pig and Hive in Apache Hadoop
PPTX
Session 04 pig - slides
PPTX
PigHive.pptx
PPTX
power point presentation on pig -hadoop framework
PPTX
Oracle: Basic SQL
PPTX
Oracle: Basic SQL
PPTX
PPT
Enabling Applications with Informix' new OLAP functionality
PPTX
PigHive presentation and hive impor.pptx
PPT
Olap Functions Suport in Informix
PPTX
http://boxinglatestnews.com
Apache Pig Relational Operators - II
Session 04 -Pig Continued
pig intro.pdf
Apache pig
04 pig data operations
Pig latin
Enhancing Spark SQL Optimizer with Reliable Statistics
Pig_Presentation
Pig statements
Understanding Pig and Hive in Apache Hadoop
Session 04 pig - slides
PigHive.pptx
power point presentation on pig -hadoop framework
Oracle: Basic SQL
Oracle: Basic SQL
Enabling Applications with Informix' new OLAP functionality
PigHive presentation and hive impor.pptx
Olap Functions Suport in Informix
http://boxinglatestnews.com
Ad

More from Rupak Roy (20)

PDF
Hierarchical Clustering - Text Mining/NLP
PDF
Clustering K means and Hierarchical - NLP
PDF
Network Analysis - NLP
PDF
Topic Modeling - NLP
PDF
Sentiment Analysis Practical Steps
PDF
NLP - Sentiment Analysis
PDF
Text Mining using Regular Expressions
PDF
Introduction to Text Mining
PDF
Apache Hbase Architecture
PDF
Introduction to Hbase
PDF
Apache Hive Table Partition and HQL
PDF
Installing Apache Hive, internal and external table, import-export
PDF
Introductive to Hive
PDF
Scoop Job, import and export to RDBMS
PDF
Apache Scoop - Import with Append mode and Last Modified mode
PDF
Introduction to scoop and its functions
PDF
Introduction to Flume
PDF
Passing Parameters using File and Command Line
PDF
Apache PIG casting, reference
PDF
Pig Latin, Data Model with Load and Store Functions
Hierarchical Clustering - Text Mining/NLP
Clustering K means and Hierarchical - NLP
Network Analysis - NLP
Topic Modeling - NLP
Sentiment Analysis Practical Steps
NLP - Sentiment Analysis
Text Mining using Regular Expressions
Introduction to Text Mining
Apache Hbase Architecture
Introduction to Hbase
Apache Hive Table Partition and HQL
Installing Apache Hive, internal and external table, import-export
Introductive to Hive
Scoop Job, import and export to RDBMS
Apache Scoop - Import with Append mode and Last Modified mode
Introduction to scoop and its functions
Introduction to Flume
Passing Parameters using File and Command Line
Apache PIG casting, reference
Pig Latin, Data Model with Load and Store Functions

Recently uploaded (20)

PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Sports Quiz easy sports quiz sports quiz
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Cell Types and Its function , kingdom of life
PPTX
Lesson notes of climatology university.
PPTX
Pharma ospi slides which help in ospi learning
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
Classroom Observation Tools for Teachers
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Basic Mud Logging Guide for educational purpose
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
master seminar digital applications in india
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
O7-L3 Supply Chain Operations - ICLT Program
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Sports Quiz easy sports quiz sports quiz
VCE English Exam - Section C Student Revision Booklet
Abdominal Access Techniques with Prof. Dr. R K Mishra
Supply Chain Operations Speaking Notes -ICLT Program
Cell Types and Its function , kingdom of life
Lesson notes of climatology university.
Pharma ospi slides which help in ospi learning
TR - Agricultural Crops Production NC III.pdf
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Classroom Observation Tools for Teachers
STATICS OF THE RIGID BODIES Hibbelers.pdf
Basic Mud Logging Guide for educational purpose
2.FourierTransform-ShortQuestionswithAnswers.pdf
master seminar digital applications in india
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
O7-L3 Supply Chain Operations - ICLT Program

Apache PIG Relational Operations

  • 1. Pig Relational Operators - I: Order, Distinct, Limit, GroupBy
  • 2. Pig Relational Operator: ORDER  This helps to sort the data based on Ascending or Descending manner. 1. Sorting for numerical fields are based on numerically. 2. Sorting for chararray fields are based on lexically. 3. Sorting for bytearray fields are based on lexically. 4. Nulls are considered to be smaller than other values. Therefore it will always come first or last during ascending or descending the results. Let’s perform this with the help of an example; grunt> dataTransaction = Load ‘/home/hduser/datasets/store.csv’ using PigStorage(‘,’) AS (Product_Name:chararray,CustomerName:chararray,Transaction_I D:byearray,TransAmt1:bytearray,TransAmt2:bytearray, TransAmt3:bytearray, Place:chararray, Department:chararray); Rupak Roy
  • 3. Pig Relational Operator: ORDER grunt> orderbyName = ORDER dataTransaction by Name; Example 2: grunt> orderbyNameNsymbol= order datatransaction by date, symbol; Example 3: grunt> desc= order datatransaction by close desc, open; Here close column will have descending order and since we didn’t mentioned any order for open column it will take ascending order by default. Rupak Roy
  • 4. Pig Relational Operator: LIMIT  Limit simply limits the number of records to display.  For example, if we have 1 million rows and columns and if we dump the results for testing or any purpose it will take a lot of time to finish displaying the results one by one which is very time consuming process, so to make sure our required script is working it is better to view some results rather than displaying the whole results.  However Pig will still read all the records even we limit the display of results by (assume) 20 records, but it will also display 20 different records each time even we use the same limit query. We can overcome this issue by using ORDER operator immediately after the limit operator and will guarantee the same 20 records each time when we use the same limit query. Rupak Roy
  • 5. Pig Relational Operator: LIMIT grunt> Trecords= LIMIT dataTransaction 20; grunt> dump Trecords; grunt> dump Trecords; #it will display 20 different records. #to overcome the limit issue grunt> order= order dataTransaction by $0; grunt> Trecords= Limit order 20; grunt> dump Trecords; Again we will test for the same results; grunt> Trecrods = Limit order 20; Rupak Roy
  • 6. Pig Relational Operator: DISTINCT  This operator simply removes the duplicate data. grunt> RD = DISTINCT dataTransaction; Note: Distinct operator makes use of a combiner or we can say semi-reducer between the map phase and reducer phase to remove the duplicates. Rupak Roy
  • 7. Pig Relational Operator: GROUP  Group operator is one of the important functions for grouping the data from a large pool of datasets. grunt> grouping = GROUP dataTransaction by Place; grunt> describe grouping; grunt>cnt = foreach grouping GENERATE group, COUNT( $1); grunt> dump cnt; By applying this to our dataset store.csv we can find from how many places each customer purchase similar products. Note: to verify this result, load the dataset in excel (since this is a subset of a dataset and will be suitable to view it in excel). Use the filter function then select only Christy Britain, we will see she have purchased the similar products from 6 different places. Rupak Roy
  • 8.  Group on multiple keys: grunt> grouped = GROUP dataTransaction by (Place, Department); Pig Relational Operator: GROUP Rupak Roy
  • 9. Next  More into advanced relational operators like foreach, filter, join and more. Rupak Roy