SlideShare a Scribd company logo
Hadoop: M/R, Pig,
Hive
A short intro and demo of each Program
By Zahid Mian (February 2015)
Agenda
• Intro to Map/Reduce (M/R)
• M/R Simple Example
• M/R Joins
• M/R Broadcast Join Example
• Intro to Pig
• Pig Example
• Intro to Hive
• Hive Example
• Resources
What is M/R?
• A way of Programing that breaks down work into two tasks:
Mapping and Reducing
• Mapping:
• Consume <key, value> pairs
• Produce <key, value> pairs
• Reducers:
• Consume: <key, <list of values>> <“EMC”, “{(…),(…)}”>
• Produce: <key, value> <“EMC”, 27.2229>
• Shuffling and Sorting:
• Behind the scenes actions done by the framework
• Groups all similar keys from all mappers, sorts and passes them
to a certain reducer
What is HDFS
• HDFS is a filesystem that ensures data availability by
replicating file blocks across several nodes (3 is default)
• Default block size is 64 MB
• A small file (1 KB) will take up 64 MB; “large” file of 65 MB will
take up 128 MB;
• Namenode stores metadata info about files
• Datanode stores the actual file(s)
• Files must be added to HDFS
• Files cannot be modified once inside HDFS
Working with HDFS
• Similar to working with Linux Filesystem
• [cloudera@quickstart ~]$ hadoop fs -mkdir /user/examples/stocks
• [cloudera@quickstart ~]$ hadoop fs -mkdir /user/examples/stocks/input
• [cloudera@quickstart ~]$ hadoop fs -mkdir /user/examples/stocks/output
• [cloudera@quickstart ~]$ hadoop fs -rm -r /user/examples/stocks/input/*
• [cloudera@quickstart ~]$ hadoop fs -copyFromLocal ~/datasets/stock*.txt /user/examples/stocks/input/
• [cloudera@quickstart ~]$ hadoop fs -cat /user/examples/stocks/input/stocks.txt
• [cloudera@quickstart ~]$ hadoop fs -rmr /user/examples/stocks/output/*
• Full list of Commands available:
• http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-
common/FileSystemShell.html
Structure of Files (demo)
Symbol, Name, Exchange
Symbol, date, open, high, low,
close, volume, adjclose
Shakespeare Count Words
• Simple text file that contains all of Shakespeare’s works
• Mapper will read each line from text file and produce a <key,
value> tuple with the word as the key and the value of 1
• Simply tokenize each line and output each word
• Reducer will get a list of values (all 1s) for each word
• Tuple: <“death”, {1,1,1,1,1,1,1,1}>
• Now simply count the 1s and output as <“death”, 8>
• It’s Hadoop’s job to Shuffle and Sort in order to give the Reducer
the correct tuple
• Output of Mapper and Reducer are stored in HDFS
• Logs are generated outside HDFS
M/R: Mapper (Simple)
All Mappers must extend this class:
org.apache.hadoop.mapreduce.Mapper
Special Hadoop type; for text files,
this is simply the line number
Special Hadoop type; Indicates type of
value Mapper will produce
“Signature” indicates that
Mapper will consume
LongWritable and Text; will
produce Text and
IntWritable
Notice word is of type Text; one is of type IntWritable
setup method is run only once before any
calls to the mapper function
M/R Submitting: Driver
• Compile and Create jar file
• Then from command prompt:
[cloudera@quickstart ~]$ hadoop jar words.jar Driver /user/examples/shakespeare/input/ /user/examples/shakespeare/output/wordcount
What is Mapper Doing?
• Sample File Segment • Mapper function gets:
• Line Number, Line Text
• <57020, “HAMLET To be,
or not to be: that is the
question:”>
• Mapper will:
• 1: tokenize string
• 2: for each word, produce a
tuple like:
• <“HAMLET”, 1>
• <“To”, 1>
• <“be”, 1>
• <“or”, 1>
• …
• Repeated for all lines
That’s it?
• Hadoop performs some Magic (Shuffling and Sorting) …
• And now we have tuples like:
• <“HAMLET”, {1,1,1,1,1,1,1}>
• <“To”, {1,1,1,1,1,1,1,1,1,1,1,1,1,1}>
• <“be”, {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1}>
• Note: List of Values isn’t correct (there are a lot more references to “HAMLET” in
the text file), but it’s meant to be representative of what it would look like
M/R: Reducer (Simple)
All Reducers must extend this class:
org.apache.hadoop.mapreduce.Reducer
This is the “key” for the data that’s
being sent to the Reducer
Special Hadoop type; Indicates type of
value Mapper will produce
“Signature” indicates that
Reducer will consume Text
and IntWritable; will
produce Text and
DoubleWritable
Notice key is of type Text; result is of type DoubleWritable
What is Reducer Doing?
• Hadoop will send a
tuple to the Reducer:
• <“HAMLET”,
{1,1,1,1,1,1,1}>
• Reducer function:
• Iterates over all the
values for that key
• Value is always 1, so
simply sum
• Reducer outputs:
• <“HAMLET”, 7>
M/R Overview
HAMLET To be, or not to …
Whether 'tis nobler in the …
Or to take arms against a …
And by opposing end …
HAMLET To be, or not to …
Whether 'tis nobler in the …
Or to take arms against a …
And by opposing end …
HAMLET, 1
to, 1
be, 1
or, 1
not, 1
to, 1
Whether, 1
‘tis, 1
nobler, 1
in, 1
the, 1
Or, 1
to, 1
take, 1
arms, 1
against, 1
a, 1
And, 1
by, 1
opposing, 1
end, 1
HAMLET, 1
a, 1
against, 1
be, 1
by, 1
end, 1
in, 1
nobler, 1
not, 1
opposing, 1
or, 1
or, 1
take, 1
to, 1
to, 1
to, 1
HAMLET, 1
a, 1
against, 1
be, 1
by, 1
end, 1
in, 1
nobler, 1
not, 1
opposing, 1
or, 2
take, 1
to, 3
HAMLET, 1
a, 1
against, 1
and, 1
arms, 1
be, 1
by, 1
end, 1
in, 1
nobler, 1
not, 1
opposing, 1
or, 2
take, 1
the, 1
'tis, 1
to, 3
whether, 1
Input Files
Each line passed
to mapper
Map Key
Value Split
Sort and
Shuffle
Reduce Key
Value Paris
Final Ouput
Map Tasks
Reduce Tasks
Final Output
Joins with M/R
• Not Straightforward (Mapper deals with a single record)
• Two Strategies:
• Re-Partition Join if both tables are Large
• Basic idea is to use Mappers to produce “key’d” records so that both
data sets will be in the same partition
• Assume EmployeeID of 100, then Mapper Produces:
• <100, “FirstName, LastName, Address”> (parent record)
• <100, “Skill1, Date, Level”> (child record)
• <100, “Skill2, Date, Level”> (child record)
• Reducer performs the join
• Expensive/Costly due to Shuffling and Sorting
• Broadcast/Replication Join if one table is small
• Essentially send a copy of small table to each Mapper
• Each Mapper performs join
M/R Mapper: Broadcast Join
M/R Reducer: Broadcast Join
M/R Driver: Broadcast Join
Results
Just the Mapper
Mapper and Reducer (calculate Avg Price by Name)
Final Thoughts on M/R
• Java Experience Necessary
• Hadoop Streaming extends M/R to C, Python, etc.
• Can use Combiners to improve performance
• Reduces Network traffic
• “Difficult” to understand all the details, but granular control
over data/process
• Useful when dealing with complex algorithms
• Several file formats available, but can also create custom
formats
• Chaining Jobs to use output of one Job as input for another
• https://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
Pig
• Higher level abstraction for writing M/R jobs
• Data Flow “language”
• Sequence of transformations (filtering, grouping, joining, etc.)
• Pig Latin (the language for Pig)
• It’s not SQL, not even close
• Pig scripts are run as M/R jobs in Hadoop
• Pig Shell will compile and optimize script
• Need to understand data in order to create schemas
• Pig can define Simple and Complex types, so a parent/child
data can exist in one “line” (think Json)
• User Defined Functions (UDF) can be written in Java, Jython,
etc. http://pig.apache.org/docs/r0.9.1/udf.html
Generic Example
• This script shows many of the operations within Pig
Users = load 'users' as (name, age);
Fltrd = filter Users by age >= 18 and age <= 25;
Pages = load 'pages' as (user, url);
Jnd = join Fltrd by name, Pages by user;
Grpd = group Jnd by url;
Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks;
Srtd = order Smmd by clicks desc;
Top5 = limit Srtd 5;
store Top5 into 'top5sites';
Avg Opening Price by Name
Performs join between
two datasets;
describe shows you
the structure
Pig Scripts are Hadoop Jobs
z.pig is the name
of the script
Hive
• It’s not Pig
• SQL-based tool for Hadoop (HiveQL, not SQL)
• More friendlier for SQL users
• “Databases” are simply Namespaces
• “Tables” similar to SQL Tables
• Cannot Insert/Update/Delete
• New data is added when HDFS is updated (add a file to HDFS)
• Metadata is kept in a relational database (MySQL by default)
Hive and HDFS
• When a Table points to a HDFS location, it will read all files in
that location; cannot specify a single file
• Easy to create Partitions; simply create sub directories
• That’s why each file is stored in a separate directory
Hive Script
Hive Results
Hive Scripts are Hadoop Jobs
{Blank}

More Related Content

PDF
Lecture 2 part 3
PPTX
Hive and Pig for .NET User Group
PPTX
Map reduce prashant
PPTX
Unit 2 part-2
PPTX
Unit 4-apache pig
PPTX
06 pig etl features
PPTX
Apache pig
PPTX
Map reduce paradigm explained
Lecture 2 part 3
Hive and Pig for .NET User Group
Map reduce prashant
Unit 2 part-2
Unit 4-apache pig
06 pig etl features
Apache pig
Map reduce paradigm explained

What's hot (19)

PDF
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PPTX
Apache PIG
PPTX
Unit 4 lecture2
PPTX
03 pig intro
PPTX
An Introduction to Apache Pig
PPTX
PPTX
MapReduce basic
PPTX
Apache pig
PDF
Apache Pig: A big data processor
PPTX
Unit 3 writable collections
PDF
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
PDF
Apache Hadoop MapReduce Tutorial
PPT
Map Reduce introduction
PPTX
Map reduce and Hadoop on windows
PPT
Hadoop MapReduce Fundamentals
PPTX
Hadoop & HDFS for Beginners
PPTX
MapReduce Paradigm
PPTX
Pig latin
PDF
Hadoop & MapReduce
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
Apache PIG
Unit 4 lecture2
03 pig intro
An Introduction to Apache Pig
MapReduce basic
Apache pig
Apache Pig: A big data processor
Unit 3 writable collections
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Apache Hadoop MapReduce Tutorial
Map Reduce introduction
Map reduce and Hadoop on windows
Hadoop MapReduce Fundamentals
Hadoop & HDFS for Beginners
MapReduce Paradigm
Pig latin
Hadoop & MapReduce
Ad

Viewers also liked (18)

PDF
PPTX
SQL On Hadoop
PPTX
MapReduce DesignPatterns
PDF
Hadoop map reduce concepts
PPTX
Hadoop HDFS Concepts
PPT
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
PDF
ODP
Hadoop Installation and basic configuration
PDF
Hadoop Administration pdf
PPTX
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
PPTX
SQL-on-Hadoop Tutorial
PDF
Introduction to Apache Hive
PPTX
Sql vs NoSQL
PPT
HIVE: Data Warehousing & Analytics on Hadoop
PPT
Introduction To Map Reduce
PPTX
Introduction to YARN and MapReduce 2
ODP
Hadoop demo ppt
PDF
Hadoop Overview & Architecture
 
SQL On Hadoop
MapReduce DesignPatterns
Hadoop map reduce concepts
Hadoop HDFS Concepts
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Hadoop Installation and basic configuration
Hadoop Administration pdf
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
SQL-on-Hadoop Tutorial
Introduction to Apache Hive
Sql vs NoSQL
HIVE: Data Warehousing & Analytics on Hadoop
Introduction To Map Reduce
Introduction to YARN and MapReduce 2
Hadoop demo ppt
Hadoop Overview & Architecture
 
Ad

Similar to Hadoop M/R Pig Hive (20)

PPTX
Hands on Hadoop and pig
PPTX
Hadoop workshop
PPTX
Hadoop and Mapreduce for .NET User Group
PPTX
Mapreduce advanced
PPTX
map reduce Technic in big data
PPT
Hadoop_Pennonsoft
PDF
Hadoop eco system with mapreduce hive and pig
PDF
Hadoop first mr job - inverted index construction
PPT
Hadoop - Introduction to mapreduce
PPT
Hadoop - Introduction to Hadoop
PDF
Apache Hadoop 1.1
PPTX
Map reducefunnyslide
PDF
MapReduce with Hadoop and Ruby
PDF
Hadoop ecosystem
PDF
big data analytics introduction chapter 2
PPTX
Managing Big data Module 3 (1st part)
PPTX
Hadoop for sysadmins
PPT
L19CloudMapReduce introduction for cloud computing .ppt
PPTX
Hadoop, Map Reduce and Apache Pig tutorial
Hands on Hadoop and pig
Hadoop workshop
Hadoop and Mapreduce for .NET User Group
Mapreduce advanced
map reduce Technic in big data
Hadoop_Pennonsoft
Hadoop eco system with mapreduce hive and pig
Hadoop first mr job - inverted index construction
Hadoop - Introduction to mapreduce
Hadoop - Introduction to Hadoop
Apache Hadoop 1.1
Map reducefunnyslide
MapReduce with Hadoop and Ruby
Hadoop ecosystem
big data analytics introduction chapter 2
Managing Big data Module 3 (1st part)
Hadoop for sysadmins
L19CloudMapReduce introduction for cloud computing .ppt
Hadoop, Map Reduce and Apache Pig tutorial

More from zahid-mian (9)

PDF
Mongodb Aggregation Pipeline
PDF
MongoD Essentials
PDF
Hadoop Technologies
PPTX
Intro to modern cryptography
PDF
NoSQL Databases
PDF
Statistics101: Numerical Measures
PDF
Amazon SimpleDB
PDF
C# 6 New Features
PDF
Introduction to d3js (and SVG)
Mongodb Aggregation Pipeline
MongoD Essentials
Hadoop Technologies
Intro to modern cryptography
NoSQL Databases
Statistics101: Numerical Measures
Amazon SimpleDB
C# 6 New Features
Introduction to d3js (and SVG)

Recently uploaded (20)

PDF
Transcultural that can help you someday.
PPTX
Business_Capability_Map_Collection__pptx
PPT
Predictive modeling basics in data cleaning process
PDF
Business Analytics and business intelligence.pdf
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PDF
Introduction to the R Programming Language
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PDF
annual-report-2024-2025 original latest.
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PDF
Microsoft 365 products and services descrption
PPT
ISS -ESG Data flows What is ESG and HowHow
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PDF
Introduction to Data Science and Data Analysis
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
IMPACT OF LANDSLIDE.....................
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
Transcultural that can help you someday.
Business_Capability_Map_Collection__pptx
Predictive modeling basics in data cleaning process
Business Analytics and business intelligence.pdf
Qualitative Qantitative and Mixed Methods.pptx
Optimise Shopper Experiences with a Strong Data Estate.pdf
Introduction to the R Programming Language
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
annual-report-2024-2025 original latest.
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
STERILIZATION AND DISINFECTION-1.ppthhhbx
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
Microsoft 365 products and services descrption
ISS -ESG Data flows What is ESG and HowHow
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
Introduction to Data Science and Data Analysis
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
IMPACT OF LANDSLIDE.....................
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...

Hadoop M/R Pig Hive

  • 1. Hadoop: M/R, Pig, Hive A short intro and demo of each Program By Zahid Mian (February 2015)
  • 2. Agenda • Intro to Map/Reduce (M/R) • M/R Simple Example • M/R Joins • M/R Broadcast Join Example • Intro to Pig • Pig Example • Intro to Hive • Hive Example • Resources
  • 3. What is M/R? • A way of Programing that breaks down work into two tasks: Mapping and Reducing • Mapping: • Consume <key, value> pairs • Produce <key, value> pairs • Reducers: • Consume: <key, <list of values>> <“EMC”, “{(…),(…)}”> • Produce: <key, value> <“EMC”, 27.2229> • Shuffling and Sorting: • Behind the scenes actions done by the framework • Groups all similar keys from all mappers, sorts and passes them to a certain reducer
  • 4. What is HDFS • HDFS is a filesystem that ensures data availability by replicating file blocks across several nodes (3 is default) • Default block size is 64 MB • A small file (1 KB) will take up 64 MB; “large” file of 65 MB will take up 128 MB; • Namenode stores metadata info about files • Datanode stores the actual file(s) • Files must be added to HDFS • Files cannot be modified once inside HDFS
  • 5. Working with HDFS • Similar to working with Linux Filesystem • [cloudera@quickstart ~]$ hadoop fs -mkdir /user/examples/stocks • [cloudera@quickstart ~]$ hadoop fs -mkdir /user/examples/stocks/input • [cloudera@quickstart ~]$ hadoop fs -mkdir /user/examples/stocks/output • [cloudera@quickstart ~]$ hadoop fs -rm -r /user/examples/stocks/input/* • [cloudera@quickstart ~]$ hadoop fs -copyFromLocal ~/datasets/stock*.txt /user/examples/stocks/input/ • [cloudera@quickstart ~]$ hadoop fs -cat /user/examples/stocks/input/stocks.txt • [cloudera@quickstart ~]$ hadoop fs -rmr /user/examples/stocks/output/* • Full list of Commands available: • http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop- common/FileSystemShell.html
  • 6. Structure of Files (demo) Symbol, Name, Exchange Symbol, date, open, high, low, close, volume, adjclose
  • 7. Shakespeare Count Words • Simple text file that contains all of Shakespeare’s works • Mapper will read each line from text file and produce a <key, value> tuple with the word as the key and the value of 1 • Simply tokenize each line and output each word • Reducer will get a list of values (all 1s) for each word • Tuple: <“death”, {1,1,1,1,1,1,1,1}> • Now simply count the 1s and output as <“death”, 8> • It’s Hadoop’s job to Shuffle and Sort in order to give the Reducer the correct tuple • Output of Mapper and Reducer are stored in HDFS • Logs are generated outside HDFS
  • 8. M/R: Mapper (Simple) All Mappers must extend this class: org.apache.hadoop.mapreduce.Mapper Special Hadoop type; for text files, this is simply the line number Special Hadoop type; Indicates type of value Mapper will produce “Signature” indicates that Mapper will consume LongWritable and Text; will produce Text and IntWritable Notice word is of type Text; one is of type IntWritable setup method is run only once before any calls to the mapper function
  • 9. M/R Submitting: Driver • Compile and Create jar file • Then from command prompt: [cloudera@quickstart ~]$ hadoop jar words.jar Driver /user/examples/shakespeare/input/ /user/examples/shakespeare/output/wordcount
  • 10. What is Mapper Doing? • Sample File Segment • Mapper function gets: • Line Number, Line Text • <57020, “HAMLET To be, or not to be: that is the question:”> • Mapper will: • 1: tokenize string • 2: for each word, produce a tuple like: • <“HAMLET”, 1> • <“To”, 1> • <“be”, 1> • <“or”, 1> • … • Repeated for all lines
  • 11. That’s it? • Hadoop performs some Magic (Shuffling and Sorting) … • And now we have tuples like: • <“HAMLET”, {1,1,1,1,1,1,1}> • <“To”, {1,1,1,1,1,1,1,1,1,1,1,1,1,1}> • <“be”, {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1}> • Note: List of Values isn’t correct (there are a lot more references to “HAMLET” in the text file), but it’s meant to be representative of what it would look like
  • 12. M/R: Reducer (Simple) All Reducers must extend this class: org.apache.hadoop.mapreduce.Reducer This is the “key” for the data that’s being sent to the Reducer Special Hadoop type; Indicates type of value Mapper will produce “Signature” indicates that Reducer will consume Text and IntWritable; will produce Text and DoubleWritable Notice key is of type Text; result is of type DoubleWritable
  • 13. What is Reducer Doing? • Hadoop will send a tuple to the Reducer: • <“HAMLET”, {1,1,1,1,1,1,1}> • Reducer function: • Iterates over all the values for that key • Value is always 1, so simply sum • Reducer outputs: • <“HAMLET”, 7>
  • 14. M/R Overview HAMLET To be, or not to … Whether 'tis nobler in the … Or to take arms against a … And by opposing end … HAMLET To be, or not to … Whether 'tis nobler in the … Or to take arms against a … And by opposing end … HAMLET, 1 to, 1 be, 1 or, 1 not, 1 to, 1 Whether, 1 ‘tis, 1 nobler, 1 in, 1 the, 1 Or, 1 to, 1 take, 1 arms, 1 against, 1 a, 1 And, 1 by, 1 opposing, 1 end, 1 HAMLET, 1 a, 1 against, 1 be, 1 by, 1 end, 1 in, 1 nobler, 1 not, 1 opposing, 1 or, 1 or, 1 take, 1 to, 1 to, 1 to, 1 HAMLET, 1 a, 1 against, 1 be, 1 by, 1 end, 1 in, 1 nobler, 1 not, 1 opposing, 1 or, 2 take, 1 to, 3 HAMLET, 1 a, 1 against, 1 and, 1 arms, 1 be, 1 by, 1 end, 1 in, 1 nobler, 1 not, 1 opposing, 1 or, 2 take, 1 the, 1 'tis, 1 to, 3 whether, 1 Input Files Each line passed to mapper Map Key Value Split Sort and Shuffle Reduce Key Value Paris Final Ouput Map Tasks Reduce Tasks
  • 16. Joins with M/R • Not Straightforward (Mapper deals with a single record) • Two Strategies: • Re-Partition Join if both tables are Large • Basic idea is to use Mappers to produce “key’d” records so that both data sets will be in the same partition • Assume EmployeeID of 100, then Mapper Produces: • <100, “FirstName, LastName, Address”> (parent record) • <100, “Skill1, Date, Level”> (child record) • <100, “Skill2, Date, Level”> (child record) • Reducer performs the join • Expensive/Costly due to Shuffling and Sorting • Broadcast/Replication Join if one table is small • Essentially send a copy of small table to each Mapper • Each Mapper performs join
  • 20. Results Just the Mapper Mapper and Reducer (calculate Avg Price by Name)
  • 21. Final Thoughts on M/R • Java Experience Necessary • Hadoop Streaming extends M/R to C, Python, etc. • Can use Combiners to improve performance • Reduces Network traffic • “Difficult” to understand all the details, but granular control over data/process • Useful when dealing with complex algorithms • Several file formats available, but can also create custom formats • Chaining Jobs to use output of one Job as input for another • https://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
  • 22. Pig • Higher level abstraction for writing M/R jobs • Data Flow “language” • Sequence of transformations (filtering, grouping, joining, etc.) • Pig Latin (the language for Pig) • It’s not SQL, not even close • Pig scripts are run as M/R jobs in Hadoop • Pig Shell will compile and optimize script • Need to understand data in order to create schemas • Pig can define Simple and Complex types, so a parent/child data can exist in one “line” (think Json) • User Defined Functions (UDF) can be written in Java, Jython, etc. http://pig.apache.org/docs/r0.9.1/udf.html
  • 23. Generic Example • This script shows many of the operations within Pig Users = load 'users' as (name, age); Fltrd = filter Users by age >= 18 and age <= 25; Pages = load 'pages' as (user, url); Jnd = join Fltrd by name, Pages by user; Grpd = group Jnd by url; Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks; Srtd = order Smmd by clicks desc; Top5 = limit Srtd 5; store Top5 into 'top5sites';
  • 24. Avg Opening Price by Name Performs join between two datasets; describe shows you the structure
  • 25. Pig Scripts are Hadoop Jobs z.pig is the name of the script
  • 26. Hive • It’s not Pig • SQL-based tool for Hadoop (HiveQL, not SQL) • More friendlier for SQL users • “Databases” are simply Namespaces • “Tables” similar to SQL Tables • Cannot Insert/Update/Delete • New data is added when HDFS is updated (add a file to HDFS) • Metadata is kept in a relational database (MySQL by default)
  • 27. Hive and HDFS • When a Table points to a HDFS location, it will read all files in that location; cannot specify a single file • Easy to create Partitions; simply create sub directories • That’s why each file is stored in a separate directory
  • 30. Hive Scripts are Hadoop Jobs