Hadoop M/R Pig Hive

Hadoop: M/R, Pig,
Hive
A short intro and demo of each Program
By Zahid Mian (February 2015)

Agenda
• Intro to Map/Reduce (M/R)
• M/R Simple Example
• M/R Joins
• M/R Broadcast Join Example
• Intro to Pig
• Pig Example
• Intro to Hive
• Hive Example
• Resources

What is M/R?
• A way of Programing that breaks down work into two tasks:
Mapping and Reducing
• Mapping:
• Consume <key, value> pairs
• Produce <key, value> pairs
• Reducers:
• Consume: <key, <list of values>> <“EMC”, “{(…),(…)}”>
• Produce: <key, value> <“EMC”, 27.2229>
• Shuffling and Sorting:
• Behind the scenes actions done by the framework
• Groups all similar keys from all mappers, sorts and passes them
to a certain reducer

What is HDFS
• HDFS is a filesystem that ensures data availability by
replicating file blocks across several nodes (3 is default)
• Default block size is 64 MB
• A small file (1 KB) will take up 64 MB; “large” file of 65 MB will
take up 128 MB;
• Namenode stores metadata info about files
• Datanode stores the actual file(s)
• Files must be added to HDFS
• Files cannot be modified once inside HDFS

Working with HDFS
• Similar to working with Linux Filesystem
• [cloudera@quickstart ~]$ hadoop fs -mkdir /user/examples/stocks
• [cloudera@quickstart ~]$ hadoop fs -mkdir /user/examples/stocks/input
• [cloudera@quickstart ~]$ hadoop fs -mkdir /user/examples/stocks/output
• [cloudera@quickstart ~]$ hadoop fs -rm -r /user/examples/stocks/input/*
• [cloudera@quickstart ~]$ hadoop fs -copyFromLocal ~/datasets/stock*.txt /user/examples/stocks/input/
• [cloudera@quickstart ~]$ hadoop fs -cat /user/examples/stocks/input/stocks.txt
• [cloudera@quickstart ~]$ hadoop fs -rmr /user/examples/stocks/output/*
• Full list of Commands available:
• http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-
common/FileSystemShell.html

Structure of Files (demo)
Symbol, Name, Exchange
Symbol, date, open, high, low,
close, volume, adjclose

Shakespeare Count Words
• Simple text file that contains all of Shakespeare’s works
• Mapper will read each line from text file and produce a <key,
value> tuple with the word as the key and the value of 1
• Simply tokenize each line and output each word
• Reducer will get a list of values (all 1s) for each word
• Tuple: <“death”, {1,1,1,1,1,1,1,1}>
• Now simply count the 1s and output as <“death”, 8>
• It’s Hadoop’s job to Shuffle and Sort in order to give the Reducer
the correct tuple
• Output of Mapper and Reducer are stored in HDFS
• Logs are generated outside HDFS

M/R: Mapper (Simple)
All Mappers must extend this class:
org.apache.hadoop.mapreduce.Mapper
Special Hadoop type; for text files,
this is simply the line number
Special Hadoop type; Indicates type of
value Mapper will produce
“Signature” indicates that
Mapper will consume
LongWritable and Text; will
produce Text and
IntWritable
Notice word is of type Text; one is of type IntWritable
setup method is run only once before any
calls to the mapper function

M/R Submitting: Driver
• Compile and Create jar file
• Then from command prompt:
[cloudera@quickstart ~]$ hadoop jar words.jar Driver /user/examples/shakespeare/input/ /user/examples/shakespeare/output/wordcount

What is Mapper Doing?
• Sample File Segment • Mapper function gets:
• Line Number, Line Text
• <57020, “HAMLET To be,
or not to be: that is the
question:”>
• Mapper will:
• 1: tokenize string
• 2: for each word, produce a
tuple like:
• <“HAMLET”, 1>
• <“To”, 1>
• <“be”, 1>
• <“or”, 1>
• …
• Repeated for all lines

That’s it?
• Hadoop performs some Magic (Shuffling and Sorting) …
• And now we have tuples like:
• <“HAMLET”, {1,1,1,1,1,1,1}>
• <“To”, {1,1,1,1,1,1,1,1,1,1,1,1,1,1}>
• <“be”, {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1}>
• Note: List of Values isn’t correct (there are a lot more references to “HAMLET” in
the text file), but it’s meant to be representative of what it would look like

M/R: Reducer (Simple)
All Reducers must extend this class:
org.apache.hadoop.mapreduce.Reducer
This is the “key” for the data that’s
being sent to the Reducer
Special Hadoop type; Indicates type of
value Mapper will produce
“Signature” indicates that
Reducer will consume Text
and IntWritable; will
produce Text and
DoubleWritable
Notice key is of type Text; result is of type DoubleWritable

What is Reducer Doing?
• Hadoop will send a
tuple to the Reducer:
• <“HAMLET”,
{1,1,1,1,1,1,1}>
• Reducer function:
• Iterates over all the
values for that key
• Value is always 1, so
simply sum
• Reducer outputs:
• <“HAMLET”, 7>

M/R Overview
HAMLET To be, or not to …
Whether 'tis nobler in the …
Or to take arms against a …
And by opposing end …
HAMLET To be, or not to …
Whether 'tis nobler in the …
Or to take arms against a …
And by opposing end …
HAMLET, 1
to, 1
be, 1
or, 1
not, 1
to, 1
Whether, 1
‘tis, 1
nobler, 1
in, 1
the, 1
Or, 1
to, 1
take, 1
arms, 1
against, 1
a, 1
And, 1
by, 1
opposing, 1
end, 1
HAMLET, 1
a, 1
against, 1
be, 1
by, 1
end, 1
in, 1
nobler, 1
not, 1
opposing, 1
or, 1
or, 1
take, 1
to, 1
to, 1
to, 1
HAMLET, 1
a, 1
against, 1
be, 1
by, 1
end, 1
in, 1
nobler, 1
not, 1
opposing, 1
or, 2
take, 1
to, 3
HAMLET, 1
a, 1
against, 1
and, 1
arms, 1
be, 1
by, 1
end, 1
in, 1
nobler, 1
not, 1
opposing, 1
or, 2
take, 1
the, 1
'tis, 1
to, 3
whether, 1
Input Files
Each line passed
to mapper
Map Key
Value Split
Sort and
Shuffle
Reduce Key
Value Paris
Final Ouput
Map Tasks
Reduce Tasks

Joins with M/R
• Not Straightforward (Mapper deals with a single record)
• Two Strategies:
• Re-Partition Join if both tables are Large
• Basic idea is to use Mappers to produce “key’d” records so that both
data sets will be in the same partition
• Assume EmployeeID of 100, then Mapper Produces:
• <100, “FirstName, LastName, Address”> (parent record)
• <100, “Skill1, Date, Level”> (child record)
• <100, “Skill2, Date, Level”> (child record)
• Reducer performs the join
• Expensive/Costly due to Shuffling and Sorting
• Broadcast/Replication Join if one table is small
• Essentially send a copy of small table to each Mapper
• Each Mapper performs join

Results
Just the Mapper
Mapper and Reducer (calculate Avg Price by Name)

Final Thoughts on M/R
• Java Experience Necessary
• Hadoop Streaming extends M/R to C, Python, etc.
• Can use Combiners to improve performance
• Reduces Network traffic
• “Difficult” to understand all the details, but granular control
over data/process
• Useful when dealing with complex algorithms
• Several file formats available, but can also create custom
formats
• Chaining Jobs to use output of one Job as input for another
• https://developer.yahoo.com/hadoop/tutorial/module4.html#chaining

Pig
• Higher level abstraction for writing M/R jobs
• Data Flow “language”
• Sequence of transformations (filtering, grouping, joining, etc.)
• Pig Latin (the language for Pig)
• It’s not SQL, not even close
• Pig scripts are run as M/R jobs in Hadoop
• Pig Shell will compile and optimize script
• Need to understand data in order to create schemas
• Pig can define Simple and Complex types, so a parent/child
data can exist in one “line” (think Json)
• User Defined Functions (UDF) can be written in Java, Jython,
etc. http://pig.apache.org/docs/r0.9.1/udf.html

Generic Example
• This script shows many of the operations within Pig
Users = load 'users' as (name, age);
Fltrd = filter Users by age >= 18 and age <= 25;
Pages = load 'pages' as (user, url);
Jnd = join Fltrd by name, Pages by user;
Grpd = group Jnd by url;
Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks;
Srtd = order Smmd by clicks desc;
Top5 = limit Srtd 5;
store Top5 into 'top5sites';

Avg Opening Price by Name
Performs join between
two datasets;
describe shows you
the structure

Pig Scripts are Hadoop Jobs
z.pig is the name
of the script

Hive
• It’s not Pig
• SQL-based tool for Hadoop (HiveQL, not SQL)
• More friendlier for SQL users
• “Databases” are simply Namespaces
• “Tables” similar to SQL Tables
• Cannot Insert/Update/Delete
• New data is added when HDFS is updated (add a file to HDFS)
• Metadata is kept in a relational database (MySQL by default)

Hive and HDFS
• When a Table points to a HDFS location, it will read all files in
that location; cannot specify a single file
• Easy to create Partitions; simply create sub directories
• That’s why each file is stored in a separate directory

Hadoop M/R Pig Hive

More Related Content

What's hot (19)

Viewers also liked (18)

Similar to Hadoop M/R Pig Hive (20)

More from zahid-mian (9)

Recently uploaded (20)

Hadoop M/R Pig Hive