MapReduce and Hadoop Introcuctory Presentation

CS350 - MAPREDUCE USING
HADOOP
Spring 2012

PARALLELIZATION: BASIC IDEA
Parallelization is “easy” if processing can be cleanly split into n units:
work
w1 w2 w3
Partition
problem

PARALLELIZATION: BASIC IDEA
thread thread thread
Workers process data:
w1 w2 w3
w1 w2 w3
Spawn worker threads:
results
Report
results
w1 w2 w3

THE PROBLEM
Google faced the problem of analyzing huge sets of data (order of petabytes).
 E.g. pagerank, web access logs, etc.
Algorithm to process data can be reasonable simple,
 But to finish it in an acceptable amount of time the task must be split and forwarded to potentially
thousands of machines
Programmers were forced to develop the software that:
 Splits data
 Forwards data and code to participant nodes
 Checks nodes state to react to errors
 Retrieves and organizes results
Tedious, error-prone, time-consuming... and had to be done for each problem.

THE SOLUTION: MAPREDUCE
MapReduce is an abstraction to organize parallelizable tasks.
Algorithm has to be adapted to fit MapReduce's main two steps:
 Map: data processing (collecting/grouping/distribution intermediate step)
 Reduce: data collection and digesting
MapReduce Architecture provides
 Automatic parallelization & distribution
 Fault tolerance
 I/O scheduling
 Monitoring & status updates

LIST PROCESSING
Conceptually, MapReduce programs transform lists of input data
elements into lists of output data elements.
A MapReduce program will do this twice, using two different list
processing idioms:
 Map
 Reduce
These terms are taken from several list processing languages such as
LISP, Scheme, or ML.

MAPPING LISTS
A list of data elements are provided, one at a time, to a function
called the Mapper.
It transforms each element individually to an output data element.

REDUCING LISTS
Reducing lets you aggregate values together.
A reducer function receives an iterator of input values from an input list.
It then combines these values together, returning a single output value.

MAPPING IN MAPREDUCE
(KEYS AND VALUES)
In MapReduce, no value stands on its own.
Every value has a key associated with it. Keys identify related values.
 For example, a log of time-coded speedometer readings from multiple cars could be
keyed by license-plate number.
The mapping and reducing functions receive not just values, but (key, value) pairs.
The output of each of these functions is the same:
 Both a key and a value must be emitted to the next list in the data flow.
AAA-123 65mph, 12:00pm
ZZZ-789 50mph, 12:02pm
AAA-123 40mph, 12:05pm
CCC-456 25mph, 12:15pm ...

REDUCING IN MAPREDUCE
(KEYS DIVIDE THE REDUCE SPACE)
In MapReduce, all of the output values are not usually reduced together.
All of the values with the same key are presented to a single reducer together.
This is performed independently of any reduce operations occurring on other lists of
values, with different keys attached.

EXAMPLE: WORD COUNT
A simple MapReduce program can be written to determine how many times different words appear
in a set of files.
For example, if we had the files:
 foo.txt: Sweet, this is the foo file
 bar.txt: This is the bar file
We would expect the output to be: sweet 1
this 2
is 2
the 2
foo 1
bar 1
file 2

WORD COUNT IN MAPREDUCE (2)
The high-level structure would look like this:
mapper (filename, file-contents):
for each word in file-contents:
emit (word, 1)
reducer (word, values):
sum = 0
for each value in values:
sum = sum + value
emit (word, sum)

INPUT FILES
This is where the data for a MapReduce task is initially stored.
The input files typically reside in HDFS.
The format of these files can be:
 Line-based log files
 Binary format files
 Multi-line input records,
It is typical for these input files to be very large -- tens of gigabytes
or more.

INPUT FORMAT
How these input files are split up and read is defined by the InputFormat.
An InputFormat is a class that provides the following functionality:
 Selects the files or other objects that should be used for input
 Defines the InputSplits that break a file into tasks
 Provides a factory for RecordReader objects that read the file

INPUT SPLITS
An InputSplit describes a unit of work that comprises a single map task in a
MapReduce program.
 A MapReduce program applied to a data set, collectively referred to as a Job, is made up of
several (possibly several hundred) tasks
By processing a file in chunks, we allow several map tasks to operate on a
single file in parallel.
 The various blocks that make up the file may be spread across several different nodes in the cluster
 The individual blocks are thus all processed locally, instead of needing to be transferred from one node to another
The tasks are then assigned to the nodes in the system based on where the
input file chunks are physically resident.
 An individual node may have several dozen tasks assigned to it
 The node will begin working on the tasks, attempting to perform as many in parallel as it can

RECORD READER
The InputSplit has defined a slice of work, but does not describe how
to access it.
TheRecordReader class actually loads the data from its source and
converts it into (key, value) pairs suitable for reading by the Mapper.
The RecordReader is invoke repeatedly on the input until the entire
InputSplit has been consumed.
 Each invocation of the RecordReader leads to another call to the map() method of the
Mapper.

MAPPER
Given a key and a value, the map() method emits (key, value) pair(s)
which are forwarded to the Reducers.
The individual mappers are intentionally not provided with a
mechanism to communicate with one another in any way.
 This allows the reliability of each map task to be governed solely by the reliability of
the local machine
The map() method receives two parameters in addition to the key and
the value:
 The OutputCollector object has a method named collect() which will forward a (key,
value) pair to the reduce phase of the job.
 The Reporter object provides information about the current task

PARTITION & SHUFFLE
After the first map tasks have completed, the nodes may still be performing
several more map tasks each.
 But they also begin exchanging the intermediate outputs from the map tasks to where they are
required by the reducers
 This process of moving map outputs to the reducers is known as shuffling
A different subset of the intermediate key space is assigned to each reduce node;
these subsets (known as "partitions") are the inputs to the reduce tasks.
Each map task may emit (key, value) pairs to any partition; all values for the same
key are always reduced together regardless of which mapper is its origin.
Therefore, the map nodes must all agree on where to send the different pieces of
the intermediate data.

REDUCER
Sort
 Each reduce task is responsible for reducing the values associated with several
intermediate keys.
 The set of intermediate keys on a single node is automatically sorted by Hadoop
before they are presented to the Reducer.
Reduce
 A Reducer instance is created for each reduce task.
 This is an instance of user-provided code that performs the second important phase of
job-specific work.
 For each key in the partition assigned to a Reducer, the Reducer's reduce() method is called once.
 This receives a key as well as an iterator over all the values associated with the key.
 The values associated with a key are returned by the iterator in an undefined order.
 The Reducer also receives as parameters OutputCollector and Reporter objects; they
are used in the same manner as in the map() method.

OUTPUT FORMAT
The (key, value) pairs provided to this OutputCollector are then written to
output files.
 The way they are written is governed by the OutputFormat.
Each Reducer writes a separate file in a common output directory.
 The output directory is set by theFileOutputFormat.setOutputPath() method

RECORD WRITER
The OutputFormat class is a factory for RecordWriter objects;
 These are used to write the individual records to the files as directed by the
OutputFormat
The output files written by the Reducers are then left in HDFS for your
use by,
 Another MapReduce job
 A separate program
 Human inspection

FAULT TOLERANCE
One of the primary reasons to use Hadoop to run your jobs is due to its
high degree of fault tolerance.
Map worker failure
 Map tasks completed or in-progress at worker are reset to idle
 Reduce workers are notified when task is rescheduled on another worker
Reduce worker failure
 Only in-progress tasks are reset to idle
Master failure
 MapReduce task is aborted and client is notified
Should we have task identities?

EXAMPLE: INVERTED INDEX
An inverted index returns a list of documents that contain each word in those
documents.
Thus, if the word "cat" appears in documents A and B, but not C, then the line:
 cat A, B
should appear in the output.
If the word "baseball" appears in documents B and C, then the line:
 baseball B, C
should appear in the output as well.

INVERTED INDEX CODE Using Eclipse and Hadoop

REFERENCES
Yahoo! Hadoop tutorial
 http://developer.yahoo.com/hadoop/tutorial/index.html
Processing of massive data: MapReduce
 http://lsd.ls.fi.upm.es/lsd/nuevas-tendencias-en-sistemas-distribuidos/IntroToMapRedu
ce.pdf
Hadoop webpage
 http://hadoop.apache.org/common/docs/current/
CS-350 Concurrency in the Cloud (for the masses)

MapReduce and Hadoop Introcuctory Presentation

More Related Content

Similar to MapReduce and Hadoop Introcuctory Presentation (20)

More from ssuserb91a20 (7)

Recently uploaded (20)

MapReduce and Hadoop Introcuctory Presentation