4. THE PROBLEM
Google faced the problem of analyzing huge sets of data (order of petabytes).
E.g. pagerank, web access logs, etc.
Algorithm to process data can be reasonable simple,
But to finish it in an acceptable amount of time the task must be split and forwarded to potentially
thousands of machines
Programmers were forced to develop the software that:
Splits data
Forwards data and code to participant nodes
Checks nodes state to react to errors
Retrieves and organizes results
Tedious, error-prone, time-consuming... and had to be done for each problem.
5. THE SOLUTION: MAPREDUCE
MapReduce is an abstraction to organize parallelizable tasks.
Algorithm has to be adapted to fit MapReduce's main two steps:
Map: data processing (collecting/grouping/distribution intermediate step)
Reduce: data collection and digesting
MapReduce Architecture provides
Automatic parallelization & distribution
Fault tolerance
I/O scheduling
Monitoring & status updates
6. LIST PROCESSING
Conceptually, MapReduce programs transform lists of input data
elements into lists of output data elements.
A MapReduce program will do this twice, using two different list
processing idioms:
Map
Reduce
These terms are taken from several list processing languages such as
LISP, Scheme, or ML.
7. MAPPING LISTS
A list of data elements are provided, one at a time, to a function
called the Mapper.
It transforms each element individually to an output data element.
8. REDUCING LISTS
Reducing lets you aggregate values together.
A reducer function receives an iterator of input values from an input list.
It then combines these values together, returning a single output value.
9. MAPPING IN MAPREDUCE
(KEYS AND VALUES)
In MapReduce, no value stands on its own.
Every value has a key associated with it. Keys identify related values.
For example, a log of time-coded speedometer readings from multiple cars could be
keyed by license-plate number.
The mapping and reducing functions receive not just values, but (key, value) pairs.
The output of each of these functions is the same:
Both a key and a value must be emitted to the next list in the data flow.
AAA-123 65mph, 12:00pm
ZZZ-789 50mph, 12:02pm
AAA-123 40mph, 12:05pm
CCC-456 25mph, 12:15pm ...
10. REDUCING IN MAPREDUCE
(KEYS DIVIDE THE REDUCE SPACE)
In MapReduce, all of the output values are not usually reduced together.
All of the values with the same key are presented to a single reducer together.
This is performed independently of any reduce operations occurring on other lists of
values, with different keys attached.
12. EXAMPLE: WORD COUNT
A simple MapReduce program can be written to determine how many times different words appear
in a set of files.
For example, if we had the files:
foo.txt: Sweet, this is the foo file
bar.txt: This is the bar file
We would expect the output to be: sweet 1
this 2
is 2
the 2
foo 1
bar 1
file 2
13. WORD COUNT IN MAPREDUCE (2)
The high-level structure would look like this:
mapper (filename, file-contents):
for each word in file-contents:
emit (word, 1)
reducer (word, values):
sum = 0
for each value in values:
sum = sum + value
emit (word, sum)
18. INPUT FILES
This is where the data for a MapReduce task is initially stored.
The input files typically reside in HDFS.
The format of these files can be:
Line-based log files
Binary format files
Multi-line input records,
It is typical for these input files to be very large -- tens of gigabytes
or more.
19. INPUT FORMAT
How these input files are split up and read is defined by the InputFormat.
An InputFormat is a class that provides the following functionality:
Selects the files or other objects that should be used for input
Defines the InputSplits that break a file into tasks
Provides a factory for RecordReader objects that read the file
20. INPUT SPLITS
An InputSplit describes a unit of work that comprises a single map task in a
MapReduce program.
A MapReduce program applied to a data set, collectively referred to as a Job, is made up of
several (possibly several hundred) tasks
By processing a file in chunks, we allow several map tasks to operate on a
single file in parallel.
The various blocks that make up the file may be spread across several different nodes in the cluster
The individual blocks are thus all processed locally, instead of needing to be transferred from one node to another
The tasks are then assigned to the nodes in the system based on where the
input file chunks are physically resident.
An individual node may have several dozen tasks assigned to it
The node will begin working on the tasks, attempting to perform as many in parallel as it can
21. RECORD READER
The InputSplit has defined a slice of work, but does not describe how
to access it.
TheRecordReader class actually loads the data from its source and
converts it into (key, value) pairs suitable for reading by the Mapper.
The RecordReader is invoke repeatedly on the input until the entire
InputSplit has been consumed.
Each invocation of the RecordReader leads to another call to the map() method of the
Mapper.
22. MAPPER
Given a key and a value, the map() method emits (key, value) pair(s)
which are forwarded to the Reducers.
The individual mappers are intentionally not provided with a
mechanism to communicate with one another in any way.
This allows the reliability of each map task to be governed solely by the reliability of
the local machine
The map() method receives two parameters in addition to the key and
the value:
The OutputCollector object has a method named collect() which will forward a (key,
value) pair to the reduce phase of the job.
The Reporter object provides information about the current task
23. PARTITION & SHUFFLE
After the first map tasks have completed, the nodes may still be performing
several more map tasks each.
But they also begin exchanging the intermediate outputs from the map tasks to where they are
required by the reducers
This process of moving map outputs to the reducers is known as shuffling
A different subset of the intermediate key space is assigned to each reduce node;
these subsets (known as "partitions") are the inputs to the reduce tasks.
Each map task may emit (key, value) pairs to any partition; all values for the same
key are always reduced together regardless of which mapper is its origin.
Therefore, the map nodes must all agree on where to send the different pieces of
the intermediate data.
24. REDUCER
Sort
Each reduce task is responsible for reducing the values associated with several
intermediate keys.
The set of intermediate keys on a single node is automatically sorted by Hadoop
before they are presented to the Reducer.
Reduce
A Reducer instance is created for each reduce task.
This is an instance of user-provided code that performs the second important phase of
job-specific work.
For each key in the partition assigned to a Reducer, the Reducer's reduce() method is called once.
This receives a key as well as an iterator over all the values associated with the key.
The values associated with a key are returned by the iterator in an undefined order.
The Reducer also receives as parameters OutputCollector and Reporter objects; they
are used in the same manner as in the map() method.
25. OUTPUT FORMAT
The (key, value) pairs provided to this OutputCollector are then written to
output files.
The way they are written is governed by the OutputFormat.
Each Reducer writes a separate file in a common output directory.
The output directory is set by theFileOutputFormat.setOutputPath() method
26. RECORD WRITER
The OutputFormat class is a factory for RecordWriter objects;
These are used to write the individual records to the files as directed by the
OutputFormat
The output files written by the Reducers are then left in HDFS for your
use by,
Another MapReduce job
A separate program
Human inspection
28. FAULT TOLERANCE
One of the primary reasons to use Hadoop to run your jobs is due to its
high degree of fault tolerance.
Map worker failure
Map tasks completed or in-progress at worker are reset to idle
Reduce workers are notified when task is rescheduled on another worker
Reduce worker failure
Only in-progress tasks are reset to idle
Master failure
MapReduce task is aborted and client is notified
Should we have task identities?
29. EXAMPLE: INVERTED INDEX
An inverted index returns a list of documents that contain each word in those
documents.
Thus, if the word "cat" appears in documents A and B, but not C, then the line:
cat A, B
should appear in the output.
If the word "baseball" appears in documents B and C, then the line:
baseball B, C
should appear in the output as well.