2. The MapReduce Paradigm
The MapReduce Paradigm
Platform for reliable, scalable parallel
computing
Abstracts issues of distributed and parallel
environment from programmer.
Runs over distributed file systems
Google File System
Hadoop File System (HDFS)
3. Distributed File Systems
Distributed File Systems
Highly scalable distributed file system for large
data-intensive applications.
E.g. 10K nodes, 100 million files, 10 PB
Provides redundant storage of massive amounts
of data on cheap and unreliable computers
Files are replicated to handle hardware failure
Detect failures and recovers from them
Provides a platform over which other systems
like MapReduce, BigTable operate.
4. Distributed File System
Single Namespace for entire cluster
Data Coherency
– Write-once-read-many access model
– Client can only append to existing files
Files are broken up into blocks
– Typically 128 MB block size
– Each block replicated on multiple DataNodes
Intelligent Client
– Client can find location of blocks
– Client accesses data directly from DataNode
7. MapReduce: Insight
MapReduce: Insight
Consider the problem of counting the number of
occurrences of each word in a large collection of
documents
How would you do it in parallel ?
Solution:
Divide documents among workers
Each worker parses document to find all words, outputs
(word, count) pairs
Partition (word, count) pairs across workers based on word
For each word at a worker, locally add up counts
8. MapReduce Programming Model
MapReduce Programming Model
Inspired from map and reduce operations
commonly used in functional programming
languages like Lisp.
Input: a set of key/value pairs
User supplies two functions:
map(k,v) list(k1,v1)
reduce(k1, list(v1)) v2
(k1,v1) is an intermediate key/value pair
Output is the set of (k1,v2) pairs
9. MapReduce: The Map Step
v2
k2
k v
k v
map
v1
k1
vn
kn
…
k v
map
Input
key-value pairs
Intermediate
key-value pairs
…
k v
Adapted from Jeff Ullman’s course slides
E.g. (doc—id, doc-content) E.g. (word, wordcount-in-a-doc)
10. MapReduce: The Reduce Step
k v
…
k v
k v
k v
Intermediate
key-value pairs
group
reduce
reduce
k v
k v
k v
…
k v
…
k v
k v v
v v
Key-value groups
Output
key-value pairs
Adapted from Jeff Ullman’s course slides
E.g.
(word, wordcount-in-a-doc)
(word, list-of-wordcount) (word, final-count)
~ SQL Group by ~ SQL aggregation
11. Pseudo-code
Pseudo-code
map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, "1");
// Group by step done by system on key of intermediate Emit above, and //
reduce called on list of values in each group.
reduce(String output_key, Iterator intermediate_values):
// output_key: a word
// output_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));
14. Map Reduce vs. Parallel Databases
Map Reduce widely used for parallel processing
Google, Yahoo, and 100’s of other companies
Example uses: compute PageRank, build keyword indices,
do data analysis of web click logs, ….
Database people say: but parallel databases have
been doing this for decades
Map Reduce people say:
we operate at scales of 1000’s of machines
We handle failures seamlessly
We allow procedural code in map and reduce and allow
data of any type
15. Implementations
Google
Not available outside Google
Hadoop
An open-source implementation in Java
Uses HDFS for stable storage
Download: http://lucene.apache.org/hadoop/
Aster Data
Cluster-optimized SQL Database that also implements
MapReduce
IITB alumnus among founders
And several others, such as Cassandra at
Facebook, etc.
16. Reading
Jeffrey Dean and Sanjay Ghemawat, MapReduce:
Simplified Data Processing on Large Clusters
http://labs.google.com/papers/mapreduce.html
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak
Leung, The Google File System,
http://labs.google.com/papers/gfs.html