MapReduce in cgrid and cloud computinge.ppt

The MapReduce Paradigm
The MapReduce Paradigm
 Platform for reliable, scalable parallel
computing
 Abstracts issues of distributed and parallel
environment from programmer.
 Runs over distributed file systems
 Google File System
 Hadoop File System (HDFS)

Distributed File Systems
Distributed File Systems
 Highly scalable distributed file system for large
data-intensive applications.
 E.g. 10K nodes, 100 million files, 10 PB
 Provides redundant storage of massive amounts
of data on cheap and unreliable computers
 Files are replicated to handle hardware failure
 Detect failures and recovers from them
 Provides a platform over which other systems
like MapReduce, BigTable operate.

Distributed File System
 Single Namespace for entire cluster
 Data Coherency
– Write-once-read-many access model
– Client can only append to existing files
 Files are broken up into blocks
– Typically 128 MB block size
– Each block replicated on multiple DataNodes
 Intelligent Client
– Client can find location of blocks
– Client accesses data directly from DataNode

Secondary
NameNode
Client
HDFS Architecture
NameNode
DataNodes
1. filenam
e
2. BlckId, DataNodes
o
3.Read data
NameNode : Maps a file to a file-id and list of MapNodes
DataNode : Maps a block-id to a physical location on disk

MapReduce: Insight
MapReduce: Insight
 Consider the problem of counting the number of
occurrences of each word in a large collection of
documents
 How would you do it in parallel ?
 Solution:
 Divide documents among workers
 Each worker parses document to find all words, outputs
(word, count) pairs
 Partition (word, count) pairs across workers based on word
 For each word at a worker, locally add up counts

MapReduce Programming Model
MapReduce Programming Model
 Inspired from map and reduce operations
commonly used in functional programming
languages like Lisp.
 Input: a set of key/value pairs
 User supplies two functions:
map(k,v)  list(k1,v1)
reduce(k1, list(v1))  v2
 (k1,v1) is an intermediate key/value pair
 Output is the set of (k1,v2) pairs

MapReduce: The Map Step
v2
k2
k v
k v
map
v1
k1
vn
kn
…
k v
map
Input
key-value pairs
Intermediate
key-value pairs
…
k v
Adapted from Jeff Ullman’s course slides
E.g. (doc—id, doc-content) E.g. (word, wordcount-in-a-doc)

MapReduce: The Reduce Step
k v
…
k v
k v
k v
Intermediate
key-value pairs
group
reduce
reduce
k v
k v
k v
…
k v
…
k v
k v v
v v
Key-value groups
Output
key-value pairs
Adapted from Jeff Ullman’s course slides
E.g.
(word, wordcount-in-a-doc)
(word, list-of-wordcount) (word, final-count)
~ SQL Group by ~ SQL aggregation

Pseudo-code
Pseudo-code
map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, "1");
// Group by step done by system on key of intermediate Emit above, and //
reduce called on list of values in each group.
reduce(String output_key, Iterator intermediate_values):
// output_key: a word
// output_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));

MapReduce: Execution overview
MapReduce: Execution overview

Distributed Execution Overview
User
Program
Worker
Worker
Master
Worker
Worker
Worker
fork fork fork
assign
map
assign
reduce
read
local
write
remote
read,
sort
Output
File 0
Output
File 1
write
Split 0
Split 1
Split 2
input data from
distributed file
system

Map Reduce vs. Parallel Databases
 Map Reduce widely used for parallel processing
 Google, Yahoo, and 100’s of other companies
 Example uses: compute PageRank, build keyword indices,
do data analysis of web click logs, ….
 Database people say: but parallel databases have
been doing this for decades
 Map Reduce people say:
 we operate at scales of 1000’s of machines
 We handle failures seamlessly
 We allow procedural code in map and reduce and allow
data of any type

Implementations
 Google
 Not available outside Google
 Hadoop
 An open-source implementation in Java
 Uses HDFS for stable storage
 Download: http://lucene.apache.org/hadoop/
 Aster Data
 Cluster-optimized SQL Database that also implements
MapReduce
 IITB alumnus among founders
 And several others, such as Cassandra at
Facebook, etc.

Reading
 Jeffrey Dean and Sanjay Ghemawat, MapReduce:
Simplified Data Processing on Large Clusters
http://labs.google.com/papers/mapreduce.html
 Sanjay Ghemawat, Howard Gobioff, and Shun-Tak
Leung, The Google File System,
http://labs.google.com/papers/gfs.html

MapReduce in cgrid and cloud computinge.ppt

More Related Content

Similar to MapReduce in cgrid and cloud computinge.ppt (20)

Recently uploaded (20)

MapReduce in cgrid and cloud computinge.ppt