MapReduce basics

MapReduce basics
Harisankar H,
PhD student, DOS lab,
Dept. CSE, IIT Madras

6-Feb-2013

http://harisankarh.wordpress.com

Distributed processing ?
• Processing distributed across multiple
machines/servers

Image from: http://installornot.com/wp-content/uploads/google-datacenter-tech-13.jpg

Why distributed processing?
– Reduce execution time of large jobs
• E.g., extracting urls from terabytes of data
• 1000 machines could finish the jobs 1000 times faster
– Fault-tolerance
• Other nodes will take over the jobs if some of the
nodes fail
– Typically if you have 10,000 servers, on the average one will
fail per day

Issues in distributed processing
• Realized traditionally using special-purpose
implementations
– E.g., indexer, log processor
• Implementation really hard at socket programming level
– Fault-tolerance
• Keep track of failure, reassignment of tasks
– Hand-coded parallelization
– Scheduling across heterogeneous nodes
– Locality
• Minimise movement of data for computation
– How to distribute data?
• Results in:
– Complex, brittle, non-generic code
– Reimplementation of common features like fault-tolerance,
distribution

Need for a generic abstraction for
distributed processing

App programmer  abstraction  systems developer

Separation of concerns

Express app Performance, fault
logic handling etc.

• Tradeoff between genericity and performance
– More generic => usually less performance
• MapReduce probably a sweet spot where you
have both to some extent

MapReduce abstraction(app
programmer’s view)
• Model input and output as <key,value> pairs
• Provide map() and reduce() functions which
act on <k,v> pairs
• Input: set of <k,v> pairs: {k,v}
– For each input <k,v>:
map(k1,v1)  list(k2,v2)
– For each unique output key from map:
reduce(k2,combined list(v2))  list(v3)

System will take care of distributing the tasks across thousands of machines,
handling locality, fault-tolerance etc.

Example: word count
• Problem:
– Count the number of occurrences of each unique
word in a big collection of documents
• Input <k,v> set:
– <document name, document contents>
• Organize the files in this format
• Output:
– <word, count>
• Get it in output files
• Next step:
– Define the map() and reduce() functions

Word count
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, “1”);

reduce(String key, List values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));

Program in java

public void reduce(Text key,
public void map(LongWritable key, Text Iterable<IntWritable> values, Context
value, Context context) throws … context) throws …
{ {
String line = value.toString(); int sum = 0;
StringTokenizer tokenizer = new for (IntWritable val : values) {
StringTokenizer(line); sum += val.get();
while (tokenizer.hasMoreTokens()) { }
word.set(tokenizer.nextToken()); context.write(key, new
context.write(word, one); IntWritable(sum));
} }
}

Implementing MapReduce abstraction

App programmer  abstraction  systems developer

• Looked at the application programmer’s view
• Need a platform which implements the
MapReduce abstraction
• Hadoop is the popular open-source
implementation of MapReduce abstraction
• Questions for the platform developer
– How to
• parallelize ?
• handle faults ?
• provide locality ?
• distribute the data ?

Basics of platform implementation
• parallelize ?
– Each map can be executed independently in parallel
– After all maps have finished execution, all reduce can be
executed in parallel
• handle faults ?
– map() and reduce() has no internal state
• Simply re-execute in case of a failure
• distribute the data ?
– Have a distributed file system(HDFS)
• provide locality ?
– Prefer to execute map() on the nodes having input <k,v>
pair

MapReduce implementation
• Distributed File System(DFS) +
MapReduce(MR) Engine
– Specifically, MR engine uses a DFS
• Distributed files system
– Files split into large chunks and stored in the
distributed file system(e.g., HDFS)
– Large chunks: typically 64MB per block
– can have a master-slave architecture
• Master assigns and manages replicated blocks in the
slaves

MapReduce engine
• Has a master slave architecture
– Master co-ordinates the task execution across
workers
– Workers perform the map() and reduce()
functions
• Reads and writes blocks to/from the DFS
– Master keeps tracks of failure of workers and
reassigns tasks if necessary
• Failure detection usually done through timeouts

Some tips for designing MR jobs
• Reduce network traffic between map and reduce
– Model map() and reduce() jobs appropriately
– Use combine() functions
• combine(<k,[v]>)  <k,[v]>
• combine() executes after all map()s finish in each block
– map() [same node] combine() [network]  reduce()

• Make map jobs of roughly equal expected
execution times
• Try to make reduce() jobs less skewed

Pros and cons of MapReduce
• Advantages
– Simple, easy to use distributed processing system
– Reasonably generic
– Exploits locality for performance
– Simple and less buggy implementation
• Issues
– Not a magic bullet which fit all problems
• Difficult to model iterative and recursive computations
– E.g.: k-means clustering
– Generate-Map-Reduce
• Difficult to model streaming computations
• Centralized entities like master becomes bottlenecks
• Most real-world problems require large chains of MR jobs

Summary
• Today
– Distributed processing issues, MR programming model
– Sample MR job
– How MR can be implemented
– Pros and cons of MR, tips for better performance
• Tomorrow
– Details specific to Hadoop
– Downloading and setting up of Hadoop on a cluster

Ack: some images from: Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data
processing on large clusters. Commun. ACM 51, 1 (January 2008), 107-113.

Hadoop components
• HDFS
– Master: Namenode
– Slave : DataNode
• MapReduce engine
– Master: JobTracker
– Slave: TaskTracker

MapReduce basics

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to MapReduce basics (20)

Recently uploaded (20)

MapReduce basics