An Introduction To Map-Reduce

An Introduction to MapReduce Francisco P érez-Sorrosal Distributed Systems Lab (DSL/LSD) Universidad Polit é cnica de Madrid 10/Apr/2008

Outline Motivation What is MapReduce? Simple Example What is MapReduce ’ s Main Goal? Main Features What MapReduce Solves? Programming Model Framework Overview Example Other Features Hadoop: A MapReduce Implementation Example References An Introduction to MapReduce

Motivation Increasing demand of large scale processing applications Web engines, semantic search tools, scientific applications... Most of these applications can be parallelized There are many ad-hoc implementations for such applications but ... An Introduction to MapReduce

Motivation (II) ...the development and management execution of such ad-hoc parallel applications was too complex Usually implies the use and management of hundreds/thousands of machines However , they share basically the same problems: Parallelization Fault-tolerance Data distribution Load balancing An Introduction to MapReduce

What is MapReduce? It is a framework to... ...automatically partition jobs that have large input data sets into simpler work units or tasks, distribute them in the nodes of a cluster ( map ) and... ...combine the intermediate results of those tasks ( reduce ) in a way to produce the required results. Presented by Google in 2004 http://labs.google.com/papers/mapreduce.html An Introduction to MapReduce

Simple Example An Introduction to MapReduce Input data Mapped data on Node 1 Mapped data on Node 2 Result

What is MapReduce ’ s Main Goal? An Introduction to MapReduce Simplify the parallelization and distribution of large-scale computations in clusters

MapReduce Main Features Simple interface Automatic partition, parallelization and distribution of tasks Fault-tolerance Status and monitoring An Introduction to MapReduce

What does MapReduce solves? It allows non-experienced programmers on parallel and distributed systems to use large distributed systems easily Used extensively on many applications inside Google and Yahoo that... ...require simple processing tasks... ... but have large input data sets An Introduction to MapReduce

What does MapReduce solves? Examples: Distributed grep Distributed sort Count URL access frequency Web crawling Represent the structure of web documents Generate summaries (pages crawled per host, most frequent queries, results returned...) An Introduction to MapReduce

Programming Model Input & Output Each one is a set of key/value pairs Map: Processes input key/value pairs Compute a set of intermediate key/value pairs map (in_key, in_value) -> list(int_key, intermediate_value) Reduce: Combine all the intermediate values that share the same key Produces a set of merged output values (usually just one per key) reduce(int_key, list(intermediate_value)) -> list(out_value) An Introduction to MapReduce

Programming Model: Example Problem : Count of URL access frequency Input : Log of web page requests Map : Processes the assigned chunk of the log Compute a set of intermediate pairs <URL, 1> Reduce : Processes the intermediate pairs <URL, 1> Adds together all the values that share the same URL Produces a set pairs in the form <URL, total count> An Introduction to MapReduce

Framework Overview An Introduction to MapReduce

Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB Master 1) Split File into 10 pieces of 64MB R = 4 output files (Set by the user)‏ a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l Worker Idle Worker Idle Worker Idle (There are 26 different keys letters in the range [a..z]) Worker Idle Worker Idle Worker Idle Worker Idle Worker Idle 1 2 3 4 5 6 7 8 9 10

Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB Master 2) Assign map and reduce tasks a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l 1 2 3 4 5 6 7 8 9 10 Worker Idle Worker Idle Worker Idle Worker Idle Worker Idle Worker Idle Worker Idle Worker Idle Mappers Reducers

Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB Master 3) Read the split data a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l 1 2 3 4 Map T. In progress Map T. In progress Map T. In progress Reduce T. Idle Reduce T. Idle Reduce T. Idle Map T. In progress Reduce T. Idle

Example: Count # of Each Letter in a Big File An Introduction to MapReduce a b c d e f g h i j k l m n n o p q r s t v w x y z Machine 1 Big File 640MB 4) Process data (in memory) a y b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l R1 Partition Function (used to map the letters in regions): R2 R3 R4 Simulating the execution in memory Map T.1 In-Progress R1 R2 R3 R4 (a,1) (b,1) (a,1) (m1) (o,1) (p,1) (r, 1) (y,1)

Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine 1 Big File 640MB Master 5) Apply combiner function a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l Simulating the execution in memory R1 R2 R3 R4 Map T.1 In-Progress (a,1) (b,1) (a,1) (m1) (o,1) (p,1) (r, 1) (y,1) (a,2) (b,1) (m1) (o,1) (p,1) (r, 1) (y,1)

Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine 1 Big File 640MB Master 6) Store results on disk a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l Memory R1 R2 R3 R4 Disk Map T.1 In-Progress (a,2) (b,1) (m1) (o,1) (p,1) (r, 1) (y,1)

Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB Master 7) Inform the master about the position of the intermediate results in local disk a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l Machine 1 R1 R2 R3 R4 MT1 Results Location MT1 Results Map T.1 In-Progress (a,2) (b,1) (m1) (o,1) (p,1) (r, 1) (y,1)

Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB Master 8) The Master assigns the next task (Map Task 5) to the Worker recently free a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l Machine 1 R1 R2 R3 R4 T1 Results Worker In-Progress Data for Map Task 5 (a,2) (b,1) (m1) (o,1) (p,1) (r, 1) (y,1) Task 5

Example: Count # of Each Letter in a Big File An Introduction to MapReduce Master 9) The Master forwards the location of the intermediate results of Map Task 1 to reducers Machine 1 R1 R2 R3 R4 MT1 Results MT1 Results Location (R1) MT1 Results Location (Rx) Big File 640MB a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l ... Map T.5 In-Progress Reduce T.1 Idle (a,2) (b,1) (m1) (o,1) (p,1) (r, 1) (y,1)

Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l R1 a b c d e f g Letters in Region 1 : Reduce T.1 Idle (a, 2) (b,1) (e, 1) (d, 1) (c, 1) (e, 1) (g, 1) (e, 1) (a, 3) (c, 1) (c, 1) (a, 1) (b,1) (a, 2) (f, 1) (e, 1) (a, 2) (e, 1)(c, 1) (e, 1)

Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine N (a, 2) (b,1) (e, 1) (d, 1) (c, 1) (e, 1) (g, 1) (e, 1) (a, 3) (c, 1) (c, 1) (a, 1) (b,1) (a, 2) (f, 1) (e, 1) (a, 2) (e, 1)(c, 1) (e, 1) Data read from each Map Task stored in region 1 10) The RT 1 reads the data in R=1 from each MT Reduce T.1 In-Progress

Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine N (a, 2) (a, 3) (a, 1) (a, 2) (a, 2) (b,1) (b,1) (c, 1) (c, 1) (c, 1) (c, 1) (d, 1) (e, 1) (e, 1) (e, 1) (e, 1) (e, 1) (e, 1) (f, 1) (g, 1) 11) The reduce task 1 sorts the data Reduce T.1 In-Progress

Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine N (a, 2) (a, 3) (a, 1) (a, 2) (a, 2) (b,1) (b,1) (c, 1) (c, 1) (c, 1) (c, 1) (d, 1) (e, 1) (e, 1) (e, 1) (e, 1) (e, 1) (e, 1) (f, 1) (g, 1) 12) Then it passes the key and the corresponding set of intermediate data to the user's reduce function (a, {2,3,1,2,2}) (b, {1,1}) (c, {1,1,1,1}) (d,{1}) (e, {1,1,1,1,1,1}) (f, {1}) (g, {1}) Reduce T.1 In-Progress

Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine N 12) Finally, generates the output file 1 of R, after executing the user's reduce (a, {2,3,1,2,2}) (b, {1,1}) (c, {1,1,1,1}) (d,{1}) (e, {1,1,1,1,1,1}) (f, {1}) (g, {1}) (a, 10) (b, 2) (c, 4) (d, 1) (e, 6) (f, 1) (g, 1) Reduce T.1 In-Progress

Other Features: Failures Re-execution is the main mechanism for fault-tolerance Worker failures: Master detect Worker failures via periodic heartbeats The master drives the re-execution of tasks Completed and in-progress map tasks are re-executed In-progress reduce tasks are re-executed Master failure: The initial implementation did not support failures of the master Solutions: Checkpoint the state of internal structures in the GFS Use replication techniques Robust : lost 1600 of 1800 machines once, but finished fine An Introduction to MapReduce

Other Features: Locality Most input data is read locally Why? To not consume network bandwidth How does it achieve that? The master attempts to schedule a map task on a machine that contains a replica (in the GFS) of the corresponding input data If it fails, attempts to schedule near a replica (e.g. on the same network switch)‏ An Introduction to MapReduce

Other Features: Backup Tasks Some tasks may have delays (Stragglers): A machine that takes too long time to complete one of the last few map or reduce tasks Causes : Bad disk, concurrency with other processes, processor caches disabled Solution : When close to completion, master schedules Backup Tasks for in-progress tasks Whichever one that finishes first "wins" Effect : Dramatically shortens job completion time An Introduction to MapReduce

Performance Tests run on cluster of ~ 1800 machines: 4 GB of memory Dual-processor 2 GHz Xeons with Hyperthreading Dual 160 GB IDE disks Gigabit Ethernet per machine All machines in placed in the same hosting facility An Introduction to MapReduce

Performance: Distributed Grep Program Searching for rare three-character pattern The pattern occurs 97337 times‏ Scans through 10 10 100-byte records ( Input ) Input split into aprox. 64MB Map tasks = 15000‏ Entire output is placed in one file Reducers =1‏ An Introduction to MapReduce

Performance: Grep Test completes in ~ 150 sec Locality optimization helps: 1800 machines read 1 TB of data at peak of ~31 GB/s Without this, rack switches would limit to 10 GB/s Startup overhead is significant for short jobs An Introduction to MapReduce 1764 Workers Maps are starting to finish Scan Rate

Hadoop: A MapReduce Implementation http://hadoop.apache.org Installing Hadoop MapReduce Install Hadoop Core Configure Hadoop site in conf/hadoop-site.xml HDFS Master MapReduce Master # of replicated files in the cluster An Introduction to MapReduce <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>

Hadoop: A MapReduce Implementation Create a distributed filesystem: $ bin/hadoop namenode -format Start Hadoop daemons $ bin/start-all.sh ($ bin/start-dfs.sh + $ bin/start-mapred.sh) Check the namenode (HDFS) http://localhost:50070/ Check the job tracker (MapReduce) http://localhost:50030/ An Introduction to MapReduce

Hadoop: HDFS Console An Introduction to MapReduce

Hadoop: JobTracker Console An Introduction to MapReduce

Hadoop: Word Count Example $ bin/hadoop dfs -ls /tmp/fperez-hadoop/wordcount/input/  /tmp/fperez-hadoop/wordcount/input/file01  /tmp/fperez-hadoop/wordcount/input/file02 $ bin/hadoop dfs -cat /tmp/fperez-hadoop/wordcount/input/file01   Welcome To Hadoop World $ bin/hadoop dfs -cat /tmp/fperez-hadoop/wordcount/input/file02   Goodbye Hadoop World An Introduction to MapReduce

Hadoop: Running the Example Run the application $ bin/hadoop jar /tmp/fperez-hadoop/wordcount.jar org.myorg.WordCount /tmp/fperez-hadoop/wordcount/input /tmp/fperez/wordcount/output Output: $ bin/hadoop dfs -cat /tmp/fperez-hadoop/wordcount/output/part-00000   Goodbye 1   Hadoop 2   To 1 Welcome 1 World 2 An Introduction to MapReduce

Hadoop: Word Count Example An Introduction to MapReduce public class WordCount extends Configured implements Tool { ... public static class MapClass extends MapReduceBase implements Mapper < LongWritable , Text , Text , IntWritable > { ... // Map Task Definition } public static class Reduce extends MapReduceBase implements Reducer < Text , IntWritable , Text , IntWritable > { ... // Reduce Task Definition } public int run ( String [] args ) throws Exception { ... // Job Configuration } public static void main ( String [] args ) throws Exception { int res = ToolRunner . run (new Configuration (), new WordCount (), args ); System . exit ( res ); } }

Hadoop: Job Configuration An Introduction to MapReduce public int run ( String [] args ) throws Exception { JobConf conf = new JobConf ( getConf (), WordCount .class); conf . setJobName ( "wordcount"); // the keys are words (strings) conf . setOutputKeyClass ( Text . class); // the values are counts (ints) conf . setOutputValueClass ( IntWritable . class); conf . setMapperClass ( MapClass .class); conf . setCombinerClass ( Reduce .class); conf . setReducerClass ( Reduce .class); conf . setInputPath ( new Path ( args . get (0))); conf . setOutputPath (new Path ( args . get (1))); JobClient . runJob ( conf ); return 0; }

Hadoop: Map Class An Introduction to MapReduce public static class MapClass extends MapReduceBase implements Mapper < LongWritable , Text , Text , IntWritable > { private final static IntWritable one = new IntWritable (1); private Text word = new Text (); // map(WritableComparable, Writable, OutputCollector, Reporter) public void map ( LongWritable key , Text value , OutputCollector < Text , IntWritable > output , Reporter reporter ) throws IOException { String line = value . toString (); StringTokenizer itr = new StringTokenizer ( line ); while ( itr . hasMoreTokens ()) { word . set ( itr . nextToken ()); output . collect ( word , one ); } } }

Hadoop: Reduce Class An Introduction to MapReduce public static class Reduce extends MapReduceBase implements Reducer < Text , IntWritable , Text , IntWritable > { // reduce(WritableComparable, Iterator, OutputCollector, Reporter) public void reduce ( Text key , Iterator < IntWritable > values , OutputCollector < Text , IntWritable > output , Reporter reporter ) throws IOException { int sum = 0; while ( values . hasNext ()) { sum += values . next (). get (); } output . collect ( key , new IntWritable ( sum )); } }

References Jeffrey Dean, Sanjay Ghemawat. MapReduce : Simplified Data Processing on Large Clusters. OSDI'04, San Francisco, CA, December, 2004. Ralf Lämmel. Google's MapReduce Programming Model – Revisited. 2006-2007. Accepted for publication in the Science of Computer Programming Journal Jeff Dean, Sanjay Ghemawat. Slides from the OSDI'04. http://labs.google.com/papers/mapreduce-osdi04-slides/index.html Hadoop. http://hadoop.apache.org An Introduction to MapReduce

Questions? An Introduction to MapReduce

An Introduction To Map-Reduce

More Related Content

What's hot (20)

Similar to An Introduction To Map-Reduce (20)

An Introduction To Map-Reduce

Editor's Notes