SlideShare a Scribd company logo
An Introduction to  MapReduce  Francisco P érez-Sorrosal Distributed Systems Lab (DSL/LSD) Universidad Polit é cnica de Madrid 10/Apr/2008
Outline Motivation What is MapReduce? Simple Example What is MapReduce ’ s Main Goal? Main Features What MapReduce Solves? Programming Model Framework Overview Example Other Features Hadoop: A MapReduce Implementation Example References An Introduction to MapReduce
Motivation Increasing demand of large scale processing applications Web engines, semantic search tools, scientific applications... Most of these applications can be parallelized There are many ad-hoc implementations for such applications  but ... An Introduction to MapReduce
Motivation (II) ...the development and management execution of such ad-hoc parallel applications was too complex Usually implies the use and management of hundreds/thousands of machines However , they share basically the same problems: Parallelization Fault-tolerance Data distribution  Load balancing An Introduction to MapReduce
What is MapReduce? It is a framework to... ...automatically partition jobs that have large input data sets into simpler work units or tasks,  distribute them in the nodes of a cluster ( map ) and... ...combine the intermediate results of those tasks ( reduce ) in a way to produce the required results. Presented by Google in 2004 http://labs.google.com/papers/mapreduce.html An Introduction to MapReduce
Simple Example An Introduction to MapReduce Input data Mapped data on Node 1 Mapped data on Node 2 Result
What is MapReduce ’ s Main Goal? An Introduction to MapReduce Simplify the parallelization and  distribution of large-scale  computations in clusters
MapReduce Main Features Simple interface Automatic partition, parallelization and distribution of tasks Fault-tolerance Status and monitoring An Introduction to MapReduce
What does MapReduce solves? It allows non-experienced programmers on parallel and distributed systems to use large distributed systems easily Used extensively on many applications inside Google and Yahoo that... ...require simple processing tasks... ... but   have large input data sets An Introduction to MapReduce
What does MapReduce solves? Examples:  Distributed grep Distributed sort Count URL access frequency Web crawling Represent the structure of web documents Generate summaries (pages crawled per host, most frequent queries, results returned...) An Introduction to MapReduce
Programming Model Input & Output Each one is a set of key/value pairs Map: Processes input key/value pairs Compute a set of intermediate key/value pairs map (in_key, in_value) -> list(int_key, intermediate_value) Reduce: Combine all the intermediate values that share the same key Produces a set of merged output values (usually just one per key) reduce(int_key, list(intermediate_value)) -> list(out_value) An Introduction to MapReduce
Programming Model: Example Problem : Count of URL access frequency Input : Log of web page requests Map :  Processes the assigned chunk of the log Compute a set of intermediate pairs  <URL, 1> Reduce : Processes the intermediate pairs  <URL, 1>   Adds together all the values that share the same URL Produces a set pairs in the form  <URL, total count> An Introduction to MapReduce
Framework Overview An Introduction to MapReduce
Framework Overview An Introduction to MapReduce
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB Master 1) Split File into 10 pieces of 64MB R = 4 output files (Set by the user)‏ a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l Worker Idle Worker Idle Worker Idle (There are 26 different keys letters in the range [a..z]) Worker Idle Worker Idle Worker Idle Worker Idle Worker Idle 1 2 3 4 5 6 7 8 9 10
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB Master 2) Assign map and reduce tasks a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l 1 2 3 4 5 6 7 8 9 10 Worker Idle Worker Idle Worker Idle Worker Idle Worker Idle Worker Idle Worker Idle Worker Idle Mappers Reducers
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB Master 3) Read the split data a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l 1 2 3 4 Map T. In progress Map T. In progress Map T. In progress Reduce T. Idle Reduce T. Idle Reduce T. Idle Map T. In progress Reduce T. Idle
Example: Count # of Each Letter in a Big File An Introduction to MapReduce a   b  c d e f g  h i  j k l  m  n  n  o   p  q  r  s  t  v w x  y  z  Machine 1 Big File 640MB 4) Process data (in memory) a y b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l R1 Partition Function (used to map the letters in regions): R2 R3 R4 Simulating the execution in memory Map T.1 In-Progress R1 R2 R3 R4 (a,1) (b,1) (a,1) (m1) (o,1) (p,1) (r, 1) (y,1)
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine 1 Big File 640MB Master 5) Apply combiner function a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l Simulating the execution in memory R1 R2 R3 R4 Map T.1 In-Progress (a,1) (b,1) (a,1) (m1) (o,1) (p,1) (r, 1) (y,1) (a,2) (b,1)  (m1) (o,1) (p,1) (r, 1) (y,1)
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine 1 Big File 640MB Master 6) Store results on disk a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l Memory R1 R2 R3 R4 Disk Map T.1 In-Progress (a,2) (b,1)  (m1) (o,1) (p,1) (r, 1) (y,1)
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB Master 7) Inform the master about the position of the intermediate results in local disk  a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l Machine 1 R1 R2 R3 R4 MT1 Results Location MT1 Results Map T.1 In-Progress (a,2) (b,1)  (m1) (o,1) (p,1) (r, 1) (y,1)
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB Master 8) The Master assigns the next task (Map Task 5) to the Worker recently free a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l Machine 1 R1 R2 R3 R4 T1 Results Worker In-Progress Data for  Map Task 5 (a,2) (b,1)  (m1) (o,1) (p,1) (r, 1) (y,1) Task 5
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Master 9) The Master forwards the location of the intermediate results of Map Task 1 to reducers Machine 1 R1 R2 R3 R4 MT1 Results MT1 Results Location (R1) MT1 Results Location (Rx) Big File 640MB a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l ... Map T.5 In-Progress Reduce T.1 Idle (a,2) (b,1)  (m1) (o,1) (p,1) (r, 1) (y,1)
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB a  t  b  o m  a  p r r  e d  u c e  g o o o  g  l e a  p i m  a c a c a b  r a a  r r o z  f e i j  a  o t o m  a t  e c  r u i m  e s s o l R1 a b c d e f g Letters in  Region 1 : Reduce T.1 Idle (a, 2) (b,1) (e, 1) (d, 1) (c, 1) (e, 1) (g, 1) (e, 1) (a, 3) (c, 1) (c, 1) (a, 1) (b,1) (a, 2) (f, 1) (e, 1) (a, 2) (e, 1)(c, 1) (e, 1)
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine N (a, 2) (b,1) (e, 1) (d, 1) (c, 1) (e, 1) (g, 1) (e, 1) (a, 3) (c, 1) (c, 1) (a, 1) (b,1) (a, 2) (f, 1) (e, 1) (a, 2) (e, 1)(c, 1) (e, 1) Data read from  each Map Task stored in region 1 10) The RT 1 reads the data in R=1 from each MT Reduce T.1 In-Progress
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine N (a, 2) (a, 3) (a, 1) (a, 2) (a, 2) (b,1) (b,1) (c, 1) (c, 1) (c, 1) (c, 1) (d, 1) (e, 1) (e, 1) (e, 1)  (e, 1) (e, 1) (e, 1) (f, 1) (g, 1) 11) The reduce task 1 sorts the data Reduce T.1 In-Progress
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine N (a, 2) (a, 3) (a, 1) (a, 2) (a, 2) (b,1) (b,1) (c, 1) (c, 1) (c, 1) (c, 1) (d, 1) (e, 1) (e, 1) (e, 1)  (e, 1) (e, 1) (e, 1) (f, 1) (g, 1) 12) Then it passes the key and the corresponding set of intermediate data to the user's reduce function (a, {2,3,1,2,2}) (b, {1,1}) (c, {1,1,1,1}) (d,{1}) (e, {1,1,1,1,1,1}) (f, {1}) (g, {1}) Reduce T.1 In-Progress
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine N 12)   Finally, generates the output file 1 of R, after executing the user's reduce   (a, {2,3,1,2,2}) (b, {1,1}) (c, {1,1,1,1}) (d,{1}) (e, {1,1,1,1,1,1}) (f, {1}) (g, {1}) (a, 10) (b, 2) (c, 4) (d, 1) (e, 6) (f, 1) (g, 1) Reduce T.1 In-Progress
Other Features: Failures Re-execution is the main mechanism for fault-tolerance Worker failures: Master detect Worker failures via periodic heartbeats The master drives the re-execution of tasks Completed and in-progress map tasks are re-executed In-progress reduce tasks are re-executed Master failure:  The initial implementation did not support failures of the master Solutions:  Checkpoint the state of internal structures in the GFS Use replication techniques Robust : lost 1600 of 1800 machines once, but finished fine  An Introduction to MapReduce
Other Features: Locality Most input data is read locally Why?  To not consume network bandwidth How does it achieve that? The master attempts to schedule a map task on a machine that contains a replica (in the GFS) of the corresponding input data If it fails, attempts to schedule near a replica (e.g. on the same network switch)‏ An Introduction to MapReduce
Other Features: Backup Tasks Some tasks may have delays (Stragglers): A machine that takes too long time to complete one of the last few map or reduce tasks Causes : Bad disk, concurrency with other processes, processor caches disabled Solution : When close to completion, master schedules Backup Tasks for in-progress tasks Whichever one that finishes first &quot;wins&quot; Effect : Dramatically shortens job completion time An Introduction to MapReduce
Performance Tests run on cluster of  ~  1800 machines: 4 GB of memory Dual-processor 2 GHz Xeons with Hyperthreading Dual 160 GB IDE disks Gigabit Ethernet per machine All machines in placed in the same hosting facility An Introduction to MapReduce
Performance: Distributed Grep Program Searching for rare three-character pattern The pattern occurs 97337 times‏ Scans through  10 10  100-byte records ( Input ) Input split into aprox. 64MB Map tasks = 15000‏ Entire output is placed in one file Reducers =1‏ An Introduction to MapReduce
Performance: Grep Test completes in  ~  150 sec Locality optimization helps: 1800 machines read 1 TB of data at peak of ~31 GB/s Without this, rack switches would limit to 10 GB/s Startup overhead is significant for short jobs  An Introduction to MapReduce 1764 Workers Maps are starting to finish Scan Rate
Hadoop: A MapReduce Implementation http://hadoop.apache.org Installing Hadoop MapReduce Install Hadoop Core Configure Hadoop site in conf/hadoop-site.xml HDFS Master MapReduce Master # of replicated files in the cluster An Introduction to MapReduce <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
Hadoop: A MapReduce Implementation Create a distributed filesystem: $ bin/hadoop namenode -format Start Hadoop daemons $ bin/start-all.sh ($ bin/start-dfs.sh + $ bin/start-mapred.sh) Check the namenode (HDFS) http://localhost:50070/ Check the job tracker (MapReduce) http://localhost:50030/ An Introduction to MapReduce
Hadoop: HDFS Console An Introduction to MapReduce
Hadoop: JobTracker Console An Introduction to MapReduce
Hadoop: Word Count Example $ bin/hadoop dfs -ls /tmp/fperez-hadoop/wordcount/input/ 
/tmp/fperez-hadoop/wordcount/input/file01 
/tmp/fperez-hadoop/wordcount/input/file02 $ bin/hadoop dfs -cat /tmp/fperez-hadoop/wordcount/input/file01 
 Welcome To Hadoop World $ bin/hadoop dfs -cat /tmp/fperez-hadoop/wordcount/input/file02 
 Goodbye Hadoop World An Introduction to MapReduce
Hadoop: Running the Example Run the application $ bin/hadoop jar /tmp/fperez-hadoop/wordcount.jar org.myorg.WordCount /tmp/fperez-hadoop/wordcount/input /tmp/fperez/wordcount/output Output: $ bin/hadoop dfs -cat /tmp/fperez-hadoop/wordcount/output/part-00000  
 Goodbye 1 
 Hadoop 2 
 To 1 Welcome 1 World 2 An Introduction to MapReduce
Hadoop: Word Count Example An Introduction to MapReduce public class  WordCount  extends  Configured  implements  Tool  { ... public static class  MapClass  extends  MapReduceBase   implements  Mapper < LongWritable ,  Text ,  Text , IntWritable > { ...  // Map Task Definition } public static class  Reduce  extends  MapReduceBase implements  Reducer < Text ,  IntWritable ,  Text , IntWritable > { ...  // Reduce Task Definition } public int  run ( String []  args ) throws  Exception  { ...  // Job Configuration } public static void  main ( String []  args ) throws  Exception  { int  res  =  ToolRunner . run (new  Configuration (),  new  WordCount (),  args ); System . exit ( res );  } }
Hadoop: Job Configuration An Introduction to MapReduce public int  run ( String []  args ) throws  Exception  {  	 JobConf   conf  = new  JobConf ( getConf (),  WordCount .class);  conf . setJobName ( &quot;wordcount&quot;);  // the keys are words (strings)  conf . setOutputKeyClass ( Text . class);  // the values are counts (ints)  conf . setOutputValueClass ( IntWritable . class);  	 conf . setMapperClass ( MapClass .class);  conf . setCombinerClass ( Reduce .class);  conf . setReducerClass ( Reduce .class); 	conf . setInputPath ( new  Path ( args . get (0)));  			conf . setOutputPath (new  Path ( args . get (1)));  JobClient . runJob ( conf ); return 0;  }
Hadoop: Map Class An Introduction to MapReduce public static class  MapClass  extends  MapReduceBase   implements  Mapper < LongWritable ,  Text ,  Text ,  IntWritable > {  private final static  IntWritable   one  = new  IntWritable (1);  private  Text   word  = new  Text ();  // map(WritableComparable, Writable, OutputCollector, Reporter) public void  map ( LongWritable   key ,  Text   value ,  OutputCollector < Text ,  IntWritable >  output ,  Reporter   reporter ) throws  IOException  {  String   line  =  value . toString ();  StringTokenizer   itr  = new  StringTokenizer ( line );  while ( itr . hasMoreTokens ()) {  word . set ( itr . nextToken ());  output . collect ( word ,  one );  }  } }
Hadoop: Reduce Class An Introduction to MapReduce public static class  Reduce  extends  MapReduceBase  implements  Reducer < Text ,  IntWritable ,  Text ,  IntWritable > {    // reduce(WritableComparable, Iterator, OutputCollector, Reporter) public void  reduce ( Text   key ,  Iterator < IntWritable >  values ,  OutputCollector < Text ,  IntWritable >  output ,  Reporter   reporter ) throws  IOException  {  int  sum  = 0;  while ( values . hasNext ()) {  sum  +=  values . next (). get ();  }  	 output . collect ( key , new  IntWritable ( sum ));  }  }
References Jeffrey Dean, Sanjay Ghemawat.   MapReduce : Simplified Data Processing on Large Clusters.  OSDI'04,  San Francisco, CA, December, 2004. Ralf Lämmel.  Google's MapReduce Programming Model – Revisited.  2006-2007. Accepted for publication in the Science of Computer Programming Journal Jeff Dean, Sanjay Ghemawat. Slides from the OSDI'04.   http://labs.google.com/papers/mapreduce-osdi04-slides/index.html Hadoop.  http://hadoop.apache.org An Introduction to MapReduce
Questions? An Introduction to MapReduce

More Related Content

PDF
Introduction to Map-Reduce
PPT
Hadoop Map Reduce
PDF
Large Scale Data Analysis with Map/Reduce, part I
PDF
Mapreduce by examples
PPT
Map Reduce
PPTX
Map Reduce Online
PPTX
Map reduce presentation
PPT
Map Reduce
Introduction to Map-Reduce
Hadoop Map Reduce
Large Scale Data Analysis with Map/Reduce, part I
Mapreduce by examples
Map Reduce
Map Reduce Online
Map reduce presentation
Map Reduce

What's hot (20)

PPTX
Map reduce and Hadoop on windows
PPTX
Introduction to MapReduce
PDF
Introduction to map reduce
PDF
Map Reduce
PPTX
Introduction to Map Reduce
PPTX
Introduction to MapReduce
PPT
Map Reduce
PPTX
Introduction to map reduce
PPTX
MapReduce basic
PPTX
Hadoop performance optimization tips
PDF
The google MapReduce
PDF
Hadoop Map Reduce Arch
PPT
Map Reduce
PDF
Hadoop map reduce concepts
PPTX
Map Reduce
PPT
Map Reduce introduction
PPTX
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
PDF
An Introduction to MapReduce
PDF
Topic 6: MapReduce Applications
Map reduce and Hadoop on windows
Introduction to MapReduce
Introduction to map reduce
Map Reduce
Introduction to Map Reduce
Introduction to MapReduce
Map Reduce
Introduction to map reduce
MapReduce basic
Hadoop performance optimization tips
The google MapReduce
Hadoop Map Reduce Arch
Map Reduce
Hadoop map reduce concepts
Map Reduce
Map Reduce introduction
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
An Introduction to MapReduce
Topic 6: MapReduce Applications
Ad

Similar to An Introduction To Map-Reduce (20)

PDF
Interactive big data analytics
PPT
Introduction To Map Reduce
PDF
Hadoop & MapReduce
PDF
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
PPT
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
PDF
Mapreduce Algorithms
PPTX
mapreduce.pptx
PPT
Map reduce - simplified data processing on large clusters
PPTX
MAP REDUCE IN DATA SCIENCE.pptx
PPT
Intermachine Parallelism
PPT
Behm Shah Pagerank
PPT
L4.FA16n nm,m,m,,m,m,m,mmnm,n,mnmnmm.ppt
PDF
MapReduce-Notes.pdf
PPTX
Map Reduce
PDF
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
PDF
Map Reduce Execution Architecture
PPTX
Lecture 04 big data analytics | map reduce
PPTX
CS 542 -- Query Execution
PDF
Applying stratosphere for big data analytics
PPT
Comparing Distributed Indexing To Mapreduce or Not?
Interactive big data analytics
Introduction To Map Reduce
Hadoop & MapReduce
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Mapreduce Algorithms
mapreduce.pptx
Map reduce - simplified data processing on large clusters
MAP REDUCE IN DATA SCIENCE.pptx
Intermachine Parallelism
Behm Shah Pagerank
L4.FA16n nm,m,m,,m,m,m,mmnm,n,mnmnmm.ppt
MapReduce-Notes.pdf
Map Reduce
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Map Reduce Execution Architecture
Lecture 04 big data analytics | map reduce
CS 542 -- Query Execution
Applying stratosphere for big data analytics
Comparing Distributed Indexing To Mapreduce or Not?
Ad

An Introduction To Map-Reduce

  • 1. An Introduction to MapReduce Francisco P érez-Sorrosal Distributed Systems Lab (DSL/LSD) Universidad Polit é cnica de Madrid 10/Apr/2008
  • 2. Outline Motivation What is MapReduce? Simple Example What is MapReduce ’ s Main Goal? Main Features What MapReduce Solves? Programming Model Framework Overview Example Other Features Hadoop: A MapReduce Implementation Example References An Introduction to MapReduce
  • 3. Motivation Increasing demand of large scale processing applications Web engines, semantic search tools, scientific applications... Most of these applications can be parallelized There are many ad-hoc implementations for such applications but ... An Introduction to MapReduce
  • 4. Motivation (II) ...the development and management execution of such ad-hoc parallel applications was too complex Usually implies the use and management of hundreds/thousands of machines However , they share basically the same problems: Parallelization Fault-tolerance Data distribution Load balancing An Introduction to MapReduce
  • 5. What is MapReduce? It is a framework to... ...automatically partition jobs that have large input data sets into simpler work units or tasks, distribute them in the nodes of a cluster ( map ) and... ...combine the intermediate results of those tasks ( reduce ) in a way to produce the required results. Presented by Google in 2004 http://labs.google.com/papers/mapreduce.html An Introduction to MapReduce
  • 6. Simple Example An Introduction to MapReduce Input data Mapped data on Node 1 Mapped data on Node 2 Result
  • 7. What is MapReduce ’ s Main Goal? An Introduction to MapReduce Simplify the parallelization and distribution of large-scale computations in clusters
  • 8. MapReduce Main Features Simple interface Automatic partition, parallelization and distribution of tasks Fault-tolerance Status and monitoring An Introduction to MapReduce
  • 9. What does MapReduce solves? It allows non-experienced programmers on parallel and distributed systems to use large distributed systems easily Used extensively on many applications inside Google and Yahoo that... ...require simple processing tasks... ... but have large input data sets An Introduction to MapReduce
  • 10. What does MapReduce solves? Examples: Distributed grep Distributed sort Count URL access frequency Web crawling Represent the structure of web documents Generate summaries (pages crawled per host, most frequent queries, results returned...) An Introduction to MapReduce
  • 11. Programming Model Input & Output Each one is a set of key/value pairs Map: Processes input key/value pairs Compute a set of intermediate key/value pairs map (in_key, in_value) -> list(int_key, intermediate_value) Reduce: Combine all the intermediate values that share the same key Produces a set of merged output values (usually just one per key) reduce(int_key, list(intermediate_value)) -> list(out_value) An Introduction to MapReduce
  • 12. Programming Model: Example Problem : Count of URL access frequency Input : Log of web page requests Map : Processes the assigned chunk of the log Compute a set of intermediate pairs <URL, 1> Reduce : Processes the intermediate pairs <URL, 1> Adds together all the values that share the same URL Produces a set pairs in the form <URL, total count> An Introduction to MapReduce
  • 13. Framework Overview An Introduction to MapReduce
  • 14. Framework Overview An Introduction to MapReduce
  • 15. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB Master 1) Split File into 10 pieces of 64MB R = 4 output files (Set by the user)‏ a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l Worker Idle Worker Idle Worker Idle (There are 26 different keys letters in the range [a..z]) Worker Idle Worker Idle Worker Idle Worker Idle Worker Idle 1 2 3 4 5 6 7 8 9 10
  • 16. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB Master 2) Assign map and reduce tasks a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l 1 2 3 4 5 6 7 8 9 10 Worker Idle Worker Idle Worker Idle Worker Idle Worker Idle Worker Idle Worker Idle Worker Idle Mappers Reducers
  • 17. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB Master 3) Read the split data a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l 1 2 3 4 Map T. In progress Map T. In progress Map T. In progress Reduce T. Idle Reduce T. Idle Reduce T. Idle Map T. In progress Reduce T. Idle
  • 18. Example: Count # of Each Letter in a Big File An Introduction to MapReduce a b c d e f g h i j k l m n n o p q r s t v w x y z Machine 1 Big File 640MB 4) Process data (in memory) a y b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l R1 Partition Function (used to map the letters in regions): R2 R3 R4 Simulating the execution in memory Map T.1 In-Progress R1 R2 R3 R4 (a,1) (b,1) (a,1) (m1) (o,1) (p,1) (r, 1) (y,1)
  • 19. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine 1 Big File 640MB Master 5) Apply combiner function a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l Simulating the execution in memory R1 R2 R3 R4 Map T.1 In-Progress (a,1) (b,1) (a,1) (m1) (o,1) (p,1) (r, 1) (y,1) (a,2) (b,1) (m1) (o,1) (p,1) (r, 1) (y,1)
  • 20. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine 1 Big File 640MB Master 6) Store results on disk a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l Memory R1 R2 R3 R4 Disk Map T.1 In-Progress (a,2) (b,1) (m1) (o,1) (p,1) (r, 1) (y,1)
  • 21. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB Master 7) Inform the master about the position of the intermediate results in local disk a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l Machine 1 R1 R2 R3 R4 MT1 Results Location MT1 Results Map T.1 In-Progress (a,2) (b,1) (m1) (o,1) (p,1) (r, 1) (y,1)
  • 22. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB Master 8) The Master assigns the next task (Map Task 5) to the Worker recently free a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l Machine 1 R1 R2 R3 R4 T1 Results Worker In-Progress Data for Map Task 5 (a,2) (b,1) (m1) (o,1) (p,1) (r, 1) (y,1) Task 5
  • 23. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Master 9) The Master forwards the location of the intermediate results of Map Task 1 to reducers Machine 1 R1 R2 R3 R4 MT1 Results MT1 Results Location (R1) MT1 Results Location (Rx) Big File 640MB a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l ... Map T.5 In-Progress Reduce T.1 Idle (a,2) (b,1) (m1) (o,1) (p,1) (r, 1) (y,1)
  • 24. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l R1 a b c d e f g Letters in Region 1 : Reduce T.1 Idle (a, 2) (b,1) (e, 1) (d, 1) (c, 1) (e, 1) (g, 1) (e, 1) (a, 3) (c, 1) (c, 1) (a, 1) (b,1) (a, 2) (f, 1) (e, 1) (a, 2) (e, 1)(c, 1) (e, 1)
  • 25. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine N (a, 2) (b,1) (e, 1) (d, 1) (c, 1) (e, 1) (g, 1) (e, 1) (a, 3) (c, 1) (c, 1) (a, 1) (b,1) (a, 2) (f, 1) (e, 1) (a, 2) (e, 1)(c, 1) (e, 1) Data read from each Map Task stored in region 1 10) The RT 1 reads the data in R=1 from each MT Reduce T.1 In-Progress
  • 26. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine N (a, 2) (a, 3) (a, 1) (a, 2) (a, 2) (b,1) (b,1) (c, 1) (c, 1) (c, 1) (c, 1) (d, 1) (e, 1) (e, 1) (e, 1) (e, 1) (e, 1) (e, 1) (f, 1) (g, 1) 11) The reduce task 1 sorts the data Reduce T.1 In-Progress
  • 27. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine N (a, 2) (a, 3) (a, 1) (a, 2) (a, 2) (b,1) (b,1) (c, 1) (c, 1) (c, 1) (c, 1) (d, 1) (e, 1) (e, 1) (e, 1) (e, 1) (e, 1) (e, 1) (f, 1) (g, 1) 12) Then it passes the key and the corresponding set of intermediate data to the user's reduce function (a, {2,3,1,2,2}) (b, {1,1}) (c, {1,1,1,1}) (d,{1}) (e, {1,1,1,1,1,1}) (f, {1}) (g, {1}) Reduce T.1 In-Progress
  • 28. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine N 12) Finally, generates the output file 1 of R, after executing the user's reduce (a, {2,3,1,2,2}) (b, {1,1}) (c, {1,1,1,1}) (d,{1}) (e, {1,1,1,1,1,1}) (f, {1}) (g, {1}) (a, 10) (b, 2) (c, 4) (d, 1) (e, 6) (f, 1) (g, 1) Reduce T.1 In-Progress
  • 29. Other Features: Failures Re-execution is the main mechanism for fault-tolerance Worker failures: Master detect Worker failures via periodic heartbeats The master drives the re-execution of tasks Completed and in-progress map tasks are re-executed In-progress reduce tasks are re-executed Master failure: The initial implementation did not support failures of the master Solutions: Checkpoint the state of internal structures in the GFS Use replication techniques Robust : lost 1600 of 1800 machines once, but finished fine An Introduction to MapReduce
  • 30. Other Features: Locality Most input data is read locally Why? To not consume network bandwidth How does it achieve that? The master attempts to schedule a map task on a machine that contains a replica (in the GFS) of the corresponding input data If it fails, attempts to schedule near a replica (e.g. on the same network switch)‏ An Introduction to MapReduce
  • 31. Other Features: Backup Tasks Some tasks may have delays (Stragglers): A machine that takes too long time to complete one of the last few map or reduce tasks Causes : Bad disk, concurrency with other processes, processor caches disabled Solution : When close to completion, master schedules Backup Tasks for in-progress tasks Whichever one that finishes first &quot;wins&quot; Effect : Dramatically shortens job completion time An Introduction to MapReduce
  • 32. Performance Tests run on cluster of ~ 1800 machines: 4 GB of memory Dual-processor 2 GHz Xeons with Hyperthreading Dual 160 GB IDE disks Gigabit Ethernet per machine All machines in placed in the same hosting facility An Introduction to MapReduce
  • 33. Performance: Distributed Grep Program Searching for rare three-character pattern The pattern occurs 97337 times‏ Scans through 10 10 100-byte records ( Input ) Input split into aprox. 64MB Map tasks = 15000‏ Entire output is placed in one file Reducers =1‏ An Introduction to MapReduce
  • 34. Performance: Grep Test completes in ~ 150 sec Locality optimization helps: 1800 machines read 1 TB of data at peak of ~31 GB/s Without this, rack switches would limit to 10 GB/s Startup overhead is significant for short jobs An Introduction to MapReduce 1764 Workers Maps are starting to finish Scan Rate
  • 35. Hadoop: A MapReduce Implementation http://hadoop.apache.org Installing Hadoop MapReduce Install Hadoop Core Configure Hadoop site in conf/hadoop-site.xml HDFS Master MapReduce Master # of replicated files in the cluster An Introduction to MapReduce <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
  • 36. Hadoop: A MapReduce Implementation Create a distributed filesystem: $ bin/hadoop namenode -format Start Hadoop daemons $ bin/start-all.sh ($ bin/start-dfs.sh + $ bin/start-mapred.sh) Check the namenode (HDFS) http://localhost:50070/ Check the job tracker (MapReduce) http://localhost:50030/ An Introduction to MapReduce
  • 37. Hadoop: HDFS Console An Introduction to MapReduce
  • 38. Hadoop: JobTracker Console An Introduction to MapReduce
  • 39. Hadoop: Word Count Example $ bin/hadoop dfs -ls /tmp/fperez-hadoop/wordcount/input/ 
/tmp/fperez-hadoop/wordcount/input/file01 
/tmp/fperez-hadoop/wordcount/input/file02 $ bin/hadoop dfs -cat /tmp/fperez-hadoop/wordcount/input/file01 
 Welcome To Hadoop World $ bin/hadoop dfs -cat /tmp/fperez-hadoop/wordcount/input/file02 
 Goodbye Hadoop World An Introduction to MapReduce
  • 40. Hadoop: Running the Example Run the application $ bin/hadoop jar /tmp/fperez-hadoop/wordcount.jar org.myorg.WordCount /tmp/fperez-hadoop/wordcount/input /tmp/fperez/wordcount/output Output: $ bin/hadoop dfs -cat /tmp/fperez-hadoop/wordcount/output/part-00000 
 Goodbye 1 
 Hadoop 2 
 To 1 Welcome 1 World 2 An Introduction to MapReduce
  • 41. Hadoop: Word Count Example An Introduction to MapReduce public class WordCount extends Configured implements Tool { ... public static class MapClass extends MapReduceBase implements Mapper < LongWritable , Text , Text , IntWritable > { ... // Map Task Definition } public static class Reduce extends MapReduceBase implements Reducer < Text , IntWritable , Text , IntWritable > { ... // Reduce Task Definition } public int run ( String [] args ) throws Exception { ... // Job Configuration } public static void main ( String [] args ) throws Exception { int res = ToolRunner . run (new Configuration (), new WordCount (), args ); System . exit ( res ); } }
  • 42. Hadoop: Job Configuration An Introduction to MapReduce public int run ( String [] args ) throws Exception { JobConf conf = new JobConf ( getConf (), WordCount .class); conf . setJobName ( &quot;wordcount&quot;); // the keys are words (strings) conf . setOutputKeyClass ( Text . class); // the values are counts (ints) conf . setOutputValueClass ( IntWritable . class); conf . setMapperClass ( MapClass .class); conf . setCombinerClass ( Reduce .class); conf . setReducerClass ( Reduce .class); conf . setInputPath ( new Path ( args . get (0))); conf . setOutputPath (new Path ( args . get (1))); JobClient . runJob ( conf ); return 0; }
  • 43. Hadoop: Map Class An Introduction to MapReduce public static class MapClass extends MapReduceBase implements Mapper < LongWritable , Text , Text , IntWritable > { private final static IntWritable one = new IntWritable (1); private Text word = new Text (); // map(WritableComparable, Writable, OutputCollector, Reporter) public void map ( LongWritable key , Text value , OutputCollector < Text , IntWritable > output , Reporter reporter ) throws IOException { String line = value . toString (); StringTokenizer itr = new StringTokenizer ( line ); while ( itr . hasMoreTokens ()) { word . set ( itr . nextToken ()); output . collect ( word , one ); } } }
  • 44. Hadoop: Reduce Class An Introduction to MapReduce public static class Reduce extends MapReduceBase implements Reducer < Text , IntWritable , Text , IntWritable > { // reduce(WritableComparable, Iterator, OutputCollector, Reporter) public void reduce ( Text key , Iterator < IntWritable > values , OutputCollector < Text , IntWritable > output , Reporter reporter ) throws IOException { int sum = 0; while ( values . hasNext ()) { sum += values . next (). get (); } output . collect ( key , new IntWritable ( sum )); } }
  • 45. References Jeffrey Dean, Sanjay Ghemawat. MapReduce : Simplified Data Processing on Large Clusters. OSDI'04, San Francisco, CA, December, 2004. Ralf Lämmel. Google's MapReduce Programming Model – Revisited. 2006-2007. Accepted for publication in the Science of Computer Programming Journal Jeff Dean, Sanjay Ghemawat. Slides from the OSDI'04. http://labs.google.com/papers/mapreduce-osdi04-slides/index.html Hadoop. http://hadoop.apache.org An Introduction to MapReduce

Editor's Notes

  • #4: Existen librerias para programar culsters como PVM (Parallel virtual machine), MPI (Message Passing interface)
  • #6: Encima de GFS
  • #7: Given a collection of Shapes we split this collection into 2 parts and send every part to a grid node. Each node will count number of Shapes provided and will return it back to caller. The caller then will add results received from remote nodes and provide the reduced result back to the user (the counts are displayed next to every shape).
  • #8: Simplificar la paralelizaci ón y distribución de procesamientos masivos de datos.
  • #9: So, in order to achieve this goal, MR provides...
  • #13: La entrada es una lista de URLs de p áginas web
  • #16: Vamos a ver con un ejemplo como funciona el framework de MapReduce. El programa de ejemplo cuenta el n úmero de ocurrencias de cada letra que aparece en un fichero grande y clasifica las letras en 4 ficheros de salida distintos por rangos de siete letras (e.g. De la A a la G, de la H a la N...). Para esto se utiliza una función de particionamiento definida por el usuario, y que suele ser una función hash.
  • #17: El master asigna a cada worker una tarea espec ífica. En este caso suponemos que los workers de la izquierda van a ser asignados con tareas map y los de la derecha con tareas de reducción.
  • #18: El siguiente paso asigna a cada tarea map su correspondiente parte de la entrada del fichero
  • #19: Utilizando la funci ón de particionamiento, la tarea 1 v a clasificando las letras en cada region de la siguiente manera: la a a la region 1, la y a la region 4
  • #30: Completed map tasks need to be re-executed because their results are stored in local disks and they are inaccessible. Completed reduce tasks don’t need to be re-executed because their results are in the GFS that provides fault-tolerance through data replication.