SlideShare a Scribd company logo
Map Reduce Muhammad UsmanShahidSoftware Engineer Usman.shahid.st@hotmail.com10/17/20111
Parallel ProgrammingUsed for performance and efficiency.Processing is broken up into parts and done concurrently.Instruction of each part run on a separate CPU while many processors are connected.Identification of set of tasks which can run concurrently is important.A Fibonacci function is Fk+2 = Fk + Fk+1.It is clear that Fibonacci function can not be parallelized as each computed value depends on previous.Now consider a huge array which can be broken up into sub-arrays.10/17/20112
Parallel Programming10/17/20113If each element required some processing, with no dependencies in the computation, we have an ideal parallel computing opportunity.
Google Data CenterGoogle believes buy cheap computers but numerous in number.Google has parallel processing concept in its data centers.Map Reduce is a parallel and distributed approach developed by Google for processing large data sets. 10/17/20114
Map Reduce IntroductionMap Reduce has two key components. Map and Reduce.Map function is used on input values to calculate a set of key/Value pairs.Reduce aggregates this data into a scalar.10/17/20115
Data Distribution	Input files are split into M pieces on distributed file systems.Intermediate files are created from map tasks are written to local disks.Output files are written to distributed file systems.10/17/20116
Data Distribution10/17/20117
Map Reduce FunctionMap Reduce function by an example see the query “Select Sum(stuMarks) from student group by studentSection”.In above query “select” phase is doing the same as Map do and “Group By” same as Reduce Phase.10/17/20118
Classical ExampleThe classical example of Map Reduce is the log file analysis.Big log files are split and mapper search for different web pages which are accessed.Every time a web page is found in the log a key/value pair is emitted to the reducer in such way that key = web page and value = 1.The reducer aggregates the number for a certain web pages. Result is the count of total hits for each web page.10/17/20119
Reverse Web Link GraphIn this example Map function outputs (URL target, source) from an input web page (source).Reduce function concatenates the list of all source URL(s) with a give target of URL and returns (target, list(source)).10/17/201110
Other Examples Map Reduce can be used for the lot of problems.For Example the Google used the Map Reduce for the calculation of page ranks.Word count in large set of documents can also be resolved by Map Reduce very efficiently.Google library for the Map Reduce is not open source but an implementation in java called hadoop is an open source.10/17/201111
Implementation of ExampleWord Count is a simple application that counts the number of occurrences of words in a given set of inputs.Hadoop library is used for its implementation.Code is given in the below attached file.10/17/201112
Usage of ImplementationFor example the input files are $ bin/hadoopdfs -ls /usr/joe/wordcount/input/ /usr/joe/wordcount/input/file01 /usr/joe/wordcount/input/file02 $ bin/hadoopdfs -cat /usr/joe/wordcount/input/file01 Hello World Bye World $ bin/hadoopdfs -cat /usr/joe/wordcount/input/file02 Hello Hadoop Goodbye HadoopRun the application.Word Count is straight forward problem.10/17/201113
Walk Through ImplementationThe Mapper implementation (lines 14-26), via the map method (lines 18-25), processes one line at a time, as provided by the specified TextInputFormat (line 49). It then splits the line into tokens separated by whitespaces, via the StringTokenizer, and emits a key-value pair of < <word>, 1>.For the given sample input the first map emits:< Hello, 1> < World, 1> < Bye, 1> < World, 1> The second map emits:< Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1> 10/17/201114
Walk Through ImplementationWordCount also specifies a combiner (line 46). Hence, the output of each map is passed through the local combiner (which is same as the Reducer as per the job configuration) for local aggregation, after being sorted on the keys.The output of the first map:< Bye, 1> < Hello, 1> < World, 2> The output of the second map:< Goodbye, 1> < Hadoop, 2> < Hello, 1> 10/17/201115
Walk Through ImplementationThe Reducer implementation (lines 28-36), via the reduce method (lines 29-35) just sums up the values, which are the occurence counts for each key (i.e. words in this example).Thus the output of the job is:< Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2> The run method specifies various facets of the job, such as the input/output paths (passed via the command line), key/value types, input/output formats etc., in theJobConf. It then calls the JobClient.runJob (line 55) to submit the and monitor its progress.10/17/201116
Execution Overview10/17/201117
Map Reduce ExecutionMap Reduce library is the user program that first splits the input files in M pieces. Then it start ups many copies of the program on cluster of machines.One of the copy is special – The Master other are the workers. There are M Map tasks and R Reduce tasks to assign. The master picks the idle workers and assign them the Map task or Reduce Task.A worker who is assigned Map task reads the contents of corresponding input split. It parses the key value pair and pass it to user defined Map function this generates the intermediate key/value pairs buffered in the memory.Periodically, the buffered pairs are written to local disks. The locations of these buffered pairs on local disks are passed back to the master, who is responsible for forwarding them to the reducer workers.10/17/201118
Map Reduce Execution	When master notify a reduce worker about these location, it uses RPC to access this local data, then it sorts the data.The reduce worker iterates over the sorted intermediate data, for each unique key it passes the key and values to the reduce function. The output is appended to the final output file.Many associated issues are handled by the library likeParallelizationFault Tolerance Data DistributionLoad Balancing10/17/201119
DebuggingOffer human readable status info on http server, user can see jobs In progress, Completed etc.Allows use of GDB and other debugging tools.10/17/201120
ConclusionsSimplifies large scale computations that fit this model.Allows user to focus on the problem without worrying about the details.It is being used by renowned companies like Google and Yahoo.Google library for Map Reduce is not open source but a project of Apache called hadoop is an open source library for Map Reduce.10/17/201121

More Related Content

PPT
Map Reduce
PDF
Mapreduce by examples
PPT
Hadoop Map Reduce
PDF
Introduction to Map-Reduce
PPTX
Map reduce presentation
PPT
Apache hadoop, hdfs and map reduce Overview
PDF
Large Scale Data Analysis with Map/Reduce, part I
PPTX
MapReduce basic
Map Reduce
Mapreduce by examples
Hadoop Map Reduce
Introduction to Map-Reduce
Map reduce presentation
Apache hadoop, hdfs and map reduce Overview
Large Scale Data Analysis with Map/Reduce, part I
MapReduce basic

What's hot (19)

PPTX
Map Reduce
PPT
Map Reduce
PPTX
Map reduce prashant
PPT
An Introduction To Map-Reduce
PPTX
MapReduce Paradigm
PPT
Map Reduce introduction
PDF
Hadoop Map Reduce Arch
PDF
Topic 6: MapReduce Applications
PPTX
Introduction to Map Reduce
PDF
Map Reduce
PPTX
Introduction to MapReduce
PPTX
Introduction to MapReduce
PPT
Map Reduce
PDF
An Introduction to MapReduce
PPT
Introduction To Map Reduce
PPTX
Map Reduce Online
PPTX
Map Reduce
PDF
Map Reduce data types and formats
PDF
Apache Hadoop MapReduce Tutorial
Map Reduce
Map Reduce
Map reduce prashant
An Introduction To Map-Reduce
MapReduce Paradigm
Map Reduce introduction
Hadoop Map Reduce Arch
Topic 6: MapReduce Applications
Introduction to Map Reduce
Map Reduce
Introduction to MapReduce
Introduction to MapReduce
Map Reduce
An Introduction to MapReduce
Introduction To Map Reduce
Map Reduce Online
Map Reduce
Map Reduce data types and formats
Apache Hadoop MapReduce Tutorial
Ad

Similar to Map reduce and Hadoop on windows (20)

PDF
Lecture 1 mapreduce
PDF
Map reduce
PDF
2004 map reduce simplied data processing on large clusters (mapreduce)
PPTX
MapReduce and Hadoop Introcuctory Presentation
PDF
Map reduceoriginalpaper mandatoryreading
PDF
Map reduce
PPTX
Map and Reduce
PPTX
MAP REDUCE IN DATA SCIENCE.pptx
PDF
E031201032036
PDF
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
PPTX
Introduction to map reduce
PPTX
Types_of_Stats.pptxTypes_of_Stats.pptxTypes_of_Stats.pptx
PDF
MapReduce-Notes.pdf
PPT
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
PPTX
Apache Crunch
PDF
MapReduce
PDF
Unit-2 Hadoop Framework.pdf
PDF
The google MapReduce
PDF
Big data shim
PPTX
mapreduce.pptx
Lecture 1 mapreduce
Map reduce
2004 map reduce simplied data processing on large clusters (mapreduce)
MapReduce and Hadoop Introcuctory Presentation
Map reduceoriginalpaper mandatoryreading
Map reduce
Map and Reduce
MAP REDUCE IN DATA SCIENCE.pptx
E031201032036
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
Introduction to map reduce
Types_of_Stats.pptxTypes_of_Stats.pptxTypes_of_Stats.pptx
MapReduce-Notes.pdf
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Apache Crunch
MapReduce
Unit-2 Hadoop Framework.pdf
The google MapReduce
Big data shim
mapreduce.pptx
Ad

Recently uploaded (20)

PDF
Empathic Computing: Creating Shared Understanding
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Machine learning based COVID-19 study performance prediction
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Cloud computing and distributed systems.
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Encapsulation theory and applications.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Encapsulation_ Review paper, used for researhc scholars
Empathic Computing: Creating Shared Understanding
Network Security Unit 5.pdf for BCA BBA.
Machine learning based COVID-19 study performance prediction
Diabetes mellitus diagnosis method based random forest with bat algorithm
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Cloud computing and distributed systems.
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
The Rise and Fall of 3GPP – Time for a Sabbatical?
NewMind AI Monthly Chronicles - July 2025
Advanced methodologies resolving dimensionality complications for autism neur...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Encapsulation theory and applications.pdf
Electronic commerce courselecture one. Pdf
Chapter 3 Spatial Domain Image Processing.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Approach and Philosophy of On baking technology
Encapsulation_ Review paper, used for researhc scholars

Map reduce and Hadoop on windows

  • 1. Map Reduce Muhammad UsmanShahidSoftware Engineer Usman.shahid.st@hotmail.com10/17/20111
  • 2. Parallel ProgrammingUsed for performance and efficiency.Processing is broken up into parts and done concurrently.Instruction of each part run on a separate CPU while many processors are connected.Identification of set of tasks which can run concurrently is important.A Fibonacci function is Fk+2 = Fk + Fk+1.It is clear that Fibonacci function can not be parallelized as each computed value depends on previous.Now consider a huge array which can be broken up into sub-arrays.10/17/20112
  • 3. Parallel Programming10/17/20113If each element required some processing, with no dependencies in the computation, we have an ideal parallel computing opportunity.
  • 4. Google Data CenterGoogle believes buy cheap computers but numerous in number.Google has parallel processing concept in its data centers.Map Reduce is a parallel and distributed approach developed by Google for processing large data sets. 10/17/20114
  • 5. Map Reduce IntroductionMap Reduce has two key components. Map and Reduce.Map function is used on input values to calculate a set of key/Value pairs.Reduce aggregates this data into a scalar.10/17/20115
  • 6. Data Distribution Input files are split into M pieces on distributed file systems.Intermediate files are created from map tasks are written to local disks.Output files are written to distributed file systems.10/17/20116
  • 8. Map Reduce FunctionMap Reduce function by an example see the query “Select Sum(stuMarks) from student group by studentSection”.In above query “select” phase is doing the same as Map do and “Group By” same as Reduce Phase.10/17/20118
  • 9. Classical ExampleThe classical example of Map Reduce is the log file analysis.Big log files are split and mapper search for different web pages which are accessed.Every time a web page is found in the log a key/value pair is emitted to the reducer in such way that key = web page and value = 1.The reducer aggregates the number for a certain web pages. Result is the count of total hits for each web page.10/17/20119
  • 10. Reverse Web Link GraphIn this example Map function outputs (URL target, source) from an input web page (source).Reduce function concatenates the list of all source URL(s) with a give target of URL and returns (target, list(source)).10/17/201110
  • 11. Other Examples Map Reduce can be used for the lot of problems.For Example the Google used the Map Reduce for the calculation of page ranks.Word count in large set of documents can also be resolved by Map Reduce very efficiently.Google library for the Map Reduce is not open source but an implementation in java called hadoop is an open source.10/17/201111
  • 12. Implementation of ExampleWord Count is a simple application that counts the number of occurrences of words in a given set of inputs.Hadoop library is used for its implementation.Code is given in the below attached file.10/17/201112
  • 13. Usage of ImplementationFor example the input files are $ bin/hadoopdfs -ls /usr/joe/wordcount/input/ /usr/joe/wordcount/input/file01 /usr/joe/wordcount/input/file02 $ bin/hadoopdfs -cat /usr/joe/wordcount/input/file01 Hello World Bye World $ bin/hadoopdfs -cat /usr/joe/wordcount/input/file02 Hello Hadoop Goodbye HadoopRun the application.Word Count is straight forward problem.10/17/201113
  • 14. Walk Through ImplementationThe Mapper implementation (lines 14-26), via the map method (lines 18-25), processes one line at a time, as provided by the specified TextInputFormat (line 49). It then splits the line into tokens separated by whitespaces, via the StringTokenizer, and emits a key-value pair of < <word>, 1>.For the given sample input the first map emits:< Hello, 1> < World, 1> < Bye, 1> < World, 1> The second map emits:< Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1> 10/17/201114
  • 15. Walk Through ImplementationWordCount also specifies a combiner (line 46). Hence, the output of each map is passed through the local combiner (which is same as the Reducer as per the job configuration) for local aggregation, after being sorted on the keys.The output of the first map:< Bye, 1> < Hello, 1> < World, 2> The output of the second map:< Goodbye, 1> < Hadoop, 2> < Hello, 1> 10/17/201115
  • 16. Walk Through ImplementationThe Reducer implementation (lines 28-36), via the reduce method (lines 29-35) just sums up the values, which are the occurence counts for each key (i.e. words in this example).Thus the output of the job is:< Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2> The run method specifies various facets of the job, such as the input/output paths (passed via the command line), key/value types, input/output formats etc., in theJobConf. It then calls the JobClient.runJob (line 55) to submit the and monitor its progress.10/17/201116
  • 18. Map Reduce ExecutionMap Reduce library is the user program that first splits the input files in M pieces. Then it start ups many copies of the program on cluster of machines.One of the copy is special – The Master other are the workers. There are M Map tasks and R Reduce tasks to assign. The master picks the idle workers and assign them the Map task or Reduce Task.A worker who is assigned Map task reads the contents of corresponding input split. It parses the key value pair and pass it to user defined Map function this generates the intermediate key/value pairs buffered in the memory.Periodically, the buffered pairs are written to local disks. The locations of these buffered pairs on local disks are passed back to the master, who is responsible for forwarding them to the reducer workers.10/17/201118
  • 19. Map Reduce Execution When master notify a reduce worker about these location, it uses RPC to access this local data, then it sorts the data.The reduce worker iterates over the sorted intermediate data, for each unique key it passes the key and values to the reduce function. The output is appended to the final output file.Many associated issues are handled by the library likeParallelizationFault Tolerance Data DistributionLoad Balancing10/17/201119
  • 20. DebuggingOffer human readable status info on http server, user can see jobs In progress, Completed etc.Allows use of GDB and other debugging tools.10/17/201120
  • 21. ConclusionsSimplifies large scale computations that fit this model.Allows user to focus on the problem without worrying about the details.It is being used by renowned companies like Google and Yahoo.Google library for Map Reduce is not open source but a project of Apache called hadoop is an open source library for Map Reduce.10/17/201121