SlideShare a Scribd company logo
MapReduce
Presentation – Advance Distributed system
MapReduce
 MapReduce is a programming model and an associated implementation for
processing and generating large data sets with a parallel, distributed
algorithm on a cluster.
 Map() procedure that performs filtering and sorting.
 Reduce() procedure that performs a summary operation (such as statistical
operations)
HADOOP
 Hadoop is a free, Java-based programming framework that supports the
processing of large data sets in a distributed computing environment. It is
part of the Apache project sponsored by the Apache Software Foundation.
MapReduce - Orchestrating
 MapReduce System is orchestrates the processing by
1. marshalling (saving the system) the distributed servers
2. Running the various tasks in parallel
3. managing all communications and data transfers between the various parts of
the system
4. providing for redundancy and fault tolerance.
MapReduce contributions
 The key contributions ( an important values added to the system ) are
Scalability (by controlling a big number of nodes) and fault-tolerance
achieved for a variety of applications by optimizing the execution engine
once.
MapReduce main steps (3 phases way)
 Steps of MapReduce :
1. "Map" step: Each worker node applies the "map()" function to the local data,
and writes the output to a temporary storage. A master node orchestrates
that for redundant copies of input data, only one is processed.
2. "Shuffle" step: Worker nodes redistribute data based on the output keys
(produced by the "map()" function), all the data belonging to one key are
located on the same worker node.
3. "Reduce" step: Worker nodes now process each group of output data, per key,
in parallel.
MapReduce main steps (3 phases way)
MapReduce main steps (5 phases way)
 Another way to look at MapReduce is as a 5-step parallel and distributed
computation:
1. Map step.
2. Run the user-provided Map() code – Map() is run exactly once for each K1 key
value, generating output organized by key values K2.
3. Shuffle step.
4. Reduce step.
5. Produce the final output – the MapReduce system collects all the Reduce
output, and sorts it by K2 to produce the final outcome
MapReduce main steps (5 phases way)
Working in parallel
 Each mapping operation is independent of the others ( run in parallel )
 Limitations are:
1. Number of independent data sources
2. Number of CPUs near each source.
 Set of 'reducers' can perform the reduction phase, all outputs of the map
operation that share the same key are presented to the same reducer at the
same time.
 The main advantage for working in parallel is the recovering from partial
failure of servers or storage during the operation, if one mapper or reducer
fails, the work can be rescheduled – assuming the input data is still available.
Working in parallel (cont)
Logical work
 The Map and Reduce functions of MapReduce are both defined with respect to the
data structured in (key, value) pairs. Map takes one pair of data with a type in one
data domain, and returns a list of pairs in a different domain:
Map(k1,v1) → list(k2,v2)
 The Map function is applied in parallel to every pair in the input dataset. This
produces a list of pairs for each call. After that, the MapReduce framework
collects all pairs with the same key from all lists and groups them together,
creating one group for each key.
 The Reduce function is then applied in parallel to each group, which in turn
produces a collection of values in the same domain:
Reduce(k2, list (v2)) → list(v3)
 Each Reduce call typically produces either one value v3 or an empty return,
though one call is allowed to return more than one value. The returns of all calls
are collected as the desired result list.
Logical work (cont)
Implementation
 Distributed implementations of MapReduce require a means of connecting
the processes performing the Map and Reduce phases
Implementation (cont)
 counts the appearance of each word in a set of documents
function map(String name, String document):
// name: document name
// document: document contents
for each word w in document:
emit (w, 1)
function reduce(String word, Iterator partialCounts):
// word: a word
// partialCounts: a list of aggregated partial counts
sum = 0
for each pc in partialCounts:
sum += ParseInt(pc)
emit (word, sum)
Dataflow
 The hotspots in the dataflow are ( depends on the application)
 an input reader
 a Map function
 a partition function
 a compare function
 a Reduce function
 an output writer
Dataflow (cont)
 Input reader
 The input reader divides the input into appropriate size and the framework
assigns one split to each Map function. The input reader reads data from
stable storage (typically a distributed file system) and generates key/value
pairs.
Dataflow (cont)
 Map function
 The Map function takes a series of key/value pairs, processes each, and
generates zero or more output key/value pairs. The input and output types of
the map are often different from each other.
 As in example If the application is doing a word count, the map function
would break the line into words and output a key/value pair for each word.
Each output pair would contain the word as the key and the number of
instances of that word in the line as the value.
Dataflow (cont)
 Partition function
 Each Map function output is allocated to a particular reducer by the application's
partition function for sharing purposes. The partition function is given the key and
the number of reducers and returns the index of the desired reducer.
 It is important to pick a partition function that gives an approximately uniform
distribution of data per shard for load-balancing purposes, otherwise the
MapReduce operation can be held up waiting for slow reducers (reducers assigned
more than their share of data) to finish.
 Between the map and reduce stages, the data is shuffled (means that the data are
parallel-sorted or exchanged between nodes) in order to move the data from the
map node that produced it to the shard in which it will be reduced.
Dataflow (cont)
 Comparison function
 The input for each Reduce is pulled from the machine where the Map ran and
sorted using the application's comparison function.
Dataflow (cont)
 Reduce function
 The framework calls the application's Reduce function once for each unique
key in the sorted order. The Reduce can iterate through the values that are
associated with that key and produce zero or more outputs.
 In the word count example, the Reduce function takes the input values, sums
them and generates a single output of the word and the final sum.
Dataflow (cont)
 Output writer
 The Output Writer writes the output of the Reduce to the stable storage,
usually a distributed file system.
Performance
 MapReduce programs are not guaranteed to be fast.
 The partition function and the amount of data written by the Map function
can have a large impact on the performance.
 Additional modules such as the Combiner function can help to reduce the
amount of data written to disk, and transmitted over the network.
 Communication cost often dominates the computation cost, and many
MapReduce implementations are designed to write all communication to
distributed storage for crash recovery.
Distribution and reliability
 MapReduce achieves reliability by parceling out a number of operations on
the set of data to each node in the network (load distributing).
 Each node is expected to report back periodically with completed work and
status updates.
 If a node falls silent for longer than that interval, the master node records
the node as dead and sends out the node's assigned work to other nodes.
 Individual operations use atomic operations for naming file outputs as a check
to ensure that there are not parallel conflicting threads running
 Reduce operations operate much the same way to save the bandwidth across
the backbone network of the datacenter
Uses
 Useful in distributed pattern-based searching, distributed sorting, web link-
graph reversal, Singular Value Decomposition, web access log stats, inverted
index construction, document clustering, machine learning, and statistical
machine translation.
 adapted to several computing environments like multi-core and many-core
systems, desktop grids, volunteer computing environments, dynamic cloud
environments, and mobile environments
Criticism
 Lack of novelty
 Restricted programming framework

More Related Content

PPTX
Hadoop Architecture
PPTX
Introduction to Map Reduce
PDF
Hadoop Ecosystem
PPT
Hadoop Map Reduce
PDF
Introduction to Hadoop
PPT
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
PPTX
Hadoop online training
PDF
Big Data Analytics with Spark
Hadoop Architecture
Introduction to Map Reduce
Hadoop Ecosystem
Hadoop Map Reduce
Introduction to Hadoop
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
Hadoop online training
Big Data Analytics with Spark

What's hot (20)

PPTX
Map Reduce
PPT
Map Reduce
PDF
Hadoop Overview & Architecture
 
PPSX
PDF
Data Streaming For Big Data
PPT
Hive(ppt)
PPT
Map reduce in BIG DATA
PPTX
Distributed shred memory architecture
PPTX
Hadoop Distributed File System
PPTX
Hadoop And Their Ecosystem ppt
PPTX
Big Data Open Source Technologies
PDF
Hadoop YARN
PPTX
Cloud File System with GFS and HDFS
PDF
Lecture4 big data technology foundations
PDF
Hadoop combiner and partitioner
PPT
Map Reduce
PPTX
Big data and Hadoop
DOCX
Hadoop Seminar Report
PDF
Evolution of Cloud Computing
PPTX
Hadoop introduction , Why and What is Hadoop ?
Map Reduce
Map Reduce
Hadoop Overview & Architecture
 
Data Streaming For Big Data
Hive(ppt)
Map reduce in BIG DATA
Distributed shred memory architecture
Hadoop Distributed File System
Hadoop And Their Ecosystem ppt
Big Data Open Source Technologies
Hadoop YARN
Cloud File System with GFS and HDFS
Lecture4 big data technology foundations
Hadoop combiner and partitioner
Map Reduce
Big data and Hadoop
Hadoop Seminar Report
Evolution of Cloud Computing
Hadoop introduction , Why and What is Hadoop ?
Ad

Viewers also liked (15)

PPTX
Analysing of big data using map reduce
PDF
An Introduction to MapReduce
PPTX
MapReduce Design Patterns
PDF
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
PPTX
Pig, Making Hadoop Easy
PDF
introduction to data processing using Hadoop and Pig
PPT
Hadoop MapReduce Fundamentals
PDF
Practical Problem Solving with Apache Hadoop & Pig
PPT
HIVE: Data Warehousing & Analytics on Hadoop
PDF
Hive Quick Start Tutorial
PDF
Integration of Hive and HBase
KEY
Hadoop, Pig, and Twitter (NoSQL East 2009)
PPTX
MapReduce in Simple Terms
PPT
Introduction To Map Reduce
PPTX
Big Data Analytics with Hadoop
Analysing of big data using map reduce
An Introduction to MapReduce
MapReduce Design Patterns
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Pig, Making Hadoop Easy
introduction to data processing using Hadoop and Pig
Hadoop MapReduce Fundamentals
Practical Problem Solving with Apache Hadoop & Pig
HIVE: Data Warehousing & Analytics on Hadoop
Hive Quick Start Tutorial
Integration of Hive and HBase
Hadoop, Pig, and Twitter (NoSQL East 2009)
MapReduce in Simple Terms
Introduction To Map Reduce
Big Data Analytics with Hadoop
Ad

Similar to Map reduce presentation (20)

PPTX
Mapreduce script
PPTX
This gives a brief detail about big data
PDF
An Introduction to MapReduce
PDF
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
PPTX
Big Data.pptx
PDF
2 mapreduce-model-principles
PPT
MapReduce in cgrid and cloud computinge.ppt
PDF
MapReduce basics
PPTX
Introduction to MapReduce
PPTX
introduction to Complete Map and Reduce Framework
PPTX
Map reduce presentation
PPTX
COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION
PDF
Big data shim
PPTX
Map reduce helpful for college students.pptx
PPTX
Types_of_Stats.pptxTypes_of_Stats.pptxTypes_of_Stats.pptx
PPTX
Hadoop and Mapreduce for .NET User Group
PPTX
Embarrassingly/Delightfully Parallel Problems
PPTX
Introduction to map reduce
PPTX
Map reduce and Hadoop on windows
Mapreduce script
This gives a brief detail about big data
An Introduction to MapReduce
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
Big Data.pptx
2 mapreduce-model-principles
MapReduce in cgrid and cloud computinge.ppt
MapReduce basics
Introduction to MapReduce
introduction to Complete Map and Reduce Framework
Map reduce presentation
COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION
Big data shim
Map reduce helpful for college students.pptx
Types_of_Stats.pptxTypes_of_Stats.pptxTypes_of_Stats.pptx
Hadoop and Mapreduce for .NET User Group
Embarrassingly/Delightfully Parallel Problems
Introduction to map reduce
Map reduce and Hadoop on windows

Recently uploaded (20)

PDF
.pdf is not working space design for the following data for the following dat...
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PDF
Lecture1 pattern recognition............
PPTX
1_Introduction to advance data techniques.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Computer network topology notes for revision
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
.pdf is not working space design for the following data for the following dat...
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Lecture1 pattern recognition............
1_Introduction to advance data techniques.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Computer network topology notes for revision
STUDY DESIGN details- Lt Col Maksud (21).pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Reliability_Chapter_ presentation 1221.5784
Introduction to Knowledge Engineering Part 1
IBA_Chapter_11_Slides_Final_Accessible.pptx
Introduction-to-Cloud-ComputingFinal.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Moving the Public Sector (Government) to a Digital Adoption
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Business Acumen Training GuidePresentation.pptx
Database Infoormation System (DBIS).pptx
climate analysis of Dhaka ,Banglades.pptx

Map reduce presentation

  • 2. MapReduce  MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.  Map() procedure that performs filtering and sorting.  Reduce() procedure that performs a summary operation (such as statistical operations)
  • 3. HADOOP  Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.
  • 4. MapReduce - Orchestrating  MapReduce System is orchestrates the processing by 1. marshalling (saving the system) the distributed servers 2. Running the various tasks in parallel 3. managing all communications and data transfers between the various parts of the system 4. providing for redundancy and fault tolerance.
  • 5. MapReduce contributions  The key contributions ( an important values added to the system ) are Scalability (by controlling a big number of nodes) and fault-tolerance achieved for a variety of applications by optimizing the execution engine once.
  • 6. MapReduce main steps (3 phases way)  Steps of MapReduce : 1. "Map" step: Each worker node applies the "map()" function to the local data, and writes the output to a temporary storage. A master node orchestrates that for redundant copies of input data, only one is processed. 2. "Shuffle" step: Worker nodes redistribute data based on the output keys (produced by the "map()" function), all the data belonging to one key are located on the same worker node. 3. "Reduce" step: Worker nodes now process each group of output data, per key, in parallel.
  • 7. MapReduce main steps (3 phases way)
  • 8. MapReduce main steps (5 phases way)  Another way to look at MapReduce is as a 5-step parallel and distributed computation: 1. Map step. 2. Run the user-provided Map() code – Map() is run exactly once for each K1 key value, generating output organized by key values K2. 3. Shuffle step. 4. Reduce step. 5. Produce the final output – the MapReduce system collects all the Reduce output, and sorts it by K2 to produce the final outcome
  • 9. MapReduce main steps (5 phases way)
  • 10. Working in parallel  Each mapping operation is independent of the others ( run in parallel )  Limitations are: 1. Number of independent data sources 2. Number of CPUs near each source.  Set of 'reducers' can perform the reduction phase, all outputs of the map operation that share the same key are presented to the same reducer at the same time.  The main advantage for working in parallel is the recovering from partial failure of servers or storage during the operation, if one mapper or reducer fails, the work can be rescheduled – assuming the input data is still available.
  • 12. Logical work  The Map and Reduce functions of MapReduce are both defined with respect to the data structured in (key, value) pairs. Map takes one pair of data with a type in one data domain, and returns a list of pairs in a different domain: Map(k1,v1) → list(k2,v2)  The Map function is applied in parallel to every pair in the input dataset. This produces a list of pairs for each call. After that, the MapReduce framework collects all pairs with the same key from all lists and groups them together, creating one group for each key.  The Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain: Reduce(k2, list (v2)) → list(v3)  Each Reduce call typically produces either one value v3 or an empty return, though one call is allowed to return more than one value. The returns of all calls are collected as the desired result list.
  • 14. Implementation  Distributed implementations of MapReduce require a means of connecting the processes performing the Map and Reduce phases
  • 15. Implementation (cont)  counts the appearance of each word in a set of documents function map(String name, String document): // name: document name // document: document contents for each word w in document: emit (w, 1) function reduce(String word, Iterator partialCounts): // word: a word // partialCounts: a list of aggregated partial counts sum = 0 for each pc in partialCounts: sum += ParseInt(pc) emit (word, sum)
  • 16. Dataflow  The hotspots in the dataflow are ( depends on the application)  an input reader  a Map function  a partition function  a compare function  a Reduce function  an output writer
  • 17. Dataflow (cont)  Input reader  The input reader divides the input into appropriate size and the framework assigns one split to each Map function. The input reader reads data from stable storage (typically a distributed file system) and generates key/value pairs.
  • 18. Dataflow (cont)  Map function  The Map function takes a series of key/value pairs, processes each, and generates zero or more output key/value pairs. The input and output types of the map are often different from each other.  As in example If the application is doing a word count, the map function would break the line into words and output a key/value pair for each word. Each output pair would contain the word as the key and the number of instances of that word in the line as the value.
  • 19. Dataflow (cont)  Partition function  Each Map function output is allocated to a particular reducer by the application's partition function for sharing purposes. The partition function is given the key and the number of reducers and returns the index of the desired reducer.  It is important to pick a partition function that gives an approximately uniform distribution of data per shard for load-balancing purposes, otherwise the MapReduce operation can be held up waiting for slow reducers (reducers assigned more than their share of data) to finish.  Between the map and reduce stages, the data is shuffled (means that the data are parallel-sorted or exchanged between nodes) in order to move the data from the map node that produced it to the shard in which it will be reduced.
  • 20. Dataflow (cont)  Comparison function  The input for each Reduce is pulled from the machine where the Map ran and sorted using the application's comparison function.
  • 21. Dataflow (cont)  Reduce function  The framework calls the application's Reduce function once for each unique key in the sorted order. The Reduce can iterate through the values that are associated with that key and produce zero or more outputs.  In the word count example, the Reduce function takes the input values, sums them and generates a single output of the word and the final sum.
  • 22. Dataflow (cont)  Output writer  The Output Writer writes the output of the Reduce to the stable storage, usually a distributed file system.
  • 23. Performance  MapReduce programs are not guaranteed to be fast.  The partition function and the amount of data written by the Map function can have a large impact on the performance.  Additional modules such as the Combiner function can help to reduce the amount of data written to disk, and transmitted over the network.  Communication cost often dominates the computation cost, and many MapReduce implementations are designed to write all communication to distributed storage for crash recovery.
  • 24. Distribution and reliability  MapReduce achieves reliability by parceling out a number of operations on the set of data to each node in the network (load distributing).  Each node is expected to report back periodically with completed work and status updates.  If a node falls silent for longer than that interval, the master node records the node as dead and sends out the node's assigned work to other nodes.  Individual operations use atomic operations for naming file outputs as a check to ensure that there are not parallel conflicting threads running  Reduce operations operate much the same way to save the bandwidth across the backbone network of the datacenter
  • 25. Uses  Useful in distributed pattern-based searching, distributed sorting, web link- graph reversal, Singular Value Decomposition, web access log stats, inverted index construction, document clustering, machine learning, and statistical machine translation.  adapted to several computing environments like multi-core and many-core systems, desktop grids, volunteer computing environments, dynamic cloud environments, and mobile environments
  • 26. Criticism  Lack of novelty  Restricted programming framework