SlideShare a Scribd company logo
MapReduce:
Simplified Data
Processing on
Large Clusters
Papers We Love
Bucharest Chapter
October 12, 2015
I am Adrian Florea
Architect Lead @ IBM Bucharest Software Lab
Hello!
What is?
◉MapReduce is a programming model and implementation for processing large data sets
◉Users specify a Map function that processes a key-value pair to generate a set of
intermediate key-value pairs and a Reduce function that aggregates all intermediate
values that share the same intermediate key in order to combine the derived data
appropriately
map:(k1, v1)->[(k2, v2)]
reduce:(k2, [v2])->[(k3, v3)]
Map
◉function (mathematics)
◉map (Java)
◉Select (.NET)
Where else have we seen this?
Reduce
◉fold (functional programming)
◉reduce (Java)
◉Aggregate (.NET)
Other analogies
"To draw an analogy to SQL, map is like
the group-by clause of an aggregate
query. Reduce is analogous to the
aggregate function that is computed
over all the rows with the same group-
by attribute"
D.J. DeWitt & M. Stonebraker
Divide-and-conquer algorithms
“recursively breaking down a problem into two
or more sub-problems of the same (or related)
type (divide), until these become simple enough
to be solved directly (conquer). The solutions to
the sub-problems are then combined to give a
solution to the original problem.”
Jeffrey Dean
Google Fellow
December 6, 2004 4PM
Sanjay Ghemawat
Google Fellow
History
◉April 1960: John McCarthy introduced the concept of “maplist”
◉September 4, 1998: Google founded
◉1998-2003: hundreds of special-purpose large data computation
programs in Google
◉February 2003: 1st version of MapReduce
◉August 2003: MapReduce significant enhancements
◉June 18, 2004: Patent US7650331 B1 filed
◉December 6, 2004: 1st MapReduce public presentation
◉2005: Hadoop implementation started in Java (Douglass R. Cutting &
Michael J. Cafarella)
◉September 4, 2007: Hadoop 0.14.1
◉January 19, 2010: Patent US7650331 B1 published
◉July 6, 2015: Hadoop 2.7.1
Distribution issues
◉Communication and routing
which nodes should be involved?
what transport protocol should be used?
threads/events/connections management
remote execution of your processing code?
◉Fault tolerance and fault detection
◉Load balancing / partitioning of data
heterogeneity of nodes
skew in data
network topology
◉Parallelization strategy
algorithmic issues of work splitting
“without having to deal with
failures, the rest of the support
code just isn’t so complicate”
S. Ghemawat
MapReduce model
Map Operation Reduce Operation
Input Data Intermediate Data Output Data
application-independent
Map Module
application-independent
Reduce Module
MapReduce system
application-specific
Map Operation
application-specific
Reduce OperationInput
Data
Intermediate
Data
Output
Data
Original article Hadoop wiki
Original article Hadoop wiki
MapReduce model in practice
map(String key, String value):
for each Word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values)
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(key, AsString(result));
map:(k1, v1)->[(k2, v2)]
reduce:(k2, [v2])->[(k3, v3)]
How long does it take to go
through 1 TB?
Sequentially: 3 hours
MapReduce: startup overhead 70 seconds + computation 80 seconds
Environment
◉ 1800 Linux dual-processorx86 machines, 2-4 GB memory
◉ Fast Ethernet/Giga Ethernet
◉ Inexpensive IDE disks and a distributed Google File System
Take a walk through a Google data center
Tianhe-2, #1 supercomputer: 3,120,000 cores, 1,5 PB total memory
Execution diagram
◉Master process is a task itself initiated by the WQM and is responsible for
assigning all other tasks to worker processes
◉Each worker invokes at least a map thread and a reduce thread
◉If a worker fails, its tasks are reassigned to another worker process
◉When WQM receives a job, it allocates the job to the master that calculates
and requires M+R+1 processes to be allocated to the job
◉WQM responds with the process allocation info (can result less processes) to
the master that will manage the performance of the job
◉Reduce tasks begin work when the master informs them that there are
intermediate files ready
◉Input data (files/DB/memory) are splitted in data blocks (16-64 MB)
automatically or configurable
◉The worker to which a map task has been assigned applies the map() operator
to the respective input data block
◉When the worker completes the task, it informs the master of the status
◉Master informs workers where to find intermediate data and schedules their
reading
◉Workers (3 & 4) sort the intermediate key-value pairs, then merge (by applying
reduce()) them and write to output
Workflow diagram
◉When a process completes a task it informs WQM
which updates the status tables
◉When WQM discovers one process failed, it assign its
tasks to a new process and updates the status tables
Task Status Table
◉TaskID
◉Status (InProgress, Waiting, Completed, Failed)
◉ProcessID
◉InputFiles (Input, Intermediate)
◉OutputFiles
Process Status Table
◉ProcessID
◉Status (Idle, Busy, Failed)
◉Location (CPU ID, etc.)
◉Current (TaskID, WQM)
Questions from the audience
@ original paper presentation
◉Q: Wanted to know of any task that could not be handled using MapReduce?
A: join operations could not be performed with the current model
◉Q: Wondered how MapReduce differs from parallel databases?
A: MapReduce is stored across a large number of machines as compared to parallel databases,
the abstractions are fairly simple to use in MapReduce, and
MapReduce also benefits greatly from locality optimizations
Bibliography
◉ J. Dean, S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters", OSDI'04,
Dec. 6, 2004
◉ J. Dean, S. Ghemawat, “System and method for efficient large-scale data processing”, Patent
US 7650331 B1, Jan. 19, 2010
◉ S. Ghemawat, J. Dean, J. Zhao, M. Austern, A. Spector, "Google Technology RoundTable: Map
Reduce", Aug. 21, 2008 – Youtube
◉ P. Mahadevan, "OSDI'04 Conference Reports", ;LOGIN: Vol. 30, No. 2, Apr. 2005, p. 61
◉ R. Jacotin, “Lecture: The Google MapReduce”, SlideShare, October 3, 2014
Any questions ?
You can find me at
Thanks!

More Related Content

PDF
MapReduce: Simplified Data Processing On Large Clusters
PDF
MapReduce: Simplified Data Processing on Large Clusters
PPTX
MapReduce : Simplified Data Processing on Large Clusters
PDF
Mapreduce - Simplified Data Processing on Large Clusters
PPT
Map reduce - simplified data processing on large clusters
PPTX
Introduction to map reduce
PPTX
Hadoop deconstructing map reduce job step by step
PDF
Hadoop combiner and partitioner
MapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large Clusters
Map reduce - simplified data processing on large clusters
Introduction to map reduce
Hadoop deconstructing map reduce job step by step
Hadoop combiner and partitioner

What's hot (20)

PPT
Map Reduce
PPTX
Map reduce
PDF
Hadoop map reduce in operation
PDF
Hadoop map reduce v2
PPTX
06 how to write a map reduce version of k-means clustering
PPTX
Mapreduce script
PDF
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
PPTX
Introduction to MapReduce
PPTX
Map reduce in Hadoop
PPT
Map Reduce
PDF
Hadoop secondary sort and a custom comparator
PPSX
MapReduce Scheduling Algorithms
PDF
E031201032036
PDF
Map Reduce
PDF
MapReduce
PPTX
Map reduce presentation
PPTX
Map Reduce Online
PPTX
Introduction to MapReduce
PDF
MapReduce Algorithm Design
PDF
Hadoop
Map Reduce
Map reduce
Hadoop map reduce in operation
Hadoop map reduce v2
06 how to write a map reduce version of k-means clustering
Mapreduce script
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
Introduction to MapReduce
Map reduce in Hadoop
Map Reduce
Hadoop secondary sort and a custom comparator
MapReduce Scheduling Algorithms
E031201032036
Map Reduce
MapReduce
Map reduce presentation
Map Reduce Online
Introduction to MapReduce
MapReduce Algorithm Design
Hadoop
Ad

Viewers also liked (17)

PPTX
Mapas mentales
PPTX
Modern History
PDF
Infortask | O sistema mais simples e prĂĄtico para organizar atividades!
PDF
Cv Pradipta
DOC
Application form FOR LINK'D IN
PPT
Veneers/ fixed orthodontics courses
PDF
10 reasons for not booking your flight on go to gate
PPSX
BWS - Profile
PPT
Subject Selection - Industrial Arts
PPT
Seguridad informatica
PPTX
Jim Hemmington, Head of Procurement at BBC - Developing and Implementing a St...
PPT
Failures in fpd/ orthodontics courses in india
PDF
500 important spoken tamil situations into spoken english sentences sample
PDF
JKUAT Degree Cert
PDF
Real-Time Supply Chain Analytics with Machine Learning, Kafka, and Spark
PPTX
Learning strategies
PPTX
Basic components of computer system
Mapas mentales
Modern History
Infortask | O sistema mais simples e prĂĄtico para organizar atividades!
Cv Pradipta
Application form FOR LINK'D IN
Veneers/ fixed orthodontics courses
10 reasons for not booking your flight on go to gate
BWS - Profile
Subject Selection - Industrial Arts
Seguridad informatica
Jim Hemmington, Head of Procurement at BBC - Developing and Implementing a St...
Failures in fpd/ orthodontics courses in india
500 important spoken tamil situations into spoken english sentences sample
JKUAT Degree Cert
Real-Time Supply Chain Analytics with Machine Learning, Kafka, and Spark
Learning strategies
Basic components of computer system
Ad

Similar to "MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation @ Papers We Love Bucharest (20)

PDF
MapReduce
PPTX
mapreduce.pptx
PPTX
Mapreduce is for Hadoop Ecosystem in Data Science
PDF
PPTX
MapReduce.pptx
PPTX
Main map reduce
PPTX
Map reduce helpful for college students.pptx
PPT
Introduction To Map Reduce
 
PPTX
This gives a brief detail about big data
PPTX
introduction to Complete Map and Reduce Framework
PDF
Simplified Data Processing On Large Cluster
PDF
Hadoop eco system with mapreduce hive and pig
PPTX
COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION
PDF
Mapreduce2008 cacm
PPTX
Hadoop and Mapreduce for .NET User Group
PDF
Report Hadoop Map Reduce
PPTX
MapReduce presentation
PPT
Map reducecloudtech
PPTX
Hadoop training-in-hyderabad
PPT
L19CloudMapReduce introduction for cloud computing .ppt
MapReduce
mapreduce.pptx
Mapreduce is for Hadoop Ecosystem in Data Science
MapReduce.pptx
Main map reduce
Map reduce helpful for college students.pptx
Introduction To Map Reduce
 
This gives a brief detail about big data
introduction to Complete Map and Reduce Framework
Simplified Data Processing On Large Cluster
Hadoop eco system with mapreduce hive and pig
COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION
Mapreduce2008 cacm
Hadoop and Mapreduce for .NET User Group
Report Hadoop Map Reduce
MapReduce presentation
Map reducecloudtech
Hadoop training-in-hyderabad
L19CloudMapReduce introduction for cloud computing .ppt

Recently uploaded (20)

PPTX
assetexplorer- product-overview - presentation
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
medical staffing services at VALiNTRY
PPTX
Transform Your Business with a Software ERP System
PDF
System and Network Administraation Chapter 3
PDF
Designing Intelligence for the Shop Floor.pdf
PPTX
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
 
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Design an Analysis of Algorithms II-SECS-1021-03
assetexplorer- product-overview - presentation
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Softaken Excel to vCard Converter Software.pdf
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
How to Choose the Right IT Partner for Your Business in Malaysia
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
medical staffing services at VALiNTRY
Transform Your Business with a Software ERP System
System and Network Administraation Chapter 3
Designing Intelligence for the Shop Floor.pdf
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Reimagine Home Health with the Power of Agentic AI​
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Operating system designcfffgfgggggggvggggggggg
Odoo Companies in India – Driving Business Transformation.pdf
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
 
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Design an Analysis of Algorithms II-SECS-1021-03

"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation @ Papers We Love Bucharest

  • 1. MapReduce: Simplified Data Processing on Large Clusters Papers We Love Bucharest Chapter October 12, 2015
  • 2. I am Adrian Florea Architect Lead @ IBM Bucharest Software Lab Hello!
  • 3. What is? ◉MapReduce is a programming model and implementation for processing large data sets ◉Users specify a Map function that processes a key-value pair to generate a set of intermediate key-value pairs and a Reduce function that aggregates all intermediate values that share the same intermediate key in order to combine the derived data appropriately map:(k1, v1)->[(k2, v2)] reduce:(k2, [v2])->[(k3, v3)]
  • 4. Map ◉function (mathematics) ◉map (Java) ◉Select (.NET) Where else have we seen this? Reduce ◉fold (functional programming) ◉reduce (Java) ◉Aggregate (.NET)
  • 5. Other analogies "To draw an analogy to SQL, map is like the group-by clause of an aggregate query. Reduce is analogous to the aggregate function that is computed over all the rows with the same group- by attribute" D.J. DeWitt & M. Stonebraker Divide-and-conquer algorithms “recursively breaking down a problem into two or more sub-problems of the same (or related) type (divide), until these become simple enough to be solved directly (conquer). The solutions to the sub-problems are then combined to give a solution to the original problem.”
  • 6. Jeffrey Dean Google Fellow December 6, 2004 4PM Sanjay Ghemawat Google Fellow
  • 7. History ◉April 1960: John McCarthy introduced the concept of “maplist” ◉September 4, 1998: Google founded ◉1998-2003: hundreds of special-purpose large data computation programs in Google ◉February 2003: 1st version of MapReduce ◉August 2003: MapReduce significant enhancements ◉June 18, 2004: Patent US7650331 B1 filed ◉December 6, 2004: 1st MapReduce public presentation ◉2005: Hadoop implementation started in Java (Douglass R. Cutting & Michael J. Cafarella) ◉September 4, 2007: Hadoop 0.14.1 ◉January 19, 2010: Patent US7650331 B1 published ◉July 6, 2015: Hadoop 2.7.1
  • 8. Distribution issues ◉Communication and routing which nodes should be involved? what transport protocol should be used? threads/events/connections management remote execution of your processing code? ◉Fault tolerance and fault detection ◉Load balancing / partitioning of data heterogeneity of nodes skew in data network topology ◉Parallelization strategy algorithmic issues of work splitting “without having to deal with failures, the rest of the support code just isn’t so complicate” S. Ghemawat
  • 9. MapReduce model Map Operation Reduce Operation Input Data Intermediate Data Output Data
  • 10. application-independent Map Module application-independent Reduce Module MapReduce system application-specific Map Operation application-specific Reduce OperationInput Data Intermediate Data Output Data
  • 13. MapReduce model in practice map(String key, String value): for each Word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values) int result = 0; for each v in values: result += ParseInt(v); Emit(key, AsString(result)); map:(k1, v1)->[(k2, v2)] reduce:(k2, [v2])->[(k3, v3)]
  • 14. How long does it take to go through 1 TB? Sequentially: 3 hours MapReduce: startup overhead 70 seconds + computation 80 seconds Environment ◉ 1800 Linux dual-processorx86 machines, 2-4 GB memory ◉ Fast Ethernet/Giga Ethernet ◉ Inexpensive IDE disks and a distributed Google File System
  • 15. Take a walk through a Google data center
  • 16. Tianhe-2, #1 supercomputer: 3,120,000 cores, 1,5 PB total memory
  • 17. Execution diagram ◉Master process is a task itself initiated by the WQM and is responsible for assigning all other tasks to worker processes ◉Each worker invokes at least a map thread and a reduce thread ◉If a worker fails, its tasks are reassigned to another worker process ◉When WQM receives a job, it allocates the job to the master that calculates and requires M+R+1 processes to be allocated to the job ◉WQM responds with the process allocation info (can result less processes) to the master that will manage the performance of the job ◉Reduce tasks begin work when the master informs them that there are intermediate files ready ◉Input data (files/DB/memory) are splitted in data blocks (16-64 MB) automatically or configurable ◉The worker to which a map task has been assigned applies the map() operator to the respective input data block ◉When the worker completes the task, it informs the master of the status ◉Master informs workers where to find intermediate data and schedules their reading ◉Workers (3 & 4) sort the intermediate key-value pairs, then merge (by applying reduce()) them and write to output
  • 18. Workflow diagram ◉When a process completes a task it informs WQM which updates the status tables ◉When WQM discovers one process failed, it assign its tasks to a new process and updates the status tables Task Status Table ◉TaskID ◉Status (InProgress, Waiting, Completed, Failed) ◉ProcessID ◉InputFiles (Input, Intermediate) ◉OutputFiles Process Status Table ◉ProcessID ◉Status (Idle, Busy, Failed) ◉Location (CPU ID, etc.) ◉Current (TaskID, WQM)
  • 19. Questions from the audience @ original paper presentation ◉Q: Wanted to know of any task that could not be handled using MapReduce? A: join operations could not be performed with the current model ◉Q: Wondered how MapReduce differs from parallel databases? A: MapReduce is stored across a large number of machines as compared to parallel databases, the abstractions are fairly simple to use in MapReduce, and MapReduce also benefits greatly from locality optimizations
  • 20. Bibliography ◉ J. Dean, S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters", OSDI'04, Dec. 6, 2004 ◉ J. Dean, S. Ghemawat, “System and method for efficient large-scale data processing”, Patent US 7650331 B1, Jan. 19, 2010 ◉ S. Ghemawat, J. Dean, J. Zhao, M. Austern, A. Spector, "Google Technology RoundTable: Map Reduce", Aug. 21, 2008 – Youtube ◉ P. Mahadevan, "OSDI'04 Conference Reports", ;LOGIN: Vol. 30, No. 2, Apr. 2005, p. 61 ◉ R. Jacotin, “Lecture: The Google MapReduce”, SlideShare, October 3, 2014
  • 21. Any questions ? You can find me at Thanks!