SlideShare a Scribd company logo
gooqwerty777@gmail.com
Introduction
● Motivation: lots of special-purpose programs
should process large amounts of raw data
– crawl(analyze) documents, web request logs, etc.
– should use lots of machine to reduce processing
time
– implementation is time-consuming and complex
● Solution: Design a programming model–
MapReduce
– Hides the details of parallelization, fault-tolerance,
locality optimization, and load balancing.
Map and Reduce
● Divide, Conquer, and Combine
-> Divide, Maps, and Reduces
● User only need to implement Map() and
Reduce() functions!(and some arguments)
Programming Model
• Take an input pair
• produces a set of intermediate
key/value pairs
Map
• sorted intermediate pairs by key value
• groups together all intermediate values
with the same intermediate key
Shuffling
• Take intermediate key and value set of
key
• merges together these value
Reduce
Introduction of MapReduce
More Example
● Inverted Index
– Find specified word in set of files
– Input: <files(splited), docID>
– Intermediate: <word, docID>
– Final: <word, list<docID>>
● Distributed Grep
● Distributed Sort
● Count of URL Access Frequency
● Term-Vector per Host
More Example
 Distributed Grep
 Distributed Sort
 Inverted Index
 Intermediate: <word, docID> → Final: <word,
list<docID>>
 Count of URL Access Frequency
 <URL, 1> → <URL, TotalCount>
 Term-Vector per Host
 summarize the most important words in docs
 <term, freq> → vector<term, freq>
Introduction of MapReduce
Implementation
● Parallelization
– Input of Map: partitioning the input data into M splits
– Input of Reduce: partitioning intermediate data into R
files
● Master program: assign M map tasks and R
reduce tasks to worker programs
– Map workers: Intermediate key/value pairs are written
to local disk. Locations of these files would pass back
to master.
– Reduce workers: Get location from master and use
remote procedure calls(RPC) to read data in local
disk.
Fault Tolerance
● Worker Failure
– ping every worker periodically
– tasks in failed machine :
● rescheduling now assigning task
● reset completed map tasks and rescheduling
● but completed reduce tasks don't need to reset (files
are stored in global file system)
● Master Failure
– failure of master is unlikely
– aborts the MapReduce, clients should check and
retry it
Backup Tasks
● Straggler : machine that takes an unusually long
time to complete tasks
– bad disk
– other tasks
● Solution: when MapReduce is close to
completion, master schedules backup
executions of the remaining in-progress tasks
– Only wait one of they to complete
– Takes 30% less time to complete, with computational
resources increase by no more than a few percent
Performance
● Environment
– 1800 machines
– two 2GHz Intel Xeon processors with Hyper-Threading
enabled, 4GB of memory, two 160GB IDE disks, and
a gigabit Ethernet link.
● Sorts approximately 1TB of data
– 891sec
● intentionally killed 200 out of 1746 workers several
minutes
– 933sec (just 5% increase)
● No backup tasks
– 1283sec (44% increase)
0
200
400
600
800
1000
1200
1400
Normal Fault Tolerance Without Backup
task
Performance
Execute time(Sec)
Advantage
● Large variety of problems are easily expressible
as MapReduce
– Every work which can be divided!
● Easy to use for programmers who have no
experience with distributed or parallel systems
– What you think is how to deal with splited data, and
how to compose result
● Code is simpler, easier to understand and modify
Application
● large-scale machine learning
● extraction of data used to produce reports of
popular queries
● extraction of properties of web pages
– PageRank
● Open Source implementation
– Hadoop
Refinements
● Locality optimization
–Input file copies in local disks
● Skipping Bad Records
– ignore a few bad records, when doing statistical
analysis on a large data set.
– signal handler (When error, send information to
master)
● Counter object
– Piggybacked on the ping response
Refinements
● Locality optimization
– Input file copies in local disks
● Intermediate key/value pairs is in order
– Utilized sort and random access
● Input and Output Types
– the key is the offset in the file and the value is the contents of the line.
– reader interface
● Skipping Bad Records
– ignore a few bad records, when doing statistical analysis on a large data set.
– signal handler (When error, send information to master)
● Counter object
– Piggybacked on the ping response
● Combiner function
– partial merging at local disk before sending record
Reference
 Jeffrey Dean and Sanjay Ghemawat,
2004, MapReduce: Simplified Data
Processing on Large Clusters
Introduction of MapReduce

More Related Content

PPT
Map reduce - simplified data processing on large clusters
PPTX
Juniper Innovation Contest
PDF
MapReduce: Simplified Data Processing on Large Clusters
PDF
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
PDF
Mapreduce - Simplified Data Processing on Large Clusters
PPTX
writing Hadoop Map Reduce programs
PPTX
Spark Overview and Performance Issues
PPT
Map Reduce
Map reduce - simplified data processing on large clusters
Juniper Innovation Contest
MapReduce: Simplified Data Processing on Large Clusters
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
Mapreduce - Simplified Data Processing on Large Clusters
writing Hadoop Map Reduce programs
Spark Overview and Performance Issues
Map Reduce

What's hot (20)

PDF
Hadoop map reduce v2
PPTX
Hadoop deconstructing map reduce job step by step
PPTX
Some thoughts on apache spark & shark
PDF
Hadoop data management
PPT
PPTX
Hadoop Map Reduce OS
PPTX
Map reduce presentation
PPTX
Apache spark - History and market overview
ODP
HBase introduction talk
PPTX
Mapreduce script
PPTX
Hadoop_EcoSystem_Pradeep_MG
PDF
Apache Kudu
PDF
Hadoop Map Reduce Arch
PDF
Apache spark - Spark's distributed programming model
PPTX
Apache Spark Core
PPTX
Introduction to MapReduce
PPTX
Scheduling scheme for hadoop clusters
PPSX
MapReduce Scheduling Algorithms
PDF
The google MapReduce
PPT
Map Reduce
Hadoop map reduce v2
Hadoop deconstructing map reduce job step by step
Some thoughts on apache spark & shark
Hadoop data management
Hadoop Map Reduce OS
Map reduce presentation
Apache spark - History and market overview
HBase introduction talk
Mapreduce script
Hadoop_EcoSystem_Pradeep_MG
Apache Kudu
Hadoop Map Reduce Arch
Apache spark - Spark's distributed programming model
Apache Spark Core
Introduction to MapReduce
Scheduling scheme for hadoop clusters
MapReduce Scheduling Algorithms
The google MapReduce
Map Reduce
Ad

Viewers also liked (13)

PPT
Tcad - Results Budget Debate Slides
PPTX
Children’s defense fund
PPTX
Cdf children and federal policy jan2014
PPTX
Realizing the Vision to End Childhood Hunger
PPTX
DS-17-030 CDBG Priorities
PPTX
The Three Branches Of Government
PPT
Accessing Federal Food Programs to Assist Children in Shelters
PPTX
Branches of government
PPTX
The Three Branches Of Government Power Point
PPTX
BUDGET PROCESS OF THE PHILIPPINE NATIONAL GOVERNMENT
PPTX
The Truth about Tone from the Top by @EricPesik
PDF
Mobile-First SEO - The Marketers Edition #3XEDigital
PDF
How to Become a Thought Leader in Your Niche
Tcad - Results Budget Debate Slides
Children’s defense fund
Cdf children and federal policy jan2014
Realizing the Vision to End Childhood Hunger
DS-17-030 CDBG Priorities
The Three Branches Of Government
Accessing Federal Food Programs to Assist Children in Shelters
Branches of government
The Three Branches Of Government Power Point
BUDGET PROCESS OF THE PHILIPPINE NATIONAL GOVERNMENT
The Truth about Tone from the Top by @EricPesik
Mobile-First SEO - The Marketers Edition #3XEDigital
How to Become a Thought Leader in Your Niche
Ad

Similar to Introduction of MapReduce (20)

PPTX
mapreduce.pptx
PPTX
MapReduce : Simplified Data Processing on Large Clusters
PDF
My mapreduce1 presentation
PPT
Introduction To Map Reduce
PDF
Mapreduce2008 cacm
PDF
MapReduce
PPTX
MapReduce.pptx
PDF
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
PPTX
introduction to Complete Map and Reduce Framework
PPTX
COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION
PPTX
This gives a brief detail about big data
PPT
Map reducecloudtech
PDF
Simplified Data Processing On Large Cluster
PPTX
IOE MODULE 6.pptx
PPTX
Mapreduce is for Hadoop Ecosystem in Data Science
PDF
Lecture 1 mapreduce
PPTX
Introduction to map reduce
PDF
2004 map reduce simplied data processing on large clusters (mapreduce)
mapreduce.pptx
MapReduce : Simplified Data Processing on Large Clusters
My mapreduce1 presentation
Introduction To Map Reduce
Mapreduce2008 cacm
MapReduce
MapReduce.pptx
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
introduction to Complete Map and Reduce Framework
COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION
This gives a brief detail about big data
Map reducecloudtech
Simplified Data Processing On Large Cluster
IOE MODULE 6.pptx
Mapreduce is for Hadoop Ecosystem in Data Science
Lecture 1 mapreduce
Introduction to map reduce
2004 map reduce simplied data processing on large clusters (mapreduce)

Recently uploaded (20)

PDF
KodekX | Application Modernization Development
PDF
Approach and Philosophy of On baking technology
PPTX
MYSQL Presentation for SQL database connectivity
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Encapsulation theory and applications.pdf
PPTX
Spectroscopy.pptx food analysis technology
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Electronic commerce courselecture one. Pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
KodekX | Application Modernization Development
Approach and Philosophy of On baking technology
MYSQL Presentation for SQL database connectivity
The Rise and Fall of 3GPP – Time for a Sabbatical?
sap open course for s4hana steps from ECC to s4
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Per capita expenditure prediction using model stacking based on satellite ima...
Understanding_Digital_Forensics_Presentation.pptx
Machine learning based COVID-19 study performance prediction
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Encapsulation theory and applications.pdf
Spectroscopy.pptx food analysis technology
20250228 LYD VKU AI Blended-Learning.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Electronic commerce courselecture one. Pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Agricultural_Statistics_at_a_Glance_2022_0.pdf

Introduction of MapReduce

  • 2. Introduction ● Motivation: lots of special-purpose programs should process large amounts of raw data – crawl(analyze) documents, web request logs, etc. – should use lots of machine to reduce processing time – implementation is time-consuming and complex ● Solution: Design a programming model– MapReduce – Hides the details of parallelization, fault-tolerance, locality optimization, and load balancing.
  • 3. Map and Reduce ● Divide, Conquer, and Combine -> Divide, Maps, and Reduces ● User only need to implement Map() and Reduce() functions!(and some arguments)
  • 4. Programming Model • Take an input pair • produces a set of intermediate key/value pairs Map • sorted intermediate pairs by key value • groups together all intermediate values with the same intermediate key Shuffling • Take intermediate key and value set of key • merges together these value Reduce
  • 6. More Example ● Inverted Index – Find specified word in set of files – Input: <files(splited), docID> – Intermediate: <word, docID> – Final: <word, list<docID>> ● Distributed Grep ● Distributed Sort ● Count of URL Access Frequency ● Term-Vector per Host
  • 7. More Example  Distributed Grep  Distributed Sort  Inverted Index  Intermediate: <word, docID> → Final: <word, list<docID>>  Count of URL Access Frequency  <URL, 1> → <URL, TotalCount>  Term-Vector per Host  summarize the most important words in docs  <term, freq> → vector<term, freq>
  • 9. Implementation ● Parallelization – Input of Map: partitioning the input data into M splits – Input of Reduce: partitioning intermediate data into R files ● Master program: assign M map tasks and R reduce tasks to worker programs – Map workers: Intermediate key/value pairs are written to local disk. Locations of these files would pass back to master. – Reduce workers: Get location from master and use remote procedure calls(RPC) to read data in local disk.
  • 10. Fault Tolerance ● Worker Failure – ping every worker periodically – tasks in failed machine : ● rescheduling now assigning task ● reset completed map tasks and rescheduling ● but completed reduce tasks don't need to reset (files are stored in global file system) ● Master Failure – failure of master is unlikely – aborts the MapReduce, clients should check and retry it
  • 11. Backup Tasks ● Straggler : machine that takes an unusually long time to complete tasks – bad disk – other tasks ● Solution: when MapReduce is close to completion, master schedules backup executions of the remaining in-progress tasks – Only wait one of they to complete – Takes 30% less time to complete, with computational resources increase by no more than a few percent
  • 12. Performance ● Environment – 1800 machines – two 2GHz Intel Xeon processors with Hyper-Threading enabled, 4GB of memory, two 160GB IDE disks, and a gigabit Ethernet link. ● Sorts approximately 1TB of data – 891sec ● intentionally killed 200 out of 1746 workers several minutes – 933sec (just 5% increase) ● No backup tasks – 1283sec (44% increase)
  • 13. 0 200 400 600 800 1000 1200 1400 Normal Fault Tolerance Without Backup task Performance Execute time(Sec)
  • 14. Advantage ● Large variety of problems are easily expressible as MapReduce – Every work which can be divided! ● Easy to use for programmers who have no experience with distributed or parallel systems – What you think is how to deal with splited data, and how to compose result ● Code is simpler, easier to understand and modify
  • 15. Application ● large-scale machine learning ● extraction of data used to produce reports of popular queries ● extraction of properties of web pages – PageRank ● Open Source implementation – Hadoop
  • 16. Refinements ● Locality optimization –Input file copies in local disks ● Skipping Bad Records – ignore a few bad records, when doing statistical analysis on a large data set. – signal handler (When error, send information to master) ● Counter object – Piggybacked on the ping response
  • 17. Refinements ● Locality optimization – Input file copies in local disks ● Intermediate key/value pairs is in order – Utilized sort and random access ● Input and Output Types – the key is the offset in the file and the value is the contents of the line. – reader interface ● Skipping Bad Records – ignore a few bad records, when doing statistical analysis on a large data set. – signal handler (When error, send information to master) ● Counter object – Piggybacked on the ping response ● Combiner function – partial merging at local disk before sending record
  • 18. Reference  Jeffrey Dean and Sanjay Ghemawat, 2004, MapReduce: Simplified Data Processing on Large Clusters