SlideShare a Scribd company logo
Map Reduce and Hadoop
The MapReduce Paradigm
The MapReduce Paradigm
 Platform for reliable, scalable parallel
computing
 Abstracts issues of distributed and parallel
environment from programmer.
 Runs over distributed file systems
 Google File System
 Hadoop File System (HDFS)
Distributed File Systems
Distributed File Systems
 Highly scalable distributed file system for large
data-intensive applications.
 E.g. 10K nodes, 100 million files, 10 PB
 Provides redundant storage of massive amounts
of data on cheap and unreliable computers
 Files are replicated to handle hardware failure
 Detect failures and recovers from them
 Provides a platform over which other systems
like MapReduce, BigTable operate.
Distributed File System
 Single Namespace for entire cluster
 Data Coherency
– Write-once-read-many access model
– Client can only append to existing files
 Files are broken up into blocks
– Typically 128 MB block size
– Each block replicated on multiple DataNodes
 Intelligent Client
– Client can find location of blocks
– Client accesses data directly from DataNode
Secondary
NameNode
Client
HDFS Architecture
NameNode
DataNodes
1. filenam
e
2. BlckId, DataNodes
o
3.Read data
NameNode : Maps a file to a file-id and list of MapNodes
DataNode : Maps a block-id to a physical location on disk
MapReduce in cgrid and cloud computinge.ppt
MapReduce: Insight
MapReduce: Insight
 Consider the problem of counting the number of
occurrences of each word in a large collection of
documents
 How would you do it in parallel ?
 Solution:
 Divide documents among workers
 Each worker parses document to find all words, outputs
(word, count) pairs
 Partition (word, count) pairs across workers based on word
 For each word at a worker, locally add up counts
MapReduce Programming Model
MapReduce Programming Model
 Inspired from map and reduce operations
commonly used in functional programming
languages like Lisp.
 Input: a set of key/value pairs
 User supplies two functions:
map(k,v)  list(k1,v1)
reduce(k1, list(v1))  v2
 (k1,v1) is an intermediate key/value pair
 Output is the set of (k1,v2) pairs
MapReduce: The Map Step
v2
k2
k v
k v
map
v1
k1
vn
kn
…
k v
map
Input
key-value pairs
Intermediate
key-value pairs
…
k v
Adapted from Jeff Ullman’s course slides
E.g. (doc—id, doc-content) E.g. (word, wordcount-in-a-doc)
MapReduce: The Reduce Step
k v
…
k v
k v
k v
Intermediate
key-value pairs
group
reduce
reduce
k v
k v
k v
…
k v
…
k v
k v v
v v
Key-value groups
Output
key-value pairs
Adapted from Jeff Ullman’s course slides
E.g.
(word, wordcount-in-a-doc)
(word, list-of-wordcount) (word, final-count)
~ SQL Group by ~ SQL aggregation
Pseudo-code
Pseudo-code
map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, "1");
// Group by step done by system on key of intermediate Emit above, and //
reduce called on list of values in each group.
reduce(String output_key, Iterator intermediate_values):
// output_key: a word
// output_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));
MapReduce: Execution overview
MapReduce: Execution overview
Distributed Execution Overview
User
Program
Worker
Worker
Master
Worker
Worker
Worker
fork fork fork
assign
map
assign
reduce
read
local
write
remote
read,
sort
Output
File 0
Output
File 1
write
Split 0
Split 1
Split 2
input data from
distributed file
system
Map Reduce vs. Parallel Databases
 Map Reduce widely used for parallel processing
 Google, Yahoo, and 100’s of other companies
 Example uses: compute PageRank, build keyword indices,
do data analysis of web click logs, ….
 Database people say: but parallel databases have
been doing this for decades
 Map Reduce people say:
 we operate at scales of 1000’s of machines
 We handle failures seamlessly
 We allow procedural code in map and reduce and allow
data of any type
Implementations
 Google
 Not available outside Google
 Hadoop
 An open-source implementation in Java
 Uses HDFS for stable storage
 Download: http://lucene.apache.org/hadoop/
 Aster Data
 Cluster-optimized SQL Database that also implements
MapReduce
 IITB alumnus among founders
 And several others, such as Cassandra at
Facebook, etc.
Reading
 Jeffrey Dean and Sanjay Ghemawat, MapReduce:
Simplified Data Processing on Large Clusters
http://labs.google.com/papers/mapreduce.html
 Sanjay Ghemawat, Howard Gobioff, and Shun-Tak
Leung, The Google File System,
http://labs.google.com/papers/gfs.html

More Related Content

PPTX
This gives a brief detail about big data
PPTX
introduction to Complete Map and Reduce Framework
PPTX
COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION
PPTX
Hadoop and Mapreduce for .NET User Group
PDF
Mapreduce2008 cacm
PDF
MapReduce basics
PPTX
Map reduce presentation
PDF
An Introduction to MapReduce
This gives a brief detail about big data
introduction to Complete Map and Reduce Framework
COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION
Hadoop and Mapreduce for .NET User Group
Mapreduce2008 cacm
MapReduce basics
Map reduce presentation
An Introduction to MapReduce

Similar to MapReduce in cgrid and cloud computinge.ppt (20)

PPT
Map reducecloudtech
PDF
MapReduce
PDF
Big data shim
PPTX
Introduction to Map-Reduce Programming with Hadoop
PPT
Map Reduce
PDF
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
PPTX
Big Data.pptx
PPTX
Map reduce helpful for college students.pptx
PDF
2 mapreduce-model-principles
PPTX
PPTX
Map reducefunnyslide
PPTX
Map reduce and Hadoop on windows
PDF
Large Scale Data Processing & Storage
PPTX
Introduction to MapReduce
PDF
2004 map reduce simplied data processing on large clusters (mapreduce)
PDF
Map reduce
PPT
Hadoop Map Reduce
PDF
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
PPT
mapreduce ppt.ppt
PPTX
Types_of_Stats.pptxTypes_of_Stats.pptxTypes_of_Stats.pptx
Map reducecloudtech
MapReduce
Big data shim
Introduction to Map-Reduce Programming with Hadoop
Map Reduce
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
Big Data.pptx
Map reduce helpful for college students.pptx
2 mapreduce-model-principles
Map reducefunnyslide
Map reduce and Hadoop on windows
Large Scale Data Processing & Storage
Introduction to MapReduce
2004 map reduce simplied data processing on large clusters (mapreduce)
Map reduce
Hadoop Map Reduce
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
mapreduce ppt.ppt
Types_of_Stats.pptxTypes_of_Stats.pptxTypes_of_Stats.pptx
Ad

Recently uploaded (20)

PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPT
Project quality management in manufacturing
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PPTX
UNIT 4 Total Quality Management .pptx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
Sustainable Sites - Green Building Construction
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PDF
composite construction of structures.pdf
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
Geodesy 1.pptx...............................................
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
Operating System & Kernel Study Guide-1 - converted.pdf
Project quality management in manufacturing
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Arduino robotics embedded978-1-4302-3184-4.pdf
UNIT 4 Total Quality Management .pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
Model Code of Practice - Construction Work - 21102022 .pdf
Sustainable Sites - Green Building Construction
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
composite construction of structures.pdf
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
OOP with Java - Java Introduction (Basics)
Geodesy 1.pptx...............................................
Strings in CPP - Strings in C++ are sequences of characters used to store and...
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Lesson 3_Tessellation.pptx finite Mathematics
Ad

MapReduce in cgrid and cloud computinge.ppt

  • 1. Map Reduce and Hadoop
  • 2. The MapReduce Paradigm The MapReduce Paradigm  Platform for reliable, scalable parallel computing  Abstracts issues of distributed and parallel environment from programmer.  Runs over distributed file systems  Google File System  Hadoop File System (HDFS)
  • 3. Distributed File Systems Distributed File Systems  Highly scalable distributed file system for large data-intensive applications.  E.g. 10K nodes, 100 million files, 10 PB  Provides redundant storage of massive amounts of data on cheap and unreliable computers  Files are replicated to handle hardware failure  Detect failures and recovers from them  Provides a platform over which other systems like MapReduce, BigTable operate.
  • 4. Distributed File System  Single Namespace for entire cluster  Data Coherency – Write-once-read-many access model – Client can only append to existing files  Files are broken up into blocks – Typically 128 MB block size – Each block replicated on multiple DataNodes  Intelligent Client – Client can find location of blocks – Client accesses data directly from DataNode
  • 5. Secondary NameNode Client HDFS Architecture NameNode DataNodes 1. filenam e 2. BlckId, DataNodes o 3.Read data NameNode : Maps a file to a file-id and list of MapNodes DataNode : Maps a block-id to a physical location on disk
  • 7. MapReduce: Insight MapReduce: Insight  Consider the problem of counting the number of occurrences of each word in a large collection of documents  How would you do it in parallel ?  Solution:  Divide documents among workers  Each worker parses document to find all words, outputs (word, count) pairs  Partition (word, count) pairs across workers based on word  For each word at a worker, locally add up counts
  • 8. MapReduce Programming Model MapReduce Programming Model  Inspired from map and reduce operations commonly used in functional programming languages like Lisp.  Input: a set of key/value pairs  User supplies two functions: map(k,v)  list(k1,v1) reduce(k1, list(v1))  v2  (k1,v1) is an intermediate key/value pair  Output is the set of (k1,v2) pairs
  • 9. MapReduce: The Map Step v2 k2 k v k v map v1 k1 vn kn … k v map Input key-value pairs Intermediate key-value pairs … k v Adapted from Jeff Ullman’s course slides E.g. (doc—id, doc-content) E.g. (word, wordcount-in-a-doc)
  • 10. MapReduce: The Reduce Step k v … k v k v k v Intermediate key-value pairs group reduce reduce k v k v k v … k v … k v k v v v v Key-value groups Output key-value pairs Adapted from Jeff Ullman’s course slides E.g. (word, wordcount-in-a-doc) (word, list-of-wordcount) (word, final-count) ~ SQL Group by ~ SQL aggregation
  • 11. Pseudo-code Pseudo-code map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); // Group by step done by system on key of intermediate Emit above, and // reduce called on list of values in each group. reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));
  • 13. Distributed Execution Overview User Program Worker Worker Master Worker Worker Worker fork fork fork assign map assign reduce read local write remote read, sort Output File 0 Output File 1 write Split 0 Split 1 Split 2 input data from distributed file system
  • 14. Map Reduce vs. Parallel Databases  Map Reduce widely used for parallel processing  Google, Yahoo, and 100’s of other companies  Example uses: compute PageRank, build keyword indices, do data analysis of web click logs, ….  Database people say: but parallel databases have been doing this for decades  Map Reduce people say:  we operate at scales of 1000’s of machines  We handle failures seamlessly  We allow procedural code in map and reduce and allow data of any type
  • 15. Implementations  Google  Not available outside Google  Hadoop  An open-source implementation in Java  Uses HDFS for stable storage  Download: http://lucene.apache.org/hadoop/  Aster Data  Cluster-optimized SQL Database that also implements MapReduce  IITB alumnus among founders  And several others, such as Cassandra at Facebook, etc.
  • 16. Reading  Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters http://labs.google.com/papers/mapreduce.html  Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, The Google File System, http://labs.google.com/papers/gfs.html