SlideShare a Scribd company logo
K . M A D U R A I A N D B . R A M A M U R T H Y
MapReduce and Hadoop
Distributed File System
B.Ramamurthy & K.Madurai
1
Contact:
Dr. Bina Ramamurthy
CSE Department
University at Buffalo (SUNY)
bina@buffalo.edu
http://www.cse.buffalo.edu/faculty/bina
Partially Supported by
NSF DUE Grant: 0737243
CCSCNE 2009 Palttsburg, April 24 2009
The Context: Big-data
 Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009)
 Google collects 270PB data in a month (2007), 20000PB a day (2008)
 2010 census data is expected to be a huge gold mine of information
 Data mining huge amounts of data collected in a wide range of domains
from astronomy to healthcare has become essential for planning and
performance.
 We are in a knowledge economy.
 Data is an important asset to any organization
 Discovery of knowledge; Enabling discovery; annotation of data
 We are looking at newer
 programming models, and
 Supporting algorithms and data structures.
 NSF refers to it as “data-intensive computing” and industry calls it “big-
data” and “cloud computing”
B.Ramamurthy & K.Madurai
2
CCSCNE 2009 Palttsburg, April 24 2009
Purpose of this talk
 To provide a simple introduction to:
 “The big-data computing” : An important
advancement that has a potential to impact
significantly the CS and undergraduate curriculum.
 A programming model called MapReduce for
processing “big-data”
 A supporting file system called Hadoop Distributed
File System (HDFS)
 To encourage educators to explore ways to infuse
relevant concepts of this emerging area into their
curriculum.
B.Ramamurthy & K.Madurai
3
CCSCNE 2009 Palttsburg, April 24 2009
The Outline
 Introduction to MapReduce
 From CS Foundation to MapReduce
 MapReduce programming model
 Hadoop Distributed File System
 Relevance to Undergraduate Curriculum
 Demo (Internet access needed)
 Our experience with the framework
 Summary
 References
B.Ramamurthy & K.Madurai
4
CCSCNE 2009 Palttsburg, April 24 2009
MapReduce
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
5
What is MapReduce?
 MapReduce is a programming model Google has used
successfully is processing its “big-data” sets (~ 20000 peta
bytes per day)
 Users specify the computation in terms of a map and a
reduce function,
 Underlying runtime system automatically parallelizes the
computation across large-scale clusters of machines, and
 Underlying system also handles machine failures,
efficient communications, and performance issues.
-- Reference: Dean, J. and Ghemawat, S. 2008. MapReduce:
simplified data processing on large clusters. Communication of
ACM 51, 1 (Jan. 2008), 107-113.
B.Ramamurthy & K.Madurai
6
CCSCNE 2009 Palttsburg, April 24 2009
From CS Foundations to MapReduce
Consider a large data collection:
{web, weed, green, sun, moon, land, part, web,
green,…}
Problem: Count the occurrences of the different words
in the collection.
Lets design a solution for this problem;
 We will start from scratch
 We will add and relax constraints
 We will do incremental design, improving the solution for
performance and scalability
B.Ramamurthy & K.Madurai
7
CCSCNE 2009 Palttsburg, April 24 2009
Word Counter and Result Table
Data
collection
web 2
weed 1
green 2
sun 1
moon 1
land 1
part 1
B.Ramamurthy & K.Madurai
8
ResultTable
Main
DataCollection
WordCounter
parse( )
count( )
{web, weed, green, sun, moon, land, part,
web, green,…}
CCSCNE 2009 Palttsburg, April 24 2009
Multiple Instances of Word Counter
web 2
weed 1
green 2
sun 1
moon 1
land 1
part 1
B.Ramamurthy & K.Madurai
9
Thread
DataCollection ResultTable
WordCounter
parse( )
count( )
Main
1..*
1..*
Data
collection
Observe:
Multi-thread
Lock on shared data
CCSCNE 2009 Palttsburg, April 24 2009
Improve Word Counter for Performance
B.Ramamurthy & K.Madurai
10
Data
collection
WordList
Thread
Main
1..*
1..*
DataCollection
Parser
1..*
Counter
1..*
ResultTable
KEY web weed green sun moon land part web green …….
VALUE
web 2
weed 1
green 2
sun 1
moon 1
land 1
part 1
N
o
No need for lock
Separate counters
CCSCNE 2009 Palttsburg, April 24 2009
Peta-scale Data
B.Ramamurthy & K.Madurai
11
Data
collection
WordList
Thread
Main
1..*
1..*
DataCollection
Parser
1..*
Counter
1..*
ResultTable
KEY web weed green sun moon land part web green …….
VALUE
web 2
weed 1
green 2
sun 1
moon 1
land 1
part 1
CCSCNE 2009 Palttsburg, April 24 2009
Addressing the Scale Issue
B.Ramamurthy & K.Madurai
12
 Single machine cannot serve all the data: you need a distributed
special (file) system
 Large number of commodity hardware disks: say, 1000 disks 1TB
each
 Issue: With Mean time between failures (MTBF) or failure rate of
1/1000, then at least 1 of the above 1000 disks would be down at a
given time.
 Thus failure is norm and not an exception.
 File system has to be fault-tolerant: replication, checksum
 Data transfer bandwidth is critical (location of data)
 Critical aspects: fault tolerance + replication + load balancing,
monitoring
 Exploit parallelism afforded by splitting parsing and counting
 Provision and locate computing at data locations
CCSCNE 2009 Palttsburg, April 24 2009
Peta-scale Data
B.Ramamurthy & K.Madurai
13
Data
collection
WordList
Thread
Main
1..*
1..*
DataCollection
Parser
1..*
Counter
1..*
ResultTable
KEY web weed green sun moon land part web green …….
VALUE
web 2
weed 1
green 2
sun 1
moon 1
land 1
part 1
CCSCNE 2009 Palttsburg, April 24 2009
Peta Scale Data is Commonly Distributed
B.Ramamurthy & K.Madurai
14
Data
collection
WordList
Thread
Main
1..*
1..*
DataCollection
Parser
1..*
Counter
1..*
ResultTable
KEY web weed green sun moon land part web green …….
VALUE
web 2
weed 1
green 2
sun 1
moon 1
land 1
part 1
Data
collection
Data
collection
Data
collection
Data
collection Issue: managing the
large scale data
CCSCNE 2009 Palttsburg, April 24 2009
Write Once Read Many (WORM) data
B.Ramamurthy & K.Madurai
15
Data
collection
WordList
Thread
Main
1..*
1..*
DataCollection
Parser
1..*
Counter
1..*
ResultTable
KEY web weed green sun moon land part web green …….
VALUE
web 2
weed 1
green 2
sun 1
moon 1
land 1
part 1
Data
collection
Data
collection
Data
collection
Data
collection
CCSCNE 2009 Palttsburg, April 24 2009
WORM Data is Amenable to Parallelism
B.Ramamurthy & K.Madurai
16
Data
collection
WordList
Thread
Main
1..*
1..*
DataCollection
Parser
1..*
Counter
1..*
ResultTable
Data
collection
Data
collection
Data
collection
Data
collection
1. Data with WORM
characteristics : yields
to parallel processing;
2. Data without
dependencies: yields
to out of order
processing
CCSCNE 2009 Palttsburg, April 24 2009
Divide and Conquer: Provision Computing at Data Location
B.Ramamurthy & K.Madurai
17
WordList
Thread
Main
1..*
1..*
DataCollection
Parser
1..*
Counter
1..*
ResultTable
Data
collection
WordList
Thread
Main
1..*
1..*
DataCollection
Parser
1..*
Counter
1..*
ResultTable
Data
collection
Data
collection
WordList
Thread
Main
1..*
1..*
DataCollection
Parser
1..*
Counter
1..*
ResultTable
WordList
Thread
Main
1..*
1..*
DataCollection
Parser
1..*
Counter
1..*
ResultTable
Data
collection
For our example,
#1: Schedule parallel parse tasks
#2: Schedule parallel count tasks
This is a particular solution;
Lets generalize it:
Our parse is a mapping operation:
MAP: input  <key, value> pairs
Our count is a reduce operation:
REDUCE: <key, value> pairs reduced
Map/Reduce originated from Lisp
But have different meaning here
Runtime adds distribution + fault
tolerance + replication + monitoring +
load balancing to your base application!
One node
CCSCNE 2009 Palttsburg, April 24 2009
Mapper and Reducer
B.Ramamurthy & K.Madurai
18
MapReduceTask
YourMapper
YourReducer
Parser
Counter
Mapper Reducer
Remember: MapReduce is simplified processing for larger data sets:
MapReduce Version of WordCount Source code
CCSCNE 2009 Palttsburg, April 24 2009
Map Operation
MAP: Input data  <key, value> pair
Data
Collection: split1
web 1
weed 1
green 1
sun 1
moon 1
land 1
part 1
web 1
green 1
… 1
KEY VALUE
Split the data to
Supply multiple
processors
Data
Collection: split 2
Data
Collection: split n
Map
……
Map
B.Ramamurthy & K.Madurai
19
web 1
weed 1
green 1
sun 1
moon 1
land 1
part 1
web 1
green 1
… 1
KEY VALUE
web 1
weed 1
green 1
sun 1
moon 1
land 1
part 1
web 1
green 1
… 1
KEY VALUE
web 1
weed 1
green 1
sun 1
moon 1
land 1
part 1
web 1
green 1
… 1
KEY VALUE
web 1
weed 1
green 1
sun 1
moon 1
land 1
part 1
web 1
green 1
… 1
KEY VALUE
…
CCSCNE 2009 Palttsburg, April 24 2009
Reduce
Reduce
Reduce
Reduce Operation
MAP: Input data  <key, value> pair
REDUCE: <key, value> pair  <result>
Data
Collection: split1 Split the data to
Supply multiple
processors
Data
Collection: split 2
Data
Collection: split n Map
Map
……
Map
B.Ramamurthy & K.Madurai
20
…
CCSCNE 2009 Palttsburg, April 24 2009
Count
Count
Count
Large scale data splits
Parse-hash
Parse-hash
Parse-hash
Parse-hash
Map <key, 1> Reducers (say, Count)
P-0000
P-0001
P-0002
, count1
, count2
,count3
B.Ramamurthy & K.Madurai
21
CCSCNE 2009 Palttsburg, April 24 2009
Cat
Bat
Dog
Other
Words
(size:
TByte)
map
map
map
map
split
split
split
split
combine
combine
combine
reduce
reduce
reduce
part0
part1
part2
MapReduce Example in my operating systems class
B.Ramamurthy & K.Madurai
22
CCSCNE 2009 Palttsburg, April 24 2009
MapReduce Programming
Model
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
23
MapReduce programming model
 Determine if the problem is parallelizable and solvable using
MapReduce (ex: Is the data WORM?, large data set).
 Design and implement solution as Mapper classes and
Reducer class.
 Compile the source code with hadoop core.
 Package the code as jar executable.
 Configure the application (job) as to the number of mappers
and reducers (tasks), input and output streams
 Load the data (or use it on previously available data)
 Launch the job and monitor.
 Study the result.
 Detailed steps.
B.Ramamurthy & K.Madurai
24
CCSCNE 2009 Palttsburg, April 24 2009
MapReduce Characteristics
 Very large scale data: peta, exa bytes
 Write once and read many data: allows for parallelism without
mutexes
 Map and Reduce are the main operations: simple code
 There are other supporting operations such as combine and
partition (out of the scope of this talk).
 All the map should be completed before reduce operation starts.
 Map and reduce operations are typically performed by the same
physical processor.
 Number of map tasks and reduce tasks are configurable.
 Operations are provisioned near the data.
 Commodity hardware and storage.
 Runtime takes care of splitting and moving data for operations.
 Special distributed file system. Example: Hadoop Distributed File
System and Hadoop Runtime.
B.Ramamurthy & K.Madurai
25
CCSCNE 2009 Palttsburg, April 24 2009
Classes of problems “mapreducable”
 Benchmark for comparing: Jim Gray’s challenge on data-
intensive computing. Ex: “Sort”
 Google uses it (we think) for wordcount, adwords, pagerank,
indexing data.
 Simple algorithms such as grep, text-indexing, reverse
indexing
 Bayesian classification: data mining domain
 Facebook uses it for various operations: demographics
 Financial services use it for analytics
 Astronomy: Gaussian analysis for locating extra-terrestrial
objects.
 Expected to play a critical role in semantic web and web3.0
B.Ramamurthy & K.Madurai
26
CCSCNE 2009 Palttsburg, April 24 2009
Scope of MapReduce
Pipelined Instruction level
Concurrent Thread level
Service Object level
Indexed File level
Mega Block level
Virtual System Level
Data size: small
Data size: large
B.Ramamurthy & K.Madurai
27
CCSCNE 2009 Palttsburg, April 24 2009
Hadoop
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
28
What is Hadoop?
 At Google MapReduce operation are run on a special
file system called Google File System (GFS) that is
highly optimized for this purpose.
 GFS is not open source.
 Doug Cutting and Yahoo! reverse engineered the
GFS and called it Hadoop Distributed File System
(HDFS).
 The software framework that supports HDFS,
MapReduce and other related entities is called the
project Hadoop or simply Hadoop.
 This is open source and distributed by Apache.
B.Ramamurthy & K.Madurai
29
CCSCNE 2009 Palttsburg, April 24 2009
Basic Features: HDFS
 Highly fault-tolerant
 High throughput
 Suitable for applications with large data sets
 Streaming access to file system data
 Can be built out of commodity hardware
B.Ramamurthy & K.Madurai
30
CCSCNE 2009 Palttsburg, April 24 2009
Hadoop Distributed File System
B.Ramamurthy & K.Madurai
31
Application
Local file
system
Master node
Name Nodes
HDFS Client
HDFS Server
Block size: 2K
Block size: 128M
Replicated
CCSCNE 2009 Palttsburg, April 24 2009
More details: We discuss this in great detail in my Operating
Systems course
Hadoop Distributed File System
B.Ramamurthy & K.Madurai
32
Application
Local file
system
Master node
Name Nodes
HDFS Client
HDFS Server
Block size: 2K
Block size: 128M
Replicated
CCSCNE 2009 Palttsburg, April 24 2009
More details: We discuss this in great detail in my Operating
Systems course
heartbeat
blockmap
Relevance and Impact on Undergraduate courses
 Data structures and algorithms: a new look at traditional
algorithms such as sort: Quicksort may not be your
choice! It is not easily parallelizable. Merge sort is better.
 You can identify mappers and reducers among your
algorithms. Mappers and reducers are simply place
holders for algorithms relevant for your applications.
 Large scale data and analytics are indeed concepts to
reckon with similar to how we addressed “programming
in the large” by OO concepts.
 While a full course on MR/HDFS may not be warranted,
the concepts perhaps can be woven into most courses in
our CS curriculum.
B.Ramamurthy & K.Madurai
33
CCSCNE 2009 Palttsburg, April 24 2009
Demo
 VMware simulated Hadoop and MapReduce demo
 Remote access to NEXOS system at my Buffalo office
 5-node HDFS running HDFS on Ubuntu 8.04
 1 –name node and 4 data-nodes
 Each is an old commodity PC with 512 MB RAM,
120GB – 160GB external memory
 Zeus (namenode), datanodes: hermes, dionysus,
aphrodite, athena
B.Ramamurthy & K.Madurai
34
CCSCNE 2009 Palttsburg, April 24 2009
Summary
 We introduced MapReduce programming model for
processing large scale data
 We discussed the supporting Hadoop Distributed
File System
 The concepts were illustrated using a simple example
 We reviewed some important parts of the source
code for the example.
 Relationship to Cloud Computing
B.Ramamurthy & K.Madurai
35
CCSCNE 2009 Palttsburg, April 24 2009
References
1. Apache Hadoop Tutorial: http://hadoop.apache.org
http://hadoop.apache.org/core/docs/current/mapred_tu
torial.html
2. Dean, J. and Ghemawat, S. 2008. MapReduce:
simplified data processing on large clusters.
Communication of ACM 51, 1 (Jan. 2008), 107-113.
3. Cloudera Videos by Aaron Kimball:
http://www.cloudera.com/hadoop-training-basic
4. http://www.cse.buffalo.edu/faculty/bina/mapreduce.html
B.Ramamurthy & K.Madurai
36
CCSCNE 2009 Palttsburg, April 24 2009

More Related Content

PPT
mapreduce and hadoop Distributed File sysytem
PPTX
ch02-mapreduce.pptx
PDF
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
PPTX
Hadoop and Mapreduce for .NET User Group
PPT
Hadoop by sunitha
PDF
Hadoop Overview & Architecture
 
PPT
Hadoop
mapreduce and hadoop Distributed File sysytem
ch02-mapreduce.pptx
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Hadoop and Mapreduce for .NET User Group
Hadoop by sunitha
Hadoop Overview & Architecture
 
Hadoop

Similar to mapreduceApril24.ppt (20)

PPTX
Lecture2-MapReduce - An introductory lecture to Map Reduce
PPTX
This gives a brief detail about big data
PPTX
COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION
KEY
PPTX
introduction to Complete Map and Reduce Framework
PDF
Hadoop Overview kdd2011
PPT
L19CloudMapReduce introduction for cloud computing .ppt
PDF
An Introduction to MapReduce
PPTX
Big data & Hadoop
PPTX
Map Reduced and Data Mining Introductory Presentation
PPT
Map reducecloudtech
PDF
Modeling Social Data, Lecture 4: Counting at Scale
PDF
Hadoop: A Hands-on Introduction
PPTX
Coding serbia
PPTX
An introduction to Hadoop for large scale data analysis
PDF
Large Scale Data Processing & Storage
PDF
Simplified Data Processing On Large Cluster
PPT
11. From Hadoop to Spark 1:2
PDF
Modeling Social Data, Lecture 3: Counting at Scale
Lecture2-MapReduce - An introductory lecture to Map Reduce
This gives a brief detail about big data
COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION
introduction to Complete Map and Reduce Framework
Hadoop Overview kdd2011
L19CloudMapReduce introduction for cloud computing .ppt
An Introduction to MapReduce
Big data & Hadoop
Map Reduced and Data Mining Introductory Presentation
Map reducecloudtech
Modeling Social Data, Lecture 4: Counting at Scale
Hadoop: A Hands-on Introduction
Coding serbia
An introduction to Hadoop for large scale data analysis
Large Scale Data Processing & Storage
Simplified Data Processing On Large Cluster
11. From Hadoop to Spark 1:2
Modeling Social Data, Lecture 3: Counting at Scale
Ad

More from Anonymous9etQKwW (13)

PPTX
CISCT 2024 template (1) template template
PPTX
distributed system ppt presentation in cs
PPTX
os distributed system theoretical foundation
PPTX
osi model computer networks complete detail
PPT
CODch3Slides.ppt
PPTX
IntroductoryPPT_CSE242.pptx
PPT
Intro.ppt
PPTX
Big Data & Analytics (CSE6005) L6.pptx
PPTX
Lecture 2 Hadoop.pptx
PPT
PPTX
lecture 2.pptx
PPT
Chap 4.ppt
PPT
Artificial Neural Networks_Bioinsspired_Algorithms_Nov 20.ppt
CISCT 2024 template (1) template template
distributed system ppt presentation in cs
os distributed system theoretical foundation
osi model computer networks complete detail
CODch3Slides.ppt
IntroductoryPPT_CSE242.pptx
Intro.ppt
Big Data & Analytics (CSE6005) L6.pptx
Lecture 2 Hadoop.pptx
lecture 2.pptx
Chap 4.ppt
Artificial Neural Networks_Bioinsspired_Algorithms_Nov 20.ppt
Ad

Recently uploaded (20)

PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PPTX
web development for engineering and engineering
PDF
PPT on Performance Review to get promotions
PDF
Digital Logic Computer Design lecture notes
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
Lecture Notes Electrical Wiring System Components
PPT
Project quality management in manufacturing
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
additive manufacturing of ss316l using mig welding
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Foundation to blockchain - A guide to Blockchain Tech
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
bas. eng. economics group 4 presentation 1.pptx
Operating System & Kernel Study Guide-1 - converted.pdf
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
CYBER-CRIMES AND SECURITY A guide to understanding
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Strings in CPP - Strings in C++ are sequences of characters used to store and...
web development for engineering and engineering
PPT on Performance Review to get promotions
Digital Logic Computer Design lecture notes
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Arduino robotics embedded978-1-4302-3184-4.pdf
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Lecture Notes Electrical Wiring System Components
Project quality management in manufacturing
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
additive manufacturing of ss316l using mig welding
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Foundation to blockchain - A guide to Blockchain Tech

mapreduceApril24.ppt

  • 1. K . M A D U R A I A N D B . R A M A M U R T H Y MapReduce and Hadoop Distributed File System B.Ramamurthy & K.Madurai 1 Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially Supported by NSF DUE Grant: 0737243 CCSCNE 2009 Palttsburg, April 24 2009
  • 2. The Context: Big-data  Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009)  Google collects 270PB data in a month (2007), 20000PB a day (2008)  2010 census data is expected to be a huge gold mine of information  Data mining huge amounts of data collected in a wide range of domains from astronomy to healthcare has become essential for planning and performance.  We are in a knowledge economy.  Data is an important asset to any organization  Discovery of knowledge; Enabling discovery; annotation of data  We are looking at newer  programming models, and  Supporting algorithms and data structures.  NSF refers to it as “data-intensive computing” and industry calls it “big- data” and “cloud computing” B.Ramamurthy & K.Madurai 2 CCSCNE 2009 Palttsburg, April 24 2009
  • 3. Purpose of this talk  To provide a simple introduction to:  “The big-data computing” : An important advancement that has a potential to impact significantly the CS and undergraduate curriculum.  A programming model called MapReduce for processing “big-data”  A supporting file system called Hadoop Distributed File System (HDFS)  To encourage educators to explore ways to infuse relevant concepts of this emerging area into their curriculum. B.Ramamurthy & K.Madurai 3 CCSCNE 2009 Palttsburg, April 24 2009
  • 4. The Outline  Introduction to MapReduce  From CS Foundation to MapReduce  MapReduce programming model  Hadoop Distributed File System  Relevance to Undergraduate Curriculum  Demo (Internet access needed)  Our experience with the framework  Summary  References B.Ramamurthy & K.Madurai 4 CCSCNE 2009 Palttsburg, April 24 2009
  • 5. MapReduce CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai 5
  • 6. What is MapReduce?  MapReduce is a programming model Google has used successfully is processing its “big-data” sets (~ 20000 peta bytes per day)  Users specify the computation in terms of a map and a reduce function,  Underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, and  Underlying system also handles machine failures, efficient communications, and performance issues. -- Reference: Dean, J. and Ghemawat, S. 2008. MapReduce: simplified data processing on large clusters. Communication of ACM 51, 1 (Jan. 2008), 107-113. B.Ramamurthy & K.Madurai 6 CCSCNE 2009 Palttsburg, April 24 2009
  • 7. From CS Foundations to MapReduce Consider a large data collection: {web, weed, green, sun, moon, land, part, web, green,…} Problem: Count the occurrences of the different words in the collection. Lets design a solution for this problem;  We will start from scratch  We will add and relax constraints  We will do incremental design, improving the solution for performance and scalability B.Ramamurthy & K.Madurai 7 CCSCNE 2009 Palttsburg, April 24 2009
  • 8. Word Counter and Result Table Data collection web 2 weed 1 green 2 sun 1 moon 1 land 1 part 1 B.Ramamurthy & K.Madurai 8 ResultTable Main DataCollection WordCounter parse( ) count( ) {web, weed, green, sun, moon, land, part, web, green,…} CCSCNE 2009 Palttsburg, April 24 2009
  • 9. Multiple Instances of Word Counter web 2 weed 1 green 2 sun 1 moon 1 land 1 part 1 B.Ramamurthy & K.Madurai 9 Thread DataCollection ResultTable WordCounter parse( ) count( ) Main 1..* 1..* Data collection Observe: Multi-thread Lock on shared data CCSCNE 2009 Palttsburg, April 24 2009
  • 10. Improve Word Counter for Performance B.Ramamurthy & K.Madurai 10 Data collection WordList Thread Main 1..* 1..* DataCollection Parser 1..* Counter 1..* ResultTable KEY web weed green sun moon land part web green ……. VALUE web 2 weed 1 green 2 sun 1 moon 1 land 1 part 1 N o No need for lock Separate counters CCSCNE 2009 Palttsburg, April 24 2009
  • 11. Peta-scale Data B.Ramamurthy & K.Madurai 11 Data collection WordList Thread Main 1..* 1..* DataCollection Parser 1..* Counter 1..* ResultTable KEY web weed green sun moon land part web green ……. VALUE web 2 weed 1 green 2 sun 1 moon 1 land 1 part 1 CCSCNE 2009 Palttsburg, April 24 2009
  • 12. Addressing the Scale Issue B.Ramamurthy & K.Madurai 12  Single machine cannot serve all the data: you need a distributed special (file) system  Large number of commodity hardware disks: say, 1000 disks 1TB each  Issue: With Mean time between failures (MTBF) or failure rate of 1/1000, then at least 1 of the above 1000 disks would be down at a given time.  Thus failure is norm and not an exception.  File system has to be fault-tolerant: replication, checksum  Data transfer bandwidth is critical (location of data)  Critical aspects: fault tolerance + replication + load balancing, monitoring  Exploit parallelism afforded by splitting parsing and counting  Provision and locate computing at data locations CCSCNE 2009 Palttsburg, April 24 2009
  • 13. Peta-scale Data B.Ramamurthy & K.Madurai 13 Data collection WordList Thread Main 1..* 1..* DataCollection Parser 1..* Counter 1..* ResultTable KEY web weed green sun moon land part web green ……. VALUE web 2 weed 1 green 2 sun 1 moon 1 land 1 part 1 CCSCNE 2009 Palttsburg, April 24 2009
  • 14. Peta Scale Data is Commonly Distributed B.Ramamurthy & K.Madurai 14 Data collection WordList Thread Main 1..* 1..* DataCollection Parser 1..* Counter 1..* ResultTable KEY web weed green sun moon land part web green ……. VALUE web 2 weed 1 green 2 sun 1 moon 1 land 1 part 1 Data collection Data collection Data collection Data collection Issue: managing the large scale data CCSCNE 2009 Palttsburg, April 24 2009
  • 15. Write Once Read Many (WORM) data B.Ramamurthy & K.Madurai 15 Data collection WordList Thread Main 1..* 1..* DataCollection Parser 1..* Counter 1..* ResultTable KEY web weed green sun moon land part web green ……. VALUE web 2 weed 1 green 2 sun 1 moon 1 land 1 part 1 Data collection Data collection Data collection Data collection CCSCNE 2009 Palttsburg, April 24 2009
  • 16. WORM Data is Amenable to Parallelism B.Ramamurthy & K.Madurai 16 Data collection WordList Thread Main 1..* 1..* DataCollection Parser 1..* Counter 1..* ResultTable Data collection Data collection Data collection Data collection 1. Data with WORM characteristics : yields to parallel processing; 2. Data without dependencies: yields to out of order processing CCSCNE 2009 Palttsburg, April 24 2009
  • 17. Divide and Conquer: Provision Computing at Data Location B.Ramamurthy & K.Madurai 17 WordList Thread Main 1..* 1..* DataCollection Parser 1..* Counter 1..* ResultTable Data collection WordList Thread Main 1..* 1..* DataCollection Parser 1..* Counter 1..* ResultTable Data collection Data collection WordList Thread Main 1..* 1..* DataCollection Parser 1..* Counter 1..* ResultTable WordList Thread Main 1..* 1..* DataCollection Parser 1..* Counter 1..* ResultTable Data collection For our example, #1: Schedule parallel parse tasks #2: Schedule parallel count tasks This is a particular solution; Lets generalize it: Our parse is a mapping operation: MAP: input  <key, value> pairs Our count is a reduce operation: REDUCE: <key, value> pairs reduced Map/Reduce originated from Lisp But have different meaning here Runtime adds distribution + fault tolerance + replication + monitoring + load balancing to your base application! One node CCSCNE 2009 Palttsburg, April 24 2009
  • 18. Mapper and Reducer B.Ramamurthy & K.Madurai 18 MapReduceTask YourMapper YourReducer Parser Counter Mapper Reducer Remember: MapReduce is simplified processing for larger data sets: MapReduce Version of WordCount Source code CCSCNE 2009 Palttsburg, April 24 2009
  • 19. Map Operation MAP: Input data  <key, value> pair Data Collection: split1 web 1 weed 1 green 1 sun 1 moon 1 land 1 part 1 web 1 green 1 … 1 KEY VALUE Split the data to Supply multiple processors Data Collection: split 2 Data Collection: split n Map …… Map B.Ramamurthy & K.Madurai 19 web 1 weed 1 green 1 sun 1 moon 1 land 1 part 1 web 1 green 1 … 1 KEY VALUE web 1 weed 1 green 1 sun 1 moon 1 land 1 part 1 web 1 green 1 … 1 KEY VALUE web 1 weed 1 green 1 sun 1 moon 1 land 1 part 1 web 1 green 1 … 1 KEY VALUE web 1 weed 1 green 1 sun 1 moon 1 land 1 part 1 web 1 green 1 … 1 KEY VALUE … CCSCNE 2009 Palttsburg, April 24 2009
  • 20. Reduce Reduce Reduce Reduce Operation MAP: Input data  <key, value> pair REDUCE: <key, value> pair  <result> Data Collection: split1 Split the data to Supply multiple processors Data Collection: split 2 Data Collection: split n Map Map …… Map B.Ramamurthy & K.Madurai 20 … CCSCNE 2009 Palttsburg, April 24 2009
  • 21. Count Count Count Large scale data splits Parse-hash Parse-hash Parse-hash Parse-hash Map <key, 1> Reducers (say, Count) P-0000 P-0001 P-0002 , count1 , count2 ,count3 B.Ramamurthy & K.Madurai 21 CCSCNE 2009 Palttsburg, April 24 2009
  • 23. MapReduce Programming Model CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai 23
  • 24. MapReduce programming model  Determine if the problem is parallelizable and solvable using MapReduce (ex: Is the data WORM?, large data set).  Design and implement solution as Mapper classes and Reducer class.  Compile the source code with hadoop core.  Package the code as jar executable.  Configure the application (job) as to the number of mappers and reducers (tasks), input and output streams  Load the data (or use it on previously available data)  Launch the job and monitor.  Study the result.  Detailed steps. B.Ramamurthy & K.Madurai 24 CCSCNE 2009 Palttsburg, April 24 2009
  • 25. MapReduce Characteristics  Very large scale data: peta, exa bytes  Write once and read many data: allows for parallelism without mutexes  Map and Reduce are the main operations: simple code  There are other supporting operations such as combine and partition (out of the scope of this talk).  All the map should be completed before reduce operation starts.  Map and reduce operations are typically performed by the same physical processor.  Number of map tasks and reduce tasks are configurable.  Operations are provisioned near the data.  Commodity hardware and storage.  Runtime takes care of splitting and moving data for operations.  Special distributed file system. Example: Hadoop Distributed File System and Hadoop Runtime. B.Ramamurthy & K.Madurai 25 CCSCNE 2009 Palttsburg, April 24 2009
  • 26. Classes of problems “mapreducable”  Benchmark for comparing: Jim Gray’s challenge on data- intensive computing. Ex: “Sort”  Google uses it (we think) for wordcount, adwords, pagerank, indexing data.  Simple algorithms such as grep, text-indexing, reverse indexing  Bayesian classification: data mining domain  Facebook uses it for various operations: demographics  Financial services use it for analytics  Astronomy: Gaussian analysis for locating extra-terrestrial objects.  Expected to play a critical role in semantic web and web3.0 B.Ramamurthy & K.Madurai 26 CCSCNE 2009 Palttsburg, April 24 2009
  • 27. Scope of MapReduce Pipelined Instruction level Concurrent Thread level Service Object level Indexed File level Mega Block level Virtual System Level Data size: small Data size: large B.Ramamurthy & K.Madurai 27 CCSCNE 2009 Palttsburg, April 24 2009
  • 28. Hadoop CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai 28
  • 29. What is Hadoop?  At Google MapReduce operation are run on a special file system called Google File System (GFS) that is highly optimized for this purpose.  GFS is not open source.  Doug Cutting and Yahoo! reverse engineered the GFS and called it Hadoop Distributed File System (HDFS).  The software framework that supports HDFS, MapReduce and other related entities is called the project Hadoop or simply Hadoop.  This is open source and distributed by Apache. B.Ramamurthy & K.Madurai 29 CCSCNE 2009 Palttsburg, April 24 2009
  • 30. Basic Features: HDFS  Highly fault-tolerant  High throughput  Suitable for applications with large data sets  Streaming access to file system data  Can be built out of commodity hardware B.Ramamurthy & K.Madurai 30 CCSCNE 2009 Palttsburg, April 24 2009
  • 31. Hadoop Distributed File System B.Ramamurthy & K.Madurai 31 Application Local file system Master node Name Nodes HDFS Client HDFS Server Block size: 2K Block size: 128M Replicated CCSCNE 2009 Palttsburg, April 24 2009 More details: We discuss this in great detail in my Operating Systems course
  • 32. Hadoop Distributed File System B.Ramamurthy & K.Madurai 32 Application Local file system Master node Name Nodes HDFS Client HDFS Server Block size: 2K Block size: 128M Replicated CCSCNE 2009 Palttsburg, April 24 2009 More details: We discuss this in great detail in my Operating Systems course heartbeat blockmap
  • 33. Relevance and Impact on Undergraduate courses  Data structures and algorithms: a new look at traditional algorithms such as sort: Quicksort may not be your choice! It is not easily parallelizable. Merge sort is better.  You can identify mappers and reducers among your algorithms. Mappers and reducers are simply place holders for algorithms relevant for your applications.  Large scale data and analytics are indeed concepts to reckon with similar to how we addressed “programming in the large” by OO concepts.  While a full course on MR/HDFS may not be warranted, the concepts perhaps can be woven into most courses in our CS curriculum. B.Ramamurthy & K.Madurai 33 CCSCNE 2009 Palttsburg, April 24 2009
  • 34. Demo  VMware simulated Hadoop and MapReduce demo  Remote access to NEXOS system at my Buffalo office  5-node HDFS running HDFS on Ubuntu 8.04  1 –name node and 4 data-nodes  Each is an old commodity PC with 512 MB RAM, 120GB – 160GB external memory  Zeus (namenode), datanodes: hermes, dionysus, aphrodite, athena B.Ramamurthy & K.Madurai 34 CCSCNE 2009 Palttsburg, April 24 2009
  • 35. Summary  We introduced MapReduce programming model for processing large scale data  We discussed the supporting Hadoop Distributed File System  The concepts were illustrated using a simple example  We reviewed some important parts of the source code for the example.  Relationship to Cloud Computing B.Ramamurthy & K.Madurai 35 CCSCNE 2009 Palttsburg, April 24 2009
  • 36. References 1. Apache Hadoop Tutorial: http://hadoop.apache.org http://hadoop.apache.org/core/docs/current/mapred_tu torial.html 2. Dean, J. and Ghemawat, S. 2008. MapReduce: simplified data processing on large clusters. Communication of ACM 51, 1 (Jan. 2008), 107-113. 3. Cloudera Videos by Aaron Kimball: http://www.cloudera.com/hadoop-training-basic 4. http://www.cse.buffalo.edu/faculty/bina/mapreduce.html B.Ramamurthy & K.Madurai 36 CCSCNE 2009 Palttsburg, April 24 2009