SlideShare a Scribd company logo
K. MADURAI AND B. RAMAMURTHY
MapReduce and Hadoop
Distributed File System
B.Ramamurthy & K.Madurai
1
Contact:
Dr. Bina Ramamurthy
CSE Department
University at Buffalo (SUNY)
bina@buffalo.edu
http://www.cse.buffalo.edu/faculty/bina
Partially Supported by
NSF DUE Grant: 0737243
CCSCNE 2009 Palttsburg, April 24 2009
The Context: Big-data
 Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009)
 Google collects 270PB data in a month (2007), 20000PB a day (2008)
 2010 census data is expected to be a huge gold mine of information
 Data mining huge amounts of data collected in a wide range of domains
from astronomy to healthcare has become essential for planning and
performance.
 We are in a knowledge economy.
 Data is an important asset to any organization
 Discovery of knowledge; Enabling discovery; annotation of data
 We are looking at newer
 programming models, and
 Supporting algorithms and data structures.
 NSF refers to it as “data-intensive computing” and industry calls it “big-
data” and “cloud computing”
B.Ramamurthy & K.Madurai
2
CCSCNE 2009 Palttsburg, April 24 2009
Purpose of this talk
 To provide a simple introduction to:
 “The big-data computing” : An important
advancement that has a potential to impact
significantly the CS and undergraduate curriculum.
 A programming model called MapReduce for
processing “big-data”
 A supporting file system called Hadoop Distributed
File System (HDFS)
 To encourage educators to explore ways to infuse
relevant concepts of this emerging area into their
curriculum.
B.Ramamurthy & K.Madurai
3
CCSCNE 2009 Palttsburg, April 24 2009
The Outline
 Introduction to MapReduce
 From CS Foundation to MapReduce
 MapReduce programming model
 Hadoop Distributed File System
 Relevance to Undergraduate Curriculum
 Demo (Internet access needed)
 Our experience with the framework
 Summary
 References
B.Ramamurthy & K.Madurai
4
CCSCNE 2009 Palttsburg, April 24 2009
MapReduce
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
5
What is MapReduce?
 MapReduce is a programming model Google has used
successfully is processing its “big-data” sets (~ 20000 peta
bytes per day)
 Users specify the computation in terms of a map and a
reduce function,
 Underlying runtime system automatically parallelizes the
computation across large-scale clusters of machines, and
 Underlying system also handles machine failures,
efficient communications, and performance issues.
-- Reference: Dean, J. and Ghemawat, S. 2008. MapReduce:
simplified data processing on large clusters. Communication of
ACM 51, 1 (Jan. 2008), 107-113.
B.Ramamurthy & K.Madurai
6
CCSCNE 2009 Palttsburg, April 24 2009
From CS Foundations to MapReduce
Consider a large data collection:
{web, weed, green, sun, moon, land, part, web, green,
…}
Problem: Count the occurrences of the different words
in the collection.
Lets design a solution for this problem;
 We will start from scratch
 We will add and relax constraints
 We will do incremental design, improving the solution for
performance and scalability
B.Ramamurthy & K.Madurai
7
CCSCNE 2009 Palttsburg, April 24 2009
Word Counter and Result Table
Data
collection
web 2
weed 1
green 2
sun 1
moon 1
land 1
part 1
B.Ramamurthy & K.Madurai
8
ResultTable
Main
DataCollection
WordCounter
parse( )
count( )
{web, weed, green, sun, moon, land, part,
web, green,…}
CCSCNE 2009 Palttsburg, April 24 2009
Multiple Instances of Word Counter
web 2
weed 1
green 2
sun 1
moon 1
land 1
part 1
B.Ramamurthy & K.Madurai
9
Thread
DataCollection ResultTable
WordCounter
parse( )
count( )
Main
1..*
1..*
Data
collection
Observe:
Multi-thread
Lock on shared data
CCSCNE 2009 Palttsburg, April 24 2009
Improve Word Counter for Performance
B.Ramamurthy & K.Madurai
10
Data
collection
WordList
Thread
Main
1..*
1..*
DataCollection
Parser
1..*
Counter
1..*
ResultTable
KEY web weed green sun moon land part web green …….
VALUE
web 2
weed 1
green 2
sun 1
moon 1
land 1
part 1
N
o
No need for lock
Separate counters
CCSCNE 2009 Palttsburg, April 24 2009
Peta-scale Data
B.Ramamurthy & K.Madurai
11
Data
collection
WordList
Thread
Main
1..*
1..*
DataCollection
Parser
1..*
Counter
1..*
ResultTable
KEY web weed green sun moon land part web green …….
VALUE
web 2
weed 1
green 2
sun 1
moon 1
land 1
part 1
CCSCNE 2009 Palttsburg, April 24 2009
Addressing the Scale Issue
B.Ramamurthy & K.Madurai
12
 Single machine cannot serve all the data: you need a distributed
special (file) system
 Large number of commodity hardware disks: say, 1000 disks 1TB
each
 Issue: With Mean time between failures (MTBF) or failure rate of
1/1000, then at least 1 of the above 1000 disks would be down at a
given time.
 Thus failure is norm and not an exception.
 File system has to be fault-tolerant: replication, checksum
 Data transfer bandwidth is critical (location of data)
 Critical aspects: fault tolerance + replication + load balancing,
monitoring
 Exploit parallelism afforded by splitting parsing and counting
 Provision and locate computing at data locations
CCSCNE 2009 Palttsburg, April 24 2009
Peta-scale Data
B.Ramamurthy & K.Madurai
13
Data
collection
WordList
Thread
Main
1..*
1..*
DataCollection
Parser
1..*
Counter
1..*
ResultTable
KEY web weed green sun moon land part web green …….
VALUE
web 2
weed 1
green 2
sun 1
moon 1
land 1
part 1
CCSCNE 2009 Palttsburg, April 24 2009
Peta Scale Data is Commonly Distributed
B.Ramamurthy & K.Madurai
14
Data
collection
WordList
Thread
Main
1..*
1..*
DataCollection
Parser
1..*
Counter
1..*
ResultTable
KEY web weed green sun moon land part web green …….
VALUE
web 2
weed 1
green 2
sun 1
moon 1
land 1
part 1
Data
collection
Data
collection
Data
collection
Data
collection Issue: managing the
large scale data
CCSCNE 2009 Palttsburg, April 24 2009
Write Once Read Many (WORM) data
B.Ramamurthy & K.Madurai
15
Data
collection
WordList
Thread
Main
1..*
1..*
DataCollection
Parser
1..*
Counter
1..*
ResultTable
KEY web weed green sun moon land part web green …….
VALUE
web 2
weed 1
green 2
sun 1
moon 1
land 1
part 1
Data
collection
Data
collection
Data
collection
Data
collection
CCSCNE 2009 Palttsburg, April 24 2009
WORM Data is Amenable to Parallelism
B.Ramamurthy & K.Madurai
16
Data
collection
WordList
Thread
Main
1..*
1..*
DataCollection
Parser
1..*
Counter
1..*
ResultTable
Data
collection
Data
collection
Data
collection
Data
collection
1. Data with WORM
characteristics : yields
to parallel processing;
2. Data without
dependencies: yields
to out of order
processing
CCSCNE 2009 Palttsburg, April 24 2009
Divide and Conquer: Provision Computing at Data Location
B.Ramamurthy & K.Madurai
17
WordList
Thread
Main
1..*
1..*
DataCollection
Parser
1..*
Counter
1..*
ResultTable
Data
collection
WordList
Thread
Main
1..*
1..*
DataCollection
Parser
1..*
Counter
1..*
ResultTable
Data
collection
Data
collection
WordList
Thread
Main
1..*
1..*
DataCollection
Parser
1..*
Counter
1..*
ResultTable
WordList
Thread
Main
1..*
1..*
DataCollection
Parser
1..*
Counter
1..*
ResultTable
Data
collection
For our example,
#1: Schedule parallel parse tasks
#2: Schedule parallel count tasks
This is a particular solution;
Lets generalize it:
Our parse is a mapping operation:
MAP: input  <key, value> pairs
Our count is a reduce operation:
REDUCE: <key, value> pairs reduced
Map/Reduce originated from Lisp
But have different meaning here
Runtime adds distribution + fault
tolerance + replication + monitoring +
load balancing to your base application!
One node
CCSCNE 2009 Palttsburg, April 24 2009
Mapper and Reducer
B.Ramamurthy & K.Madurai
18
Remember: MapReduce is simplified processing for larger data sets:
MapReduce Version of WordCount Source code
CCSCNE 2009 Palttsburg, April 24 2009
Map Operation
MAP: Input data  <key, value> pair
Data
Collection: split1
web 1
weed 1
green 1
sun 1
moon 1
land 1
part 1
web 1
green 1
… 1
KEY VALUE
Split the data to
Supply multiple
processors
Data
Collection: split 2
Data
Collection: split n
Map
……
Map
B.Ramamurthy & K.Madurai
19
…
CCSCNE 2009 Palttsburg, April 24 2009
Reduce
Reduce
Reduce
Reduce Operation
MAP: Input data  <key, value> pair
REDUCE: <key, value> pair  <result>
Data
Collection: split1 Split the data to
Supply multiple
processors
Data
Collection: split 2
Data
Collection: split n Map
Map
……
Map
B.Ramamurthy & K.Madurai
20
…
CCSCNE 2009 Palttsburg, April 24 2009
Count
Count
Count
Large scale data splits
Parse-hash
Parse-hash
Parse-hash
Parse-hash
Map <key, 1> Reducers (say, Count)
P-0000
P-0001
P-0002
, count1
, count2
,count3
B.Ramamurthy & K.Madurai
21
CCSCNE 2009 Palttsburg, April 24 2009
Cat
Bat
Dog
Other
Words
(size:
TByte)
map
map
map
map
split
split
split
split
combine
combine
combine
reduce
reduce
reduce
part0
part1
part2
MapReduce Example in my operating systems class
B.Ramamurthy & K.Madurai
22
CCSCNE 2009 Palttsburg, April 24 2009
MapReduce Programming
Model
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
23
MapReduce programming model
 Determine if the problem is parallelizable and solvable using
MapReduce (ex: Is the data WORM?, large data set).
 Design and implement solution as Mapper classes and
Reducer class.
 Compile the source code with hadoop core.
 Package the code as jar executable.
 Configure the application (job) as to the number of mappers
and reducers (tasks), input and output streams
 Load the data (or use it on previously available data)
 Launch the job and monitor.
 Study the result.
 Detailed steps.
B.Ramamurthy & K.Madurai
24
CCSCNE 2009 Palttsburg, April 24 2009
MapReduce Characteristics
 Very large scale data: peta, exa bytes
 Write once and read many data: allows for parallelism without
mutexes
 Map and Reduce are the main operations: simple code
 There are other supporting operations such as combine and partition
(out of the scope of this talk).
 All the map should be completed before reduce operation starts.
 Map and reduce operations are typically performed by the same
physical processor.
 Number of map tasks and reduce tasks are configurable.
 Operations are provisioned near the data.
 Commodity hardware and storage.
 Runtime takes care of splitting and moving data for operations.
 Special distributed file system. Example: Hadoop Distributed File
System and Hadoop Runtime.
B.Ramamurthy & K.Madurai
25
CCSCNE 2009 Palttsburg, April 24 2009
Classes of problems “mapreducable”
 Benchmark for comparing: Jim Gray’s challenge on data-
intensive computing. Ex: “Sort”
 Google uses it (we think) for wordcount, adwords, pagerank,
indexing data.
 Simple algorithms such as grep, text-indexing, reverse
indexing
 Bayesian classification: data mining domain
 Facebook uses it for various operations: demographics
 Financial services use it for analytics
 Astronomy: Gaussian analysis for locating extra-terrestrial
objects.
 Expected to play a critical role in semantic web and web3.0
B.Ramamurthy & K.Madurai
26
CCSCNE 2009 Palttsburg, April 24 2009
Scope of MapReduce
Pipelined Instruction level
Concurrent Thread level
Service Object level
Indexed File level
Mega Block level
Virtual System Level
Data size: small
Data size: large
B.Ramamurthy & K.Madurai
27
CCSCNE 2009 Palttsburg, April 24 2009
Hadoop
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
28
What is Hadoop?
 At Google MapReduce operation are run on a special
file system called Google File System (GFS) that is
highly optimized for this purpose.
 GFS is not open source.
 Doug Cutting and Yahoo! reverse engineered the
GFS and called it Hadoop Distributed File System
(HDFS).
 The software framework that supports HDFS,
MapReduce and other related entities is called the
project Hadoop or simply Hadoop.
 This is open source and distributed by Apache.
B.Ramamurthy & K.Madurai
29
CCSCNE 2009 Palttsburg, April 24 2009
Basic Features: HDFS
 Highly fault-tolerant
 High throughput
 Suitable for applications with large data sets
 Streaming access to file system data
 Can be built out of commodity hardware
B.Ramamurthy & K.Madurai
30
CCSCNE 2009 Palttsburg, April 24 2009
Hadoop Distributed File System
B.Ramamurthy & K.Madurai
31
Application
Local file
system
Master node
Name Nodes
HDFS Client
HDFS Server
Block size: 2K
Block size: 128M
Replicated
CCSCNE 2009 Palttsburg, April 24 2009
More details: We discuss this in great detail in my Operating
Systems course
Hadoop Distributed File System
B.Ramamurthy & K.Madurai
32
Application
Local file
system
Master node
Name Nodes
HDFS Client
HDFS Server
Block size: 2K
Block size: 128M
Replicated
CCSCNE 2009 Palttsburg, April 24 2009
More details: We discuss this in great detail in my Operating
Systems course
heartbeat
blockmap
Relevance and Impact on Undergraduate courses
 Data structures and algorithms: a new look at traditional
algorithms such as sort: Quicksort may not be your
choice! It is not easily parallelizable. Merge sort is better.
 You can identify mappers and reducers among your
algorithms. Mappers and reducers are simply place
holders for algorithms relevant for your applications.
 Large scale data and analytics are indeed concepts to
reckon with similar to how we addressed “programming
in the large” by OO concepts.
 While a full course on MR/HDFS may not be warranted,
the concepts perhaps can be woven into most courses in
our CS curriculum.
B.Ramamurthy & K.Madurai
33
CCSCNE 2009 Palttsburg, April 24 2009
Demo
 VMware simulated Hadoop and MapReduce demo
 Remote access to NEXOS system at my Buffalo office
 5-node HDFS running HDFS on Ubuntu 8.04
 1 –name node and 4 data-nodes
 Each is an old commodity PC with 512 MB RAM,
120GB – 160GB external memory
 Zeus (namenode), datanodes: hermes, dionysus,
aphrodite, athena
B.Ramamurthy & K.Madurai
34
CCSCNE 2009 Palttsburg, April 24 2009
Summary
 We introduced MapReduce programming model for
processing large scale data
 We discussed the supporting Hadoop Distributed
File System
 The concepts were illustrated using a simple example
 We reviewed some important parts of the source
code for the example.
 Relationship to Cloud Computing
B.Ramamurthy & K.Madurai
35
CCSCNE 2009 Palttsburg, April 24 2009
References
1. Apache Hadoop Tutorial: http://hadoop.apache.org
http://hadoop.apache.org/core/docs/current/mapred_tu
torial.html
2. Dean, J. and Ghemawat, S. 2008. MapReduce:
simplified data processing on large clusters.
Communication of ACM 51, 1 (Jan. 2008), 107-113.
3. Cloudera Videos by Aaron Kimball:
http://www.cloudera.com/hadoop-training-basic
4. http://www.cse.buffalo.edu/faculty/bina/mapreduce.html
B.Ramamurthy & K.Madurai
36
CCSCNE 2009 Palttsburg, April 24 2009

More Related Content

PPT
mapreduceApril24.ppt
PDF
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
PPTX
ch02-mapreduce.pptx
PPTX
Hadoop and Mapreduce for .NET User Group
PPTX
This gives a brief detail about big data
PPTX
COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION
PPTX
introduction to Complete Map and Reduce Framework
mapreduceApril24.ppt
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
ch02-mapreduce.pptx
Hadoop and Mapreduce for .NET User Group
This gives a brief detail about big data
COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION
introduction to Complete Map and Reduce Framework

Similar to mapreduce and hadoop Distributed File sysytem (20)

PPTX
Lecture2-MapReduce - An introductory lecture to Map Reduce
PPT
Hadoop
PPT
Map reducecloudtech
PPTX
Map Reduced and Data Mining Introductory Presentation
PDF
Simplified Data Processing On Large Cluster
PPTX
An introduction to Hadoop for large scale data analysis
PPTX
Big data & Hadoop
PDF
An Introduction to MapReduce
PDF
Hadoop Overview & Architecture
 
PDF
Large Scale Data Processing & Storage
PPTX
Introduction to MapReduce
PPT
Hadoop by sunitha
PPTX
Map reducefunnyslide
PPTX
Hadoop training-in-hyderabad
PDF
Mapreduce2008 cacm
PPTX
Big data & hadoop
PPT
L19CloudMapReduce introduction for cloud computing .ppt
PDF
Lecture 1 mapreduce
Lecture2-MapReduce - An introductory lecture to Map Reduce
Hadoop
Map reducecloudtech
Map Reduced and Data Mining Introductory Presentation
Simplified Data Processing On Large Cluster
An introduction to Hadoop for large scale data analysis
Big data & Hadoop
An Introduction to MapReduce
Hadoop Overview & Architecture
 
Large Scale Data Processing & Storage
Introduction to MapReduce
Hadoop by sunitha
Map reducefunnyslide
Hadoop training-in-hyderabad
Mapreduce2008 cacm
Big data & hadoop
L19CloudMapReduce introduction for cloud computing .ppt
Lecture 1 mapreduce
Ad

Recently uploaded (20)

PPTX
history of c programming in notes for students .pptx
PDF
System and Network Administraation Chapter 3
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Designing Intelligence for the Shop Floor.pdf
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
top salesforce developer skills in 2025.pdf
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PPTX
Introduction to Artificial Intelligence
PPTX
ai tools demonstartion for schools and inter college
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Digital Systems & Binary Numbers (comprehensive )
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
Reimagine Home Health with the Power of Agentic AI​
history of c programming in notes for students .pptx
System and Network Administraation Chapter 3
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Designing Intelligence for the Shop Floor.pdf
VVF-Customer-Presentation2025-Ver1.9.pptx
Upgrade and Innovation Strategies for SAP ERP Customers
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Navsoft: AI-Powered Business Solutions & Custom Software Development
Design an Analysis of Algorithms II-SECS-1021-03
Which alternative to Crystal Reports is best for small or large businesses.pdf
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
top salesforce developer skills in 2025.pdf
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
Introduction to Artificial Intelligence
ai tools demonstartion for schools and inter college
How to Migrate SBCGlobal Email to Yahoo Easily
Digital Systems & Binary Numbers (comprehensive )
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Reimagine Home Health with the Power of Agentic AI​
Ad

mapreduce and hadoop Distributed File sysytem

  • 1. K. MADURAI AND B. RAMAMURTHY MapReduce and Hadoop Distributed File System B.Ramamurthy & K.Madurai 1 Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially Supported by NSF DUE Grant: 0737243 CCSCNE 2009 Palttsburg, April 24 2009
  • 2. The Context: Big-data  Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009)  Google collects 270PB data in a month (2007), 20000PB a day (2008)  2010 census data is expected to be a huge gold mine of information  Data mining huge amounts of data collected in a wide range of domains from astronomy to healthcare has become essential for planning and performance.  We are in a knowledge economy.  Data is an important asset to any organization  Discovery of knowledge; Enabling discovery; annotation of data  We are looking at newer  programming models, and  Supporting algorithms and data structures.  NSF refers to it as “data-intensive computing” and industry calls it “big- data” and “cloud computing” B.Ramamurthy & K.Madurai 2 CCSCNE 2009 Palttsburg, April 24 2009
  • 3. Purpose of this talk  To provide a simple introduction to:  “The big-data computing” : An important advancement that has a potential to impact significantly the CS and undergraduate curriculum.  A programming model called MapReduce for processing “big-data”  A supporting file system called Hadoop Distributed File System (HDFS)  To encourage educators to explore ways to infuse relevant concepts of this emerging area into their curriculum. B.Ramamurthy & K.Madurai 3 CCSCNE 2009 Palttsburg, April 24 2009
  • 4. The Outline  Introduction to MapReduce  From CS Foundation to MapReduce  MapReduce programming model  Hadoop Distributed File System  Relevance to Undergraduate Curriculum  Demo (Internet access needed)  Our experience with the framework  Summary  References B.Ramamurthy & K.Madurai 4 CCSCNE 2009 Palttsburg, April 24 2009
  • 5. MapReduce CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai 5
  • 6. What is MapReduce?  MapReduce is a programming model Google has used successfully is processing its “big-data” sets (~ 20000 peta bytes per day)  Users specify the computation in terms of a map and a reduce function,  Underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, and  Underlying system also handles machine failures, efficient communications, and performance issues. -- Reference: Dean, J. and Ghemawat, S. 2008. MapReduce: simplified data processing on large clusters. Communication of ACM 51, 1 (Jan. 2008), 107-113. B.Ramamurthy & K.Madurai 6 CCSCNE 2009 Palttsburg, April 24 2009
  • 7. From CS Foundations to MapReduce Consider a large data collection: {web, weed, green, sun, moon, land, part, web, green, …} Problem: Count the occurrences of the different words in the collection. Lets design a solution for this problem;  We will start from scratch  We will add and relax constraints  We will do incremental design, improving the solution for performance and scalability B.Ramamurthy & K.Madurai 7 CCSCNE 2009 Palttsburg, April 24 2009
  • 8. Word Counter and Result Table Data collection web 2 weed 1 green 2 sun 1 moon 1 land 1 part 1 B.Ramamurthy & K.Madurai 8 ResultTable Main DataCollection WordCounter parse( ) count( ) {web, weed, green, sun, moon, land, part, web, green,…} CCSCNE 2009 Palttsburg, April 24 2009
  • 9. Multiple Instances of Word Counter web 2 weed 1 green 2 sun 1 moon 1 land 1 part 1 B.Ramamurthy & K.Madurai 9 Thread DataCollection ResultTable WordCounter parse( ) count( ) Main 1..* 1..* Data collection Observe: Multi-thread Lock on shared data CCSCNE 2009 Palttsburg, April 24 2009
  • 10. Improve Word Counter for Performance B.Ramamurthy & K.Madurai 10 Data collection WordList Thread Main 1..* 1..* DataCollection Parser 1..* Counter 1..* ResultTable KEY web weed green sun moon land part web green ……. VALUE web 2 weed 1 green 2 sun 1 moon 1 land 1 part 1 N o No need for lock Separate counters CCSCNE 2009 Palttsburg, April 24 2009
  • 11. Peta-scale Data B.Ramamurthy & K.Madurai 11 Data collection WordList Thread Main 1..* 1..* DataCollection Parser 1..* Counter 1..* ResultTable KEY web weed green sun moon land part web green ……. VALUE web 2 weed 1 green 2 sun 1 moon 1 land 1 part 1 CCSCNE 2009 Palttsburg, April 24 2009
  • 12. Addressing the Scale Issue B.Ramamurthy & K.Madurai 12  Single machine cannot serve all the data: you need a distributed special (file) system  Large number of commodity hardware disks: say, 1000 disks 1TB each  Issue: With Mean time between failures (MTBF) or failure rate of 1/1000, then at least 1 of the above 1000 disks would be down at a given time.  Thus failure is norm and not an exception.  File system has to be fault-tolerant: replication, checksum  Data transfer bandwidth is critical (location of data)  Critical aspects: fault tolerance + replication + load balancing, monitoring  Exploit parallelism afforded by splitting parsing and counting  Provision and locate computing at data locations CCSCNE 2009 Palttsburg, April 24 2009
  • 13. Peta-scale Data B.Ramamurthy & K.Madurai 13 Data collection WordList Thread Main 1..* 1..* DataCollection Parser 1..* Counter 1..* ResultTable KEY web weed green sun moon land part web green ……. VALUE web 2 weed 1 green 2 sun 1 moon 1 land 1 part 1 CCSCNE 2009 Palttsburg, April 24 2009
  • 14. Peta Scale Data is Commonly Distributed B.Ramamurthy & K.Madurai 14 Data collection WordList Thread Main 1..* 1..* DataCollection Parser 1..* Counter 1..* ResultTable KEY web weed green sun moon land part web green ……. VALUE web 2 weed 1 green 2 sun 1 moon 1 land 1 part 1 Data collection Data collection Data collection Data collection Issue: managing the large scale data CCSCNE 2009 Palttsburg, April 24 2009
  • 15. Write Once Read Many (WORM) data B.Ramamurthy & K.Madurai 15 Data collection WordList Thread Main 1..* 1..* DataCollection Parser 1..* Counter 1..* ResultTable KEY web weed green sun moon land part web green ……. VALUE web 2 weed 1 green 2 sun 1 moon 1 land 1 part 1 Data collection Data collection Data collection Data collection CCSCNE 2009 Palttsburg, April 24 2009
  • 16. WORM Data is Amenable to Parallelism B.Ramamurthy & K.Madurai 16 Data collection WordList Thread Main 1..* 1..* DataCollection Parser 1..* Counter 1..* ResultTable Data collection Data collection Data collection Data collection 1. Data with WORM characteristics : yields to parallel processing; 2. Data without dependencies: yields to out of order processing CCSCNE 2009 Palttsburg, April 24 2009
  • 17. Divide and Conquer: Provision Computing at Data Location B.Ramamurthy & K.Madurai 17 WordList Thread Main 1..* 1..* DataCollection Parser 1..* Counter 1..* ResultTable Data collection WordList Thread Main 1..* 1..* DataCollection Parser 1..* Counter 1..* ResultTable Data collection Data collection WordList Thread Main 1..* 1..* DataCollection Parser 1..* Counter 1..* ResultTable WordList Thread Main 1..* 1..* DataCollection Parser 1..* Counter 1..* ResultTable Data collection For our example, #1: Schedule parallel parse tasks #2: Schedule parallel count tasks This is a particular solution; Lets generalize it: Our parse is a mapping operation: MAP: input  <key, value> pairs Our count is a reduce operation: REDUCE: <key, value> pairs reduced Map/Reduce originated from Lisp But have different meaning here Runtime adds distribution + fault tolerance + replication + monitoring + load balancing to your base application! One node CCSCNE 2009 Palttsburg, April 24 2009
  • 18. Mapper and Reducer B.Ramamurthy & K.Madurai 18 Remember: MapReduce is simplified processing for larger data sets: MapReduce Version of WordCount Source code CCSCNE 2009 Palttsburg, April 24 2009
  • 19. Map Operation MAP: Input data  <key, value> pair Data Collection: split1 web 1 weed 1 green 1 sun 1 moon 1 land 1 part 1 web 1 green 1 … 1 KEY VALUE Split the data to Supply multiple processors Data Collection: split 2 Data Collection: split n Map …… Map B.Ramamurthy & K.Madurai 19 … CCSCNE 2009 Palttsburg, April 24 2009
  • 20. Reduce Reduce Reduce Reduce Operation MAP: Input data  <key, value> pair REDUCE: <key, value> pair  <result> Data Collection: split1 Split the data to Supply multiple processors Data Collection: split 2 Data Collection: split n Map Map …… Map B.Ramamurthy & K.Madurai 20 … CCSCNE 2009 Palttsburg, April 24 2009
  • 21. Count Count Count Large scale data splits Parse-hash Parse-hash Parse-hash Parse-hash Map <key, 1> Reducers (say, Count) P-0000 P-0001 P-0002 , count1 , count2 ,count3 B.Ramamurthy & K.Madurai 21 CCSCNE 2009 Palttsburg, April 24 2009
  • 23. MapReduce Programming Model CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai 23
  • 24. MapReduce programming model  Determine if the problem is parallelizable and solvable using MapReduce (ex: Is the data WORM?, large data set).  Design and implement solution as Mapper classes and Reducer class.  Compile the source code with hadoop core.  Package the code as jar executable.  Configure the application (job) as to the number of mappers and reducers (tasks), input and output streams  Load the data (or use it on previously available data)  Launch the job and monitor.  Study the result.  Detailed steps. B.Ramamurthy & K.Madurai 24 CCSCNE 2009 Palttsburg, April 24 2009
  • 25. MapReduce Characteristics  Very large scale data: peta, exa bytes  Write once and read many data: allows for parallelism without mutexes  Map and Reduce are the main operations: simple code  There are other supporting operations such as combine and partition (out of the scope of this talk).  All the map should be completed before reduce operation starts.  Map and reduce operations are typically performed by the same physical processor.  Number of map tasks and reduce tasks are configurable.  Operations are provisioned near the data.  Commodity hardware and storage.  Runtime takes care of splitting and moving data for operations.  Special distributed file system. Example: Hadoop Distributed File System and Hadoop Runtime. B.Ramamurthy & K.Madurai 25 CCSCNE 2009 Palttsburg, April 24 2009
  • 26. Classes of problems “mapreducable”  Benchmark for comparing: Jim Gray’s challenge on data- intensive computing. Ex: “Sort”  Google uses it (we think) for wordcount, adwords, pagerank, indexing data.  Simple algorithms such as grep, text-indexing, reverse indexing  Bayesian classification: data mining domain  Facebook uses it for various operations: demographics  Financial services use it for analytics  Astronomy: Gaussian analysis for locating extra-terrestrial objects.  Expected to play a critical role in semantic web and web3.0 B.Ramamurthy & K.Madurai 26 CCSCNE 2009 Palttsburg, April 24 2009
  • 27. Scope of MapReduce Pipelined Instruction level Concurrent Thread level Service Object level Indexed File level Mega Block level Virtual System Level Data size: small Data size: large B.Ramamurthy & K.Madurai 27 CCSCNE 2009 Palttsburg, April 24 2009
  • 28. Hadoop CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai 28
  • 29. What is Hadoop?  At Google MapReduce operation are run on a special file system called Google File System (GFS) that is highly optimized for this purpose.  GFS is not open source.  Doug Cutting and Yahoo! reverse engineered the GFS and called it Hadoop Distributed File System (HDFS).  The software framework that supports HDFS, MapReduce and other related entities is called the project Hadoop or simply Hadoop.  This is open source and distributed by Apache. B.Ramamurthy & K.Madurai 29 CCSCNE 2009 Palttsburg, April 24 2009
  • 30. Basic Features: HDFS  Highly fault-tolerant  High throughput  Suitable for applications with large data sets  Streaming access to file system data  Can be built out of commodity hardware B.Ramamurthy & K.Madurai 30 CCSCNE 2009 Palttsburg, April 24 2009
  • 31. Hadoop Distributed File System B.Ramamurthy & K.Madurai 31 Application Local file system Master node Name Nodes HDFS Client HDFS Server Block size: 2K Block size: 128M Replicated CCSCNE 2009 Palttsburg, April 24 2009 More details: We discuss this in great detail in my Operating Systems course
  • 32. Hadoop Distributed File System B.Ramamurthy & K.Madurai 32 Application Local file system Master node Name Nodes HDFS Client HDFS Server Block size: 2K Block size: 128M Replicated CCSCNE 2009 Palttsburg, April 24 2009 More details: We discuss this in great detail in my Operating Systems course heartbeat blockmap
  • 33. Relevance and Impact on Undergraduate courses  Data structures and algorithms: a new look at traditional algorithms such as sort: Quicksort may not be your choice! It is not easily parallelizable. Merge sort is better.  You can identify mappers and reducers among your algorithms. Mappers and reducers are simply place holders for algorithms relevant for your applications.  Large scale data and analytics are indeed concepts to reckon with similar to how we addressed “programming in the large” by OO concepts.  While a full course on MR/HDFS may not be warranted, the concepts perhaps can be woven into most courses in our CS curriculum. B.Ramamurthy & K.Madurai 33 CCSCNE 2009 Palttsburg, April 24 2009
  • 34. Demo  VMware simulated Hadoop and MapReduce demo  Remote access to NEXOS system at my Buffalo office  5-node HDFS running HDFS on Ubuntu 8.04  1 –name node and 4 data-nodes  Each is an old commodity PC with 512 MB RAM, 120GB – 160GB external memory  Zeus (namenode), datanodes: hermes, dionysus, aphrodite, athena B.Ramamurthy & K.Madurai 34 CCSCNE 2009 Palttsburg, April 24 2009
  • 35. Summary  We introduced MapReduce programming model for processing large scale data  We discussed the supporting Hadoop Distributed File System  The concepts were illustrated using a simple example  We reviewed some important parts of the source code for the example.  Relationship to Cloud Computing B.Ramamurthy & K.Madurai 35 CCSCNE 2009 Palttsburg, April 24 2009
  • 36. References 1. Apache Hadoop Tutorial: http://hadoop.apache.org http://hadoop.apache.org/core/docs/current/mapred_tu torial.html 2. Dean, J. and Ghemawat, S. 2008. MapReduce: simplified data processing on large clusters. Communication of ACM 51, 1 (Jan. 2008), 107-113. 3. Cloudera Videos by Aaron Kimball: http://www.cloudera.com/hadoop-training-basic 4. http://www.cse.buffalo.edu/faculty/bina/mapreduce.html B.Ramamurthy & K.Madurai 36 CCSCNE 2009 Palttsburg, April 24 2009