SlideShare a Scribd company logo
CS350 - MAPREDUCE USING
HADOOP
Spring 2012
PARALLELIZATION: BASIC IDEA
Parallelization is “easy” if processing can be cleanly split into n units:
work
w1 w2 w3
Partition
problem
PARALLELIZATION: BASIC IDEA
thread thread thread
Workers process data:
w1 w2 w3
w1 w2 w3
thread thread thread
Spawn worker threads:
results
Report
results
thread thread thread
w1 w2 w3
THE PROBLEM
Google faced the problem of analyzing huge sets of data (order of petabytes).
 E.g. pagerank, web access logs, etc.
Algorithm to process data can be reasonable simple,
 But to finish it in an acceptable amount of time the task must be split and forwarded to potentially
thousands of machines
Programmers were forced to develop the software that:
 Splits data
 Forwards data and code to participant nodes
 Checks nodes state to react to errors
 Retrieves and organizes results
Tedious, error-prone, time-consuming... and had to be done for each problem.
THE SOLUTION: MAPREDUCE
MapReduce is an abstraction to organize parallelizable tasks.
Algorithm has to be adapted to fit MapReduce's main two steps:
 Map: data processing (collecting/grouping/distribution intermediate step)
 Reduce: data collection and digesting
MapReduce Architecture provides
 Automatic parallelization & distribution
 Fault tolerance
 I/O scheduling
 Monitoring & status updates
LIST PROCESSING
Conceptually, MapReduce programs transform lists of input data
elements into lists of output data elements.
A MapReduce program will do this twice, using two different list
processing idioms:
 Map
 Reduce
These terms are taken from several list processing languages such as
LISP, Scheme, or ML.
MAPPING LISTS
A list of data elements are provided, one at a time, to a function
called the Mapper.
It transforms each element individually to an output data element.
REDUCING LISTS
Reducing lets you aggregate values together.
A reducer function receives an iterator of input values from an input list.
It then combines these values together, returning a single output value.
MAPPING IN MAPREDUCE
(KEYS AND VALUES)
In MapReduce, no value stands on its own.
Every value has a key associated with it. Keys identify related values.
 For example, a log of time-coded speedometer readings from multiple cars could be
keyed by license-plate number.
The mapping and reducing functions receive not just values, but (key, value) pairs.
The output of each of these functions is the same:
 Both a key and a value must be emitted to the next list in the data flow.
AAA-123 65mph, 12:00pm
ZZZ-789 50mph, 12:02pm
AAA-123 40mph, 12:05pm
CCC-456 25mph, 12:15pm ...
REDUCING IN MAPREDUCE
(KEYS DIVIDE THE REDUCE SPACE)
In MapReduce, all of the output values are not usually reduced together.
All of the values with the same key are presented to a single reducer together.
This is performed independently of any reduce operations occurring on other lists of
values, with different keys attached.
MAPREDUCE DATA FLOW
EXAMPLE: WORD COUNT
A simple MapReduce program can be written to determine how many times different words appear
in a set of files.
For example, if we had the files:
 foo.txt: Sweet, this is the foo file
 bar.txt: This is the bar file
We would expect the output to be: sweet 1
this 2
is 2
the 2
foo 1
bar 1
file 2
WORD COUNT IN MAPREDUCE (2)
The high-level structure would look like this:
mapper (filename, file-contents):
for each word in file-contents:
emit (word, 1)
reducer (word, values):
sum = 0
for each value in values:
sum = sum + value
emit (word, sum)
WORD COUNT IN MAPREDUCE
WORD COUNT SOURCE
WORD COUNT DRIVER
A CLOSER LOOK
INPUT FILES
This is where the data for a MapReduce task is initially stored.
The input files typically reside in HDFS.
The format of these files can be:
 Line-based log files
 Binary format files
 Multi-line input records,
It is typical for these input files to be very large -- tens of gigabytes
or more.
INPUT FORMAT
How these input files are split up and read is defined by the InputFormat.
An InputFormat is a class that provides the following functionality:
 Selects the files or other objects that should be used for input
 Defines the InputSplits that break a file into tasks
 Provides a factory for RecordReader objects that read the file
INPUT SPLITS
An InputSplit describes a unit of work that comprises a single map task in a
MapReduce program.
 A MapReduce program applied to a data set, collectively referred to as a Job, is made up of
several (possibly several hundred) tasks
By processing a file in chunks, we allow several map tasks to operate on a
single file in parallel.
 The various blocks that make up the file may be spread across several different nodes in the cluster
 The individual blocks are thus all processed locally, instead of needing to be transferred from one node to another
The tasks are then assigned to the nodes in the system based on where the
input file chunks are physically resident.
 An individual node may have several dozen tasks assigned to it
 The node will begin working on the tasks, attempting to perform as many in parallel as it can
RECORD READER
The InputSplit has defined a slice of work, but does not describe how
to access it.
TheRecordReader class actually loads the data from its source and
converts it into (key, value) pairs suitable for reading by the Mapper.
The RecordReader is invoke repeatedly on the input until the entire
InputSplit has been consumed.
 Each invocation of the RecordReader leads to another call to the map() method of the
Mapper.
MAPPER
Given a key and a value, the map() method emits (key, value) pair(s)
which are forwarded to the Reducers.
The individual mappers are intentionally not provided with a
mechanism to communicate with one another in any way.
 This allows the reliability of each map task to be governed solely by the reliability of
the local machine
The map() method receives two parameters in addition to the key and
the value:
 The OutputCollector object has a method named collect() which will forward a (key,
value) pair to the reduce phase of the job.
 The Reporter object provides information about the current task
PARTITION & SHUFFLE
After the first map tasks have completed, the nodes may still be performing
several more map tasks each.
 But they also begin exchanging the intermediate outputs from the map tasks to where they are
required by the reducers
 This process of moving map outputs to the reducers is known as shuffling
A different subset of the intermediate key space is assigned to each reduce node;
these subsets (known as "partitions") are the inputs to the reduce tasks.
Each map task may emit (key, value) pairs to any partition; all values for the same
key are always reduced together regardless of which mapper is its origin.
Therefore, the map nodes must all agree on where to send the different pieces of
the intermediate data.
REDUCER
Sort
 Each reduce task is responsible for reducing the values associated with several
intermediate keys.
 The set of intermediate keys on a single node is automatically sorted by Hadoop
before they are presented to the Reducer.
Reduce
 A Reducer instance is created for each reduce task.
 This is an instance of user-provided code that performs the second important phase of
job-specific work.
 For each key in the partition assigned to a Reducer, the Reducer's reduce() method is called once.
 This receives a key as well as an iterator over all the values associated with the key.
 The values associated with a key are returned by the iterator in an undefined order.
 The Reducer also receives as parameters OutputCollector and Reporter objects; they
are used in the same manner as in the map() method.
OUTPUT FORMAT
The (key, value) pairs provided to this OutputCollector are then written to
output files.
 The way they are written is governed by the OutputFormat.
Each Reducer writes a separate file in a common output directory.
 The output directory is set by theFileOutputFormat.setOutputPath() method
RECORD WRITER
The OutputFormat class is a factory for RecordWriter objects;
 These are used to write the individual records to the files as directed by the
OutputFormat
The output files written by the Reducers are then left in HDFS for your
use by,
 Another MapReduce job
 A separate program
 Human inspection
ADDITIONAL FUNCTIONALITY
FAULT TOLERANCE
One of the primary reasons to use Hadoop to run your jobs is due to its
high degree of fault tolerance.
Map worker failure
 Map tasks completed or in-progress at worker are reset to idle
 Reduce workers are notified when task is rescheduled on another worker
Reduce worker failure
 Only in-progress tasks are reset to idle
Master failure
 MapReduce task is aborted and client is notified
Should we have task identities?
EXAMPLE: INVERTED INDEX
An inverted index returns a list of documents that contain each word in those
documents.
Thus, if the word "cat" appears in documents A and B, but not C, then the line:
 cat A, B
should appear in the output.
If the word "baseball" appears in documents B and C, then the line:
 baseball B, C
should appear in the output as well.
INVERTED INDEX CODE Using Eclipse and Hadoop
REFERENCES
Yahoo! Hadoop tutorial
 http://developer.yahoo.com/hadoop/tutorial/index.html
Processing of massive data: MapReduce
 http://lsd.ls.fi.upm.es/lsd/nuevas-tendencias-en-sistemas-distribuidos/IntroToMapRedu
ce.pdf
Hadoop webpage
 http://hadoop.apache.org/common/docs/current/
CS-350 Concurrency in the Cloud (for the masses)

More Related Content

PPTX
MAP REDUCE IN DATA SCIENCE.pptx
PPTX
Map Reduce
PPT
Hadoop Map Reduce
PPTX
Map reduce presentation
PDF
MapReduce-Notes.pdf
PPTX
Types_of_Stats.pptxTypes_of_Stats.pptxTypes_of_Stats.pptx
PPTX
MapReduce.pptx
PDF
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
MAP REDUCE IN DATA SCIENCE.pptx
Map Reduce
Hadoop Map Reduce
Map reduce presentation
MapReduce-Notes.pdf
Types_of_Stats.pptxTypes_of_Stats.pptxTypes_of_Stats.pptx
MapReduce.pptx
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...

Similar to MapReduce and Hadoop Introcuctory Presentation (20)

PPT
Session 19 - MapReduce
PPTX
Mapreduce advanced
PPTX
PDF
Hadoop interview questions - Softwarequery.com
PPT
Map Reduce
PDF
Map reduce
PDF
2004 map reduce simplied data processing on large clusters (mapreduce)
PPTX
map reduce Technic in big data
PDF
Lecture 1 mapreduce
PPTX
What is MapReduce ?
PDF
Map reduceoriginalpaper mandatoryreading
PDF
Map reduce
PPTX
Hadoop ecosystem
PPTX
Hadoop File System was developed using distributed file system design.
PPT
Hadoop - Introduction to mapreduce
PPT
Hadoop_Pennonsoft
PDF
Hadoop ecosystem
PPTX
map Reduce.pptx
PDF
Hadoop interview question
PDF
Big data hadoop distributed file system for data
Session 19 - MapReduce
Mapreduce advanced
Hadoop interview questions - Softwarequery.com
Map Reduce
Map reduce
2004 map reduce simplied data processing on large clusters (mapreduce)
map reduce Technic in big data
Lecture 1 mapreduce
What is MapReduce ?
Map reduceoriginalpaper mandatoryreading
Map reduce
Hadoop ecosystem
Hadoop File System was developed using distributed file system design.
Hadoop - Introduction to mapreduce
Hadoop_Pennonsoft
Hadoop ecosystem
map Reduce.pptx
Hadoop interview question
Big data hadoop distributed file system for data
Ad

More from ssuserb91a20 (7)

PPTX
Εργαλεία Αξιοποίησης Μεγάλων Δεδομένων.pptx
PPTX
Chapter1-Introduction Εισαγωγικές έννοιες
PPTX
Lecture2-MapReduce - An introductory lecture to Map Reduce
PPTX
Κατανεμημένα συστήματα και Map Reduce.pptx
PPTX
Install Hadoop with Virtual Box Instructions
PPTX
Hadoop and MapReduce Introductort presentation
PPTX
Map Reduced and Data Mining Introductory Presentation
Εργαλεία Αξιοποίησης Μεγάλων Δεδομένων.pptx
Chapter1-Introduction Εισαγωγικές έννοιες
Lecture2-MapReduce - An introductory lecture to Map Reduce
Κατανεμημένα συστήματα και Map Reduce.pptx
Install Hadoop with Virtual Box Instructions
Hadoop and MapReduce Introductort presentation
Map Reduced and Data Mining Introductory Presentation
Ad

Recently uploaded (20)

PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
Lecture Notes Electrical Wiring System Components
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
DOCX
573137875-Attendance-Management-System-original
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
Well-logging-methods_new................
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
PPT on Performance Review to get promotions
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
composite construction of structures.pdf
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Lecture Notes Electrical Wiring System Components
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
CYBER-CRIMES AND SECURITY A guide to understanding
Embodied AI: Ushering in the Next Era of Intelligent Systems
573137875-Attendance-Management-System-original
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Operating System & Kernel Study Guide-1 - converted.pdf
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Well-logging-methods_new................
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPT on Performance Review to get promotions
Foundation to blockchain - A guide to Blockchain Tech
composite construction of structures.pdf
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
CH1 Production IntroductoryConcepts.pptx
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx

MapReduce and Hadoop Introcuctory Presentation

  • 1. CS350 - MAPREDUCE USING HADOOP Spring 2012
  • 2. PARALLELIZATION: BASIC IDEA Parallelization is “easy” if processing can be cleanly split into n units: work w1 w2 w3 Partition problem
  • 3. PARALLELIZATION: BASIC IDEA thread thread thread Workers process data: w1 w2 w3 w1 w2 w3 thread thread thread Spawn worker threads: results Report results thread thread thread w1 w2 w3
  • 4. THE PROBLEM Google faced the problem of analyzing huge sets of data (order of petabytes).  E.g. pagerank, web access logs, etc. Algorithm to process data can be reasonable simple,  But to finish it in an acceptable amount of time the task must be split and forwarded to potentially thousands of machines Programmers were forced to develop the software that:  Splits data  Forwards data and code to participant nodes  Checks nodes state to react to errors  Retrieves and organizes results Tedious, error-prone, time-consuming... and had to be done for each problem.
  • 5. THE SOLUTION: MAPREDUCE MapReduce is an abstraction to organize parallelizable tasks. Algorithm has to be adapted to fit MapReduce's main two steps:  Map: data processing (collecting/grouping/distribution intermediate step)  Reduce: data collection and digesting MapReduce Architecture provides  Automatic parallelization & distribution  Fault tolerance  I/O scheduling  Monitoring & status updates
  • 6. LIST PROCESSING Conceptually, MapReduce programs transform lists of input data elements into lists of output data elements. A MapReduce program will do this twice, using two different list processing idioms:  Map  Reduce These terms are taken from several list processing languages such as LISP, Scheme, or ML.
  • 7. MAPPING LISTS A list of data elements are provided, one at a time, to a function called the Mapper. It transforms each element individually to an output data element.
  • 8. REDUCING LISTS Reducing lets you aggregate values together. A reducer function receives an iterator of input values from an input list. It then combines these values together, returning a single output value.
  • 9. MAPPING IN MAPREDUCE (KEYS AND VALUES) In MapReduce, no value stands on its own. Every value has a key associated with it. Keys identify related values.  For example, a log of time-coded speedometer readings from multiple cars could be keyed by license-plate number. The mapping and reducing functions receive not just values, but (key, value) pairs. The output of each of these functions is the same:  Both a key and a value must be emitted to the next list in the data flow. AAA-123 65mph, 12:00pm ZZZ-789 50mph, 12:02pm AAA-123 40mph, 12:05pm CCC-456 25mph, 12:15pm ...
  • 10. REDUCING IN MAPREDUCE (KEYS DIVIDE THE REDUCE SPACE) In MapReduce, all of the output values are not usually reduced together. All of the values with the same key are presented to a single reducer together. This is performed independently of any reduce operations occurring on other lists of values, with different keys attached.
  • 12. EXAMPLE: WORD COUNT A simple MapReduce program can be written to determine how many times different words appear in a set of files. For example, if we had the files:  foo.txt: Sweet, this is the foo file  bar.txt: This is the bar file We would expect the output to be: sweet 1 this 2 is 2 the 2 foo 1 bar 1 file 2
  • 13. WORD COUNT IN MAPREDUCE (2) The high-level structure would look like this: mapper (filename, file-contents): for each word in file-contents: emit (word, 1) reducer (word, values): sum = 0 for each value in values: sum = sum + value emit (word, sum)
  • 14. WORD COUNT IN MAPREDUCE
  • 18. INPUT FILES This is where the data for a MapReduce task is initially stored. The input files typically reside in HDFS. The format of these files can be:  Line-based log files  Binary format files  Multi-line input records, It is typical for these input files to be very large -- tens of gigabytes or more.
  • 19. INPUT FORMAT How these input files are split up and read is defined by the InputFormat. An InputFormat is a class that provides the following functionality:  Selects the files or other objects that should be used for input  Defines the InputSplits that break a file into tasks  Provides a factory for RecordReader objects that read the file
  • 20. INPUT SPLITS An InputSplit describes a unit of work that comprises a single map task in a MapReduce program.  A MapReduce program applied to a data set, collectively referred to as a Job, is made up of several (possibly several hundred) tasks By processing a file in chunks, we allow several map tasks to operate on a single file in parallel.  The various blocks that make up the file may be spread across several different nodes in the cluster  The individual blocks are thus all processed locally, instead of needing to be transferred from one node to another The tasks are then assigned to the nodes in the system based on where the input file chunks are physically resident.  An individual node may have several dozen tasks assigned to it  The node will begin working on the tasks, attempting to perform as many in parallel as it can
  • 21. RECORD READER The InputSplit has defined a slice of work, but does not describe how to access it. TheRecordReader class actually loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper. The RecordReader is invoke repeatedly on the input until the entire InputSplit has been consumed.  Each invocation of the RecordReader leads to another call to the map() method of the Mapper.
  • 22. MAPPER Given a key and a value, the map() method emits (key, value) pair(s) which are forwarded to the Reducers. The individual mappers are intentionally not provided with a mechanism to communicate with one another in any way.  This allows the reliability of each map task to be governed solely by the reliability of the local machine The map() method receives two parameters in addition to the key and the value:  The OutputCollector object has a method named collect() which will forward a (key, value) pair to the reduce phase of the job.  The Reporter object provides information about the current task
  • 23. PARTITION & SHUFFLE After the first map tasks have completed, the nodes may still be performing several more map tasks each.  But they also begin exchanging the intermediate outputs from the map tasks to where they are required by the reducers  This process of moving map outputs to the reducers is known as shuffling A different subset of the intermediate key space is assigned to each reduce node; these subsets (known as "partitions") are the inputs to the reduce tasks. Each map task may emit (key, value) pairs to any partition; all values for the same key are always reduced together regardless of which mapper is its origin. Therefore, the map nodes must all agree on where to send the different pieces of the intermediate data.
  • 24. REDUCER Sort  Each reduce task is responsible for reducing the values associated with several intermediate keys.  The set of intermediate keys on a single node is automatically sorted by Hadoop before they are presented to the Reducer. Reduce  A Reducer instance is created for each reduce task.  This is an instance of user-provided code that performs the second important phase of job-specific work.  For each key in the partition assigned to a Reducer, the Reducer's reduce() method is called once.  This receives a key as well as an iterator over all the values associated with the key.  The values associated with a key are returned by the iterator in an undefined order.  The Reducer also receives as parameters OutputCollector and Reporter objects; they are used in the same manner as in the map() method.
  • 25. OUTPUT FORMAT The (key, value) pairs provided to this OutputCollector are then written to output files.  The way they are written is governed by the OutputFormat. Each Reducer writes a separate file in a common output directory.  The output directory is set by theFileOutputFormat.setOutputPath() method
  • 26. RECORD WRITER The OutputFormat class is a factory for RecordWriter objects;  These are used to write the individual records to the files as directed by the OutputFormat The output files written by the Reducers are then left in HDFS for your use by,  Another MapReduce job  A separate program  Human inspection
  • 28. FAULT TOLERANCE One of the primary reasons to use Hadoop to run your jobs is due to its high degree of fault tolerance. Map worker failure  Map tasks completed or in-progress at worker are reset to idle  Reduce workers are notified when task is rescheduled on another worker Reduce worker failure  Only in-progress tasks are reset to idle Master failure  MapReduce task is aborted and client is notified Should we have task identities?
  • 29. EXAMPLE: INVERTED INDEX An inverted index returns a list of documents that contain each word in those documents. Thus, if the word "cat" appears in documents A and B, but not C, then the line:  cat A, B should appear in the output. If the word "baseball" appears in documents B and C, then the line:  baseball B, C should appear in the output as well.
  • 30. INVERTED INDEX CODE Using Eclipse and Hadoop
  • 31. REFERENCES Yahoo! Hadoop tutorial  http://developer.yahoo.com/hadoop/tutorial/index.html Processing of massive data: MapReduce  http://lsd.ls.fi.upm.es/lsd/nuevas-tendencias-en-sistemas-distribuidos/IntroToMapRedu ce.pdf Hadoop webpage  http://hadoop.apache.org/common/docs/current/ CS-350 Concurrency in the Cloud (for the masses)