SlideShare a Scribd company logo
MapReduce
Unit 2
Objective
What is Apache MapReduce?
Features of MapReduce
How does the Hadoop
MapReduce algorithm work?
MapReduce Example
What is Apache MapReduce?
 Apache MapReduce is the processing engine of
Hadoop that processes and computes vast
volumes of data. MapReduce programming
paradigm allows you to scale unstructured data
across hundreds or thousands of commodity
servers in an Apache Hadoop cluster.
 It has two main components or phases, the map
phase and the reduce phase.
 The input data is fed to the mapper phase to map the
data. The shuffle, sort, and reduce operations are
then performed to give the final output.
MapReduce programming paradigm offers several
features and benefits to help gain insights from vast
volumes of data.
Features of MapReduce
 MapReduce algorithms help organizations to process
vast amounts of data, parallelly stored in the Hadoop
Distributed File System (HDFS).
 It reduces the processing time and supports faster
processing of data. This is because all the nodes are
working with their part of the data, in parallel.
 Developers can write MapReduce codes in a range
of languages such as Java, C++, and Python.
 It is fault-tolerant as it considers replicated copies of
the blocks in other machines for further processing,
in case of failure.
Work?
Let’s understand how the MapReduce algorithm works by
understanding the job execution flow in detail.
 The input data to process using the MapReduce task is
stored in input files that reside on HDFS.
 The input format defines the input specification and how
the input files are split and read.
 The input split logically represents the data to be
processed by an individual Mapper.
 The record reader communicates with the input split and
converts the data into key-value pairs suitable for reading
by the mapper (k, v).
 The mapper class processes input records from
RecordReader and generates intermediate key-value pairs
(k’, v’). Conditional logic is applied to ‘n’ number of data
blocks present across various data nodes.
 The combiner is a mini reducer. For every combiner,
there is one mapper. It is used to optimize the
performance of MapReduce jobs.
 The partitioner decides how outputs from the
combiner are sent to the reducers.
 The output of the partitioner is shuffled and sorted. All
the duplicate values are removed, and different values
are grouped based on similar keys. This output is fed
as input to the reducer. All the intermediate values for
the intermediate keys are combined into a list by the
reducer called tuples.
 The record writer writes these output key-value pairs
from the reducer to the output files. The output data is
stored on the HDFS.
MapReduce workflow
Shown below is a MapReduce example to count the frequency of each
word in a given input text. Our input text is, “Big data comes in various
formats. This data can be stored in multiple data servers.”
MapReduce Example to count the occurrences of words
Records
 Shown below is a sample data of call records. It has
the information regarding phone numbers from which
the call was made, and to which phone number it was
made. The data also gives information about the total
duration of each call. It also tells you if the call made
was a local (0) or an STD call (1).
 use this data to perform certain operations with the
help of a MapReduce algorithm. One of the operations
you can perform is to find all the phone numbers that
made more than 60 minutes of STD calls. use Java
programming language to do this task.
1. Let’s first declare our constants for the fields.
2. Import all the necessary packages to make sure
we use the classes in the right way.
3. The order of the driver, mapper, and reducer class does
not matter. So, let’s create a mapper that will do the map
task.
•create a TokenizerMapper that will extend our Mapper class. It accepts the
desired data types (line 69-70).
•assign phone numbers and the duration of the calls in minutes (line 72-73).
•The map task works on a string, and it breaks it into individual elements
based on a delimiter (line 75-78).
Then, check if the string that we are looking for has an STD flag (line 79).
then set the phone numbers using the constant class and find the duration (line
81-83).
Finally, extract the phone numbers and the duration of the call made by a
 This mapper class will return an intermediate output,
which would then be sorted and shuffled and passed
on to the reducer.
4. Next, we define our reducer class.
So, define reducer class called SumReducer. The reducer uses the right data types
specific to Hadoop MapReduce (line 50-52).
The reduce (Object, Iterable, Context) method is called for each <key, (collection of
values)> in the sorted inputs. The output of the reduce task is written to a
RecordWriter via TaskInputOutputContext.write(Object, Object) (line 54-56).
It looks into all the keys and values. Wherever it finds that the keys that are repeating
and the duration is more than 60 minutes, it would return an aggregated result (line
5. The driver class has all the job configurations, mapper, reducer, and also a
combiner class. It is responsible for setting up a MapReduce job to run in the
Hadoop cluster. You can specify the names of Mapper and Reducer Classes long
with data types and their respective job names.
6. Now, package the files as .jar and transfer it to the Hadoop cluster and run it
on top of YARN.
You can locate your call records file using hdfs dfs -ls “Location of the file”
7. Now, input the call records file for processing. Use the command below to
locate the file and give the class name, along with another file location to save the
output.
hadoop jar STDSubscribers.jar org.example.hadoopcodes.STDSubscribers
sampleMRIn/calldatarecords.txt sampleMROutput-2
8. Once you run the above command successfully, you can see the output by
checking the directory.
hdfs dfs -cat sampleMROutput-2/part-r-00000
Introduction To Map Reduce
Programs:
 Hadoop MapReduce is a software framework for easily
writing applications which process big amounts of data in-
parallel on large clusters (thousands of nodes) of
commodity hardware in a reliable, fault-tolerant manner.
 The term MapReduce actually refers to the following two
different tasks that Hadoop programs perform:
The Map Task: This is the first task, which takes input
data and converts it into a set of data, where individual
elements are broken down into tuples (key/value pairs).
The Reduce Task: This task takes the output from a map
task as input and combines those data tuples into a
smaller set of tuples. The reduce task is always
performed after the map task.
Typically both the input and the output are stored in a file-system. The
framework takes care of scheduling tasks, monitoring them and re-executes the
failed tasks. The MapReduce framework consists of a single master JobTracker
and one slave TaskTracker per cluster-node. The master is responsible for
resource management, tracking resource consumption/availability and
scheduling the jobs component tasks on the slaves, monitoring them and re-
The slaves TaskTracker execute the tasks as directed by the master and provide
task-status
information to the master periodically. The JobTracker is a single point of failure for
the Hadoop MapReduce service which means if JobTracker goes down, all running
jobs are halted.
The major advantage of MapReduce is that it is easy to scale data processing over
multiple computing nodes. Under the MapReduce model, the data processing
primitives are called mappers and reducers. Decomposing a data processing
application into mappers and reducers is sometimes nontrivial. But, once we write
an application in the MapReduce form, scaling the application to run over
hundreds, thousands, or even tens of thousands of machines in a cluster is merely
During a MapReduce job, Hadoop sends the Map and Reduce tasks to the
appropriate servers in the cluster. The framework manages all the details of
data-passing such as issuing tasks, verifying task completion, and copying
data around the cluster between the nodes.
Most of the computing takes place on nodes with data on local disks that
reduces the network traffic. After completion of the given tasks, the cluster
collects and reduces the data to form an appropriate result, and sends it back
to the Hadoop server.
Topic – 1: A Weather Dataset
MapReduce is a programming model for data processing.
For our example, we will write a program that mines
weather data. Weather sensors collecting data every hour
at many locations across the globe gather a large volume of
log data, which is a good candidate for analysis with
MapReduce, since it is semistructured and record-oriented.
Data Format:
The data we will use is from the National Climatic Data
Center (NCDC, http://www.ncdc.noaa.gov/). The data is
stored using a line-oriented ASCII format, in which each line
is a record. The format supports a rich set of meteorological
elements, many of which are optional or with variable data
lengths. For simplicity, we shall focus on the basic
elements, such as temperature, which are always present
and are of fixed width.
Example : Format of a National Climate Data Center record
0057
332130 # USAF weather station identifier
99999 # WBAN weather station identifier
19500101 # observation date
0300 # observation time
4
+51317 # latitude (degrees x 1000)
+028783 # longitude (degrees x 1000)
FM-12
+0171 # elevation (meters)
99999
V020
320 # wind direction (degrees)
1 # quality code
N
0072
1
00450 # sky ceiling height (meters)
1 # quality code
C
N
010000 # visibility distance (meters)
1 # quality code
N
9
-0128 # air temperature (degrees Celsius x 10)
1 # quality code
-0139 # dew point temperature (degrees Celsius x 10)
1 # quality code
10268 # atmospheric pressure (hectopascals x 10)
1 # quality code
Data files are organized by date and weather station. There is a directory for
each year from 1901 to 2001, each containing a gzipped file for each weather
station with its readings for that year. For example, here are the first entries for
1990:
% ls raw/1990 | head
010010-99999-1990.gz
010014-99999-1990.gz
010015-99999-1990.gz
010016-99999-1990.gz
010017-99999-1990.gz
010030-99999-1990.gz
010040-99999-1990.gz
010080-99999-1990.gz
010100-99999-1990.gz
010150-99999-1990.gz
Since there are tens of thousands of weather stations, the whole dataset is
made up of a large number of relatively small files. It’s generally easier and
more efficient to process a smaller number of relatively large files, so the data
was preprocessed so that each year’s readings were concatenated into a
single file.
Example : A program for finding the maximum recorded temperature by
year from NCDC weather records
#!/usr/bin/env bash
for year in all/*
do
echo -ne `basename $year .gz`"t"
gunzip -c $year | 
awk '{ temp = substr($0, 88, 5) + 0;
q = substr($0, 93, 1);
if (temp !=9999 && q ~ /[01459]/ && temp > max) max
= temp }
END { print max }'
Done
The script loops through the compressed year files, first printing the year, and then
processing each file using awk. The awk script extracts two fields from the data:
the air temperature and the quality code. The air temperature value is turned
into an integer by adding 0. Next, a test is applied to see if the temperature is
valid (the value 9999 signifies a missing value in the NCDC dataset) and if the
quality code indicates that the reading is not suspect or erroneous. If the
reading is OK, the value is compared with the maximum value seen so far,
which is updated if a new maximum is found. The END block is executed after
all the lines in the file have been processed, and it prints the maximum value.
Note: “What is meant by “awk”
awk (also written as Awk and AWK) is a utility that enables a programmer to write
tiny but effective programs in the form of statements that define text patterns that
are to be searched for in each line of a document and the action that is to be taken
when a match is found within a line. awk comes with most UNIX-based operating
systems such asLinux, and also with some other operating systems, such as
Windows 95/98/NT.
An awk program is made up of patterns and actions to be performed when a
pattern match is found. awk scans input lines sequentially and examines each one
to determine whether it contains a pattern matching one specified by the user.
When the matching pattern is found, awk carries out the instructions in the
program. For example, awk could scan text for a critical portion and reformat the
text contained in it according to the user's command. If no pattern is specified, the
program will carry out the command on all of the input data.
awk breaks each line into fields, which are groups of characters with spaces acting
as separators so that a word, for example, would be a field. A string is encased in
backslashes and actions to be performed are encased in curly brackets. The lines
are numbered in order of their appearance, with "0" refering to the entire line. "$" is
the symbol for field. So, for example, to search for a line containing the word
"nutmeg," and to print each line in which the word occurs, the awk program would
consist of:
/nutmeg/ { print $0 }.
The name "awk" is derived from the names of its three developers: Alfred Aho,
Here is the beginning of a run:
% ./max_temperature.sh
1901 317
1902 244
1903 289
1904 256
1905 283
The temperature values in the source file are scaled by a factor of 10, so this works out as a maximum
temperature of 31.7°C for 1901 (there were very few readings at the beginning of the century, so this is
plausible). The complete run for the century took 42 minutes in one run on a single EC2 High-CPU Extra
Large Instance.
Analyzing the Data with Hadoop:
To take advantage of the parallel processing that Hadoop provides, we need to express our query as a
MapReduce job. After some local, small-scale testing, we will be able to run it on a cluster of machines.
Map and Reduce:
MapReduce works by breaking the processing into two phases: the map phase and the reduce phase.
Each phase has key-value pairs as input and output, the types of which may be chosen by the
programmer. The programmer also specifies two functions: the map function and the reduce function.
The input to our map phase is the raw NCDC data. We choose a text input format that gives us each line in
the dataset as a text value. The key is the offset of the beginning of the line from the beginning of the file,
but as we have no need for this, we ignore it.
Our map function is simple. We pull out the year and the air temperature, since these are the only fields we
are interested in. In this case, the map function is just a data preparation phase, setting up the data in such
a way that the reducer function can do its work on it: finding the maximum temperature for each year. The
map function is also a good place to drop bad records: here we filter out temperatures that are missing,
suspect, or erroneous.
To visualize the way the map works, consider the following sample lines of input
data (some unused columns have been dropped to fit the page, indicated by
ellipses):
0067011990999991950051507004...9999999N9+00001+99999999999...
0043011990999991950051512004...9999999N9+00221+99999999999...
0043011990999991950051518004...9999999N9-00111+99999999999...
0043012650999991949032412004...0500001N9+01111+99999999999...
0043012650999991949032418004...0500001N9+00781+99999999999...
These lines are presented to the map function as the key-value pairs:
(0, 0067011990999991950051507004...9999999N9+00001+99999999999...)
(106, 0043011990999991950051512004...9999999N9+00221+99999999999...)
(212, 0043011990999991950051518004...9999999N9-00111+99999999999...)
(318, 0043012650999991949032412004...0500001N9+01111+99999999999...)
(424, 0043012650999991949032418004...0500001N9+00781+99999999999...)
The keys are the line offsets within the file, which we ignore in our map function.
The map function merely extracts the year and the air temperature (indicated in
bold text), and emits them as its output (the temperature values have been
interpreted as integers):
(1950, 0)
(1950, 22)
(1950, −11)
(1949, 111)
(1949, 78)
The output from the map function is processed by the MapReduce framework
before being sent to the reduce function. This processing sorts and groups the
key-value pairs by key. So, continuing the example, our reduce function sees the
following input:
(1949, [111, 78])
(1950, [0, 22, −11])
Each year appears with a list of all its air temperature readings. All the reduce
function has to do now is iterate through the list and pick up the maximum
reading:
(1949, 111)
(1950, 22)
This is the final output: the maximum global temperature recorded in each
year.
The whole data flow is illustrated in the below Figure. At the bottom of the
diagram is a Unix pipeline, which mimics the whole MapReduce flow, and which
we will see again later in the chapter when we look at Hadoop Streaming.
Java MapReduce:
Having run through how the MapReduce program works, the next step is to
express it in code. We need three things: a map function, a reduce
function, and some code to run the job. The map function is represented by
the Mapper class, which declares an abstract map() method. The below
example shows the implementation of our map method.
Example: Mapper for maximum temperature example :
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
private static final int MISSING = 9999;
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String year = line.substring(15, 19);
int airTemperature;
if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92));
}
String quality = line.substring(92, 93);
if (airTemperature != MISSING && quality.matches("[01459]")) {
context.write(new Text(year), new IntWritable(airTemperature));
}
}
}
The Mapper class is a generic type, with four formal type parameters that specify
the input key, input value, output key, and output value types of the map function.
For the present example, the input key is a long integer offset, the input value is a
line of text, the output key is a year, and the output value is an air temperature (an
integer). Rather than use built-in Java types, Hadoop provides its own set of basic
types that are optimized for network serialization. These are found in the
org.apache.hadoop.io package. Here we use LongWritable, which corresponds to
a Java Long, Text (like Java String), and IntWritable (like Java Integer).
The map() method is passed a key and a value. We convert the Text value
containing the line of input into a Java String, then use its substring() method to
extract the columns we are interested in. The map() method also provides an
instance of Context to write the output to. In this case, we write the year as a Text
object (since we are just using it as a key), and the temperature is wrapped in an
IntWritable. We write an output record only if the temperature is present and the
Example: Reducer for maximum temperature example:
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class MaxTemperatureReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int maxValue = Integer.MIN_VALUE;
for (IntWritable value : values) {
maxValue = Math.max(maxValue, value.get());
}
context.write(key, new IntWritable(maxValue));
}
}
Again, four formal type parameters are used to specify the input and output
types, this time for the reduce function. The input types of the reduce function
must match the output types of the map function: Text and IntWritable. And in this
case, the output types of the reduce function are Text and IntWritable, for a year
and its maximum temperature, which we find by iterating through the
temperatures and comparing each with a record of the highest found so far.
The third piece of code runs the MapReduce job as shown in the below example.
Example: Application to find the maximum temperature in the weather
dataset:
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MaxTemperature {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MaxTemperature <input path> <output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(MaxTemperature.class);
job.setJobName("Max temperature");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
A Job object forms the specification of the job. It gives you control over how the job is run.
When we run this job on a Hadoop cluster, we will package the code into a JAR file (which
Hadoop will distribute around the cluster). Rather than explicitly specify the name of the JAR
file, we can pass a class in the Job’s setJarByClass() method, which Hadoop will use to locate
the relevant JAR file by looking for the JAR file containing this class.
Having constructed a Job object, we specify the input and output paths. An input path is
specified by calling the static addInputPath() method on FileInputFormat, and it can be a single
file, a directory (in which case, the input forms all the files in that directory), or a file pattern. As
the name suggests, addInputPath() can be called more than once to use input from multiple
paths.
The output path (of which there is only one) is specified by the static setOutput Path() method
on FileOutputFormat. It specifies a directory where the output files from the reducer functions
are written. The directory shouldn’t exist before running the job, as Hadoop will complain and
not run the job. This precaution is to prevent data loss
Next, we specify the map and reduce types to use via the setMapperClass() and
setReducerClass() methods.
The setOutputKeyClass() and setOutputValueClass() methods control the output types for the
map and the reduce functions, which are often the same, as they are in our case. If they are
different, then the map output types can be set using the methods setMapOutputKeyClass()
and setMapOutputValueClass().
The input types are controlled via the input format, which we have not explicitly set since we are
using the default TextInputFormat.
After setting the classes that define the map and reduce functions, we are ready to run the job.
The waitForCompletion() method on Job submits the job and waits for it to finish. The method’s
boolean argument is a verbose flag, so in this case the job writes information about its progress
to the console.
Topic – 2: The Old and New Java Map Reduce API’s
Topic – 3: Basic programs of Hadoop MapReduce
Driver code, Mapper code, Reducer code, RecordReader, Combiner,
Partitioner
Hadoop Data Types:
Despite our many discussions regarding keys and values, we have yet to
mention their types. The MapReduce framework won’t allow them to be any
arbitrary class. For example, although we can and often do talk about certain
keys and values as integers, strings, and so on, they aren’t exactly standard
Java classes, such as Integer, String, and so forth. This is because the
MapReduce framework has a certain defined way of serializing the key/value
pairs to move them across the cluster’s network, and only classes that
support this kind of serialization can function as keys or values in the
framework.
More specifically, classes that implement the Writable interface can be values,
and classes that implement the WritableComparable<T> interface can be
either keys or values. Note that the WritableComparable<T> interface is a
combination of the Writable and java.lang.Comparable<T> interfaces . We
need the comparability requirement for keys because they will be sorted at
the reduce stage, whereas values are simply passed through.
Hadoop comes with a number of predefined classes that implement
WritableComparable, including wrapper classes for all the basic data types,
as seen in the below table.
Unit 2
Unit 2
Unit 2
Unit 2
Unit 2
Unit 2
Unit 2
Unit 2
Unit 2

More Related Content

PPTX
Unit 4-apache pig
PPTX
Unit 2 part-2
PPTX
Unit 4 lecture2
PPTX
Unit 3 writable collections
PPTX
Unit 4 lecture-3
PPTX
PPTX
Unit 3 lecture-2
PPTX
Map reduce prashant
Unit 4-apache pig
Unit 2 part-2
Unit 4 lecture2
Unit 3 writable collections
Unit 4 lecture-3
Unit 3 lecture-2
Map reduce prashant

What's hot (19)

PPTX
04 pig data operations
PDF
Map Reduce Execution Architecture
PPTX
Apache PIG
PDF
Map Reduce data types and formats
PPTX
Unit 5-apache hive
PPTX
Map Reduce
PPTX
Mapreduce advanced
PPTX
Session 04 pig - slides
PDF
Mapreduce by examples
PDF
Introduction to R and R Studio
PPTX
Hadoop MapReduce framework - Module 3
PPT
Session 19 - MapReduce
PPTX
Map reduce and Hadoop on windows
PDF
Apache Scoop - Import with Append mode and Last Modified mode
PDF
Introduction to PIG components
PDF
Hadoop Map Reduce Arch
PPT
Hadoop Map Reduce
PDF
Introduction to hadoop ecosystem
PPTX
06 pig etl features
04 pig data operations
Map Reduce Execution Architecture
Apache PIG
Map Reduce data types and formats
Unit 5-apache hive
Map Reduce
Mapreduce advanced
Session 04 pig - slides
Mapreduce by examples
Introduction to R and R Studio
Hadoop MapReduce framework - Module 3
Session 19 - MapReduce
Map reduce and Hadoop on windows
Apache Scoop - Import with Append mode and Last Modified mode
Introduction to PIG components
Hadoop Map Reduce Arch
Hadoop Map Reduce
Introduction to hadoop ecosystem
06 pig etl features
Ad

Similar to Unit 2 (20)

PPTX
MAP REDUCE IN DATA SCIENCE.pptx
PPTX
writing Hadoop Map Reduce programs
PDF
Report Hadoop Map Reduce
PDF
MapReduce-Notes.pdf
PPTX
map reduce Technic in big data
PDF
E031201032036
PPTX
MapReduce and Hadoop Introcuctory Presentation
PDF
Hadoop interview questions - Softwarequery.com
PDF
2 mapreduce-model-principles
PPTX
Managing Big data Module 3 (1st part)
PPTX
Stratosphere with big_data_analytics
PDF
Applying stratosphere for big data analytics
PPTX
Types_of_Stats.pptxTypes_of_Stats.pptxTypes_of_Stats.pptx
PDF
Hot-Spot analysis Using Apache Spark framework
PPTX
Map reduce presentation
PDF
Hadoop interview questions
PPTX
Finalprojectpresentation
PDF
Lecture 1 mapreduce
PDF
Map reduce
PDF
2004 map reduce simplied data processing on large clusters (mapreduce)
MAP REDUCE IN DATA SCIENCE.pptx
writing Hadoop Map Reduce programs
Report Hadoop Map Reduce
MapReduce-Notes.pdf
map reduce Technic in big data
E031201032036
MapReduce and Hadoop Introcuctory Presentation
Hadoop interview questions - Softwarequery.com
2 mapreduce-model-principles
Managing Big data Module 3 (1st part)
Stratosphere with big_data_analytics
Applying stratosphere for big data analytics
Types_of_Stats.pptxTypes_of_Stats.pptxTypes_of_Stats.pptx
Hot-Spot analysis Using Apache Spark framework
Map reduce presentation
Hadoop interview questions
Finalprojectpresentation
Lecture 1 mapreduce
Map reduce
2004 map reduce simplied data processing on large clusters (mapreduce)
Ad

More from vishal choudhary (20)

PPTX
mobile application using automatin using node ja java on
PPTX
mobile development using node js and java
PPTX
Pixel to Percentage conversion Convert left and right padding of a div to per...
PPTX
esponsive web design means that your website (
PPTX
function in php using like three type of function
PPTX
data base connectivity in php using msql database
PPTX
software evelopment life cycle model and example of water fall model
PPTX
software Engineering lecture on development life cycle
PPTX
strings in php how to use different data types in string
PPTX
OPEN SOURCE WEB APPLICATION DEVELOPMENT question
PPTX
web performnace optimization using css minification
PPTX
web performance optimization using style
PPTX
Data types and variables in php for writing and databse
PPTX
Data types and variables in php for writing
PPTX
Data types and variables in php for writing
PPTX
sofwtare standard for test plan it execution
PPTX
Software test policy and test plan in development
PPTX
function in php like control loop and its uses
PPTX
introduction to php and its uses in daily
PPTX
data type in php and its introduction to use
mobile application using automatin using node ja java on
mobile development using node js and java
Pixel to Percentage conversion Convert left and right padding of a div to per...
esponsive web design means that your website (
function in php using like three type of function
data base connectivity in php using msql database
software evelopment life cycle model and example of water fall model
software Engineering lecture on development life cycle
strings in php how to use different data types in string
OPEN SOURCE WEB APPLICATION DEVELOPMENT question
web performnace optimization using css minification
web performance optimization using style
Data types and variables in php for writing and databse
Data types and variables in php for writing
Data types and variables in php for writing
sofwtare standard for test plan it execution
Software test policy and test plan in development
function in php like control loop and its uses
introduction to php and its uses in daily
data type in php and its introduction to use

Recently uploaded (20)

PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Well-logging-methods_new................
PPTX
Lecture Notes Electrical Wiring System Components
PDF
Structs to JSON How Go Powers REST APIs.pdf
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
Sustainable Sites - Green Building Construction
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
PPT on Performance Review to get promotions
PPTX
Welding lecture in detail for understanding
PDF
Digital Logic Computer Design lecture notes
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
UNIT 4 Total Quality Management .pptx
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PPT
Project quality management in manufacturing
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Well-logging-methods_new................
Lecture Notes Electrical Wiring System Components
Structs to JSON How Go Powers REST APIs.pdf
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
OOP with Java - Java Introduction (Basics)
Sustainable Sites - Green Building Construction
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPT on Performance Review to get promotions
Welding lecture in detail for understanding
Digital Logic Computer Design lecture notes
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
UNIT 4 Total Quality Management .pptx
Arduino robotics embedded978-1-4302-3184-4.pdf
Lesson 3_Tessellation.pptx finite Mathematics
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Strings in CPP - Strings in C++ are sequences of characters used to store and...
Project quality management in manufacturing
Mitigating Risks through Effective Management for Enhancing Organizational Pe...

Unit 2

  • 2. Objective What is Apache MapReduce? Features of MapReduce How does the Hadoop MapReduce algorithm work? MapReduce Example
  • 3. What is Apache MapReduce?  Apache MapReduce is the processing engine of Hadoop that processes and computes vast volumes of data. MapReduce programming paradigm allows you to scale unstructured data across hundreds or thousands of commodity servers in an Apache Hadoop cluster.  It has two main components or phases, the map phase and the reduce phase.
  • 4.  The input data is fed to the mapper phase to map the data. The shuffle, sort, and reduce operations are then performed to give the final output. MapReduce programming paradigm offers several features and benefits to help gain insights from vast volumes of data.
  • 5. Features of MapReduce  MapReduce algorithms help organizations to process vast amounts of data, parallelly stored in the Hadoop Distributed File System (HDFS).  It reduces the processing time and supports faster processing of data. This is because all the nodes are working with their part of the data, in parallel.  Developers can write MapReduce codes in a range of languages such as Java, C++, and Python.  It is fault-tolerant as it considers replicated copies of the blocks in other machines for further processing, in case of failure.
  • 6. Work? Let’s understand how the MapReduce algorithm works by understanding the job execution flow in detail.  The input data to process using the MapReduce task is stored in input files that reside on HDFS.  The input format defines the input specification and how the input files are split and read.  The input split logically represents the data to be processed by an individual Mapper.  The record reader communicates with the input split and converts the data into key-value pairs suitable for reading by the mapper (k, v).  The mapper class processes input records from RecordReader and generates intermediate key-value pairs (k’, v’). Conditional logic is applied to ‘n’ number of data blocks present across various data nodes.
  • 7.  The combiner is a mini reducer. For every combiner, there is one mapper. It is used to optimize the performance of MapReduce jobs.  The partitioner decides how outputs from the combiner are sent to the reducers.  The output of the partitioner is shuffled and sorted. All the duplicate values are removed, and different values are grouped based on similar keys. This output is fed as input to the reducer. All the intermediate values for the intermediate keys are combined into a list by the reducer called tuples.  The record writer writes these output key-value pairs from the reducer to the output files. The output data is stored on the HDFS.
  • 9. Shown below is a MapReduce example to count the frequency of each word in a given input text. Our input text is, “Big data comes in various formats. This data can be stored in multiple data servers.” MapReduce Example to count the occurrences of words
  • 10. Records  Shown below is a sample data of call records. It has the information regarding phone numbers from which the call was made, and to which phone number it was made. The data also gives information about the total duration of each call. It also tells you if the call made was a local (0) or an STD call (1).
  • 11.  use this data to perform certain operations with the help of a MapReduce algorithm. One of the operations you can perform is to find all the phone numbers that made more than 60 minutes of STD calls. use Java programming language to do this task. 1. Let’s first declare our constants for the fields.
  • 12. 2. Import all the necessary packages to make sure we use the classes in the right way.
  • 13. 3. The order of the driver, mapper, and reducer class does not matter. So, let’s create a mapper that will do the map task. •create a TokenizerMapper that will extend our Mapper class. It accepts the desired data types (line 69-70). •assign phone numbers and the duration of the calls in minutes (line 72-73). •The map task works on a string, and it breaks it into individual elements based on a delimiter (line 75-78). Then, check if the string that we are looking for has an STD flag (line 79). then set the phone numbers using the constant class and find the duration (line 81-83). Finally, extract the phone numbers and the duration of the call made by a
  • 14.  This mapper class will return an intermediate output, which would then be sorted and shuffled and passed on to the reducer. 4. Next, we define our reducer class. So, define reducer class called SumReducer. The reducer uses the right data types specific to Hadoop MapReduce (line 50-52). The reduce (Object, Iterable, Context) method is called for each <key, (collection of values)> in the sorted inputs. The output of the reduce task is written to a RecordWriter via TaskInputOutputContext.write(Object, Object) (line 54-56). It looks into all the keys and values. Wherever it finds that the keys that are repeating and the duration is more than 60 minutes, it would return an aggregated result (line
  • 15. 5. The driver class has all the job configurations, mapper, reducer, and also a combiner class. It is responsible for setting up a MapReduce job to run in the Hadoop cluster. You can specify the names of Mapper and Reducer Classes long with data types and their respective job names. 6. Now, package the files as .jar and transfer it to the Hadoop cluster and run it on top of YARN. You can locate your call records file using hdfs dfs -ls “Location of the file”
  • 16. 7. Now, input the call records file for processing. Use the command below to locate the file and give the class name, along with another file location to save the output. hadoop jar STDSubscribers.jar org.example.hadoopcodes.STDSubscribers sampleMRIn/calldatarecords.txt sampleMROutput-2 8. Once you run the above command successfully, you can see the output by checking the directory. hdfs dfs -cat sampleMROutput-2/part-r-00000
  • 17. Introduction To Map Reduce Programs:  Hadoop MapReduce is a software framework for easily writing applications which process big amounts of data in- parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.  The term MapReduce actually refers to the following two different tasks that Hadoop programs perform: The Map Task: This is the first task, which takes input data and converts it into a set of data, where individual elements are broken down into tuples (key/value pairs). The Reduce Task: This task takes the output from a map task as input and combines those data tuples into a smaller set of tuples. The reduce task is always performed after the map task.
  • 18. Typically both the input and the output are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for resource management, tracking resource consumption/availability and scheduling the jobs component tasks on the slaves, monitoring them and re-
  • 19. The slaves TaskTracker execute the tasks as directed by the master and provide task-status information to the master periodically. The JobTracker is a single point of failure for the Hadoop MapReduce service which means if JobTracker goes down, all running jobs are halted. The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers. Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely
  • 20. During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster. The framework manages all the details of data-passing such as issuing tasks, verifying task completion, and copying data around the cluster between the nodes. Most of the computing takes place on nodes with data on local disks that reduces the network traffic. After completion of the given tasks, the cluster collects and reduces the data to form an appropriate result, and sends it back to the Hadoop server.
  • 21. Topic – 1: A Weather Dataset MapReduce is a programming model for data processing. For our example, we will write a program that mines weather data. Weather sensors collecting data every hour at many locations across the globe gather a large volume of log data, which is a good candidate for analysis with MapReduce, since it is semistructured and record-oriented. Data Format: The data we will use is from the National Climatic Data Center (NCDC, http://www.ncdc.noaa.gov/). The data is stored using a line-oriented ASCII format, in which each line is a record. The format supports a rich set of meteorological elements, many of which are optional or with variable data lengths. For simplicity, we shall focus on the basic elements, such as temperature, which are always present and are of fixed width.
  • 22. Example : Format of a National Climate Data Center record 0057 332130 # USAF weather station identifier 99999 # WBAN weather station identifier 19500101 # observation date 0300 # observation time 4 +51317 # latitude (degrees x 1000) +028783 # longitude (degrees x 1000) FM-12 +0171 # elevation (meters) 99999 V020 320 # wind direction (degrees) 1 # quality code N 0072 1 00450 # sky ceiling height (meters) 1 # quality code C N 010000 # visibility distance (meters) 1 # quality code N 9 -0128 # air temperature (degrees Celsius x 10) 1 # quality code -0139 # dew point temperature (degrees Celsius x 10) 1 # quality code 10268 # atmospheric pressure (hectopascals x 10) 1 # quality code
  • 23. Data files are organized by date and weather station. There is a directory for each year from 1901 to 2001, each containing a gzipped file for each weather station with its readings for that year. For example, here are the first entries for 1990: % ls raw/1990 | head 010010-99999-1990.gz 010014-99999-1990.gz 010015-99999-1990.gz 010016-99999-1990.gz 010017-99999-1990.gz 010030-99999-1990.gz 010040-99999-1990.gz 010080-99999-1990.gz 010100-99999-1990.gz 010150-99999-1990.gz Since there are tens of thousands of weather stations, the whole dataset is made up of a large number of relatively small files. It’s generally easier and more efficient to process a smaller number of relatively large files, so the data was preprocessed so that each year’s readings were concatenated into a single file.
  • 24. Example : A program for finding the maximum recorded temperature by year from NCDC weather records #!/usr/bin/env bash for year in all/* do echo -ne `basename $year .gz`"t" gunzip -c $year | awk '{ temp = substr($0, 88, 5) + 0; q = substr($0, 93, 1); if (temp !=9999 && q ~ /[01459]/ && temp > max) max = temp } END { print max }' Done The script loops through the compressed year files, first printing the year, and then processing each file using awk. The awk script extracts two fields from the data: the air temperature and the quality code. The air temperature value is turned into an integer by adding 0. Next, a test is applied to see if the temperature is valid (the value 9999 signifies a missing value in the NCDC dataset) and if the quality code indicates that the reading is not suspect or erroneous. If the reading is OK, the value is compared with the maximum value seen so far, which is updated if a new maximum is found. The END block is executed after all the lines in the file have been processed, and it prints the maximum value.
  • 25. Note: “What is meant by “awk” awk (also written as Awk and AWK) is a utility that enables a programmer to write tiny but effective programs in the form of statements that define text patterns that are to be searched for in each line of a document and the action that is to be taken when a match is found within a line. awk comes with most UNIX-based operating systems such asLinux, and also with some other operating systems, such as Windows 95/98/NT. An awk program is made up of patterns and actions to be performed when a pattern match is found. awk scans input lines sequentially and examines each one to determine whether it contains a pattern matching one specified by the user. When the matching pattern is found, awk carries out the instructions in the program. For example, awk could scan text for a critical portion and reformat the text contained in it according to the user's command. If no pattern is specified, the program will carry out the command on all of the input data. awk breaks each line into fields, which are groups of characters with spaces acting as separators so that a word, for example, would be a field. A string is encased in backslashes and actions to be performed are encased in curly brackets. The lines are numbered in order of their appearance, with "0" refering to the entire line. "$" is the symbol for field. So, for example, to search for a line containing the word "nutmeg," and to print each line in which the word occurs, the awk program would consist of: /nutmeg/ { print $0 }. The name "awk" is derived from the names of its three developers: Alfred Aho,
  • 26. Here is the beginning of a run: % ./max_temperature.sh 1901 317 1902 244 1903 289 1904 256 1905 283 The temperature values in the source file are scaled by a factor of 10, so this works out as a maximum temperature of 31.7°C for 1901 (there were very few readings at the beginning of the century, so this is plausible). The complete run for the century took 42 minutes in one run on a single EC2 High-CPU Extra Large Instance. Analyzing the Data with Hadoop: To take advantage of the parallel processing that Hadoop provides, we need to express our query as a MapReduce job. After some local, small-scale testing, we will be able to run it on a cluster of machines. Map and Reduce: MapReduce works by breaking the processing into two phases: the map phase and the reduce phase. Each phase has key-value pairs as input and output, the types of which may be chosen by the programmer. The programmer also specifies two functions: the map function and the reduce function. The input to our map phase is the raw NCDC data. We choose a text input format that gives us each line in the dataset as a text value. The key is the offset of the beginning of the line from the beginning of the file, but as we have no need for this, we ignore it. Our map function is simple. We pull out the year and the air temperature, since these are the only fields we are interested in. In this case, the map function is just a data preparation phase, setting up the data in such a way that the reducer function can do its work on it: finding the maximum temperature for each year. The map function is also a good place to drop bad records: here we filter out temperatures that are missing, suspect, or erroneous.
  • 27. To visualize the way the map works, consider the following sample lines of input data (some unused columns have been dropped to fit the page, indicated by ellipses): 0067011990999991950051507004...9999999N9+00001+99999999999... 0043011990999991950051512004...9999999N9+00221+99999999999... 0043011990999991950051518004...9999999N9-00111+99999999999... 0043012650999991949032412004...0500001N9+01111+99999999999... 0043012650999991949032418004...0500001N9+00781+99999999999... These lines are presented to the map function as the key-value pairs: (0, 0067011990999991950051507004...9999999N9+00001+99999999999...) (106, 0043011990999991950051512004...9999999N9+00221+99999999999...) (212, 0043011990999991950051518004...9999999N9-00111+99999999999...) (318, 0043012650999991949032412004...0500001N9+01111+99999999999...) (424, 0043012650999991949032418004...0500001N9+00781+99999999999...)
  • 28. The keys are the line offsets within the file, which we ignore in our map function. The map function merely extracts the year and the air temperature (indicated in bold text), and emits them as its output (the temperature values have been interpreted as integers): (1950, 0) (1950, 22) (1950, −11) (1949, 111) (1949, 78) The output from the map function is processed by the MapReduce framework before being sent to the reduce function. This processing sorts and groups the key-value pairs by key. So, continuing the example, our reduce function sees the following input: (1949, [111, 78]) (1950, [0, 22, −11]) Each year appears with a list of all its air temperature readings. All the reduce function has to do now is iterate through the list and pick up the maximum reading: (1949, 111) (1950, 22)
  • 29. This is the final output: the maximum global temperature recorded in each year. The whole data flow is illustrated in the below Figure. At the bottom of the diagram is a Unix pipeline, which mimics the whole MapReduce flow, and which we will see again later in the chapter when we look at Hadoop Streaming. Java MapReduce: Having run through how the MapReduce program works, the next step is to express it in code. We need three things: a map function, a reduce function, and some code to run the job. The map function is represented by the Mapper class, which declares an abstract map() method. The below example shows the implementation of our map method.
  • 30. Example: Mapper for maximum temperature example : import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class MaxTemperatureMapper extends Mapper<LongWritable, Text, Text, IntWritable> { private static final int MISSING = 9999; @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String year = line.substring(15, 19); int airTemperature; if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs airTemperature = Integer.parseInt(line.substring(88, 92)); } else { airTemperature = Integer.parseInt(line.substring(87, 92)); } String quality = line.substring(92, 93); if (airTemperature != MISSING && quality.matches("[01459]")) { context.write(new Text(year), new IntWritable(airTemperature)); } } }
  • 31. The Mapper class is a generic type, with four formal type parameters that specify the input key, input value, output key, and output value types of the map function. For the present example, the input key is a long integer offset, the input value is a line of text, the output key is a year, and the output value is an air temperature (an integer). Rather than use built-in Java types, Hadoop provides its own set of basic types that are optimized for network serialization. These are found in the org.apache.hadoop.io package. Here we use LongWritable, which corresponds to a Java Long, Text (like Java String), and IntWritable (like Java Integer). The map() method is passed a key and a value. We convert the Text value containing the line of input into a Java String, then use its substring() method to extract the columns we are interested in. The map() method also provides an instance of Context to write the output to. In this case, we write the year as a Text object (since we are just using it as a key), and the temperature is wrapped in an IntWritable. We write an output record only if the temperature is present and the
  • 32. Example: Reducer for maximum temperature example: import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class MaxTemperatureReducer extends Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int maxValue = Integer.MIN_VALUE; for (IntWritable value : values) { maxValue = Math.max(maxValue, value.get()); } context.write(key, new IntWritable(maxValue)); } }
  • 33. Again, four formal type parameters are used to specify the input and output types, this time for the reduce function. The input types of the reduce function must match the output types of the map function: Text and IntWritable. And in this case, the output types of the reduce function are Text and IntWritable, for a year and its maximum temperature, which we find by iterating through the temperatures and comparing each with a record of the highest found so far. The third piece of code runs the MapReduce job as shown in the below example. Example: Application to find the maximum temperature in the weather dataset: import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class MaxTemperature { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: MaxTemperature <input path> <output path>"); System.exit(-1); } Job job = new Job(); job.setJarByClass(MaxTemperature.class); job.setJobName("Max temperature"); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(MaxTemperatureMapper.class); job.setReducerClass(MaxTemperatureReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
  • 34. A Job object forms the specification of the job. It gives you control over how the job is run. When we run this job on a Hadoop cluster, we will package the code into a JAR file (which Hadoop will distribute around the cluster). Rather than explicitly specify the name of the JAR file, we can pass a class in the Job’s setJarByClass() method, which Hadoop will use to locate the relevant JAR file by looking for the JAR file containing this class. Having constructed a Job object, we specify the input and output paths. An input path is specified by calling the static addInputPath() method on FileInputFormat, and it can be a single file, a directory (in which case, the input forms all the files in that directory), or a file pattern. As the name suggests, addInputPath() can be called more than once to use input from multiple paths. The output path (of which there is only one) is specified by the static setOutput Path() method on FileOutputFormat. It specifies a directory where the output files from the reducer functions are written. The directory shouldn’t exist before running the job, as Hadoop will complain and not run the job. This precaution is to prevent data loss Next, we specify the map and reduce types to use via the setMapperClass() and setReducerClass() methods. The setOutputKeyClass() and setOutputValueClass() methods control the output types for the map and the reduce functions, which are often the same, as they are in our case. If they are different, then the map output types can be set using the methods setMapOutputKeyClass() and setMapOutputValueClass(). The input types are controlled via the input format, which we have not explicitly set since we are using the default TextInputFormat. After setting the classes that define the map and reduce functions, we are ready to run the job. The waitForCompletion() method on Job submits the job and waits for it to finish. The method’s boolean argument is a verbose flag, so in this case the job writes information about its progress to the console.
  • 35. Topic – 2: The Old and New Java Map Reduce API’s
  • 36. Topic – 3: Basic programs of Hadoop MapReduce Driver code, Mapper code, Reducer code, RecordReader, Combiner, Partitioner Hadoop Data Types: Despite our many discussions regarding keys and values, we have yet to mention their types. The MapReduce framework won’t allow them to be any arbitrary class. For example, although we can and often do talk about certain keys and values as integers, strings, and so on, they aren’t exactly standard Java classes, such as Integer, String, and so forth. This is because the MapReduce framework has a certain defined way of serializing the key/value pairs to move them across the cluster’s network, and only classes that support this kind of serialization can function as keys or values in the framework. More specifically, classes that implement the Writable interface can be values, and classes that implement the WritableComparable<T> interface can be either keys or values. Note that the WritableComparable<T> interface is a combination of the Writable and java.lang.Comparable<T> interfaces . We need the comparability requirement for keys because they will be sorted at the reduce stage, whereas values are simply passed through. Hadoop comes with a number of predefined classes that implement WritableComparable, including wrapper classes for all the basic data types, as seen in the below table.