Unit 2

Objective
What is Apache MapReduce?
Features of MapReduce
How does the Hadoop
MapReduce algorithm work?
MapReduce Example

What is Apache MapReduce?
 Apache MapReduce is the processing engine of
Hadoop that processes and computes vast
volumes of data. MapReduce programming
paradigm allows you to scale unstructured data
across hundreds or thousands of commodity
servers in an Apache Hadoop cluster.
 It has two main components or phases, the map
phase and the reduce phase.

 The input data is fed to the mapper phase to map the
data. The shuffle, sort, and reduce operations are
then performed to give the final output.
MapReduce programming paradigm offers several
features and benefits to help gain insights from vast
volumes of data.

Features of MapReduce
 MapReduce algorithms help organizations to process
vast amounts of data, parallelly stored in the Hadoop
Distributed File System (HDFS).
 It reduces the processing time and supports faster
processing of data. This is because all the nodes are
working with their part of the data, in parallel.
 Developers can write MapReduce codes in a range
of languages such as Java, C++, and Python.
 It is fault-tolerant as it considers replicated copies of
the blocks in other machines for further processing,
in case of failure.

Work?
Let’s understand how the MapReduce algorithm works by
understanding the job execution flow in detail.
 The input data to process using the MapReduce task is
stored in input files that reside on HDFS.
 The input format defines the input specification and how
the input files are split and read.
 The input split logically represents the data to be
processed by an individual Mapper.
 The record reader communicates with the input split and
converts the data into key-value pairs suitable for reading
by the mapper (k, v).
 The mapper class processes input records from
RecordReader and generates intermediate key-value pairs
(k’, v’). Conditional logic is applied to ‘n’ number of data
blocks present across various data nodes.

 The combiner is a mini reducer. For every combiner,
there is one mapper. It is used to optimize the
performance of MapReduce jobs.
 The partitioner decides how outputs from the
combiner are sent to the reducers.
 The output of the partitioner is shuffled and sorted. All
the duplicate values are removed, and different values
are grouped based on similar keys. This output is fed
as input to the reducer. All the intermediate values for
the intermediate keys are combined into a list by the
reducer called tuples.
 The record writer writes these output key-value pairs
from the reducer to the output files. The output data is
stored on the HDFS.

Shown below is a MapReduce example to count the frequency of each
word in a given input text. Our input text is, “Big data comes in various
formats. This data can be stored in multiple data servers.”
MapReduce Example to count the occurrences of words

Records
 Shown below is a sample data of call records. It has
the information regarding phone numbers from which
the call was made, and to which phone number it was
made. The data also gives information about the total
duration of each call. It also tells you if the call made
was a local (0) or an STD call (1).

 use this data to perform certain operations with the
help of a MapReduce algorithm. One of the operations
you can perform is to find all the phone numbers that
made more than 60 minutes of STD calls. use Java
programming language to do this task.
1. Let’s first declare our constants for the fields.

2. Import all the necessary packages to make sure
we use the classes in the right way.

3. The order of the driver, mapper, and reducer class does
not matter. So, let’s create a mapper that will do the map
task.
•create a TokenizerMapper that will extend our Mapper class. It accepts the
desired data types (line 69-70).
•assign phone numbers and the duration of the calls in minutes (line 72-73).
•The map task works on a string, and it breaks it into individual elements
based on a delimiter (line 75-78).
Then, check if the string that we are looking for has an STD flag (line 79).
then set the phone numbers using the constant class and find the duration (line
81-83).
Finally, extract the phone numbers and the duration of the call made by a

 This mapper class will return an intermediate output,
which would then be sorted and shuffled and passed
on to the reducer.
4. Next, we define our reducer class.
So, define reducer class called SumReducer. The reducer uses the right data types
specific to Hadoop MapReduce (line 50-52).
The reduce (Object, Iterable, Context) method is called for each <key, (collection of
values)> in the sorted inputs. The output of the reduce task is written to a
RecordWriter via TaskInputOutputContext.write(Object, Object) (line 54-56).
It looks into all the keys and values. Wherever it finds that the keys that are repeating
and the duration is more than 60 minutes, it would return an aggregated result (line

5. The driver class has all the job configurations, mapper, reducer, and also a
combiner class. It is responsible for setting up a MapReduce job to run in the
Hadoop cluster. You can specify the names of Mapper and Reducer Classes long
with data types and their respective job names.
6. Now, package the files as .jar and transfer it to the Hadoop cluster and run it
on top of YARN.
You can locate your call records file using hdfs dfs -ls “Location of the file”

7. Now, input the call records file for processing. Use the command below to
locate the file and give the class name, along with another file location to save the
output.
hadoop jar STDSubscribers.jar org.example.hadoopcodes.STDSubscribers
sampleMRIn/calldatarecords.txt sampleMROutput-2
8. Once you run the above command successfully, you can see the output by
checking the directory.
hdfs dfs -cat sampleMROutput-2/part-r-00000

Introduction To Map Reduce
Programs:
 Hadoop MapReduce is a software framework for easily
writing applications which process big amounts of data in-
parallel on large clusters (thousands of nodes) of
commodity hardware in a reliable, fault-tolerant manner.
 The term MapReduce actually refers to the following two
different tasks that Hadoop programs perform:
The Map Task: This is the first task, which takes input
data and converts it into a set of data, where individual
elements are broken down into tuples (key/value pairs).
The Reduce Task: This task takes the output from a map
task as input and combines those data tuples into a
smaller set of tuples. The reduce task is always
performed after the map task.

Typically both the input and the output are stored in a file-system. The
framework takes care of scheduling tasks, monitoring them and re-executes the
failed tasks. The MapReduce framework consists of a single master JobTracker
and one slave TaskTracker per cluster-node. The master is responsible for
resource management, tracking resource consumption/availability and
scheduling the jobs component tasks on the slaves, monitoring them and re-

The slaves TaskTracker execute the tasks as directed by the master and provide
task-status
information to the master periodically. The JobTracker is a single point of failure for
the Hadoop MapReduce service which means if JobTracker goes down, all running
jobs are halted.
The major advantage of MapReduce is that it is easy to scale data processing over
multiple computing nodes. Under the MapReduce model, the data processing
primitives are called mappers and reducers. Decomposing a data processing
application into mappers and reducers is sometimes nontrivial. But, once we write
an application in the MapReduce form, scaling the application to run over
hundreds, thousands, or even tens of thousands of machines in a cluster is merely

During a MapReduce job, Hadoop sends the Map and Reduce tasks to the
appropriate servers in the cluster. The framework manages all the details of
data-passing such as issuing tasks, verifying task completion, and copying
data around the cluster between the nodes.
Most of the computing takes place on nodes with data on local disks that
reduces the network traffic. After completion of the given tasks, the cluster
collects and reduces the data to form an appropriate result, and sends it back
to the Hadoop server.

Topic – 1: A Weather Dataset
MapReduce is a programming model for data processing.
For our example, we will write a program that mines
weather data. Weather sensors collecting data every hour
at many locations across the globe gather a large volume of
log data, which is a good candidate for analysis with
MapReduce, since it is semistructured and record-oriented.
Data Format:
The data we will use is from the National Climatic Data
Center (NCDC, http://www.ncdc.noaa.gov/). The data is
stored using a line-oriented ASCII format, in which each line
is a record. The format supports a rich set of meteorological
elements, many of which are optional or with variable data
lengths. For simplicity, we shall focus on the basic
elements, such as temperature, which are always present
and are of fixed width.

Example : Format of a National Climate Data Center record
0057
332130 # USAF weather station identifier
99999 # WBAN weather station identifier
19500101 # observation date
0300 # observation time
4
+51317 # latitude (degrees x 1000)
+028783 # longitude (degrees x 1000)
FM-12
+0171 # elevation (meters)
99999
V020
320 # wind direction (degrees)
1 # quality code
N
0072
1
00450 # sky ceiling height (meters)
1 # quality code
C
N
010000 # visibility distance (meters)
1 # quality code
N
9
-0128 # air temperature (degrees Celsius x 10)
1 # quality code
-0139 # dew point temperature (degrees Celsius x 10)
1 # quality code
10268 # atmospheric pressure (hectopascals x 10)
1 # quality code

Data files are organized by date and weather station. There is a directory for
each year from 1901 to 2001, each containing a gzipped file for each weather
station with its readings for that year. For example, here are the first entries for
1990:
% ls raw/1990 | head
010010-99999-1990.gz
010014-99999-1990.gz
010015-99999-1990.gz
010016-99999-1990.gz
010017-99999-1990.gz
010030-99999-1990.gz
010040-99999-1990.gz
010080-99999-1990.gz
010100-99999-1990.gz
010150-99999-1990.gz
Since there are tens of thousands of weather stations, the whole dataset is
made up of a large number of relatively small files. It’s generally easier and
more efficient to process a smaller number of relatively large files, so the data
was preprocessed so that each year’s readings were concatenated into a
single file.

Example : A program for finding the maximum recorded temperature by
year from NCDC weather records
#!/usr/bin/env bash
for year in all/*
do
echo -ne `basename $year .gz`"t"
gunzip -c $year |
awk '{ temp = substr($0, 88, 5) + 0;
q = substr($0, 93, 1);
if (temp !=9999 && q ~ /[01459]/ && temp > max) max
= temp }
END { print max }'
Done
The script loops through the compressed year files, first printing the year, and then
processing each file using awk. The awk script extracts two fields from the data:
the air temperature and the quality code. The air temperature value is turned
into an integer by adding 0. Next, a test is applied to see if the temperature is
valid (the value 9999 signifies a missing value in the NCDC dataset) and if the
quality code indicates that the reading is not suspect or erroneous. If the
reading is OK, the value is compared with the maximum value seen so far,
which is updated if a new maximum is found. The END block is executed after
all the lines in the file have been processed, and it prints the maximum value.

Note: “What is meant by “awk”
awk (also written as Awk and AWK) is a utility that enables a programmer to write
tiny but effective programs in the form of statements that define text patterns that
are to be searched for in each line of a document and the action that is to be taken
when a match is found within a line. awk comes with most UNIX-based operating
systems such asLinux, and also with some other operating systems, such as
Windows 95/98/NT.
An awk program is made up of patterns and actions to be performed when a
pattern match is found. awk scans input lines sequentially and examines each one
to determine whether it contains a pattern matching one specified by the user.
When the matching pattern is found, awk carries out the instructions in the
program. For example, awk could scan text for a critical portion and reformat the
text contained in it according to the user's command. If no pattern is specified, the
program will carry out the command on all of the input data.
awk breaks each line into fields, which are groups of characters with spaces acting
as separators so that a word, for example, would be a field. A string is encased in
backslashes and actions to be performed are encased in curly brackets. The lines
are numbered in order of their appearance, with "0" refering to the entire line. "$" is
the symbol for field. So, for example, to search for a line containing the word
"nutmeg," and to print each line in which the word occurs, the awk program would
consist of:
/nutmeg/ { print $0 }.
The name "awk" is derived from the names of its three developers: Alfred Aho,

Here is the beginning of a run:
% ./max_temperature.sh
1901 317
1902 244
1903 289
1904 256
1905 283
The temperature values in the source file are scaled by a factor of 10, so this works out as a maximum
temperature of 31.7°C for 1901 (there were very few readings at the beginning of the century, so this is
plausible). The complete run for the century took 42 minutes in one run on a single EC2 High-CPU Extra
Large Instance.
Analyzing the Data with Hadoop:
To take advantage of the parallel processing that Hadoop provides, we need to express our query as a
MapReduce job. After some local, small-scale testing, we will be able to run it on a cluster of machines.
Map and Reduce:
MapReduce works by breaking the processing into two phases: the map phase and the reduce phase.
Each phase has key-value pairs as input and output, the types of which may be chosen by the
programmer. The programmer also specifies two functions: the map function and the reduce function.
The input to our map phase is the raw NCDC data. We choose a text input format that gives us each line in
the dataset as a text value. The key is the offset of the beginning of the line from the beginning of the file,
but as we have no need for this, we ignore it.
Our map function is simple. We pull out the year and the air temperature, since these are the only fields we
are interested in. In this case, the map function is just a data preparation phase, setting up the data in such
a way that the reducer function can do its work on it: finding the maximum temperature for each year. The
map function is also a good place to drop bad records: here we filter out temperatures that are missing,
suspect, or erroneous.

To visualize the way the map works, consider the following sample lines of input
data (some unused columns have been dropped to fit the page, indicated by
ellipses):
0067011990999991950051507004...9999999N9+00001+99999999999...
0043011990999991950051512004...9999999N9+00221+99999999999...
0043011990999991950051518004...9999999N9-00111+99999999999...
0043012650999991949032412004...0500001N9+01111+99999999999...
0043012650999991949032418004...0500001N9+00781+99999999999...
These lines are presented to the map function as the key-value pairs:
(0, 0067011990999991950051507004...9999999N9+00001+99999999999...)
(106, 0043011990999991950051512004...9999999N9+00221+99999999999...)
(212, 0043011990999991950051518004...9999999N9-00111+99999999999...)
(318, 0043012650999991949032412004...0500001N9+01111+99999999999...)
(424, 0043012650999991949032418004...0500001N9+00781+99999999999...)

The keys are the line offsets within the file, which we ignore in our map function.
The map function merely extracts the year and the air temperature (indicated in
bold text), and emits them as its output (the temperature values have been
interpreted as integers):
(1950, 0)
(1950, 22)
(1950, −11)
(1949, 111)
(1949, 78)
The output from the map function is processed by the MapReduce framework
before being sent to the reduce function. This processing sorts and groups the
key-value pairs by key. So, continuing the example, our reduce function sees the
following input:
(1949, [111, 78])
(1950, [0, 22, −11])
Each year appears with a list of all its air temperature readings. All the reduce
function has to do now is iterate through the list and pick up the maximum
reading:
(1949, 111)
(1950, 22)

This is the final output: the maximum global temperature recorded in each
year.
The whole data flow is illustrated in the below Figure. At the bottom of the
diagram is a Unix pipeline, which mimics the whole MapReduce flow, and which
we will see again later in the chapter when we look at Hadoop Streaming.
Java MapReduce:
Having run through how the MapReduce program works, the next step is to
express it in code. We need three things: a map function, a reduce
function, and some code to run the job. The map function is represented by
the Mapper class, which declares an abstract map() method. The below
example shows the implementation of our map method.

Example: Mapper for maximum temperature example :
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
private static final int MISSING = 9999;
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String year = line.substring(15, 19);
int airTemperature;
if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92));
}
String quality = line.substring(92, 93);
if (airTemperature != MISSING && quality.matches("[01459]")) {
context.write(new Text(year), new IntWritable(airTemperature));
}
}
}

The Mapper class is a generic type, with four formal type parameters that specify
the input key, input value, output key, and output value types of the map function.
For the present example, the input key is a long integer offset, the input value is a
line of text, the output key is a year, and the output value is an air temperature (an
integer). Rather than use built-in Java types, Hadoop provides its own set of basic
types that are optimized for network serialization. These are found in the
org.apache.hadoop.io package. Here we use LongWritable, which corresponds to
a Java Long, Text (like Java String), and IntWritable (like Java Integer).
The map() method is passed a key and a value. We convert the Text value
containing the line of input into a Java String, then use its substring() method to
extract the columns we are interested in. The map() method also provides an
instance of Context to write the output to. In this case, we write the year as a Text
object (since we are just using it as a key), and the temperature is wrapped in an
IntWritable. We write an output record only if the temperature is present and the

Example: Reducer for maximum temperature example:
import java.io.IOException;
import org.apache.hadoop.mapreduce.Reducer;
public class MaxTemperatureReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int maxValue = Integer.MIN_VALUE;
for (IntWritable value : values) {
maxValue = Math.max(maxValue, value.get());
}
context.write(key, new IntWritable(maxValue));
}
}

Again, four formal type parameters are used to specify the input and output
types, this time for the reduce function. The input types of the reduce function
must match the output types of the map function: Text and IntWritable. And in this
case, the output types of the reduce function are Text and IntWritable, for a year
and its maximum temperature, which we find by iterating through the
temperatures and comparing each with a record of the highest found so far.
The third piece of code runs the MapReduce job as shown in the below example.
Example: Application to find the maximum temperature in the weather
dataset:
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MaxTemperature {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MaxTemperature <input path> <output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(MaxTemperature.class);
job.setJobName("Max temperature");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

A Job object forms the specification of the job. It gives you control over how the job is run.
When we run this job on a Hadoop cluster, we will package the code into a JAR file (which
Hadoop will distribute around the cluster). Rather than explicitly specify the name of the JAR
file, we can pass a class in the Job’s setJarByClass() method, which Hadoop will use to locate
the relevant JAR file by looking for the JAR file containing this class.
Having constructed a Job object, we specify the input and output paths. An input path is
specified by calling the static addInputPath() method on FileInputFormat, and it can be a single
file, a directory (in which case, the input forms all the files in that directory), or a file pattern. As
the name suggests, addInputPath() can be called more than once to use input from multiple
paths.
The output path (of which there is only one) is specified by the static setOutput Path() method
on FileOutputFormat. It specifies a directory where the output files from the reducer functions
are written. The directory shouldn’t exist before running the job, as Hadoop will complain and
not run the job. This precaution is to prevent data loss
Next, we specify the map and reduce types to use via the setMapperClass() and
setReducerClass() methods.
The setOutputKeyClass() and setOutputValueClass() methods control the output types for the
map and the reduce functions, which are often the same, as they are in our case. If they are
different, then the map output types can be set using the methods setMapOutputKeyClass()
and setMapOutputValueClass().
The input types are controlled via the input format, which we have not explicitly set since we are
using the default TextInputFormat.
After setting the classes that define the map and reduce functions, we are ready to run the job.
The waitForCompletion() method on Job submits the job and waits for it to finish. The method’s
boolean argument is a verbose flag, so in this case the job writes information about its progress
to the console.

Topic – 2: The Old and New Java Map Reduce API’s

Topic – 3: Basic programs of Hadoop MapReduce
Driver code, Mapper code, Reducer code, RecordReader, Combiner,
Partitioner
Hadoop Data Types:
Despite our many discussions regarding keys and values, we have yet to
mention their types. The MapReduce framework won’t allow them to be any
arbitrary class. For example, although we can and often do talk about certain
keys and values as integers, strings, and so on, they aren’t exactly standard
Java classes, such as Integer, String, and so forth. This is because the
MapReduce framework has a certain defined way of serializing the key/value
pairs to move them across the cluster’s network, and only classes that
support this kind of serialization can function as keys or values in the
framework.
More specifically, classes that implement the Writable interface can be values,
and classes that implement the WritableComparable<T> interface can be
either keys or values. Note that the WritableComparable<T> interface is a
combination of the Writable and java.lang.Comparable<T> interfaces . We
need the comparability requirement for keys because they will be sorted at
the reduce stage, whereas values are simply passed through.
Hadoop comes with a number of predefined classes that implement
WritableComparable, including wrapper classes for all the basic data types,
as seen in the below table.

Unit 2

More Related Content

What's hot (19)

Similar to Unit 2 (20)

More from vishal choudhary (20)

Recently uploaded (20)

Unit 2