Open In App

Hadoop - Mapper In MapReduce

Last Updated : 31 Jul, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

In Hadoop’s MapReduce framework, the Mapper is the core component of the Map Phase, responsible for processing raw input data and converting it into a structured form (key-value pairs) that Hadoop can efficiently handle.

A Mapper is a user-defined Java class that takes input splits (chunks of data from HDFS), processes each record and emits intermediate key-value pairs. These pairs are then shuffled and sorted before being passed to the Reducer (or directly stored in case of a Map-only job).

For Example:

Class MyMappper extends Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

Parameters :

  • KEYIN : Input key (e.g., line offset in a file).
  • VALUEIN : Input value (e.g., a line of text).
  • KEYOUT : Output key (e.g., word).
  • VALUEOUT : Output value (e.g., integer count).

Mapper Workflow

The Mapper’s task is completed with the help of five key components:

Mapper In Hadoop Map-Reduce

1. Input

The Mapper process starts with the input, which consists of raw datasets stored in HDFS. An InputFormat is used to locate and interpret this data so it can be processed properly.

2. Input Splits

The input is divided into input splits, allowing Hadoop to process data in parallel. Each split is handled by a separate Mapper task. The split size can be configured with mapred.max.split.size, and the number of Mappers is calculated as:

Number of Mappers = Total Data Size / Input Split Size

For example, a 10TB file with 128MB splits results in about 81,920 Mappers.

3. RecordReader

Each split is then converted into key-value pairs by a RecordReader. By default, Hadoop uses TextInputFormat, where the key is the byte offset of a line and the value is the text itself.

4. Map Function

The map() function contains the user-defined logic. It processes each key-value pair and produces intermediate key-value pairs, which serve as input for the Reduce phase.

5. Intermediate Output Disk

The Mapper’s output is stored temporarily, first in an in-memory buffer (100MB by default, configurable via io.sort.mb). When the buffer is full, the data is spilled to the local disk. These results are not written to HDFS unless it is a Map-only job with no Reducer.

Example: WordCount Mapper

The WordCount program demonstrates the Mapper’s role clearly.

Java
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    @Override
    protected void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        
        String[] words = value.toString().split("\\s+"); // Split line into words

        for (String w : words) {
            word.set(w);
            context.write(word, one);  // Emit (word, 1)
        }
    }
}

Input:

Hello Hadoop
Hello Mapper

Mapper Output (Intermediate Data):

(Hello, 1)
(Hadoop, 1)
(Hello, 1)
(Mapper, 1)

Explanation:

  • Mapper Definition : extends Mapper<LongWritable, Text, Text, IntWritable> defines input (line offset, line text) and output (word, count).
  • Setup: IntWritable one = new IntWritable(1) assigns count = 1 and Text word stores each word.
  • In map(), each line (value) is split into words using whitespace.
  • For every word, context.write(word, one) emits (word, 1) as the intermediate key-value pair.

Key Features of Hadoop Mapper

  • Parallelism : Each input split is handled by a separate Mapper task running in parallel.
  • Intermediate Data : Produces temporary key-value pairs for the Reducer.
  • Flexibility : Logic can be customized depending on the use case (filtering, parsing, transformation).
  • Map-Only Jobs : If no Reducer is needed, Mapper output itself can be written to HDFS.
  • Local Storage of Output : To avoid replication overhead, intermediate results are kept on local disk until shuffled.

How to Calculate the Number of Mappers in Hadoop

The number of Mappers is determined by the input split size, not directly by the number of HDFS blocks. Each split is handled by one Mapper task. By default, the split size equals the HDFS block size (e.g., 128 MB), but it can be configured.

Formula:

Number of Mappers = Total Data Size / Input Split Size

Example: For a dataset of 10 TB (≈10,240,000 MB) with a split size of 128 MB:

10,240,000/128=80,000 Mappers


Article Tags :

Similar Reads