Hadoop - Mapper In MapReduce
Last Updated :
31 Jul, 2025
In Hadoop’s MapReduce framework, the Mapper is the core component of the Map Phase, responsible for processing raw input data and converting it into a structured form (key-value pairs) that Hadoop can efficiently handle.
A Mapper is a user-defined Java class that takes input splits (chunks of data from HDFS), processes each record and emits intermediate key-value pairs. These pairs are then shuffled and sorted before being passed to the Reducer (or directly stored in case of a Map-only job).
For Example:
Class MyMappper extends Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
Parameters :
- KEYIN : Input key (e.g., line offset in a file).
- VALUEIN : Input value (e.g., a line of text).
- KEYOUT : Output key (e.g., word).
- VALUEOUT : Output value (e.g., integer count).
Mapper Workflow
The Mapper’s task is completed with the help of five key components:

The Mapper process starts with the input, which consists of raw datasets stored in HDFS. An InputFormat is used to locate and interpret this data so it can be processed properly.
The input is divided into input splits, allowing Hadoop to process data in parallel. Each split is handled by a separate Mapper task. The split size can be configured with mapred.max.split.size, and the number of Mappers is calculated as:
Number of Mappers = Total Data Size / Input Split Size
For example, a 10TB file with 128MB splits results in about 81,920 Mappers.
3. RecordReader
Each split is then converted into key-value pairs by a RecordReader. By default, Hadoop uses TextInputFormat, where the key is the byte offset of a line and the value is the text itself.
4. Map Function
The map() function contains the user-defined logic. It processes each key-value pair and produces intermediate key-value pairs, which serve as input for the Reduce phase.
The Mapper’s output is stored temporarily, first in an in-memory buffer (100MB by default, configurable via io.sort.mb). When the buffer is full, the data is spilled to the local disk. These results are not written to HDFS unless it is a Map-only job with no Reducer.
Example: WordCount Mapper
The WordCount program demonstrates the Mapper’s role clearly.
Java
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] words = value.toString().split("\\s+"); // Split line into words
for (String w : words) {
word.set(w);
context.write(word, one); // Emit (word, 1)
}
}
}
Input:
Hello Hadoop
Hello Mapper
Mapper Output (Intermediate Data):
(Hello, 1)
(Hadoop, 1)
(Hello, 1)
(Mapper, 1)
Explanation:
- Mapper Definition : extends Mapper<LongWritable, Text, Text, IntWritable> defines input (line offset, line text) and output (word, count).
- Setup: IntWritable one = new IntWritable(1) assigns count = 1 and Text word stores each word.
- In map(), each line (value) is split into words using whitespace.
- For every word, context.write(word, one) emits (word, 1) as the intermediate key-value pair.
Key Features of Hadoop Mapper
- Parallelism : Each input split is handled by a separate Mapper task running in parallel.
- Intermediate Data : Produces temporary key-value pairs for the Reducer.
- Flexibility : Logic can be customized depending on the use case (filtering, parsing, transformation).
- Map-Only Jobs : If no Reducer is needed, Mapper output itself can be written to HDFS.
- Local Storage of Output : To avoid replication overhead, intermediate results are kept on local disk until shuffled.
How to Calculate the Number of Mappers in Hadoop
The number of Mappers is determined by the input split size, not directly by the number of HDFS blocks. Each split is handled by one Mapper task. By default, the split size equals the HDFS block size (e.g., 128 MB), but it can be configured.
Formula:
Number of Mappers = Total Data Size / Input Split Size
Example: For a dataset of 10 TB (≈10,240,000 MB) with a split size of 128 MB:
10,240,000/128=80,000 Mappers
Related Articles
Similar Reads
Hadoop - Reducer in Map-Reduce MapReduce is a core programming model in the Hadoop ecosystem, designed to process large datasets in parallel across distributed machines (nodes). The execution flow is divided into two major phases: Map Phase and Reduce Phase.Hadoop programs typically consist of three main components:Mapper Class:
3 min read
MapReduce - Combiners Map-Reduce is a programming model that is used for processing large-size data-sets over distributed systems in Hadoop. Map phase and Reduce Phase are the main two important parts of any Map-Reduce job. Map-Reduce applications are limited by the bandwidth available on the cluster because there is a m
6 min read
Hadoop MapReduce - Data Flow MapReduce is a Hadoop processing framework that efficiently handles large-scale data across distributed machines. Unlike traditional systems, it works directly on data stored across nodes in HDFS.Hadoop MapReduce follows a simple yet powerful data processing model that breaks large datasets into sma
2 min read
MapReduce Architecture MapReduce Architecture is the backbone of Hadoopâs processing, offering a framework that splits jobs into smaller tasks, executes them in parallel across a cluster, and merges results. Its design ensures parallelism, data locality, fault tolerance, and scalability, making it ideal for applications l
4 min read
Difference between MapReduce and Pig MapReduce is a model that works over Hadoop to access big data efficiently stored in HDFS (Hadoop Distributed File System). It is the core component of Hadoop, which divides the big data into small chunks and process them parallelly. Features of MapReduce: It can store and distribute huge data acros
2 min read
Difference Between MapReduce and Hive MapReduce is a model that works over Hadoop to access big data efficiently stored in HDFS (Hadoop Distributed File System). It is the core component of Hadoop, which divides the big data into small chunks and process them parallelly. Features of MapReduce: It can store and distribute huge data acros
2 min read