Open In App

Map Reduce in Hadoop

Last Updated : 04 Aug, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

MapReduce is the processing engine of Hadoop. While HDFS is responsible for storing massive amounts of data, MapReduce handles the actual computation and analysis. It provides a simple yet powerful programming model that allows developers to process large datasets in a distributed and parallel manner. It is a two-phase data processing model in Hadoop:

  • Map Phase: Breaks the data into smaller chunks, processes them, and generates intermediate (key, value) pairs.
  • Reduce Phase : Aggregates and merges the intermediate results to produce the final output.
Map and Reduce interfaces

How MapReduce Works – Step by Step

Example Input File: sample.txt

Hello I am GeeksforGeeks
How can I help you
How can I assist you
Are you an engineer
Are you looking for coding
Are you looking for interview questions
what are you doing these days
what are your strengths

When stored in HDFS, this file is divided into input splits (e.g., first.txt, second.txt, etc.). Each split is assigned to a Mapper for processing.

Step 1: Input Splitting & Record Reader

  • The input file is divided into splits.
  • Each split is converted into (key, value) pairs using a RecordReader.
  • In TextInputFormat (default), the key is the byte offset of the line and the value is the line itself.

Example:

(0, "Hello I am GeeksforGeeks")
(26, "How can I help you")

File Formats in Hadoop

The way a RecordReader converts text into (key, value) pairs depends on the input file format. Hadoop provides several built‑in formats:

  • TextInputFormat (Default) : Each line becomes (byte offset, line).
  • KeyValueTextInputFormat : Splits each line into (key, value) pairs using a delimiter.
  • SequenceFileInputFormat : Stores data in a binary (key, value) format (efficient for data exchange between MapReduce jobs).
  • SequenceFileAsTextInputFormat : Reads binary sequence files but presents keys and values as text.

By default, Hadoop uses TextInputFormat.

Step 2: Map Phase

Each Mapper processes its assigned (key, value) pair and generates intermediate data.

For (0, "Hello I am GeeksforGeeks"):

(Hello, 1)
(I, 1)
(am, 1)
(GeeksforGeeks, 1)

For (26, "How can I help you"):

(How, 1)
(can, 1)
(I, 1)
(help, 1)
(you, 1)

All Mappers run in parallel, one per input split.

Step 3: Shuffling and Sorting

The Mapper outputs are not final. Before reducing:

  • Shuffling groups all values of identical keys:

(How, [1,1])
(Are, [1,1,1])
(I, [1,1])

  • Sorting arranges keys so that they can be processed sequentially.

This prepares data for the reducer.

Step 4: Reduce Phase

The Reducer aggregates values for each key:

(How, [1,1]) → (How, 2)
(Are, [1,1,1]) → (Are, 3)
(I, [1,1]) → (I, 2)

Final output (saved in result.output):

Hello - 1
I - 2
am - 1
GeeksforGeeks - 1
How - 2
Are - 3
are - 2
you - 3
what - 2
...

MapReduce Fundamentals

  • Number of Mappers = Number of Input Splits
  • Reducer Output Files : By default, one reducer produces one output file. Multiple reducers → multiple output files.
  • Intermediate Data : Mapper output is not final; it needs shuffling, sorting, and reducing.
  • Fault Tolerance : If a TaskTracker fails, the JobTracker reassigns tasks to another node with a copy of the data.

Advantages of MapReduce in Hadoop

  • Scalable : Efficiently processes petabytes of data across thousands of nodes.
  • Parallel : Executes tasks simultaneously on distributed data blocks.
  • Fault-Tolerant : Recovers automatically by reassigning failed tasks.
  • Simple : Developers only write Map and Reduce logic; Hadoop manages execution.

Article Tags :

Similar Reads