Open In App

How Job runs on MapReduce

Last Updated : 11 Aug, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Running a job in Hadoop MapReduce may look simple from the user’s side, but behind the scenes, it involves a complex series of steps. From job submission to resource allocation and finally task execution, Hadoop handles everything through its distributed architecture.

Let’s explore how it all begins.

Submitting a Job in MapReduce

The MapReduce job starts with just one method call in your Java code:

job.submit(); // Submits job asynchronously

  • This triggers the internal method submitJobInternal() via a JobSubmitter object.
  • It sets off a full job lifecycle handled by Hadoop including input validation, resource setup and task scheduling.

Alternatively, you can use:

job.waitForCompletion(true); // Submits job and waits for it to finish

  • This submits the job and monitors its progress.
  • It logs updates every second and prints final job statistics on success or failure.

Although method call is simple, it starts a complex sequence of operations involving several key components of the Hadoop ecosystem.

Key Components Involved

  1. Client: Submits the MapReduce job
  2. YARN NodeManager: Manages execution containers on worker nodes
  3. YARN ResourceManager: Allocates and tracks compute resources across the cluster
  4. ApplicationMaster (MRAppMaster): Coordinates the job schedules tasks, handles progress/failure
  5. HDFS (Distributed File System): Stores job JAR, input splits and configuration so cluster nodes can access them.

MapReduce_flowchartInternal Process of JobSubmitter 

When a MapReduce job is submitted, JobSubmitter sets up job by checking paths, copying configs and preparing it on HDFS. This ensures everything is ready before execution begins.

Let’s explore what happens behind the scenes.

1. Request Application ID

  • The ResourceManager generates a unique Application ID that identifies the MapReduce job throughout its lifecycle.
  • This unique ID helps track the job’s status, logs and progress across the Hadoop cluster.

2. Validate Output Path

Before launching the job, Hadoop ensures the output directory is valid and won’t cause issues during execution. It checks if the job's output directory:

  • Is correctly specified: The job proceeds.
  • Already exists: Throws an error to avoid overwriting existing results.
  • Is not specified: Throws an error because Hadoop needs a target location to save the output.

This step ensures a safe and clean output path, preventing data conflicts or job failure.

3. Compute Input Splits

Hadoop needs to divide the input data into manageable pieces (splits) so it can process them in parallel. It checks if the job's input splits:

  • Are already specified: Skips computation and uses them directly.
  • Are not provided: Automatically computes them based on input size and block size.
  • Fails to compute: Job submission is aborted and an error is returned to the client.

This step ensures that the data is properly partitioned for parallel processing across multiple mapper tasks.

4. Upload Job Resources to HDFS

Once the job is prepared, Hadoop needs to share all necessary files across the cluster so that every node involved in processing can access them. It copies the following into a shared HDFS directory (usually named after Job ID):

  • Job JAR file: Contains the Mapper, Reducer and Driver classes.
  • Input splits metadata: Tells where and how to read chunks of input data.
  • Configuration files: Include job settings and environment details.

The JAR file is also replicated across the cluster based on configuration:

mapreduce.client.submit.file.replication

This replication ensures high availability of JAR file even if some nodes go down during execution.

5. Submit Application

This is the final step in the job submission process. The job is officially handed over to YARN by calling:

submitApplication()

This method notifies ResourceManager that job is ready to run and provides all necessary metadata, configurations and resources.

Launch ApplicationMaster

Once the ResourceManager accepts the job, it requests a container from a NodeManager and launches the MRAppMaster. The ApplicationMaster:

  • Splits tasks into Map and Reduce
  • Tracks task progress and reschedules failed ones
  • Communicates status back to the client

6. Once Submitted: Monitoring & Completion

After the job is submitted, Hadoop enters the monitoring phase:

  • The client polls ResourceManager every second to check job’s current status (e.g., running, completed, failed).
  • Real-time updates about map and reduce progress are logged to the console, so users can track how the job is performing.
  • On success: Hadoop displays the final output location and job counters (e.g., number of input records, output records, bytes read/written).
  • On failure: A clear error message is printed with details about why the job failed, making debugging easier.

7. Final Output Commit

Each reducer writes results to a temporary path. After successful execution:

  • OutputCommitter moves files to the final output directory.
  • This prevents partial/corrupted output if tasks fail midway.

By default, Hadoop uses FileOutputCommitter.


Similar Reads