Open In App

What is Hadoop Streaming?

Last Updated : 11 Aug, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Hadoop MapReduce was originally built for Java, which limited its accessibility to developers familiar with other languages. To address this, Hadoop introduced Streaming a utility that enables writing MapReduce programs in any language that supports standard input and output, such as Python, Bash or Perl.

Hadoop Streaming, available since version 0.14.1, allows external scripts to be used as Mapper and Reducer tasks. These scripts process input from STDIN and produce output to STDOUT, enabling non-Java programs to participate fully in Hadoop’s distributed data processing.

Use Cases of Hadoop Streaming

  1. Suitable for developers preferring Python, Perl, Bash or other non-Java languages
  2. Enables reuse of existing legacy scripts in MapReduce workflows
  3. Facilitates rapid prototyping of data processing tasks using simple scripts
  4. Supports development of custom mappers and reducers for non-standard or binary data formats

Data Flow in Hadoop Streaming

Hadoop Streaming processes key-value pairs through external mapper and reducer scripts using standard input and output. Let’s see how data flows through each stage in the diagram below.

What-is-Hadoop-Streaming

Let’s walk through how data flows in a typical Hadoop Streaming job:

1. Input Reader / Format

Hadoop reads the input data using InputFormat class.

  • The data is split into <key, value> pairs.
  • These pairs are passed to the Mapper script.

2. Mapper Stream

  • Each input pair is sent to an external Mapper script via STDIN.
  • The script can be written in any language that supports standard input/output.
  • It processes the input and writes output to STDOUT in the form of intermediate <key, value> pairs.

3. Intermediate Key-Value Pairs

  • Hadoop automatically collects all intermediate output.
  • It shuffles and groups values by keys across all Mappers.
  • Sorted data is now ready for Reducers.

4. Reducer Stream

  • The grouped intermediate data is passed to an external Reducer script via STDIN.
  • The script processes each group of keys and values.
  • The output, written via STDOUT, contains final <key, value> results.

5. Output Format

  • Final output pairs from the Reducer are collected by Hadoop.
  • They are written to HDFS using configured OutputFormat (usually plain text files).

Running a Streaming Job in Hadoop

To run a Hadoop Streaming job, use hadoop jar command along with hadoop-streaming.jar file. This allows to plug in external scripts as Mapper and Reducer even if they’re written in languages like Python, Bash or Perl.

Let’s see an example of how to run a streaming job using Python scripts:

hadoop jar /path/to/hadoop-streaming.jar \
-input /data/input.txt \
-output /data/output \
-mapper mymapper.py \
-reducer myreducer.py \
-file mymapper.py \
-file myreducer.py

What Command Does

  • Runs a MapReduce job using external Python scripts.
  • -input: Specifies the input data stored in HDFS.
  • -output: Directory to store the final output (must not already exist).
  • -mapper: Python script used to process the input data.
  • -reducer: Python script used to process grouped key-value pairs.
  • -file: Uploads the mapper and reducer scripts to Hadoop nodes.

Internal Workflow

  • Hadoop passes each line of input to mymapper.py via STDIN.
  • The mapper emits key-value pairs using STDOUT.
  • Hadoop shuffles and groups the data by key.
  • Grouped data is passed to myreducer.py via STDIN.
  • The reducer outputs final results via STDOUT, which are written to /data/output.

Useful Hadoop Streaming Options

Option

Description

-inputInput path for the mapper
-output Output path after reduce phase 
-mapperCommand or script to run as the mapper
-reducer Command or script to run as the reducer
-fileUpload mapper/reducer script to all compute nodes 
-inputformatCustom InputFormat class
-outputformatCustom OutputFormat class 
-partitioner Defines how keys are divided among reducers
-combinerreduce logic applied after map (local mini-reduce)
-verboseEnables detailed logs
-numReduceTasksNumber of reducer tasks
-mapdebug,
-reducedebug
Scripts to run on task failure (for debugging) 

Article Tags :

Similar Reads