What is Hadoop Streaming?
Last Updated :
11 Aug, 2025
Hadoop MapReduce was originally built for Java, which limited its accessibility to developers familiar with other languages. To address this, Hadoop introduced Streaming a utility that enables writing MapReduce programs in any language that supports standard input and output, such as Python, Bash or Perl.
Hadoop Streaming, available since version 0.14.1, allows external scripts to be used as Mapper and Reducer tasks. These scripts process input from STDIN and produce output to STDOUT, enabling non-Java programs to participate fully in Hadoop’s distributed data processing.
Use Cases of Hadoop Streaming
- Suitable for developers preferring Python, Perl, Bash or other non-Java languages
- Enables reuse of existing legacy scripts in MapReduce workflows
- Facilitates rapid prototyping of data processing tasks using simple scripts
- Supports development of custom mappers and reducers for non-standard or binary data formats
Data Flow in Hadoop Streaming
Hadoop Streaming processes key-value pairs through external mapper and reducer scripts using standard input and output. Let’s see how data flows through each stage in the diagram below.

Let’s walk through how data flows in a typical Hadoop Streaming job:
Hadoop reads the input data using InputFormat class.
- The data is split into <key, value> pairs.
- These pairs are passed to the Mapper script.
2. Mapper Stream
- Each input pair is sent to an external Mapper script via STDIN.
- The script can be written in any language that supports standard input/output.
- It processes the input and writes output to STDOUT in the form of intermediate <key, value> pairs.
3. Intermediate Key-Value Pairs
- Hadoop automatically collects all intermediate output.
- It shuffles and groups values by keys across all Mappers.
- Sorted data is now ready for Reducers.
4. Reducer Stream
- The grouped intermediate data is passed to an external Reducer script via STDIN.
- The script processes each group of keys and values.
- The output, written via STDOUT, contains final <key, value> results.
- Final output pairs from the Reducer are collected by Hadoop.
- They are written to HDFS using configured OutputFormat (usually plain text files).
Running a Streaming Job in Hadoop
To run a Hadoop Streaming job, use hadoop jar command along with hadoop-streaming.jar file. This allows to plug in external scripts as Mapper and Reducer even if they’re written in languages like Python, Bash or Perl.
Let’s see an example of how to run a streaming job using Python scripts:
hadoop jar /path/to/hadoop-streaming.jar \
-input /data/input.txt \
-output /data/output \
-mapper mymapper.py \
-reducer myreducer.py \
-file mymapper.py \
-file myreducer.py
What Command Does
- Runs a MapReduce job using external Python scripts.
- -input: Specifies the input data stored in HDFS.
- -output: Directory to store the final output (must not already exist).
- -mapper: Python script used to process the input data.
- -reducer: Python script used to process grouped key-value pairs.
- -file: Uploads the mapper and reducer scripts to Hadoop nodes.
Internal Workflow
- Hadoop passes each line of input to mymapper.py via STDIN.
- The mapper emits key-value pairs using STDOUT.
- Hadoop shuffles and groups the data by key.
- Grouped data is passed to myreducer.py via STDIN.
- The reducer outputs final results via STDOUT, which are written to /data/output.
Useful Hadoop Streaming Options
Option | Description |
---|
-input | Input path for the mapper |
-output | Output path after reduce phase |
-mapper | Command or script to run as the mapper |
-reducer | Command or script to run as the reducer |
-file | Upload mapper/reducer script to all compute nodes |
-inputformat | Custom InputFormat class |
-outputformat | Custom OutputFormat class |
-partitioner | Defines how keys are divided among reducers |
-combiner | reduce logic applied after map (local mini-reduce) |
-verbose | Enables detailed logs |
-numReduceTasks | Number of reducer tasks |
-mapdebug, -reducedebug | Scripts to run on task failure (for debugging) |
Similar Reads
Hadoop - Features of Hadoop Which Makes It Popular Hadoop is still a top choice for Big Data, even with tools like Flink, Cassandra and Storm in the market. Its reliability, scalability and ability to handle all types of data make it a trusted solution for businesses worldwide.Component of HadoopHadoop is powered by three core components that work t
4 min read
Difference Between Hadoop and HBase Hadoop: Hadoop is an open source framework from Apache that is used to store and process large datasets distributed across a cluster of servers. Four main components of Hadoop are Hadoop Distributed File System(HDFS), Yarn, MapReduce, and libraries. It involves not only large data but a mixture of s
2 min read
Difference Between Hadoop and Hive Hadoop: Hadoop is a Framework or Software which was invented to manage huge data or Big Data. Hadoop is used for storing and processing large data distributed across a cluster of commodity servers. Hadoop stores the data using Hadoop distributed file system and process/query it using the Map-Reduce
2 min read
Difference Between Hadoop and Apache Spark Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. H
2 min read
MapReduce Architecture MapReduce Architecture is the backbone of Hadoopâs processing, offering a framework that splits jobs into smaller tasks, executes them in parallel across a cluster, and merges results. Its design ensures parallelism, data locality, fault tolerance, and scalability, making it ideal for applications l
4 min read
What is the Purpose of Hadoop Streaming? In the world of big data, processing vast amounts of data efficiently is a crucial task. Hadoop, an open-source framework, has been a cornerstone in managing and processing large data sets across distributed computing environments. Among its various components, Hadoop Streaming stands out as a versa
4 min read