Hadoop Version 3.0 - What's New?

Last Updated : 04 Aug, 2025

Hadoop is a Java-based framework for distributed storage and processing of large datasets. Introduced in 2006 by Doug Cutting and Mike Cafarella for the Nutch project, it soon became central to Big Data technologies. By 2008, it outperformed supercomputers in sorting terabytes of data. With Hadoop 2.x enabling scalability and Hadoop 3.x improving fault tolerance, efficiency, and flexibility, it continues to power modern data-intensive industries.

Key New Features in Hadoop 3.0

1. JDK 8.0 is the Minimum JAVA Version Supported by Hadoop 3.x

Since Oracle has ended the use of JDK 7 in 2015, so to use Hadoop 3 users have to upgrade their Java version to JDK 8 or above to compile and run all the Hadoop files. JDK version below 8 is no more supported for using Hadoop 3.

2. Erasure Coding is Supported

Erasure coding in Hadoop 3 provides fault tolerance by reconstructing lost data, similar to RAID technology. Unlike Hadoop 2, which relied on replication, erasure coding requires nearly half the storage while offering the same reliability. This reduces disk usage, saves storage costs, and improves fault tolerance efficiency in Hadoop clusters built on commodity hardware.

3. More Than Two NameNodes Supported

Hadoop 3.x extends fault tolerance by supporting multiple standby NameNodes instead of just one, as in Hadoop 2.x. Data replication is managed through a quorum of three or more JournalNodes, making the cluster more resilient. For example, configuring three NameNodes with five JournalNodes allows the system to handle failures of two NameNodes, ensuring higher availability for big data applications.

4. Shell Script Rewriting

The Hadoop file system utilizes various shell-type commands that directly interact with the HDFS and other file systems that Hadoop supports i.e. such as WebHDFS, Local FS, S3 FS, etc. The multiple functionalities of Hadoop are controlled by the shell. The shell script used in the latest version of Hadoop i.e. Hadoop 3.x has fixed lots of bugs. Hadoop 3.x shell scripts also provide the functionality of rewriting the shell script.

5. Timeline Service v.2 for YARN

The YARN Timeline service stores and retrieve the applicant's information(The information can be ongoing or historical). Timeline service v.2 was much important to improve the reliability and scalability of our Hadoop. System usability is enhanced with the help of flows and aggregation. In Hadoop 1.x with TimeLine service, v.1 users can only make a single instance of reader/writer and storage architecture that can not be scaled further.

Hadoop 2.x uses distributed writer architecture where data read and write operations are separable. Here distributed collectors are provided for every YARN(Yet Another Resource Negotiator) application. Timeline service v.2 uses HBase for storage purposes which can be scaled to massive size along with providing good response time for reading and writing operations.

The information that Timeline service v.2 stores can be of major 2 types:

A. Generic information of the completed application

user information
queue name
count of attempts made per application
container information which runs for each attempt on application

B. Per framework information about running and completed application

count of Map and Reduce Task
counters
information broadcast by the developer for TimeLine Server with the help of Timeline client.

6. Filesystem Connector Support

This new Hadoop version 3.x now supports Azure Data Lake and Aliyun Object Storage System which are the other standby option for the Hadoop-compatible filesystem.

7. Default Multiple Service Ports Have Been Changed

In the Previous version of Hadoop, the multiple service port for Hadoop is in the Linux ephemeral port range (32768-61000). In this kind of configuration due to conflicts occurs in some other application sometimes the service fails to bind to the ports. So to overcome this problem Hadoop 3.x has moved the conflicts ports from the Linux ephemeral port range and new ports have been assigned to this as shown below.

// The new assigned Port
Namenode Ports: 50470 -> 9871, 50070 -> 9870, 8020 -> 9820
Datanode Ports: 50020-> 9867,50010 -> 9866, 50475 -> 9865, 50075 -> 9864
Secondary NN Ports: 50091 -> 9869, 50090 -> 9868

8. Intra-Datanode Balancer

DataNodes are utilized in the Hadoop cluster for storage purposes. The DataNodes handles multiple disks at a time. This Disk's got filled evenly during write operations. Adding or Removing the disk can cause significant skewness in a DataNode. The existing HDFS-BALANCER can not handle this significant skewness, which concerns itself with inter-, not intra-, DN skew. The latest intra-DataNode balancing feature can manage this situation which is invoked with the help of HDFS disk balancer CLI.

9. Shaded Client Jars

The new Hadoop–client-API and Hadoop-client-runtime are made available in Hadoop 3.x which provides Hadoop dependencies in a single packet or single jar file. In Hadoop 3.x the Hadoop –client-API have compile-time scope while Hadoop-client-runtime has runtime scope. Both of these contain third-party dependencies provided by Hadoop-client. Now, the developers can easily bundle all the dependencies in a single jar file and can easily test the jars for any version conflicts. using this way, the Hadoop dependencies onto application classpath can be easily withdrawn.

10. Task Heap and Daemon Management

In Hadoop version 3.x we can easily configure Hadoop daemon heap size with some newly added ways. With the help of the memory size of the host auto-tuning is made available. Instead of HADOOP_HEAPSIZE, developers can use the HEAP_MAX_SIZE and HEAP_MIN_SIZE variables. JAVA_HEAP_SIZE internal variable is also removed in this latest Hadoop version 3.x. Default heap sizes are also removed which is used for auto-tuning by JVM(Java Virtual Machine). If you want to use the older default then enable it by configuring HADOOP_HEAPSIZE_MAX in Hadoop-env.sh file.

Hadoop Architecture
MapReduce Job Execution
Difference Between Hadoop 2.x and Hadoop 3.x
Data Engineering

dikshantmalidev

Improve

Article Tags :

Data Engineering