Hadoop - File Blocks and Replication Factor

Last Updated : 24 Jun, 2025

Hadoop Distributed File System i.e. HDFS is used in Hadoop to store the data means all of our data is stored in HDFS. Hadoop is also known for its efficient and reliable storage technique. The Replication Factor is nothing but it is a process of making replicate or duplicate's of data so let's discuss them one by one with the example for better understanding.

File Blocks in Hadoop

Whenever any file is imported to your Hadoop Distributed File System, that file gets divided into blocks of some size and then these blocks of data are stored in various slave nodes. This is a normal thing that happens in almost all types of file systems.

By default in Hadoop1, these blocks are 64MB in size, and in Hadoop2 these blocks are 128MB in size which means all the blocks that are obtained after dividing a file should be 64MB or 128MB in size. You can manually change the size of the file block in hdfs-site.xml file.

Hadoop doesn't know or it doesn't care about the data stored in these blocks so it considers the final file blocks as a partial record. In the Linux file system, the size of a file block is about 4KB which is very much less than the default size of file blocks in the Hadoop file system.

As Hadoop is mainly configured for storing large size data which is in petabyte, this is what makes Hadoop file system different from other file systems as it can be scaled, nowadays file blocks of 128MB to 256MB are considered in Hadoop.

Example

Suppose you have uploaded a file of 400MB to your HDFS then what happens is, this file got divided into blocks of 128MB + 128MB + 128MB + 16MB = 400MB size. Means 4 blocks are created each of 128MB except the last one.

Why Are File Blocks So Large in Hadoop?

Reduce Metadata Overhead: Smaller blocks lead to more metadata, making management cumbersome.
Minimize Disk Seek Time: Larger blocks reduce the number of disk seek operations, increasing efficiency.

Advantages of File Blocks

Easy to maintain as the size can be larger than any of the single disk present in our cluster.
We don't need to take care of Metadata like any of the permission as they can be handled on different systems. So no need to store this Meta Data with the file blocks.
Making Replicates of this data is quite easy which provides us fault tolerance and high availability in our Hadoop cluster.
As the blocks are of a fixed configured size we can easily maintain its record.

Replication and Replication Factor

Replication is nothing but making a copy of something and the number of times you make a copy of that particular thing is its Replication Factor. In file blocks when the HDFS stores the data in the form of various blocks, at the same time Hadoop is also configured to make a copy of those file blocks.

By default the Replication Factor for Hadoop is set to 3. It can be configured manually as per your requirement

Example:

Master Node: 64 GB RAM, 50 GB Disk
4 Slave Nodes: 16 GB RAM, 40 GB Disk each
Replication Factor: 3

Here RAM for Master is more. It needs to be kept more because Master is the one who is going to guide this slave so Master has to process fast.

Now suppose you have a file of size 150MB then the total file blocks will be 2 shown below.

128MB = Block 1
22MB = Block 2

As the replication factor by default is 3 so we have 3 copies of this file block

FileBlock1-Replica1(B1R1) FileBlock2-Replica1(B2R1)
FileBlock1-Replica2(B1R2) FileBlock2-Replica2(B2R2)
FileBlock1-Replica3(B1R3) FileBlock2-Replica3(B2R3)

These blocks are going to be stored in our Slave as shown in the above image, which means if Slave 1 crashes, then B1R1 and B2R3 are lost. But B1 and B2 can be recovered from other slaves as the Replica of this file blocks is already present in other slaves.

Similarly, if any other Slave crashes then we can recover that file block some other slave.

Replication is going to increase our storage but data is even more important for us.

Need for Replication

Hadoop uses commodity hardware, which can fail unexpectedly. Replication ensures:

High Availability: Data is not lost even if a node crashes.
Fault Tolerance: System can recover blocks from other nodes.

Although replication increases storage usage, enterprises prioritize data reliability over saving storage space.

Configuration of Replication Factor

You can configure the Replication factor in your hdfs-site.xml file.

Here, we have set the replication Factor to one as we have only a single system to work with i.e. a single laptop. We don't have any Cluster with lots of the nodes.

You need to simply change the value in dfs.replication property as per your need.

Advantages of Replication in Hadoop

Fault Tolerance: Ensures data is available even if a node fails.
High Availability: Keeps data accessible at all times without downtime.
Data Reliability: Protects against data loss or corruption.
Parallel Access: Allows multiple users or processes to read data simultaneously.
Load Balancing: Distributes read requests across replicas to avoid overloading any one node.

dikshantmalidev

Improve

Article Tags :

Data Engineering