Hadoop - File Blocks and Replication Factor
Last Updated :
24 Jun, 2025
Hadoop Distributed File System i.e. HDFS is used in Hadoop to store the data means all of our data is stored in HDFS. Hadoop is also known for its efficient and reliable storage technique. The Replication Factor is nothing but it is a process of making replicate or duplicate's of data so let's discuss them one by one with the example for better understanding.
File Blocks in Hadoop
Data Blocks in Hadoop HDFSWhenever any file is imported to your Hadoop Distributed File System, that file gets divided into blocks of some size and then these blocks of data are stored in various slave nodes. This is a normal thing that happens in almost all types of file systems.
By default in Hadoop1, these blocks are 64MB in size, and in Hadoop2 these blocks are 128MB in size which means all the blocks that are obtained after dividing a file should be 64MB or 128MB in size. You can manually change the size of the file block in hdfs-site.xml file.
Hadoop doesn't know or it doesn't care about the data stored in these blocks so it considers the final file blocks as a partial record. In the Linux file system, the size of a file block is about 4KB which is very much less than the default size of file blocks in the Hadoop file system.
As Hadoop is mainly configured for storing large size data which is in petabyte, this is what makes Hadoop file system different from other file systems as it can be scaled, nowadays file blocks of 128MB to 256MB are considered in Hadoop.
DFS Block SizeExample
Suppose you have uploaded a file of 400MB to your HDFS then what happens is, this file got divided into blocks of 128MB + 128MB + 128MB + 16MB = 400MB size. Means 4 blocks are created each of 128MB except the last one.
Why Are File Blocks So Large in Hadoop?
- Reduce Metadata Overhead: Smaller blocks lead to more metadata, making management cumbersome.
- Minimize Disk Seek Time: Larger blocks reduce the number of disk seek operations, increasing efficiency.
Advantages of File Blocks
- Easy to maintain as the size can be larger than any of the single disk present in our cluster.
- We don't need to take care of Metadata like any of the permission as they can be handled on different systems. So no need to store this Meta Data with the file blocks.
- Making Replicates of this data is quite easy which provides us fault tolerance and high availability in our Hadoop cluster.
- As the blocks are of a fixed configured size we can easily maintain its record.
Replication and Replication Factor
Replication is nothing but making a copy of something and the number of times you make a copy of that particular thing is its Replication Factor. In file blocks when the HDFS stores the data in the form of various blocks, at the same time Hadoop is also configured to make a copy of those file blocks.
By default the Replication Factor for Hadoop is set to 3. It can be configured manually as per your requirement
Example:
- Master Node: 64 GB RAM, 50 GB Disk
- 4 Slave Nodes: 16 GB RAM, 40 GB Disk each
- Replication Factor: 3
Here RAM for Master is more. It needs to be kept more because Master is the one who is going to guide this slave so Master has to process fast.
Now suppose you have a file of size 150MB then the total file blocks will be 2 shown below.
Master System128MB = Block 1
22MB = Block 2
As the replication factor by default is 3 so we have 3 copies of this file block
FileBlock1-Replica1(B1R1) FileBlock2-Replica1(B2R1)
FileBlock1-Replica2(B1R2) FileBlock2-Replica2(B2R2)
FileBlock1-Replica3(B1R3) FileBlock2-Replica3(B2R3)
These blocks are going to be stored in our Slave as shown in the above image, which means if Slave 1 crashes, then B1R1 and B2R3 are lost. But B1 and B2 can be recovered from other slaves as the Replica of this file blocks is already present in other slaves.
Similarly, if any other Slave crashes then we can recover that file block some other slave.
Replication is going to increase our storage but data is even more important for us.
Need for Replication
Hadoop uses commodity hardware, which can fail unexpectedly. Replication ensures:
- High Availability: Data is not lost even if a node crashes.
- Fault Tolerance: System can recover blocks from other nodes.
Although replication increases storage usage, enterprises prioritize data reliability over saving storage space.
Configuration of Replication Factor
You can configure the Replication factor in your hdfs-site.xml file.

Here, we have set the replication Factor to one as we have only a single system to work with i.e. a single laptop. We don't have any Cluster with lots of the nodes.
You need to simply change the value in dfs.replication property as per your need.
Advantages of Replication in Hadoop
- Fault Tolerance: Ensures data is available even if a node fails.
- High Availability: Keeps data accessible at all times without downtime.
- Data Reliability: Protects against data loss or corruption.
- Parallel Access: Allows multiple users or processes to read data simultaneously.
- Load Balancing: Distributes read requests across replicas to avoid overloading any one node.
Similar Reads
Hadoop - Daemons and Their Features In Hadoop, daemons are background Java processes that run continuously to manage storage, resource allocation, and task coordination across a distributed system. These daemons form the backbone of the Hadoop framework, enabling efficient data processing and fault tolerance at scale.Hadoop's architec
4 min read
Hadoop - Rack and Rack Awareness In a Hadoop cluster, data is stored on many machines called DataNodes, which are grouped into Racks. Each rack holds around 30-40 machines.Hadoop uses a feature called Rack Awareness to improve speed and efficiency. It means the NameNode knows where each DataNode is located (which rack) and uses thi
3 min read
How does Hadoop ensure fault tolerance and high availability? The Apache Hadoop framework stands out as a pivotal technology that facilitates the processing of vast datasets across clusters of computers. It's built on the principle that system faults and hardware failures are common occurrences, rather than exceptions. Consequently, Hadoop is designed to ensur
5 min read
Hadoop - Different Modes of Operation Hadoop is an open-source framework widely used for storing, processing, and analyzing large-scale datasets on clusters of commodity hardware. It provides a scale-out storage and processing architecture, meaning we can easily add or remove nodes as per requirements. This makes Hadoop highly suitable
2 min read
Hadoop - HDFS (Hadoop Distributed File System) Before learning about HDFS (Hadoop Distributed File System), itâs important to understand what a file system is. A file system is a way an operating system organizes and manages files on disk storage. It helps users store, maintain, and retrieve data from the disk.Example: Windows uses file systems
4 min read
Various Filesystems in Hadoop Hadoop is an open-source software framework written primarily in Java (with some components in C and shell scripting) that enables computation over massive volumes of data across a cluster of machines. Designed for batch/offline processing, Hadoop leverages commodity hardware i.e. inexpensive machin
3 min read