SlideShare a Scribd company logo
2
Most read
3
Most read
4
Most read
Hadoop Architecture | Features and Objectives
What is Hadoop?
Hadoop is an Apache open-source framework. It was written using Java that allows the
distributed processing of large datasets across clusters of computers using simple
programming models. The Hadoop framework application works in a platform that provides
distributed storage and computation across clusters of computers. Using Google’s solution,
Doug Cutting and his team developed an Open Source Project which is named as HADOOP.
Using the Map Reduce algorithm, Hadoop runs the applications where the data is processed in
parallel with others. In simple, Hadoop is used to develop applications that could perform a
complete statistical analysis of huge amounts of data.
Architecture of Hadoop
Hadoop has two major layers namely
• Processing/Computation layer (Map Reduce), and
• Storage layer (Hadoop Distributed File System).
Map Reduce
Map Reduce is a parallel programming model that is used for writing distributed applications.
These distributed applications are devised at Google for efficient processing of large amounts
of data, on large clusters of commodity hardware in a reliable, fault-tolerant manner. The Map
Reduce program runs on the Hadoop framework.
Hadoop Distributed File System
The Hadoop Distributed File System is based on the Google File System. It provides a
distributed file system that is designed to run on commodity hardware. HDFS has many
similarities with existing distributed file systems. It is designed to be deployed on low-cost
hardware and highly fault-tolerant. It provides high throughput access to application data.
The below following two modules are also included in Hadoop Framework −
a. Hadoop Common
These are Java libraries and utilities which are required by other Hadoop modules.
b. Hadoop YARN
Hadoop a framework for job scheduling and cluster resource management.
How Does Hadoop Work?
It’s quite expensive to build bigger servers with heavy configurations that handle large scale
processing , As it is cheaper than one high-end server We can use Hadoop as an alternative. So
this is the major factor behind using Hadoop that it runs across clustered and low-cost
machines.
The following core tasks that Hadoop performs are clearly mentioned below:-
1. Data is firstly segmented into directories and files. Files are further divided into
uniform-sized blocks of 128M and 64M (preferably 128M).
2. These files are then again distributed across various cluster nodes for further
processing.
3. Being on top of the local file system, HDFS supervises the processing.
4. All the Blocks are replicated for handling hardware failure.
5. It Checks that the code was executed successfully.
6. Performs the sort that takes place between the map and reduces stages.
7. Sends the sorted data to a certain computer.
8. Writes the debugging logs for each job.
Hadoop File System was developed using a distributed file system design. It runs on commodity
hardware. Comparing to other distributed systems, HDFS is highly faulted tolerant and
designed using low-cost hardware.
HDFS holds a very large amount of data and it maintains easier access. The files are stored
across multiple machines for storing such huge data. HDFS also makes applications available to
parallel processing.
Features of HDFS.
1. To interact with HDFS, Hadoop provides a command interface
2. Users can easily check the status of the cluster with the help of name node and data node
3. Available of streaming access to file system data.
4. HDFS provides file permissions and authentication.
HDFS Architecture
It mainly follows the master-slave architecture
Name node
It is the commodity hardware that consists of the GNU/Linux operating system and the name
node software. It is software that runs on commodity hardware. Below mentioned are the
following tasks that it can perform
a. It manages the file system namespace.
b. It regulates the client’s access to files.
c. Executes the file system operations such as renaming, closing, and opening files and
directories.
Data node
It is a commodity hardware that consists of the GNU/Linux operating system and data node
software. There will be a data node. For every node in a cluster, these nodes will manage the
data storage of their system.
a. As per client request, Data nodes perform read-write operations on the file systems
b. They also perform other operations such as block creation, deletion, and replication.
Block
The file in a file system is divided into one or more segments. These file segments are called
blocks. In simple words, we can say that the minimum amount of data that HDFS can read or
write is called a Block. Generally, the default block size is 64MB, but we can increase the block
size as per the need to change in HDFS configuration.
Objectives of HDFS
1. Fault detection and recovery
As HDFS includes a large number of commodity hardware, there is a probability of having
failures in components. To overcome this HDFS should have mechanisms for quick and
automatic fault detection and recovery.
2. Huge datasets
To manage the applications having huge datasets HDFS should have hundreds of nodes per
cluster
3. Hardware at data
When the computation takes place near the data a requested task can be done. The network
traffic is reduced and results in increment in the throughput.
Hadoop Advantages :-
1. Varied data sources
2. Availability
3. Scalable
4. Cost effective
5. Low network traffic
6. Ease of use
7. Performance
8. High throughput
9. Compatibility
10. Fault tolerant
11. Open source
12. Multi-Language support
Limitations of Hadoop:-
1. Issues with small files
2. Slow processing speed
3. Latency
4. Security
5. No real time data processing
6. Uncertainty
7. Lengthy line of code
8. Not easy to use
9. No caching
10. Supports only batch processing
Summary:-
This brings us to the end of this article on Hadoop. In this article you have learn what is Hadoop,
Architecture of Hadoop, Features and HDFS Architecture. We have also come up with a
curriculum that covers exactly what you would need to be expert in Hadoop Development! You
can have a look at the course details for Hadoop.
Hadoop  architecture-tutorial

More Related Content

PPTX
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
PPTX
Introduction to Hadoop and Hadoop component
PPTX
Big Data Open Source Technologies
PPTX
Hadoop And Their Ecosystem ppt
PPTX
Hadoop File system (HDFS)
PPT
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
PPTX
Big data and Hadoop
PPTX
Anomaly Detection
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Introduction to Hadoop and Hadoop component
Big Data Open Source Technologies
Hadoop And Their Ecosystem ppt
Hadoop File system (HDFS)
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Big data and Hadoop
Anomaly Detection

What's hot (20)

PPTX
Distributed Shared Memory
PDF
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
PPT
2.2 decision tree
PPT
Classification Algorithms
PPTX
Map Reduce
PPT
Hadoop Map Reduce
PDF
Hadoop Overview & Architecture
 
PDF
Hadoop Ecosystem
PPT
3.3 hierarchical methods
PPTX
Consistency in NoSQL
PPT
Map reduce in BIG DATA
PPT
Hive(ppt)
PPT
Chapter 8. Classification Basic Concepts.ppt
PPTX
Introduction to Map Reduce
PPTX
Apache PIG
PPT
Chapter 3. Data Preprocessing.ppt
PPTX
Ensemble methods in machine learning
PDF
Hadoop & MapReduce
PPTX
Introducing Technologies for Handling Big Data by Jaseela
PPTX
Classification techniques in data mining
Distributed Shared Memory
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
2.2 decision tree
Classification Algorithms
Map Reduce
Hadoop Map Reduce
Hadoop Overview & Architecture
 
Hadoop Ecosystem
3.3 hierarchical methods
Consistency in NoSQL
Map reduce in BIG DATA
Hive(ppt)
Chapter 8. Classification Basic Concepts.ppt
Introduction to Map Reduce
Apache PIG
Chapter 3. Data Preprocessing.ppt
Ensemble methods in machine learning
Hadoop & MapReduce
Introducing Technologies for Handling Big Data by Jaseela
Classification techniques in data mining
Ad

Similar to Hadoop architecture-tutorial (20)

PPTX
Hadoop architecture-tutorial
PPTX
PPTX
hadoop_Introduction module 2 and chapter 3pptx.pptx
PPTX
Hadoop_Introduction unit-2 for vtu syllabus
PDF
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
PDF
hadoop distributed file systems complete information
PPTX
2. hadoop fundamentals
PPT
Hadoop
PPTX
Hadoop introduction
PPTX
Hadoop ppt1
PPTX
Hadoop and Big data in Big data and cloud.pptx
PPTX
OPERATING SYSTEM .pptx
PPTX
Bigdata and Hadoop Introduction
PPT
PPTX
Managing Big data with Hadoop
PPTX
Distributed Systems Hadoop.pptx
PPTX
Unit 5
PPTX
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
PDF
hdfs readrmation ghghg bigdats analytics info.pdf
Hadoop architecture-tutorial
hadoop_Introduction module 2 and chapter 3pptx.pptx
Hadoop_Introduction unit-2 for vtu syllabus
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
hadoop distributed file systems complete information
2. hadoop fundamentals
Hadoop
Hadoop introduction
Hadoop ppt1
Hadoop and Big data in Big data and cloud.pptx
OPERATING SYSTEM .pptx
Bigdata and Hadoop Introduction
Managing Big data with Hadoop
Distributed Systems Hadoop.pptx
Unit 5
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
hdfs readrmation ghghg bigdats analytics info.pdf
Ad

Recently uploaded (20)

PPTX
Institutional Correction lecture only . . .
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
Classroom Observation Tools for Teachers
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
Business Ethics Teaching Materials for college
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PPTX
master seminar digital applications in india
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
Week 4 Term 3 Study Techniques revisited.pptx
PDF
RMMM.pdf make it easy to upload and study
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
Institutional Correction lecture only . . .
Module 4: Burden of Disease Tutorial Slides S2 2025
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
Supply Chain Operations Speaking Notes -ICLT Program
Renaissance Architecture: A Journey from Faith to Humanism
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
O5-L3 Freight Transport Ops (International) V1.pdf
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Classroom Observation Tools for Teachers
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Business Ethics Teaching Materials for college
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
master seminar digital applications in india
Microbial diseases, their pathogenesis and prophylaxis
Week 4 Term 3 Study Techniques revisited.pptx
RMMM.pdf make it easy to upload and study
Final Presentation General Medicine 03-08-2024.pptx
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table

Hadoop architecture-tutorial

  • 1. Hadoop Architecture | Features and Objectives What is Hadoop? Hadoop is an Apache open-source framework. It was written using Java that allows the distributed processing of large datasets across clusters of computers using simple programming models. The Hadoop framework application works in a platform that provides distributed storage and computation across clusters of computers. Using Google’s solution, Doug Cutting and his team developed an Open Source Project which is named as HADOOP. Using the Map Reduce algorithm, Hadoop runs the applications where the data is processed in parallel with others. In simple, Hadoop is used to develop applications that could perform a complete statistical analysis of huge amounts of data. Architecture of Hadoop Hadoop has two major layers namely • Processing/Computation layer (Map Reduce), and • Storage layer (Hadoop Distributed File System).
  • 2. Map Reduce Map Reduce is a parallel programming model that is used for writing distributed applications. These distributed applications are devised at Google for efficient processing of large amounts of data, on large clusters of commodity hardware in a reliable, fault-tolerant manner. The Map Reduce program runs on the Hadoop framework. Hadoop Distributed File System The Hadoop Distributed File System is based on the Google File System. It provides a distributed file system that is designed to run on commodity hardware. HDFS has many similarities with existing distributed file systems. It is designed to be deployed on low-cost hardware and highly fault-tolerant. It provides high throughput access to application data. The below following two modules are also included in Hadoop Framework − a. Hadoop Common These are Java libraries and utilities which are required by other Hadoop modules. b. Hadoop YARN Hadoop a framework for job scheduling and cluster resource management. How Does Hadoop Work? It’s quite expensive to build bigger servers with heavy configurations that handle large scale processing , As it is cheaper than one high-end server We can use Hadoop as an alternative. So this is the major factor behind using Hadoop that it runs across clustered and low-cost machines. The following core tasks that Hadoop performs are clearly mentioned below:- 1. Data is firstly segmented into directories and files. Files are further divided into uniform-sized blocks of 128M and 64M (preferably 128M). 2. These files are then again distributed across various cluster nodes for further processing. 3. Being on top of the local file system, HDFS supervises the processing. 4. All the Blocks are replicated for handling hardware failure.
  • 3. 5. It Checks that the code was executed successfully. 6. Performs the sort that takes place between the map and reduces stages. 7. Sends the sorted data to a certain computer. 8. Writes the debugging logs for each job. Hadoop File System was developed using a distributed file system design. It runs on commodity hardware. Comparing to other distributed systems, HDFS is highly faulted tolerant and designed using low-cost hardware. HDFS holds a very large amount of data and it maintains easier access. The files are stored across multiple machines for storing such huge data. HDFS also makes applications available to parallel processing. Features of HDFS. 1. To interact with HDFS, Hadoop provides a command interface 2. Users can easily check the status of the cluster with the help of name node and data node 3. Available of streaming access to file system data. 4. HDFS provides file permissions and authentication. HDFS Architecture It mainly follows the master-slave architecture
  • 4. Name node It is the commodity hardware that consists of the GNU/Linux operating system and the name node software. It is software that runs on commodity hardware. Below mentioned are the following tasks that it can perform a. It manages the file system namespace. b. It regulates the client’s access to files. c. Executes the file system operations such as renaming, closing, and opening files and directories. Data node It is a commodity hardware that consists of the GNU/Linux operating system and data node software. There will be a data node. For every node in a cluster, these nodes will manage the data storage of their system. a. As per client request, Data nodes perform read-write operations on the file systems b. They also perform other operations such as block creation, deletion, and replication. Block The file in a file system is divided into one or more segments. These file segments are called blocks. In simple words, we can say that the minimum amount of data that HDFS can read or write is called a Block. Generally, the default block size is 64MB, but we can increase the block size as per the need to change in HDFS configuration. Objectives of HDFS 1. Fault detection and recovery As HDFS includes a large number of commodity hardware, there is a probability of having failures in components. To overcome this HDFS should have mechanisms for quick and automatic fault detection and recovery. 2. Huge datasets To manage the applications having huge datasets HDFS should have hundreds of nodes per cluster 3. Hardware at data
  • 5. When the computation takes place near the data a requested task can be done. The network traffic is reduced and results in increment in the throughput. Hadoop Advantages :- 1. Varied data sources 2. Availability 3. Scalable 4. Cost effective 5. Low network traffic 6. Ease of use 7. Performance 8. High throughput 9. Compatibility 10. Fault tolerant 11. Open source 12. Multi-Language support Limitations of Hadoop:- 1. Issues with small files 2. Slow processing speed 3. Latency 4. Security 5. No real time data processing 6. Uncertainty 7. Lengthy line of code 8. Not easy to use 9. No caching 10. Supports only batch processing Summary:- This brings us to the end of this article on Hadoop. In this article you have learn what is Hadoop, Architecture of Hadoop, Features and HDFS Architecture. We have also come up with a curriculum that covers exactly what you would need to be expert in Hadoop Development! You can have a look at the course details for Hadoop.