SlideShare a Scribd company logo
Hadoop Distributed File System
Big Data Analytics
Nadar Saraswathi College of Arts & Science
Submitted By
N. Nagapandiyammal
M.Sc Computer Science
Hadoop Distributed File System
 The Hadoop Distributed File System (HDFS) is the primary
data storage system used by Hadoop applications.
 It employs a NameNode and DataNode architecture to
implement a distributed file system that provides high-
performance access to data across highly scalable Hadoop
clusters.
 HDFS is a key part of the many Hadoop ecosystem
technologies, as it provides a reliable means for managing
pools of big data and supporting related big data
analytics applications.
 The Hadoop distributed file system (HDFS) is a distributed,
scalable, and portable file system written in Java for the
Hadoop framework.
HDFS has five services
 1. Name Node
 2. Secondary Name Node
 3. Job tracker
 4. Data Node
 5. Task Tracker
Hadoop Distributed File System
Name Node
 HDFS consists of only one Name Node we call it as Master
Node which can track the files, manage the file system and
has the meta data and the whole data in it.
 To be particular Name node contains the details of the No.
of blocks, Locations at what data node the data is stored and
where the replications are stored and other details.
 As we have only one Name Node we call it as Single Point
Failure. It has Direct connect with the client.
Data Node
 A Data Node stores data in it as the blocks. This is also
known as the slave node and it stores the actual data into
HDFS which is responsible for the client to read and write.
 These are slave daemons. Every Data node sends a
Heartbeat message to the Name node every 3 seconds and
conveys that it is alive.
 In this way when Name Node does not receive a heartbeat
from a data node for 2 minutes, it will take that data node as
dead and starts the process of block replications on some
other Data node.
Secondary Name Node
 This is only to take care of the checkpoints of the file
system metadata which is in the Name Node.
 This is also known as the checkpoint Node. It is helper
Node for the Name Node.
Job Tracker
 Basically Job Tracker will be useful in the Processing the
data. Job Tracker receives the requests for Map Reduce
execution from the client.
 Job tracker talks to the Name node to know about the
location of the data like Job Tracker will request the Name
Node for the processing the data.
 Name node in response gives the Meta data to job tracker.
Task Tracker
 It is the Slave Node for the Job Tracker and it will take the
task from the Job Tracker. And also it receives code from
the Job Tracker.
 Task Tracker will take the code and apply on the file. The
process of applying that code on the file is known as
Mapper.
Other file systems
 HDFS: Hadoop's own rack-aware file system. This is designed
to scale to tens of petabytes of storage and runs on top of the
file systems of the underlying operating systems.
 FTP file system: This stores all its data on remotely accessible
FTP servers.
 Amazon S3 (Simple Storage Service) object storage: This is
targeted at clusters hosted on the Amazon Elastic Compute
Cloud server-on-demand infrastructure. There is no rack-
awareness in this file system, as it is all remote.
 Windows Azure Storage Blobs (WASB) file system: This is an
extension of HDFS that allows distributions of Hadoop to
access data in Azure blob stores without moving the data
permanently into the cluster.
Why use HDFS?
 The Hadoop Distributed File System arose at Yahoo as a
part of that company's ad serving and search engine
requirements. Like other web-oriented companies, Yahoo
found itself juggling a variety of applications that were
accessed by a growing numbers of users, who were creating
more and more data.
 Facebook, eBay, LinkedIn and Twitter are among the web
companies that used HDFS to underpin big data analytics to
address these same requirements.
 HDFS was used by The New York Times as part of large-
scale image conversions, Media6Degrees for log processing
and machine learning, LiveBet for log storage and odds
analysis, Joost for session analysis and Fox Audience
Network for log analysis and data mining.
 HDFS is also at the core of many open source data
warehouse alternatives, sometimes called data lakes.
HDFS and Hadoop history
 In 2006, Hadoop's originators ceded their work on HDFS and
MapReduce to the Apache Software Foundation project. In 2012,
HDFS and Hadoop became available in Version 1.0. The basic HDFS
standard has been continuously updated since its inception.
 With Version 2.0 of Hadoop in 2013, a general-purpose YARN
resource manager was added, and MapReduce and HDFS were
effectively decoupled. Thereafter, diverse data processing frameworks
and file systems were supported by Hadoop.
 While MapReduce was often replaced by Apache Spark, HDFS
continued to be a prevalent file format for Hadoop. After four alpha
releases and one beta, Apache Hadoop 3.0.0 became generally
available in December 2017, with HDFS enhancements supporting
additional NameNodes, erasure coding facilities and greater data
compression.
 At the same time, advances in HDFS tooling, such as LinkedIn's open
source Dr. Elephant and Dynamometer performance testing tools, have
expanded to enable development of ever larger HDFS
implementations.
Hadoop Distributed File System
Thank You

More Related Content

DOCX
assignment3
PPTX
Hadoop distributed file system
PDF
Hadoop distributed file system
PPT
hadoop
PPTX
Snapshot in Hadoop Distributed File System
PDF
Hadoop architecture-tutorial
PDF
Hadoop
PPTX
Big Data & Hadoop
assignment3
Hadoop distributed file system
Hadoop distributed file system
hadoop
Snapshot in Hadoop Distributed File System
Hadoop architecture-tutorial
Hadoop
Big Data & Hadoop

What's hot (19)

PPTX
2.introduction to hdfs
PPTX
Hadoop Distributed File System
PPTX
Hadoop distributed file system
PPTX
Hadoop file system
PPTX
Hadoop
PPTX
Big Data and Hadoop - An Introduction
PDF
Hadoop paper
PDF
BIG DATA Session 6
PPTX
Hadoop architecture-tutorial
PPTX
Bd class 2 complete
PPTX
Sector Vs Hadoop
PPTX
Design of Hadoop Distributed File System
PPTX
Hadoop File system (HDFS)
PPTX
BIG DATA: Apache Hadoop
PDF
lec4_ref.pdf
PPTX
Hadoop basics
PDF
Most Popular Hadoop Interview Questions and Answers
PDF
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
PPTX
Big data
2.introduction to hdfs
Hadoop Distributed File System
Hadoop distributed file system
Hadoop file system
Hadoop
Big Data and Hadoop - An Introduction
Hadoop paper
BIG DATA Session 6
Hadoop architecture-tutorial
Bd class 2 complete
Sector Vs Hadoop
Design of Hadoop Distributed File System
Hadoop File system (HDFS)
BIG DATA: Apache Hadoop
lec4_ref.pdf
Hadoop basics
Most Popular Hadoop Interview Questions and Answers
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Big data
Ad

Similar to Hadoop Distributed File System (20)

PPTX
HADOOP.pptx
PDF
Big data overview of apache hadoop
PDF
Big data overview of apache hadoop
PPTX
Big Data Analytics -Introduction education
PPTX
Hadoop
PPTX
Managing Big data with Hadoop
PDF
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
PPTX
Hadoop introduction
PPT
PPTX
Introduction to Hadoop and Hadoop component
PDF
Introduction to Hadoop Administration
PDF
Introduction to Hadoop Administration
PPTX
Big data with HDFS and Mapreduce
PPTX
Topic 9a-Hadoop Storage- HDFS.pptx
PPTX
Introduction to HDFS and MapReduce
PPTX
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
PPTX
Introduction to hadoop and hdfs
PPT
An Introduction to Hadoop
PPTX
Hadoop ppt1
HADOOP.pptx
Big data overview of apache hadoop
Big data overview of apache hadoop
Big Data Analytics -Introduction education
Hadoop
Managing Big data with Hadoop
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
Hadoop introduction
Introduction to Hadoop and Hadoop component
Introduction to Hadoop Administration
Introduction to Hadoop Administration
Big data with HDFS and Mapreduce
Topic 9a-Hadoop Storage- HDFS.pptx
Introduction to HDFS and MapReduce
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
Introduction to hadoop and hdfs
An Introduction to Hadoop
Hadoop ppt1
Ad

More from NilaNila16 (14)

PPTX
Basic Block Scheduling
PPTX
Affine Array Indexes
PPTX
Software Engineering
PPTX
Web Programming
PPTX
MapReduce Paradigm
PPTX
Data Mining
PPTX
Operating system
PPTX
RDBMS
PPTX
Linear Block Codes
PPTX
Applications of graph theory
PPTX
Hasse Diagram
PPTX
Fuzzy set
PPTX
Recurrence Relation
PPTX
Input/Output Exploring java.io
Basic Block Scheduling
Affine Array Indexes
Software Engineering
Web Programming
MapReduce Paradigm
Data Mining
Operating system
RDBMS
Linear Block Codes
Applications of graph theory
Hasse Diagram
Fuzzy set
Recurrence Relation
Input/Output Exploring java.io

Recently uploaded (20)

PPTX
PHIL.-ASTRONOMY-AND-NAVIGATION of ..pptx
PPTX
Primary and secondary sources, and history
PDF
Nykaa-Strategy-Case-Fixing-Retention-UX-and-D2C-Engagement (1).pdf
PDF
Instagram's Product Secrets Unveiled with this PPT
PPTX
Presentation for DGJV QMS (PQP)_12.03.2025.pptx
PPTX
An Unlikely Response 08 10 2025.pptx
DOCX
"Project Management: Ultimate Guide to Tools, Techniques, and Strategies (2025)"
PDF
Presentation1 [Autosaved].pdf diagnosiss
PPTX
BIOLOGY TISSUE PPT CLASS 9 PROJECT PUBLIC
PPTX
2025-08-10 Joseph 02 (shared slides).pptx
PPTX
Project and change Managment: short video sequences for IBA
PPTX
Hydrogel Based delivery Cancer Treatment
DOCX
ENGLISH PROJECT FOR BINOD BIHARI MAHTO KOYLANCHAL UNIVERSITY
PPTX
Relationship Management Presentation In Banking.pptx
PPTX
INTERNATIONAL LABOUR ORAGNISATION PPT ON SOCIAL SCIENCE
PPTX
_ISO_Presentation_ISO 9001 and 45001.pptx
PPTX
Anesthesia and it's stage with mnemonic and images
PDF
Swiggy’s Playbook: UX, Logistics & Monetization
PPTX
ART-APP-REPORT-FINctrwxsg f fuy L-na.pptx
PPTX
Effective_Handling_Information_Presentation.pptx
PHIL.-ASTRONOMY-AND-NAVIGATION of ..pptx
Primary and secondary sources, and history
Nykaa-Strategy-Case-Fixing-Retention-UX-and-D2C-Engagement (1).pdf
Instagram's Product Secrets Unveiled with this PPT
Presentation for DGJV QMS (PQP)_12.03.2025.pptx
An Unlikely Response 08 10 2025.pptx
"Project Management: Ultimate Guide to Tools, Techniques, and Strategies (2025)"
Presentation1 [Autosaved].pdf diagnosiss
BIOLOGY TISSUE PPT CLASS 9 PROJECT PUBLIC
2025-08-10 Joseph 02 (shared slides).pptx
Project and change Managment: short video sequences for IBA
Hydrogel Based delivery Cancer Treatment
ENGLISH PROJECT FOR BINOD BIHARI MAHTO KOYLANCHAL UNIVERSITY
Relationship Management Presentation In Banking.pptx
INTERNATIONAL LABOUR ORAGNISATION PPT ON SOCIAL SCIENCE
_ISO_Presentation_ISO 9001 and 45001.pptx
Anesthesia and it's stage with mnemonic and images
Swiggy’s Playbook: UX, Logistics & Monetization
ART-APP-REPORT-FINctrwxsg f fuy L-na.pptx
Effective_Handling_Information_Presentation.pptx

Hadoop Distributed File System

  • 1. Hadoop Distributed File System Big Data Analytics Nadar Saraswathi College of Arts & Science Submitted By N. Nagapandiyammal M.Sc Computer Science
  • 2. Hadoop Distributed File System  The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications.  It employs a NameNode and DataNode architecture to implement a distributed file system that provides high- performance access to data across highly scalable Hadoop clusters.  HDFS is a key part of the many Hadoop ecosystem technologies, as it provides a reliable means for managing pools of big data and supporting related big data analytics applications.  The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file system written in Java for the Hadoop framework.
  • 3. HDFS has five services  1. Name Node  2. Secondary Name Node  3. Job tracker  4. Data Node  5. Task Tracker
  • 5. Name Node  HDFS consists of only one Name Node we call it as Master Node which can track the files, manage the file system and has the meta data and the whole data in it.  To be particular Name node contains the details of the No. of blocks, Locations at what data node the data is stored and where the replications are stored and other details.  As we have only one Name Node we call it as Single Point Failure. It has Direct connect with the client.
  • 6. Data Node  A Data Node stores data in it as the blocks. This is also known as the slave node and it stores the actual data into HDFS which is responsible for the client to read and write.  These are slave daemons. Every Data node sends a Heartbeat message to the Name node every 3 seconds and conveys that it is alive.  In this way when Name Node does not receive a heartbeat from a data node for 2 minutes, it will take that data node as dead and starts the process of block replications on some other Data node.
  • 7. Secondary Name Node  This is only to take care of the checkpoints of the file system metadata which is in the Name Node.  This is also known as the checkpoint Node. It is helper Node for the Name Node.
  • 8. Job Tracker  Basically Job Tracker will be useful in the Processing the data. Job Tracker receives the requests for Map Reduce execution from the client.  Job tracker talks to the Name node to know about the location of the data like Job Tracker will request the Name Node for the processing the data.  Name node in response gives the Meta data to job tracker.
  • 9. Task Tracker  It is the Slave Node for the Job Tracker and it will take the task from the Job Tracker. And also it receives code from the Job Tracker.  Task Tracker will take the code and apply on the file. The process of applying that code on the file is known as Mapper.
  • 10. Other file systems  HDFS: Hadoop's own rack-aware file system. This is designed to scale to tens of petabytes of storage and runs on top of the file systems of the underlying operating systems.  FTP file system: This stores all its data on remotely accessible FTP servers.  Amazon S3 (Simple Storage Service) object storage: This is targeted at clusters hosted on the Amazon Elastic Compute Cloud server-on-demand infrastructure. There is no rack- awareness in this file system, as it is all remote.  Windows Azure Storage Blobs (WASB) file system: This is an extension of HDFS that allows distributions of Hadoop to access data in Azure blob stores without moving the data permanently into the cluster.
  • 11. Why use HDFS?  The Hadoop Distributed File System arose at Yahoo as a part of that company's ad serving and search engine requirements. Like other web-oriented companies, Yahoo found itself juggling a variety of applications that were accessed by a growing numbers of users, who were creating more and more data.  Facebook, eBay, LinkedIn and Twitter are among the web companies that used HDFS to underpin big data analytics to address these same requirements.  HDFS was used by The New York Times as part of large- scale image conversions, Media6Degrees for log processing and machine learning, LiveBet for log storage and odds analysis, Joost for session analysis and Fox Audience Network for log analysis and data mining.  HDFS is also at the core of many open source data warehouse alternatives, sometimes called data lakes.
  • 12. HDFS and Hadoop history  In 2006, Hadoop's originators ceded their work on HDFS and MapReduce to the Apache Software Foundation project. In 2012, HDFS and Hadoop became available in Version 1.0. The basic HDFS standard has been continuously updated since its inception.  With Version 2.0 of Hadoop in 2013, a general-purpose YARN resource manager was added, and MapReduce and HDFS were effectively decoupled. Thereafter, diverse data processing frameworks and file systems were supported by Hadoop.  While MapReduce was often replaced by Apache Spark, HDFS continued to be a prevalent file format for Hadoop. After four alpha releases and one beta, Apache Hadoop 3.0.0 became generally available in December 2017, with HDFS enhancements supporting additional NameNodes, erasure coding facilities and greater data compression.  At the same time, advances in HDFS tooling, such as LinkedIn's open source Dr. Elephant and Dynamometer performance testing tools, have expanded to enable development of ever larger HDFS implementations.