SlideShare a Scribd company logo
1
HDFS DATA
STORAGE
M2,C3
HDFS DATA STORAGE
2
• RECALL:
– Hadoop data store concept implies storing the
data at a number of clusters
• Each cluster has a number of data stores called
racks.
• Each rack stores a number of DataNodes.
• Each DataNode has a large number of Data
blocks.
• The racks distribute across a cluster.
• The nodes have processing and storage
capabilities.
3
• The data blocks replicate by default atleast
on three DataNodes in same or remote
nodes.
• A file , containing the data divides into data
blocks.
• A data block default size is 64MBs
4
HDFS FEATURES
5
Hadoop Physical organization
• The conventional file system uses directories.
• Directory consists of folders.
• A folder consists of files.
• When data processes, the data sources are identified by
pointers
• A data dictionary stores the resource pointers
6
• Master tables at the dictionary store at a
central location.
• The centrally stored tables enable
administration easier when the data source
change during processing
7
Recall: Namenode & Datanode
• Namenode stores the files meta data.
• Meta data gives information about the file of user
application, but does not participate in the
computations.
• Datanode stores the actual data files in the data
blocks
8
The client,
master
Namenode,
MasterNodes
and slave
nodes
9
• Clients are users who run the application with the
help of Hadoop ecosystem projects.
• Examples of ecosystem projects: Hive,Mahout and Pig
• A single masternode provides HDFS, MapReduce and
Hbase using threads in small to medium sized clusters
10
MasterNode
• A MasterNode fundamentally play the role of a coordinator.
• The MasterNode
– receives client connections ,
– maintains the description of the global file system namespace and
allocation of file blocks
• It also monitors the state of the system in order to detect any failure
11
Components of Master
1. Namenode
2. Secondary namenode
3. JobTracker
• NameNode stores all the file system related information
such as:
– The file section is stored in which part of the cluster
– Last access time for the files
– User permissions like which user has access to the
file 12
Secondary NameNode
• When the cluster size is large, multiple servers are
used, such as to balance the load.
• The secondary NameNode provides NameNode
management services and zookeeper,which is used by
Hbase for metadata storage
13
Secondary namenode
• An alternate for NameNode
• It keeps a copy of NameNode meta data
• Stored metadata can be rebuilt easily,incase
of NameNode failure
14
• The JobTracker coordinates the parallel
processing of data.
• Masters, slaves and Hadoop client(node)
– load the data into clusters
– Submit the processing job
– Retrieve the data to see the response after
the job completion 15
Hadoop2
• Limitations of Hadoop1:
– Single NameNode failure is an operational limitation
– Scaling up was also restricted to just beyond a few thousands of
DataNodes and few number of clusters
• Hadoop2 provides multiple NameNodes
• This enables Higher resource availability 16
Components of Hadoop2
• An associated NameNode
• Zookeeper coordination client, functions as a centralized
repository for distributed applications.
– Uses synchronisation, serialization and coordination
activities.
– Enables functioning of a distributed system as a single
function
• Associated JournalNode(JN):
– Keeps the records of the state,resources
assigned ,intermediate results or execution of application
tasks.
– Distributed applications can read and write data from a JN
17
Steps taken by system incase of failure
• One set of resources is in active state
• The other one remains in standby state.
• Two masters,
– One MN1 is in active state
– Another MN2 is in secondary state
• That ensures availability in case of network fault of
an active NameNode NM1
• The system activates the secondary NameNode NM2
and creates a secondary in another MasterNode MN3
unused earlier.
18
• The entries copy from JN1 in MN1 into
JN2,which is at newly active Masternode
MN2
• Therefore, the application runs
uninterrupted and resources are
available uninterrupted
19
HDFS COMMANDS
• HDFS shell is not compliant with the POSIX.
• Thus, the shell cannot interact similar to UNIX or LINUX
• Commands for interacting with the files in HDFS require
– /bin/hdfs dfs <args>
– Where args stands for the command arguments
– All Hadoop commands are invoked by the bin/Hadoop
script :
• % Hadoop fsck / -files -blocks 20
Commands examples
• -copyToLocal
– command for copying a file at HDFS to
the local.
• -cat
– Copying to standard output(stdout)
21
HDFS COMMANDS usages
22
HDFS Summarize
• HDFS uses a master/slave model designed for large file
reading/streaming.
• The NameNode is a metadata server or "data traffic cop."
• HDFS provides a single namespace that is managed by the
NameNode.
• Data is redundantly stored on DataNodes; there is no data on
the NameNode.
• The SecondaryNameNode performs checkpoints of NameNode
file system's state but is not a failover node 23
END OF M2,C3
24

More Related Content

PPTX
Introduction to HDFS
PPTX
HADOOP.pptx
PDF
Hadoop data management
PPTX
Hadoop BRamamurthy ajjaahdvddvdnsmsjdjfj
PPTX
Module 2 C2_HadoopEcosystemComponents.pptx
PDF
big data hadoop technonolgy for storing and processing data
Introduction to HDFS
HADOOP.pptx
Hadoop data management
Hadoop BRamamurthy ajjaahdvddvdnsmsjdjfj
Module 2 C2_HadoopEcosystemComponents.pptx
big data hadoop technonolgy for storing and processing data

Similar to Module 2_Chapter 3_HDFS DATA STORAGE.pptx (20)

PDF
HDFS Design Principles
PPTX
Topic 9a-Hadoop Storage- HDFS.pptx
PPTX
Hadoop and It_s Components_PPT .pptx
PDF
hadoop distributed file systems complete information
PPT
PPTX
Cloud Computing - Cloud Technologies and Advancements
PPTX
module 2.pptx
PPTX
Big data with HDFS and Mapreduce
PPTX
Introduction to hadoop and hdfs
PPTX
Introduction to HDFS
PPT
Hadoop -HDFS.ppt
PPTX
Hadoop
PPTX
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
PPTX
PPTX
Understanding Hadoop
PPTX
Introduction_to_HDFS sun.pptx
PPT
hdfs filesystem in bigdata for hadoop configuration
PPTX
Hadoop HDFS Architeture and Design
PPTX
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
PPTX
HDFS Design Principles
Topic 9a-Hadoop Storage- HDFS.pptx
Hadoop and It_s Components_PPT .pptx
hadoop distributed file systems complete information
Cloud Computing - Cloud Technologies and Advancements
module 2.pptx
Big data with HDFS and Mapreduce
Introduction to hadoop and hdfs
Introduction to HDFS
Hadoop -HDFS.ppt
Hadoop
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
Understanding Hadoop
Introduction_to_HDFS sun.pptx
hdfs filesystem in bigdata for hadoop configuration
Hadoop HDFS Architeture and Design
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Ad

More from Shrinivasa6 (11)

PPT
shortest path algorithms with different examplesppt
PPT
dynamic-programming unit 3 power point presentation
PPTX
Module 2 Chapter 6 Yet another resource locater.pptx
PPTX
hadoop_Introduction module 2 and chapter 3pptx.pptx
PPTX
Big data analytics Module1 contents pptx
PPTX
Hadoop_Introduction unit-2 for vtu syllabus
PPTX
BDA: Big Data Analytics for Unit-1 Vtu syllabus
PPTX
M4,C5 APACHE PIG.pptx
PPTX
Module-1.pptx63.pptx
PPTX
Hadoop_Introduction_pptx.pptx
PPTX
BDA_Module1.pptx
shortest path algorithms with different examplesppt
dynamic-programming unit 3 power point presentation
Module 2 Chapter 6 Yet another resource locater.pptx
hadoop_Introduction module 2 and chapter 3pptx.pptx
Big data analytics Module1 contents pptx
Hadoop_Introduction unit-2 for vtu syllabus
BDA: Big Data Analytics for Unit-1 Vtu syllabus
M4,C5 APACHE PIG.pptx
Module-1.pptx63.pptx
Hadoop_Introduction_pptx.pptx
BDA_Module1.pptx
Ad

Recently uploaded (20)

PPTX
Fundamentals of safety and accident prevention -final (1).pptx
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
Accra-Kumasi Expressway - Prefeasibility Report Volume 1 of 7.11.2018.pdf
PDF
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
PPTX
Software Engineering and software moduleing
PPTX
Feature types and data preprocessing steps
PDF
22EC502-MICROCONTROLLER AND INTERFACING-8051 MICROCONTROLLER.pdf
PDF
Categorization of Factors Affecting Classification Algorithms Selection
PPTX
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
PDF
distributed database system" (DDBS) is often used to refer to both the distri...
PDF
Influence of Green Infrastructure on Residents’ Endorsement of the New Ecolog...
PPT
Occupational Health and Safety Management System
PPTX
Sorting and Hashing in Data Structures with Algorithms, Techniques, Implement...
PDF
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
PPT
INTRODUCTION -Data Warehousing and Mining-M.Tech- VTU.ppt
PDF
Abrasive, erosive and cavitation wear.pdf
PPTX
Nature of X-rays, X- Ray Equipment, Fluoroscopy
PPTX
Fundamentals of Mechanical Engineering.pptx
PDF
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
Fundamentals of safety and accident prevention -final (1).pptx
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
Automation-in-Manufacturing-Chapter-Introduction.pdf
Accra-Kumasi Expressway - Prefeasibility Report Volume 1 of 7.11.2018.pdf
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
Software Engineering and software moduleing
Feature types and data preprocessing steps
22EC502-MICROCONTROLLER AND INTERFACING-8051 MICROCONTROLLER.pdf
Categorization of Factors Affecting Classification Algorithms Selection
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
distributed database system" (DDBS) is often used to refer to both the distri...
Influence of Green Infrastructure on Residents’ Endorsement of the New Ecolog...
Occupational Health and Safety Management System
Sorting and Hashing in Data Structures with Algorithms, Techniques, Implement...
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
INTRODUCTION -Data Warehousing and Mining-M.Tech- VTU.ppt
Abrasive, erosive and cavitation wear.pdf
Nature of X-rays, X- Ray Equipment, Fluoroscopy
Fundamentals of Mechanical Engineering.pptx
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...

Module 2_Chapter 3_HDFS DATA STORAGE.pptx

  • 2. HDFS DATA STORAGE 2 • RECALL: – Hadoop data store concept implies storing the data at a number of clusters • Each cluster has a number of data stores called racks. • Each rack stores a number of DataNodes.
  • 3. • Each DataNode has a large number of Data blocks. • The racks distribute across a cluster. • The nodes have processing and storage capabilities. 3
  • 4. • The data blocks replicate by default atleast on three DataNodes in same or remote nodes. • A file , containing the data divides into data blocks. • A data block default size is 64MBs 4
  • 6. Hadoop Physical organization • The conventional file system uses directories. • Directory consists of folders. • A folder consists of files. • When data processes, the data sources are identified by pointers • A data dictionary stores the resource pointers 6
  • 7. • Master tables at the dictionary store at a central location. • The centrally stored tables enable administration easier when the data source change during processing 7
  • 8. Recall: Namenode & Datanode • Namenode stores the files meta data. • Meta data gives information about the file of user application, but does not participate in the computations. • Datanode stores the actual data files in the data blocks 8
  • 10. • Clients are users who run the application with the help of Hadoop ecosystem projects. • Examples of ecosystem projects: Hive,Mahout and Pig • A single masternode provides HDFS, MapReduce and Hbase using threads in small to medium sized clusters 10
  • 11. MasterNode • A MasterNode fundamentally play the role of a coordinator. • The MasterNode – receives client connections , – maintains the description of the global file system namespace and allocation of file blocks • It also monitors the state of the system in order to detect any failure 11
  • 12. Components of Master 1. Namenode 2. Secondary namenode 3. JobTracker • NameNode stores all the file system related information such as: – The file section is stored in which part of the cluster – Last access time for the files – User permissions like which user has access to the file 12
  • 13. Secondary NameNode • When the cluster size is large, multiple servers are used, such as to balance the load. • The secondary NameNode provides NameNode management services and zookeeper,which is used by Hbase for metadata storage 13
  • 14. Secondary namenode • An alternate for NameNode • It keeps a copy of NameNode meta data • Stored metadata can be rebuilt easily,incase of NameNode failure 14
  • 15. • The JobTracker coordinates the parallel processing of data. • Masters, slaves and Hadoop client(node) – load the data into clusters – Submit the processing job – Retrieve the data to see the response after the job completion 15
  • 16. Hadoop2 • Limitations of Hadoop1: – Single NameNode failure is an operational limitation – Scaling up was also restricted to just beyond a few thousands of DataNodes and few number of clusters • Hadoop2 provides multiple NameNodes • This enables Higher resource availability 16
  • 17. Components of Hadoop2 • An associated NameNode • Zookeeper coordination client, functions as a centralized repository for distributed applications. – Uses synchronisation, serialization and coordination activities. – Enables functioning of a distributed system as a single function • Associated JournalNode(JN): – Keeps the records of the state,resources assigned ,intermediate results or execution of application tasks. – Distributed applications can read and write data from a JN 17
  • 18. Steps taken by system incase of failure • One set of resources is in active state • The other one remains in standby state. • Two masters, – One MN1 is in active state – Another MN2 is in secondary state • That ensures availability in case of network fault of an active NameNode NM1 • The system activates the secondary NameNode NM2 and creates a secondary in another MasterNode MN3 unused earlier. 18
  • 19. • The entries copy from JN1 in MN1 into JN2,which is at newly active Masternode MN2 • Therefore, the application runs uninterrupted and resources are available uninterrupted 19
  • 20. HDFS COMMANDS • HDFS shell is not compliant with the POSIX. • Thus, the shell cannot interact similar to UNIX or LINUX • Commands for interacting with the files in HDFS require – /bin/hdfs dfs <args> – Where args stands for the command arguments – All Hadoop commands are invoked by the bin/Hadoop script : • % Hadoop fsck / -files -blocks 20
  • 21. Commands examples • -copyToLocal – command for copying a file at HDFS to the local. • -cat – Copying to standard output(stdout) 21
  • 23. HDFS Summarize • HDFS uses a master/slave model designed for large file reading/streaming. • The NameNode is a metadata server or "data traffic cop." • HDFS provides a single namespace that is managed by the NameNode. • Data is redundantly stored on DataNodes; there is no data on the NameNode. • The SecondaryNameNode performs checkpoints of NameNode file system's state but is not a failover node 23