Module 2_Chapter 3_HDFS DATA STORAGE.pptx

HDFS DATA STORAGE
2
• RECALL:
– Hadoop data store concept implies storing the
data at a number of clusters
• Each cluster has a number of data stores called
racks.
• Each rack stores a number of DataNodes.

• Each DataNode has a large number of Data
blocks.
• The racks distribute across a cluster.
• The nodes have processing and storage
capabilities.
3

• The data blocks replicate by default atleast
on three DataNodes in same or remote
nodes.
• A file , containing the data divides into data
blocks.
• A data block default size is 64MBs
4

Hadoop Physical organization
• The conventional file system uses directories.
• Directory consists of folders.
• A folder consists of files.
• When data processes, the data sources are identified by
pointers
• A data dictionary stores the resource pointers
6

• Master tables at the dictionary store at a
central location.
• The centrally stored tables enable
administration easier when the data source
change during processing
7

Recall: Namenode & Datanode
• Namenode stores the files meta data.
• Meta data gives information about the file of user
application, but does not participate in the
computations.
• Datanode stores the actual data files in the data
blocks
8

The client,
master
Namenode,
MasterNodes
and slave
nodes
9

• Clients are users who run the application with the
help of Hadoop ecosystem projects.
• Examples of ecosystem projects: Hive,Mahout and Pig
• A single masternode provides HDFS, MapReduce and
Hbase using threads in small to medium sized clusters
10

MasterNode
• A MasterNode fundamentally play the role of a coordinator.
• The MasterNode
– receives client connections ,
– maintains the description of the global file system namespace and
allocation of file blocks
• It also monitors the state of the system in order to detect any failure
11

Components of Master
1. Namenode
2. Secondary namenode
3. JobTracker
• NameNode stores all the file system related information
such as:
– The file section is stored in which part of the cluster
– Last access time for the files
– User permissions like which user has access to the
file 12

Secondary NameNode
• When the cluster size is large, multiple servers are
used, such as to balance the load.
• The secondary NameNode provides NameNode
management services and zookeeper,which is used by
Hbase for metadata storage
13

Secondary namenode
• An alternate for NameNode
• It keeps a copy of NameNode meta data
• Stored metadata can be rebuilt easily,incase
of NameNode failure
14

• The JobTracker coordinates the parallel
processing of data.
• Masters, slaves and Hadoop client(node)
– load the data into clusters
– Submit the processing job
– Retrieve the data to see the response after
the job completion 15

Hadoop2
• Limitations of Hadoop1:
– Single NameNode failure is an operational limitation
– Scaling up was also restricted to just beyond a few thousands of
DataNodes and few number of clusters
• Hadoop2 provides multiple NameNodes
• This enables Higher resource availability 16

Components of Hadoop2
• An associated NameNode
• Zookeeper coordination client, functions as a centralized
repository for distributed applications.
– Uses synchronisation, serialization and coordination
activities.
– Enables functioning of a distributed system as a single
function
• Associated JournalNode(JN):
– Keeps the records of the state,resources
assigned ,intermediate results or execution of application
tasks.
– Distributed applications can read and write data from a JN
17

Steps taken by system incase of failure
• One set of resources is in active state
• The other one remains in standby state.
• Two masters,
– One MN1 is in active state
– Another MN2 is in secondary state
• That ensures availability in case of network fault of
an active NameNode NM1
• The system activates the secondary NameNode NM2
and creates a secondary in another MasterNode MN3
unused earlier.
18

• The entries copy from JN1 in MN1 into
JN2,which is at newly active Masternode
MN2
• Therefore, the application runs
uninterrupted and resources are
available uninterrupted
19

HDFS COMMANDS
• HDFS shell is not compliant with the POSIX.
• Thus, the shell cannot interact similar to UNIX or LINUX
• Commands for interacting with the files in HDFS require
– /bin/hdfs dfs <args>
– Where args stands for the command arguments
– All Hadoop commands are invoked by the bin/Hadoop
script :
• % Hadoop fsck / -files -blocks 20

Commands examples
• -copyToLocal
– command for copying a file at HDFS to
the local.
• -cat
– Copying to standard output(stdout)
21

HDFS Summarize
• HDFS uses a master/slave model designed for large file
reading/streaming.
• The NameNode is a metadata server or "data traffic cop."
• HDFS provides a single namespace that is managed by the
NameNode.
• Data is redundantly stored on DataNodes; there is no data on
the NameNode.
• The SecondaryNameNode performs checkpoints of NameNode
file system's state but is not a failover node 23

Module 2_Chapter 3_HDFS DATA STORAGE.pptx

More Related Content

Similar to Module 2_Chapter 3_HDFS DATA STORAGE.pptx (20)

More from Shrinivasa6 (11)

Recently uploaded (20)

Module 2_Chapter 3_HDFS DATA STORAGE.pptx