SlideShare a Scribd company logo
Introduction to
Hadoop
Big Data
Overview
• Big Data is a collection of data that
is huge in volume yet growing
exponentially with time.
• It is a data with so large size and
complexity that none of
traditional data management tools
can store it or process it efficiently.
• It cannot be processed using
traditional computing techniques.
• Big data is also a data but
with huge size.
The 4 V’s in Big Data
• Volume of Big Data
• The volume of data refers to the size of the data sets that need to be analyzed and processed, which are now frequently
larger than terabytes and petabytes. In other words, this means that the data sets in Big Data are too large to process
with a regular laptop or desktop processor. An example of a high-volume data set would be all credit card transactions
on a day within Europe.
• Velocity of Big Data
• Velocity refers to the speed with which data is generated. An example of a data that is generated with high velocity
would be Twitter messages or Facebook posts.
• Variety of Big Data
• Variety makes Big Data really big. Big Data comes from a great variety of sources and generally is one out of three
types: structured, semi structured and unstructured data. An example of high variety data sets would be the CCTV audio
and video files that are generated at various locations in a city.
• Veracity of Big Data
• Veracity refers to the quality of the data that is being analyzed. High veracity data has many records that are valuable to
analyze and that contribute in a meaningful way to the overall results. Low veracity data, on the other hand, contains a
high percentage of meaningless data. The non-valuable in these data sets is referred to as noise. An example of a high
veracity data set would be data from a medical experiment or trial.
Types of Big
Data
• Structured data:-Structured data is data that has been
predefined and formatted to a set structure before being
placed in data storage, The best example of structured
data is the relational database: the data has been formatted
into precisely defined fields, such as credit card numbers or
address, in order to be easily queried with SQL.
• Semi-Structured :-Semi-structured data is a form
of structured data that does not obey the tabular structure
of data models associated with relational databases or other
forms of data tables, but nonetheless contains tags or other
markers to separate semantic elements and enforce
hierarchies of records and fields within the data.
• Unstructured data:-Unstructured data is data stored in its
native format and not processed until it is used, which is
known as schema-on-read. It comes in a myriad of file
formats, including email, social media posts, presentations,
chats, IoT sensor data, and satellite imagery.
How Big Data
is generated
• The bulk of big data
generated comes from
three primary sources:
social data,
machine data and
transactional data.
Apache Hadoop
Framework
• Apache Hadoop is a collection of
open-source software utilities that
facilitates using a network of
many computers to solve
problems involving massive
amounts of data and
computation. It provides a
software framework for distributed
storage and processing of big
data using the MapReduce
programming model.
Core components of
Hadoop
Hadoop Distributed File
System (HDFS): the storage
system for Hadoop spread
out over multiple machines
as a means to reduce cost
and increase reliability.
MapReduce engine: the
algorithm that filters, sorts
and then uses the database
input in some way.
Some
Hadoop
Users
Difference
Between
Hadoop and
RDBMS
Cluster Modes in Hadoop
• Hadoop Mainly works on 3
different Modes:
1.Standalone Mode
2.Pseudo-distributed Mode
3.Fully-Distributed Mode
Hadoop Ecosystem
Introduction: Hadoop Ecosystem is a platform or a suite which provides
various services to solve the big data problems. It includes Apache projects
and various commercial tools and solutions. There are four major elements
of Hadoop i.e., HDFS, MapReduce, YARN, and Hadoop Common. Most
of the tools or solutions are used to supplement or support these major
elements. All these tools work collectively to provide services such as
absorption, analysis, storage and maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:
•HDFS: Hadoop Distributed File System
•YARN: Yet Another Resource Negotiator
•MapReduce: Programming based Data Processing
•Spark: In-Memory data processing
•PIG, HIVE: Query based processing of data services
•HBase: NoSQL Database
•Mahout, Spark MLLib: Machine Learning algorithm libraries
•Solar, Lucene: Searching and Indexing
•Zookeeper: Managing cluster
•Oozie: Job Scheduling
HDFS Daemons
and MapReduce
Daemons
Hadoop Cluster Architecture
• A hadoop cluster architecture
consists of a data centre, rack and the
node that actually executes the jobs.
Data centre consists of the racks and
racks consists of nodes. A medium to
large cluster consists of a two or three
level hadoop cluster architecture that
is built with rack mounted servers.
Every rack of servers is interconnected
through 1 gigabyte of Ethernet (1
GigE). Each rack level switch in a
hadoop cluster is connected to a
cluster level switch which are in turn
connected to other cluster level
switches or they uplink to other
switching infrastructure.
Hadoop Distributed File
System
• The Hadoop Distributed File System (HDFS) is the
primary data storage system used
by Hadoop applications. It employs a NameNode and
DataNode architecture to implement a distributed file
system that provides high-performance access to data
across highly scalable Hadoop clusters.
• With the Hadoop Distributed File system, the data is
written once on the server and subsequently read and
re-used many times thereafter.
• The NameNode also manages access to the files,
including reads, writes, creates, deletes and replication
of data blocks across different data nodes
HDFS Components
• Hadoop cluster consists of three components -
• Master Node – Master node in a hadoop cluster is responsible for storing data in HDFS and
executing parallel computation the stored data using MapReduce. Master Node has 3 nodes –
NameNode, Secondary NameNode and JobTracker. JobTracker monitors the parallel processing of
data using MapReduce while the NameNode handles the data storage function with HDFS.
NameNode keeps a track of all the information on files (i.e. the metadata on files) such as the
access time of the file, which user is accessing a file on current time and which file is saved in which
hadoop cluster. The secondary NameNode keeps a backup of the NameNode data.
• Slave/Worker Node- This component in a hadoop cluster is responsible for storing the data and
performing computations. Every slave/worker node runs both a TaskTracker and a DataNode
service to communicate with the Master node in the cluster. The DataNode service is secondary to
the NameNode and the TaskTracker service is secondary to the JobTracker.
• Client Nodes – Client node has hadoop installed with all the required cluster configuration settings
and is responsible for loading all the data into the hadoop cluster. Client node submits mapreduce
jobs describing on how data needs to be processed and then the output is retrieved by the client
node once the job processing is completed.
HDFS Architecture
• Hadoop follows a master slave architecture
design for data storage and distributed data
processing using HDFS and MapReduce
respectively. The master node for data
storage is hadoop HDFS is the NameNode
and the master node for parallel processing
of data using Hadoop MapReduce is the Job
Tracker. The slave nodes in the hadoop
architecture are the other machines in the
Hadoop cluster which store data and perform
complex computations. Every slave node has
a Task Tracker daemon and a DataNode that
synchronizes the processes with the Job
Tracker and NameNode respectively. In
Hadoop architectural implementation the
master or slave systems can be setup in the
cloud or on-premise.
HDFS Read File
• Step 1: The client opens the file it wishes to read by calling open() on the File
System Object(which for HDFS is an instance of Distributed File System).
• Step 2: Distributed File System( DFS) calls the name node, using remote
procedure calls (RPCs), to determine the locations of the first few blocks in the
file. For each block, the name node returns the addresses of the data nodes that
have a copy of that block. The DFS returns an FSDataInputStream to the client
for it to read data from. FSDataInputStream in turn wraps a DFSInputStream,
which manages the data node and name node I/O.
• Step 3: The client then calls read() on the stream. DFSInputStream, which has
stored the info node addresses for the primary few blocks within the file, then
connects to the primary (closest) data node for the primary block in the file.
• Step 4: Data is streamed from the data node back to the client, which calls
read() repeatedly on the stream.
• Step 5: When the end of the block is reached, DFSInputStream will close the
connection to the data node, then finds the best data node for the next block.
This happens transparently to the client, which from its point of view is simply
reading an endless stream. Blocks are read as, with the DFSInputStream
opening new connections to data nodes because the client reads through the
stream. It will also call the name node to retrieve the data node locations for the
next batch of blocks as needed.
• Step 6: When the client has finished reading the file, a function is called, close()
on the FSDataInputStream.
HDFS Write File
• Step 1: The client creates the file by calling create() on
DistributedFileSystem(DFS).
• Step 2: DFS makes an RPC call to the name node to create a new file in the file
system’s namespace, with no blocks associated with it. The name node performs
various checks to make sure the file doesn’t already exist and that the client has
the right permissions to create the file. If these checks pass, the name node
prepares a record of the new file; otherwise, the file can’t be created and therefore
the client is thrown an error i.e. IOException. The DFS returns an
FSDataOutputStream for the client to start out writing data to.
• Step 3: Because the client writes data, the DFSOutputStream splits it into packets,
which it writes to an indoor queue called the info queue. The data queue is
consumed by the DataStreamer, which is liable for asking the name node to
allocate new blocks by picking an inventory of suitable data nodes to store the
replicas. The list of data nodes forms a pipeline, and here we’ll assume the
replication level is three, so there are three nodes in the pipeline. The
DataStreamer streams the packets to the primary data node within the pipeline,
which stores each packet and forwards it to the second data node within the
pipeline.
• Step 4: Similarly, the second data node stores the packet and forwards it to the
third (and last) data node in the pipeline.
• Step 5: The DFSOutputStream sustains an internal queue of packets that are
waiting to be acknowledged by data nodes, called an “ack queue”.
• Step 6: This action sends up all the remaining packets to the data node pipeline
and waits for acknowledgments before connecting to the name node to signal
whether the file is complete or not.
Some Basic
Hadoop
Commands
• cat
HDFS Command that reads a file on HDFS and prints the content of that file to the
standard output.
Command: hdfs dfs –cat /new_file/test
• text
HDFS Command that takes a source file and outputs the file in text format.
Command: hdfs dfs –text /new_file/test
• copyFromLocal
HDFS Command to copy the file from a Local file system to HDFS.
Command: hdfs dfs –copyFromLocal /home/file/test /new_file
• copyToLocal
HDFS Command to copy the file from HDFS to Local File System.
Command: hdfs dfs –copyToLocal /file/test /home/file
• put
HDFS Command to copy single source or multiple sources from local file system to the destination file system.
Command: hdfs dfs –put /home/file/test /user
• get
HDFS Command to copy files from hdfs to the local file system.
Command: hdfs dfs –get /user/test /home/file
• count
HDFS Command to count the number of directories, files, and bytes under the paths that match the specified file pattern.
Command: hdfs dfs –count /user
• rm
HDFS Command to remove the file from HDFS.
Command: hdfs dfs –rm /new_file/test
• rm -r
HDFS Command to remove the entire directory and all of its content from HDFS.
Command: hdfs dfs -rm -r /new_file
• cp
HDFS Command to copy files from source to destination. This command allows multiple
sources as well, in which case the destination must be a directory.
Command: hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2
• mv
HDFS Command to move files from source to destination. This command allows multiple
sources as well, in which case the destination needs to be a directory.
Command: hdfs dfs -mv /user/hadoop/file1 /user/hadoop/file2
• rmdir
HDFS Command to remove the directory.
Command: hdfs dfs –rmdir /user/Hadoop
• help
HDFS Command that displays help for given command or all commands if none is
specified.
Command: hdfs dfs -help

More Related Content

PPTX
Big data and hadoop
PPT
What is Hadoop?
PPTX
The Exabyte Journey and DataBrew with CICD
PDF
Introduction to Hadoop
PDF
Introduction to Hadoop part1
PPTX
Introduction to Hadoop part 2
PDF
Hadoop MapReduce Framework
PPTX
002 Introduction to hadoop v3
Big data and hadoop
What is Hadoop?
The Exabyte Journey and DataBrew with CICD
Introduction to Hadoop
Introduction to Hadoop part1
Introduction to Hadoop part 2
Hadoop MapReduce Framework
002 Introduction to hadoop v3

What's hot (20)

PDF
Apache Hadoop - Big Data Engineering
PDF
Seminar_Report_hadoop
DOCX
Hadoop Seminar Report
PPTX
Hadoop info
DOCX
Hadoop Seminar Report
PPTX
Apache Hadoop
ODP
Hadoop seminar
PDF
Hadoop Developer
PPTX
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
PPT
hadoop
PPTX
Hadoop and Big Data
PPT
Big Data and Hadoop Basics
DOCX
Hadoop technology doc
PPTX
Hadoop basics
PPSX
PPTX
Big Data and Hadoop
PDF
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
DOCX
Seminar Report Vaibhav
PDF
Hadoop tools with Examples
PPTX
Hadoop Presentation - PPT
Apache Hadoop - Big Data Engineering
Seminar_Report_hadoop
Hadoop Seminar Report
Hadoop info
Hadoop Seminar Report
Apache Hadoop
Hadoop seminar
Hadoop Developer
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
hadoop
Hadoop and Big Data
Big Data and Hadoop Basics
Hadoop technology doc
Hadoop basics
Big Data and Hadoop
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Seminar Report Vaibhav
Hadoop tools with Examples
Hadoop Presentation - PPT
Ad

Similar to Hadoop (20)

PPTX
Managing Big data with Hadoop
PPTX
Big Data Analytics With Hadoop
PPT
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
PPTX
MOD-2 presentation on engineering students
PPTX
Unit-1 Introduction to Big Data.pptx
PDF
Hadoop paper
PPT
Hadoop Technology
PDF
Hadoop introduction
PPT
hadoop
PPTX
Big data and Hadoop Section..............
PPTX
Big Data & Hadoop
PPTX
Module-2_HADOOP.pptx
PPTX
BIg Data Analytics-Module-2 vtu engineering.pptx
PPTX
BIg Data Analytics-Module-2 as per vtu syllabus.pptx
PDF
Chapter2.pdf
PPTX
Hadoop and MapReduce addDdaDadadDDAD.pptx
PDF
Hadoop data management
PPTX
Big data Hadoop
PDF
Hadoop
Managing Big data with Hadoop
Big Data Analytics With Hadoop
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
MOD-2 presentation on engineering students
Unit-1 Introduction to Big Data.pptx
Hadoop paper
Hadoop Technology
Hadoop introduction
hadoop
Big data and Hadoop Section..............
Big Data & Hadoop
Module-2_HADOOP.pptx
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 as per vtu syllabus.pptx
Chapter2.pdf
Hadoop and MapReduce addDdaDadadDDAD.pptx
Hadoop data management
Big data Hadoop
Hadoop
Ad

Recently uploaded (20)

PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PDF
Structs to JSON How Go Powers REST APIs.pdf
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
PPT on Performance Review to get promotions
PPTX
Geodesy 1.pptx...............................................
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Construction Project Organization Group 2.pptx
PDF
Well-logging-methods_new................
PDF
composite construction of structures.pdf
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
CH1 Production IntroductoryConcepts.pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Structs to JSON How Go Powers REST APIs.pdf
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPT on Performance Review to get promotions
Geodesy 1.pptx...............................................
CYBER-CRIMES AND SECURITY A guide to understanding
Lesson 3_Tessellation.pptx finite Mathematics
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Construction Project Organization Group 2.pptx
Well-logging-methods_new................
composite construction of structures.pdf
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Arduino robotics embedded978-1-4302-3184-4.pdf
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...

Hadoop

  • 2. Big Data Overview • Big Data is a collection of data that is huge in volume yet growing exponentially with time. • It is a data with so large size and complexity that none of traditional data management tools can store it or process it efficiently. • It cannot be processed using traditional computing techniques. • Big data is also a data but with huge size.
  • 3. The 4 V’s in Big Data • Volume of Big Data • The volume of data refers to the size of the data sets that need to be analyzed and processed, which are now frequently larger than terabytes and petabytes. In other words, this means that the data sets in Big Data are too large to process with a regular laptop or desktop processor. An example of a high-volume data set would be all credit card transactions on a day within Europe. • Velocity of Big Data • Velocity refers to the speed with which data is generated. An example of a data that is generated with high velocity would be Twitter messages or Facebook posts. • Variety of Big Data • Variety makes Big Data really big. Big Data comes from a great variety of sources and generally is one out of three types: structured, semi structured and unstructured data. An example of high variety data sets would be the CCTV audio and video files that are generated at various locations in a city. • Veracity of Big Data • Veracity refers to the quality of the data that is being analyzed. High veracity data has many records that are valuable to analyze and that contribute in a meaningful way to the overall results. Low veracity data, on the other hand, contains a high percentage of meaningless data. The non-valuable in these data sets is referred to as noise. An example of a high veracity data set would be data from a medical experiment or trial.
  • 4. Types of Big Data • Structured data:-Structured data is data that has been predefined and formatted to a set structure before being placed in data storage, The best example of structured data is the relational database: the data has been formatted into precisely defined fields, such as credit card numbers or address, in order to be easily queried with SQL. • Semi-Structured :-Semi-structured data is a form of structured data that does not obey the tabular structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. • Unstructured data:-Unstructured data is data stored in its native format and not processed until it is used, which is known as schema-on-read. It comes in a myriad of file formats, including email, social media posts, presentations, chats, IoT sensor data, and satellite imagery.
  • 5. How Big Data is generated • The bulk of big data generated comes from three primary sources: social data, machine data and transactional data.
  • 6. Apache Hadoop Framework • Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model.
  • 7. Core components of Hadoop Hadoop Distributed File System (HDFS): the storage system for Hadoop spread out over multiple machines as a means to reduce cost and increase reliability. MapReduce engine: the algorithm that filters, sorts and then uses the database input in some way.
  • 10. Cluster Modes in Hadoop • Hadoop Mainly works on 3 different Modes: 1.Standalone Mode 2.Pseudo-distributed Mode 3.Fully-Distributed Mode
  • 11. Hadoop Ecosystem Introduction: Hadoop Ecosystem is a platform or a suite which provides various services to solve the big data problems. It includes Apache projects and various commercial tools and solutions. There are four major elements of Hadoop i.e., HDFS, MapReduce, YARN, and Hadoop Common. Most of the tools or solutions are used to supplement or support these major elements. All these tools work collectively to provide services such as absorption, analysis, storage and maintenance of data etc. Following are the components that collectively form a Hadoop ecosystem: •HDFS: Hadoop Distributed File System •YARN: Yet Another Resource Negotiator •MapReduce: Programming based Data Processing •Spark: In-Memory data processing •PIG, HIVE: Query based processing of data services •HBase: NoSQL Database •Mahout, Spark MLLib: Machine Learning algorithm libraries •Solar, Lucene: Searching and Indexing •Zookeeper: Managing cluster •Oozie: Job Scheduling
  • 13. Hadoop Cluster Architecture • A hadoop cluster architecture consists of a data centre, rack and the node that actually executes the jobs. Data centre consists of the racks and racks consists of nodes. A medium to large cluster consists of a two or three level hadoop cluster architecture that is built with rack mounted servers. Every rack of servers is interconnected through 1 gigabyte of Ethernet (1 GigE). Each rack level switch in a hadoop cluster is connected to a cluster level switch which are in turn connected to other cluster level switches or they uplink to other switching infrastructure.
  • 14. Hadoop Distributed File System • The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. It employs a NameNode and DataNode architecture to implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters. • With the Hadoop Distributed File system, the data is written once on the server and subsequently read and re-used many times thereafter. • The NameNode also manages access to the files, including reads, writes, creates, deletes and replication of data blocks across different data nodes
  • 15. HDFS Components • Hadoop cluster consists of three components - • Master Node – Master node in a hadoop cluster is responsible for storing data in HDFS and executing parallel computation the stored data using MapReduce. Master Node has 3 nodes – NameNode, Secondary NameNode and JobTracker. JobTracker monitors the parallel processing of data using MapReduce while the NameNode handles the data storage function with HDFS. NameNode keeps a track of all the information on files (i.e. the metadata on files) such as the access time of the file, which user is accessing a file on current time and which file is saved in which hadoop cluster. The secondary NameNode keeps a backup of the NameNode data. • Slave/Worker Node- This component in a hadoop cluster is responsible for storing the data and performing computations. Every slave/worker node runs both a TaskTracker and a DataNode service to communicate with the Master node in the cluster. The DataNode service is secondary to the NameNode and the TaskTracker service is secondary to the JobTracker. • Client Nodes – Client node has hadoop installed with all the required cluster configuration settings and is responsible for loading all the data into the hadoop cluster. Client node submits mapreduce jobs describing on how data needs to be processed and then the output is retrieved by the client node once the job processing is completed.
  • 16. HDFS Architecture • Hadoop follows a master slave architecture design for data storage and distributed data processing using HDFS and MapReduce respectively. The master node for data storage is hadoop HDFS is the NameNode and the master node for parallel processing of data using Hadoop MapReduce is the Job Tracker. The slave nodes in the hadoop architecture are the other machines in the Hadoop cluster which store data and perform complex computations. Every slave node has a Task Tracker daemon and a DataNode that synchronizes the processes with the Job Tracker and NameNode respectively. In Hadoop architectural implementation the master or slave systems can be setup in the cloud or on-premise.
  • 17. HDFS Read File • Step 1: The client opens the file it wishes to read by calling open() on the File System Object(which for HDFS is an instance of Distributed File System). • Step 2: Distributed File System( DFS) calls the name node, using remote procedure calls (RPCs), to determine the locations of the first few blocks in the file. For each block, the name node returns the addresses of the data nodes that have a copy of that block. The DFS returns an FSDataInputStream to the client for it to read data from. FSDataInputStream in turn wraps a DFSInputStream, which manages the data node and name node I/O. • Step 3: The client then calls read() on the stream. DFSInputStream, which has stored the info node addresses for the primary few blocks within the file, then connects to the primary (closest) data node for the primary block in the file. • Step 4: Data is streamed from the data node back to the client, which calls read() repeatedly on the stream. • Step 5: When the end of the block is reached, DFSInputStream will close the connection to the data node, then finds the best data node for the next block. This happens transparently to the client, which from its point of view is simply reading an endless stream. Blocks are read as, with the DFSInputStream opening new connections to data nodes because the client reads through the stream. It will also call the name node to retrieve the data node locations for the next batch of blocks as needed. • Step 6: When the client has finished reading the file, a function is called, close() on the FSDataInputStream.
  • 18. HDFS Write File • Step 1: The client creates the file by calling create() on DistributedFileSystem(DFS). • Step 2: DFS makes an RPC call to the name node to create a new file in the file system’s namespace, with no blocks associated with it. The name node performs various checks to make sure the file doesn’t already exist and that the client has the right permissions to create the file. If these checks pass, the name node prepares a record of the new file; otherwise, the file can’t be created and therefore the client is thrown an error i.e. IOException. The DFS returns an FSDataOutputStream for the client to start out writing data to. • Step 3: Because the client writes data, the DFSOutputStream splits it into packets, which it writes to an indoor queue called the info queue. The data queue is consumed by the DataStreamer, which is liable for asking the name node to allocate new blocks by picking an inventory of suitable data nodes to store the replicas. The list of data nodes forms a pipeline, and here we’ll assume the replication level is three, so there are three nodes in the pipeline. The DataStreamer streams the packets to the primary data node within the pipeline, which stores each packet and forwards it to the second data node within the pipeline. • Step 4: Similarly, the second data node stores the packet and forwards it to the third (and last) data node in the pipeline. • Step 5: The DFSOutputStream sustains an internal queue of packets that are waiting to be acknowledged by data nodes, called an “ack queue”. • Step 6: This action sends up all the remaining packets to the data node pipeline and waits for acknowledgments before connecting to the name node to signal whether the file is complete or not.
  • 20. • cat HDFS Command that reads a file on HDFS and prints the content of that file to the standard output. Command: hdfs dfs –cat /new_file/test • text HDFS Command that takes a source file and outputs the file in text format. Command: hdfs dfs –text /new_file/test • copyFromLocal HDFS Command to copy the file from a Local file system to HDFS. Command: hdfs dfs –copyFromLocal /home/file/test /new_file • copyToLocal HDFS Command to copy the file from HDFS to Local File System. Command: hdfs dfs –copyToLocal /file/test /home/file
  • 21. • put HDFS Command to copy single source or multiple sources from local file system to the destination file system. Command: hdfs dfs –put /home/file/test /user • get HDFS Command to copy files from hdfs to the local file system. Command: hdfs dfs –get /user/test /home/file • count HDFS Command to count the number of directories, files, and bytes under the paths that match the specified file pattern. Command: hdfs dfs –count /user • rm HDFS Command to remove the file from HDFS. Command: hdfs dfs –rm /new_file/test • rm -r HDFS Command to remove the entire directory and all of its content from HDFS. Command: hdfs dfs -rm -r /new_file
  • 22. • cp HDFS Command to copy files from source to destination. This command allows multiple sources as well, in which case the destination must be a directory. Command: hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2 • mv HDFS Command to move files from source to destination. This command allows multiple sources as well, in which case the destination needs to be a directory. Command: hdfs dfs -mv /user/hadoop/file1 /user/hadoop/file2 • rmdir HDFS Command to remove the directory. Command: hdfs dfs –rmdir /user/Hadoop • help HDFS Command that displays help for given command or all commands if none is specified. Command: hdfs dfs -help