Hadoop

Big Data
Overview
• Big Data is a collection of data that
is huge in volume yet growing
exponentially with time.
• It is a data with so large size and
complexity that none of
traditional data management tools
can store it or process it efficiently.
• It cannot be processed using
traditional computing techniques.
• Big data is also a data but
with huge size.

The 4 V’s in Big Data
• Volume of Big Data
• The volume of data refers to the size of the data sets that need to be analyzed and processed, which are now frequently
larger than terabytes and petabytes. In other words, this means that the data sets in Big Data are too large to process
with a regular laptop or desktop processor. An example of a high-volume data set would be all credit card transactions
on a day within Europe.
• Velocity of Big Data
• Velocity refers to the speed with which data is generated. An example of a data that is generated with high velocity
would be Twitter messages or Facebook posts.
• Variety of Big Data
• Variety makes Big Data really big. Big Data comes from a great variety of sources and generally is one out of three
types: structured, semi structured and unstructured data. An example of high variety data sets would be the CCTV audio
and video files that are generated at various locations in a city.
• Veracity of Big Data
• Veracity refers to the quality of the data that is being analyzed. High veracity data has many records that are valuable to
analyze and that contribute in a meaningful way to the overall results. Low veracity data, on the other hand, contains a
high percentage of meaningless data. The non-valuable in these data sets is referred to as noise. An example of a high
veracity data set would be data from a medical experiment or trial.

Types of Big
Data
• Structured data:-Structured data is data that has been
predefined and formatted to a set structure before being
placed in data storage, The best example of structured
data is the relational database: the data has been formatted
into precisely defined fields, such as credit card numbers or
address, in order to be easily queried with SQL.
• Semi-Structured :-Semi-structured data is a form
of structured data that does not obey the tabular structure
of data models associated with relational databases or other
forms of data tables, but nonetheless contains tags or other
markers to separate semantic elements and enforce
hierarchies of records and fields within the data.
• Unstructured data:-Unstructured data is data stored in its
native format and not processed until it is used, which is
known as schema-on-read. It comes in a myriad of file
formats, including email, social media posts, presentations,
chats, IoT sensor data, and satellite imagery.

How Big Data
is generated
• The bulk of big data
generated comes from
three primary sources:
social data,
machine data and
transactional data.

Apache Hadoop
Framework
• Apache Hadoop is a collection of
open-source software utilities that
facilitates using a network of
many computers to solve
problems involving massive
amounts of data and
computation. It provides a
software framework for distributed
storage and processing of big
data using the MapReduce
programming model.

Core components of
Hadoop
Hadoop Distributed File
System (HDFS): the storage
system for Hadoop spread
out over multiple machines
as a means to reduce cost
and increase reliability.
MapReduce engine: the
algorithm that filters, sorts
and then uses the database
input in some way.

Difference
Between
Hadoop and
RDBMS

Cluster Modes in Hadoop
• Hadoop Mainly works on 3
different Modes:
1.Standalone Mode
2.Pseudo-distributed Mode
3.Fully-Distributed Mode

Hadoop Ecosystem
Introduction: Hadoop Ecosystem is a platform or a suite which provides
various services to solve the big data problems. It includes Apache projects
and various commercial tools and solutions. There are four major elements
of Hadoop i.e., HDFS, MapReduce, YARN, and Hadoop Common. Most
of the tools or solutions are used to supplement or support these major
elements. All these tools work collectively to provide services such as
absorption, analysis, storage and maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:
•HDFS: Hadoop Distributed File System
•YARN: Yet Another Resource Negotiator
•MapReduce: Programming based Data Processing
•Spark: In-Memory data processing
•PIG, HIVE: Query based processing of data services
•HBase: NoSQL Database
•Mahout, Spark MLLib: Machine Learning algorithm libraries
•Solar, Lucene: Searching and Indexing
•Zookeeper: Managing cluster
•Oozie: Job Scheduling

HDFS Daemons
and MapReduce
Daemons

Hadoop Cluster Architecture
• A hadoop cluster architecture
consists of a data centre, rack and the
node that actually executes the jobs.
Data centre consists of the racks and
racks consists of nodes. A medium to
large cluster consists of a two or three
level hadoop cluster architecture that
is built with rack mounted servers.
Every rack of servers is interconnected
through 1 gigabyte of Ethernet (1
GigE). Each rack level switch in a
hadoop cluster is connected to a
cluster level switch which are in turn
connected to other cluster level
switches or they uplink to other
switching infrastructure.

Hadoop Distributed File
System
• The Hadoop Distributed File System (HDFS) is the
primary data storage system used
by Hadoop applications. It employs a NameNode and
DataNode architecture to implement a distributed file
system that provides high-performance access to data
across highly scalable Hadoop clusters.
• With the Hadoop Distributed File system, the data is
written once on the server and subsequently read and
re-used many times thereafter.
• The NameNode also manages access to the files,
including reads, writes, creates, deletes and replication
of data blocks across different data nodes

HDFS Components
• Hadoop cluster consists of three components -
• Master Node – Master node in a hadoop cluster is responsible for storing data in HDFS and
executing parallel computation the stored data using MapReduce. Master Node has 3 nodes –
NameNode, Secondary NameNode and JobTracker. JobTracker monitors the parallel processing of
data using MapReduce while the NameNode handles the data storage function with HDFS.
NameNode keeps a track of all the information on files (i.e. the metadata on files) such as the
access time of the file, which user is accessing a file on current time and which file is saved in which
hadoop cluster. The secondary NameNode keeps a backup of the NameNode data.
• Slave/Worker Node- This component in a hadoop cluster is responsible for storing the data and
performing computations. Every slave/worker node runs both a TaskTracker and a DataNode
service to communicate with the Master node in the cluster. The DataNode service is secondary to
the NameNode and the TaskTracker service is secondary to the JobTracker.
• Client Nodes – Client node has hadoop installed with all the required cluster configuration settings
and is responsible for loading all the data into the hadoop cluster. Client node submits mapreduce
jobs describing on how data needs to be processed and then the output is retrieved by the client
node once the job processing is completed.

HDFS Architecture
• Hadoop follows a master slave architecture
design for data storage and distributed data
processing using HDFS and MapReduce
respectively. The master node for data
storage is hadoop HDFS is the NameNode
and the master node for parallel processing
of data using Hadoop MapReduce is the Job
Tracker. The slave nodes in the hadoop
architecture are the other machines in the
Hadoop cluster which store data and perform
complex computations. Every slave node has
a Task Tracker daemon and a DataNode that
synchronizes the processes with the Job
Tracker and NameNode respectively. In
Hadoop architectural implementation the
master or slave systems can be setup in the
cloud or on-premise.

HDFS Read File
• Step 1: The client opens the file it wishes to read by calling open() on the File
System Object(which for HDFS is an instance of Distributed File System).
• Step 2: Distributed File System( DFS) calls the name node, using remote
procedure calls (RPCs), to determine the locations of the first few blocks in the
file. For each block, the name node returns the addresses of the data nodes that
have a copy of that block. The DFS returns an FSDataInputStream to the client
for it to read data from. FSDataInputStream in turn wraps a DFSInputStream,
which manages the data node and name node I/O.
• Step 3: The client then calls read() on the stream. DFSInputStream, which has
stored the info node addresses for the primary few blocks within the file, then
connects to the primary (closest) data node for the primary block in the file.
• Step 4: Data is streamed from the data node back to the client, which calls
read() repeatedly on the stream.
• Step 5: When the end of the block is reached, DFSInputStream will close the
connection to the data node, then finds the best data node for the next block.
This happens transparently to the client, which from its point of view is simply
reading an endless stream. Blocks are read as, with the DFSInputStream
opening new connections to data nodes because the client reads through the
stream. It will also call the name node to retrieve the data node locations for the
next batch of blocks as needed.
• Step 6: When the client has finished reading the file, a function is called, close()
on the FSDataInputStream.

HDFS Write File
• Step 1: The client creates the file by calling create() on
DistributedFileSystem(DFS).
• Step 2: DFS makes an RPC call to the name node to create a new file in the file
system’s namespace, with no blocks associated with it. The name node performs
various checks to make sure the file doesn’t already exist and that the client has
the right permissions to create the file. If these checks pass, the name node
prepares a record of the new file; otherwise, the file can’t be created and therefore
the client is thrown an error i.e. IOException. The DFS returns an
FSDataOutputStream for the client to start out writing data to.
• Step 3: Because the client writes data, the DFSOutputStream splits it into packets,
which it writes to an indoor queue called the info queue. The data queue is
consumed by the DataStreamer, which is liable for asking the name node to
allocate new blocks by picking an inventory of suitable data nodes to store the
replicas. The list of data nodes forms a pipeline, and here we’ll assume the
replication level is three, so there are three nodes in the pipeline. The
DataStreamer streams the packets to the primary data node within the pipeline,
which stores each packet and forwards it to the second data node within the
pipeline.
• Step 4: Similarly, the second data node stores the packet and forwards it to the
third (and last) data node in the pipeline.
• Step 5: The DFSOutputStream sustains an internal queue of packets that are
waiting to be acknowledged by data nodes, called an “ack queue”.
• Step 6: This action sends up all the remaining packets to the data node pipeline
and waits for acknowledgments before connecting to the name node to signal
whether the file is complete or not.

• cat
HDFS Command that reads a file on HDFS and prints the content of that file to the
standard output.
Command: hdfs dfs –cat /new_file/test
• text
HDFS Command that takes a source file and outputs the file in text format.
Command: hdfs dfs –text /new_file/test
• copyFromLocal
HDFS Command to copy the file from a Local file system to HDFS.
Command: hdfs dfs –copyFromLocal /home/file/test /new_file
• copyToLocal
HDFS Command to copy the file from HDFS to Local File System.
Command: hdfs dfs –copyToLocal /file/test /home/file

• put
HDFS Command to copy single source or multiple sources from local file system to the destination file system.
Command: hdfs dfs –put /home/file/test /user
• get
HDFS Command to copy files from hdfs to the local file system.
Command: hdfs dfs –get /user/test /home/file
• count
HDFS Command to count the number of directories, files, and bytes under the paths that match the specified file pattern.
Command: hdfs dfs –count /user
• rm
HDFS Command to remove the file from HDFS.
Command: hdfs dfs –rm /new_file/test
• rm -r
HDFS Command to remove the entire directory and all of its content from HDFS.
Command: hdfs dfs -rm -r /new_file

• cp
HDFS Command to copy files from source to destination. This command allows multiple
sources as well, in which case the destination must be a directory.
Command: hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2
• mv
HDFS Command to move files from source to destination. This command allows multiple
sources as well, in which case the destination needs to be a directory.
Command: hdfs dfs -mv /user/hadoop/file1 /user/hadoop/file2
• rmdir
HDFS Command to remove the directory.
Command: hdfs dfs –rmdir /user/Hadoop
• help
HDFS Command that displays help for given command or all commands if none is
specified.
Command: hdfs dfs -help

Hadoop

More Related Content

What's hot (20)

Similar to Hadoop (20)

Recently uploaded (20)

Hadoop