2. What is Hadoop
The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets across
clusters of computers using simple programming models.
Hadoop consists of the
◦ Hadoop Common package, which provides file system
and operating system level abstractions
◦ MapReduce engine
◦ Hadoop Distributed File System (HDFS).
The Hadoop distributed file system (HDFS) is a distributed,
scalable, and portable file system written in Java for the
Hadoop framework
4. HDFS
HDFS has five services as follows:
◦ Name Node
◦ Secondary Name Node
◦ Job tracker
◦ Data Node
◦ Task Tracker
5. Nodes
Name Node also known as the master.
◦ The master node can track files, manage the file system and has
the metadata of all of the stored data within it.
◦ The name node contains the details of the number of blocks,
locations of the data node that the data is stored in, where the
replications are stored, and other details.
◦ The name node has direct contact with the client.
Data Node also known as slave:
◦ A Data Node stores data in it as blocks
◦ It stores the actual data into HDFS which is responsible for the
client to read and write.
◦ Every Data node sends a Heartbeat message to the Name node
every 3 seconds and conveys that it is alive.
◦ If the data node is dead and the Name node starts the process of
block replications on some other Data node.
6. Nodes
Secondary Name Node:
◦ Takes care of checkpoints of the file system metadata which is in
the Name Node.
◦ This is also known as the checkpoint Node
Job Tracker:
◦ Job Tracker receives the requests for Map Reduce execution
from the client.
◦ Job tracker talks to the Name Node to know about the location
of the data that will be used in processing.
Task Tracker:
◦ It is the Slave Node for the Job Tracker and it will take the task
from the Job Tracker.
◦ Task Tracker will take the code and apply on the file. The process
of applying that code on the file is known as Mapper.
7. Commands on hdfs
There are two type of commands
◦ Admin Commands
◦ Get status
◦ Generate a report
◦ Shell like Filesystem commands
◦ Put a file in the DFS
◦ Create a directory in DFS
◦ Show the contents of a file
8. Hadoop Admin
command list
hadoop classpath --Prints the class path needed to get the Hadoop jar
and the required libraries.
hadoop conftest --Check configuration file. Validates configuration XML
file.
hadoop version
hadoop distcp hdfs://nn1:8020/foo/bar hdfs://nn2:8020/bar/foo
--DistCp (distributed copy) is a tool used for large inter/intra-cluster
copying.
hadoop envvars
mapred historyserver --Start JobHistoryServer.
hdfs dfs -ls -d /user/Hadoop or hadoop fs -ls -d /user/hadoop
hadoop fs -expunge [Remove files from trace]
hadoop dfsadmin –help
hadoop namenode -format
9. Hadoop File system commands
Create a Directory hadoop fs -mkdir <Directory Path>
List contents of a file or path hadoop fs -ls <Directory Path>
Put a file into the distributed file system hadoop fs –put <local path> <destination path>
Copy from a local source hadoop fs -copyFromLocal <local path> <destination path>
Find a file hadoop fs -find / -name <file name with expansion>
Print Head hadoop fs -head <file Path>
10. Hadoop File system commands
Cat to display contents hdfs dfs –cat <Path to file name>
Append contents of one file to another hdfs dfs –appendToFile <source> <destination>
Get a file from the distributed system hdfs dfs –get <file path in dfs> <destination path>
Move a file hadoop fs -mv <Source path> <destination path>
Show capacity, freeand used space hdfs dfs –df <location in the dfs>
Print Head hadoop fs -head <file Path>
11. Mapreduce
MapReduce is a processing technique and a programing paradigm for distributed computing
The paradigm consists of 2 parts: Map and Reduce
Map stage −
◦ The map or mapper’s job is to process the input data.
◦ Map takes a set of data and converts it into another set of intermediate data
◦ The intermediate data is usually individual elements that are broken down into tuples (key/value pairs).
Reduce Stage –
◦ The Reduce task takes in the output of the Map task and reduces them to a more compact output data
◦ After processing, it produces a new set of output, which will be stored in the HDFS.
12. MapReduce: The Map Step
v2
k2
k v
k v
map
v1
k1
vn
kn
…
k v
map
Input
key-value pairs
Intermediate
key-value pairs
…
k v
E.g. (doc—id, doc-content) E.g. (word, wordcount-in-a-doc)
13. MapReduce: The Reduce Step
k v
…
k v
k v
k v
Intermediate
key-value pairs
group
reduce
reduce
k v
k v
k v
…
k v
…
k v
k v v
v v
Key-value groups
Output
key-value pairs
E.g.
(word, wordcount-in-a-doc)
(word, list-of-wordcount) (word, final-count)
~ SQL Group by ~ SQL aggregation
14. Case of word count using map reduce
Input
Set of data
Bus, Car, bus, car, train, car, bus, car, train, bus,TRAIN,BUS, buS,
caR, CAR, car, BUS, TRAIN
Output Convert into another set of data
(Key,Value)
(Bus,1), (Car,1), (bus,1), (car,1), (train,1),(car,1), (bus,1), (car,1),
(train,1), (bus,1),(TRAIN,1),(BUS,1), (buS,1), (caR,1), (CAR,1),(car,1),
(BUS,1), (TRAIN,1)
MAP
Input
(output of
Map function)
Set of Tuples (Bus,1), (Car,1), (bus,1), (car,1), (train,1),(car,1),
(bus,1), (car,1), (train,1), (bus,1),(TRAIN,1),
(BUS,1), (buS,1), (caR,1), (CAR,1),(car,1),
(BUS,1), (TRAIN,1)
Output
Converts into smaller set of
tuples
(BUS,7),
(CAR,7),
(TRAIN,4)
REDUCE