SlideShare a Scribd company logo
Hadoop HDFS and MAPREDUCE
(BTDS603-20 )
Module -1
What is Hadoop
The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets across
clusters of computers using simple programming models.
Hadoop consists of the
◦ Hadoop Common package, which provides file system
and operating system level abstractions
◦ MapReduce engine
◦ Hadoop Distributed File System (HDFS).
The Hadoop distributed file system (HDFS) is a distributed,
scalable, and portable file system written in Java for the
Hadoop framework
Hadoop
Architecture
HDFS
HDFS has five services as follows:
◦ Name Node
◦ Secondary Name Node
◦ Job tracker
◦ Data Node
◦ Task Tracker
Nodes
Name Node also known as the master.
◦ The master node can track files, manage the file system and has
the metadata of all of the stored data within it.
◦ The name node contains the details of the number of blocks,
locations of the data node that the data is stored in, where the
replications are stored, and other details.
◦ The name node has direct contact with the client.
Data Node also known as slave:
◦ A Data Node stores data in it as blocks
◦ It stores the actual data into HDFS which is responsible for the
client to read and write.
◦ Every Data node sends a Heartbeat message to the Name node
every 3 seconds and conveys that it is alive.
◦ If the data node is dead and the Name node starts the process of
block replications on some other Data node.
Nodes
Secondary Name Node:
◦ Takes care of checkpoints of the file system metadata which is in
the Name Node.
◦ This is also known as the checkpoint Node
Job Tracker:
◦ Job Tracker receives the requests for Map Reduce execution
from the client.
◦ Job tracker talks to the Name Node to know about the location
of the data that will be used in processing.
Task Tracker:
◦ It is the Slave Node for the Job Tracker and it will take the task
from the Job Tracker.
◦ Task Tracker will take the code and apply on the file. The process
of applying that code on the file is known as Mapper.
Commands on hdfs
There are two type of commands
◦ Admin Commands
◦ Get status
◦ Generate a report
◦ Shell like Filesystem commands
◦ Put a file in the DFS
◦ Create a directory in DFS
◦ Show the contents of a file
Hadoop Admin
command list
hadoop classpath --Prints the class path needed to get the Hadoop jar
and the required libraries.
hadoop conftest --Check configuration file. Validates configuration XML
file.
hadoop version
hadoop distcp hdfs://nn1:8020/foo/bar hdfs://nn2:8020/bar/foo
--DistCp (distributed copy) is a tool used for large inter/intra-cluster
copying.
hadoop envvars
mapred historyserver --Start JobHistoryServer.
hdfs dfs -ls -d /user/Hadoop or hadoop fs -ls -d /user/hadoop
hadoop fs -expunge [Remove files from trace]
hadoop dfsadmin –help
hadoop namenode -format
Hadoop File system commands
Create a Directory hadoop fs -mkdir <Directory Path>
List contents of a file or path hadoop fs -ls <Directory Path>
Put a file into the distributed file system hadoop fs –put <local path> <destination path>
Copy from a local source hadoop fs -copyFromLocal <local path> <destination path>
Find a file hadoop fs -find / -name <file name with expansion>
Print Head hadoop fs -head <file Path>
Hadoop File system commands
Cat to display contents hdfs dfs –cat <Path to file name>
Append contents of one file to another hdfs dfs –appendToFile <source> <destination>
Get a file from the distributed system hdfs dfs –get <file path in dfs> <destination path>
Move a file hadoop fs -mv <Source path> <destination path>
Show capacity, freeand used space hdfs dfs –df <location in the dfs>
Print Head hadoop fs -head <file Path>
Mapreduce
MapReduce is a processing technique and a programing paradigm for distributed computing
The paradigm consists of 2 parts: Map and Reduce
Map stage −
◦ The map or mapper’s job is to process the input data.
◦ Map takes a set of data and converts it into another set of intermediate data
◦ The intermediate data is usually individual elements that are broken down into tuples (key/value pairs).
Reduce Stage –
◦ The Reduce task takes in the output of the Map task and reduces them to a more compact output data
◦ After processing, it produces a new set of output, which will be stored in the HDFS.
MapReduce: The Map Step
v2
k2
k v
k v
map
v1
k1
vn
kn
…
k v
map
Input
key-value pairs
Intermediate
key-value pairs
…
k v
E.g. (doc—id, doc-content) E.g. (word, wordcount-in-a-doc)
MapReduce: The Reduce Step
k v
…
k v
k v
k v
Intermediate
key-value pairs
group
reduce
reduce
k v
k v
k v
…
k v
…
k v
k v v
v v
Key-value groups
Output
key-value pairs
E.g.
(word, wordcount-in-a-doc)
(word, list-of-wordcount) (word, final-count)
~ SQL Group by ~ SQL aggregation
Case of word count using map reduce
Input
Set of data
Bus, Car, bus, car, train, car, bus, car, train, bus,TRAIN,BUS, buS,
caR, CAR, car, BUS, TRAIN
Output Convert into another set of data
(Key,Value)
(Bus,1), (Car,1), (bus,1), (car,1), (train,1),(car,1), (bus,1), (car,1),
(train,1), (bus,1),(TRAIN,1),(BUS,1), (buS,1), (caR,1), (CAR,1),(car,1),
(BUS,1), (TRAIN,1)
MAP
Input
(output of
Map function)
Set of Tuples (Bus,1), (Car,1), (bus,1), (car,1), (train,1),(car,1),
(bus,1), (car,1), (train,1), (bus,1),(TRAIN,1),
(BUS,1), (buS,1), (caR,1), (CAR,1),(car,1),
(BUS,1), (TRAIN,1)
Output
Converts into smaller set of
tuples
(BUS,7),
(CAR,7),
(TRAIN,4)
REDUCE
MAPREDUCE
Lec 2 & 3 _Unit 1_Hadoop _MapReduce1.pptx

More Related Content

PPTX
MapReduce1.pptx
PPTX
Hadoop
PPTX
Apache Hadoop Big Data Technology
PPTX
PPTX
PPTX
Cppt Hadoop
PPT
Apache hadoop, hdfs and map reduce Overview
PPT
MapReduce1.pptx
Hadoop
Apache Hadoop Big Data Technology
Cppt Hadoop
Apache hadoop, hdfs and map reduce Overview

Similar to Lec 2 & 3 _Unit 1_Hadoop _MapReduce1.pptx (20)

PPTX
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
PDF
PDF
PPTX
Hadoop Fundamentals
PPTX
Hadoop fundamentals
PPT
Big Data Technologies - Hadoop
PPT
An Introduction to Hadoop
PPTX
Distributed Systems Hadoop.pptx
PPTX
Introduction to HDFS
PPTX
Introduction to HDFS and MapReduce
PPT
Hadoop Map-Reduce from the subject: Big Data Analytics
PDF
Hadoop Ecosystem
ODP
Apache hadoop
PDF
Hadoop overview.pdf
PPT
hadoop
PPT
hadoop
PPTX
HADOOP DISTRIBUTED FILE SYSTEM AND MAPREDUCE
PPTX
Hadoop
PPT
Hadoop ppt2
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
Hadoop Fundamentals
Hadoop fundamentals
Big Data Technologies - Hadoop
An Introduction to Hadoop
Distributed Systems Hadoop.pptx
Introduction to HDFS
Introduction to HDFS and MapReduce
Hadoop Map-Reduce from the subject: Big Data Analytics
Hadoop Ecosystem
Apache hadoop
Hadoop overview.pdf
hadoop
hadoop
HADOOP DISTRIBUTED FILE SYSTEM AND MAPREDUCE
Hadoop
Hadoop ppt2
Ad

More from ashima967262 (6)

PPTX
FILE AND OBJECT<,ITS PROPERTIES IN PYTHON
PPTX
MULTITHREADING AND ITS USE IN PYTHON UNI
PPTX
lec1_Unit 1_rev.pptx_big data aanalytics
PDF
Map_reduce_working_Big Data_Analytics_2025
PPTX
Algo analysis for computational programmingpptx
PPTX
cache memory and types of cache memory,
FILE AND OBJECT<,ITS PROPERTIES IN PYTHON
MULTITHREADING AND ITS USE IN PYTHON UNI
lec1_Unit 1_rev.pptx_big data aanalytics
Map_reduce_working_Big Data_Analytics_2025
Algo analysis for computational programmingpptx
cache memory and types of cache memory,
Ad

Recently uploaded (20)

PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
1_Introduction to advance data techniques.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPT
Quality review (1)_presentation of this 21
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Lecture1 pattern recognition............
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
1_Introduction to advance data techniques.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Business Acumen Training GuidePresentation.pptx
Introduction to Knowledge Engineering Part 1
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Quality review (1)_presentation of this 21
Galatica Smart Energy Infrastructure Startup Pitch Deck
Lecture1 pattern recognition............
.pdf is not working space design for the following data for the following dat...
IBA_Chapter_11_Slides_Final_Accessible.pptx
climate analysis of Dhaka ,Banglades.pptx
IB Computer Science - Internal Assessment.pptx
Reliability_Chapter_ presentation 1221.5784
Major-Components-ofNKJNNKNKNKNKronment.pptx

Lec 2 & 3 _Unit 1_Hadoop _MapReduce1.pptx

  • 1. Hadoop HDFS and MAPREDUCE (BTDS603-20 ) Module -1
  • 2. What is Hadoop The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. Hadoop consists of the ◦ Hadoop Common package, which provides file system and operating system level abstractions ◦ MapReduce engine ◦ Hadoop Distributed File System (HDFS). The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file system written in Java for the Hadoop framework
  • 4. HDFS HDFS has five services as follows: ◦ Name Node ◦ Secondary Name Node ◦ Job tracker ◦ Data Node ◦ Task Tracker
  • 5. Nodes Name Node also known as the master. ◦ The master node can track files, manage the file system and has the metadata of all of the stored data within it. ◦ The name node contains the details of the number of blocks, locations of the data node that the data is stored in, where the replications are stored, and other details. ◦ The name node has direct contact with the client. Data Node also known as slave: ◦ A Data Node stores data in it as blocks ◦ It stores the actual data into HDFS which is responsible for the client to read and write. ◦ Every Data node sends a Heartbeat message to the Name node every 3 seconds and conveys that it is alive. ◦ If the data node is dead and the Name node starts the process of block replications on some other Data node.
  • 6. Nodes Secondary Name Node: ◦ Takes care of checkpoints of the file system metadata which is in the Name Node. ◦ This is also known as the checkpoint Node Job Tracker: ◦ Job Tracker receives the requests for Map Reduce execution from the client. ◦ Job tracker talks to the Name Node to know about the location of the data that will be used in processing. Task Tracker: ◦ It is the Slave Node for the Job Tracker and it will take the task from the Job Tracker. ◦ Task Tracker will take the code and apply on the file. The process of applying that code on the file is known as Mapper.
  • 7. Commands on hdfs There are two type of commands ◦ Admin Commands ◦ Get status ◦ Generate a report ◦ Shell like Filesystem commands ◦ Put a file in the DFS ◦ Create a directory in DFS ◦ Show the contents of a file
  • 8. Hadoop Admin command list hadoop classpath --Prints the class path needed to get the Hadoop jar and the required libraries. hadoop conftest --Check configuration file. Validates configuration XML file. hadoop version hadoop distcp hdfs://nn1:8020/foo/bar hdfs://nn2:8020/bar/foo --DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. hadoop envvars mapred historyserver --Start JobHistoryServer. hdfs dfs -ls -d /user/Hadoop or hadoop fs -ls -d /user/hadoop hadoop fs -expunge [Remove files from trace] hadoop dfsadmin –help hadoop namenode -format
  • 9. Hadoop File system commands Create a Directory hadoop fs -mkdir <Directory Path> List contents of a file or path hadoop fs -ls <Directory Path> Put a file into the distributed file system hadoop fs –put <local path> <destination path> Copy from a local source hadoop fs -copyFromLocal <local path> <destination path> Find a file hadoop fs -find / -name <file name with expansion> Print Head hadoop fs -head <file Path>
  • 10. Hadoop File system commands Cat to display contents hdfs dfs –cat <Path to file name> Append contents of one file to another hdfs dfs –appendToFile <source> <destination> Get a file from the distributed system hdfs dfs –get <file path in dfs> <destination path> Move a file hadoop fs -mv <Source path> <destination path> Show capacity, freeand used space hdfs dfs –df <location in the dfs> Print Head hadoop fs -head <file Path>
  • 11. Mapreduce MapReduce is a processing technique and a programing paradigm for distributed computing The paradigm consists of 2 parts: Map and Reduce Map stage − ◦ The map or mapper’s job is to process the input data. ◦ Map takes a set of data and converts it into another set of intermediate data ◦ The intermediate data is usually individual elements that are broken down into tuples (key/value pairs). Reduce Stage – ◦ The Reduce task takes in the output of the Map task and reduces them to a more compact output data ◦ After processing, it produces a new set of output, which will be stored in the HDFS.
  • 12. MapReduce: The Map Step v2 k2 k v k v map v1 k1 vn kn … k v map Input key-value pairs Intermediate key-value pairs … k v E.g. (doc—id, doc-content) E.g. (word, wordcount-in-a-doc)
  • 13. MapReduce: The Reduce Step k v … k v k v k v Intermediate key-value pairs group reduce reduce k v k v k v … k v … k v k v v v v Key-value groups Output key-value pairs E.g. (word, wordcount-in-a-doc) (word, list-of-wordcount) (word, final-count) ~ SQL Group by ~ SQL aggregation
  • 14. Case of word count using map reduce Input Set of data Bus, Car, bus, car, train, car, bus, car, train, bus,TRAIN,BUS, buS, caR, CAR, car, BUS, TRAIN Output Convert into another set of data (Key,Value) (Bus,1), (Car,1), (bus,1), (car,1), (train,1),(car,1), (bus,1), (car,1), (train,1), (bus,1),(TRAIN,1),(BUS,1), (buS,1), (caR,1), (CAR,1),(car,1), (BUS,1), (TRAIN,1) MAP Input (output of Map function) Set of Tuples (Bus,1), (Car,1), (bus,1), (car,1), (train,1),(car,1), (bus,1), (car,1), (train,1), (bus,1),(TRAIN,1), (BUS,1), (buS,1), (caR,1), (CAR,1),(car,1), (BUS,1), (TRAIN,1) Output Converts into smaller set of tuples (BUS,7), (CAR,7), (TRAIN,4) REDUCE