Hadoop Introduction

Apache Hadoop is a Java software framework that allows for the distributed processing
of large data sets across clusters of computers spread across the world using a simple
programming model.

•  Distributed, scalable and
reliable
•  Fault‐tolerant storage
system
Hadoop Distributed
File System
•  High-performance parallel
data processing
•  Employs the divide-conquer
principle
Map-Reduce
Programming Model

A class teacher of class 5 needs to find out the name of the student with highest marks
for each subject.
Total students : 50
Total subjects : 5
Our Goal
To minimize the Total time spent
Time to process each
subject per student
: 1min
Total time spent : 250mins
Subject 1 : S1-98
Subject 2 : S13-95
Subject 3 : S1-97
Subject 4 : S23-100
Subject 5 : S8-99
Input
Output

HDFS: Distribute the
data into blocks across
multiple nodes
Distribute papers across 5 peons – Each
peon will have papers of 10 students for
each subject (50 papers each)
a)
Map Phase: Apply
business logic on
distributed data in parallel
Each peon will provide list of subjects
with student name and highest marks
from his data from a list of 10 students.
Total time spent: 50mins (in parallel)
b)
Reduce Phase: Iterate
over the map phase
output and get final result
Total records left: 5 students for 5
subjects only. Time to get subject list for
student name with highest marks: 25mins
c)
Total time spent: 50 + 25 = 75mins

Social Media Data
Analyzing Web Clickstream Data
Server Log Data
Machine and Sensor Data

HDFS Layer : --
Stores files across storage nodes
in a Hadoop cluster
Consists of :
•  Namenode & Datanodes
Map-Reduce Engine : --
Processes vast amounts of data in-
parallel on large clusters in a
reliable & fault-tolerant manner
Consists of :
•  Job Tracker & Task Trackers

Namenode
Datanode_1 Datanode_2 Datanode_3
HDFS
Block 1
HDFS
Block 2
HDFS
Block 3 Block 4
Storage & Replication of Blocks in HDFS
Filedividedintoblocks
Block 1
Block 2
Block 3
Block 4
HDFS Client
File write
request

Job
Tracker
Task Tracker 1 Task Tracker _2 Task Tracker _3
HDFS
Block 1
HDFS
Block 2
HDFS
Block 3 Block 4
Map-Reduce
job from
client
Executes individual
Map-Reduce tasks
assigned by Job
Tracker
Task Trackers retrieve data from HDFS which is stored on the
Data-node i.e. the same system where Task Tracker is running.
Task
Tracker
Data
Node
Slave
m/c

NameNode
Ø  Maps a block to the Datanodes
Ø  Controls read/write access to files
Ø  Manages Replication Engine for Blocks
DataNode
Ø  Responsible for serving read and write
requests (block creation, deletion, and
replication)
JobTracker
Ø  Accepts Map-Reduce tasks from the clients
Ø  Assigns tasks to the Task Trackers &
monitors their status
TaskTracker
Ø  Worker daemon, runs Map-Reduce tasks
Ø  Sends heart-beat to Job Tracker
Ø  Retrieves Job resources from HDFS
NameNode DataNode
JobTracker TaskTracker
Hadoop
Daemons

Hadoop
Services
HDFS MapReduce YARN
YARN stands for “Yet
Another Resource
Negotiator”, a framework
to provide generic
resource management
solution to Hadoop
clusters.

Allows easy integration of
multiple data processing
algorithms to the data stored in
HDFS

Query Language Pig Scripting
Coordination Service
Columnar Database
Log Management
Data Exchange
Designing Workflow
Machine Learning
Messaging System

a)  Apache Website
à http://hadoop.apache.org/
b)  Learning YARN
à https://www.packtpub.com/big-data-and-business-intelligence/learning-yarn
c)  Hadoop: The definitive guide
àhttp://shop.oreilly.com/product/0636920033448.do

Hadoop Introduction

More Related Content

What's hot (20)

Viewers also liked (18)

Similar to Hadoop Introduction (20)

Recently uploaded (20)

Hadoop Introduction