Hadoop Developer

Course Topics
Slide 2 www.edureka.in/hadoop
 Module 1
 Understanding Big Data
 Hadoop Architecture
 Module 2
 Hadoop Cluster Configuration
 Data loading Techniques
 Hadoop Project Environment
 Module 3
 Hadoop MapReduce framework
 Programming in Map Reduce
 Module 4
 Advance MapReduce
 MRUnit testing framework
 Module 5
 Analytics using Pig
 Understanding Pig Latin
 Module 6
 Analytics using Hive
 Understanding HIVE QL
 Module 7
 Advance Hive
 NoSQL Databases and HBASE
 Module 8
 Advance HBASE
 Zookeeper Service
 Module 9
 Hadoop 2.0 – New Features
 Programming in MRv2
 Module 10
 Apache Oozie
 Real world Datasets and Analysis
 Project Discussion

Topics for Today
 Revision
 Hadoop 2.0 New Features
 HDFS High Availability
 HDFS Federation
 YARN or MRv2
 YARN and Hadoop ecosystem
 YARN Components
 YARN Application Execution
 Running an Application with YARN
 Writing a YARN Application

Lets’s Revise – HBase and ZooKeeper
Master
HFile
RegionServers
ReRgeigoinoSneSrevrevresrs
memstore
WAL
/hbase/region
1
/hbase/region
2
…..
…..
/hbase/region
Zookeeper
HDFS

Client
HDFS Map Reduce
Secondary
Name Node
Data
Blocks
….
Data Node
Name Node Job Tracker
Task Tracker
Map Reduce
Data Node Task Tracker
Map Reduce
Hadoop 1.0 – In Summary
Data Node Data NodeTask Tracker
Map Reduce
Task Tracker
Map Reduce

Hadoop 1.0 - Challenges
Problem Description
NameNode – No Horizontal
Scalability
Single NameNode and Single Namespaces, limited by
NameNode RAM
NameNode – No High Availability
(HA)
NameNode is Single Point of Failure, Need manual recovery
using Secondary NameNode in case of failure
Job Tracker – Overburdened Spends significant portion of time and effort managing the life
cycle of applications
MRv1 – Only Map and Reduce tasks Humongous Data stored in HDFS remains unutilized and
cannot be used for other workloads such as Graph processing
etc.

NameNode – Scale and HA
NameNode - No High Availability
NameNode - No Horizontal Scale
Data
Node
Data
Node
Data
Node
….
Client get Block Locations
Block Management
Read Data
NameNode
NS

 Secondary NameNode:
 “Not a hot standby” for the NameNode
 Connects to NameNode every hour*
 Housekeeping, backup of NemeNode metadata
 Saved metadata can build a failed NameNode
You give me
metadata every
hour, I will make
it secure
Secondary
NameNode
Single Point
of FailureNameNode
metadata
metadata
Name Node –Single Point of Failure

Job Tracker – Overburdened
CPU
 Spends a very significant portion of time and
effort managing the life cycle of applications
Network
 Single Listener Thread to communicate
thousands of Map and Reduce Jobs
with
Task Tracker Task Tracker Task Tracker….
Job
Tracker

MRv1 – Unpredictability in Large Clusters
As the cluster size grow and reaches to 4000 Nodes
 Cascading Failures
 The DataNode failures results in a serious
deterioration of the overall cluster
performance because of attempts to
replicate data and overload live nodes,
through network flooding.
 Multi-tenancy
 As clusters increase in size, you may want
to employ these clusters for a variety of
models. MRv1 dedicates its nodes to
Hadoop and cannot be re-purposed for
other applications and workloads in an
Organization. With the growing popularity
and adoption of cloud computing among
enterprises, this becomes more important.

 Terabytes and Petabytes of data in HDFS can be used only for MapReduce processing
Unutilized Data in HDFS
www.edureka.in/hadoop

Hadoop 2.0 New Features
Other important Hadoop 2.0 features
 HDFS Snapshots
 NFSv3 access to data in HDFS
 Support for running Hadoop on MS Windows
 Binary Compatibility for MapReduce applications built on Hadoop 1.0
 Substantial amount of Integration testing with rest of the projects (such as
Hadoop ecosystem
PIG, HIVE) in
Property Hadoop 1.0 Hadoop 2.0
Federation One Namenode and
Namespaces
Multiple Namenode and
Namespaces
High Availability Not present Highly Available
YARN - Processing
Control and Multi-
tenancy
JobTracker, Task Tracker Resource Manager, Node
Manager, App Master,
Capacity Scheduler

Hadoop 2.0 Cluster Architecture - Federation
Namenode
Block Management
NS
Storage
Datanode Datanode…
NamespaceBlockStorage
Namespace
NS1 NSk NSn
NN-1 NN-k NN-n
Common Storage
Datanode 1
…
Datanode 2
…
Datanode m
…
BlockStorage
Pool 1 Pool k Pool n
Block Pools
… …
Hadoop 1.0 Hadoop 2.0
http://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-hdfs/Federation.html

- provides cross-data center (non-
Hlo
eca
lll
o) s
Tup
hp
eo
rr
et
!fo
!r HDFS, allowing
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
How does HDFS Federation help HDFS Scale horizontally?
- reduces the load on any single NameNode by using the multiple,
independent NameNode to manage individual parts of the
filesystem namespace
a cluster administrator to split the Block Storage outside the local
cluster.
Annie’s Question

A. In order to scale the name service horizontally, HDFS federation
uses multiple independent Namenodes. The Namenodes are
federated, that is, the Namenodes are independent and don’t require
coordination with each other.
Annie’s Answer

/finance respectively. What will happ
Hen
elif
loyo
Tu
htr
ey
reto
!!put a file in
My name is Annie.
I love quizzes and
You have configured two name nodes to manage /marketing and
/accounting directory?
Annie’s Question

The put will fail. None of the namespace will manage the file and you
will get an IOException with a No such file or directory error.
Annie’s Answer

Node Manager
HDFS
YARN
Resource
Manager
Shared
edit logs
All name space edits
logged to shared NFS
storage; single writer
(fencing)
Read edit logs and applies
to its own namespace
Secondary
Name Node
Data Node
Standby
NameNode
Active
NameNode
Container
App
Master
Node Manager
Data Node
Container
App
Master
Data Node
Container
Data Node
NDoadtae NMoadneager
Client
App
Master
Node Manager
Container
App
Master
Hadoop 2.0 Cluster Architecture - HA
NameNode High
Availability
Next Generation
MapReduce
HDFS HIGH AVAILABILITY
http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html
*Not necessary to
configure Secondary
NameNode

NameNode Recovery Vs. Failover
High Availability in
Hadoop 2.0
NameNode recovery in
Hadoop 1.0
Secondary
Name Node
Standby
NameNode
Active
NameNode
Secondary
Name Node
NameNode
Edit logs
Meta-Data
Automatic failover
to Standby
NameNode
Manually Recover
using Secondary
NameNode
FSImage

HDFS HA was developed to overcome the following disadvantage in
Hadoop 1.0?
- Single Point Of Failure Of Name•Node
- Only one version can be run in cHlaseslilcoMTaph•eRreed!u!ce
- Too much burden on Job TracMkeyr name is Annie.
I love quizzes and
Annie’s Question

Single Point of Failure of NameNode.
Annie’s Answer

Hadoop 2.0 – In Summary
Client
HDFS
Distributed Data Storage
Active
NameNode
YARN
Resource Manager
Applications
Secondary
Name Node
Standby
NameNode
Distributed Data Processing
Data Node
Node Manager
Container
App
Master …….
Masters
Slaves
Node Manager
Data Node
Container
App
Master
Data Node
Node Manager
Container
App
Master
Shared
edit logs
Scheduler Manager
(AsM)
NameNode High
Availability
Next Generation
MapReduce

YARN and Hadoop Ecosystem
Apache Oozie (Workflow)
HDFS
(Hadoop Distributed File System)
Pig Latin
Data Analysis
Hive
DW System
MapReduce Framework
HBase
Apache Oozie (Workflow)
HDFS
(Hadoop Distributed File System)
Pig Latin
Data Analysis
Hive
DW System
MapReduce Framework HBase
Other
YARN
Frameworks
(MPI, GIRAPH)
framework
YARN
Cluster Resource Management
YARN adds a more general interface to run non-MapReduce
jobs (such as Graph Processing) within the Hadoop

BATCH
(MapReduce)
INTERACTIVE
(Text)
ONLINE
(HBase)
STREAMING
(Storm, S4, …)
GRAPH
(Giraph)
IN-MEMORY
(Spark)
HPC MPI
(OpenMPI)
OTHER
(Search)
(Weave..)
YARN – Moving beyond MapReduce
http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html

Multi-tenancy - Capacity Scheduler
 Organizes jobs into queues
 Queue shares as %’s of cluster
 FIFO scheduling within each
queue
 Data locality-aware Scheduling
 Hierarchical Queues
To manage the resource within an organization.
 Capacity Guarantees
A fraction to the total available capacity allocated to each Queue.
 Security
To safeguard applications from other users.
 Elasticity
Resources are available in a predictable and elastic manner to queues.
 Multi-tenancy
Set of limit to prevent over-utilization of resources by a single application.
 Operability
Runtime configuration of Queues.
 Resource-based scheduling
If needed, Applications can request more resources than the default.

Executing MapReduce Application on YARN
MapReduce Application Execution

YARN MR Application Execution Flow
 MapReduce Job Execution
 Job Submission
 Job Initialization
 Tasks Assignment
 Tasks’ Memory
 Status Updates
 Failure Recovery
 Client
 Submit a MapReduce Job
 Resource Manager
 Manage the resource utilization across
Hadoop Cluster
 Node Manager
 Runs on each Data Node
 Creates execution container
 Monitors Container’s usage
 MapReduce Application Master
 Coordinates and Manages MapReduce Jobs
 Negotiates with Resource Manager to
schedule asks
 The tasks are started by NodeManager(s)
 HDFS
 shares resources and Job’s artefacts
among YARN components

YARN = Yet Another Resource Negotiator
Node Manager
Container Container
Node Manager
App
Master
Container
Node Manager
Container
App
Master
Resource
Manager
Client
Client
MapReduce Status
Job Submission
Node Status
Resource Request
 Resource Manager
 Cluster Level resource
manager
 Long Life, High Quality
Hardware
 Node Manager
 One per Data Node
 Monitors resources on Data
Node
 Application Master
 One per application
 Short life
 Manages task /scheduling
Hadoop 2.0 - YARN
JobHistory
Server

Client
Resource
ManagerApplication
Node Manager
YARNChild
Job Object
1. Run Job 2. Get New Application
Client JVM
3. Copy Job Resources
4. Submit Application
Map or
Reduce
Task
HDFS
MR AppMaster
DataNode
Management Node
Task JVM

Client
Resource
ManagerApplication
Node Manager
YarnChild
Job Object
Client JVM
5. Start MR AppMaster container
Map or
Reduce
Task
HDFS
MR AppMaster
DataNode
7. Get Input Splits
8. Request Resources
9. Start container
6. Create container
Management Node
Task JVM

Client
Resource
Manager
HDFS
Application
Node Manager
MR AppMaster
YarnChild
Map or
Reduce
Task
Job Object
Client JVM
5. Start MR AppMaster container
7. Get Input Splits
8. Request Resources
9. Start container
6. Create container
DataNode
12. Execute
10. Create Container
11. Acquire Job Resources
Management Node
Task JVM

Client
Resource
Manager
HDFS
Application
Node Manager
MR AppMaster
YarnChild
Map or
Reduce
Task
Job Object
Client JVM
DataNode
Management Node
Task JVM
Poll for Status
Update Status

Hadoop 2.0 : YARN Workflow
Resource
Manager
Node Manager
Node Manager
App
Master 2
Node Manager
Node Manager
Node Manager
Container 2.2
Node Manager
Container 2.3
Node Manager
Node Manager
Container 1.1
Container 2.1
Node Manager
Container 1.2
Scheduler
Applications
Manager (AsM)
Node Manager
Node Manager
Node Manager
App
Master 1

YARN was developed to overcome the following disadvantage in
Hadoop 1.0 MapReduce framework?
- Single Point Of Failure Of Name•Node
- Only one version can be run in cHlaseslilcoMTaph•eRreed!u!ce
- Too much burden on Job TracMkeyr name is Annie.
I love quizzes and
Annie’s Question

Too much burden on Job Tracker.
Annie’s Answer

Hello There!!
My name is Annie.
I love quizzes and
In YARN, the functionality of JobTracker has been replaced by which
of the following YARN features:
- Job Scheduling
- Task Monitoring
- Resource Management
- Node management
Annie’s Question

Task Monitoring and Resource Management. The fundamental
idea of YARN is to split up the two major functionalities of the
JobTracker, i.e. resource management and job
scheduling/monitoring, into separate daemons. A global
ResourceManager (RM) for resources and per-application
ApplicationMaster (AM) for task monitoring.
Annie’s Answer

Hello There!!
My name is Annie.
I love quizzes and
In YARN, which of the following daemons takes care of the container
and the resource utilization by the applications?:
- Node Manager
- Job Tracker
- Task tracker
- Application Master
- Resource manager
Annie’s Question

ApplicationMaster.
Annie’s Answer

Hello There!!
My name is Annie.
I love quizzes and
Can we run MRv1 Jobs in a YARN enabled Hadoop Cluster?
- Yes
- No
Annie’s Question

Yes. You need to recompile the Jobs in MRv2 after enabling YARN to
run the Job successfully in a YARN enabled Hadoop Cluster.
Annie’s Answer

Hello There!!
My name is Annie.
I love quizzes and
Which of the following YARN/MRv2 daemon is responsible for
launching the tasks?
- Task Tracker
- Resource Manager
- ApplicationMaster
- ApplicationMasterServer
Annie’s Question

ApplicationMaster.
Annie’s Answer

 Write the MapReduce programs using MRv2 for your Module-3 assignments.
 Assignment Programs – WordCount , Patents, Temperature, and Alphabets
 Create a Single Node Apache Hadoop Cluster using the documents present in the LMS.
 Execute the ‘teragen’ example to test the Cluster
Module-10 Pre-work

Thank You
See You in Class Next Week

NN High Availability – Quorum Journal Manager
Standby Name
Node
Active Name
Node
Data Node Data Node Data Node Data Node
….
Data Blocks
Journal
Node
Write namespace
modification to edit
logs; only Active Name
Node is the writer
Journal
Node
Journal
Node
Read edit logs and applies to its own
namespace
Data Nodes are configured with the location of both
Name Nodes, and send block location information and
heartbeats to both.

Hadoop Developer

More Related Content

What's hot (20)

Viewers also liked (10)

Similar to Hadoop Developer (20)

More from Edureka! (20)

Recently uploaded (20)

Hadoop Developer