SlideShare a Scribd company logo
Slide 1 www.edureka.in/hadoop
Course Topics
Slide 2 www.edureka.in/hadoop
 Module 1
 Understanding Big Data
 Hadoop Architecture
 Module 2
 Hadoop Cluster Configuration
 Data loading Techniques
 Hadoop Project Environment
 Module 3
 Hadoop MapReduce framework
 Programming in Map Reduce
 Module 4
 Advance MapReduce
 MRUnit testing framework
 Module 5
 Analytics using Pig
 Understanding Pig Latin
 Module 6
 Analytics using Hive
 Understanding HIVE QL
 Module 7
 Advance Hive
 NoSQL Databases and HBASE
 Module 8
 Advance HBASE
 Zookeeper Service
 Module 9
 Hadoop 2.0 – New Features
 Programming in MRv2
 Module 10
 Apache Oozie
 Real world Datasets and Analysis
 Project Discussion
Topics for Today
Slide 3 www.edureka.in/hadoop
 Revision
 Hadoop 2.0 New Features
 HDFS High Availability
 HDFS Federation
 YARN or MRv2
 YARN and Hadoop ecosystem
 YARN Components
 YARN Application Execution
 Running an Application with YARN
 Writing a YARN Application
Lets’s Revise – HBase and ZooKeeper
Master
HFile
RegionServers
ReRgeigoinoSneSrevrevresrs
memstore
WAL
/hbase/region
1
/hbase/region
2
…..
…..
/hbase/region
Zookeeper
HDFS
Slide 4 www.edureka.in/hadoop
Client
HDFS Map Reduce
Secondary
Name Node
Data
Blocks
….
Data Node
Name Node Job Tracker
Task Tracker
Map Reduce
Data Node Task Tracker
Map Reduce
Hadoop 1.0 – In Summary
Data Node Data NodeTask Tracker
Map Reduce
Task Tracker
Map Reduce
Slide 5 www.edureka.in/hadoop
Hadoop 1.0 - Challenges
Slide 6 www.edureka.in/hadoop
Problem Description
NameNode – No Horizontal
Scalability
Single NameNode and Single Namespaces, limited by
NameNode RAM
NameNode – No High Availability
(HA)
NameNode is Single Point of Failure, Need manual recovery
using Secondary NameNode in case of failure
Job Tracker – Overburdened Spends significant portion of time and effort managing the life
cycle of applications
MRv1 – Only Map and Reduce tasks Humongous Data stored in HDFS remains unutilized and
cannot be used for other workloads such as Graph processing
etc.
NameNode – Scale and HA
NameNode - No High Availability
NameNode - No Horizontal Scale
Data
Node
Data
Node
Data
Node
….
Client get Block Locations
Block Management
Read Data
NameNode
NS
Slide 7 www.edureka.in/hadoop
 Secondary NameNode:
 “Not a hot standby” for the NameNode
 Connects to NameNode every hour*
 Housekeeping, backup of NemeNode metadata
 Saved metadata can build a failed NameNode
You give me
metadata every
hour, I will make
it secure
Secondary
NameNode
Single Point
of FailureNameNode
metadata
Slide 8 www.edureka.in/hadoop
metadata
Name Node –Single Point of Failure
Job Tracker – Overburdened
CPU
 Spends a very significant portion of time and
effort managing the life cycle of applications
Network
 Single Listener Thread to communicate
thousands of Map and Reduce Jobs
with
Task Tracker Task Tracker Task Tracker….
Job
Tracker
Slide 9 www.edureka.in/hadoop
MRv1 – Unpredictability in Large Clusters
As the cluster size grow and reaches to 4000 Nodes
 Cascading Failures
 The DataNode failures results in a serious
deterioration of the overall cluster
performance because of attempts to
replicate data and overload live nodes,
through network flooding.
 Multi-tenancy
 As clusters increase in size, you may want
to employ these clusters for a variety of
models. MRv1 dedicates its nodes to
Hadoop and cannot be re-purposed for
other applications and workloads in an
Organization. With the growing popularity
and adoption of cloud computing among
enterprises, this becomes more important.
Slide 10 www.edureka.in/hadoop
 Terabytes and Petabytes of data in HDFS can be used only for MapReduce processing
Unutilized Data in HDFS
www.edureka.in/hadoop
Hadoop 2.0 New Features
Slide 12 www.edureka.in/hadoop
Other important Hadoop 2.0 features
 HDFS Snapshots
 NFSv3 access to data in HDFS
 Support for running Hadoop on MS Windows
 Binary Compatibility for MapReduce applications built on Hadoop 1.0
 Substantial amount of Integration testing with rest of the projects (such as
Hadoop ecosystem
PIG, HIVE) in
Property Hadoop 1.0 Hadoop 2.0
Federation One Namenode and
Namespaces
Multiple Namenode and
Namespaces
High Availability Not present Highly Available
YARN - Processing
Control and Multi-
tenancy
JobTracker, Task Tracker Resource Manager, Node
Manager, App Master,
Capacity Scheduler
Hadoop 2.0 Cluster Architecture - Federation
Namenode
Block Management
NS
Storage
Datanode Datanode…
NamespaceBlockStorage
Namespace
NS1 NSk NSn
NN-1 NN-k NN-n
Common Storage
Datanode 1
…
Datanode 2
…
Datanode m
…
BlockStorage
Pool 1 Pool k Pool n
Block Pools
… …
Hadoop 1.0 Hadoop 2.0
Slide 13 www.edureka.in/hadoop
http://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-hdfs/Federation.html
- provides cross-data center (non-
Hlo
eca
lll
o) s
Tup
hp
eo
rr
et
!fo
!r HDFS, allowing
Slide 14 www.edureka.in/hadoop
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
How does HDFS Federation help HDFS Scale horizontally?
- reduces the load on any single NameNode by using the multiple,
independent NameNode to manage individual parts of the
filesystem namespace
a cluster administrator to split the Block Storage outside the local
cluster.
Annie’s Question
A. In order to scale the name service horizontally, HDFS federation
uses multiple independent Namenodes. The Namenodes are
federated, that is, the Namenodes are independent and don’t require
coordination with each other.
Slide 15 www.edureka.in/hadoop
Annie’s Answer
/finance respectively. What will happ
Hen
elif
loyo
Tu
htr
ey
reto
!!put a file in
Slide 16 www.edureka.in/hadoop
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
You have configured two name nodes to manage /marketing and
/accounting directory?
Annie’s Question
The put will fail. None of the namespace will manage the file and you
will get an IOException with a No such file or directory error.
Slide 17 www.edureka.in/hadoop
Annie’s Answer
Node Manager
HDFS
YARN
Resource
Manager
Shared
edit logs
All name space edits
logged to shared NFS
storage; single writer
(fencing)
Read edit logs and applies
to its own namespace
Secondary
Name Node
Data Node
Standby
NameNode
Active
NameNode
Container
App
Master
Node Manager
Data Node
Container
App
Master
Data Node
Container
Data Node
NDoadtae NMoadneager
Client
App
Master
Node Manager
Container
App
Master
Hadoop 2.0 Cluster Architecture - HA
NameNode High
Availability
Slide 18 www.edureka.in/hadoop
Next Generation
MapReduce
HDFS HIGH AVAILABILITY
http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html
*Not necessary to
configure Secondary
NameNode
NameNode Recovery Vs. Failover
High Availability in
Hadoop 2.0
NameNode recovery in
Hadoop 1.0
Secondary
Name Node
Standby
NameNode
Active
NameNode
Secondary
Name Node
NameNode
Edit logs
Meta-Data
Automatic failover
to Standby
NameNode
Manually Recover
using Secondary
NameNode
FSImage
www.edureka.in/hadoop
HDFS HA was developed to overcome the following disadvantage in
Hadoop 1.0?
- Single Point Of Failure Of Name•Node
- Only one version can be run in cHlaseslilcoMTaph•eRreed!u!ce
- Too much burden on Job TracMkeyr name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
Slide 20 www.edureka.in/hadoop
Annie’s Question
Single Point of Failure of NameNode.
Slide 21 www.edureka.in/hadoop
Annie’s Answer
Hadoop 2.0 – In Summary
Client
HDFS
Distributed Data Storage
Active
NameNode
YARN
Resource Manager
Applications
Secondary
Name Node
Standby
NameNode
Distributed Data Processing
Data Node
Node Manager
Container
App
Master …….
Masters
Slaves
Node Manager
Data Node
Container
App
Master
Data Node
Node Manager
Container
App
Master
Shared
edit logs
Scheduler Manager
(AsM)
NameNode High
Availability
www.edureka.in/hadoop
Next Generation
MapReduce
YARN and Hadoop Ecosystem
Apache Oozie (Workflow)
HDFS
(Hadoop Distributed File System)
Pig Latin
Data Analysis
Hive
DW System
MapReduce Framework
HBase
Apache Oozie (Workflow)
HDFS
(Hadoop Distributed File System)
Pig Latin
Data Analysis
Hive
DW System
MapReduce Framework HBase
Other
YARN
Frameworks
(MPI, GIRAPH)
framework
Slide 23 www.edureka.in/hadoop
YARN
Cluster Resource Management
YARN adds a more general interface to run non-MapReduce
jobs (such as Graph Processing) within the Hadoop
BATCH
(MapReduce)
INTERACTIVE
(Text)
ONLINE
(HBase)
STREAMING
(Storm, S4, …)
GRAPH
(Giraph)
IN-MEMORY
(Spark)
HPC MPI
(OpenMPI)
OTHER
(Search)
(Weave..)
YARN – Moving beyond MapReduce
Slide 24 www.edureka.in/hadoop
http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html
Multi-tenancy - Capacity Scheduler
Slide 25 www.edureka.in/hadoop
 Organizes jobs into queues
 Queue shares as %’s of cluster
 FIFO scheduling within each
queue
 Data locality-aware Scheduling
 Hierarchical Queues
To manage the resource within an organization.
 Capacity Guarantees
A fraction to the total available capacity allocated to each Queue.
 Security
To safeguard applications from other users.
 Elasticity
Resources are available in a predictable and elastic manner to queues.
 Multi-tenancy
Set of limit to prevent over-utilization of resources by a single application.
 Operability
Runtime configuration of Queues.
 Resource-based scheduling
If needed, Applications can request more resources than the default.
Executing MapReduce Application on YARN
www.edureka.in/hadoop
MapReduce Application Execution
YARN MR Application Execution Flow
Slide 27 www.edureka.in/hadoop
 MapReduce Job Execution
 Job Submission
 Job Initialization
 Tasks Assignment
 Tasks’ Memory
 Status Updates
 Failure Recovery
 Client
 Submit a MapReduce Job
 Resource Manager
 Manage the resource utilization across
Hadoop Cluster
 Node Manager
 Runs on each Data Node
 Creates execution container
 Monitors Container’s usage
 MapReduce Application Master
 Coordinates and Manages MapReduce Jobs
 Negotiates with Resource Manager to
schedule asks
 The tasks are started by NodeManager(s)
 HDFS
 shares resources and Job’s artefacts
among YARN components
YARN = Yet Another Resource Negotiator
Node Manager
Container Container
Node Manager
App
Master
Container
Node Manager
Container
App
Master
Resource
Manager
Client
Client
MapReduce Status
Job Submission
Node Status
Resource Request
 Resource Manager
 Cluster Level resource
manager
 Long Life, High Quality
Hardware
 Node Manager
 One per Data Node
 Monitors resources on Data
Node
 Application Master
 One per application
 Short life
 Manages task /scheduling
Hadoop 2.0 - YARN
JobHistory
Server
Slide 28 www.edureka.in/hadoop
YARN MR Application Execution Flow
Client
Resource
ManagerApplication
Node Manager
YARNChild
Job Object
1. Run Job 2. Get New Application
Client JVM
3. Copy Job Resources
4. Submit Application
Map or
Reduce
Task
HDFS
MR AppMaster
DataNode
Slide 29 www.edureka.in/hadoop
Management Node
Task JVM
YARN MR Application Execution Flow
Client
Resource
ManagerApplication
Node Manager
YarnChild
Job Object
1. Run Job 2. Get New Application
Client JVM
3. Copy Job Resources
4. Submit Application
5. Start MR AppMaster container
Map or
Reduce
Task
HDFS
MR AppMaster
DataNode
Slide 30 www.edureka.in/hadoop
7. Get Input Splits
8. Request Resources
9. Start container
6. Create container
Management Node
Task JVM
YARN MR Application Execution Flow
Client
Resource
Manager
HDFS
Application
Node Manager
MR AppMaster
YarnChild
Map or
Reduce
Task
Job Object
1. Run Job 2. Get New Application
Client JVM
3. Copy Job Resources
4. Submit Application
5. Start MR AppMaster container
7. Get Input Splits
8. Request Resources
9. Start container
6. Create container
DataNode
12. Execute
10. Create Container
11. Acquire Job Resources
Slide 31 www.edureka.in/hadoop
Management Node
Task JVM
YARN MR Application Execution Flow
Client
Resource
Manager
HDFS
Application
Node Manager
MR AppMaster
YarnChild
Map or
Reduce
Task
Job Object
Client JVM
DataNode
Management Node
Task JVM
Poll for Status
Slide 32 www.edureka.in/hadoop
Update Status
Hadoop 2.0 : YARN Workflow
Resource
Manager
Node Manager
Node Manager
App
Master 2
Node Manager
Node Manager
Node Manager
Container 2.2
Node Manager
Container 2.3
Node Manager
Node Manager
Container 1.1
Container 2.1
Node Manager
Container 1.2
Scheduler
Applications
Manager (AsM)
Node Manager
Node Manager
Node Manager
App
Master 1
www.edureka.in/hadoop
YARN was developed to overcome the following disadvantage in
Hadoop 1.0 MapReduce framework?
- Single Point Of Failure Of Name•Node
- Only one version can be run in cHlaseslilcoMTaph•eRreed!u!ce
- Too much burden on Job TracMkeyr name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
Slide 34 www.edureka.in/hadoop
Annie’s Question
Too much burden on Job Tracker.
Slide 35 www.edureka.in/hadoop
Annie’s Answer
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
In YARN, the functionality of JobTracker has been replaced by which
of the following YARN features:
Slide 36 www.edureka.in/hadoop
- Job Scheduling
- Task Monitoring
- Resource Management
- Node management
Annie’s Question
Task Monitoring and Resource Management. The fundamental
idea of YARN is to split up the two major functionalities of the
JobTracker, i.e. resource management and job
scheduling/monitoring, into separate daemons. A global
ResourceManager (RM) for resources and per-application
ApplicationMaster (AM) for task monitoring.
Slide 37 www.edureka.in/hadoop
Annie’s Answer
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
In YARN, which of the following daemons takes care of the container
and the resource utilization by the applications?:
Slide 38 www.edureka.in/hadoop
- Node Manager
- Job Tracker
- Task tracker
- Application Master
- Resource manager
Annie’s Question
ApplicationMaster.
Slide 39 www.edureka.in/hadoop
Annie’s Answer
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
Can we run MRv1 Jobs in a YARN enabled Hadoop Cluster?
Slide 40 www.edureka.in/hadoop
- Yes
- No
Annie’s Question
Yes. You need to recompile the Jobs in MRv2 after enabling YARN to
run the Job successfully in a YARN enabled Hadoop Cluster.
Slide 41 www.edureka.in/hadoop
Annie’s Answer
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
Which of the following YARN/MRv2 daemon is responsible for
launching the tasks?
Slide 42 www.edureka.in/hadoop
- Task Tracker
- Resource Manager
- ApplicationMaster
- ApplicationMasterServer
Annie’s Question
ApplicationMaster.
Slide 43 www.edureka.in/hadoop
Annie’s Answer
 Write the MapReduce programs using MRv2 for your Module-3 assignments.
 Assignment Programs – WordCount , Patents, Temperature, and Alphabets
 Create a Single Node Apache Hadoop Cluster using the documents present in the LMS.
 Execute the ‘teragen’ example to test the Cluster
Slide 44 www.edureka.in/hadoop
Module-10 Pre-work
Thank You
See You in Class Next Week
www.edureka.in/hadoop
NN High Availability – Quorum Journal Manager
Standby Name
Node
Active Name
Node
Data Node Data Node Data Node Data Node
….
Data Blocks
Journal
Node
Write namespace
modification to edit
logs; only Active Name
Node is the writer
Journal
Node
Journal
Node
Read edit logs and applies to its own
namespace
Data Nodes are configured with the location of both
Name Nodes, and send block location information and
heartbeats to both.

More Related Content

PPTX
Huhadoop - v1.1
PDF
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
PDF
Hadoop Developer
PDF
Hadoop MapReduce Framework
PDF
Hadoop Architecture and HDFS
PPTX
YARN - Hadoop Next Generation Compute Platform
PDF
Introduction to Big Data & Hadoop
PDF
Setting High Availability in Hadoop Cluster
Huhadoop - v1.1
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop Developer
Hadoop MapReduce Framework
Hadoop Architecture and HDFS
YARN - Hadoop Next Generation Compute Platform
Introduction to Big Data & Hadoop
Setting High Availability in Hadoop Cluster

What's hot (20)

ODP
Hadoop demo ppt
PDF
Hadoop 101
 
ODT
Hadoop Interview Questions and Answers by rohit kapa
PPTX
Introduction to Hadoop
PDF
Introduction to Hadoop Administration
PPTX
A day in the life of hadoop administrator!
PPTX
Introduction to Big Data and Hadoop
PDF
Introduction to Hadoop
PPTX
PDF
Introduction to Hadoop
PPT
Seminar Presentation Hadoop
PDF
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
PPT
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
PDF
Top 5 Hadoop Admin Tasks
PDF
Hadoop Administration pdf
PPTX
Hadoop introduction
PPTX
Hadoop technology
PPT
Presentation on Hadoop Technology
PPTX
Overview of Big data, Hadoop and Microsoft BI - version1
PDF
20131205 hadoop-hdfs-map reduce-introduction
Hadoop demo ppt
Hadoop 101
 
Hadoop Interview Questions and Answers by rohit kapa
Introduction to Hadoop
Introduction to Hadoop Administration
A day in the life of hadoop administrator!
Introduction to Big Data and Hadoop
Introduction to Hadoop
Introduction to Hadoop
Seminar Presentation Hadoop
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Top 5 Hadoop Admin Tasks
Hadoop Administration pdf
Hadoop introduction
Hadoop technology
Presentation on Hadoop Technology
Overview of Big data, Hadoop and Microsoft BI - version1
20131205 hadoop-hdfs-map reduce-introduction
Ad

Viewers also liked (10)

PDF
Crandon Institute Certificate-Executive Assistant
PDF
BIG DATA Nedir ve IBM Çözümleri.
PDF
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
PDF
Introduction to Hadoop
PPTX
HDFS Namenode High Availability
PPTX
Big Data Sunum
PPTX
Impala Unlocks Interactive BI on Hadoop
PPTX
Apache Tez: Accelerating Hadoop Query Processing
PPTX
Big Data / Büyük Veri Nedir?
PPTX
Hadoop introduction , Why and What is Hadoop ?
Crandon Institute Certificate-Executive Assistant
BIG DATA Nedir ve IBM Çözümleri.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Introduction to Hadoop
HDFS Namenode High Availability
Big Data Sunum
Impala Unlocks Interactive BI on Hadoop
Apache Tez: Accelerating Hadoop Query Processing
Big Data / Büyük Veri Nedir?
Hadoop introduction , Why and What is Hadoop ?
Ad

Similar to Hadoop Developer (20)

PPTX
Hadoop and It_s Components_PPT .pptx
PPTX
Understanding Hadoop
PPTX
Learn Hadoop Administration
PDF
Hadoop- A Highly Available and Secure Enterprise DataWarehousing solution
PPTX
Hadoop a Highly Available and Secure Enterprise Data Warehousing solution
ODP
Hadoop2
ODP
Hadoop2.2
PPTX
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
PDF
Hadoop 2.0 handout 5.0
PPTX
Hadoop ppt on the basics and architecture
PDF
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
PDF
Apache Hadoop YARN, NameNode HA, HDFS Federation
PDF
Introduction to hadoop administration jk
PPTX
batch-6 - Manisalyabatch_8_bda - Tejaswini Chowdary.pptx Kumar.pptx
PPT
hadoop
PPT
hadoop
PDF
Hadoop Cluster With High Availability
DOCX
500 data engineering interview question.docx
PPTX
Introduction to HDFS
PDF
Webinar: Top 5 Hadoop Admin Tasks
Hadoop and It_s Components_PPT .pptx
Understanding Hadoop
Learn Hadoop Administration
Hadoop- A Highly Available and Secure Enterprise DataWarehousing solution
Hadoop a Highly Available and Secure Enterprise Data Warehousing solution
Hadoop2
Hadoop2.2
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop 2.0 handout 5.0
Hadoop ppt on the basics and architecture
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Apache Hadoop YARN, NameNode HA, HDFS Federation
Introduction to hadoop administration jk
batch-6 - Manisalyabatch_8_bda - Tejaswini Chowdary.pptx Kumar.pptx
hadoop
hadoop
Hadoop Cluster With High Availability
500 data engineering interview question.docx
Introduction to HDFS
Webinar: Top 5 Hadoop Admin Tasks

More from Edureka! (20)

PDF
What to learn during the 21 days Lockdown | Edureka
PDF
Top 10 Dying Programming Languages in 2020 | Edureka
PDF
Top 5 Trending Business Intelligence Tools | Edureka
PDF
Tableau Tutorial for Data Science | Edureka
PDF
Python Programming Tutorial | Edureka
PDF
Top 5 PMP Certifications | Edureka
PDF
Top Maven Interview Questions in 2020 | Edureka
PDF
Linux Mint Tutorial | Edureka
PDF
How to Deploy Java Web App in AWS| Edureka
PDF
Importance of Digital Marketing | Edureka
PDF
RPA in 2020 | Edureka
PDF
Email Notifications in Jenkins | Edureka
PDF
EA Algorithm in Machine Learning | Edureka
PDF
Cognitive AI Tutorial | Edureka
PDF
AWS Cloud Practitioner Tutorial | Edureka
PDF
Blue Prism Top Interview Questions | Edureka
PDF
Big Data on AWS Tutorial | Edureka
PDF
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
PDF
Kubernetes Installation on Ubuntu | Edureka
PDF
Introduction to DevOps | Edureka
What to learn during the 21 days Lockdown | Edureka
Top 10 Dying Programming Languages in 2020 | Edureka
Top 5 Trending Business Intelligence Tools | Edureka
Tableau Tutorial for Data Science | Edureka
Python Programming Tutorial | Edureka
Top 5 PMP Certifications | Edureka
Top Maven Interview Questions in 2020 | Edureka
Linux Mint Tutorial | Edureka
How to Deploy Java Web App in AWS| Edureka
Importance of Digital Marketing | Edureka
RPA in 2020 | Edureka
Email Notifications in Jenkins | Edureka
EA Algorithm in Machine Learning | Edureka
Cognitive AI Tutorial | Edureka
AWS Cloud Practitioner Tutorial | Edureka
Blue Prism Top Interview Questions | Edureka
Big Data on AWS Tutorial | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
Kubernetes Installation on Ubuntu | Edureka
Introduction to DevOps | Edureka

Recently uploaded (20)

PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
Basic Mud Logging Guide for educational purpose
PDF
Complications of Minimal Access Surgery at WLH
PDF
Pre independence Education in Inndia.pdf
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
01-Introduction-to-Information-Management.pdf
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPTX
GDM (1) (1).pptx small presentation for students
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
Cell Structure & Organelles in detailed.
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Basic Mud Logging Guide for educational purpose
Complications of Minimal Access Surgery at WLH
Pre independence Education in Inndia.pdf
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Anesthesia in Laparoscopic Surgery in India
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Supply Chain Operations Speaking Notes -ICLT Program
01-Introduction-to-Information-Management.pdf
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
GDM (1) (1).pptx small presentation for students
Final Presentation General Medicine 03-08-2024.pptx
VCE English Exam - Section C Student Revision Booklet
FourierSeries-QuestionsWithAnswers(Part-A).pdf
2.FourierTransform-ShortQuestionswithAnswers.pdf
Cell Structure & Organelles in detailed.
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx

Hadoop Developer

  • 2. Course Topics Slide 2 www.edureka.in/hadoop  Module 1  Understanding Big Data  Hadoop Architecture  Module 2  Hadoop Cluster Configuration  Data loading Techniques  Hadoop Project Environment  Module 3  Hadoop MapReduce framework  Programming in Map Reduce  Module 4  Advance MapReduce  MRUnit testing framework  Module 5  Analytics using Pig  Understanding Pig Latin  Module 6  Analytics using Hive  Understanding HIVE QL  Module 7  Advance Hive  NoSQL Databases and HBASE  Module 8  Advance HBASE  Zookeeper Service  Module 9  Hadoop 2.0 – New Features  Programming in MRv2  Module 10  Apache Oozie  Real world Datasets and Analysis  Project Discussion
  • 3. Topics for Today Slide 3 www.edureka.in/hadoop  Revision  Hadoop 2.0 New Features  HDFS High Availability  HDFS Federation  YARN or MRv2  YARN and Hadoop ecosystem  YARN Components  YARN Application Execution  Running an Application with YARN  Writing a YARN Application
  • 4. Lets’s Revise – HBase and ZooKeeper Master HFile RegionServers ReRgeigoinoSneSrevrevresrs memstore WAL /hbase/region 1 /hbase/region 2 ….. ….. /hbase/region Zookeeper HDFS Slide 4 www.edureka.in/hadoop
  • 5. Client HDFS Map Reduce Secondary Name Node Data Blocks …. Data Node Name Node Job Tracker Task Tracker Map Reduce Data Node Task Tracker Map Reduce Hadoop 1.0 – In Summary Data Node Data NodeTask Tracker Map Reduce Task Tracker Map Reduce Slide 5 www.edureka.in/hadoop
  • 6. Hadoop 1.0 - Challenges Slide 6 www.edureka.in/hadoop Problem Description NameNode – No Horizontal Scalability Single NameNode and Single Namespaces, limited by NameNode RAM NameNode – No High Availability (HA) NameNode is Single Point of Failure, Need manual recovery using Secondary NameNode in case of failure Job Tracker – Overburdened Spends significant portion of time and effort managing the life cycle of applications MRv1 – Only Map and Reduce tasks Humongous Data stored in HDFS remains unutilized and cannot be used for other workloads such as Graph processing etc.
  • 7. NameNode – Scale and HA NameNode - No High Availability NameNode - No Horizontal Scale Data Node Data Node Data Node …. Client get Block Locations Block Management Read Data NameNode NS Slide 7 www.edureka.in/hadoop
  • 8.  Secondary NameNode:  “Not a hot standby” for the NameNode  Connects to NameNode every hour*  Housekeeping, backup of NemeNode metadata  Saved metadata can build a failed NameNode You give me metadata every hour, I will make it secure Secondary NameNode Single Point of FailureNameNode metadata Slide 8 www.edureka.in/hadoop metadata Name Node –Single Point of Failure
  • 9. Job Tracker – Overburdened CPU  Spends a very significant portion of time and effort managing the life cycle of applications Network  Single Listener Thread to communicate thousands of Map and Reduce Jobs with Task Tracker Task Tracker Task Tracker…. Job Tracker Slide 9 www.edureka.in/hadoop
  • 10. MRv1 – Unpredictability in Large Clusters As the cluster size grow and reaches to 4000 Nodes  Cascading Failures  The DataNode failures results in a serious deterioration of the overall cluster performance because of attempts to replicate data and overload live nodes, through network flooding.  Multi-tenancy  As clusters increase in size, you may want to employ these clusters for a variety of models. MRv1 dedicates its nodes to Hadoop and cannot be re-purposed for other applications and workloads in an Organization. With the growing popularity and adoption of cloud computing among enterprises, this becomes more important. Slide 10 www.edureka.in/hadoop
  • 11.  Terabytes and Petabytes of data in HDFS can be used only for MapReduce processing Unutilized Data in HDFS www.edureka.in/hadoop
  • 12. Hadoop 2.0 New Features Slide 12 www.edureka.in/hadoop Other important Hadoop 2.0 features  HDFS Snapshots  NFSv3 access to data in HDFS  Support for running Hadoop on MS Windows  Binary Compatibility for MapReduce applications built on Hadoop 1.0  Substantial amount of Integration testing with rest of the projects (such as Hadoop ecosystem PIG, HIVE) in Property Hadoop 1.0 Hadoop 2.0 Federation One Namenode and Namespaces Multiple Namenode and Namespaces High Availability Not present Highly Available YARN - Processing Control and Multi- tenancy JobTracker, Task Tracker Resource Manager, Node Manager, App Master, Capacity Scheduler
  • 13. Hadoop 2.0 Cluster Architecture - Federation Namenode Block Management NS Storage Datanode Datanode… NamespaceBlockStorage Namespace NS1 NSk NSn NN-1 NN-k NN-n Common Storage Datanode 1 … Datanode 2 … Datanode m … BlockStorage Pool 1 Pool k Pool n Block Pools … … Hadoop 1.0 Hadoop 2.0 Slide 13 www.edureka.in/hadoop http://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-hdfs/Federation.html
  • 14. - provides cross-data center (non- Hlo eca lll o) s Tup hp eo rr et !fo !r HDFS, allowing Slide 14 www.edureka.in/hadoop My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. How does HDFS Federation help HDFS Scale horizontally? - reduces the load on any single NameNode by using the multiple, independent NameNode to manage individual parts of the filesystem namespace a cluster administrator to split the Block Storage outside the local cluster. Annie’s Question
  • 15. A. In order to scale the name service horizontally, HDFS federation uses multiple independent Namenodes. The Namenodes are federated, that is, the Namenodes are independent and don’t require coordination with each other. Slide 15 www.edureka.in/hadoop Annie’s Answer
  • 16. /finance respectively. What will happ Hen elif loyo Tu htr ey reto !!put a file in Slide 16 www.edureka.in/hadoop My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. You have configured two name nodes to manage /marketing and /accounting directory? Annie’s Question
  • 17. The put will fail. None of the namespace will manage the file and you will get an IOException with a No such file or directory error. Slide 17 www.edureka.in/hadoop Annie’s Answer
  • 18. Node Manager HDFS YARN Resource Manager Shared edit logs All name space edits logged to shared NFS storage; single writer (fencing) Read edit logs and applies to its own namespace Secondary Name Node Data Node Standby NameNode Active NameNode Container App Master Node Manager Data Node Container App Master Data Node Container Data Node NDoadtae NMoadneager Client App Master Node Manager Container App Master Hadoop 2.0 Cluster Architecture - HA NameNode High Availability Slide 18 www.edureka.in/hadoop Next Generation MapReduce HDFS HIGH AVAILABILITY http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html *Not necessary to configure Secondary NameNode
  • 19. NameNode Recovery Vs. Failover High Availability in Hadoop 2.0 NameNode recovery in Hadoop 1.0 Secondary Name Node Standby NameNode Active NameNode Secondary Name Node NameNode Edit logs Meta-Data Automatic failover to Standby NameNode Manually Recover using Secondary NameNode FSImage www.edureka.in/hadoop
  • 20. HDFS HA was developed to overcome the following disadvantage in Hadoop 1.0? - Single Point Of Failure Of Name•Node - Only one version can be run in cHlaseslilcoMTaph•eRreed!u!ce - Too much burden on Job TracMkeyr name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. Slide 20 www.edureka.in/hadoop Annie’s Question
  • 21. Single Point of Failure of NameNode. Slide 21 www.edureka.in/hadoop Annie’s Answer
  • 22. Hadoop 2.0 – In Summary Client HDFS Distributed Data Storage Active NameNode YARN Resource Manager Applications Secondary Name Node Standby NameNode Distributed Data Processing Data Node Node Manager Container App Master ……. Masters Slaves Node Manager Data Node Container App Master Data Node Node Manager Container App Master Shared edit logs Scheduler Manager (AsM) NameNode High Availability www.edureka.in/hadoop Next Generation MapReduce
  • 23. YARN and Hadoop Ecosystem Apache Oozie (Workflow) HDFS (Hadoop Distributed File System) Pig Latin Data Analysis Hive DW System MapReduce Framework HBase Apache Oozie (Workflow) HDFS (Hadoop Distributed File System) Pig Latin Data Analysis Hive DW System MapReduce Framework HBase Other YARN Frameworks (MPI, GIRAPH) framework Slide 23 www.edureka.in/hadoop YARN Cluster Resource Management YARN adds a more general interface to run non-MapReduce jobs (such as Graph Processing) within the Hadoop
  • 24. BATCH (MapReduce) INTERACTIVE (Text) ONLINE (HBase) STREAMING (Storm, S4, …) GRAPH (Giraph) IN-MEMORY (Spark) HPC MPI (OpenMPI) OTHER (Search) (Weave..) YARN – Moving beyond MapReduce Slide 24 www.edureka.in/hadoop http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html
  • 25. Multi-tenancy - Capacity Scheduler Slide 25 www.edureka.in/hadoop  Organizes jobs into queues  Queue shares as %’s of cluster  FIFO scheduling within each queue  Data locality-aware Scheduling  Hierarchical Queues To manage the resource within an organization.  Capacity Guarantees A fraction to the total available capacity allocated to each Queue.  Security To safeguard applications from other users.  Elasticity Resources are available in a predictable and elastic manner to queues.  Multi-tenancy Set of limit to prevent over-utilization of resources by a single application.  Operability Runtime configuration of Queues.  Resource-based scheduling If needed, Applications can request more resources than the default.
  • 26. Executing MapReduce Application on YARN www.edureka.in/hadoop MapReduce Application Execution
  • 27. YARN MR Application Execution Flow Slide 27 www.edureka.in/hadoop  MapReduce Job Execution  Job Submission  Job Initialization  Tasks Assignment  Tasks’ Memory  Status Updates  Failure Recovery  Client  Submit a MapReduce Job  Resource Manager  Manage the resource utilization across Hadoop Cluster  Node Manager  Runs on each Data Node  Creates execution container  Monitors Container’s usage  MapReduce Application Master  Coordinates and Manages MapReduce Jobs  Negotiates with Resource Manager to schedule asks  The tasks are started by NodeManager(s)  HDFS  shares resources and Job’s artefacts among YARN components
  • 28. YARN = Yet Another Resource Negotiator Node Manager Container Container Node Manager App Master Container Node Manager Container App Master Resource Manager Client Client MapReduce Status Job Submission Node Status Resource Request  Resource Manager  Cluster Level resource manager  Long Life, High Quality Hardware  Node Manager  One per Data Node  Monitors resources on Data Node  Application Master  One per application  Short life  Manages task /scheduling Hadoop 2.0 - YARN JobHistory Server Slide 28 www.edureka.in/hadoop
  • 29. YARN MR Application Execution Flow Client Resource ManagerApplication Node Manager YARNChild Job Object 1. Run Job 2. Get New Application Client JVM 3. Copy Job Resources 4. Submit Application Map or Reduce Task HDFS MR AppMaster DataNode Slide 29 www.edureka.in/hadoop Management Node Task JVM
  • 30. YARN MR Application Execution Flow Client Resource ManagerApplication Node Manager YarnChild Job Object 1. Run Job 2. Get New Application Client JVM 3. Copy Job Resources 4. Submit Application 5. Start MR AppMaster container Map or Reduce Task HDFS MR AppMaster DataNode Slide 30 www.edureka.in/hadoop 7. Get Input Splits 8. Request Resources 9. Start container 6. Create container Management Node Task JVM
  • 31. YARN MR Application Execution Flow Client Resource Manager HDFS Application Node Manager MR AppMaster YarnChild Map or Reduce Task Job Object 1. Run Job 2. Get New Application Client JVM 3. Copy Job Resources 4. Submit Application 5. Start MR AppMaster container 7. Get Input Splits 8. Request Resources 9. Start container 6. Create container DataNode 12. Execute 10. Create Container 11. Acquire Job Resources Slide 31 www.edureka.in/hadoop Management Node Task JVM
  • 32. YARN MR Application Execution Flow Client Resource Manager HDFS Application Node Manager MR AppMaster YarnChild Map or Reduce Task Job Object Client JVM DataNode Management Node Task JVM Poll for Status Slide 32 www.edureka.in/hadoop Update Status
  • 33. Hadoop 2.0 : YARN Workflow Resource Manager Node Manager Node Manager App Master 2 Node Manager Node Manager Node Manager Container 2.2 Node Manager Container 2.3 Node Manager Node Manager Container 1.1 Container 2.1 Node Manager Container 1.2 Scheduler Applications Manager (AsM) Node Manager Node Manager Node Manager App Master 1 www.edureka.in/hadoop
  • 34. YARN was developed to overcome the following disadvantage in Hadoop 1.0 MapReduce framework? - Single Point Of Failure Of Name•Node - Only one version can be run in cHlaseslilcoMTaph•eRreed!u!ce - Too much burden on Job TracMkeyr name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. Slide 34 www.edureka.in/hadoop Annie’s Question
  • 35. Too much burden on Job Tracker. Slide 35 www.edureka.in/hadoop Annie’s Answer
  • 36. Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. In YARN, the functionality of JobTracker has been replaced by which of the following YARN features: Slide 36 www.edureka.in/hadoop - Job Scheduling - Task Monitoring - Resource Management - Node management Annie’s Question
  • 37. Task Monitoring and Resource Management. The fundamental idea of YARN is to split up the two major functionalities of the JobTracker, i.e. resource management and job scheduling/monitoring, into separate daemons. A global ResourceManager (RM) for resources and per-application ApplicationMaster (AM) for task monitoring. Slide 37 www.edureka.in/hadoop Annie’s Answer
  • 38. Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. In YARN, which of the following daemons takes care of the container and the resource utilization by the applications?: Slide 38 www.edureka.in/hadoop - Node Manager - Job Tracker - Task tracker - Application Master - Resource manager Annie’s Question
  • 40. Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. Can we run MRv1 Jobs in a YARN enabled Hadoop Cluster? Slide 40 www.edureka.in/hadoop - Yes - No Annie’s Question
  • 41. Yes. You need to recompile the Jobs in MRv2 after enabling YARN to run the Job successfully in a YARN enabled Hadoop Cluster. Slide 41 www.edureka.in/hadoop Annie’s Answer
  • 42. Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. Which of the following YARN/MRv2 daemon is responsible for launching the tasks? Slide 42 www.edureka.in/hadoop - Task Tracker - Resource Manager - ApplicationMaster - ApplicationMasterServer Annie’s Question
  • 44.  Write the MapReduce programs using MRv2 for your Module-3 assignments.  Assignment Programs – WordCount , Patents, Temperature, and Alphabets  Create a Single Node Apache Hadoop Cluster using the documents present in the LMS.  Execute the ‘teragen’ example to test the Cluster Slide 44 www.edureka.in/hadoop Module-10 Pre-work
  • 45. Thank You See You in Class Next Week
  • 46. www.edureka.in/hadoop NN High Availability – Quorum Journal Manager Standby Name Node Active Name Node Data Node Data Node Data Node Data Node …. Data Blocks Journal Node Write namespace modification to edit logs; only Active Name Node is the writer Journal Node Journal Node Read edit logs and applies to its own namespace Data Nodes are configured with the location of both Name Nodes, and send block location information and heartbeats to both.