Failsafe Hadoop Infrastructure and the way they work

High Availability Hadoop
Clusters

• Planned downtime
−Upgrades
−Config changes
• Unplanned downtime
−Hardware failure
−Server unresponsive
−Software failures
−Occurs infrequently
Impact

• HDFS HA using QJM
• HDFS HA using NFS for shared storage
• Resource manager HA
Different Kinds Of HA Configurations

HDFS HA - Necessary Hardware Resources
• Name node machines
− Active NN
− Stand by NN
Both of these should ideally be of equivalent hardware.
• Journal Nodes
− Light weight daemons that can be run on machines running other hadoop daemons.
− There must be at least 3 journal node daemons running at any point of time as the
shared edit logs are published to a majority of the journal nodes.
− Journal node daemons should be run in odd numbers (3,5,7 etc)
− When running N journal nodes the system tolerates a maximum of (N-1)/2 failures.
• Zookeeper Service

HDFS HA Architecture Using The Quorum Journal
Manager

RM HA -Necessary Hardware Resources
• Resource manager machines
− Active RM
− Stand by RM
Both of these should ideally be of equivalent hardware.
• Zookeeper service

Resource Manager HA Architecture

RM Failover
• Two failover mechanisms
− Manual Transition - Transition current active rm to standby and then transition standby
rm to Active
− Automatic failover - Embedded zookeeper based ActiveStandby elector to decide which
rm is in active state.
• Each client must have the all resource managers listed with them. The clients use a round
robin fashion to connect to the active resource manager.
• Promoted RM continues to perform from where the previous RM left off. The new RM
spawns new attempts for each of the managed applications. Applications can create
checkpoints to avoid losing work. All states are stored in the zookeeper state store which
allows only a single rm to get write access.

Failsafe Hadoop Infrastructure and the way they work

More Related Content

What's hot (12)

Viewers also liked (20)

Similar to Failsafe Hadoop Infrastructure and the way they work (20)

More from Sigmoid (10)

Recently uploaded (20)

Failsafe Hadoop Infrastructure and the way they work