Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & Sanjay Radia, Hortonworks

NameNode HASuresh Srinivas - HortonworksAaron T. Myers - Cloudera

OverviewPart 1 – Suresh Srinivas (Hortonworks)HDFS Availability and Data Integrity – what is the record?NN HA DesignPart 2 – Aaron T. Myers (Cloudera)NN HA Design continuedClient-NN Connection failoverOperations and Admin of HAFuture Work2

Current HDFS Availability & Data IntegritySimple design, storage fault toleranceStorage: Rely in OS’s file system rather than use raw diskStorage Fault Tolerance: multiple replicas, active monitoringSingle NameNode MasterPersistent state: multiple copies + checkpointsRestart on failureHow well did it work?Lost 19 out of 329 Million blocks on 10 clusters with 20K nodes in 2009 7-9’s of reliabilityFixed in 20 and 21.18 month Study: 22 failures on 25 clusters - 0.58 failures per year per clusterOnly 8 would have benefitted from HA failover!! (0.23 failures per cluster year)NN is very robust and can take a lot of abuseNN is resilient against overload caused by misbehaving apps3

HA NameNodeActive work has started on HA NameNode (Failover)HA NameNodeDetailed design and sub tasks in HDFS-1623HA: Related workBackup NN (0.21)Avatar NN (Facebook)HA NN prototype using Linux HA (Yahoo!)HA NN prototype with Backup NN and block report replicator (eBay)HA is the highest priority4

Approach and TerminologyInitial goal is Active-StandbyWith Federation each namespace volume has a NameNodeSingle active NN for any namespace volumeTerminologyActive NN – actively serves the read/write operations from the clientsStandby NN - waits, becomes active when Active dies or is unhealthyCould serve read operationsStandby’s State may be cold, warm or hot Cold : Standby has zero state (e.g. started after the Active is declared dead.Warm: Standby has partial state:has loaded fsImage& editLogsbut has not received any block reportsHot Standby: Standby has all most of the Active’s state and start immediately5

High Level Use CasesSupported failuresSingle hardware failureDouble hardware failure not supportedSome software failuresSame software failure affects both active and standby6Planned downtimeUpgradesConfig changesMain reason for downtimeUnplanned downtimeHardware failureServer unresponsiveSoftware failuresOccurs infrequently

Use CasesDeployment modelsSingle NN configuration; no failoverActive and Standby with manual failoverStandby could be cold/warm/hotAddresses downtime during upgrades – main cause of unavailabilityActive and Standby with automatic failoverHot standbyAddresses downtime during upgrades and other failuresSee HDFS-1623 for detailed use cases7

DesignFailover control outside NNParallel Block reports to Active and Standby (Hot failover)Shared or non-shared NN stateFencing of shared resources/dataDatanodesShared NN state (if any)Client failoverIP FailoverSmart clients (e.g configuration, or ZooKeeper for coordination)8

Failover Control Outside NNHA Daemon outside NameNodeDaemon manages resourcesAll resources modeled uniformlyResources – OS, HW, Network etc.NameNode is just another resourceHeartbeat with other nodesQuorum based leader electionZookeeper for coordination and QuorumFencing during split brainPrevents data corruptionQuorumServiceHeartbeatLeader ElectionHADaemonResourcesResourcesResourcesActionsstart, stop, failover, monitor, …Fencing/STONITHSharedResources

NN HA with Shared Storage and ZooKeeperZKZKZKHeartbeatHeartbeatFailoverControllerStandbyFailoverControllerActiveCmdsMonitor Health of NN. OS, HWMonitor Health of NN. OS, HWNNActiveNNStandbyShared NN state with single writer(fencing)Block Reports to Active & StandbyDN fencing: Update cmds from oneDNDNDN

Client Failover DesignSmart clientsUsers use one logical URI, client selects correct NN to connect toImplementing two options out of the boxClient Knows of multiple NNs Use a coordination service (ZooKeeper)Common things between theseWhich operations are idempotent, therefore safe to retry on a failoverFailover/retry strategiesSome differencesExpected time for client failoverEase of administration12

Ops/Admin: Shared StorageTo share NN state, need shared storageNeeds to be HA itself to avoid just shifting SPOFBookKeeper, etc will likely take care of this in the futureMany come with IP fencing optionsRecommended mount options:tcp,soft,intr,timeo=60,retrans=10Not all edits directories are created equalUsed to be all edits dirs were just a pool of redundant dirsCan now configure some edits directories to be requiredCan now configure number of tolerated failuresYou want at least 2 for durability, 1 remote for HA13

Ops/Admin: NN fencingClient failover does not solve this problemOut of the boxRPC to active NN to tell it to go to standby (graceful failover)SSH to active NN and `kill -9’ NNPluggable optionsMany filers have protocols for IP-based fencing optionsMany PDUs have protocols for IP-based plug-pulling (STONITH)Nuke the node from orbit. It’s the only way to be sure.Configure extra options if available to youWill be tried in order during a failover eventEscalate the aggressiveness of the methodFencing is critical for correctness of NN metadata14

Ops/Admin: MonitoringNew NN metricsSize of pending DN message queuesSeconds since the standby NN last read from shared edit logDN block report lagAll measurements of standby NN lag – monitor/alert on all of theseMonitor shared storage solutionVolumes fill up, disks go bad, etcShould configure paranoid edit log retention policy (default is 2)Canary-based monitoring of HDFS a good ideaPinging both NNs not sufficient15

Ops/Admin: HardwareActive/Standby NNs should be on separate racksShared storage system should be on separate rackActive/Standby NNs should have close to the same hardwareSame amount of RAM – need to store the same thingsSame # of processors - need to serve same number of clientsAll the same recommendations still apply for NNECC memory, 48GBSeveral separate disks for NN metadata directoriesRedundant disks for OS drives, probably RAID 5 or mirroringRedundant power16

Future WorkOther options to share NN metadataBookKeeperMultiple, potentially non-HA filersEntirely different metadata systemMore advanced client failover/load sheddingServe stale reads from the standby NNSpeculative RPCNon-RPC clients (IP failover, DNS failover, proxy, etc.)Even Higher HAMultiple standby NNs17

QADetailed design (HDFS-1623)Community effortHDFS-1971, 1972, 1973, 1974, 1975, 2005, 2064, 107318

Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & Sanjay Radia, Hortonworks

More Related Content

What's hot (20)

Viewers also liked (18)

Similar to Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & Sanjay Radia, Hortonworks (20)

More from Cloudera, Inc. (20)

Recently uploaded (20)

Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & Sanjay Radia, Hortonworks

Editor's Notes