SlideShare a Scribd company logo
NameNode HASuresh Srinivas - HortonworksAaron T. Myers - Cloudera
OverviewPart 1 – Suresh Srinivas (Hortonworks)HDFS Availability and Data Integrity – what is the record?NN HA DesignPart 2 – Aaron T. Myers (Cloudera)NN HA Design continuedClient-NN Connection failoverOperations and Admin of HAFuture Work2
Current HDFS Availability & Data IntegritySimple design, storage fault toleranceStorage: Rely in OS’s file system rather than use raw diskStorage Fault Tolerance: multiple replicas, active monitoringSingle NameNode MasterPersistent state:  multiple copies  + checkpointsRestart on failureHow well did it work?Lost 19 out of 329 Million blocks on 10 clusters with 20K nodes in 2009 7-9’s of reliabilityFixed in 20 and 21.18 month Study: 22 failures on 25 clusters - 0.58 failures per year per clusterOnly 8 would have benefitted from HA failover!! (0.23 failures per cluster year)NN is very robust and can take a lot of abuseNN is resilient against overload caused by misbehaving apps3
 HA NameNodeActive work has started on HA NameNode (Failover)HA NameNodeDetailed design and sub tasks in HDFS-1623HA: Related workBackup NN (0.21)Avatar NN (Facebook)HA NN prototype using Linux HA (Yahoo!)HA NN prototype with Backup NN and block report replicator (eBay)HA is the highest priority4
Approach and TerminologyInitial goal is Active-StandbyWith Federation each namespace volume has a NameNodeSingle active NN for any namespace volumeTerminologyActive NN –  actively serves the read/write operations from the clientsStandby NN  - waits, becomes active when Active dies or is unhealthyCould serve read operationsStandby’s State may be cold, warm or hot Cold : Standby has zero state (e.g. started after the Active is declared dead.Warm: Standby has partial state:has loaded fsImage& editLogsbut has not received any block reportsHot Standby: Standby has all most of the Active’s state and start immediately5
High Level Use CasesSupported failuresSingle hardware failureDouble hardware failure not supportedSome software failuresSame software failure affects both active and standby6Planned downtimeUpgradesConfig changesMain reason for downtimeUnplanned downtimeHardware failureServer unresponsiveSoftware failuresOccurs infrequently
Use CasesDeployment modelsSingle NN configuration; no failoverActive and Standby with manual failoverStandby could be cold/warm/hotAddresses downtime during upgrades – main cause of unavailabilityActive and Standby with automatic failoverHot standbyAddresses downtime during upgrades and other failuresSee HDFS-1623 for detailed use cases7
DesignFailover control outside NNParallel Block reports to Active and Standby (Hot failover)Shared or non-shared NN stateFencing of shared resources/dataDatanodesShared NN state (if any)Client failoverIP FailoverSmart clients (e.g configuration, or ZooKeeper for coordination)8
Failover Control Outside NNHA Daemon outside NameNodeDaemon manages resourcesAll resources modeled uniformlyResources – OS, HW, Network etc.NameNode is just another resourceHeartbeat with other nodesQuorum based leader electionZookeeper for coordination and QuorumFencing during split brainPrevents data corruptionQuorumServiceHeartbeatLeader ElectionHADaemonResourcesResourcesResourcesActionsstart, stop, failover, monitor, …Fencing/STONITHSharedResources
NN HA with Shared Storage and ZooKeeperZKZKZKHeartbeatHeartbeatFailoverControllerStandbyFailoverControllerActiveCmdsMonitor Health of NN. OS, HWMonitor Health of NN. OS, HWNNActiveNNStandbyShared NN state with single writer(fencing)Block Reports to Active & StandbyDN fencing: Update cmds from oneDNDNDN
HA Design Details11
Client Failover DesignSmart clientsUsers use one logical URI, client selects correct NN to connect toImplementing two options out of the boxClient Knows of multiple NNs Use a coordination service (ZooKeeper)Common things between theseWhich operations are idempotent, therefore safe to retry on a failoverFailover/retry strategiesSome differencesExpected time for client failoverEase of administration12
Ops/Admin: Shared StorageTo share NN state, need shared storageNeeds to be HA itself to avoid just shifting SPOFBookKeeper, etc will likely take care of this in the futureMany come with IP fencing optionsRecommended mount options:tcp,soft,intr,timeo=60,retrans=10Not all edits directories are created equalUsed to be all edits dirs were just a pool of redundant dirsCan now configure some edits directories to be requiredCan now configure number of tolerated failuresYou want at least 2 for durability, 1 remote for HA13
Ops/Admin: NN fencingClient failover does not solve this problemOut of the boxRPC to active NN to tell it to go to standby (graceful failover)SSH to active NN and `kill -9’ NNPluggable optionsMany filers have protocols for IP-based fencing optionsMany PDUs have protocols for IP-based plug-pulling (STONITH)Nuke the node from orbit. It’s the only way to be sure.Configure extra options if available to youWill be tried in order during a failover eventEscalate the aggressiveness of the methodFencing is critical for correctness of NN metadata14
Ops/Admin: MonitoringNew NN metricsSize of pending DN message queuesSeconds since the standby NN last read from shared edit logDN block report lagAll measurements of standby NN lag – monitor/alert on all of theseMonitor shared storage solutionVolumes fill up, disks go bad, etcShould configure paranoid edit log retention policy (default is 2)Canary-based monitoring of HDFS a good ideaPinging both NNs not sufficient15
Ops/Admin: HardwareActive/Standby NNs should be on separate racksShared storage system should be on separate rackActive/Standby NNs should have close to the same hardwareSame amount of RAM – need to store the same thingsSame # of processors - need to serve same number of clientsAll the same recommendations still apply for NNECC memory, 48GBSeveral separate disks for NN metadata directoriesRedundant disks for OS drives, probably RAID 5 or mirroringRedundant power16
Future WorkOther options to share NN metadataBookKeeperMultiple, potentially non-HA filersEntirely different metadata systemMore advanced client failover/load sheddingServe stale reads from the standby NNSpeculative RPCNon-RPC clients (IP failover, DNS failover, proxy, etc.)Even Higher HAMultiple standby NNs17
QADetailed design (HDFS-1623)Community effortHDFS-1971, 1972, 1973, 1974, 1975, 2005, 2064, 107318

More Related Content

PDF
[NetApp] Simplified HA:DR Using Storage Solutions
PPTX
Hdfs ha using journal nodes
PDF
HDFS NameNode High Availability
PPT
Fastback Technical Enablementv1
PPTX
Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2
PDF
Hadoop ha system admin
PPT
How Data Instant Replay and Data Progression Work Together
PPTX
Ambari Meetup: NameNode HA
[NetApp] Simplified HA:DR Using Storage Solutions
Hdfs ha using journal nodes
HDFS NameNode High Availability
Fastback Technical Enablementv1
Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2
Hadoop ha system admin
How Data Instant Replay and Data Progression Work Together
Ambari Meetup: NameNode HA

What's hot (20)

PDF
Tsm7.1 seminar Stavanger
PPT
Presentation on backup and recoveryyyyyyyyyyyyy
PPTX
Nn ha hadoop world.final
PDF
Hadoop availability
PPTX
AITP July 2012 Presentation - Disaster Recovery - Business + Technology
PDF
SVC / Storwize: cache partition analysis (BVQ howto)
PPT
Fb Sales Enbl 1 4
PPT
Database backup & recovery
PPTX
Backup Exec 21
PPT
Disaster Recovery & Data Backup Strategies
PDF
Designing large scale distributed systems
PPTX
Backup and recovery
PPTX
Backup & restore in windows
PDF
Analyze a SVC, STORWIZE metro/ global mirror performance problem-v58-20150818...
PPTX
Flashy prefetching for high performance flash drives
PPTX
Distributed systems and scalability rules
PPT
Creating And Implementing A Data Disaster Recovery Plan
PDF
EMC Data Domain Retention Lock Software: Detailed Review
 
PPTX
Strata + Hadoop World 2012: HDFS: Now and Future
Tsm7.1 seminar Stavanger
Presentation on backup and recoveryyyyyyyyyyyyy
Nn ha hadoop world.final
Hadoop availability
AITP July 2012 Presentation - Disaster Recovery - Business + Technology
SVC / Storwize: cache partition analysis (BVQ howto)
Fb Sales Enbl 1 4
Database backup & recovery
Backup Exec 21
Disaster Recovery & Data Backup Strategies
Designing large scale distributed systems
Backup and recovery
Backup & restore in windows
Analyze a SVC, STORWIZE metro/ global mirror performance problem-v58-20150818...
Flashy prefetching for high performance flash drives
Distributed systems and scalability rules
Creating And Implementing A Data Disaster Recovery Plan
EMC Data Domain Retention Lock Software: Detailed Review
 
Strata + Hadoop World 2012: HDFS: Now and Future
Ad

Viewers also liked (18)

ODP
Sociales esquemas
PPS
Playing with the moon
PPT
La educación virtual en el periodismo - Tarea foro 6
PPTX
Kongregate - Maximizing Player Retention and Monetization in Free-to-Play Gam...
PDF
Bascules
PDF
Meet the Outwarians
PDF
The age of orchestration: from Docker basics to cluster management
PPTX
Progression Search
PPTX
Career page photo slide show
PDF
Características de los videojuegos
PPTX
Minneapolis VAST/HQ
PPTX
Introction to docker swarm
PPTX
Esurance Careers Slideshow
PDF
Randomforestで高次元の変数重要度を見る #japanr LT
PPTX
Linkedin us regional page intro slides edited_160902_final
PDF
サイボウズの開発を支えるKAIZEN文化
PDF
Aioug big data and hadoop
PPTX
Launch Festival 2016 - Push Notifications -- You're Doing it Wrong
Sociales esquemas
Playing with the moon
La educación virtual en el periodismo - Tarea foro 6
Kongregate - Maximizing Player Retention and Monetization in Free-to-Play Gam...
Bascules
Meet the Outwarians
The age of orchestration: from Docker basics to cluster management
Progression Search
Career page photo slide show
Características de los videojuegos
Minneapolis VAST/HQ
Introction to docker swarm
Esurance Careers Slideshow
Randomforestで高次元の変数重要度を見る #japanr LT
Linkedin us regional page intro slides edited_160902_final
サイボウズの開発を支えるKAIZEN文化
Aioug big data and hadoop
Launch Festival 2016 - Push Notifications -- You're Doing it Wrong
Ad

Similar to Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & Sanjay Radia, Hortonworks (20)

PPTX
HDFS Namenode High Availability
PPTX
Hadoop Summit 2012 | HDFS High Availability
PDF
Apache Hadoop YARN, NameNode HA, HDFS Federation
PDF
Hdfs high availability
PDF
Hdfs high availability
PPTX
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
PDF
What's New and Upcoming in HDFS - the Hadoop Distributed File System
PPT
hdfs filesystem in bigdata for hadoop configuration
PPTX
Hadoop HDFS Architeture and Design
PPTX
PDF
HDFS Design Principles
PDF
99.999% Available OpenStack Cloud - A Builder's Guide
PPT
Hadoop-professional-software-development-course-in-mumbai
PPT
Hadoop professional-software-development-course-in-mumbai
ODP
Hadoop2
PPTX
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoo...
PPTX
Introduction to hadoop and hdfs
PDF
Базы данных. HDFS
HDFS Namenode High Availability
Hadoop Summit 2012 | HDFS High Availability
Apache Hadoop YARN, NameNode HA, HDFS Federation
Hdfs high availability
Hdfs high availability
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
What's New and Upcoming in HDFS - the Hadoop Distributed File System
hdfs filesystem in bigdata for hadoop configuration
Hadoop HDFS Architeture and Design
HDFS Design Principles
99.999% Available OpenStack Cloud - A Builder's Guide
Hadoop-professional-software-development-course-in-mumbai
Hadoop professional-software-development-course-in-mumbai
Hadoop2
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoo...
Introduction to hadoop and hdfs
Базы данных. HDFS

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18

Recently uploaded (20)

PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
Cloud computing and distributed systems.
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Modernizing your data center with Dell and AMD
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
KodekX | Application Modernization Development
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Machine learning based COVID-19 study performance prediction
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
Spectral efficient network and resource selection model in 5G networks
Unlocking AI with Model Context Protocol (MCP)
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Cloud computing and distributed systems.
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Modernizing your data center with Dell and AMD
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
KodekX | Application Modernization Development
MYSQL Presentation for SQL database connectivity
Advanced Soft Computing BINUS July 2025.pdf
Network Security Unit 5.pdf for BCA BBA.
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Machine learning based COVID-19 study performance prediction
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
Dropbox Q2 2025 Financial Results & Investor Presentation
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
GamePlan Trading System Review: Professional Trader's Honest Take
Spectral efficient network and resource selection model in 5G networks

Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & Sanjay Radia, Hortonworks

  • 1. NameNode HASuresh Srinivas - HortonworksAaron T. Myers - Cloudera
  • 2. OverviewPart 1 – Suresh Srinivas (Hortonworks)HDFS Availability and Data Integrity – what is the record?NN HA DesignPart 2 – Aaron T. Myers (Cloudera)NN HA Design continuedClient-NN Connection failoverOperations and Admin of HAFuture Work2
  • 3. Current HDFS Availability & Data IntegritySimple design, storage fault toleranceStorage: Rely in OS’s file system rather than use raw diskStorage Fault Tolerance: multiple replicas, active monitoringSingle NameNode MasterPersistent state: multiple copies + checkpointsRestart on failureHow well did it work?Lost 19 out of 329 Million blocks on 10 clusters with 20K nodes in 2009 7-9’s of reliabilityFixed in 20 and 21.18 month Study: 22 failures on 25 clusters - 0.58 failures per year per clusterOnly 8 would have benefitted from HA failover!! (0.23 failures per cluster year)NN is very robust and can take a lot of abuseNN is resilient against overload caused by misbehaving apps3
  • 4. HA NameNodeActive work has started on HA NameNode (Failover)HA NameNodeDetailed design and sub tasks in HDFS-1623HA: Related workBackup NN (0.21)Avatar NN (Facebook)HA NN prototype using Linux HA (Yahoo!)HA NN prototype with Backup NN and block report replicator (eBay)HA is the highest priority4
  • 5. Approach and TerminologyInitial goal is Active-StandbyWith Federation each namespace volume has a NameNodeSingle active NN for any namespace volumeTerminologyActive NN – actively serves the read/write operations from the clientsStandby NN - waits, becomes active when Active dies or is unhealthyCould serve read operationsStandby’s State may be cold, warm or hot Cold : Standby has zero state (e.g. started after the Active is declared dead.Warm: Standby has partial state:has loaded fsImage& editLogsbut has not received any block reportsHot Standby: Standby has all most of the Active’s state and start immediately5
  • 6. High Level Use CasesSupported failuresSingle hardware failureDouble hardware failure not supportedSome software failuresSame software failure affects both active and standby6Planned downtimeUpgradesConfig changesMain reason for downtimeUnplanned downtimeHardware failureServer unresponsiveSoftware failuresOccurs infrequently
  • 7. Use CasesDeployment modelsSingle NN configuration; no failoverActive and Standby with manual failoverStandby could be cold/warm/hotAddresses downtime during upgrades – main cause of unavailabilityActive and Standby with automatic failoverHot standbyAddresses downtime during upgrades and other failuresSee HDFS-1623 for detailed use cases7
  • 8. DesignFailover control outside NNParallel Block reports to Active and Standby (Hot failover)Shared or non-shared NN stateFencing of shared resources/dataDatanodesShared NN state (if any)Client failoverIP FailoverSmart clients (e.g configuration, or ZooKeeper for coordination)8
  • 9. Failover Control Outside NNHA Daemon outside NameNodeDaemon manages resourcesAll resources modeled uniformlyResources – OS, HW, Network etc.NameNode is just another resourceHeartbeat with other nodesQuorum based leader electionZookeeper for coordination and QuorumFencing during split brainPrevents data corruptionQuorumServiceHeartbeatLeader ElectionHADaemonResourcesResourcesResourcesActionsstart, stop, failover, monitor, …Fencing/STONITHSharedResources
  • 10. NN HA with Shared Storage and ZooKeeperZKZKZKHeartbeatHeartbeatFailoverControllerStandbyFailoverControllerActiveCmdsMonitor Health of NN. OS, HWMonitor Health of NN. OS, HWNNActiveNNStandbyShared NN state with single writer(fencing)Block Reports to Active & StandbyDN fencing: Update cmds from oneDNDNDN
  • 12. Client Failover DesignSmart clientsUsers use one logical URI, client selects correct NN to connect toImplementing two options out of the boxClient Knows of multiple NNs Use a coordination service (ZooKeeper)Common things between theseWhich operations are idempotent, therefore safe to retry on a failoverFailover/retry strategiesSome differencesExpected time for client failoverEase of administration12
  • 13. Ops/Admin: Shared StorageTo share NN state, need shared storageNeeds to be HA itself to avoid just shifting SPOFBookKeeper, etc will likely take care of this in the futureMany come with IP fencing optionsRecommended mount options:tcp,soft,intr,timeo=60,retrans=10Not all edits directories are created equalUsed to be all edits dirs were just a pool of redundant dirsCan now configure some edits directories to be requiredCan now configure number of tolerated failuresYou want at least 2 for durability, 1 remote for HA13
  • 14. Ops/Admin: NN fencingClient failover does not solve this problemOut of the boxRPC to active NN to tell it to go to standby (graceful failover)SSH to active NN and `kill -9’ NNPluggable optionsMany filers have protocols for IP-based fencing optionsMany PDUs have protocols for IP-based plug-pulling (STONITH)Nuke the node from orbit. It’s the only way to be sure.Configure extra options if available to youWill be tried in order during a failover eventEscalate the aggressiveness of the methodFencing is critical for correctness of NN metadata14
  • 15. Ops/Admin: MonitoringNew NN metricsSize of pending DN message queuesSeconds since the standby NN last read from shared edit logDN block report lagAll measurements of standby NN lag – monitor/alert on all of theseMonitor shared storage solutionVolumes fill up, disks go bad, etcShould configure paranoid edit log retention policy (default is 2)Canary-based monitoring of HDFS a good ideaPinging both NNs not sufficient15
  • 16. Ops/Admin: HardwareActive/Standby NNs should be on separate racksShared storage system should be on separate rackActive/Standby NNs should have close to the same hardwareSame amount of RAM – need to store the same thingsSame # of processors - need to serve same number of clientsAll the same recommendations still apply for NNECC memory, 48GBSeveral separate disks for NN metadata directoriesRedundant disks for OS drives, probably RAID 5 or mirroringRedundant power16
  • 17. Future WorkOther options to share NN metadataBookKeeperMultiple, potentially non-HA filersEntirely different metadata systemMore advanced client failover/load sheddingServe stale reads from the standby NNSpeculative RPCNon-RPC clients (IP failover, DNS failover, proxy, etc.)Even Higher HAMultiple standby NNs17
  • 18. QADetailed design (HDFS-1623)Community effortHDFS-1971, 1972, 1973, 1974, 1975, 2005, 2064, 107318

Editor's Notes

  • #4: Data – can I read what I wrote, is the service availableWhen I asked one of the original authors of of GFS if there were any decisions they would revist – random writersSimplicity is keyRaw disk – fs take time to stabilize – we can take advantage of ext4, xfs or zfs