BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automatic Service Failover by Mike Dalton of Zettaset

Big Data Cloud Meetup Big Data & Cloud Computing - Help, Educate & Demystify. September 8 th 2011

Fail-Proofing Hadoop Clusters with Automated Service Failover Michael Dalton, CTO Zettaset Sept 8th 2011 Meetup

Problem Hadoop environments have many SPOFs NameNode , JobTracker, Oozie Kerberos Sept 8th 2011 Meetup

Ideal Solution Automated failover No data loss Handle all failover aspects (IP failover, etc) Failover all services No JobTracker = No MR No Kerberos = no new Kerberos authentication Sept 8th 2011 Meetup

Existing Solutions AvatarNode (NameNode, patch from FB) Replicate writes to a backup service BackupNameNode (NN, not committed) 'Hot' copy of NameNode, replicated All failover manual Sept 8th 2011 Meetup

Why is Failover Hard? Sept 8th 2011 Meetup M1 M2 C1 C2

Data Loss Split-Brain issues lose data Multiple masters = data corruption Clients confused about who is up Problem for traditional HA environments Linux-HA, etc Heartbeat failure != Death Sept 8th 2011 Meetup

Theoretical Limits Can we solve this reliably? Fischer-Lynch-Paterson (FLP) Theorem Consensus impossible in asynchronous distributed system when even a single process can fail No free lunch Sept 8th 2011 Meetup

Revisiting Our Assumptions Drop fully asynchronous requirement What about leases? Masters obtain, renew a lease Shutdown if lease expires (not asynchronous) Assumes only bounded relative clock skew Everyone should agree on how fast time elapses Sept 8th 2011 Meetup

Master Failover Requires highly available lock / lease system Master obtains a lease to be master Replicates writes to a backup master If master loses lease, hold a new election Old master will shut down when lease expires If clock skew bounded, no split-brain! Sept 8th 2011 Meetup

Failover: Locks/Consensus Apache ZooKeeper – Hadoop subproject Highly-available distributed filesystem for distributed consensus problems Create election, membership, etc. using special-purpose FS semantics 'Ephemeral' files disappear when session lease expires 'Sequential' files have auto-incremented suffix Sept 8th 2011 Meetup

ZooKeeper Internals ZooKeeper consists of a quorum of nodes (typically 3-9) Majority vote elects a leader (via leases) Leader proposes all FS modifications Majority must approve a modification for it to be committed Sept 8th 2011 Meetup

Example: HBase Apache HBase has full automated multi-master failover Prospective masters register in ZooKeeper ZooKeeper ephemeral/sequential files used for election Clients lookup current address of master in ZooKeeper Failover fully automated All files stored on HDFS, so no replication issues Sept 8th 2011 Meetup

Failover: Replication HBase approach avoids replication issues with HDFS Kerberos, NN, Oozie, etc can't use HDFS Legacy compatibility (and for NN, circular deps) How can we add synchronous write replication? Can't break compatibility or change apps Sept 8th 2011 Meetup

Failover: Networking HBase avoids networking failover by storing master address in ZK Legacy services use IP or hostnames, not ZK, to connect to master Out-of-trunk patches to make ZK a DNS server But Java doesn't respect DNS TTLs anyway, complicating max time for failover Sept 8th 2011 Meetup

Failover: Networking HBase avoids networking failover by storing master address in ZK Legacy services use IP or hostnames, not ZK, to connect to master Out-of-trunk patches to make ZK a DNS server But Java doesn't respect DNS TTLs anyway, complicating max time for failover DNS introduces its own issues anyway... Sept 8th 2011 Meetup

IP Failover Instead, you can failover IP addresses Virtual IPs – if supported by router Otherwise, dynamically update routes as part of your failover New leader updates routing tables. For local area networks, ensure ARP tables updated Gratuitous ARP or store ARP information in ZK Sept 8th 2011 Meetup

Putting it all together Consensus/Election Use ZooKeeper, 3-9 node quorum State Replication Small data in ZK, Large data in HDFS If neither possible, DRBD Network Failover Store master address in ZK Or, perform IP failover Dynamically update routing tables, update ARPcache Sept 8th 2011 Meetup

Conclusion Fully automated failover is possible Design for synchronous replication Prevent split-brain Manage legacy compatibility Coming to Hadoop ZettaSet provides fully HA Hadoop Sept 8th 2011 Meetup

BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automatic Service Failover by Mike Dalton of Zettaset

More Related Content

Similar to BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automatic Service Failover by Mike Dalton of Zettaset (20)

More from BigDataCloud (20)

Recently uploaded (20)

BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automatic Service Failover by Mike Dalton of Zettaset