SlideShare a Scribd company logo
A High Availability story for HDFS
AvatarNode
Dhruba Borthakur & Dmytro Molkov
dhruba@apache.org & dms@facebook.com
Presented at The Hadoop User Group Meeting,
Sept 29, 2010
How infrequently does the NameNode (NN) stop?
  Hadoop Software Bugs
–  Two directories in fs.name.dir, but when a write to first
directory failed, the NN ignored the second one (once)
–  Upgrade from 0.17 to 0.18 caused data corruption
(once)
  Configuration errors
–  Fsimage partition ran out of space (once)
–  Network Load Anomalies (about 10 times)
  Maintenance:
–  Deploy new patches (once every month)
What does the SecondaryNameNode do?
  Periodically merges Transaction logs
  Requires the same amount of memory as NN
  Why is it separate from NN?
–  Avoids fine-grain locking of NN data structures
–  Avoids implementing copy-on-write for NN data
structures
  Renamed as CheckpointNode (CN) in 0.21 release.
Shortcomings of the SecondaryNameNode?
  Does not have a copy of the latest transaction log
  Periodic and is not continuous
–  Configured to run every hour
  If the NN dies, the SecondaryNameNode does not take
over the responsibilities of the NN
BackupNode (BN)
  NN streams transaction log to BackupNode
  BackupNode applies log to in-memory and disk image
  BN always commit to disk before success to NN
  If BN restarts, it has to catch up with NN
  Available in HDFS 0.21 release
Limitations of BackupNode (BN)
  Maximum of one BackupNode per NN
–  Support only two-machine failure
  NN does not forward block reports to BackupNode
  Time to restart from 12 GB image, 70M files + 100 M
blocks
–  3 – 5 minutes to read the image from disk
–  20 min to process block reports
–  BN will still take 25+ minutes to failover!
Overlapping Clusters for HA
  “Always available for write” model
  Two logical clusters each with their own NN
  Each physical machine runs two instances of DataNode
  Two DataNode instances share the same physical storage
device
  Application has logic to failover writes from one HDFS
cluster to another
  More details at http://hadoopblog.blogspot.com/2009/06/hdfs-
scribe-integration.html
HDFS+Zookeeper
  HDFS can store transaction logs in Zookeeper/Bookeeper
–  http://issues.apache.org/jira/browse/HDFS-234
  Transaction log need not be stored in NFS filer
  A new NN will still have to process block reports
–  Not good for HA yet, because NN failover will take 30 minutes
Our use case for High Availability
  Failover should occur in less than a minute
  Failovers are needed only for new software upgrades
Challenges
  DataNodes send block location
information to only one
NameNode
  NameNode needs block locations
in memory to serve clients
  The in-memory metadata for 100
million files could be 60 GB,
huge!
DataNodes
Primary
NameNode
Client
Block location
message “yes, I
have blockid 123”
Client retrieves
block location from
NameNode
Introduction for AvatarNode
  Active-Standby Pair
–  Coordinated via zookeeper
–  Failover in few seconds
–  Wrapper over NameNode
  Active AvatarNode
–  Writes transaction log to filer
  Standby AvatarNode
–  Reads transactions from filer
–  Latest metadata in memory
http://hadoopblog.blogspot.com/2010/02/hadoop-namenode-high-availability.html
NFS
Filer
Active
AvatarNode
(NameNode)
Client
Standby
AvatarNode
(NameNode)
Block
location
messages
Client retrieves
block location from
Primary or Standby
write
transaction
read
transaction
Block
location
messages
DataNodes
ZooKeeper integration for Clients
  DistributedAvatarFileSystem:
–  Connects to ZooKeeper to figure out who the Primary node is.
There is a znode in ZooKeeper that has the current address of the
primary. Clients read it on creation and during failover.
–  Is aware of the failover state and pauses until it is over. If the
znode is empty the cluster is failing over to the new Primary, just
wait for that to finish.
–  Handles failures in calls to the NameNode. If the call failed with
network exception – checks for the failover in progress and retries
the call after the new Primary is up.
Four steps to failover
  Wipe ZooKeeper entry. Clients will know the failover is in
progress. (0 seconds)
  Stop the primary namenode. Last bits of data will be
flushed to Transaction Log and it will die. (Seconds)
  Switch Standby to Primary. It will consume the rest of the
Transaction log and get out of safemode ready to serve
traffic. (Seconds)
  Update the entry in ZooKeeper. All the clients waiting for
failover will pick up the new connection. (0 seconds)
  After: Start the first node in the Standby Mode (Takes a
while, but the cluster is up and running)
Why add ZooKeeper to the mix
  Provides a clean way to execute failovers in the application
layer.
  A centralized control of all the clients. Gives us the ability
to Pause clients until the failover is done.
  A good stepping stone for future improvements needed to
perform automatic failover:
–  Nodes voting on who will be the primary
–  DataNodes knowing who has the authority to delete blocks
ZooKeeper is an option
  Can be implemented using IP failover based on the existing
infrastructure.
  The clients will not know if the failover is done and it is
safe to make the call again.
  IP failover works well in tightly coupled system (both nodes
in one rack) so the Single Point of Failure is still there (rack
switch)
  There is no need to run a dedicated ZooKeeper cluster.
It is not all about failover
  The Standby node has a lot of CPU that is not used
  The Standby node has a full (but delayed a bit) picture of
Blocks and the Namesystem.
  Send all reads that can deal with stale data to Standby.
  Have pluggable services run as a part of Standby node and
use the metadata of the filesystem directly from the
shared memory instead of querying the namenode all the
time.
  My Hadoop Blog:
–  http://hadoopblog.blogspot.com/
–  http://www.facebook.com/hadoopfs

More Related Content

PDF
SnapDiff
PDF
Difference between cluster image package show-repository and system image get
PDF
Oracle database might have problems with stale NFSv3 locks upon restart
PDF
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
PDF
Steps to identify ONTAP latency related issues
PDF
OSSV [Open System SnapVault]
PDF
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
PDF
Trivadis TechEvent 2017 ACFS Replication as of 12 2 by Mathias Zarick
SnapDiff
Difference between cluster image package show-repository and system image get
Oracle database might have problems with stale NFSv3 locks upon restart
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Steps to identify ONTAP latency related issues
OSSV [Open System SnapVault]
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Trivadis TechEvent 2017 ACFS Replication as of 12 2 by Mathias Zarick

What's hot (20)

DOCX
Tier 2 net app baseline design standard revised nov 2011
ODP
High Availability in 37 Easy Steps
PPSX
FD.io Vector Packet Processing (VPP)
PPTX
Linux System Monitoring
PPTX
Broken Linux Performance Tools 2016
PDF
Performance Analysis: new tools and concepts from the cloud
PDF
Understand and optimize Linux I/O
PDF
Think_your_Postgres_backups_and_recovery_are_safe_lets_talk.pptx
PDF
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
TXT
Interview questions
PDF
LISA2010 visualizations
PDF
Lisa12 methodologies
PDF
제3회난공불락 오픈소스 인프라세미나 - lustre
PDF
Linux BPF Superpowers
PDF
Testing real-time Linux. What to test and how
PDF
From DTrace to Linux
PDF
Practice and challenges from building IaaS
PDF
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
PDF
Everything You Thought You Already Knew About Orchestration
PDF
Linux Performance Analysis and Tools
Tier 2 net app baseline design standard revised nov 2011
High Availability in 37 Easy Steps
FD.io Vector Packet Processing (VPP)
Linux System Monitoring
Broken Linux Performance Tools 2016
Performance Analysis: new tools and concepts from the cloud
Understand and optimize Linux I/O
Think_your_Postgres_backups_and_recovery_are_safe_lets_talk.pptx
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Interview questions
LISA2010 visualizations
Lisa12 methodologies
제3회난공불락 오픈소스 인프라세미나 - lustre
Linux BPF Superpowers
Testing real-time Linux. What to test and how
From DTrace to Linux
Practice and challenges from building IaaS
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
Everything You Thought You Already Knew About Orchestration
Linux Performance Analysis and Tools
Ad

Viewers also liked (14)

PPT
Hadoop Security Preview
ODP
Cascalog internal dsl_preso
PDF
Hdfs high availability
PDF
Karmasphere hadoop-productivity-tools
PPT
1 content optimization-hug-2010-07-21
PPTX
File Context
PPT
2 hadoop@e bay-hug-2010-07-21
PDF
Twitter Protobufs And Hadoop Hug 021709
PDF
Cloudera Desktop
PDF
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
PPT
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
PPT
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
PPTX
Building a Scalable Web Crawler with Hadoop
Hadoop Security Preview
Cascalog internal dsl_preso
Hdfs high availability
Karmasphere hadoop-productivity-tools
1 content optimization-hug-2010-07-21
File Context
2 hadoop@e bay-hug-2010-07-21
Twitter Protobufs And Hadoop Hug 021709
Cloudera Desktop
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Building a Scalable Web Crawler with Hadoop
Ad

Similar to Hdfs high availability (20)

PPTX
Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & ...
PDF
Hadoop availability
PPTX
Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2
PPTX
Hadoop HDFS NameNode HA
PDF
Apache Hadoop YARN, NameNode HA, HDFS Federation
PPTX
Hadoop Summit 2012 | HDFS High Availability
PPTX
Hadoop HDFS Architeture and Design
PPTX
PDF
HDFS Design Principles
PPT
hdfs filesystem in bigdata for hadoop configuration
PPT
3 HDFS basicsaaaaaaaaaaaaaaaaaaaaaaaa.ppt
PPTX
HDFS+basics.pptx
PDF
HDFS NameNode High Availability
PPTX
Nn ha hadoop world.final
PPTX
HDFS Namenode High Availability
PDF
Setting High Availability in Hadoop Cluster
PPTX
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
PPTX
Introduction to hadoop and hdfs
Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & ...
Hadoop availability
Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2
Hadoop HDFS NameNode HA
Apache Hadoop YARN, NameNode HA, HDFS Federation
Hadoop Summit 2012 | HDFS High Availability
Hadoop HDFS Architeture and Design
HDFS Design Principles
hdfs filesystem in bigdata for hadoop configuration
3 HDFS basicsaaaaaaaaaaaaaaaaaaaaaaaa.ppt
HDFS+basics.pptx
HDFS NameNode High Availability
Nn ha hadoop world.final
HDFS Namenode High Availability
Setting High Availability in Hadoop Cluster
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
Introduction to hadoop and hdfs

More from Hadoop User Group (18)

PPTX
Common crawlpresentation
PPT
Pig at Linkedin
PDF
HUG August 2010: Best practices
PDF
3 avro hug-2010-07-21
PPT
1 hadoop security_in_details_hadoop_summit2010
PPT
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
PPT
Hadoop, Hbase and Hive- Bay area Hadoop User Group
PPTX
Yahoo! Mail antispam - Bay area Hadoop user group
PPT
Flightcaster Presentation Hadoop
PPTX
Map Reduce Online
PPT
Hadoop Security Preview
PPT
Hadoop Security Preview
PPT
Hadoop Release Plan Feb17
PPTX
Ordered Record Collection
PPT
Hadoop and Voldemort @ LinkedIn
PPS
Searching At Scale
PPTX
Hadoop Record Reader In Python
PDF
Karmasphere Studio for Hadoop
Common crawlpresentation
Pig at Linkedin
HUG August 2010: Best practices
3 avro hug-2010-07-21
1 hadoop security_in_details_hadoop_summit2010
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Yahoo! Mail antispam - Bay area Hadoop user group
Flightcaster Presentation Hadoop
Map Reduce Online
Hadoop Security Preview
Hadoop Security Preview
Hadoop Release Plan Feb17
Ordered Record Collection
Hadoop and Voldemort @ LinkedIn
Searching At Scale
Hadoop Record Reader In Python
Karmasphere Studio for Hadoop

Recently uploaded (20)

PPTX
GDM (1) (1).pptx small presentation for students
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
Open Quiz Monsoon Mind Game Prelims.pptx
PPTX
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PDF
Introduction-to-Social-Work-by-Leonora-Serafeca-De-Guzman-Group-2.pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
Open Quiz Monsoon Mind Game Final Set.pptx
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
Basic Mud Logging Guide for educational purpose
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPTX
Microbial diseases, their pathogenesis and prophylaxis
GDM (1) (1).pptx small presentation for students
102 student loan defaulters named and shamed – Is someone you know on the list?
Open Quiz Monsoon Mind Game Prelims.pptx
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
Introduction-to-Social-Work-by-Leonora-Serafeca-De-Guzman-Group-2.pdf
O5-L3 Freight Transport Ops (International) V1.pdf
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
O7-L3 Supply Chain Operations - ICLT Program
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
Abdominal Access Techniques with Prof. Dr. R K Mishra
Open Quiz Monsoon Mind Game Final Set.pptx
TR - Agricultural Crops Production NC III.pdf
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Microbial disease of the cardiovascular and lymphatic systems
Basic Mud Logging Guide for educational purpose
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Microbial diseases, their pathogenesis and prophylaxis

Hdfs high availability

  • 1. A High Availability story for HDFS AvatarNode Dhruba Borthakur & Dmytro Molkov dhruba@apache.org & dms@facebook.com Presented at The Hadoop User Group Meeting, Sept 29, 2010
  • 2. How infrequently does the NameNode (NN) stop?   Hadoop Software Bugs –  Two directories in fs.name.dir, but when a write to first directory failed, the NN ignored the second one (once) –  Upgrade from 0.17 to 0.18 caused data corruption (once)   Configuration errors –  Fsimage partition ran out of space (once) –  Network Load Anomalies (about 10 times)   Maintenance: –  Deploy new patches (once every month)
  • 3. What does the SecondaryNameNode do?   Periodically merges Transaction logs   Requires the same amount of memory as NN   Why is it separate from NN? –  Avoids fine-grain locking of NN data structures –  Avoids implementing copy-on-write for NN data structures   Renamed as CheckpointNode (CN) in 0.21 release.
  • 4. Shortcomings of the SecondaryNameNode?   Does not have a copy of the latest transaction log   Periodic and is not continuous –  Configured to run every hour   If the NN dies, the SecondaryNameNode does not take over the responsibilities of the NN
  • 5. BackupNode (BN)   NN streams transaction log to BackupNode   BackupNode applies log to in-memory and disk image   BN always commit to disk before success to NN   If BN restarts, it has to catch up with NN   Available in HDFS 0.21 release
  • 6. Limitations of BackupNode (BN)   Maximum of one BackupNode per NN –  Support only two-machine failure   NN does not forward block reports to BackupNode   Time to restart from 12 GB image, 70M files + 100 M blocks –  3 – 5 minutes to read the image from disk –  20 min to process block reports –  BN will still take 25+ minutes to failover!
  • 7. Overlapping Clusters for HA   “Always available for write” model   Two logical clusters each with their own NN   Each physical machine runs two instances of DataNode   Two DataNode instances share the same physical storage device   Application has logic to failover writes from one HDFS cluster to another   More details at http://hadoopblog.blogspot.com/2009/06/hdfs- scribe-integration.html
  • 8. HDFS+Zookeeper   HDFS can store transaction logs in Zookeeper/Bookeeper –  http://issues.apache.org/jira/browse/HDFS-234   Transaction log need not be stored in NFS filer   A new NN will still have to process block reports –  Not good for HA yet, because NN failover will take 30 minutes
  • 9. Our use case for High Availability   Failover should occur in less than a minute   Failovers are needed only for new software upgrades
  • 10. Challenges   DataNodes send block location information to only one NameNode   NameNode needs block locations in memory to serve clients   The in-memory metadata for 100 million files could be 60 GB, huge! DataNodes Primary NameNode Client Block location message “yes, I have blockid 123” Client retrieves block location from NameNode
  • 11. Introduction for AvatarNode   Active-Standby Pair –  Coordinated via zookeeper –  Failover in few seconds –  Wrapper over NameNode   Active AvatarNode –  Writes transaction log to filer   Standby AvatarNode –  Reads transactions from filer –  Latest metadata in memory http://hadoopblog.blogspot.com/2010/02/hadoop-namenode-high-availability.html NFS Filer Active AvatarNode (NameNode) Client Standby AvatarNode (NameNode) Block location messages Client retrieves block location from Primary or Standby write transaction read transaction Block location messages DataNodes
  • 12. ZooKeeper integration for Clients   DistributedAvatarFileSystem: –  Connects to ZooKeeper to figure out who the Primary node is. There is a znode in ZooKeeper that has the current address of the primary. Clients read it on creation and during failover. –  Is aware of the failover state and pauses until it is over. If the znode is empty the cluster is failing over to the new Primary, just wait for that to finish. –  Handles failures in calls to the NameNode. If the call failed with network exception – checks for the failover in progress and retries the call after the new Primary is up.
  • 13. Four steps to failover   Wipe ZooKeeper entry. Clients will know the failover is in progress. (0 seconds)   Stop the primary namenode. Last bits of data will be flushed to Transaction Log and it will die. (Seconds)   Switch Standby to Primary. It will consume the rest of the Transaction log and get out of safemode ready to serve traffic. (Seconds)   Update the entry in ZooKeeper. All the clients waiting for failover will pick up the new connection. (0 seconds)   After: Start the first node in the Standby Mode (Takes a while, but the cluster is up and running)
  • 14. Why add ZooKeeper to the mix   Provides a clean way to execute failovers in the application layer.   A centralized control of all the clients. Gives us the ability to Pause clients until the failover is done.   A good stepping stone for future improvements needed to perform automatic failover: –  Nodes voting on who will be the primary –  DataNodes knowing who has the authority to delete blocks
  • 15. ZooKeeper is an option   Can be implemented using IP failover based on the existing infrastructure.   The clients will not know if the failover is done and it is safe to make the call again.   IP failover works well in tightly coupled system (both nodes in one rack) so the Single Point of Failure is still there (rack switch)   There is no need to run a dedicated ZooKeeper cluster.
  • 16. It is not all about failover   The Standby node has a lot of CPU that is not used   The Standby node has a full (but delayed a bit) picture of Blocks and the Namesystem.   Send all reads that can deal with stale data to Standby.   Have pluggable services run as a part of Standby node and use the metadata of the filesystem directly from the shared memory instead of querying the namenode all the time.
  • 17.   My Hadoop Blog: –  http://hadoopblog.blogspot.com/ –  http://www.facebook.com/hadoopfs