SlideShare a Scribd company logo
HDFS ARCHITECTURE
How HDFS is evolving to meet new needs
✛  Aaron T. Myers
    ✛  Hadoop PMC Member / Committer at ASF
    ✛  Software Engineer at Cloudera
    ✛  Primarily work on HDFS and Hadoop Security




2
✛  HDFS architecture circa 2010
    ✛  New requirements for HDFS
       >  Random read patterns
       >  Higher scalability
       >  Higher availability
    ✛  HDFS evolutions to address requirements
       >  Read pipeline performance improvements
       >  Federated namespaces
       >  Highly available Name Node



3
HDFS ARCHITECTURE: 2010
✛  Each cluster has…
       >  A single Name Node
           ∗  Stores file system metadata
           ∗  Stores “Block ID” -> Data Node mapping
       >  Many Data Nodes
           ∗  Store actual file data
       >  Clients of HDFS…
           ∗  Communicate with Name Node to browse file system, get
              block locations for files
           ∗  Communicate directly with Data Nodes to read/write files




5
6
✛  Want to support larger clusters
       >  ~4,000 node limit with 2010 architecture
       >  New nodes beefier than old nodes
          ∗  2009: 8 cores, 16GB RAM, 4x1TB disks
          ∗  2012: 16 cores, 48GB RAM, 12x3TB disks

    ✛  Want to increase availability
       >  With rise of HBase, HDFS now serving live traffic
       >  Downtime means immediate user-facing impact
    ✛  Want to improve random read performance
       >  HBase usually does small, random reads, not bulk


7
✛  Single Name Node
       >  If Name Node goes offline, cluster is unavailable
       >  Name Node must fit all FS metadata in memory
    ✛  Inefficiencies in read pipeline
       >  Designed for large, streaming reads
       >  Not small, random reads (like HBase use case)




8
✛  Fine for offline, batch-oriented applications
    ✛  If cluster goes offline, external customers don’t
      notice
    ✛  Can always use separate clusters for different
      groups
    ✛  HBase didn’t exist when Hadoop first created
       >  MapReduce was the only client application




9
HDFS PERFORMANCE IMPROVEMENTS
HDFS CPU Improvements: Checksumming

•  HDFS checksums every piece of data in/out
•  Significant CPU overhead
   •  Measure by putting ~1G in HDFS, cat file in a loop
   •  0.20.2: ~30-50% of CPU time is CRC32 computation!
•  Optimizations:
   •  Switch to “bulk” API: verify/compute 64KB at a time
      instead of 512 bytes (better instruction cache locality,
      amortize JNI overhead)
   •  Switch to CRC32C polynomial, SSE4.2, highly tuned
      assembly (~8 bytes per cycle with instruction level
      parallelism!)


    11                 Copyright 2011 Cloudera Inc. All rights reserved
Checksum improvements
                              (lower is better)
            1360us
100%
 90%
 80%
 70%
 60%              760us
 50%
                                                                                             CDH3u0
 40%
                                                                                             Optimized
 30%
 20%
 10%
  0%
            Random-read     Random-read CPU                                Sequential-read
              latency            usage                                       CPU usage
 Post-optimization: only 16% overhead vs un-checksummed access
 Maintain ~800MB/sec from a single thread reading OS cache

       12                      Copyright 2011 Cloudera Inc. All rights reserved
HDFS Random access

•  0.20.2:
    •  Each individual read operation reconnects to
       DataNode
    •  Much TCP Handshake overhead, thread creation,
       etc
•  2.0.0:
    •  Clients cache open sockets to each datanode (like
       HTTP Keepalive)
    •  Local readers can bypass the DN in some
       circumstances to directly read data
    •  Rewritten BlockReader to eliminate a data copy
    •  Eliminated lock contention in DataNode’s
       FSDataset class

   13                 Copyright 2011 Cloudera Inc. All rights reserved
Random-read micro benchmark (higher is better)
                  700
                  600
 Speed (MB/sec)




                  500
                  400
                  300
                  200
                  100
                        106 253 299                        247 488 635                              187 477 633
                    0
                        4 threads, 1 file              16 threads, 1 file                          8 threads, 2 files
                            0.20.2     Trunk (no native)                                   Trunk (native)
       TestParallelRead benchmark, modified to 100% random read
       proportion.
       Quad core Core i7 Q820@1.73Ghz
                   14                       Copyright 2011 Cloudera Inc. All rights reserved
Random-read macro benchmark (HBase YCSB)

                CDH4
  Reads/sec




              CDH3u1




                                   time
      15         Copyright 2011 Cloudera Inc. All rights reserved
HDFS FEDERATION ARCHITECTURE
✛  Instead of one Name Node per cluster, several
   >  Before: Only one Name Node, many Data Nodes
   >  Now: A handful of Name Nodes, many Data Nodes
✛  Distribute file system metadata between the
  NNs
✛  Each Name Node operates independently
   >  Potentially overlapping ranges of block IDs
   >  Introduce a new concept: block pool ID
   >  Each Name Node manages a single block pool
HDFS Architecture: Federation
✛  Improve scalability to 6,000+ Data Nodes
    >  Bumping into single Data Node scalability now
 ✛  Allow for better isolation
    >  Could locate HBase dirs on dedicated Name Node
    >  Could locate /user dirs on dedicated Name Node
 ✛  Clients still see unified view of FS namespace
    >  Use ViewFS – client side mount table configuration


     Note: Federation != Increased Availability

19
HDFS HIGH AVAILABILITY ARCHITECTURE
Current HDFS Availability & Data Integrity

•  Simple design, storage fault tolerance
   •  Storage: Rely on OS’s file system rather
      than use raw disk
   •  Storage Fault Tolerance: multiple replicas,
      active monitoring
   •  Single NameNode Master
  •  Persistent state: multiple copies + checkpoints
  •  Restart on failure




                                  21
Current HDFS Availability & Data Integrity

•  How well did it work?

•  Lost 19 out of 329 Million blocks on 10 clusters with 20K
  nodes in 2009
   •  7-9’s of reliability, and that bug was fixed in 0.20


•  18 months Study: 22 failures on 25 clusters - 0.58 failures
  per year per cluster
   •  Only 8 would have benefitted from HA failover!! (0.23
     failures per cluster year)



                                               22
So why build an HA NameNode?

•  Most cluster downtime in practice is planned
  downtime
   •  Cluster restart for a NN configuration change (e.g
      new JVM configs, new HDFS configs)
   •  Cluster restart for a NN hardware upgrade/repair
   •  Cluster restart for a NN software upgrade (e.g. new
      Hadoop, new kernel, new JVM)
•  Planned downtimes cause the vast majority of
  outage!

•  Manual failover solves all of the above!
   •  Failover to NN2, fix NN1, fail back to NN1, zero
      downtime
    23
Approach and Terminology
•  Initial goal: Active-Standby with Hot
  Failover

•  Terminology
   •  Active NN: actively serves read/write
      operations from clients
   •  Standby NN: waits, becomes active when
      Active dies or is unhealthy
   •  Hot failover: standby able to take over
      instantly

                             24
HDFS Architecture: High Availability

•  Single NN configuration; no failover
•  Active and Standby with manual failover
   •  Addresses downtime during upgrades – main
      cause of unavailability
•  Active and Standby with automatic
  failover
   •  Addresses downtime during unplanned outages
       (kernel panics, bad memory, double PDU failure,
       etc)
    •  See HDFS-1623 for detailed use cases
•  With Federation each namespace volume has an
   active-standby NameNode pair

                                  25
HDFS Architecture: High Availability

•  Failover controller outside NN
•  Parallel Block reports to Active and
   Standby
•  NNs share namespace state via a shared
   edit log
   •  NAS or Journal Nodes
   •  Like RDBMS “log shipping replication”
•  Client failover
   •  Smart clients (e.g configuration, or ZooKeeper for
      coordination)
   •  IP Failover in the future
                                  26
HDFS Architecture: High Availability
HDFS ARCHITECTURE: WHAT’S NEXT
✛  Increase scalability of single Data Node
   >  Currently the most-noticed scalability limit
✛  Support for point-in-time snapshots
   >  To better support DR, backups
✛  Completely separate block / namespace layers
   >  Increase scalability even further, new use cases
✛  Fully distributed NN metadata
   >  No pre-determined “special nodes” in the system
[B4]deview 2012-hdfs

More Related Content

ODP
Hug Hbase Presentation.
PPTX
In-memory Caching in HDFS: Lower Latency, Same Great Taste
PPTX
Global Azure Virtual 2020 What's new on Azure IaaS for SQL VMs
PPTX
Apache HBase Performance Tuning
PPTX
Optimizing your Infrastrucure and Operating System for Hadoop
PDF
HBase Sizing Guide
PDF
Evaluating NoSQL Performance: Time for Benchmarking
PDF
Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...
Hug Hbase Presentation.
In-memory Caching in HDFS: Lower Latency, Same Great Taste
Global Azure Virtual 2020 What's new on Azure IaaS for SQL VMs
Apache HBase Performance Tuning
Optimizing your Infrastrucure and Operating System for Hadoop
HBase Sizing Guide
Evaluating NoSQL Performance: Time for Benchmarking
Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...

What's hot (20)

PDF
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
PPTX
Interactive Hadoop via Flash and Memory
PPTX
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
PDF
Apache Hadoop YARN, NameNode HA, HDFS Federation
PPTX
HBaseCon 2015: HBase Performance Tuning @ Salesforce
PPTX
Ceph Day Melbourne - Walk Through a Software Defined Everything PoC
ODP
Benchmarking MongoDB and CouchBase
PDF
HBase Advanced - Lars George
PPTX
HBase Low Latency
PPTX
Apache HBase, Accelerated: In-Memory Flush and Compaction
PPTX
HBase and HDFS: Understanding FileSystem Usage in HBase
PPTX
Introduction to hadoop high availability
PPTX
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
PDF
HBase: Extreme Makeover
PPTX
Introduction to hadoop and hdfs
PDF
Stabilizing Ceph
PDF
MyRocks introduction and production deployment
PDF
HBase Blockcache 101
PDF
HBase Storage Internals
PDF
HBaseCon 2015: Elastic HBase on Mesos
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Interactive Hadoop via Flash and Memory
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Apache Hadoop YARN, NameNode HA, HDFS Federation
HBaseCon 2015: HBase Performance Tuning @ Salesforce
Ceph Day Melbourne - Walk Through a Software Defined Everything PoC
Benchmarking MongoDB and CouchBase
HBase Advanced - Lars George
HBase Low Latency
Apache HBase, Accelerated: In-Memory Flush and Compaction
HBase and HDFS: Understanding FileSystem Usage in HBase
Introduction to hadoop high availability
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
HBase: Extreme Makeover
Introduction to hadoop and hdfs
Stabilizing Ceph
MyRocks introduction and production deployment
HBase Blockcache 101
HBase Storage Internals
HBaseCon 2015: Elastic HBase on Mesos
Ad

Viewers also liked (19)

PPT
Anatomy of file read in hadoop
PDF
HDFS Design Principles
PDF
[2B3]ARCUS차별기능,사용이슈,그리고카카오적용사례
PDF
B1 최신분산시스템이해결하고있는오래된이슈들
PDF
[B1]real time large data at twitter
PDF
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
PDF
[2A6]web & health 2.0. 회사에서의 data science란?
PDF
[2A3]Big Data Launching Episodes
PDF
[2A5]하둡 보안 어떻게 해야 할까
PDF
Intro to hadoop tutorial
PPTX
[2A7]Linkedin'sDataScienceWhyIsItScience
PDF
[1C6]오픈소스 하드웨어 플랫폼과 Node.js로 구현하는 IoT 플랫폼
PDF
Deview 2013 :: Backend PaaS, CloudFoundry 뽀개기
PDF
[243]kaleido 노현걸
PDF
[153] apache reef
PDF
[2D1]Elasticsearch 성능 최적화
PDF
[2A4]DeepLearningAtNAVER
PDF
KGC 2014 가볍고 유연하게 데이터 분석하기 : 쿠키런 사례 중심 , 데브시스터즈
PDF
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
Anatomy of file read in hadoop
HDFS Design Principles
[2B3]ARCUS차별기능,사용이슈,그리고카카오적용사례
B1 최신분산시스템이해결하고있는오래된이슈들
[B1]real time large data at twitter
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
[2A6]web & health 2.0. 회사에서의 data science란?
[2A3]Big Data Launching Episodes
[2A5]하둡 보안 어떻게 해야 할까
Intro to hadoop tutorial
[2A7]Linkedin'sDataScienceWhyIsItScience
[1C6]오픈소스 하드웨어 플랫폼과 Node.js로 구현하는 IoT 플랫폼
Deview 2013 :: Backend PaaS, CloudFoundry 뽀개기
[243]kaleido 노현걸
[153] apache reef
[2D1]Elasticsearch 성능 최적화
[2A4]DeepLearningAtNAVER
KGC 2014 가볍고 유연하게 데이터 분석하기 : 쿠키런 사례 중심 , 데브시스터즈
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
Ad

Similar to [B4]deview 2012-hdfs (20)

PPTX
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
PPT
HDFS_architecture.ppt
PDF
Hadoop data management
PPTX
Google
PDF
Big data interview questions and answers
PPTX
Hadop-HDFS-HDFS-Hadop-HDFS-HDFS-Hadop-HDFS-HDFS
PDF
Design for a Distributed Name Node
ODP
Hadoop HDFS by rohitkapa
PPTX
Hadoop BRamamurthy ajjaahdvddvdnsmsjdjfj
PPTX
Big Data Analytics -Introduction education
PPTX
PDF
PDF
Aziksa hadoop architecture santosh jha
PDF
Базы данных. HDFS
PDF
Hadoop, Taming Elephants
PPTX
Strata + Hadoop World 2012: HDFS: Now and Future
PDF
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
PDF
hadoop distributed file systems complete information
PDF
HDFS NameNode High Availability
PDF
IRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
HDFS_architecture.ppt
Hadoop data management
Google
Big data interview questions and answers
Hadop-HDFS-HDFS-Hadop-HDFS-HDFS-Hadop-HDFS-HDFS
Design for a Distributed Name Node
Hadoop HDFS by rohitkapa
Hadoop BRamamurthy ajjaahdvddvdnsmsjdjfj
Big Data Analytics -Introduction education
Aziksa hadoop architecture santosh jha
Базы данных. HDFS
Hadoop, Taming Elephants
Strata + Hadoop World 2012: HDFS: Now and Future
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
hadoop distributed file systems complete information
HDFS NameNode High Availability
IRJET- A Novel Approach to Process Small HDFS Files with Apache Spark

More from NAVER D2 (20)

PDF
[211] 인공지능이 인공지능 챗봇을 만든다
PDF
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
PDF
[215] Druid로 쉽고 빠르게 데이터 분석하기
PDF
[245]Papago Internals: 모델분석과 응용기술 개발
PDF
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
PDF
[235]Wikipedia-scale Q&A
PDF
[244]로봇이 현실 세계에 대해 학습하도록 만들기
PDF
[243] Deep Learning to help student’s Deep Learning
PDF
[234]Fast & Accurate Data Annotation Pipeline for AI applications
PDF
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
PDF
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
PDF
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
PDF
[224]네이버 검색과 개인화
PDF
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
PDF
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
PDF
[213] Fashion Visual Search
PDF
[232] TensorRT를 활용한 딥러닝 Inference 최적화
PDF
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
PDF
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
PDF
[223]기계독해 QA: 검색인가, NLP인가?
[211] 인공지능이 인공지능 챗봇을 만든다
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
[215] Druid로 쉽고 빠르게 데이터 분석하기
[245]Papago Internals: 모델분석과 응용기술 개발
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
[235]Wikipedia-scale Q&A
[244]로봇이 현실 세계에 대해 학습하도록 만들기
[243] Deep Learning to help student’s Deep Learning
[234]Fast & Accurate Data Annotation Pipeline for AI applications
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
[224]네이버 검색과 개인화
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
[213] Fashion Visual Search
[232] TensorRT를 활용한 딥러닝 Inference 최적화
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
[223]기계독해 QA: 검색인가, NLP인가?

Recently uploaded (20)

PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Cloud computing and distributed systems.
PDF
Approach and Philosophy of On baking technology
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Encapsulation_ Review paper, used for researhc scholars
PPT
Teaching material agriculture food technology
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Advanced methodologies resolving dimensionality complications for autism neur...
Chapter 3 Spatial Domain Image Processing.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Mobile App Security Testing_ A Comprehensive Guide.pdf
Cloud computing and distributed systems.
Approach and Philosophy of On baking technology
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Network Security Unit 5.pdf for BCA BBA.
Encapsulation_ Review paper, used for researhc scholars
Teaching material agriculture food technology
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Digital-Transformation-Roadmap-for-Companies.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025

[B4]deview 2012-hdfs

  • 1. HDFS ARCHITECTURE How HDFS is evolving to meet new needs
  • 2. ✛  Aaron T. Myers ✛  Hadoop PMC Member / Committer at ASF ✛  Software Engineer at Cloudera ✛  Primarily work on HDFS and Hadoop Security 2
  • 3. ✛  HDFS architecture circa 2010 ✛  New requirements for HDFS >  Random read patterns >  Higher scalability >  Higher availability ✛  HDFS evolutions to address requirements >  Read pipeline performance improvements >  Federated namespaces >  Highly available Name Node 3
  • 5. ✛  Each cluster has… >  A single Name Node ∗  Stores file system metadata ∗  Stores “Block ID” -> Data Node mapping >  Many Data Nodes ∗  Store actual file data >  Clients of HDFS… ∗  Communicate with Name Node to browse file system, get block locations for files ∗  Communicate directly with Data Nodes to read/write files 5
  • 6. 6
  • 7. ✛  Want to support larger clusters >  ~4,000 node limit with 2010 architecture >  New nodes beefier than old nodes ∗  2009: 8 cores, 16GB RAM, 4x1TB disks ∗  2012: 16 cores, 48GB RAM, 12x3TB disks ✛  Want to increase availability >  With rise of HBase, HDFS now serving live traffic >  Downtime means immediate user-facing impact ✛  Want to improve random read performance >  HBase usually does small, random reads, not bulk 7
  • 8. ✛  Single Name Node >  If Name Node goes offline, cluster is unavailable >  Name Node must fit all FS metadata in memory ✛  Inefficiencies in read pipeline >  Designed for large, streaming reads >  Not small, random reads (like HBase use case) 8
  • 9. ✛  Fine for offline, batch-oriented applications ✛  If cluster goes offline, external customers don’t notice ✛  Can always use separate clusters for different groups ✛  HBase didn’t exist when Hadoop first created >  MapReduce was the only client application 9
  • 11. HDFS CPU Improvements: Checksumming •  HDFS checksums every piece of data in/out •  Significant CPU overhead •  Measure by putting ~1G in HDFS, cat file in a loop •  0.20.2: ~30-50% of CPU time is CRC32 computation! •  Optimizations: •  Switch to “bulk” API: verify/compute 64KB at a time instead of 512 bytes (better instruction cache locality, amortize JNI overhead) •  Switch to CRC32C polynomial, SSE4.2, highly tuned assembly (~8 bytes per cycle with instruction level parallelism!) 11 Copyright 2011 Cloudera Inc. All rights reserved
  • 12. Checksum improvements (lower is better) 1360us 100% 90% 80% 70% 60% 760us 50% CDH3u0 40% Optimized 30% 20% 10% 0% Random-read Random-read CPU Sequential-read latency usage CPU usage Post-optimization: only 16% overhead vs un-checksummed access Maintain ~800MB/sec from a single thread reading OS cache 12 Copyright 2011 Cloudera Inc. All rights reserved
  • 13. HDFS Random access •  0.20.2: •  Each individual read operation reconnects to DataNode •  Much TCP Handshake overhead, thread creation, etc •  2.0.0: •  Clients cache open sockets to each datanode (like HTTP Keepalive) •  Local readers can bypass the DN in some circumstances to directly read data •  Rewritten BlockReader to eliminate a data copy •  Eliminated lock contention in DataNode’s FSDataset class 13 Copyright 2011 Cloudera Inc. All rights reserved
  • 14. Random-read micro benchmark (higher is better) 700 600 Speed (MB/sec) 500 400 300 200 100 106 253 299 247 488 635 187 477 633 0 4 threads, 1 file 16 threads, 1 file 8 threads, 2 files 0.20.2 Trunk (no native) Trunk (native) TestParallelRead benchmark, modified to 100% random read proportion. Quad core Core i7 Q820@1.73Ghz 14 Copyright 2011 Cloudera Inc. All rights reserved
  • 15. Random-read macro benchmark (HBase YCSB) CDH4 Reads/sec CDH3u1 time 15 Copyright 2011 Cloudera Inc. All rights reserved
  • 17. ✛  Instead of one Name Node per cluster, several >  Before: Only one Name Node, many Data Nodes >  Now: A handful of Name Nodes, many Data Nodes ✛  Distribute file system metadata between the NNs ✛  Each Name Node operates independently >  Potentially overlapping ranges of block IDs >  Introduce a new concept: block pool ID >  Each Name Node manages a single block pool
  • 19. ✛  Improve scalability to 6,000+ Data Nodes >  Bumping into single Data Node scalability now ✛  Allow for better isolation >  Could locate HBase dirs on dedicated Name Node >  Could locate /user dirs on dedicated Name Node ✛  Clients still see unified view of FS namespace >  Use ViewFS – client side mount table configuration Note: Federation != Increased Availability 19
  • 20. HDFS HIGH AVAILABILITY ARCHITECTURE
  • 21. Current HDFS Availability & Data Integrity •  Simple design, storage fault tolerance •  Storage: Rely on OS’s file system rather than use raw disk •  Storage Fault Tolerance: multiple replicas, active monitoring •  Single NameNode Master •  Persistent state: multiple copies + checkpoints •  Restart on failure 21
  • 22. Current HDFS Availability & Data Integrity •  How well did it work? •  Lost 19 out of 329 Million blocks on 10 clusters with 20K nodes in 2009 •  7-9’s of reliability, and that bug was fixed in 0.20 •  18 months Study: 22 failures on 25 clusters - 0.58 failures per year per cluster •  Only 8 would have benefitted from HA failover!! (0.23 failures per cluster year) 22
  • 23. So why build an HA NameNode? •  Most cluster downtime in practice is planned downtime •  Cluster restart for a NN configuration change (e.g new JVM configs, new HDFS configs) •  Cluster restart for a NN hardware upgrade/repair •  Cluster restart for a NN software upgrade (e.g. new Hadoop, new kernel, new JVM) •  Planned downtimes cause the vast majority of outage! •  Manual failover solves all of the above! •  Failover to NN2, fix NN1, fail back to NN1, zero downtime 23
  • 24. Approach and Terminology •  Initial goal: Active-Standby with Hot Failover •  Terminology •  Active NN: actively serves read/write operations from clients •  Standby NN: waits, becomes active when Active dies or is unhealthy •  Hot failover: standby able to take over instantly 24
  • 25. HDFS Architecture: High Availability •  Single NN configuration; no failover •  Active and Standby with manual failover •  Addresses downtime during upgrades – main cause of unavailability •  Active and Standby with automatic failover •  Addresses downtime during unplanned outages (kernel panics, bad memory, double PDU failure, etc) •  See HDFS-1623 for detailed use cases •  With Federation each namespace volume has an active-standby NameNode pair 25
  • 26. HDFS Architecture: High Availability •  Failover controller outside NN •  Parallel Block reports to Active and Standby •  NNs share namespace state via a shared edit log •  NAS or Journal Nodes •  Like RDBMS “log shipping replication” •  Client failover •  Smart clients (e.g configuration, or ZooKeeper for coordination) •  IP Failover in the future 26
  • 27. HDFS Architecture: High Availability
  • 29. ✛  Increase scalability of single Data Node >  Currently the most-noticed scalability limit ✛  Support for point-in-time snapshots >  To better support DR, backups ✛  Completely separate block / namespace layers >  Increase scalability even further, new use cases ✛  Fully distributed NN metadata >  No pre-determined “special nodes” in the system