SlideShare a Scribd company logo
HDFS What’s New and Future

Suresh Srinivas
suresh@hortonworks.com
@suresh_m_s




© Hortonworks Inc. 2013      Page 1
About Me
• Architect & Founder at Hortonworks
• Apache Hadoop committer and PMC member
• > 4.5 years working on HDFS




    Architecting the Future of Big Data
                                          Page 2
    © Hortonworks Inc. 2013
Agenda

• HDFS – What’s new
 – Federation
 – HA
 – Snapshots
 – Other features
• Future
 – Major Architectural Directions
 – Short term and long term features


    Architecting the Future of Big Data
                                          Page 3
    © Hortonworks Inc. 2013
We have been hard at work…
• Progress is being made in many areas
  – Scalability
  – Performance
  – Enterprise features
  – Ongoing operability improvements
  – Enhancements for other projects in the ecosystem
  – Expand Hadoop ecosystem to more platforms and use cases
• 2192 commits in Hadoop in the last year
  – Almost a million lines of changes
  – ~150 contributors
  – Lot of new contributors - ~80 with < 3 patches
• 350K lines of changes in HDFS and common

      Architecting the Future of Big Data
                                                              Page 4
      © Hortonworks Inc. 2013
Building on Rock-solid Foundation
• Original design choices - simple and robust
   – Storage: Rely in OS’s file system rather than use raw disk
   – Storage Fault Tolerance: multiple replicas, active monitoring
   – Single Namenode Master
• Reliability
   – Over 7 9’s of data reliability
   – Less than 0.38 failures across 25 clusters
• Operability
   – Small teams can manage large clusters
      • An operator per 3K node cluster
   – Fast Time to repair on node or disk failure
      • Minutes to an hour Vs. RAID array repairs taking many long hours
• Scalable - proven by large scale deployments not bits
  – > 100 PB storage, > 400 million files, > 4500 nodes in a single cluster
   – > 70 K nodes of HDFS in deployment and use


         Architecting the Future of Big Data
                                                                              Page 5
         © Hortonworks Inc. 2013
Federation
                                 NN-1                         NN-k                  NN-n


                 Namespace
                                                                                           Foreign
                                          NS1                        NS k                   NS n
                                                         ..                    ..
                                                         .                     .

                                                Pool 1            Pool k             Pool n
                 Block Storage




                                                                Block Pools




                                         DN 1                     DN 2                  DN m
                                               ..                     ..                    ..
                                                              Common Storage

• Block Storage as generic storage service
  – DNs store blocks in Block Pools for all the Namespace Volumes
• Multiple independent Namenodes and Namespace Volumes in a cluster
  – Scalability by adding more namenodes/namespaces
  – Isolation – separating applications to their own namespaces
  – Client side mount tables/ViewFS for integrated views

         Architecting the Future of Big Data
                                                                                                     Page 6
         © Hortonworks Inc. 2013
High Availability
• Support standby namenode and failover
 – Planned downtime
 – Unplanned downtime
• Release 1.1
 – Cold standby
 – Uses NFS as shared storage
 – Standard HA frameworks as failover controller
   • Linux HA and VMWare VSphere
 – Suitable for small clusters up to 500 nodes



     Architecting the Future of Big Data
                                                   Page 7
     © Hortonworks Inc. 2013
Hadoop Full Stack HA


                                                Slave Nodes of Hadoop Cluster


                                      jo           jo             jo   jo    jo
                                       b            b              b    b     b


 Apps
Running
Outside
                                                           Failover

                                        JT into Safemode

                         NN                                  JT             NN

                            Server                            Server         Server

                                           HA Cluster for Master Daemons

          Architecting the Future of Big Data
                                                                                      Page 8
          © Hortonworks Inc. 2013
High Availability – Release 2.0
• Supports manual and automatic failover
• Automatic failover with Failover Controller
  – Active NN election and failure detection using ZooKeeper
  – Periodic NN health check
  – Failover on NN failure
• Removed shared storage dependency
  – Quorum Journal Manager
    • 3 to 5 Journal Nodes for storing editlog
    • Edit must be written to quorum number of Journal Nodes



                 Available in Release 2.0.3-alpha

      Architecting the Future of Big Data
                                                               Page 9
      © Hortonworks Inc. 2013
ZK          ZK           ZK
                                 Heartbeat                                                Heartbeat


      FailoverController                                                                  FailoverController
            Active                                                                             Standby

                                 Cmds
                                                   JN         JN          JN



                                                        Shared NN state
                                        NN                                        NN
Monitor Health                                          through Quorum
of NN. OS, HW
                                       Active           of JournalNodes         Standby               Monitor Health
                                                                                                      of NN. OS, HW




    Block Reports to Active & Standby
    DN fencing: only obey commands
              from active
                                             DN        DN          DN           DN


                           Namenode HA has no external dependency
      Architecting the Future of Big Data
                                                                                                                       Page 10
      © Hortonworks Inc. 2013
Snapshots (HDFS-2802)
• Support for read-only COW snapshots
  – Design allows read-write snapshots
• Namenode only operation – no data copy made
  – Metadata in namenode - no complicated distributed mechanism
  – Datanodes have no knowledge
• Snapshot entire namespace or sub directories
  – Nested snapshots allowed
  – Managed by Admin
    • Users can take snapshots of directories they own
• Efficient
  – Instantaneous creation
  – Memory used is highly optimized
  – Does not affect regular HDFS operations

      Architecting the Future of Big Data
                                                              Page 11
      © Hortonworks Inc. 2013
Snapshot Design
                                                   ∆n    ∆n-1          ∆0




                                         Current        Sn      Sn-1   S0




• Based on Persistent Data Structures
  – Maintains changes in the diff list at the Inodes
     • Tracks creation, deletion, and modification
  – Snapshot state Sn = current - ∆n
• A large number of snapshots supported
  – State proportional to the changes between the snapshots
  – Supports millions of snapshots
       Architecting the Future of Big Data
                                                                            Page 12
       © Hortonworks Inc. 2013
Snapshot – APIs and CLIs
• All regular commands & APIs can be used with snapshot path
  – /<path>/.snapshot/<snapshot_name>/file.txt
• CLIs
  – Allow snapshots
     • dfsadmin –allowSnapshots <dir>
     • dfsadmin –disAllowSnapshots <dir>
  – Create/delete/rename snapshots
     • fs –createSnapshot<dir> [snapshot_name]
     • fs –deleteSnapshot<dir> <snapshot_name>
     • fs –renameSnapshot<dir> <old_name> <new_name>
  – Tool to print diff between snapshots
  – Admin tool to print all snapshottable directories and snapshots
• Status
  – Work almost complete – ready to be integrated to trunk
  – Additional work needed in integration to Ambari

         Architecting the Future of Big Data
                                                                      Page 13
         © Hortonworks Inc. 2013
Performance Improvements
• Many Improvements
  – SSE4.2 CRC32C – ~3x less CPU on read path
  – Read path improvements for fewer memory copies
  – Short-circuit read for 2-3x faster random reads (HBase workloads)
  – Unix domain socket based local reads (almost done)
    • Simpler to configure and generic for many applications
  – I/O improvements using posix_fadvise()
  – libhdfs improvements for zero copy reads
• Significant improvements - IO 2.5x to 5x faster
  – Lot of improvements back ported to release 1.x
    • Available in Apache release 1.1 and HDP 1.1




      Architecting the Future of Big Data
                                                                  Page 14
      © Hortonworks Inc. 2013
Other Features
• New append pipeline
• Protobuf, wire compatibility
  – Post 2.0 GA stronger wire compatibility in Apache Hadoop and HDP Releases
• Rolling upgrades
  – With relaxed version checks
• Improvements for other projects
  – Stale node to improve HBase MTTR
• Block placement enhancements
  – Better support for other topologies such as VMs and Cloud
• On the wire encryption
  – Both data and RPC
• Support for NFS gateway
  – Work in progress – available soon
• Expanding ecosystem, platforms and applicability
  – Native support for Windows

       Architecting the Future of Big Data
                                                                                Page 15
       © Hortonworks Inc. 2013
Enterprise Readiness
• Storage fault-tolerance – built into HDFS 
  – Over 7’9s of data reliability
• High Availability 
• Standard Interfaces 
  – WebHdfs(REST) & HTTPFS, Fuse, NFS, libwebhdfs and libhdfs
• Wire protocol compatibility 
• Rolling upgrades 
• Snapshots 
• Disaster Recovery 
  – Distcp for parallel and incremental copies across cluster
  – Apache Ambari and HDP for automated management


       Architecting the Future of Big Data
                                                                Page 16
       © Hortonworks Inc. 2013
HDFS Futures




Architecting the Future of Big Data
                                      Page 17
© Hortonworks Inc. 2011
Storage Abstraction
• Fundamental storage abstraction improvements
• Short Term
  – Heterogeneous storage
     • Support SSDs and disks for different storage categories
     • Match storage to different access patterns
     • Disk/storage addressing/locality and status collection
  – Block level APIs for apps that don’t need file system interface
  – Granular block placement policies
• Long Term
  – Explore support for objects/Key value store and APIs
  – Serving from Datanodes optimized based on file structure



      Architecting the Future of Big Data
                                                                      Page 18
      © Hortonworks Inc. 2013
Higher Scalability
• Even higher scalability of namespace
 – Only working set in Namenode memory
 – Namenode as container of namespaces
   • Support large number of namespaces
 – Explore new types of namespaces


• Further scale the block storage
 – Block management to Datanodes
 – Block collection/Mega block group abstraction



     Architecting the Future of Big Data
                                                   Page 19
     © Hortonworks Inc. 2013
High Availability
• Further enhancements to HA
 – Expand Full stack HA to include other dependent services
 – Support multiple standby nodes
 – Use standby for reads
 – Simplify management – eliminate special daemons for journals
    • Move Namenode metadata to HDFS




     Architecting the Future of Big Data
                                                                  Page 20
     © Hortonworks Inc. 2013
Q&A
• Myths and misinformation
 – Not reliable (was never true)
 – Namenode dies all state is lost (was never true)
 – Hard to operate
 – Slow and not performant
 – Namenode is a single point of failure
 – Needs shared NFS storage
 – Does not have point in time recovery
 – Does not support disaster recovery


                                  Thank You!
    Architecting the Future of Big Data
                                                      Page 21
    © Hortonworks Inc. 2013

More Related Content

PDF
Cache-partitioning
PDF
Ph.D. thesis presentation
PPTX
Manage rising disk prices with storage virtualization webinar
PPTX
HA Hadoop -ApacheCon talk
PDF
GIT Introduction
PPTX
ttec infortrend ds
PDF
Inside the Hadoop Machine @ VMworld
PDF
Hdfs high availability
Cache-partitioning
Ph.D. thesis presentation
Manage rising disk prices with storage virtualization webinar
HA Hadoop -ApacheCon talk
GIT Introduction
ttec infortrend ds
Inside the Hadoop Machine @ VMworld
Hdfs high availability

What's hot (20)

PPT
High Performance Computing Infrastructure: Past, Present, and Future
PDF
Dell high density GPU solution
PPTX
D02 Evolution of the HADR tool
PPTX
Setting up Storage Features in Windows Server 2012
PDF
Simple layouts for ECKD and zfcp disk configurations on Linux on System z
PDF
SLES 11 SP2 PerformanceEvaluation for Linux on System z
PDF
Using multi tiered storage systems for storing both structured & unstructured...
PDF
SANsymphony V
PDF
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
PDF
Tandberg Data - Data Protection Solutions Guide
PPTX
An Active and Hybrid Storage System for Data-intensive Applications
PDF
Consolidating database servers with Lenovo ThinkServer RD630
PPT
How an Enterprise Data Fabric (EDF) can improve resiliency and performance
PPTX
Edition based redefinition joords
PPTX
Avnet & Rorke Data - Open Compute Summit '13
PDF
Virtualized database performance with Dell PowerEdge PCIe Express Flash SSDs
PDF
Extending the lifecycle of your storage area network
PDF
Red Hat Enterprise Linux on IBM System z Performance Evaluation
PDF
SCM Dashboard
PDF
Dell Acceleration Appliance for Databases 2.0 and Microsoft SQL Server 2014: ...
High Performance Computing Infrastructure: Past, Present, and Future
Dell high density GPU solution
D02 Evolution of the HADR tool
Setting up Storage Features in Windows Server 2012
Simple layouts for ECKD and zfcp disk configurations on Linux on System z
SLES 11 SP2 PerformanceEvaluation for Linux on System z
Using multi tiered storage systems for storing both structured & unstructured...
SANsymphony V
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Tandberg Data - Data Protection Solutions Guide
An Active and Hybrid Storage System for Data-intensive Applications
Consolidating database servers with Lenovo ThinkServer RD630
How an Enterprise Data Fabric (EDF) can improve resiliency and performance
Edition based redefinition joords
Avnet & Rorke Data - Open Compute Summit '13
Virtualized database performance with Dell PowerEdge PCIe Express Flash SSDs
Extending the lifecycle of your storage area network
Red Hat Enterprise Linux on IBM System z Performance Evaluation
SCM Dashboard
Dell Acceleration Appliance for Databases 2.0 and Microsoft SQL Server 2014: ...
Ad

Similar to HDFS - What's New and Future (20)

PPTX
Strata + Hadoop World 2012: HDFS: Now and Future
PPTX
Hadoop Summit 2012 | HDFS High Availability
PDF
Nicholas:hdfs what is new in hadoop 2
PPTX
Nn ha hadoop world.final
PPTX
HDFS Namenode High Availability
PDF
Design, Scale and Performance of MapR's Distribution for Hadoop
PDF
SAP Virtualization Week 2012 - The Lego Cloud
PDF
Presentation st9900 virtualization - emea - primary disk
PPTX
Hadoop World 2011: Hadoop as a Service in Cloud
PPTX
HDFS- What is New and Future
PDF
Zoned Storage
PDF
Hadoop on VMware
PPTX
Availability and Integrity in hadoop (Strata EU Edition)
PPTX
Hadoop: today and tomorrow
PPTX
HBase with MapR
PPTX
Dragonflow Austin Summit Talk
PPTX
Best Practices for Virtualizing Hadoop
PPTX
HDFS NameNode HA in CDH4
PPTX
Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2
PDF
21.10.09 Microsoft Event, Microsoft Presentation
Strata + Hadoop World 2012: HDFS: Now and Future
Hadoop Summit 2012 | HDFS High Availability
Nicholas:hdfs what is new in hadoop 2
Nn ha hadoop world.final
HDFS Namenode High Availability
Design, Scale and Performance of MapR's Distribution for Hadoop
SAP Virtualization Week 2012 - The Lego Cloud
Presentation st9900 virtualization - emea - primary disk
Hadoop World 2011: Hadoop as a Service in Cloud
HDFS- What is New and Future
Zoned Storage
Hadoop on VMware
Availability and Integrity in hadoop (Strata EU Edition)
Hadoop: today and tomorrow
HBase with MapR
Dragonflow Austin Summit Talk
Best Practices for Virtualizing Hadoop
HDFS NameNode HA in CDH4
Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2
21.10.09 Microsoft Event, Microsoft Presentation
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Approach and Philosophy of On baking technology
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Electronic commerce courselecture one. Pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Encapsulation theory and applications.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Approach and Philosophy of On baking technology
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Reach Out and Touch Someone: Haptics and Empathic Computing
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Big Data Technologies - Introduction.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Electronic commerce courselecture one. Pdf
Chapter 3 Spatial Domain Image Processing.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Network Security Unit 5.pdf for BCA BBA.
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Unlocking AI with Model Context Protocol (MCP)
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
20250228 LYD VKU AI Blended-Learning.pptx
Encapsulation theory and applications.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto

HDFS - What's New and Future

  • 1. HDFS What’s New and Future Suresh Srinivas suresh@hortonworks.com @suresh_m_s © Hortonworks Inc. 2013 Page 1
  • 2. About Me • Architect & Founder at Hortonworks • Apache Hadoop committer and PMC member • > 4.5 years working on HDFS Architecting the Future of Big Data Page 2 © Hortonworks Inc. 2013
  • 3. Agenda • HDFS – What’s new – Federation – HA – Snapshots – Other features • Future – Major Architectural Directions – Short term and long term features Architecting the Future of Big Data Page 3 © Hortonworks Inc. 2013
  • 4. We have been hard at work… • Progress is being made in many areas – Scalability – Performance – Enterprise features – Ongoing operability improvements – Enhancements for other projects in the ecosystem – Expand Hadoop ecosystem to more platforms and use cases • 2192 commits in Hadoop in the last year – Almost a million lines of changes – ~150 contributors – Lot of new contributors - ~80 with < 3 patches • 350K lines of changes in HDFS and common Architecting the Future of Big Data Page 4 © Hortonworks Inc. 2013
  • 5. Building on Rock-solid Foundation • Original design choices - simple and robust – Storage: Rely in OS’s file system rather than use raw disk – Storage Fault Tolerance: multiple replicas, active monitoring – Single Namenode Master • Reliability – Over 7 9’s of data reliability – Less than 0.38 failures across 25 clusters • Operability – Small teams can manage large clusters • An operator per 3K node cluster – Fast Time to repair on node or disk failure • Minutes to an hour Vs. RAID array repairs taking many long hours • Scalable - proven by large scale deployments not bits – > 100 PB storage, > 400 million files, > 4500 nodes in a single cluster – > 70 K nodes of HDFS in deployment and use Architecting the Future of Big Data Page 5 © Hortonworks Inc. 2013
  • 6. Federation NN-1 NN-k NN-n Namespace Foreign NS1 NS k NS n .. .. . . Pool 1 Pool k Pool n Block Storage Block Pools DN 1 DN 2 DN m .. .. .. Common Storage • Block Storage as generic storage service – DNs store blocks in Block Pools for all the Namespace Volumes • Multiple independent Namenodes and Namespace Volumes in a cluster – Scalability by adding more namenodes/namespaces – Isolation – separating applications to their own namespaces – Client side mount tables/ViewFS for integrated views Architecting the Future of Big Data Page 6 © Hortonworks Inc. 2013
  • 7. High Availability • Support standby namenode and failover – Planned downtime – Unplanned downtime • Release 1.1 – Cold standby – Uses NFS as shared storage – Standard HA frameworks as failover controller • Linux HA and VMWare VSphere – Suitable for small clusters up to 500 nodes Architecting the Future of Big Data Page 7 © Hortonworks Inc. 2013
  • 8. Hadoop Full Stack HA Slave Nodes of Hadoop Cluster jo jo jo jo jo b b b b b Apps Running Outside Failover JT into Safemode NN JT NN Server Server Server HA Cluster for Master Daemons Architecting the Future of Big Data Page 8 © Hortonworks Inc. 2013
  • 9. High Availability – Release 2.0 • Supports manual and automatic failover • Automatic failover with Failover Controller – Active NN election and failure detection using ZooKeeper – Periodic NN health check – Failover on NN failure • Removed shared storage dependency – Quorum Journal Manager • 3 to 5 Journal Nodes for storing editlog • Edit must be written to quorum number of Journal Nodes Available in Release 2.0.3-alpha Architecting the Future of Big Data Page 9 © Hortonworks Inc. 2013
  • 10. ZK ZK ZK Heartbeat Heartbeat FailoverController FailoverController Active Standby Cmds JN JN JN Shared NN state NN NN Monitor Health through Quorum of NN. OS, HW Active of JournalNodes Standby Monitor Health of NN. OS, HW Block Reports to Active & Standby DN fencing: only obey commands from active DN DN DN DN Namenode HA has no external dependency Architecting the Future of Big Data Page 10 © Hortonworks Inc. 2013
  • 11. Snapshots (HDFS-2802) • Support for read-only COW snapshots – Design allows read-write snapshots • Namenode only operation – no data copy made – Metadata in namenode - no complicated distributed mechanism – Datanodes have no knowledge • Snapshot entire namespace or sub directories – Nested snapshots allowed – Managed by Admin • Users can take snapshots of directories they own • Efficient – Instantaneous creation – Memory used is highly optimized – Does not affect regular HDFS operations Architecting the Future of Big Data Page 11 © Hortonworks Inc. 2013
  • 12. Snapshot Design ∆n ∆n-1 ∆0 Current Sn Sn-1 S0 • Based on Persistent Data Structures – Maintains changes in the diff list at the Inodes • Tracks creation, deletion, and modification – Snapshot state Sn = current - ∆n • A large number of snapshots supported – State proportional to the changes between the snapshots – Supports millions of snapshots Architecting the Future of Big Data Page 12 © Hortonworks Inc. 2013
  • 13. Snapshot – APIs and CLIs • All regular commands & APIs can be used with snapshot path – /<path>/.snapshot/<snapshot_name>/file.txt • CLIs – Allow snapshots • dfsadmin –allowSnapshots <dir> • dfsadmin –disAllowSnapshots <dir> – Create/delete/rename snapshots • fs –createSnapshot<dir> [snapshot_name] • fs –deleteSnapshot<dir> <snapshot_name> • fs –renameSnapshot<dir> <old_name> <new_name> – Tool to print diff between snapshots – Admin tool to print all snapshottable directories and snapshots • Status – Work almost complete – ready to be integrated to trunk – Additional work needed in integration to Ambari Architecting the Future of Big Data Page 13 © Hortonworks Inc. 2013
  • 14. Performance Improvements • Many Improvements – SSE4.2 CRC32C – ~3x less CPU on read path – Read path improvements for fewer memory copies – Short-circuit read for 2-3x faster random reads (HBase workloads) – Unix domain socket based local reads (almost done) • Simpler to configure and generic for many applications – I/O improvements using posix_fadvise() – libhdfs improvements for zero copy reads • Significant improvements - IO 2.5x to 5x faster – Lot of improvements back ported to release 1.x • Available in Apache release 1.1 and HDP 1.1 Architecting the Future of Big Data Page 14 © Hortonworks Inc. 2013
  • 15. Other Features • New append pipeline • Protobuf, wire compatibility – Post 2.0 GA stronger wire compatibility in Apache Hadoop and HDP Releases • Rolling upgrades – With relaxed version checks • Improvements for other projects – Stale node to improve HBase MTTR • Block placement enhancements – Better support for other topologies such as VMs and Cloud • On the wire encryption – Both data and RPC • Support for NFS gateway – Work in progress – available soon • Expanding ecosystem, platforms and applicability – Native support for Windows Architecting the Future of Big Data Page 15 © Hortonworks Inc. 2013
  • 16. Enterprise Readiness • Storage fault-tolerance – built into HDFS  – Over 7’9s of data reliability • High Availability  • Standard Interfaces  – WebHdfs(REST) & HTTPFS, Fuse, NFS, libwebhdfs and libhdfs • Wire protocol compatibility  • Rolling upgrades  • Snapshots  • Disaster Recovery  – Distcp for parallel and incremental copies across cluster – Apache Ambari and HDP for automated management Architecting the Future of Big Data Page 16 © Hortonworks Inc. 2013
  • 17. HDFS Futures Architecting the Future of Big Data Page 17 © Hortonworks Inc. 2011
  • 18. Storage Abstraction • Fundamental storage abstraction improvements • Short Term – Heterogeneous storage • Support SSDs and disks for different storage categories • Match storage to different access patterns • Disk/storage addressing/locality and status collection – Block level APIs for apps that don’t need file system interface – Granular block placement policies • Long Term – Explore support for objects/Key value store and APIs – Serving from Datanodes optimized based on file structure Architecting the Future of Big Data Page 18 © Hortonworks Inc. 2013
  • 19. Higher Scalability • Even higher scalability of namespace – Only working set in Namenode memory – Namenode as container of namespaces • Support large number of namespaces – Explore new types of namespaces • Further scale the block storage – Block management to Datanodes – Block collection/Mega block group abstraction Architecting the Future of Big Data Page 19 © Hortonworks Inc. 2013
  • 20. High Availability • Further enhancements to HA – Expand Full stack HA to include other dependent services – Support multiple standby nodes – Use standby for reads – Simplify management – eliminate special daemons for journals • Move Namenode metadata to HDFS Architecting the Future of Big Data Page 20 © Hortonworks Inc. 2013
  • 21. Q&A • Myths and misinformation – Not reliable (was never true) – Namenode dies all state is lost (was never true) – Hard to operate – Slow and not performant – Namenode is a single point of failure – Needs shared NFS storage – Does not have point in time recovery – Does not support disaster recovery Thank You! Architecting the Future of Big Data Page 21 © Hortonworks Inc. 2013