Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017
Scaling HDFS to Manage Billions of Files
with Distributed Storage Schemes
Jing Zhao
Tsz-Wo Nicholas Sze
June 14, 2017
Page 1

About Us
• Tsz-Wo Nicholas Sze, Ph.D.
– Software Engineer at Hortonworks
– PMC member/Committer of Apache Hadoop
– Active contributor and committer of Apache Ratis
– Ph.D. from University of Maryland, College Park
– MPhil & BEng from Hong Kong University of Sci & Tech
Page 2
Architecting the Future of Big Data

• Jing Zhao, Ph.D.
– Software Engineer at Hortonworks
– PMC member/Committer of Apache Hadoop
– Active contributor and committer of Apache Ratis
– Ph.D. from University of Southern California
– B.S. from Tsinghua University, Beijing
Page 3

Outline
• Current HDFS Architecture
• Namespace Scaling
• Storage Container Architecture
– Storage Containers
– Next Generation HDFS
– Ozone – Hadoop Object Store
– cBlock
• Current Development Status
Page 4

Current HDFS
Architecture
Page 5

HDFS Architecture
Page 6
Namenode
Heartbeats & Block Reports
Block
Map Block ID  Block Locations
Datanodes
Block ID  Data
Namespace
Tree
File Path  Block IDs
Horizontally Scale IO and Storage
6
b1
b5
b3
BlockStorageNamespace
b2
b3
b1 b3
b5
b2 b1
b5
b2

Foreign
NS n
Common Storage
HDFS Layering
Page 7
DN 1 DN 2 DN m
..
NS1
... ...
NS k
Block PoolsPool nPool kPool 1
NN-1 NN-k NN-n
BlockStorageNamespace
.. ..

Scalability – What HDFS Does Well?
• HDFS NN stores all metadata in memory
– Scales to large clusters (5k) and since all metadata in memory
• 60K-100K tasks (large # of parallel ops) can share Namenode
• Low latency
• Large data if files are large
– Proof points of large data and large clusters
• Single Organizations have over 600PB in HDFS
• Single clusters with over 200PB using federation
Page 8
Metadata in memory the strength of the original GFS and HDFS design
But also its weakness in scaling number of files and blocks

Scalability – The Challenges
• Large number of files (> 350 million)
– The files may be small in size.
– NN’s strength has become a limitation
• Number of file operations
– Need to improve concurrency – move to multiple name servers
• HDFS Federation is the current solution
– Add NameNodes to scale number of files & operations
– Deployed at Twitter
• Cluster with three NameNodes 5000+ node cluster (Plans to grow to 10,000 nodes)
– Backported and used at Facebook to scale HDFS
Page 9

Scalability – Large Number of Blocks
• Block report processing
– Datanode block reports also become huge
– Requires long time to process them.
Namenode
Datanodes
b1
b5
b3b2
b3
b1 b3
b5
b2 b1
b5
b2
Heartbeats & Block Reports

Namespace Scaling
Page 11

Partial Namespace in Memory
• Use a key-value store to represent the namespace tree
– Every INode has an unique id.
– Map: id -> INode
– Map: (Parent id, child name) -> child id
• Keep only the working set in memory
– Keep part of in memory and part of it on disk
– Various caching strategies
• LRU, caching hot directories, etc.
• LevelDB
– A fast key-value store
– Used in a prototype of partial namespace implementation

Partial Namespace in Memory
• Has been prototyped
– Benchmarks so that model works well
– Most file systems keep only partial namespace in memory but not at this
scale
• Hence Cache replacement policies of working-set is important
• In Big Data, you are using only the last 3-6-12 months of your five/ten years of data
actively => working set is small
• Work in progress to get it into HDFS
• Partial Namespace has other benefits
– Faster NN start up – load-in the working set as needed
– Partial Namespace in Memory will allow multiple namespace volumes
Page 13

Previous Talks on Partial Namespace
• Evolving HDFS to a Generalized Storage Subsystem
– Sanjay Radia, Jitendra Pandey (@Hortonworks)
– Hadoop Summit 2016
• Scaling HDFS to Manage Billions of Files with Key Value Stores
– Haohui Mai, Jing Zhao (@Hortonworks)
– Hadoop Summit 2015
• Removing the NameNode's memory limitation
– Lin Xiao (Phd student @CMU, intern @Hortonworks)
– Hadoop User Group 2013

Container Architecture
Page 15

Containers
• Storage Container – a storage unit
• Local block map
– Map block IDs to local block locations
• Small in size
– 4GB or 32GB (configurable)
Page 16
b6b1 b3
Block Map
c1
Storage
Containers b8b2 b7
Block Map
c2

Distributed Block Map
• The block map is moved from the namenode to datanodes
– The block map becomes distributed
– Entire container is replicated
– A datanode has multiple containers
Page 17
b6b1 b3
Block Map
c1
b6b1 b3
Block Map
c1
b6b1 b3
Block Map
c1
c1
c5
c3
Containers
c1
c4
c2 c2
c6
c3
Datanodes

SCM – Storage Container Manager
SCM
Heartbeats & Container Reports
Container
Map Container ID  Container Locations
Datanodes
c1
c5
c3c2
c3
c1 c3
c5
c2 c1
c5
c2

NameNode
Next Generation HDFS
SCM
Container
Map
Container ID 
Container Locations
Datanodes
c1
c5
c3c2
c3
c1 c3
c5
c2 c1
c5
c2
Namespace
Tree
File Path  Block IDs and Container IDs

Billions of Files
• Next generation HDFS architecture
– Support up to 1 million blocks per container
• Provided that the total block size can fit into a container.
– A 5k-node cluster could have 1 million containers
– The cluster can store up to 1 trillion (small) blocks.
– HDFS can easily scale to mange billions of files!
Page 20

Ozone – Hadoop Object Store
• Store KV (key-value) pairs
– Similar to Amazon S3
• Need a Key Map – a key-to-container-id map
• Containers are partial object stores (partial KV maps)
Page 21
Ozone
Container
Map Container ID  Container Locations
Datanodes
c1
c5
c3c2
c3
c1 c3
c5
c2 c1
c5
c2
Key MapKey  Container IDs

Challenge – Trillions of Key-Value Pairs
• Values (Objects) are distributed in DataNodes
– 5k nodes can handle a trillion of objects (no problem)
• Trillions of keys in the Key Map
– The Key Map becomes huge (TB in size)
– Cannot fit in memory – the same old problem
• Avoid storing all keys in the Key Map
– Hash partitioning
– Range partitioning
– Partitions can be split/merged
Page 22
Ozone
Key MapKey  Container IDs

Closed Containers
• Initially, a container is open for read and write
– Using Raft for its replication
• Close the container
– once the container has reached a certain size, say 4GB or 32GB
– No longer managed by Raft
• Closed containers are immutable
– Cannot add new KV entries
– Cannot overwrite/delete KV entries
• Open containers
– New KV entries are always written to open containers
– Only need a small number of open containers (thousands)
Page 24

Container Replication
• Closed containers
– Replication or Erasure Coding
– The same way HDFS does for blocks
• Open containers are replicated by Raft
– Raft: a consensus algorithm
– Apache Ratis – an implementation of Raft
• More detail in later slides
Page 25

Big Picture
Page 26
DataNodes
Block
Containers
Object Store
Containers
Cluster
Membership
Replication
Management
Container
Location Service
Container Management Services
(Runs on DataNodes)
HBase
Object
Store
Metadata
Applications
HDFS
Physical Storage - Shared

Current Development
Status
Page 27

HDFS-7240 – Object store in HDFS
• The umbrella JIRA for the Ozone including the container
framework
– 235 subtasks
– 182 subtasks resolved (as of June 13)
– Code contributors
• Anu Engineer, Arpit Agarwal, Chen Liang, Mingliang Liu, Chris Nauroth, Kanaka
Kumar Avvaru, Mukul Kumar Singh, Tsz Wo Nicholas Sze, Weiwei Yang, Xiaobing
Zhou, Xiaoyu Yao, Yuanbo Liu, …
Page 28

HDFS-11118: Block Storage for HDFS
• The umbrella JIRA for additional work for cBlock
– 23 subtasks
– 20 subtasks resolved (as of June 13)
– Code contributor
• Chen Liang
• Mukul Kumar Singh
• Xiaoyu Yao
• cBlock has already been deployed in Hortonworks’ QE
environment for several months!
Page 29

Raft – A Consensus Algorithm
• “In Search of an Understandable Consensus Algorithm”
– The Raft paper by Diego Ongaro and John Ousterhout
– USENIX ATC’14
• “In Search of a Usable Raft Library”
– A long list of Raft implementations is available
– Most of them are tied to another project or a part of another project.
• We need a Raft implementation with high throughput!
Page 30

Apache Ratis – A Raft Library
• A brand new, incubating Apache project
– Open source, open development
– Written in Java 8
• Emphasized on pluggability
– Pluggable state machine
– Pluggable Raft log
– Pluggable RPC
• Current Supported RPC in examples: gRPC, Netty, Hadoop RPC
• Users may provide their own RPC implementation
• Support high throughput data ingest
– For more general data replication use cases
– Pipeline support for log replication
Page 31

Apache Ratis – Use cases
• General use case:
– You already have a service running on a single server
• You want to:
– replicate the server log/states to multiple machines
• The replication number/cluster membership can be changed in runtime
– have a HA service
• When a server fails, another server will automatically take over
• Clients automatically failover to the new server
• Apache Ratis is for you!
• Use cases in Ozone/HDFS
– Replicating open containers (HDFS-11519, committed on 3. April)
– Support HA in SCM
– Replacing the current NameNode HA solution
Page 32

Apache Ratis – Development Status
• A brief history
– 2016-03: Project started at Hortonworks
– 2016-04: First commit “leader election (without tests)”
– 2017-01: Entered Apache incubation.
– 2017-03: Started preparing the first Alpha release (RATIS-53).
– 2017-04: Hadoop Ozone branch started using Ratis (HDFS-11519)
– 2017-05: first 0.1.0-alpha release entered distribution
• Committers
– Anu Engineer, Arpit Agarwal, Chen Liang, Chris Nauroth, Devaraj Das,
Enis Soztutar, Hanisha Koneru, Jakob Homan, Jing Zhao, Jitendra
Pandey, Li Lu, Mayank Bansal, Mingliang Liu, Tsz Wo Nicholas Sze,
Uma Maheswara Rao G, Xiaobing Zhou, Xiaoyu Yao
• Contributions are welcome!
– http://incubator.apache.org/projects/ratis.html
– dev@ratis.incubator.apache.org
Page 33

Thank You!
Page 34

Backup Slides
Page 35

Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

More Related Content

What's hot (20)

Similar to Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes