[B4]deview 2012-hdfs

HDFS ARCHITECTURE
How HDFS is evolving to meet new needs

✛  Aaron T. Myers
✛  Hadoop PMC Member / Committer at ASF
✛  Software Engineer at Cloudera
✛  Primarily work on HDFS and Hadoop Security

2

✛  HDFS architecture circa 2010
✛  New requirements for HDFS
>  Random read patterns
>  Higher scalability
>  Higher availability
✛  HDFS evolutions to address requirements
>  Read pipeline performance improvements
>  Federated namespaces
>  Highly available Name Node

3

✛  Each cluster has…
>  A single Name Node
∗  Stores file system metadata
∗  Stores “Block ID” -> Data Node mapping
>  Many Data Nodes
∗  Store actual file data
>  Clients of HDFS…
∗  Communicate with Name Node to browse file system, get
block locations for files
∗  Communicate directly with Data Nodes to read/write files

5

✛  Want to support larger clusters
>  ~4,000 node limit with 2010 architecture
>  New nodes beefier than old nodes
∗  2009: 8 cores, 16GB RAM, 4x1TB disks
∗  2012: 16 cores, 48GB RAM, 12x3TB disks

✛  Want to increase availability
>  With rise of HBase, HDFS now serving live traffic
>  Downtime means immediate user-facing impact
✛  Want to improve random read performance
>  HBase usually does small, random reads, not bulk

7

✛  Single Name Node
>  If Name Node goes offline, cluster is unavailable
>  Name Node must fit all FS metadata in memory
✛  Inefficiencies in read pipeline
>  Designed for large, streaming reads
>  Not small, random reads (like HBase use case)

8

✛  Fine for offline, batch-oriented applications
✛  If cluster goes offline, external customers don’t
notice
✛  Can always use separate clusters for different
groups
✛  HBase didn’t exist when Hadoop first created
>  MapReduce was the only client application

9

HDFS CPU Improvements: Checksumming

•  HDFS checksums every piece of data in/out
•  Significant CPU overhead
•  Measure by putting ~1G in HDFS, cat file in a loop
•  0.20.2: ~30-50% of CPU time is CRC32 computation!
•  Optimizations:
•  Switch to “bulk” API: verify/compute 64KB at a time
instead of 512 bytes (better instruction cache locality,
amortize JNI overhead)
•  Switch to CRC32C polynomial, SSE4.2, highly tuned
assembly (~8 bytes per cycle with instruction level
parallelism!)

11 Copyright 2011 Cloudera Inc. All rights reserved

Checksum improvements
(lower is better)
1360us
100%
90%
80%
70%
60% 760us
50%
CDH3u0
40%
Optimized
30%
20%
10%
0%
Random-read Random-read CPU Sequential-read
latency usage CPU usage
Post-optimization: only 16% overhead vs un-checksummed access
Maintain ~800MB/sec from a single thread reading OS cache


HDFS Random access

•  0.20.2:
•  Each individual read operation reconnects to
DataNode
•  Much TCP Handshake overhead, thread creation,
etc
•  2.0.0:
•  Clients cache open sockets to each datanode (like
HTTP Keepalive)
•  Local readers can bypass the DN in some
circumstances to directly read data
•  Rewritten BlockReader to eliminate a data copy
•  Eliminated lock contention in DataNode’s
FSDataset class


Random-read micro benchmark (higher is better)
700
600
Speed (MB/sec)

500
400
300
200
100
106 253 299 247 488 635 187 477 633
0
4 threads, 1 file 16 threads, 1 file 8 threads, 2 files
0.20.2 Trunk (no native) Trunk (native)
TestParallelRead benchmark, modified to 100% random read
proportion.
Quad core Core i7 Q820@1.73Ghz

Random-read macro benchmark (HBase YCSB)

CDH4
Reads/sec

CDH3u1

time

✛  Instead of one Name Node per cluster, several
>  Before: Only one Name Node, many Data Nodes
>  Now: A handful of Name Nodes, many Data Nodes
✛  Distribute file system metadata between the
NNs
✛  Each Name Node operates independently
>  Potentially overlapping ranges of block IDs
>  Introduce a new concept: block pool ID
>  Each Name Node manages a single block pool

✛  Improve scalability to 6,000+ Data Nodes
>  Bumping into single Data Node scalability now
✛  Allow for better isolation
>  Could locate HBase dirs on dedicated Name Node
>  Could locate /user dirs on dedicated Name Node
✛  Clients still see unified view of FS namespace
>  Use ViewFS – client side mount table configuration

Note: Federation != Increased Availability

19

HDFS HIGH AVAILABILITY ARCHITECTURE

Current HDFS Availability & Data Integrity

•  Simple design, storage fault tolerance
•  Storage: Rely on OS’s file system rather
than use raw disk
•  Storage Fault Tolerance: multiple replicas,
active monitoring
•  Single NameNode Master
•  Persistent state: multiple copies + checkpoints
•  Restart on failure

21

Current HDFS Availability & Data Integrity

•  How well did it work?

•  Lost 19 out of 329 Million blocks on 10 clusters with 20K
nodes in 2009
•  7-9’s of reliability, and that bug was fixed in 0.20

•  18 months Study: 22 failures on 25 clusters - 0.58 failures
per year per cluster
•  Only 8 would have benefitted from HA failover!! (0.23
failures per cluster year)

22

So why build an HA NameNode?

•  Most cluster downtime in practice is planned
downtime
•  Cluster restart for a NN configuration change (e.g
new JVM configs, new HDFS configs)
•  Cluster restart for a NN hardware upgrade/repair
•  Cluster restart for a NN software upgrade (e.g. new
Hadoop, new kernel, new JVM)
•  Planned downtimes cause the vast majority of
outage!

•  Manual failover solves all of the above!
•  Failover to NN2, fix NN1, fail back to NN1, zero
downtime
23

Approach and Terminology
•  Initial goal: Active-Standby with Hot
Failover

•  Terminology
•  Active NN: actively serves read/write
operations from clients
•  Standby NN: waits, becomes active when
Active dies or is unhealthy
•  Hot failover: standby able to take over
instantly

24

HDFS Architecture: High Availability

•  Single NN configuration; no failover
•  Active and Standby with manual failover
•  Addresses downtime during upgrades – main
cause of unavailability
•  Active and Standby with automatic
failover
•  Addresses downtime during unplanned outages
(kernel panics, bad memory, double PDU failure,
etc)
•  See HDFS-1623 for detailed use cases
•  With Federation each namespace volume has an
active-standby NameNode pair

25


•  Failover controller outside NN
•  Parallel Block reports to Active and
Standby
•  NNs share namespace state via a shared
edit log
•  NAS or Journal Nodes
•  Like RDBMS “log shipping replication”
•  Client failover
•  Smart clients (e.g configuration, or ZooKeeper for
coordination)
•  IP Failover in the future
26

HDFS ARCHITECTURE: WHAT’S NEXT

✛  Increase scalability of single Data Node
>  Currently the most-noticed scalability limit
✛  Support for point-in-time snapshots
>  To better support DR, backups
✛  Completely separate block / namespace layers
>  Increase scalability even further, new use cases
✛  Fully distributed NN metadata
>  No pre-determined “special nodes” in the system

[B4]deview 2012-hdfs

More Related Content

What's hot (20)

Viewers also liked (19)

Similar to [B4]deview 2012-hdfs (20)

More from NAVER D2 (20)

Recently uploaded (20)

[B4]deview 2012-hdfs