Hopsfs 10x HDFS performance

HopsFS: 10X your HDFS with NDB
Jim Dowling
Associate Prof @ KTH
Senior Researcher @ SICS
CEO @ Logical Clocks AB
Oracle, Stockholm, 6th September 2016
www.hops.io
@hopshadoop

Hops Team
Active: Jim Dowling, Seif Haridi, Tor Björn Minde,
Gautier Berthou, Salman Niazi, Mahmoud Ismail,
Theofilos Kakantousis, Johan Svedlund Nordström,
Ermias Gebremeskel, Antonios Kouzoupis.
Alumni: Vasileios Giannokostas, Misganu Dessalegn,
Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca,
K “Sri” Srijeyanthan, Steffen Grohsschmiedt,
Alberto Lorente, Andre Moré, Ali Gholami, Davis Jaunzems,
Stig Viaene, Hooman Peiro, Evangelos Savvidis,
Jude D’Souza, Qi Qi, Gayana Chandrasekara,
Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos,
Peter Buechler, Pushparaj Motamari, Hamid Afzali,
Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.

Marketing 101: Celebrity Endorsements
*Turing Award Winner 2014, Father of Distributed Systems
Hi!
I’m Leslie Lamport* and
even though you’re not
using Paxos, I approve
this product.

Bill Gates’ biggest product regret?*

Windows Future Storage (WinFS*)
*http://www.zdnet.com/article/bill-gates-biggest-microsoft-product-regret-winfs/

Hadoop in Context
6
Data Processing
Spark, MapReduce, Flink, Presto, Tensorflow
Storage
HDFS, MapR, S3, Collossus, WAS
Resource Management
YARN, Mesos, Borg
Metadata
Hive, Parquet, Authorization, Search

HDFS v2
7
DataNodes (up to ~5K)
HDFS Client
Journal Nodes Zookeeper
Active
NameNode
Standby
NameNode
Asynchronous Replication of EditLog
Agreement on the Active NameNode
Snapshots (fsimage) - cut the EditLog
(ls, rm, mv, cp,
stat, rm, chown,
copyFromLocal,
copyFromRemote,
chmod, etc)

The NameNode is the Bottleneck for Hadoop
8

Max Pause times for NameNode Heap Sizes*
9
Max Pause-Times
(ms)
100
1000
10000
10
JVM Heap Size (GB)
50 75 100 150
*OpenJDK or Oracle JVM

NameNode and Decreasing Memory Costs
10
Size (GB)
250
500
1000
Year
2016 2017 2018 2019 2020
0
750

Externalizing the NameNode State
•Problem:
NameNode not scaling up with lower RAM prices
•Solution:
Move the metadata off the JVM Heap
•Move it where?
An in-memory storage system that can be efficiently
queried and managed. Preferably Open-Source.
•MySQL Cluster (NDB)
11

HopsFS Architecture
12
NameNodes
NDB
Leader
HDFS Client
DataNodes

Pluggable DBs: Data Abstraction Layer (DAL)
13
NameNode
(Apache v2)
DAL API
(Apache v2)
NDB-DAL-Impl
(GPL v2)
Other DB
(Other License)
hops-2.5.0.jar dal-ndb-2.5.0-7.5.3.jar

The Global Lock in the NameNode
14

HDFS NameNode Internals
Client: mkdir, getblocklocations, createFile,…..
NameNode
Journal Nodes
Client
Reader1 ReaderN…
Handler1 HandlerM
ConnectionList
Call Queue
Namespace & In-Memory EditLogFSNameSystem Lock
EditLog Buffer
EditLog1 EditLog2 EditLog3
Listener
(Nio Thread)
Responder
(Nio Thread)
dfs.namenode.service.handlercount
(default 10)
ipc.server.read.threadpool.size
(default 1)
…
Handler1 HandlerM… Done RPCs
ackIdsflush

HopsFS NameNode Internals
Client: mkdir, getblocklocations, createFile,…..
NameNode
NDB
Client
Reader1 ReaderN…
Handler1 HandlerM
ConnectionList
Call Queue
inodes block_infos replicas
Listener
(Nio Thread)
Responder
(Nio Thread)
dfs.namenode.service.handlercount
(default 10)
ipc.server.read.threadpool.size
(default 1)
…
Handler1 HandlerM…
Done RPCs
ackIds
leases…
DAL-Impl
DALAPI
HARD PART

Concurrency Model: Implicit Locking
• Serializabile FS ops using implicit locking of subtrees.
17
[Hakimzadeh, Peiro, Dowling, ”Scaling HDFS with a Strongly Consistent Relational Model for Metadata”, DAIS 2014]

Preventing Deadlock and Starvation
•Acquire FS locks in agreed order using FS Hierarchy.
•Block-level operations follow the same agreed order.
•No cycles => Freedom from deadlock
•Pessimistic Concurrency Control ensures progress
18
/user/jim/myFilemv
read
block_report
DataNodeNameNodeClient

Per Transaction Cache
•Reusing the HDFS codebase resulted in too many
roundtrips to the database per transaction.
•We cache intermediate transaction results at
NameNodes (i.e., snapshot).

Sometimes, Transactions Just ain’t Enough
•Large Subtree Operations (delete, mv, set-quota)
can’t always be executed in a single Transaction.
•4-phase Protocol
• Isolation and Consistency
• Aggressive batching
• Transparent failure handling
• Failed ops retried on new NN.
• Lease timeout for failed clients.
20

Leader Election using NDB
•Leader to coordinate replication/lease management
•NDB as shared memory for Leader Election of NN.
21
[Niazi, Berthou, Ismail, Dowling, ”Leader Election in a NewSQL Database”, DAIS 2015]

Path Component Caching
•The most common operation in HDFS is resolving
pathnames to inodes
- 67% of operations in Spotify’s Hadoop workload
•We cache recently resolved inodes at NameNodes so
that we can resolve them using a single batch
primary key lookup.
- We validate cache entries as part of transactions
- The cache converts O(N) round trips to the database to
O(1) for a hit for all inodes in a path.
22

Path Component Caching
•Resolving a path of length N gives O(N) round-trips
•With our cache, O(1) round-trip for a cache hit
/user/jim/myFile
NDB
getInode
(0, “user”) getInode
(1, “jim”) getInode
(2, “myFile”)
NameNode
/user/jim/myFile
NDB
validateInodes
([(0, “user”),
(1,”jim”),
(2,”myFile”)])
NameNode
Cache
getInodes(“/user/jim/myFile”)

Hotspots
•Mikael saw 1-2 maxed out LDM threads
•Partitioning by parent inodeId meant
fantastic performance for ‘ls’
- Partition-pruned index scans
- At high load hotspots appeared at the
top of the directory hierarchy
•Current Solution:
- Cache the root inode at NameNodes
- Pseudo-random partition key for top-level
directories, but keep partition by parent
inodeId at lower levels
- At least 4x throughput increase!
24
/
/Users /Projects
/NSA /MyProj
/Dataset1 /Dataset2

Scalable Blocking Reporting
•On 100PB+ clusters, internal maintenance protocol
traffic makes up much of the network traffic
•Block Reporting
- Leader Load Balances
- Work-steal when exiting
safe-mode
SafeBlocks
DataNodes
NameNodes
NDB
Leader
Blocks
work steal

HopsFS Metadata Scaleout
27Assuming 256MB Block Size, 100 GB JVM Heap for Apache Hadoop

HopsFS Throughput (Spotify Workload - PM)
29Experiments performed on AWS EC2 with enhanced networking and C3.8xLarge instances

HopsFS Throughput (Spotify Workload - PM)
30Experiments performed on AWS EC2 with enhanced networking and C3.8xLarge instances

HopsFS Throughput (Spotify Workload - AM)
31
NDB Setup: 8 Nodes using Xeon E5-2620 2.40GHz Processors and 10GbE.
NameNodes: Xeon E5-2620 2.40GHz Processors machines and 10GbE.

Per Operation HopsFS Throughput
32

NDB Performance Lessons
•NDB is quite stable!
•ClusterJ is (nearly) good enough
- sun.misc.Cleaner has trouble keeping up at high
throughput – OOM for ByteBuffers
- Transaction hint behavior not respected
- DTO creation time affected by Java Reflection
- Nice features would be:
• Projections
• Batched scan operations support
• Event API
•Event API and Asynchronous API needed for
performance in Hops-YARN
33

Heterogeneous Storage in HopsFS
34
•Storage Types in HopsFS: Default, EC-RAID5, SSD
- Default: 3X overhead - triple replication on spinning disks
- SSD: 3X overhead - triple replication on SSDs
- EC-RAID5: 1.4X overhead with low reconstruction overhead!

Erasure Coding
35
HDFS File (Sealed)
d0 d1 d2 d3 d4 d5 p0 p1 p1
overhead
(6+3)/6 = 1.5X
d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 p0 p1 p2 p3(12+4)/16= 1.33X
RS(6,3)
RS(12,4)

host9
d0 d1 d2 d3 d4 p0
Global/Local Reconstruction with EC-RAID5
36
d0 d1 d2 d3 d4 p0Block0 Block9
Block10 Block11 Block12 Block13
host0
host10 host10 host10 host10
ZFS RAID-ZZFS RAID-Z
(10+2+2)/10 = 1.4X
(10+2+4)/10 = 1.6X
RS(10,2) LR(5,1).RS(10,4)LR(5,1).

ePipe: Indexing HopsFS’ Namespace
37
Free-Text
Search
NDBElasticSearch
Polyglot Persistence
The Distributed Database is the Single Source of Truth.
Foreign keys ensure the integrity of Extended Metadata.
MetaData
Designer
MetaData
Entry
NDB Event API

YARN Architecture
39
NodeManagers
YARN Client
Zookeeper Nodes
ResourceMgr Standby
ResourceMgr
1. Master-Slave Replication of RM State
2. Agreement on the Active ResourceMgr

NDB
ResourceManager– Monolithic but Modular
40
ApplicationMaster
Service
ResourceTracker
Service
Scheduler
Client
Service
YARN Client
Admin
Service
Security
Cluster State
HopsResourceTracker
Cluster State
HopsScheduler
NodeManagerNodeManagerYARN Client App MasterApp Master
ResourceManager
~2k ops/s ~10k ops/s
ClusterJ Event API

Hops-YARN Architecture
41
ResourceMgrs
NDB
Scheduler
YARN Client
NodeManagers
Resource Trackers
Leader Election for
Failed Scheduler

Hopsworks – Project-Based Multi-Tenancy
•A project is a collection of
- Users with Roles
- HDFS DataSets
- Kafka Topics
- Notebooks, Jobs
•Per-Project quotas
- Storage in HDFS
- CPU in YARN
• Uber-style Pricing
•Sharing across Projects
- Datasets/Topics
43
project
dataset 1
dataset N
Topic 1
Topic N
Kafka
HDFS

Hopsworks – Dynamic Roles
44
Alice@gmail.com
NSA__Alice
Authenticate
Users__Alice
Glassfish
HopsFS
HopsYARN
Projects
Secure
Impersonation
Kafka
X.509
Certificates

SICS ICE - www.hops.site
45
A 2 MW datacenter research and test environment
Purpose: Increase knowledge, strengthen universities, companies and researchers
R&D institute, 5 lab modules, 3-4000 servers, 2-3000 square meters

Karamel/Chef for Automated Installation
46
Google Compute Engine BareMetal

Summary
•HopsFS is the world’s fastest, most scalable HDFS
implementation
•Powered by NDB, the world’s fastest database 
•Thanks to Mikael, Craig, Frazer, Bernt and others
•Still room for improvement….
47
www.hops.io

Hops
[Hadoop For Humans]
Join us!
http://github.com/hopshadoop

Hopsfs 10x HDFS performance

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Hopsfs 10x HDFS performance (20)

More from Jim Dowling (20)

Recently uploaded (20)

Hopsfs 10x HDFS performance

Editor's Notes