SlideShare a Scribd company logo
HopsFS: 10X your HDFS with NDB
Jim Dowling
Associate Prof @ KTH
Senior Researcher @ SICS
CEO @ Logical Clocks AB
Oracle, Stockholm, 6th September 2016
www.hops.io
@hopshadoop
Hops Team
Active: Jim Dowling, Seif Haridi, Tor Björn Minde,
Gautier Berthou, Salman Niazi, Mahmoud Ismail,
Theofilos Kakantousis, Johan Svedlund Nordström,
Ermias Gebremeskel, Antonios Kouzoupis.
Alumni: Vasileios Giannokostas, Misganu Dessalegn,
Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca,
K “Sri” Srijeyanthan, Steffen Grohsschmiedt,
Alberto Lorente, Andre Moré, Ali Gholami, Davis Jaunzems,
Stig Viaene, Hooman Peiro, Evangelos Savvidis,
Jude D’Souza, Qi Qi, Gayana Chandrasekara,
Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos,
Peter Buechler, Pushparaj Motamari, Hamid Afzali,
Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.
Marketing 101: Celebrity Endorsements
*Turing Award Winner 2014, Father of Distributed Systems
Hi!
I’m Leslie Lamport* and
even though you’re not
using Paxos, I approve
this product.
Bill Gates’ biggest product regret?*
Windows Future Storage (WinFS*)
*http://www.zdnet.com/article/bill-gates-biggest-microsoft-product-regret-winfs/
Hadoop in Context
6
Data Processing
Spark, MapReduce, Flink, Presto, Tensorflow
Storage
HDFS, MapR, S3, Collossus, WAS
Resource Management
YARN, Mesos, Borg
Metadata
Hive, Parquet, Authorization, Search
HDFS v2
7
DataNodes (up to ~5K)
HDFS Client
Journal Nodes Zookeeper
Active
NameNode
Standby
NameNode
Asynchronous Replication of EditLog
Agreement on the Active NameNode
Snapshots (fsimage) - cut the EditLog
(ls, rm, mv, cp,
stat, rm, chown,
copyFromLocal,
copyFromRemote,
chmod, etc)
The NameNode is the Bottleneck for Hadoop
8
Max Pause times for NameNode Heap Sizes*
9
Max Pause-Times
(ms)
100
1000
10000
10
JVM Heap Size (GB)
50 75 100 150
*OpenJDK or Oracle JVM
NameNode and Decreasing Memory Costs
10
Size (GB)
250
500
1000
Year
2016 2017 2018 2019 2020
0
750
Externalizing the NameNode State
•Problem:
NameNode not scaling up with lower RAM prices
•Solution:
Move the metadata off the JVM Heap
•Move it where?
An in-memory storage system that can be efficiently
queried and managed. Preferably Open-Source.
•MySQL Cluster (NDB)
11
HopsFS Architecture
12
NameNodes
NDB
Leader
HDFS Client
DataNodes
Pluggable DBs: Data Abstraction Layer (DAL)
13
NameNode
(Apache v2)
DAL API
(Apache v2)
NDB-DAL-Impl
(GPL v2)
Other DB
(Other License)
hops-2.5.0.jar dal-ndb-2.5.0-7.5.3.jar
The Global Lock in the NameNode
14
HDFS NameNode Internals
Client: mkdir, getblocklocations, createFile,…..
NameNode
Journal Nodes
Client
Reader1 ReaderN…
Handler1 HandlerM
ConnectionList
Call Queue
Namespace & In-Memory EditLogFSNameSystem Lock
EditLog Buffer
EditLog1 EditLog2 EditLog3
Listener
(Nio Thread)
Responder
(Nio Thread)
dfs.namenode.service.handlercount
(default 10)
ipc.server.read.threadpool.size
(default 1)
…
Handler1 HandlerM… Done RPCs
ackIdsflush
HopsFS NameNode Internals
Client: mkdir, getblocklocations, createFile,…..
NameNode
NDB
Client
Reader1 ReaderN…
Handler1 HandlerM
ConnectionList
Call Queue
inodes block_infos replicas
Listener
(Nio Thread)
Responder
(Nio Thread)
dfs.namenode.service.handlercount
(default 10)
ipc.server.read.threadpool.size
(default 1)
…
Handler1 HandlerM…
Done RPCs
ackIds
leases…
DAL-Impl
DALAPI
HARD PART
Concurrency Model: Implicit Locking
• Serializabile FS ops using implicit locking of subtrees.
17
[Hakimzadeh, Peiro, Dowling, ”Scaling HDFS with a Strongly Consistent Relational Model for Metadata”, DAIS 2014]
Preventing Deadlock and Starvation
•Acquire FS locks in agreed order using FS Hierarchy.
•Block-level operations follow the same agreed order.
•No cycles => Freedom from deadlock
•Pessimistic Concurrency Control ensures progress
18
/user/jim/myFilemv
read
block_report
DataNodeNameNodeClient
Per Transaction Cache
•Reusing the HDFS codebase resulted in too many
roundtrips to the database per transaction.
•We cache intermediate transaction results at
NameNodes (i.e., snapshot).
Sometimes, Transactions Just ain’t Enough
•Large Subtree Operations (delete, mv, set-quota)
can’t always be executed in a single Transaction.
•4-phase Protocol
• Isolation and Consistency
• Aggressive batching
• Transparent failure handling
• Failed ops retried on new NN.
• Lease timeout for failed clients.
20
Leader Election using NDB
•Leader to coordinate replication/lease management
•NDB as shared memory for Leader Election of NN.
21
[Niazi, Berthou, Ismail, Dowling, ”Leader Election in a NewSQL Database”, DAIS 2015]
Path Component Caching
•The most common operation in HDFS is resolving
pathnames to inodes
- 67% of operations in Spotify’s Hadoop workload
•We cache recently resolved inodes at NameNodes so
that we can resolve them using a single batch
primary key lookup.
- We validate cache entries as part of transactions
- The cache converts O(N) round trips to the database to
O(1) for a hit for all inodes in a path.
22
Path Component Caching
•Resolving a path of length N gives O(N) round-trips
•With our cache, O(1) round-trip for a cache hit
/user/jim/myFile
NDB
getInode
(0, “user”) getInode
(1, “jim”) getInode
(2, “myFile”)
NameNode
/user/jim/myFile
NDB
validateInodes
([(0, “user”),
(1,”jim”),
(2,”myFile”)])
NameNode
Cache
getInodes(“/user/jim/myFile”)
Hotspots
•Mikael saw 1-2 maxed out LDM threads
•Partitioning by parent inodeId meant
fantastic performance for ‘ls’
- Partition-pruned index scans
- At high load hotspots appeared at the
top of the directory hierarchy
•Current Solution:
- Cache the root inode at NameNodes
- Pseudo-random partition key for top-level
directories, but keep partition by parent
inodeId at lower levels
- At least 4x throughput increase!
24
/
/Users /Projects
/NSA /MyProj
/Dataset1 /Dataset2
Scalable Blocking Reporting
•On 100PB+ clusters, internal maintenance protocol
traffic makes up much of the network traffic
•Block Reporting
- Leader Load Balances
- Work-steal when exiting
safe-mode
SafeBlocks
DataNodes
NameNodes
NDB
Leader
Blocks
work steal
HopsFS Performance
26
HopsFS Metadata Scaleout
27Assuming 256MB Block Size, 100 GB JVM Heap for Apache Hadoop
Spotify Workload
28
HopsFS Throughput (Spotify Workload - PM)
29Experiments performed on AWS EC2 with enhanced networking and C3.8xLarge instances
HopsFS Throughput (Spotify Workload - PM)
30Experiments performed on AWS EC2 with enhanced networking and C3.8xLarge instances
HopsFS Throughput (Spotify Workload - AM)
31
NDB Setup: 8 Nodes using Xeon E5-2620 2.40GHz Processors and 10GbE.
NameNodes: Xeon E5-2620 2.40GHz Processors machines and 10GbE.
Per Operation HopsFS Throughput
32
NDB Performance Lessons
•NDB is quite stable!
•ClusterJ is (nearly) good enough
- sun.misc.Cleaner has trouble keeping up at high
throughput – OOM for ByteBuffers
- Transaction hint behavior not respected
- DTO creation time affected by Java Reflection
- Nice features would be:
• Projections
• Batched scan operations support
• Event API
•Event API and Asynchronous API needed for
performance in Hops-YARN
33
Heterogeneous Storage in HopsFS
34
•Storage Types in HopsFS: Default, EC-RAID5, SSD
- Default: 3X overhead - triple replication on spinning disks
- SSD: 3X overhead - triple replication on SSDs
- EC-RAID5: 1.4X overhead with low reconstruction overhead!
Erasure Coding
35
HDFS File (Sealed)
d0 d1 d2 d3 d4 d5 p0 p1 p1
overhead
(6+3)/6 = 1.5X
d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 p0 p1 p2 p3(12+4)/16= 1.33X
RS(6,3)
RS(12,4)
host9
d0 d1 d2 d3 d4 p0
Global/Local Reconstruction with EC-RAID5
36
d0 d1 d2 d3 d4 p0Block0 Block9
Block10 Block11 Block12 Block13
host0
host10 host10 host10 host10
ZFS RAID-ZZFS RAID-Z
(10+2+2)/10 = 1.4X
(10+2+4)/10 = 1.6X
RS(10,2) LR(5,1).RS(10,4)LR(5,1).
ePipe: Indexing HopsFS’ Namespace
37
Free-Text
Search
NDBElasticSearch
Polyglot Persistence
The Distributed Database is the Single Source of Truth.
Foreign keys ensure the integrity of Extended Metadata.
MetaData
Designer
MetaData
Entry
NDB Event API
Hops-YARN
38
YARN Architecture
39
NodeManagers
YARN Client
Zookeeper Nodes
ResourceMgr Standby
ResourceMgr
1. Master-Slave Replication of RM State
2. Agreement on the Active ResourceMgr
NDB
ResourceManager– Monolithic but Modular
40
ApplicationMaster
Service
ResourceTracker
Service
Scheduler
Client
Service
YARN Client
Admin
Service
Security
Cluster State
HopsResourceTracker
Cluster State
HopsScheduler
NodeManagerNodeManagerYARN Client App MasterApp Master
ResourceManager
~2k ops/s ~10k ops/s
ClusterJ Event API
Hops-YARN Architecture
41
ResourceMgrs
NDB
Scheduler
YARN Client
NodeManagers
Resource Trackers
Leader Election for
Failed Scheduler
Hopsworks
42
Hopsworks – Project-Based Multi-Tenancy
•A project is a collection of
- Users with Roles
- HDFS DataSets
- Kafka Topics
- Notebooks, Jobs
•Per-Project quotas
- Storage in HDFS
- CPU in YARN
• Uber-style Pricing
•Sharing across Projects
- Datasets/Topics
43
project
dataset 1
dataset N
Topic 1
Topic N
Kafka
HDFS
Hopsworks – Dynamic Roles
44
Alice@gmail.com
NSA__Alice
Authenticate
Users__Alice
Glassfish
HopsFS
HopsYARN
Projects
Secure
Impersonation
Kafka
X.509
Certificates
SICS ICE - www.hops.site
45
A 2 MW datacenter research and test environment
Purpose: Increase knowledge, strengthen universities, companies and researchers
R&D institute, 5 lab modules, 3-4000 servers, 2-3000 square meters
Karamel/Chef for Automated Installation
46
Google Compute Engine BareMetal
Summary
•HopsFS is the world’s fastest, most scalable HDFS
implementation
•Powered by NDB, the world’s fastest database 
•Thanks to Mikael, Craig, Frazer, Bernt and others
•Still room for improvement….
47
www.hops.io
Hops
[Hadoop For Humans]
Join us!
http://github.com/hopshadoop

More Related Content

PPTX
Hops - Distributed metadata for Hadoop
PPTX
Polyglot metadata for Hadoop
PPTX
Hadoop Distributed File System
PDF
Hdfs architecture
PDF
Hadoop Distributed File System
PPT
HDFS introduction
PDF
Elastic Search Training#1 (brief tutorial)-ESCC#1
PPTX
Hadoop Distributed File System
Hops - Distributed metadata for Hadoop
Polyglot metadata for Hadoop
Hadoop Distributed File System
Hdfs architecture
Hadoop Distributed File System
HDFS introduction
Elastic Search Training#1 (brief tutorial)-ESCC#1
Hadoop Distributed File System

What's hot (20)

PDF
Bio2 Rdf Presentation V3
PDF
Базы данных. HBase
PDF
Introduction to Mongodb
PPTX
Redis Introduction
KEY
MongoDB Best Practices in AWS
PPTX
Ravi Namboori Hadoop & HDFS Architecture
PDF
Introduction to Hadoop
PDF
Intro to the Hadoop Stack @ April 2011 JavaMUG
PDF
Hadoop security
PPTX
Redis/Lessons learned
PDF
Introduction to Apache Tajo: Data Warehouse for Big Data
PPTX
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
PPT
Tutorial 1
PDF
HBase Mongo_DB Project
PDF
ASPLOS2011 workshop RESoLVE "Effect of Disk Prefetching of Guest OS "
PDF
MyRocks introduction and production deployment
PDF
Bids talk 9.18
PDF
A deeper-understanding-of-spark-internals
PDF
Шардинг в MongoDB, Henrik Ingo (MongoDB)
PPTX
Democratizing Memory Storage
Bio2 Rdf Presentation V3
Базы данных. HBase
Introduction to Mongodb
Redis Introduction
MongoDB Best Practices in AWS
Ravi Namboori Hadoop & HDFS Architecture
Introduction to Hadoop
Intro to the Hadoop Stack @ April 2011 JavaMUG
Hadoop security
Redis/Lessons learned
Introduction to Apache Tajo: Data Warehouse for Big Data
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Tutorial 1
HBase Mongo_DB Project
ASPLOS2011 workshop RESoLVE "Effect of Disk Prefetching of Guest OS "
MyRocks introduction and production deployment
Bids talk 9.18
A deeper-understanding-of-spark-internals
Шардинг в MongoDB, Henrik Ingo (MongoDB)
Democratizing Memory Storage
Ad

Viewers also liked (20)

PDF
Spark summit-east-dowling-feb2017-full
PPTX
Minicurso Iniciando no Mundo Front-End - Dia 05 - SASPI {5}
PPTX
On-premise Spark as a Service with YARN
PPTX
Multi-tenant Flink as-a-service with Kafka on Hopsworks
PDF
Data Science with the Help of Metadata
PPTX
Strata Hadoop Hopsworks
PDF
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
PDF
Big Data Building Blocks with AWS Cloud
PDF
Cortana Analytics Workshop: Azure Data Catalog
PPTX
Rocking the World of Big Data at Centrica
PPTX
Shug meetup Hops Hadoop
PPTX
Deploying a Governed Data Lake
PPTX
PDF
QA_QC FabricationBrochure (1)
PDF
2009 pediatrics late results kasai
PPTX
PDF
Facility Programming_Sunda
PDF
MIILIV_M4C5 Appendice 2 parte 2
PPTX
Trabajo grupo psicología
Spark summit-east-dowling-feb2017-full
Minicurso Iniciando no Mundo Front-End - Dia 05 - SASPI {5}
On-premise Spark as a Service with YARN
Multi-tenant Flink as-a-service with Kafka on Hopsworks
Data Science with the Help of Metadata
Strata Hadoop Hopsworks
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Big Data Building Blocks with AWS Cloud
Cortana Analytics Workshop: Azure Data Catalog
Rocking the World of Big Data at Centrica
Shug meetup Hops Hadoop
Deploying a Governed Data Lake
QA_QC FabricationBrochure (1)
2009 pediatrics late results kasai
Facility Programming_Sunda
MIILIV_M4C5 Appendice 2 parte 2
Trabajo grupo psicología
Ad

Similar to Hopsfs 10x HDFS performance (20)

PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
PPTX
Big Data in Container; Hadoop Spark in Docker and Mesos
PPTX
Introduction of Hadoop
PPTX
Introduction to HDFS
PPTX
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
PDF
Daniel Krasner - High Performance Text Processing with Rosetta
PDF
Replication MongoDB Days 2013
PPTX
HDFS+basics.pptx
PPT
3 HDFS basicsaaaaaaaaaaaaaaaaaaaaaaaa.ppt
PPTX
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
ODP
Drupal MySQL Cluster
PDF
Gregory engels nsd crash course - ilug10
PDF
Lecture 2 part 1
PPTX
Ops Jumpstart: MongoDB Administration 101
PPTX
PPTX
HDFS Tiered Storage: Mounting Object Stores in HDFS
PDF
Sergi Álvarez + Roi Martín - radare2: From forensics to bindiffing [RootedCON...
PDF
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
PDF
2013 london advanced-replication
PPTX
Maintaining Strong Consistency Semantics in a Horizontally Scalable and Highl...
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Big Data in Container; Hadoop Spark in Docker and Mesos
Introduction of Hadoop
Introduction to HDFS
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
Daniel Krasner - High Performance Text Processing with Rosetta
Replication MongoDB Days 2013
HDFS+basics.pptx
3 HDFS basicsaaaaaaaaaaaaaaaaaaaaaaaa.ppt
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Drupal MySQL Cluster
Gregory engels nsd crash course - ilug10
Lecture 2 part 1
Ops Jumpstart: MongoDB Administration 101
HDFS Tiered Storage: Mounting Object Stores in HDFS
Sergi Álvarez + Roi Martín - radare2: From forensics to bindiffing [RootedCON...
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
2013 london advanced-replication
Maintaining Strong Consistency Semantics in a Horizontally Scalable and Highl...

More from Jim Dowling (20)

PDF
ARVC and flecainide case report[EI] Jim.docx.pdf
PDF
PyData Berlin 2023 - Mythical ML Pipeline.pdf
PDF
Serverless ML Workshop with Hopsworks at PyData Seattle
PDF
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PDF
_Python Ireland Meetup - Serverless ML - Dowling.pdf
PDF
Building Hopsworks, a cloud-native managed feature store for machine learning
PDF
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
PDF
Ml ops and the feature store with hopsworks, DC Data Science Meetup
PDF
Hops fs huawei internal conference july 2021
PDF
Hopsworks MLOps World talk june 21
PDF
Hopsworks Feature Store 2.0 a new paradigm
PDF
Metadata and Provenance for ML Pipelines with Hopsworks
PDF
GANs for Anti Money Laundering
PDF
Berlin buzzwords 2020-feature-store-dowling
PDF
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
PDF
Hopsworks data engineering melbourne april 2020
PDF
The Bitter Lesson of ML Pipelines
PDF
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
PDF
Hopsworks at Google AI Huddle, Sunnyvale
PDF
Hopsworks in the cloud Berlin Buzzwords 2019
ARVC and flecainide case report[EI] Jim.docx.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdf
Serverless ML Workshop with Hopsworks at PyData Seattle
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
_Python Ireland Meetup - Serverless ML - Dowling.pdf
Building Hopsworks, a cloud-native managed feature store for machine learning
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Hops fs huawei internal conference july 2021
Hopsworks MLOps World talk june 21
Hopsworks Feature Store 2.0 a new paradigm
Metadata and Provenance for ML Pipelines with Hopsworks
GANs for Anti Money Laundering
Berlin buzzwords 2020-feature-store-dowling
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Hopsworks data engineering melbourne april 2020
The Bitter Lesson of ML Pipelines
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks in the cloud Berlin Buzzwords 2019

Recently uploaded (20)

PDF
Modernizing your data center with Dell and AMD
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPT
Teaching material agriculture food technology
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
Modernizing your data center with Dell and AMD
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Teaching material agriculture food technology
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
“AI and Expert System Decision Support & Business Intelligence Systems”
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Diabetes mellitus diagnosis method based random forest with bat algorithm
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Unlocking AI with Model Context Protocol (MCP)
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
NewMind AI Weekly Chronicles - August'25 Week I
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
NewMind AI Monthly Chronicles - July 2025
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
The Rise and Fall of 3GPP – Time for a Sabbatical?

Hopsfs 10x HDFS performance

  • 1. HopsFS: 10X your HDFS with NDB Jim Dowling Associate Prof @ KTH Senior Researcher @ SICS CEO @ Logical Clocks AB Oracle, Stockholm, 6th September 2016 www.hops.io @hopshadoop
  • 2. Hops Team Active: Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou, Salman Niazi, Mahmoud Ismail, Theofilos Kakantousis, Johan Svedlund Nordström, Ermias Gebremeskel, Antonios Kouzoupis. Alumni: Vasileios Giannokostas, Misganu Dessalegn, Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca, K “Sri” Srijeyanthan, Steffen Grohsschmiedt, Alberto Lorente, Andre Moré, Ali Gholami, Davis Jaunzems, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Jude D’Souza, Qi Qi, Gayana Chandrasekara, Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari, Hamid Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.
  • 3. Marketing 101: Celebrity Endorsements *Turing Award Winner 2014, Father of Distributed Systems Hi! I’m Leslie Lamport* and even though you’re not using Paxos, I approve this product.
  • 4. Bill Gates’ biggest product regret?*
  • 5. Windows Future Storage (WinFS*) *http://www.zdnet.com/article/bill-gates-biggest-microsoft-product-regret-winfs/
  • 6. Hadoop in Context 6 Data Processing Spark, MapReduce, Flink, Presto, Tensorflow Storage HDFS, MapR, S3, Collossus, WAS Resource Management YARN, Mesos, Borg Metadata Hive, Parquet, Authorization, Search
  • 7. HDFS v2 7 DataNodes (up to ~5K) HDFS Client Journal Nodes Zookeeper Active NameNode Standby NameNode Asynchronous Replication of EditLog Agreement on the Active NameNode Snapshots (fsimage) - cut the EditLog (ls, rm, mv, cp, stat, rm, chown, copyFromLocal, copyFromRemote, chmod, etc)
  • 8. The NameNode is the Bottleneck for Hadoop 8
  • 9. Max Pause times for NameNode Heap Sizes* 9 Max Pause-Times (ms) 100 1000 10000 10 JVM Heap Size (GB) 50 75 100 150 *OpenJDK or Oracle JVM
  • 10. NameNode and Decreasing Memory Costs 10 Size (GB) 250 500 1000 Year 2016 2017 2018 2019 2020 0 750
  • 11. Externalizing the NameNode State •Problem: NameNode not scaling up with lower RAM prices •Solution: Move the metadata off the JVM Heap •Move it where? An in-memory storage system that can be efficiently queried and managed. Preferably Open-Source. •MySQL Cluster (NDB) 11
  • 13. Pluggable DBs: Data Abstraction Layer (DAL) 13 NameNode (Apache v2) DAL API (Apache v2) NDB-DAL-Impl (GPL v2) Other DB (Other License) hops-2.5.0.jar dal-ndb-2.5.0-7.5.3.jar
  • 14. The Global Lock in the NameNode 14
  • 15. HDFS NameNode Internals Client: mkdir, getblocklocations, createFile,….. NameNode Journal Nodes Client Reader1 ReaderN… Handler1 HandlerM ConnectionList Call Queue Namespace & In-Memory EditLogFSNameSystem Lock EditLog Buffer EditLog1 EditLog2 EditLog3 Listener (Nio Thread) Responder (Nio Thread) dfs.namenode.service.handlercount (default 10) ipc.server.read.threadpool.size (default 1) … Handler1 HandlerM… Done RPCs ackIdsflush
  • 16. HopsFS NameNode Internals Client: mkdir, getblocklocations, createFile,….. NameNode NDB Client Reader1 ReaderN… Handler1 HandlerM ConnectionList Call Queue inodes block_infos replicas Listener (Nio Thread) Responder (Nio Thread) dfs.namenode.service.handlercount (default 10) ipc.server.read.threadpool.size (default 1) … Handler1 HandlerM… Done RPCs ackIds leases… DAL-Impl DALAPI HARD PART
  • 17. Concurrency Model: Implicit Locking • Serializabile FS ops using implicit locking of subtrees. 17 [Hakimzadeh, Peiro, Dowling, ”Scaling HDFS with a Strongly Consistent Relational Model for Metadata”, DAIS 2014]
  • 18. Preventing Deadlock and Starvation •Acquire FS locks in agreed order using FS Hierarchy. •Block-level operations follow the same agreed order. •No cycles => Freedom from deadlock •Pessimistic Concurrency Control ensures progress 18 /user/jim/myFilemv read block_report DataNodeNameNodeClient
  • 19. Per Transaction Cache •Reusing the HDFS codebase resulted in too many roundtrips to the database per transaction. •We cache intermediate transaction results at NameNodes (i.e., snapshot).
  • 20. Sometimes, Transactions Just ain’t Enough •Large Subtree Operations (delete, mv, set-quota) can’t always be executed in a single Transaction. •4-phase Protocol • Isolation and Consistency • Aggressive batching • Transparent failure handling • Failed ops retried on new NN. • Lease timeout for failed clients. 20
  • 21. Leader Election using NDB •Leader to coordinate replication/lease management •NDB as shared memory for Leader Election of NN. 21 [Niazi, Berthou, Ismail, Dowling, ”Leader Election in a NewSQL Database”, DAIS 2015]
  • 22. Path Component Caching •The most common operation in HDFS is resolving pathnames to inodes - 67% of operations in Spotify’s Hadoop workload •We cache recently resolved inodes at NameNodes so that we can resolve them using a single batch primary key lookup. - We validate cache entries as part of transactions - The cache converts O(N) round trips to the database to O(1) for a hit for all inodes in a path. 22
  • 23. Path Component Caching •Resolving a path of length N gives O(N) round-trips •With our cache, O(1) round-trip for a cache hit /user/jim/myFile NDB getInode (0, “user”) getInode (1, “jim”) getInode (2, “myFile”) NameNode /user/jim/myFile NDB validateInodes ([(0, “user”), (1,”jim”), (2,”myFile”)]) NameNode Cache getInodes(“/user/jim/myFile”)
  • 24. Hotspots •Mikael saw 1-2 maxed out LDM threads •Partitioning by parent inodeId meant fantastic performance for ‘ls’ - Partition-pruned index scans - At high load hotspots appeared at the top of the directory hierarchy •Current Solution: - Cache the root inode at NameNodes - Pseudo-random partition key for top-level directories, but keep partition by parent inodeId at lower levels - At least 4x throughput increase! 24 / /Users /Projects /NSA /MyProj /Dataset1 /Dataset2
  • 25. Scalable Blocking Reporting •On 100PB+ clusters, internal maintenance protocol traffic makes up much of the network traffic •Block Reporting - Leader Load Balances - Work-steal when exiting safe-mode SafeBlocks DataNodes NameNodes NDB Leader Blocks work steal
  • 27. HopsFS Metadata Scaleout 27Assuming 256MB Block Size, 100 GB JVM Heap for Apache Hadoop
  • 29. HopsFS Throughput (Spotify Workload - PM) 29Experiments performed on AWS EC2 with enhanced networking and C3.8xLarge instances
  • 30. HopsFS Throughput (Spotify Workload - PM) 30Experiments performed on AWS EC2 with enhanced networking and C3.8xLarge instances
  • 31. HopsFS Throughput (Spotify Workload - AM) 31 NDB Setup: 8 Nodes using Xeon E5-2620 2.40GHz Processors and 10GbE. NameNodes: Xeon E5-2620 2.40GHz Processors machines and 10GbE.
  • 32. Per Operation HopsFS Throughput 32
  • 33. NDB Performance Lessons •NDB is quite stable! •ClusterJ is (nearly) good enough - sun.misc.Cleaner has trouble keeping up at high throughput – OOM for ByteBuffers - Transaction hint behavior not respected - DTO creation time affected by Java Reflection - Nice features would be: • Projections • Batched scan operations support • Event API •Event API and Asynchronous API needed for performance in Hops-YARN 33
  • 34. Heterogeneous Storage in HopsFS 34 •Storage Types in HopsFS: Default, EC-RAID5, SSD - Default: 3X overhead - triple replication on spinning disks - SSD: 3X overhead - triple replication on SSDs - EC-RAID5: 1.4X overhead with low reconstruction overhead!
  • 35. Erasure Coding 35 HDFS File (Sealed) d0 d1 d2 d3 d4 d5 p0 p1 p1 overhead (6+3)/6 = 1.5X d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 p0 p1 p2 p3(12+4)/16= 1.33X RS(6,3) RS(12,4)
  • 36. host9 d0 d1 d2 d3 d4 p0 Global/Local Reconstruction with EC-RAID5 36 d0 d1 d2 d3 d4 p0Block0 Block9 Block10 Block11 Block12 Block13 host0 host10 host10 host10 host10 ZFS RAID-ZZFS RAID-Z (10+2+2)/10 = 1.4X (10+2+4)/10 = 1.6X RS(10,2) LR(5,1).RS(10,4)LR(5,1).
  • 37. ePipe: Indexing HopsFS’ Namespace 37 Free-Text Search NDBElasticSearch Polyglot Persistence The Distributed Database is the Single Source of Truth. Foreign keys ensure the integrity of Extended Metadata. MetaData Designer MetaData Entry NDB Event API
  • 39. YARN Architecture 39 NodeManagers YARN Client Zookeeper Nodes ResourceMgr Standby ResourceMgr 1. Master-Slave Replication of RM State 2. Agreement on the Active ResourceMgr
  • 40. NDB ResourceManager– Monolithic but Modular 40 ApplicationMaster Service ResourceTracker Service Scheduler Client Service YARN Client Admin Service Security Cluster State HopsResourceTracker Cluster State HopsScheduler NodeManagerNodeManagerYARN Client App MasterApp Master ResourceManager ~2k ops/s ~10k ops/s ClusterJ Event API
  • 43. Hopsworks – Project-Based Multi-Tenancy •A project is a collection of - Users with Roles - HDFS DataSets - Kafka Topics - Notebooks, Jobs •Per-Project quotas - Storage in HDFS - CPU in YARN • Uber-style Pricing •Sharing across Projects - Datasets/Topics 43 project dataset 1 dataset N Topic 1 Topic N Kafka HDFS
  • 44. Hopsworks – Dynamic Roles 44 Alice@gmail.com NSA__Alice Authenticate Users__Alice Glassfish HopsFS HopsYARN Projects Secure Impersonation Kafka X.509 Certificates
  • 45. SICS ICE - www.hops.site 45 A 2 MW datacenter research and test environment Purpose: Increase knowledge, strengthen universities, companies and researchers R&D institute, 5 lab modules, 3-4000 servers, 2-3000 square meters
  • 46. Karamel/Chef for Automated Installation 46 Google Compute Engine BareMetal
  • 47. Summary •HopsFS is the world’s fastest, most scalable HDFS implementation •Powered by NDB, the world’s fastest database  •Thanks to Mikael, Craig, Frazer, Bernt and others •Still room for improvement…. 47 www.hops.io
  • 48. Hops [Hadoop For Humans] Join us! http://github.com/hopshadoop

Editor's Notes

  • #6: I am going to talk about realizing Bill Gate’s vision for a filesystem in the Hadoop Ecosystem. “WinFS was an attempt to bring the benefits of schema and relational databases to the Windows file system. …The WinFS effort was started around 1999 as the successor to the planned storage layer of Cairo and died in 2006 after consuming many thousands of hours of efforts from really smart engineers.” [Brian Welcker]** **http://blogs.msdn.com/b/bwelcker/archive/2013/02/11/the-vision-thing.aspx
  • #10: The kind of challenges you have with the NN are managing large clusters and configuring the NN.
  • #11: Slope of the Bottom Line is based on improvements in garbage collection technology – Azul JVM, Shenndowagh, etc Slope of the top line is based on Moore’s Law.
  • #12: Apache Spark already moving in this direction – Tachyon
  • #15: The NameNode has multi-reader, single writer concurrency semantics. Operations that would hold the write lock for too long, starving clients, are not executed atomically. For example, deleting a directory subtree with millions of files, involves deleting batches of files, yielding the global lock for a period, then re-acquiring it, to continue the operation.
  • #19: With global lock, it’s easy.
  • #21: If something is not atomic, you have to handl all possible failures
  • #26: Only new Protocol Buffer Message we added to DNs
  • #36: reconstruction read is expensive
  • #40: The Resource Manager (RM) is a bottleneck. Zookeeper throughput not high enough to persist all RM state Standby resource manager can only recover partial state All running jobs must be restarted. RM state not queryable. The RM is a State-Machine. Almost no session state to manage.
  • #44: Privileges – upload/download data, run analysis jobs Like RBAC solution. All access via HopsWorks.
  • #45: 44
  • #49: I need some sound-effects to go with that.