SlideShare a Scribd company logo
(c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0
Richard Wareing
Production Engineer
October 27 th, 2017
Realtime XFS
Solving The Metadata Problem
1 The Problem
2 XFS w/ Realtime Subvolumes
3 GlusterFS Application
4 Facebook Enhanacements
5 Questions?
Agenda
Quick Tangent
Some Perspective: Strengths
▪ GlusterFS xlator architecture is elegant, don’t give this up…ever
▪ Tends to enforce good design on new developers
▪ Good isolation between components (with some cheating in some
places)
▪ IO Latencies: Yet to see anything in the open source realm come close
to the latencies we see on GlusterFS for metadata
▪ Simplicity: Very easy to setup, configure & manage.
▪ Open Source: Community is strong, growing with diverse use-cases
Quick Tangent
Some Perspective: Things we need to work on…
▪ Scaling: There continues to be a huge disparity between GlusterFS and
it’s contemporaries. Ceph & HDFS both scale dramatically more.
▪ Code Quality: More commenting! This helps bring in more developers,
ease code review.
▪ Lack of JBOD support: GlusterFS requires some form of RAID[5 -6] which
adds complexity and expense
▪ Drains/Rebuilds: Without use of some XFS tricks, this is still quite slow,
taking weeks vs. days.
The Problem
GlusterFS & Metadata
▪ Preface this…
▪ Leveraging underlying FS for metadata storage was a good decision
▪ -Simplicity of design, reliability
▪ But this design choice isn’t without problems…
▪ Heavy reliance on metadata
▪ xattrs à AFR “journal”, xlator attributes for DHT, quota etc
The Problem
GlusterFS & Metadata
▪ DHT directory operations “magnify” à Single ”ls” à 1000’s of VFS
system calls
Brick 8
ls –l /foo
10 ms 8 ms 5 ms 15 ms 2 ms 11 ms 50 ms 15 ms 20 ms
50 ms
Subvol 0
readdir
Optimize
Subvol 1 Subvol 2
Brick 1 Brick 4 Brick 7 Brick 2 Brick 5 Brick 3 Brick 6 Brick 9
AFR
Optimization
s
(Slow Sub-Vol)
The Problem
Quantifying the Problem
▪ blktrace to examine physical
block device Ios
▪ Examine RWBS field
▪ Example 1: CREATE heavy
▪ Blob store ’ish workload
▪ 51% metadata
CREATE heavy workload
The Problem
Quantifying the Problem
▪ Example 2: MKDIR heavy workload
▪ Deep directory structures
▪ ”FS as database” use-case
▪ 83% metadata
MKDIR heavy workload
The Problem
Traditional Solutions
▪ Page Cache
▪ Pro: Simplicity à just “works”; works with any storage system
▪ Cons: DRAM is expensive, limited space to cache objects, doesn’t help
write heavy use-cases.
▪ Dedicated Metadata Stores
▪ Pros: Good designs can scale well; works with write heavy workloads
▪ Cons: Added complexity à Reliability, management overhead;
maintaining consistency can be challenging; very specific to storage
system
▪ Can we combine the best of both?
XFS w/ Realtime Subvol
Overview
Realtime Block Device
Standard Block Device
Metadata
Intent
Log (Journal)
Data Blocks
XFS
Metadata
Intent
Log (Journal)
Data Blocks
Realtime Data
Blocks
XFS
Filesystem
Optional: Can
be moved.
XFS
Filesystem
XFS w/ Realtime Subvol
Realtime Allocator
RT Extent 1 RT Extent 2 RT Extent 3 RT Extent 4 RT Extent 5
RT Extent 6 RT Extent 7 RT Extent 8 RT Extent 9 RT Extent 10
… … … … …
… … … … …
… … … …
RT Extent n
Tunable to any Fixed size,
guaranteed contiguous!
Realtime Bitmap
Block Allocator
GlusterFS Application
Brick Filesystem Layout
/mnt/brick1
/mnt/brick2
sdb1
Metadata Intent Log
File Cache
(Optional)
sdb2
Metadata Intent Log
File Cache
(Optional)
sdc1
Data Blocks
sdd1
Data Blocks
SSD (sdb) HDDs (sdc-
sdd)
GlusterFS Application
Benefits: Combines best of both worlds (mostly)
▪ Pros
▪ Simplicity à just “works”, no changes to GlusterFS core
▪ Works with any storage system
▪ Works with write or ready heavy metadata workloads
▪ SSD based File caching à Trivial to implement à “.cache” directory
▪ Cons
▪ Does not scale independently of bricks
▪ Changes require kernel patches potentially
▪ Realtime allocator is single-thread optimized, this may present problems for some
workloads à JBOD support?
Facebook Enhancements
Kernel Patches (Pending XFS maintainer review)
▪ statfs
▪ Return realtime device usage if rtinherit flag is set à More intuitive
▪ rtallocsize
▪ Small (initial) allocations stored on realtime subvolume automatically
▪ rtfallbackpct
▪ Allocate data to realtime device if data block device (e.g. SSD) usage is
above rtfallbackpct
Thank-you!
Questions?
(c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

More Related Content

PDF
Kernel Recipes 2017: Using Linux perf at Netflix
PPTX
TIM HIEU SSL VA UNG DUNG TREN WEB SERVER
PDF
Hands on MapR -- Viadea
PDF
Open vSwitch - Stateful Connection Tracking & Stateful NAT
PDF
BPF - in-kernel virtual machine
PDF
Application-Based Routing
PDF
Nghiên cứu triển khai công nghệ cáp quang GPON tại FPT telecom Hải Phòng
PDF
Giải pháp xử lý big data trên apache spark
Kernel Recipes 2017: Using Linux perf at Netflix
TIM HIEU SSL VA UNG DUNG TREN WEB SERVER
Hands on MapR -- Viadea
Open vSwitch - Stateful Connection Tracking & Stateful NAT
BPF - in-kernel virtual machine
Application-Based Routing
Nghiên cứu triển khai công nghệ cáp quang GPON tại FPT telecom Hải Phòng
Giải pháp xử lý big data trên apache spark

What's hot (9)

PPT
Dr osama el minshawy - vascular access complications
PDF
Ai based on gpu
PDF
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
PDF
Fun with Network Interfaces
DOC
ĐỒ ÁN THIẾT KẾ BỘ ĐIỀU KHIỂN VÀ GIÁM SÁT TRANG TRẠI NÔNG NGHIỆP ỨNG DỤNG MẠNG...
PPTX
Hadoop - Hệ thống tính toán và xử lý dữ liệu lớn
PDF
Replacing iptables with eBPF in Kubernetes with Cilium
PDF
BPF: Tracing and more
PDF
Introduction to eBPF and XDP
Dr osama el minshawy - vascular access complications
Ai based on gpu
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
Fun with Network Interfaces
ĐỒ ÁN THIẾT KẾ BỘ ĐIỀU KHIỂN VÀ GIÁM SÁT TRANG TRẠI NÔNG NGHIỆP ỨNG DỤNG MẠNG...
Hadoop - Hệ thống tính toán và xử lý dữ liệu lớn
Replacing iptables with eBPF in Kubernetes with Cilium
BPF: Tracing and more
Introduction to eBPF and XDP
Ad

Similar to GlusterFS w/ Tiered XFS (20)

PPTX
Scalability
PPTX
In-Memory Database Performance on AWS M4 Instances
PDF
CLFS 2010
PDF
Ippevent : openshift Introduction
PPT
Zettabyte File Storage System
PPT
Zettabyte File Storage System
PPTX
Vancouver bug enterprise storage and zfs
PDF
HPC DAY 2017 | HPE Storage and Data Management for Big Data
PDF
Solving k8s persistent workloads using k8s DevOps style
PPTX
What’s the Deal with Containers, Anyway?
PDF
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
PDF
Tuning Solr & Pipeline for Logs
PPTX
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
PPTX
Accelerating hbase with nvme and bucket cache
PPTX
Using SAS GRID v 9 with Isilon F810
PPTX
Ceph - High Performance Without High Costs
ODP
Software defined storage
ODP
Ceph Day NYC: Building Tomorrow's Ceph
ODP
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
PPTX
ZFS for Databases
Scalability
In-Memory Database Performance on AWS M4 Instances
CLFS 2010
Ippevent : openshift Introduction
Zettabyte File Storage System
Zettabyte File Storage System
Vancouver bug enterprise storage and zfs
HPC DAY 2017 | HPE Storage and Data Management for Big Data
Solving k8s persistent workloads using k8s DevOps style
What’s the Deal with Containers, Anyway?
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Tuning Solr & Pipeline for Logs
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
Accelerating hbase with nvme and bucket cache
Using SAS GRID v 9 with Isilon F810
Ceph - High Performance Without High Costs
Software defined storage
Ceph Day NYC: Building Tomorrow's Ceph
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
ZFS for Databases
Ad

More from Gluster.org (20)

PDF
Automating Gluster @ Facebook - Shreyas Siravara
PDF
nfusr: a new userspace NFS client based on libnfs - Shreyas Siravara
PDF
Facebook’s upstream approach to GlusterFS - David Hasson
PDF
Throttling Traffic at Facebook Scale
PDF
Gluster Metrics: why they are crucial for running stable deployments of all s...
PDF
Up and Running with Glusto & Glusto-Tests in 5 Minutes (or less)
PDF
Data Reduction for Gluster with VDO
PDF
Releases: What are contributors responsible for
PDF
RIO Distribution: Reconstructing the onion - Shyamsundar Ranganathan
PDF
Gluster and Kubernetes
PDF
Native Clients, more the merrier with GFProxy!
PDF
Gluster: a SWOT Analysis
PDF
GlusterD-2.0: What's Happening? - Kaushal Madappa
PDF
Scalability and Performance of CNS 3.6
PDF
What Makes Us Fail
PDF
Gluster as Native Storage for Containers - past, present and future
PDF
Heketi Functionality into Glusterd2
PDF
Hands On Gluster with Jeff Darcy
PDF
Architecture of the High Availability Solution for Ganesha and Samba with Kal...
PDF
Challenges with Gluster and Persistent Memory with Dan Lambright
Automating Gluster @ Facebook - Shreyas Siravara
nfusr: a new userspace NFS client based on libnfs - Shreyas Siravara
Facebook’s upstream approach to GlusterFS - David Hasson
Throttling Traffic at Facebook Scale
Gluster Metrics: why they are crucial for running stable deployments of all s...
Up and Running with Glusto & Glusto-Tests in 5 Minutes (or less)
Data Reduction for Gluster with VDO
Releases: What are contributors responsible for
RIO Distribution: Reconstructing the onion - Shyamsundar Ranganathan
Gluster and Kubernetes
Native Clients, more the merrier with GFProxy!
Gluster: a SWOT Analysis
GlusterD-2.0: What's Happening? - Kaushal Madappa
Scalability and Performance of CNS 3.6
What Makes Us Fail
Gluster as Native Storage for Containers - past, present and future
Heketi Functionality into Glusterd2
Hands On Gluster with Jeff Darcy
Architecture of the High Availability Solution for Ganesha and Samba with Kal...
Challenges with Gluster and Persistent Memory with Dan Lambright

Recently uploaded (20)

PPTX
Big Data Technologies - Introduction.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
KodekX | Application Modernization Development
PDF
Electronic commerce courselecture one. Pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Empathic Computing: Creating Shared Understanding
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
Big Data Technologies - Introduction.pptx
Spectral efficient network and resource selection model in 5G networks
The Rise and Fall of 3GPP – Time for a Sabbatical?
Chapter 3 Spatial Domain Image Processing.pdf
KodekX | Application Modernization Development
Electronic commerce courselecture one. Pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Empathic Computing: Creating Shared Understanding
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
MYSQL Presentation for SQL database connectivity
Per capita expenditure prediction using model stacking based on satellite ima...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Unlocking AI with Model Context Protocol (MCP)
Dropbox Q2 2025 Financial Results & Investor Presentation
Mobile App Security Testing_ A Comprehensive Guide.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Digital-Transformation-Roadmap-for-Companies.pptx

GlusterFS w/ Tiered XFS

  • 1. (c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0
  • 2. Richard Wareing Production Engineer October 27 th, 2017 Realtime XFS Solving The Metadata Problem
  • 3. 1 The Problem 2 XFS w/ Realtime Subvolumes 3 GlusterFS Application 4 Facebook Enhanacements 5 Questions? Agenda
  • 4. Quick Tangent Some Perspective: Strengths ▪ GlusterFS xlator architecture is elegant, don’t give this up…ever ▪ Tends to enforce good design on new developers ▪ Good isolation between components (with some cheating in some places) ▪ IO Latencies: Yet to see anything in the open source realm come close to the latencies we see on GlusterFS for metadata ▪ Simplicity: Very easy to setup, configure & manage. ▪ Open Source: Community is strong, growing with diverse use-cases
  • 5. Quick Tangent Some Perspective: Things we need to work on… ▪ Scaling: There continues to be a huge disparity between GlusterFS and it’s contemporaries. Ceph & HDFS both scale dramatically more. ▪ Code Quality: More commenting! This helps bring in more developers, ease code review. ▪ Lack of JBOD support: GlusterFS requires some form of RAID[5 -6] which adds complexity and expense ▪ Drains/Rebuilds: Without use of some XFS tricks, this is still quite slow, taking weeks vs. days.
  • 6. The Problem GlusterFS & Metadata ▪ Preface this… ▪ Leveraging underlying FS for metadata storage was a good decision ▪ -Simplicity of design, reliability ▪ But this design choice isn’t without problems… ▪ Heavy reliance on metadata ▪ xattrs à AFR “journal”, xlator attributes for DHT, quota etc
  • 7. The Problem GlusterFS & Metadata ▪ DHT directory operations “magnify” à Single ”ls” à 1000’s of VFS system calls Brick 8 ls –l /foo 10 ms 8 ms 5 ms 15 ms 2 ms 11 ms 50 ms 15 ms 20 ms 50 ms Subvol 0 readdir Optimize Subvol 1 Subvol 2 Brick 1 Brick 4 Brick 7 Brick 2 Brick 5 Brick 3 Brick 6 Brick 9 AFR Optimization s (Slow Sub-Vol)
  • 8. The Problem Quantifying the Problem ▪ blktrace to examine physical block device Ios ▪ Examine RWBS field ▪ Example 1: CREATE heavy ▪ Blob store ’ish workload ▪ 51% metadata CREATE heavy workload
  • 9. The Problem Quantifying the Problem ▪ Example 2: MKDIR heavy workload ▪ Deep directory structures ▪ ”FS as database” use-case ▪ 83% metadata MKDIR heavy workload
  • 10. The Problem Traditional Solutions ▪ Page Cache ▪ Pro: Simplicity à just “works”; works with any storage system ▪ Cons: DRAM is expensive, limited space to cache objects, doesn’t help write heavy use-cases. ▪ Dedicated Metadata Stores ▪ Pros: Good designs can scale well; works with write heavy workloads ▪ Cons: Added complexity à Reliability, management overhead; maintaining consistency can be challenging; very specific to storage system ▪ Can we combine the best of both?
  • 11. XFS w/ Realtime Subvol Overview Realtime Block Device Standard Block Device Metadata Intent Log (Journal) Data Blocks XFS Metadata Intent Log (Journal) Data Blocks Realtime Data Blocks XFS Filesystem Optional: Can be moved. XFS Filesystem
  • 12. XFS w/ Realtime Subvol Realtime Allocator RT Extent 1 RT Extent 2 RT Extent 3 RT Extent 4 RT Extent 5 RT Extent 6 RT Extent 7 RT Extent 8 RT Extent 9 RT Extent 10 … … … … … … … … … … … … … … RT Extent n Tunable to any Fixed size, guaranteed contiguous! Realtime Bitmap Block Allocator
  • 13. GlusterFS Application Brick Filesystem Layout /mnt/brick1 /mnt/brick2 sdb1 Metadata Intent Log File Cache (Optional) sdb2 Metadata Intent Log File Cache (Optional) sdc1 Data Blocks sdd1 Data Blocks SSD (sdb) HDDs (sdc- sdd)
  • 14. GlusterFS Application Benefits: Combines best of both worlds (mostly) ▪ Pros ▪ Simplicity à just “works”, no changes to GlusterFS core ▪ Works with any storage system ▪ Works with write or ready heavy metadata workloads ▪ SSD based File caching à Trivial to implement à “.cache” directory ▪ Cons ▪ Does not scale independently of bricks ▪ Changes require kernel patches potentially ▪ Realtime allocator is single-thread optimized, this may present problems for some workloads à JBOD support?
  • 15. Facebook Enhancements Kernel Patches (Pending XFS maintainer review) ▪ statfs ▪ Return realtime device usage if rtinherit flag is set à More intuitive ▪ rtallocsize ▪ Small (initial) allocations stored on realtime subvolume automatically ▪ rtfallbackpct ▪ Allocate data to realtime device if data block device (e.g. SSD) usage is above rtfallbackpct
  • 17. (c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0