SlideShare a Scribd company logo
Ceph Tech Talks:
CephFS Update
John Spray
john.spray@redhat.com
Feb 2016
2 Ceph Tech Talks: CephFS
Agenda
● Recap: CephFS architecture
● What's new for Jewel?
● Scrub/repair
● Fine-grained authorization
● RADOS namespace support in layouts
● OpenStack Manila
● Experimental multi-filesystem functionality
3 Ceph Tech Talks: CephFS
Ceph architecture
RGW
A web services
gateway for object
storage, compatible
with S3 and Swift
LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
RBD
A reliable, fully-
distributed block
device with cloud
platform integration
CEPHFS
A distributed file
system with POSIX
semantics and scale-
out metadata
management
APP HOST/VM CLIENT
4 Ceph Tech Talks: CephFS
CephFS
● POSIX interface: drop in replacement for any local or
network filesystem
● Scalable data: files stored directly in RADOS
● Scalable metadata: cluster of metadata servers
● Extra functionality: snapshots, recursive statistics
● Same storage backend as object (RGW) + block
(RBD): no separate silo needed for file
5
Ceph Tech Talks: CephFS
Components
Linux host
M M
M
Ceph server daemons
CephFS client
datametadata 01
10
M
OSD
Monitor
MDS
6 Ceph Tech Talks: CephFS
Why build a distributed filesystem?
● Existing filesystem-using workloads aren't going away
● POSIX filesystems are a lingua-franca, for
administrators as well as applications
● Interoperability with other storage systems in data
lifecycle (e.g. backup, archival)
● New platform container “volumes” are filesystems
● Permissions, directories are actually useful concepts!
7 Ceph Tech Talks: CephFS
Why not build a distributed filesystem?
● Harder to scale than object stores, because entities
(inodes, dentries, dirs) are related to one another, good
locality needed for performance.
● Some filesystem-using applications are gratuitously
inefficient (e.g. redundant “ls -l” calls, using files for
IPC) due to local filesystem latency expectations
● Complexity resulting from stateful clients e.g. taking
locks, opening files requires coordination and clients
can interfere with one another's responsiveness.
8 Ceph Tech Talks: CephFS
CephFS in practice
ceph-deploy mds create myserver
ceph osd pool create fs_data
ceph osd pool create fs_metadata
ceph fs new myfs fs_metadata fs_data
mount -t ceph x.x.x.x:6789 /mnt/ceph
9
Ceph Tech Talks: CephFS
Scrub and repair
10 Ceph Tech Talks: CephFS
Scrub/repair status
● In general, resilience and self-repair is RADOS's job:
all CephFS data & metadata lives in RADOS objects
● CephFS scrub/repair is for handling disasters: serious
software bugs, or permanently lost data in RADOS
● In Jewel, can now handle and recover from many
forms of metadata damage (corruptions, deletions)
● Repair tools require expertise: primarily for use during
(rare) support incidents, not everyday user activity
11 Ceph Tech Talks: CephFS
Scrub/repair: handling damage
● Fine-grained damage status (“damage ls”) instead of
taking whole rank offline
● Detect damage during normal load of metadata, or
during scrub
ceph tell mds.<id> damage ls
ceph tell mds.<id> damage rm
ceph mds 0 repaired
● Can repair damaged statistics online: other repairs
happen offline (i.e. stop MDS and writing directly to
metadata pool)
12 Ceph Tech Talks: CephFS
Scrub/repair: online scrub commands
● Forward scrub: traversing metadata from root
downwards
ceph daemon mds.<id> scrub_path
ceph daemon mds.<id> scrub_path recursive
ceph daemon mds.<id> scrub_path repair
ceph daemon mds.<id> tag path
● These commands will give you success or failure info
on completion, and emit cluster log messages about
issues.
13 Ceph Tech Talks: CephFS
Scrub/repair: offline repair commands
● Backward scrub: iterating over all data objects and
trying to relate them back to the metadata
● Potentially long running, but can run workers in parallel
● Find all the objects in files:
● cephfs­data­scan scan_extents
● Find (or insert) all the files into the metadata:
● cephfs­data­scan scan_inodes
14 Ceph Tech Talks: CephFS
Scrub/repair: parallel execution
● New functionality in RADOS to enable iterating over
subsets of the overall set of objects in pool.
● Currently one must coordinate collection of workers by
hand (or a short shell script)
● Example: invoke worker 3 of 10 like this:
● cephfs­data­scan scan_inodes –worker_n 3 
–worker_m 10
15 Ceph Tech Talks: CephFS
Scrub/repair: caveats
● This is still disaster recovery functionality: don't run
“repair” commands for fun.
● Not multi-MDS aware: commands operate directly on a
single MDS's share of the metadata.
● Not yet auto-run in background like RADOS scrub
16
Ceph Tech Talks: CephFS
Fine-grained authorisation
17 Ceph Tech Talks: CephFS
CephFS authorization
● Clients need to talk to MDS daemons, mons and
OSDs.
● OSD auth caps enable limiting clients to use only
particular data pools, but couldn't control which parts of
filesystem metadata they saw
● New MDS auth caps enable limiting access by path
and uid.
18 Ceph Tech Talks: CephFS
MDS auth caps
● Example: we created a dir `foodir` that has its layout
set to pool `foopool`. We create a client key 'foo' that
can only see metadata within that dir and data within
that pool.
ceph auth get­or­create client.foo 
  mds “allow rw path=/foodir” 
  osd “allow rw pool=foopool” 
  mon “allow r”
● Client must mount with “-r /foodir”, to treat that as its
root (doesn't have capability to see the real root)
19
Ceph Tech Talks: CephFS
RADOS namespaces in file
layouts
20 Ceph Tech Talks: CephFS
RADOS namespaces
● Namespaces offer a cheaper way to divide up objects
than pools.
● Pools consume physical resources (i.e. they create
PGs), whereas namespaces are effectively just a prefix
to object names.
● OSD auth caps can be limited by namespaces: when
we need to isolate two clients (e.g. two cephfs clients)
we can give them auth caps that allow access to
different namespaces.
21 Ceph Tech Talks: CephFS
Namespaces in layouts
● Existing fields: pool, stripe_unit, stripe_count,
object_size
● New field: pool_namespace
● setfattr ­n ceph.file.layout.pool_namespace ­v <ns>
● setfattr ­n ceph.dir.layout.pool_namespace ­v <ns>
● As with setting layout.pool, the data gets written there,
but the backtrace continues to be written to the default
pool (and default namespace). Backtrace not
accessed by client, so doesn't affect client side auth
configuration.
22
Ceph Tech Talks: CephFS
OpenStack Manila
23 Ceph Tech Talks: CephFS
Manila
● The OpenStack shared filesystem service
● Manila users request filesystem storage as shares
which are provisioned by drivers
● CephFS driver implements shares as directories:
● Manila expects shares to be size-constrained, we
use CephFS Quotas
● Client mount commands includes -r flag to treat the
share dir as the root
● Capacity stats reported for that directory using rstats
● Clients restricted to their directory and
pool/namespace using new auth caps
24 Ceph Tech Talks: CephFS
CephFSVolumeClient
● A new python interface in the Ceph tree, designed for
Manila and similar frameworks.
● Wraps up the directory+auth caps mechanism as a
“volume” concept.
Manila
CephFS Driver
CephFSVolumeClient
librados libcephfs
Ceph Cluster
Network
github.com/openstack/manila
github.com/ceph/ceph
25
Ceph Tech Talks: CephFS
Experimental multi-filesystem
functionality
26 Ceph Tech Talks: CephFS
Multiple filesystems
● Historical 1:1 mapping between Ceph cluster (RADOS)
and Ceph filesystem (cluster of MDSs)
● Artificial limitation: no reason we can't have multiple
CephFS filesystems, with multiple MDS clusters, all
backed onto one RADOS cluster.
● Use case ideas:
● Physically isolate workloads on separate MDS clusters
(vs. using dirs within one cluster)
● Disaster recovery: recover into a new filesystem on the
same cluster, instead of trying to do in-place
● Resilience: multiple filesystems become separate failure
domains in case of issues.
27 Ceph Tech Talks: CephFS
Multiple filesystems initial implementation
● You can now run “fs new” more than once (with
different pools)
● Old clients get the default filesystem (you can
configure which one that is)
● New userspace client config opt to select which
filesystem should be mounted
● MDS daemons are all equal: any one may get used for
any filesystem
● Switched off by default: must set a special flag to use
this (like snapshots, inline data)
28 Ceph Tech Talks: CephFS
Multiple filesystems future work
● Enable use of RADOS namespaces (not just separate
pools) for different filesystems to avoid needlessly
creating more pools
● Authorization capabilities to limit MDS and clients to
particular filesystems
● Enable selecting FS in kernel client
● Enable manually configured affinity of MDS daemons
to filesystem(s)
● More user friendly FS selection in userspace client
(filesystem name instead of ID)
29
Ceph Tech Talks: CephFS
Wrap up
30 Ceph Tech Talks: CephFS
Tips for early adopters
http://ceph.com/resources/mailing-list-irc/
http://tracker.ceph.com/projects/ceph/issues
http://ceph.com/docs/master/rados/troubleshooting/log-and-debug/
● Does the most recent development release or kernel
fix your issue?
● What is your configuration? MDS config, Ceph
version, client version, kclient or fuse
● What is your workload?
● Can you reproduce with debug logging enabled?
31
Ceph Tech Talks: CephFS
Questions?

More Related Content

PDF
librados
PDF
BlueStore: a new, faster storage backend for Ceph
PPTX
Ceph Intro and Architectural Overview by Ross Turk
ODP
Block Storage For VMs With Ceph
PDF
Ceph - A distributed storage system
PPTX
Ceph at Work in Bloomberg: Object Store, RBD and OpenStack
PDF
Community Update at OpenStack Summit Boston
PDF
Ceph and RocksDB
librados
BlueStore: a new, faster storage backend for Ceph
Ceph Intro and Architectural Overview by Ross Turk
Block Storage For VMs With Ceph
Ceph - A distributed storage system
Ceph at Work in Bloomberg: Object Store, RBD and OpenStack
Community Update at OpenStack Summit Boston
Ceph and RocksDB

What's hot (20)

PDF
Linux Stammtisch Munich: Ceph - Overview, Experiences and Outlook
PDF
Ceph Performance: Projects Leading up to Jewel
PPTX
QCT Ceph Solution - Design Consideration and Reference Architecture
PDF
BlueStore: a new, faster storage backend for Ceph
PDF
Storage tiering and erasure coding in Ceph (SCaLE13x)
PPTX
Bluestore
PPTX
Cephfs jewel mds performance benchmark
PDF
HKG15-401: Ceph and Software Defined Storage on ARM servers
PDF
2019.06.27 Intro to Ceph
PDF
What's new in Jewel and Beyond
PDF
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
PDF
What's new in Luminous and Beyond
PDF
SF Ceph Users Jan. 2014
PPTX
Hadoop over rgw
PDF
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
PDF
Distributed Storage and Compute With Ceph's librados (Vault 2015)
PDF
Introduction into Ceph storage for OpenStack
PDF
Ceph data services in a multi- and hybrid cloud world
PDF
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
PDF
The State of Ceph, Manila, and Containers in OpenStack
Linux Stammtisch Munich: Ceph - Overview, Experiences and Outlook
Ceph Performance: Projects Leading up to Jewel
QCT Ceph Solution - Design Consideration and Reference Architecture
BlueStore: a new, faster storage backend for Ceph
Storage tiering and erasure coding in Ceph (SCaLE13x)
Bluestore
Cephfs jewel mds performance benchmark
HKG15-401: Ceph and Software Defined Storage on ARM servers
2019.06.27 Intro to Ceph
What's new in Jewel and Beyond
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
What's new in Luminous and Beyond
SF Ceph Users Jan. 2014
Hadoop over rgw
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
Distributed Storage and Compute With Ceph's librados (Vault 2015)
Introduction into Ceph storage for OpenStack
Ceph data services in a multi- and hybrid cloud world
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
The State of Ceph, Manila, and Containers in OpenStack
Ad

Viewers also liked (20)

PDF
Ceph Performance: Projects Leading Up to Jewel
PPTX
Software Defined Storage, Big Data and Ceph - What Is all the Fuss About?
PDF
Beyond Hadoop and MapReduce
PDF
Ceph中国社区9.19 Ceph FS-基于RADOS的高性能分布式文件系统02-袁冬
PDF
Ceph中国社区9.19 Ceph IO 路径 和性能分析-王豪迈05
PDF
Ceph中国社区9.19 Some Ceph Story-朱荣泽03
PDF
Scale out backups-with_bareos_and_gluster
PDF
Integrating gluster fs,_qemu_and_ovirt-vijay_bellur-linuxcon_eu_2013
PPTX
Software Defined storage
PPTX
Award winning scale-up and scale-out storage for Xen
PDF
How to Install Gluster Storage Platform
PDF
Gluster Storage Platform Installation Guide
PDF
Intorduce to Ceph
PDF
Introduction to GlusterFS Webinar - September 2011
PDF
Petascale Cloud Storage with GlusterFS
PDF
Glusterfs 구성제안 및_운영가이드_v2.0
PPTX
Distributed Shared Memory Systems
PDF
Tutorial ceph-2
PDF
Glusterfs and openstack
PDF
Ceph Object Storage at Spreadshirt (July 2015, Ceph Berlin Meetup)
Ceph Performance: Projects Leading Up to Jewel
Software Defined Storage, Big Data and Ceph - What Is all the Fuss About?
Beyond Hadoop and MapReduce
Ceph中国社区9.19 Ceph FS-基于RADOS的高性能分布式文件系统02-袁冬
Ceph中国社区9.19 Ceph IO 路径 和性能分析-王豪迈05
Ceph中国社区9.19 Some Ceph Story-朱荣泽03
Scale out backups-with_bareos_and_gluster
Integrating gluster fs,_qemu_and_ovirt-vijay_bellur-linuxcon_eu_2013
Software Defined storage
Award winning scale-up and scale-out storage for Xen
How to Install Gluster Storage Platform
Gluster Storage Platform Installation Guide
Intorduce to Ceph
Introduction to GlusterFS Webinar - September 2011
Petascale Cloud Storage with GlusterFS
Glusterfs 구성제안 및_운영가이드_v2.0
Distributed Shared Memory Systems
Tutorial ceph-2
Glusterfs and openstack
Ceph Object Storage at Spreadshirt (July 2015, Ceph Berlin Meetup)
Ad

Similar to CephFS update February 2016 (20)

PDF
CephFS in Jewel: Stable at Last
PDF
OSDC 2015: John Spray | The Ceph Storage System
PPTX
What you need to know about ceph
PDF
Ceph Day London 2014 - The current state of CephFS development
PPT
An intro to Ceph and big data - CERN Big Data Workshop
PDF
The Future of Cloud Software Defined Storage with Ceph: Andrew Hatfield, Red Hat
PDF
OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula -...
PDF
OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula ...
PDF
Ceph openstack-jun-2015-meetup
PDF
Ceph & OpenStack talk given @ OpenStack Meetup @ Bangalore, June 2015
ODP
Ceph Day NYC: The Future of CephFS
PDF
Webinar - Getting Started With Ceph
PDF
2021.02 new in Ceph Pacific Dashboard
PPTX
Dfs in iaa_s
ODP
Ceph Day NYC: Ceph Performance & Benchmarking
PDF
adp.ceph.openstack.talk
PPTX
Ceph Day Santa Clara: Ceph Fundamentals
PDF
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
PDF
Ceph Day New York 2014: Future of CephFS
PDF
Open Source Storage at Scale: Ceph @ GRNET
CephFS in Jewel: Stable at Last
OSDC 2015: John Spray | The Ceph Storage System
What you need to know about ceph
Ceph Day London 2014 - The current state of CephFS development
An intro to Ceph and big data - CERN Big Data Workshop
The Future of Cloud Software Defined Storage with Ceph: Andrew Hatfield, Red Hat
OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula -...
OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula ...
Ceph openstack-jun-2015-meetup
Ceph & OpenStack talk given @ OpenStack Meetup @ Bangalore, June 2015
Ceph Day NYC: The Future of CephFS
Webinar - Getting Started With Ceph
2021.02 new in Ceph Pacific Dashboard
Dfs in iaa_s
Ceph Day NYC: Ceph Performance & Benchmarking
adp.ceph.openstack.talk
Ceph Day Santa Clara: Ceph Fundamentals
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Ceph Day New York 2014: Future of CephFS
Open Source Storage at Scale: Ceph @ GRNET

Recently uploaded (20)

PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
System and Network Administraation Chapter 3
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Nekopoi APK 2025 free lastest update
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
Online Work Permit System for Fast Permit Processing
PPTX
Introduction to Artificial Intelligence
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
System and Network Administration Chapter 2
PPTX
L1 - Introduction to python Backend.pptx
PPTX
history of c programming in notes for students .pptx
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
ManageIQ - Sprint 268 Review - Slide Deck
CHAPTER 2 - PM Management and IT Context
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
2025 Textile ERP Trends: SAP, Odoo & Oracle
System and Network Administraation Chapter 3
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Odoo POS Development Services by CandidRoot Solutions
Nekopoi APK 2025 free lastest update
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Online Work Permit System for Fast Permit Processing
Introduction to Artificial Intelligence
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
System and Network Administration Chapter 2
L1 - Introduction to python Backend.pptx
history of c programming in notes for students .pptx
PTS Company Brochure 2025 (1).pdf.......
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus

CephFS update February 2016

  • 1. Ceph Tech Talks: CephFS Update John Spray john.spray@redhat.com Feb 2016
  • 2. 2 Ceph Tech Talks: CephFS Agenda ● Recap: CephFS architecture ● What's new for Jewel? ● Scrub/repair ● Fine-grained authorization ● RADOS namespace support in layouts ● OpenStack Manila ● Experimental multi-filesystem functionality
  • 3. 3 Ceph Tech Talks: CephFS Ceph architecture RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors RBD A reliable, fully- distributed block device with cloud platform integration CEPHFS A distributed file system with POSIX semantics and scale- out metadata management APP HOST/VM CLIENT
  • 4. 4 Ceph Tech Talks: CephFS CephFS ● POSIX interface: drop in replacement for any local or network filesystem ● Scalable data: files stored directly in RADOS ● Scalable metadata: cluster of metadata servers ● Extra functionality: snapshots, recursive statistics ● Same storage backend as object (RGW) + block (RBD): no separate silo needed for file
  • 5. 5 Ceph Tech Talks: CephFS Components Linux host M M M Ceph server daemons CephFS client datametadata 01 10 M OSD Monitor MDS
  • 6. 6 Ceph Tech Talks: CephFS Why build a distributed filesystem? ● Existing filesystem-using workloads aren't going away ● POSIX filesystems are a lingua-franca, for administrators as well as applications ● Interoperability with other storage systems in data lifecycle (e.g. backup, archival) ● New platform container “volumes” are filesystems ● Permissions, directories are actually useful concepts!
  • 7. 7 Ceph Tech Talks: CephFS Why not build a distributed filesystem? ● Harder to scale than object stores, because entities (inodes, dentries, dirs) are related to one another, good locality needed for performance. ● Some filesystem-using applications are gratuitously inefficient (e.g. redundant “ls -l” calls, using files for IPC) due to local filesystem latency expectations ● Complexity resulting from stateful clients e.g. taking locks, opening files requires coordination and clients can interfere with one another's responsiveness.
  • 8. 8 Ceph Tech Talks: CephFS CephFS in practice ceph-deploy mds create myserver ceph osd pool create fs_data ceph osd pool create fs_metadata ceph fs new myfs fs_metadata fs_data mount -t ceph x.x.x.x:6789 /mnt/ceph
  • 9. 9 Ceph Tech Talks: CephFS Scrub and repair
  • 10. 10 Ceph Tech Talks: CephFS Scrub/repair status ● In general, resilience and self-repair is RADOS's job: all CephFS data & metadata lives in RADOS objects ● CephFS scrub/repair is for handling disasters: serious software bugs, or permanently lost data in RADOS ● In Jewel, can now handle and recover from many forms of metadata damage (corruptions, deletions) ● Repair tools require expertise: primarily for use during (rare) support incidents, not everyday user activity
  • 11. 11 Ceph Tech Talks: CephFS Scrub/repair: handling damage ● Fine-grained damage status (“damage ls”) instead of taking whole rank offline ● Detect damage during normal load of metadata, or during scrub ceph tell mds.<id> damage ls ceph tell mds.<id> damage rm ceph mds 0 repaired ● Can repair damaged statistics online: other repairs happen offline (i.e. stop MDS and writing directly to metadata pool)
  • 12. 12 Ceph Tech Talks: CephFS Scrub/repair: online scrub commands ● Forward scrub: traversing metadata from root downwards ceph daemon mds.<id> scrub_path ceph daemon mds.<id> scrub_path recursive ceph daemon mds.<id> scrub_path repair ceph daemon mds.<id> tag path ● These commands will give you success or failure info on completion, and emit cluster log messages about issues.
  • 13. 13 Ceph Tech Talks: CephFS Scrub/repair: offline repair commands ● Backward scrub: iterating over all data objects and trying to relate them back to the metadata ● Potentially long running, but can run workers in parallel ● Find all the objects in files: ● cephfs­data­scan scan_extents ● Find (or insert) all the files into the metadata: ● cephfs­data­scan scan_inodes
  • 14. 14 Ceph Tech Talks: CephFS Scrub/repair: parallel execution ● New functionality in RADOS to enable iterating over subsets of the overall set of objects in pool. ● Currently one must coordinate collection of workers by hand (or a short shell script) ● Example: invoke worker 3 of 10 like this: ● cephfs­data­scan scan_inodes –worker_n 3  –worker_m 10
  • 15. 15 Ceph Tech Talks: CephFS Scrub/repair: caveats ● This is still disaster recovery functionality: don't run “repair” commands for fun. ● Not multi-MDS aware: commands operate directly on a single MDS's share of the metadata. ● Not yet auto-run in background like RADOS scrub
  • 16. 16 Ceph Tech Talks: CephFS Fine-grained authorisation
  • 17. 17 Ceph Tech Talks: CephFS CephFS authorization ● Clients need to talk to MDS daemons, mons and OSDs. ● OSD auth caps enable limiting clients to use only particular data pools, but couldn't control which parts of filesystem metadata they saw ● New MDS auth caps enable limiting access by path and uid.
  • 18. 18 Ceph Tech Talks: CephFS MDS auth caps ● Example: we created a dir `foodir` that has its layout set to pool `foopool`. We create a client key 'foo' that can only see metadata within that dir and data within that pool. ceph auth get­or­create client.foo    mds “allow rw path=/foodir”    osd “allow rw pool=foopool”    mon “allow r” ● Client must mount with “-r /foodir”, to treat that as its root (doesn't have capability to see the real root)
  • 19. 19 Ceph Tech Talks: CephFS RADOS namespaces in file layouts
  • 20. 20 Ceph Tech Talks: CephFS RADOS namespaces ● Namespaces offer a cheaper way to divide up objects than pools. ● Pools consume physical resources (i.e. they create PGs), whereas namespaces are effectively just a prefix to object names. ● OSD auth caps can be limited by namespaces: when we need to isolate two clients (e.g. two cephfs clients) we can give them auth caps that allow access to different namespaces.
  • 21. 21 Ceph Tech Talks: CephFS Namespaces in layouts ● Existing fields: pool, stripe_unit, stripe_count, object_size ● New field: pool_namespace ● setfattr ­n ceph.file.layout.pool_namespace ­v <ns> ● setfattr ­n ceph.dir.layout.pool_namespace ­v <ns> ● As with setting layout.pool, the data gets written there, but the backtrace continues to be written to the default pool (and default namespace). Backtrace not accessed by client, so doesn't affect client side auth configuration.
  • 22. 22 Ceph Tech Talks: CephFS OpenStack Manila
  • 23. 23 Ceph Tech Talks: CephFS Manila ● The OpenStack shared filesystem service ● Manila users request filesystem storage as shares which are provisioned by drivers ● CephFS driver implements shares as directories: ● Manila expects shares to be size-constrained, we use CephFS Quotas ● Client mount commands includes -r flag to treat the share dir as the root ● Capacity stats reported for that directory using rstats ● Clients restricted to their directory and pool/namespace using new auth caps
  • 24. 24 Ceph Tech Talks: CephFS CephFSVolumeClient ● A new python interface in the Ceph tree, designed for Manila and similar frameworks. ● Wraps up the directory+auth caps mechanism as a “volume” concept. Manila CephFS Driver CephFSVolumeClient librados libcephfs Ceph Cluster Network github.com/openstack/manila github.com/ceph/ceph
  • 25. 25 Ceph Tech Talks: CephFS Experimental multi-filesystem functionality
  • 26. 26 Ceph Tech Talks: CephFS Multiple filesystems ● Historical 1:1 mapping between Ceph cluster (RADOS) and Ceph filesystem (cluster of MDSs) ● Artificial limitation: no reason we can't have multiple CephFS filesystems, with multiple MDS clusters, all backed onto one RADOS cluster. ● Use case ideas: ● Physically isolate workloads on separate MDS clusters (vs. using dirs within one cluster) ● Disaster recovery: recover into a new filesystem on the same cluster, instead of trying to do in-place ● Resilience: multiple filesystems become separate failure domains in case of issues.
  • 27. 27 Ceph Tech Talks: CephFS Multiple filesystems initial implementation ● You can now run “fs new” more than once (with different pools) ● Old clients get the default filesystem (you can configure which one that is) ● New userspace client config opt to select which filesystem should be mounted ● MDS daemons are all equal: any one may get used for any filesystem ● Switched off by default: must set a special flag to use this (like snapshots, inline data)
  • 28. 28 Ceph Tech Talks: CephFS Multiple filesystems future work ● Enable use of RADOS namespaces (not just separate pools) for different filesystems to avoid needlessly creating more pools ● Authorization capabilities to limit MDS and clients to particular filesystems ● Enable selecting FS in kernel client ● Enable manually configured affinity of MDS daemons to filesystem(s) ● More user friendly FS selection in userspace client (filesystem name instead of ID)
  • 29. 29 Ceph Tech Talks: CephFS Wrap up
  • 30. 30 Ceph Tech Talks: CephFS Tips for early adopters http://ceph.com/resources/mailing-list-irc/ http://tracker.ceph.com/projects/ceph/issues http://ceph.com/docs/master/rados/troubleshooting/log-and-debug/ ● Does the most recent development release or kernel fix your issue? ● What is your configuration? MDS config, Ceph version, client version, kclient or fuse ● What is your workload? ● Can you reproduce with debug logging enabled?
  • 31. 31 Ceph Tech Talks: CephFS Questions?