SlideShare a Scribd company logo
Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN
Dan van der Ster (daniel.vanderster@cern.ch)
Data and Storage Service Group | CERN IT Department
CERN’s Mission and Tools
●  CERN studies the fundamental laws of nature
○  Why do particles have mass?
○  What is our universe made of?
○  Why is there no antimatter left?
○  What was matter like right after the “Big Bang”?
○  …
●  The Large Hadron Collider (LHC)
○  Built in a 27km long tunnel, ~200m underground
○  Dipole magnets operated at -271°C (1.9K)
○  Particles do ~11’000 turns/sec, 600 million collisions/sec
○  …
●  Detectors
○  Four main experiments, each the size of a cathedral
○  DAQ systems Processing PetaBytes/sec
Scaling Ceph at CERN - D. van der Ster 3
Big Data at CERN
Physics Data on CASTOR/EOS
●  LHC experiments produce ~10GB/s
25PB/year
User Data on OpenAFS & DFS
●  Home directories for 30k users
●  Physics analysis development
●  Project spaces for applications
Service Data on AFS/NFS
●  Databases, admin applications
Tape archival with CASTOR/TSM
●  RAW physics outputs
●  Desktop/Server backups
Scaling Ceph at CERN - D. van der Ster 4
Service Size Files
OpenAFS 290TB 2.3B
CASTOR 89.0PB 325M
EOS 20.1PB 160M
IT Evolution at CERN
Scaling Ceph at CERN - D. van der Ster 5
Cloudifying CERN’s IT infrastructure ...
●  Centrally-managed and uniform hardware
○  No more service-specific storage boxes
●  OpenStack VMs for most services
○  Building for 100k nodes (mostly for batch processing)
●  Attractive desktop storage services
○  Huge demand for a local Dropbox, Google Drive …
●  Remote data centre in Budapest
○  More rack space and power, plus disaster recovery
… brings new storage requirements
●  Block storage for OpenStack VMs
○  Images and volumes
●  Backend storage for existing and new services
○  AFS, NFS, OwnCloud, Data Preservation, ...
●  Regional storage
○  Use of our new data centre in Hungary
●  Failure tolerance, data checksumming, easy to operate, security, ...
Ceph at CERN
Scaling Ceph at CERN - D. van der Ster 6
12 racks of disk server quads
Wiebalck / van der Ster -- Building an organic block storage service at CERN with Ceph
Our 3PB Ceph Cluster
Dual Intel Xeon L5640
24 threads incl. HT
Dual 1Gig-E NICs
Only one connected
2x 2TB Hitachi system disks
RAID-1 mirror
1x 240GB OCZ Deneva 2
/var/lib/ceph/mon
48GB RAM
Scaling Ceph at CERN - D. van der Ster 8
Dual Intel Xeon E5-2650
32 threads incl. HT
Dual 10Gig-E NICs
Only one connected
24x 3TB Hitachi disks
Eco drive, ~5900 RPM
3x 2TB Hitachi system disks
Triple mirror
64GB RAM
47 disk servers/1128 OSDs 5 monitors
#	
  df	
  -­‐h	
  /mnt/ceph	
  
Filesystem	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
Size	
  	
  Used	
  Avail	
  Use%	
  Mounted	
  on	
  
xxx:6789:/	
  	
  3.1P	
  	
  173T	
  	
  2.9P	
  	
  	
  6%	
  /mnt/ceph	
  
Use-Cases Being Evaluated
1.  Images and Volumes for OpenStack
2.  S3 Storage for Data Preservation / Public
Dissemination
3.  Physics data storage for archival and/or
analysis
Scaling Ceph at CERN - D. van der Ster 9
#1 is moving into production. #2 and #3 are more
exploratory at the moment.
OpenStack Volumes & Images
•  Glance: using RBD for ~3 months now.
•  Only issue was to increase ulimit -n above 1024 (10k
is good).
•  Cinder: testing with close colleagues.
•  126 Cinder Volumes attached today – 56TB used
Scaling Ceph at CERN - D. van der Ster 10
Growing # of volumes/images Usual traffic is ~50-100MB/s
with current usage. (~idle)
RBD for OpenStack Volumes
•  Before general availability, we need to test and
enable qemu iops/bps throttling
•  Otherwise VMs with many IOs can disrupt other
users.
•  One ongoing issue is that a few clients are
getting an (infrequent) segfault of qemu during
a VM reboot.
•  Happens on VMs with many attached RBD’s.
•  Difficult to get a complete (16GB) core dump.
Scaling Ceph at CERN - D. van der Ster 11
CASTOR & XRootD/EOS
•  Exploring RADOS backend for these two HEP-developed
file systems
•  Gateway model, similar to S3 via RADOSGW
•  CASTOR needs raw throughput performance (to feed
many tape drives at 250MBps each).
•  Striped RWs across many OSDs are important.
•  XRootD/EOS may benefit from the highly scalable
namespace to store O(billion) objects
•  Bonus: XRootD also offers http/webdav with X509/kerberos,
possibly even fuse mountable.
•  Developments are in early stages.
Scaling Ceph at CERN - D. van der Ster 12
Operations & Lessons Learned
Scaling Ceph at CERN - D. van der Ster 13
Configuration and Deployment
•  Dumpling 0.67.7
•  Fully Puppet-ized
•  Automated server deployment,
automated OSD replacement
•  Very few custom ceph.conf
options à
•  Experimenting with the
filestore	
  wbthrottle	
  
•  we find that disabling it
completely gives better IOps
performance
•  But don’t do this!!!
Scaling Ceph at CERN - D. van der Ster 14
mon	
  osd	
  down	
  out	
  interval	
  =	
  900	
  
	
  
osd	
  pool	
  default	
  size	
  =	
  3	
  
osd	
  pool	
  default	
  min	
  size	
  =	
  1	
  
osd	
  pool	
  default	
  pg	
  num	
  =	
  1024	
  
osd	
  pool	
  default	
  pgp	
  num	
  =	
  1024	
  
osd	
  pool	
  default	
  flag	
  hashpspool	
  =	
  true	
  
	
  
osd	
  max	
  backfills	
  =	
  1	
  
osd	
  recovery	
  max	
  active	
  =	
  1	
  
Cluster Activity
Scaling Ceph at CERN - D. van der Ster 15
General Comments…
•  In these ~7 months of running the cluster, there have been very
few problems
•  No outages
•  No data losses/corruptions
•  No unfixable performance issues
•  Behaves well during stress tests
•  But now we’re starting to get real/varied/creative users, and this
brings up many interesting issues...
•  “No amount of stress testing can prepare you for real users”
- Unknown
•  (point being, don’t take the next slides to be too negative – I’m
just trying to give helpful advice ;)
Scaling Ceph at CERN - D. van der Ster 16
Latency & Slow Requests
•  Best latency we can achieve is 20-40ms
•  Slow SATA disks, no SSDs: hard to justify SSDs in a multi-PB cluster,
but could in a smaller limited use-case cluster (e.g. for Cinder-only)
•  Latency can increase dramatically with heavy usage
•  Don’t mix latency-bound and throughput-bound users on the same
OSDs
•  Local processes scanning the disks can hurt performance
•  Add /var/lib/ceph to the updatedb PRUNEPATH
•  If you have slow disks like us, you need to understand your disk IO
scheduler – e.g. deadline prefers reads over writes: writes are given a
5 second deadline vs. 500ms for reads!
•  Scrubbing!
•  Kernel tuning: vm.* sysctl, dirty page flushing, memory
reclaiming…
•  “Something is flushing the buffers, blocking the OSD processes”
•  Slow requests: monitor them, eliminate them.
Scaling Ceph at CERN - D. van der Ster 17
Life with 250 million objects
•  Recently, a user decided to write 250 million 1kB objects
•  Not so unreasonable: 250M * 4MB = 1PB, so this simulates the cluster
being full of RBD images, at least in terms of # objects
•  It worked – no big problems from holding this many objects.
•  Tested single OSD failure: ~7 hours to backfill, including a
double-backfill glitch that we’re trying to understand.
•  But now we want to cleanup, and it is not trivial to remove 250M
objects!
•  rados rmpool generated quite a load when we rm’d a 3 million object
pool (some OSDs were temporarily marked down).
•  Probably due to a mistake in our wbthrottle tuning
Scaling Ceph at CERN - D. van der Ster 18
Other backfilling issues
•  During a backfilling event (draining a whole server),
we started observing repeated monitor elections
•  Caused by the mons’ LevelDBs being so active that the
local SATA disks couldn’t keep up.
•  When a mon falls behind, it calls an election
•  Could be due to LevelDB compaction…
•  We moved /var/lib/ceph/mon to SSDs – no more
elections during backfilling
•  Avoid double backfilling when taking an OSD out of
service:
•  Start with ceph	
  osd	
  crush	
  rm	
  <osd	
  id> !!	
  
•  If you mark the OSD out first, then crush rm it, you will
compute a new CRUSH map twice, i.e. backfill twice.
Scaling Ceph at CERN - D. van der Ster 19
Fun with CRUSH
•  CRUSH is simple yet powerful, so it is tempting to
play with the cluster layout
•  But once you have non-zero amounts of data, significant
CRUSH changes will lead to massive data movements,
which create extra disk load and may disrupt users.
•  Early CRUSH planning is crucial!
•  A network switch is a failure domain, so we should
configure CRUSH to replicate across switches,
right?
•  But (assuming we don’t have a private cluster network)
that would send all replication traffic via the switch uplinks
– bottleneck!
•  Unclear tradeoff between uptime and performance.
Scaling Ceph at CERN - D. van der Ster 20
CRUSH & Data distribution
•  CRUSH may give your cluster
an uneven data distribution
•  An OSD’s used space will
scale with the number of PGs
assigned to it
•  After you have designed your
cluster, created your pools,
started adding data, check the
PG and volume distributions
•  reweight-­‐by-­‐utilization	
  
is useful to iron out an uneven
PG distribution
•  The hashpspool flag is also
important if you have many
active pools
Scaling Ceph at CERN - D. van der Ster 21
0
20
40
60
80
100
120
140
160
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30nOSDs
n PGs
Number of OSDs having N PGs
(for pool = volumes)
RBD Reliability with 3 Replicas
•  RBD devices are chunked across thousands of objects:
•  A full 1TB volume is composed of 250,000 4MB objects
•  If any single object is lost, the whole RBD can be considered to be corrupted
(obviously, it depends which blocks are lost!)
•  If you lose an entire PG, you can consider all RBDs to be lost / corrupted.
•  Our incorrect & irrational fears:
•  Any simultaneous triple disk failure in the cluster would lead to objects being
lost – and somehow all RBDs would be corrupted.
•  As we add OSDs to the cluster, the data gets spread wider, and the chances of
RBD data loss increase.
•  But this is wrong!!
•  The only triple disk failures that can lead to data loss are those combinations
actively used by PGs – so having e.g. 4096 PGs for RBDs means that only
4096 combinations out of the 10^9 possible combinations matter.
•  N_PGs * ~(P_diskfailure^3) / 3!
•  We use 4 replicas for the RBD volumes, but this is probably overkill.
Scaling Ceph at CERN - D. van der Ster 22
Trust your clients
•  There is no server-side per-client throttling
•  A few nasty clients can overwhelm an OSD, leading to slow requests
for everyone.
•  When you have a high load / slow requests, it is not always
trivial to identify and blacklist/firewall the misbehaving client
•  Could use some help in the monitoring: per-client perf stats?
•  One of our creative users found a way to make the mon’s
generate 5*40 MBps of outbound network traffic
•  Could saturate the mon network, lead to disruptions
•  RADOS is not for end-users. A cephx keyring is for trusted
persons only, not for Joe Random User.
Scaling Ceph at CERN - D. van der Ster 23
Fat fingers
•  A healthy cluster is always vulnerable to human errors
•  We’ve thus far avoided any big mistakes
•  Used PG splitting to grow a pool from 8 to 2048 PGs
•  Leads to unresponsive OSDs who get marked down à degraded objs.
•  Safer & now-enforced to grow in 2x or 4x steps
•  ulimits, ulimits, ulimits
•  With a large number of OSDs (say, more than 500), you will hit num
file and num process limits everywhere:
•  Glance, qemu, radosgw, ceph/rados CLI, …
•  If you use XFS, don’t put your OSD journal as a file on the disk
•  Use a separate partition, the first partition!
•  We still need to reinstall our whole cluster to re-partition the OSDs
Scaling Ceph at CERN - D. van der Ster 24
Scale up and out
•  Scale up: we are demonstrating the viability of a
3PB cluster with O(1000) OSDs.
•  What about 10,000 or 100,000 OSDs?
•  What about 10,000 or 100,000 clients?
•  Many Ceph instances is always an option, but not ideal
•  Scale out: our growing data centre in Budapest
brings many options:
•  Replicate over the WAN (though, 30ms RTT)
•  Tiering / Caching pools (new feature, need to get
experience…)
•  Data locality – direct IOs to nearby replica or caching pool
Scaling Ceph at CERN - D. van der Ster 25
Summary
Scaling Ceph at CERN - D. van der Ster 26
Summary
•  CERN IT infrastructure is undergoing a private
cloud revolution, and Ceph is providing the
underlying storage.
•  Our CASTOR and XRootD physics data use-
cases may exploit RADOS for improved
performance/scalability.
•  In seven months with a 3PB cluster, we’ve not
had any disasters. Actually it’s working quite
well.
•  Presented some lessons learned, I hope they
prove useful in your Ceph explorations.
Scaling Ceph at CERN - D. van der Ster 27
Scaling Ceph at CERN - Ceph Day Frankfurt

More Related Content

PPTX
Big Data.pptx
PDF
CephFS Update
PPTX
Not Your Mother's Kafka - Deep Dive into Confluent Cloud Infrastructure | Gwe...
PPT
Seminar Presentation Hadoop
PDF
Introduction to Hadoop
PDF
U1 literatura 3_eso_introduccion_a_la_literatura
PPTX
MapR-DB – The First In-Hadoop Document Database
PDF
Ceph Block Devices: A Deep Dive
Big Data.pptx
CephFS Update
Not Your Mother's Kafka - Deep Dive into Confluent Cloud Infrastructure | Gwe...
Seminar Presentation Hadoop
Introduction to Hadoop
U1 literatura 3_eso_introduccion_a_la_literatura
MapR-DB – The First In-Hadoop Document Database
Ceph Block Devices: A Deep Dive

Similar to Scaling Ceph at CERN - Ceph Day Frankfurt (20)

PDF
Ceph for Big Science - Dan van der Ster
PDF
London Ceph Day: Ceph at CERN
PDF
Ceph in 2023 and Beyond.pdf
PPTX
Your 1st Ceph cluster
PPTX
openSUSE storage workshop 2016
PDF
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
PDF
Quick-and-Easy Deployment of a Ceph Storage Cluster
PDF
adp.ceph.openstack.talk
PPTX
Ceph barcelona-v-1.2
PPTX
ceph-barcelona-v-1.2
PPT
Ceph Performance and Optimization - Ceph Day Frankfurt
PDF
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
PDF
Intorduce to Ceph
PDF
Erasure Code at Scale - Thomas William Byrne
PPTX
Ceph, storage cluster to go exabyte and beyond
PDF
NAVER Ceph Storage on ssd for Container
PDF
Webinar - Getting Started With Ceph
PDF
SFScon18 - Martin Palma - Building a petabyte-scale storage system based on f...
PDF
Open Source Storage at Scale: Ceph @ GRNET
Ceph for Big Science - Dan van der Ster
London Ceph Day: Ceph at CERN
Ceph in 2023 and Beyond.pdf
Your 1st Ceph cluster
openSUSE storage workshop 2016
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Quick-and-Easy Deployment of a Ceph Storage Cluster
adp.ceph.openstack.talk
Ceph barcelona-v-1.2
ceph-barcelona-v-1.2
Ceph Performance and Optimization - Ceph Day Frankfurt
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Intorduce to Ceph
Erasure Code at Scale - Thomas William Byrne
Ceph, storage cluster to go exabyte and beyond
NAVER Ceph Storage on ssd for Container
Webinar - Getting Started With Ceph
SFScon18 - Martin Palma - Building a petabyte-scale storage system based on f...
Open Source Storage at Scale: Ceph @ GRNET
Ad

Recently uploaded (20)

PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Electronic commerce courselecture one. Pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
NewMind AI Monthly Chronicles - July 2025
DOCX
The AUB Centre for AI in Media Proposal.docx
PPT
Teaching material agriculture food technology
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
Diabetes mellitus diagnosis method based random forest with bat algorithm
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Chapter 3 Spatial Domain Image Processing.pdf
Electronic commerce courselecture one. Pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
NewMind AI Monthly Chronicles - July 2025
The AUB Centre for AI in Media Proposal.docx
Teaching material agriculture food technology
Network Security Unit 5.pdf for BCA BBA.
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
GamePlan Trading System Review: Professional Trader's Honest Take
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
CIFDAQ's Market Insight: SEC Turns Pro Crypto
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Empathic Computing: Creating Shared Understanding
Review of recent advances in non-invasive hemoglobin estimation
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
NewMind AI Weekly Chronicles - August'25 Week I
Ad

Scaling Ceph at CERN - Ceph Day Frankfurt

  • 2. Scaling Ceph at CERN Dan van der Ster (daniel.vanderster@cern.ch) Data and Storage Service Group | CERN IT Department
  • 3. CERN’s Mission and Tools ●  CERN studies the fundamental laws of nature ○  Why do particles have mass? ○  What is our universe made of? ○  Why is there no antimatter left? ○  What was matter like right after the “Big Bang”? ○  … ●  The Large Hadron Collider (LHC) ○  Built in a 27km long tunnel, ~200m underground ○  Dipole magnets operated at -271°C (1.9K) ○  Particles do ~11’000 turns/sec, 600 million collisions/sec ○  … ●  Detectors ○  Four main experiments, each the size of a cathedral ○  DAQ systems Processing PetaBytes/sec Scaling Ceph at CERN - D. van der Ster 3
  • 4. Big Data at CERN Physics Data on CASTOR/EOS ●  LHC experiments produce ~10GB/s 25PB/year User Data on OpenAFS & DFS ●  Home directories for 30k users ●  Physics analysis development ●  Project spaces for applications Service Data on AFS/NFS ●  Databases, admin applications Tape archival with CASTOR/TSM ●  RAW physics outputs ●  Desktop/Server backups Scaling Ceph at CERN - D. van der Ster 4 Service Size Files OpenAFS 290TB 2.3B CASTOR 89.0PB 325M EOS 20.1PB 160M
  • 5. IT Evolution at CERN Scaling Ceph at CERN - D. van der Ster 5 Cloudifying CERN’s IT infrastructure ... ●  Centrally-managed and uniform hardware ○  No more service-specific storage boxes ●  OpenStack VMs for most services ○  Building for 100k nodes (mostly for batch processing) ●  Attractive desktop storage services ○  Huge demand for a local Dropbox, Google Drive … ●  Remote data centre in Budapest ○  More rack space and power, plus disaster recovery … brings new storage requirements ●  Block storage for OpenStack VMs ○  Images and volumes ●  Backend storage for existing and new services ○  AFS, NFS, OwnCloud, Data Preservation, ... ●  Regional storage ○  Use of our new data centre in Hungary ●  Failure tolerance, data checksumming, easy to operate, security, ...
  • 6. Ceph at CERN Scaling Ceph at CERN - D. van der Ster 6
  • 7. 12 racks of disk server quads Wiebalck / van der Ster -- Building an organic block storage service at CERN with Ceph
  • 8. Our 3PB Ceph Cluster Dual Intel Xeon L5640 24 threads incl. HT Dual 1Gig-E NICs Only one connected 2x 2TB Hitachi system disks RAID-1 mirror 1x 240GB OCZ Deneva 2 /var/lib/ceph/mon 48GB RAM Scaling Ceph at CERN - D. van der Ster 8 Dual Intel Xeon E5-2650 32 threads incl. HT Dual 10Gig-E NICs Only one connected 24x 3TB Hitachi disks Eco drive, ~5900 RPM 3x 2TB Hitachi system disks Triple mirror 64GB RAM 47 disk servers/1128 OSDs 5 monitors #  df  -­‐h  /mnt/ceph   Filesystem                                                                                                             Size    Used  Avail  Use%  Mounted  on   xxx:6789:/    3.1P    173T    2.9P      6%  /mnt/ceph  
  • 9. Use-Cases Being Evaluated 1.  Images and Volumes for OpenStack 2.  S3 Storage for Data Preservation / Public Dissemination 3.  Physics data storage for archival and/or analysis Scaling Ceph at CERN - D. van der Ster 9 #1 is moving into production. #2 and #3 are more exploratory at the moment.
  • 10. OpenStack Volumes & Images •  Glance: using RBD for ~3 months now. •  Only issue was to increase ulimit -n above 1024 (10k is good). •  Cinder: testing with close colleagues. •  126 Cinder Volumes attached today – 56TB used Scaling Ceph at CERN - D. van der Ster 10 Growing # of volumes/images Usual traffic is ~50-100MB/s with current usage. (~idle)
  • 11. RBD for OpenStack Volumes •  Before general availability, we need to test and enable qemu iops/bps throttling •  Otherwise VMs with many IOs can disrupt other users. •  One ongoing issue is that a few clients are getting an (infrequent) segfault of qemu during a VM reboot. •  Happens on VMs with many attached RBD’s. •  Difficult to get a complete (16GB) core dump. Scaling Ceph at CERN - D. van der Ster 11
  • 12. CASTOR & XRootD/EOS •  Exploring RADOS backend for these two HEP-developed file systems •  Gateway model, similar to S3 via RADOSGW •  CASTOR needs raw throughput performance (to feed many tape drives at 250MBps each). •  Striped RWs across many OSDs are important. •  XRootD/EOS may benefit from the highly scalable namespace to store O(billion) objects •  Bonus: XRootD also offers http/webdav with X509/kerberos, possibly even fuse mountable. •  Developments are in early stages. Scaling Ceph at CERN - D. van der Ster 12
  • 13. Operations & Lessons Learned Scaling Ceph at CERN - D. van der Ster 13
  • 14. Configuration and Deployment •  Dumpling 0.67.7 •  Fully Puppet-ized •  Automated server deployment, automated OSD replacement •  Very few custom ceph.conf options à •  Experimenting with the filestore  wbthrottle   •  we find that disabling it completely gives better IOps performance •  But don’t do this!!! Scaling Ceph at CERN - D. van der Ster 14 mon  osd  down  out  interval  =  900     osd  pool  default  size  =  3   osd  pool  default  min  size  =  1   osd  pool  default  pg  num  =  1024   osd  pool  default  pgp  num  =  1024   osd  pool  default  flag  hashpspool  =  true     osd  max  backfills  =  1   osd  recovery  max  active  =  1  
  • 15. Cluster Activity Scaling Ceph at CERN - D. van der Ster 15
  • 16. General Comments… •  In these ~7 months of running the cluster, there have been very few problems •  No outages •  No data losses/corruptions •  No unfixable performance issues •  Behaves well during stress tests •  But now we’re starting to get real/varied/creative users, and this brings up many interesting issues... •  “No amount of stress testing can prepare you for real users” - Unknown •  (point being, don’t take the next slides to be too negative – I’m just trying to give helpful advice ;) Scaling Ceph at CERN - D. van der Ster 16
  • 17. Latency & Slow Requests •  Best latency we can achieve is 20-40ms •  Slow SATA disks, no SSDs: hard to justify SSDs in a multi-PB cluster, but could in a smaller limited use-case cluster (e.g. for Cinder-only) •  Latency can increase dramatically with heavy usage •  Don’t mix latency-bound and throughput-bound users on the same OSDs •  Local processes scanning the disks can hurt performance •  Add /var/lib/ceph to the updatedb PRUNEPATH •  If you have slow disks like us, you need to understand your disk IO scheduler – e.g. deadline prefers reads over writes: writes are given a 5 second deadline vs. 500ms for reads! •  Scrubbing! •  Kernel tuning: vm.* sysctl, dirty page flushing, memory reclaiming… •  “Something is flushing the buffers, blocking the OSD processes” •  Slow requests: monitor them, eliminate them. Scaling Ceph at CERN - D. van der Ster 17
  • 18. Life with 250 million objects •  Recently, a user decided to write 250 million 1kB objects •  Not so unreasonable: 250M * 4MB = 1PB, so this simulates the cluster being full of RBD images, at least in terms of # objects •  It worked – no big problems from holding this many objects. •  Tested single OSD failure: ~7 hours to backfill, including a double-backfill glitch that we’re trying to understand. •  But now we want to cleanup, and it is not trivial to remove 250M objects! •  rados rmpool generated quite a load when we rm’d a 3 million object pool (some OSDs were temporarily marked down). •  Probably due to a mistake in our wbthrottle tuning Scaling Ceph at CERN - D. van der Ster 18
  • 19. Other backfilling issues •  During a backfilling event (draining a whole server), we started observing repeated monitor elections •  Caused by the mons’ LevelDBs being so active that the local SATA disks couldn’t keep up. •  When a mon falls behind, it calls an election •  Could be due to LevelDB compaction… •  We moved /var/lib/ceph/mon to SSDs – no more elections during backfilling •  Avoid double backfilling when taking an OSD out of service: •  Start with ceph  osd  crush  rm  <osd  id> !!   •  If you mark the OSD out first, then crush rm it, you will compute a new CRUSH map twice, i.e. backfill twice. Scaling Ceph at CERN - D. van der Ster 19
  • 20. Fun with CRUSH •  CRUSH is simple yet powerful, so it is tempting to play with the cluster layout •  But once you have non-zero amounts of data, significant CRUSH changes will lead to massive data movements, which create extra disk load and may disrupt users. •  Early CRUSH planning is crucial! •  A network switch is a failure domain, so we should configure CRUSH to replicate across switches, right? •  But (assuming we don’t have a private cluster network) that would send all replication traffic via the switch uplinks – bottleneck! •  Unclear tradeoff between uptime and performance. Scaling Ceph at CERN - D. van der Ster 20
  • 21. CRUSH & Data distribution •  CRUSH may give your cluster an uneven data distribution •  An OSD’s used space will scale with the number of PGs assigned to it •  After you have designed your cluster, created your pools, started adding data, check the PG and volume distributions •  reweight-­‐by-­‐utilization   is useful to iron out an uneven PG distribution •  The hashpspool flag is also important if you have many active pools Scaling Ceph at CERN - D. van der Ster 21 0 20 40 60 80 100 120 140 160 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30nOSDs n PGs Number of OSDs having N PGs (for pool = volumes)
  • 22. RBD Reliability with 3 Replicas •  RBD devices are chunked across thousands of objects: •  A full 1TB volume is composed of 250,000 4MB objects •  If any single object is lost, the whole RBD can be considered to be corrupted (obviously, it depends which blocks are lost!) •  If you lose an entire PG, you can consider all RBDs to be lost / corrupted. •  Our incorrect & irrational fears: •  Any simultaneous triple disk failure in the cluster would lead to objects being lost – and somehow all RBDs would be corrupted. •  As we add OSDs to the cluster, the data gets spread wider, and the chances of RBD data loss increase. •  But this is wrong!! •  The only triple disk failures that can lead to data loss are those combinations actively used by PGs – so having e.g. 4096 PGs for RBDs means that only 4096 combinations out of the 10^9 possible combinations matter. •  N_PGs * ~(P_diskfailure^3) / 3! •  We use 4 replicas for the RBD volumes, but this is probably overkill. Scaling Ceph at CERN - D. van der Ster 22
  • 23. Trust your clients •  There is no server-side per-client throttling •  A few nasty clients can overwhelm an OSD, leading to slow requests for everyone. •  When you have a high load / slow requests, it is not always trivial to identify and blacklist/firewall the misbehaving client •  Could use some help in the monitoring: per-client perf stats? •  One of our creative users found a way to make the mon’s generate 5*40 MBps of outbound network traffic •  Could saturate the mon network, lead to disruptions •  RADOS is not for end-users. A cephx keyring is for trusted persons only, not for Joe Random User. Scaling Ceph at CERN - D. van der Ster 23
  • 24. Fat fingers •  A healthy cluster is always vulnerable to human errors •  We’ve thus far avoided any big mistakes •  Used PG splitting to grow a pool from 8 to 2048 PGs •  Leads to unresponsive OSDs who get marked down à degraded objs. •  Safer & now-enforced to grow in 2x or 4x steps •  ulimits, ulimits, ulimits •  With a large number of OSDs (say, more than 500), you will hit num file and num process limits everywhere: •  Glance, qemu, radosgw, ceph/rados CLI, … •  If you use XFS, don’t put your OSD journal as a file on the disk •  Use a separate partition, the first partition! •  We still need to reinstall our whole cluster to re-partition the OSDs Scaling Ceph at CERN - D. van der Ster 24
  • 25. Scale up and out •  Scale up: we are demonstrating the viability of a 3PB cluster with O(1000) OSDs. •  What about 10,000 or 100,000 OSDs? •  What about 10,000 or 100,000 clients? •  Many Ceph instances is always an option, but not ideal •  Scale out: our growing data centre in Budapest brings many options: •  Replicate over the WAN (though, 30ms RTT) •  Tiering / Caching pools (new feature, need to get experience…) •  Data locality – direct IOs to nearby replica or caching pool Scaling Ceph at CERN - D. van der Ster 25
  • 26. Summary Scaling Ceph at CERN - D. van der Ster 26
  • 27. Summary •  CERN IT infrastructure is undergoing a private cloud revolution, and Ceph is providing the underlying storage. •  Our CASTOR and XRootD physics data use- cases may exploit RADOS for improved performance/scalability. •  In seven months with a 3PB cluster, we’ve not had any disasters. Actually it’s working quite well. •  Presented some lessons learned, I hope they prove useful in your Ceph explorations. Scaling Ceph at CERN - D. van der Ster 27