Scaling Ceph at CERN - Ceph Day Frankfurt

Scaling Ceph at CERN
Dan van der Ster (daniel.vanderster@cern.ch)
Data and Storage Service Group | CERN IT Department

CERN’s Mission and Tools
●  CERN studies the fundamental laws of nature
○  Why do particles have mass?
○  What is our universe made of?
○  Why is there no antimatter left?
○  What was matter like right after the “Big Bang”?
○  …
●  The Large Hadron Collider (LHC)
○  Built in a 27km long tunnel, ~200m underground
○  Dipole magnets operated at -271°C (1.9K)
○  Particles do ~11’000 turns/sec, 600 million collisions/sec
○  …
●  Detectors
○  Four main experiments, each the size of a cathedral
○  DAQ systems Processing PetaBytes/sec
Scaling Ceph at CERN - D. van der Ster 3

Big Data at CERN
Physics Data on CASTOR/EOS
●  LHC experiments produce ~10GB/s
25PB/year
User Data on OpenAFS & DFS
●  Home directories for 30k users
●  Physics analysis development
●  Project spaces for applications
Service Data on AFS/NFS
●  Databases, admin applications
Tape archival with CASTOR/TSM
●  RAW physics outputs
●  Desktop/Server backups
Service Size Files
OpenAFS 290TB 2.3B
CASTOR 89.0PB 325M
EOS 20.1PB 160M

IT Evolution at CERN
Cloudifying CERN’s IT infrastructure ...
●  Centrally-managed and uniform hardware
○  No more service-specific storage boxes
●  OpenStack VMs for most services
○  Building for 100k nodes (mostly for batch processing)
●  Attractive desktop storage services
○  Huge demand for a local Dropbox, Google Drive …
●  Remote data centre in Budapest
○  More rack space and power, plus disaster recovery
… brings new storage requirements
●  Block storage for OpenStack VMs
○  Images and volumes
●  Backend storage for existing and new services
○  AFS, NFS, OwnCloud, Data Preservation, ...
●  Regional storage
○  Use of our new data centre in Hungary
●  Failure tolerance, data checksumming, easy to operate, security, ...

Ceph at CERN

12 racks of disk server quads
Wiebalck / van der Ster -- Building an organic block storage service at CERN with Ceph

Our 3PB Ceph Cluster
Dual Intel Xeon L5640
24 threads incl. HT
Dual 1Gig-E NICs
Only one connected
2x 2TB Hitachi system disks
RAID-1 mirror
1x 240GB OCZ Deneva 2
/var/lib/ceph/mon
48GB RAM
Dual Intel Xeon E5-2650
32 threads incl. HT
Dual 10Gig-E NICs
Only one connected
24x 3TB Hitachi disks
Eco drive, ~5900 RPM
3x 2TB Hitachi system disks
Triple mirror
64GB RAM
47 disk servers/1128 OSDs 5 monitors
#
df
-‐h
/mnt/ceph

Filesystem

Size

Used
Avail
Use%
Mounted
on

xxx:6789:/

3.1P

173T

2.9P

6%
/mnt/ceph

Use-Cases Being Evaluated
1.  Images and Volumes for OpenStack
2.  S3 Storage for Data Preservation / Public
Dissemination
3.  Physics data storage for archival and/or
analysis
#1 is moving into production. #2 and #3 are more
exploratory at the moment.

OpenStack Volumes & Images
•  Glance: using RBD for ~3 months now.
•  Only issue was to increase ulimit -n above 1024 (10k
is good).
•  Cinder: testing with close colleagues.
•  126 Cinder Volumes attached today – 56TB used
Growing # of volumes/images Usual traffic is ~50-100MB/s
with current usage. (~idle)

RBD for OpenStack Volumes
•  Before general availability, we need to test and
enable qemu iops/bps throttling
•  Otherwise VMs with many IOs can disrupt other
users.
•  One ongoing issue is that a few clients are
getting an (infrequent) segfault of qemu during
a VM reboot.
•  Happens on VMs with many attached RBD’s.
•  Difficult to get a complete (16GB) core dump.

CASTOR & XRootD/EOS
•  Exploring RADOS backend for these two HEP-developed
file systems
•  Gateway model, similar to S3 via RADOSGW
•  CASTOR needs raw throughput performance (to feed
many tape drives at 250MBps each).
•  Striped RWs across many OSDs are important.
•  XRootD/EOS may benefit from the highly scalable
namespace to store O(billion) objects
•  Bonus: XRootD also offers http/webdav with X509/kerberos,
possibly even fuse mountable.
•  Developments are in early stages.

Operations & Lessons Learned

Configuration and Deployment
•  Dumpling 0.67.7
•  Fully Puppet-ized
•  Automated server deployment,
automated OSD replacement
•  Very few custom ceph.conf
options à
•  Experimenting with the
filestore
wbthrottle

•  we find that disabling it
completely gives better IOps
performance
•  But don’t do this!!!
mon
osd
down
out
interval
=
900

osd
pool
default
size
=
3

osd
pool
default
min
size
=
1

osd
pool
default
pg
num
=
1024

osd
pool
default
pgp
num
=
1024

osd
pool
default
flag
hashpspool
=
true

osd
max
backfills
=
1

osd
recovery
max
active
=
1

Cluster Activity

General Comments…
•  In these ~7 months of running the cluster, there have been very
few problems
•  No outages
•  No data losses/corruptions
•  No unfixable performance issues
•  Behaves well during stress tests
•  But now we’re starting to get real/varied/creative users, and this
brings up many interesting issues...
•  “No amount of stress testing can prepare you for real users”
- Unknown
•  (point being, don’t take the next slides to be too negative – I’m
just trying to give helpful advice ;)

Latency & Slow Requests
•  Best latency we can achieve is 20-40ms
•  Slow SATA disks, no SSDs: hard to justify SSDs in a multi-PB cluster,
but could in a smaller limited use-case cluster (e.g. for Cinder-only)
•  Latency can increase dramatically with heavy usage
•  Don’t mix latency-bound and throughput-bound users on the same
OSDs
•  Local processes scanning the disks can hurt performance
•  Add /var/lib/ceph to the updatedb PRUNEPATH
•  If you have slow disks like us, you need to understand your disk IO
scheduler – e.g. deadline prefers reads over writes: writes are given a
5 second deadline vs. 500ms for reads!
•  Scrubbing!
•  Kernel tuning: vm.* sysctl, dirty page flushing, memory
reclaiming…
•  “Something is flushing the buffers, blocking the OSD processes”
•  Slow requests: monitor them, eliminate them.

Life with 250 million objects
•  Recently, a user decided to write 250 million 1kB objects
•  Not so unreasonable: 250M * 4MB = 1PB, so this simulates the cluster
being full of RBD images, at least in terms of # objects
•  It worked – no big problems from holding this many objects.
•  Tested single OSD failure: ~7 hours to backfill, including a
double-backfill glitch that we’re trying to understand.
•  But now we want to cleanup, and it is not trivial to remove 250M
objects!
•  rados rmpool generated quite a load when we rm’d a 3 million object
pool (some OSDs were temporarily marked down).
•  Probably due to a mistake in our wbthrottle tuning

Other backfilling issues
•  During a backfilling event (draining a whole server),
we started observing repeated monitor elections
•  Caused by the mons’ LevelDBs being so active that the
local SATA disks couldn’t keep up.
•  When a mon falls behind, it calls an election
•  Could be due to LevelDB compaction…
•  We moved /var/lib/ceph/mon to SSDs – no more
elections during backfilling
•  Avoid double backfilling when taking an OSD out of
service:
•  Start with ceph
osd
crush
rm
<osd
id> !!

•  If you mark the OSD out first, then crush rm it, you will
compute a new CRUSH map twice, i.e. backfill twice.

Fun with CRUSH
•  CRUSH is simple yet powerful, so it is tempting to
play with the cluster layout
•  But once you have non-zero amounts of data, significant
CRUSH changes will lead to massive data movements,
which create extra disk load and may disrupt users.
•  Early CRUSH planning is crucial!
•  A network switch is a failure domain, so we should
configure CRUSH to replicate across switches,
right?
•  But (assuming we don’t have a private cluster network)
that would send all replication traffic via the switch uplinks
– bottleneck!
•  Unclear tradeoff between uptime and performance.

CRUSH & Data distribution
•  CRUSH may give your cluster
an uneven data distribution
•  An OSD’s used space will
scale with the number of PGs
assigned to it
•  After you have designed your
cluster, created your pools,
started adding data, check the
PG and volume distributions
•  reweight-‐by-‐utilization

is useful to iron out an uneven
PG distribution
•  The hashpspool flag is also
important if you have many
active pools
0
20
40
60
80
100
120
140
160
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30nOSDs
n PGs
Number of OSDs having N PGs
(for pool = volumes)

RBD Reliability with 3 Replicas
•  RBD devices are chunked across thousands of objects:
•  A full 1TB volume is composed of 250,000 4MB objects
•  If any single object is lost, the whole RBD can be considered to be corrupted
(obviously, it depends which blocks are lost!)
•  If you lose an entire PG, you can consider all RBDs to be lost / corrupted.
•  Our incorrect & irrational fears:
•  Any simultaneous triple disk failure in the cluster would lead to objects being
lost – and somehow all RBDs would be corrupted.
•  As we add OSDs to the cluster, the data gets spread wider, and the chances of
RBD data loss increase.
•  But this is wrong!!
•  The only triple disk failures that can lead to data loss are those combinations
actively used by PGs – so having e.g. 4096 PGs for RBDs means that only
4096 combinations out of the 10^9 possible combinations matter.
•  N_PGs * ~(P_diskfailure^3) / 3!
•  We use 4 replicas for the RBD volumes, but this is probably overkill.

Trust your clients
•  There is no server-side per-client throttling
•  A few nasty clients can overwhelm an OSD, leading to slow requests
for everyone.
•  When you have a high load / slow requests, it is not always
trivial to identify and blacklist/firewall the misbehaving client
•  Could use some help in the monitoring: per-client perf stats?
•  One of our creative users found a way to make the mon’s
generate 5*40 MBps of outbound network traffic
•  Could saturate the mon network, lead to disruptions
•  RADOS is not for end-users. A cephx keyring is for trusted
persons only, not for Joe Random User.

Fat fingers
•  A healthy cluster is always vulnerable to human errors
•  We’ve thus far avoided any big mistakes
•  Used PG splitting to grow a pool from 8 to 2048 PGs
•  Leads to unresponsive OSDs who get marked down à degraded objs.
•  Safer & now-enforced to grow in 2x or 4x steps
•  ulimits, ulimits, ulimits
•  With a large number of OSDs (say, more than 500), you will hit num
file and num process limits everywhere:
•  Glance, qemu, radosgw, ceph/rados CLI, …
•  If you use XFS, don’t put your OSD journal as a file on the disk
•  Use a separate partition, the first partition!
•  We still need to reinstall our whole cluster to re-partition the OSDs

Scale up and out
•  Scale up: we are demonstrating the viability of a
3PB cluster with O(1000) OSDs.
•  What about 10,000 or 100,000 OSDs?
•  What about 10,000 or 100,000 clients?
•  Many Ceph instances is always an option, but not ideal
•  Scale out: our growing data centre in Budapest
brings many options:
•  Replicate over the WAN (though, 30ms RTT)
•  Tiering / Caching pools (new feature, need to get
experience…)
•  Data locality – direct IOs to nearby replica or caching pool

Summary

Summary
•  CERN IT infrastructure is undergoing a private
cloud revolution, and Ceph is providing the
underlying storage.
•  Our CASTOR and XRootD physics data use-
cases may exploit RADOS for improved
performance/scalability.
•  In seven months with a 3PB cluster, we’ve not
had any disasters. Actually it’s working quite
well.
•  Presented some lessons learned, I hope they
prove useful in your Ceph explorations.

Scaling Ceph at CERN - Ceph Day Frankfurt

More Related Content

Similar to Scaling Ceph at CERN - Ceph Day Frankfurt (20)

Recently uploaded (20)

Scaling Ceph at CERN - Ceph Day Frankfurt