Building reliable Ceph clusters with SUSE Enterprise Storage

Building reliable Ceph
clusters with SUSE
Enterprise Storage
Survival skills for the real world
Lars Marowsky-Brée
Distinguished Engineer
lmb@suse.com

What this talk is not
●
A comprehensive introduction to Ceph
●
SUSE Enterprise Storage roadmap session
●
A discussion of Ceph performance tuning
2

SUSE Enterprise Storage -
Reprise
3

The Ceph project
●
An Open Source Software-Defined-Storage project
●
Multiple front-ends
– S3/Swift object interface
– Native Linux block IO
– Heterogeneous Block IO (iSCSI)
– Native Linux network file system (CephFS)
– Heterogeneous Network File System (nfs-ganesha)
– Low-level, C++/Python/… libraries
– Linux, UNIX, Windows, Applications, Cloud, Containers
●
Common, smart data store (RADOS)
– Pseudo-random, algorithmic data distribution
4

Ceph Cluster: Logical View
6
MON
MON
MON
MDS
MDS
OSDOSDOSD
OSD OSD OSD
iSCSI
Gateway
iSCSI
Gateway
iSCSI
Gateway
S3/Swift
Gateway
S3/Swift
Gateway
NFS
Gateway
RADOS

Introducing dependability
●
Availability
●
Reliability
– Durability
●
Safety
●
Maintainability
8

The elephant in the room
●
Before we discuss technology ...
●
… guess what causes most outages?
9

Improve your human factor
●
Great, you are already here!
●
Training
●
Documentation
●
Team your team with a world-class
support and consulting organizations
10

Advantages of Homogeneity
●
Eases system administration
●
Components are interchangeable
●
Lower purchasing costs
●
Standardized ordering process
12

Murphy’s Law, 2016 version
●
“At scale, everything fails.”
●
Distributed systems protect against
individual failures causing service failures by
eliminating Single Points of Failure
●
Distributed systems are still vulnerable to
correlated failures
13
2n+1

Advantages of Heterogeneity
Everything is broken …
… but everything is broken differently
14

Homogeneity is non-sustainable
●
Hardware gets replaced
– Replacement with same model not available, or
– not desirable given current prices
●
Software updates are not (yet) globally immediate
●
Requirements change
●
Your cluster ends up being heterogeneous anyway
●
… you might as well benefit from it.
15

Failure is inevitable; suffering is optional
●
If you want uptime, prepare for downtime
●
Architect your system to survive a single or
multiple failures
●
Test whether the system meets your SLA
– while degraded and during recovery!
16

How much availability do you need?
●
Availability and durability are not free
●
Cost, Complexity increase exponentially
●
Scale out makes some things easier
17

Embrace diversity
●
Automatic recovery requires a >50% majority
– Splitting into multiple different categories/models
– Feasible for some components
– Multiple architectures?
– Mix them across different racks/pods
●
A 50:50 split still allows manual recovery in case of
catastrophic failures
– Different UPS and power circuits
19

Hardware choices
●
SUSE offers Reference Architectures:
– e.g., Lenovo, HPE, Cisco, Dell
●
Partners offer turn-key solutions
– e.g., HPE, Thomas-Krenn
●
SUSE Yes certification reduces risk
– https://www.suse.com/newsroom/post/2016/suse-extends-
partner-software-certification-for-cloud-and-storage-customers/
●
Small variations can have a huge impact!
20

Not all the eggs in one basket^Wrack
●
Distribute servers physically to limit the impact of power outages,
spills, …
●
Ceph’s CRUSH map allows you to describe the physical topology of
your fault domains (engineering speak for “availability zones”)
21

How many MONitors do I need?
22
2n+1

To converge roles or not
●
“Hyper converged” equals correlated
failures
●
It does drive down cost of implementation
●
Sizing becomes less deterministic
●
Services might recover at the same time
●
At scale, don’t correlate the MONs and
OSDs
23

Storage diversity
24
24
●
Avoid desktop
HDDs
●
Avoid sequential
serial numbers
●
Mount at different
angles if paranoid
●
Multiple vendors
●
Avoid desktop
SSDs
●
Monitor wear-
leveling
●
Remember the
journals see all
writes

Storage Node Sizing
●
Node failures most common granularity
– Admin mistake, network, kernel crash
●
Consider impact of outage on:
– Performance (degraded and recovery)
– and capacity!
●
A single node should not be more than 10% of your
total capacity
●
Free capacity should be larger than largest node
25

Data availability and durability
●
Replication:
– Number of copies
– Linear overhead
●
Erasure Coding:
– Flexible number of data and coding blocks
– Can survive any number of outages
– Fractional overhead
– https://www.youtube.com/watch?v=-KyGv6AZN9M
26
k+m
k
2n+1

Durability: Three-way Replication
27
Usable capacity: 33%
Durability: 2 faults

Durability: 4+3 Erasure Coding
28
Usable capacity: 57%
Durability: 3 faults

Consider Cache Tiering
●
Data in cache tier is replicated
●
Backing tier may be slower, but more
durable
29

Durability 201
●
Different strokes for different pools
●
Erasure coding schemes galore
30

Finding and correcting bad data
●
Ceph “scrubbing” detects inconsistent
or missing placement groups
periodically
http://ceph.com/planet/ceph-manually-
repair-object/
http://docs.ceph.com/docs/jewel/rados/configuration/osd-config-ref/#scrubbing
●
SUSE Enterprise Storage 5 will
validate checksums on every read
31

Automatic fault detection and recovery
●
Do you want this in your cluster?
●
Consider setting “noout”:
– during maintenance windows
– in small clusters
32

Network considerations
●
Have both the public and cluster network bonded
●
Consider different NICs
– Use last year’s NICs and switches
●
One channel from each network to each switch
33

Gateway considerations
●
RadosGW (S3/Swift):
– Use HTTP/TCP load balancers
– Possible to build using SLE HA with LVS or haproxy
●
iSCSI targets:
– Multiple gateways, natively supported by iSCSI
●
Improves availability and throughput
– Make sure you meet your performance SLAs during degraded
modes
34

Avoid configuration drift
●
Ensure that systems are configured consistently
– Installed packages
– Package versions
– Configuration (NTP, logging, passwords, …)
●
Avoid manual configuration
●
Use Salt instead
http://ourobengr.com/2016/11/hello-salty-goodness/
https://www.suse.com/communities/blog/managing-configuration-
drift-salt-snapper/
35

Trust but verify a.k.a. monitoring
●
Performance as the system ages
●
SSD degradation / wear leveling
●
Capacity utilization
●
“Free” capacity is usable for recovery
●
React to issues in a timely fashion!
36

Update, always (but with care)
●
Updates are good for your system
– Security
– Performance
– Stability
●
Ceph remains available even while updates are being rolled out
●
SUSE’s tested maintenance updates are the main product value
37

Trust nobody(not even SUSE)
●
If you at all possibly can, use a staging system
– Ideally: a (reduced) version of your production
environment
– At least: a virtualized environment
●
Test updates before rolling them out in production
– Not just code, but also processes!
●
Long-term maintainability:
– Avoid vendor lock-in, use Open Source
38

Disaster can will strike
●
Does it matter?
●
If it does:
– Backups
– Replicate to other sites
●
rbd-mirror, radosgw multi-site
●
Have fire drills!
39

Avoid complexity (KISS)
●
Be aggressive in what you test
– Test all the features
●
Be conservative in what you deploy
– Deploy only what you need
40

In conclusion
Don’t panic.
SUSE’s here to help.
41

Building reliable Ceph clusters with SUSE Enterprise Storage

Building reliable Ceph clusters with SUSE Enterprise Storage

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Building reliable Ceph clusters with SUSE Enterprise Storage (20)

Recently uploaded (20)

Building reliable Ceph clusters with SUSE Enterprise Storage

Editor's Notes