Erasure codes and storage tiers on gluster

Dan Lambright1
Erasure Codes and
Storage Tiers on
Gluster
Dan Lambright
SA summit
Sep 23, 2014

Dan Lambright2
AGENDA
●
Why erasure codes (ec) in Gluster
●
How ec works
●
Brief peek at underlying mathematics
●
Storage tiering in gluster
●
Demo
●
“One more thing”

Dan Lambright3
Why erasure codes in gluster?
● Desire protection from double failure
● RAID6 controllers are expensive
● Imagine a 64 node volume
● Each brick on a separate bare metal machine
● Cost is 64 x $ for LSI MegaRaid controller
20K
=

Dan Lambright4
Why erasure codes in gluster?
● Triplication (3 way replication) is expensive
● Two redundant disks for every data disk
● 200% overhead! :(

Dan Lambright5
Erasure codes
● Store m disks worth of data on k disks (k>m)
● n redundant disks (k-m),
● can pick n to choose failure tolerance
● A generalization of RAID6
● Distributed across nodes

Dan Lambright6
Overhead analysis
● Can also consider mean time before failure
k total disks n how many
failures
admitted
m number of
data disks
Capacity
overhead
(n/k)
RAID level
3 1 2 33.33% 5
5 1 4 20% 5
6 2 4 33.33% 6
7 3 4 42.86% E
9 1 8 11.11% 5
10 2 8 20% 6
11 3 8 27.27% E
12 4 8 33.33% E

Dan Lambright8
ERASURE CODE TERMS
● m data disks
● n parity disks
● k total number disks = m+n
● Symbol – Smallest data unit. w bits.
● Typically w = 8 = a byte
● Chunk (aka fragment) – r symbols per disk
● Stripe – collection of m+n chunks across k disks
● Unit of manipulation for recovery
● Also known as a “slice”

Dan Lambright9
ERASURE CODE TERMS
●
r=6
m=4
n =2
k=6
w=1
symbol
fragment
“Stripe” of
6 fragments
011010

Dan Lambright10
Systematic
● m data chunks, n coding chunks
● (can stripe parity and data chunks on the same disk)
● Reads are simple, only decode on repairs
Slice 1
Slice 2
Slice 3

Dan Lambright11
Non-Systematic
● All k chunks in a stripe are coded
● Do not to distinguish data from code servers
● Encode/decode on writes and reads
Slice 1
Slice 2
Slice 3

Dan Lambright12
Encoding / Decoding Overhead
● Network RTT dominate the encode/decode overhead
●
Packages exist to implement the math
● Intel has fast routines for Inverse, dot product,
encoding, decoding, etc
● Jerasure library from academia
● Gluster's is purpose built and fast

Dan Lambright14
GLUSTERFS “Disperse Volumes”
● Done by Datalab corp. by Xavier Hernandez.
● Use case : archiving medical records
● Developed over last 2 years
● Now part of gluster upstream

Dan Lambright15
CLI
Two new options have been added to the 'create' command of the cli interface:
gluster volume create <name> disperse <count> redundancy <count>
Disperse is “k” (total number volumes)
Redundancy is “n”

Dan Lambright16
“Disperse volumes” design choices
● The “symbols” are bytes: w = 8
● The fragment size r = 128
● Algorithm: Reed solomon
● Generator matrix: Vandermonde
● Non–systematic
● Encoding / decoding done on client side
● Modeled after AFR
● Concurrent writes must be processed in order

Dan Lambright18
Storage Tiers
● Different “subvolume” tiers presented as a single volume
● HDD, SSD, tape, “persistent memory”, etc.
● Plug-in policy describes how data moves between tiers
● V1 policy: Cache
● slow and fast tiers
● CLI to add/remove cache tier from existing volume

Dan Lambright19
Example: Erasure codes + SSD
● User sees one volume
● SSD “caches” ec data
Tiered volume
“cache”:
on SSD
ec
on HDD
Hot Cold
demote
promote

Dan Lambright20
Future : Data classification (DC)
● Add rules to storage graph
● Rule determines subvolume
● File name
● Attribute (size, content)
● Etc.
Filename =
*.lock ?`
Yes No
Secure /
Encrypted
HDD

Dan Lambright21
Future flexibility
● Many use cases
● Compliance
● Multi-tenancy
● Rack-aware placement (for performance)
● Policies described by language
● Arbitrary number of tiers, rules, subvolumes ..
● Template based

Dan Lambright24
Bitrot
● A daemon that scans gluster volumes
● Finds corrupted data
● Digest associated with each file
● Alert / recover on mismatch
● “Plug-ins” to daemon may do other things..
● Tuning parameters to be non-intrusive to performance
● Encryption
● Compression
● Etc.

25
Do it!
● Learn the math:
● http://web.eecs.utk.edu/~plank/plank/papers/FAST-
2013-Tutorial.html
● Get the bits:
● https://forge.gluster.org/disperse

RED HAT CONFIDENTIAL – DO NOT DISTRIBUTE
Thank You!
● dlambright@redhat.com
● RHS:
www.redhat.com/storage/
● GlusterFS:
www.gluster.org
●
@Glusterorg
@RedHatStorage
Gluster
Red Hat Storage
Slides Available on Mojo

Erasure codes and storage tiers on gluster

More Related Content

What's hot (20)

Similar to Erasure codes and storage tiers on gluster (20)

More from Red_Hat_Storage (20)

Erasure codes and storage tiers on gluster