SlideShare a Scribd company logo
Dan Lambright1
Erasure Codes and
Storage Tiers on
Gluster
Dan Lambright
SA summit
Sep 23, 2014
Dan Lambright2
AGENDA
●
Why erasure codes (ec) in Gluster
●
How ec works
●
Brief peek at underlying mathematics
●
Storage tiering in gluster
●
Demo
●
“One more thing”
Dan Lambright3
Why erasure codes in gluster?
● Desire protection from double failure
● RAID6 controllers are expensive
● Imagine a 64 node volume
● Each brick on a separate bare metal machine
● Cost is 64 x $ for LSI MegaRaid controller
20K
=
Dan Lambright4
Why erasure codes in gluster?
● Triplication (3 way replication) is expensive
● Two redundant disks for every data disk
● 200% overhead! :(
Dan Lambright5
Erasure codes
● Store m disks worth of data on k disks (k>m)
● n redundant disks (k-m),
● can pick n to choose failure tolerance
● A generalization of RAID6
● Distributed across nodes
Dan Lambright6
Overhead analysis
● Can also consider mean time before failure
k total disks n how many
failures
admitted
m number of
data disks
Capacity
overhead
(n/k)
RAID level
3 1 2 33.33% 5
5 1 4 20% 5
6 2 4 33.33% 6
7 3 4 42.86% E
9 1 8 11.11% 5
10 2 8 20% 6
11 3 8 27.27% E
12 4 8 33.33% E
ERASURE CODES PRIMER
Dan Lambright8
ERASURE CODE TERMS
● m data disks
● n parity disks
● k total number disks = m+n
● Symbol – Smallest data unit. w bits.
● Typically w = 8 = a byte
● Chunk (aka fragment) – r symbols per disk
● Stripe – collection of m+n chunks across k disks
● Unit of manipulation for recovery
● Also known as a “slice”
Dan Lambright9
ERASURE CODE TERMS
●
r=6
m=4
n =2
k=6
w=1
symbol
fragment
“Stripe” of
6 fragments
011010
Dan Lambright10
Systematic
● m data chunks, n coding chunks
● (can stripe parity and data chunks on the same disk)
● Reads are simple, only decode on repairs
Slice 1
Slice 2
Slice 3
Dan Lambright11
Non-Systematic
● All k chunks in a stripe are coded
● Do not to distinguish data from code servers
● Encode/decode on writes and reads
Slice 1
Slice 2
Slice 3
Dan Lambright12
Encoding / Decoding Overhead
● Network RTT dominate the encode/decode overhead
●
Packages exist to implement the math
● Intel has fast routines for Inverse, dot product,
encoding, decoding, etc
● Jerasure library from academia
● Gluster's is purpose built and fast
GLUSTER IMPLEMENTATION
Dan Lambright14
GLUSTERFS “Disperse Volumes”
● Done by Datalab corp. by Xavier Hernandez.
● Use case : archiving medical records
● Developed over last 2 years
● Now part of gluster upstream
Dan Lambright15
CLI
Two new options have been added to the 'create' command of the cli interface:
gluster volume create <name> disperse <count> redundancy <count>
Disperse is “k” (total number volumes)
Redundancy is “n”
Dan Lambright16
“Disperse volumes” design choices
● The “symbols” are bytes: w = 8
● The fragment size r = 128
● Algorithm: Reed solomon
● Generator matrix: Vandermonde
● Non–systematic
● Encoding / decoding done on client side
● Modeled after AFR
● Concurrent writes must be processed in order
STORAGE TIERS
Dan Lambright18
Storage Tiers
● Different “subvolume” tiers presented as a single volume
● HDD, SSD, tape, “persistent memory”, etc.
● Plug-in policy describes how data moves between tiers
● V1 policy: Cache
● slow and fast tiers
● CLI to add/remove cache tier from existing volume
Dan Lambright19
Example: Erasure codes + SSD
● User sees one volume
● SSD “caches” ec data
Tiered volume
“cache”:
on SSD
ec
on HDD
Hot Cold
demote
promote
Dan Lambright20
Future : Data classification (DC)
● Add rules to storage graph
● Rule determines subvolume
● File name
● Attribute (size, content)
● Etc.
Filename =
*.lock ?`
Yes No
Secure /
Encrypted
HDD
Dan Lambright21
Future flexibility
● Many use cases
● Compliance
● Multi-tenancy
● Rack-aware placement (for performance)
● Policies described by language
● Arbitrary number of tiers, rules, subvolumes ..
● Template based
DEMO
promote
ONE MORE THING..
promote
Dan Lambright24
Bitrot
● A daemon that scans gluster volumes
● Finds corrupted data
● Digest associated with each file
● Alert / recover on mismatch
● “Plug-ins” to daemon may do other things..
● Tuning parameters to be non-intrusive to performance
● Encryption
● Compression
● Etc.
25
Do it!
● Learn the math:
● http://web.eecs.utk.edu/~plank/plank/papers/FAST-
2013-Tutorial.html
● Get the bits:
● https://forge.gluster.org/disperse
RED HAT CONFIDENTIAL – DO NOT DISTRIBUTE
Thank You!
● dlambright@redhat.com
● RHS:
www.redhat.com/storage/
● GlusterFS:
www.gluster.org
●
@Glusterorg
@RedHatStorage
Gluster
Red Hat Storage
Slides Available on Mojo

More Related Content

PPTX
Real-time Analytics with Trino and Apache Pinot
PDF
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
PDF
Data Virtualization: An Essential Component of a Cloud Data Lake
PDF
Module: Content Addressing in IPFS
PDF
Big data ecosystem
PDF
Thinking Big - Big data: principes et architecture
PDF
Apache Iceberg Presentation for the St. Louis Big Data IDEA
PPTX
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Real-time Analytics with Trino and Apache Pinot
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
Data Virtualization: An Essential Component of a Cloud Data Lake
Module: Content Addressing in IPFS
Big data ecosystem
Thinking Big - Big data: principes et architecture
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17

What's hot (20)

PDF
Introduction to column oriented databases
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PDF
Building Open Data Lakes on AWS with Debezium and Apache Hudi
PDF
Data Quality With or Without Apache Spark and Its Ecosystem
PDF
GlusterFS As an Object Storage
PDF
Delta Lake: Optimizing Merge
PDF
Module: Content Routing in IPFS
PDF
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
PPSX
Thinking big
PDF
Iceberg: a fast table format for S3
PDF
Apache Spark Overview
PDF
Batch Processing at Scale with Flink & Iceberg
PDF
From deep learning to deep reasoning
PDF
End-to-end Data Pipeline with Apache Spark
PDF
Introducing TR-069 - An Axiros Workshop for the TR-069 Protocol - Part 1
PDF
Alphorm.com Formation Microsoft Hyperconvergence
PDF
TiDB Introduction
PDF
Intro to Graphs and Neo4j
PDF
Data warehousing unit 1
PDF
Big data-analytics-cpe8035
Introduction to column oriented databases
Apache Iceberg - A Table Format for Hige Analytic Datasets
Building Open Data Lakes on AWS with Debezium and Apache Hudi
Data Quality With or Without Apache Spark and Its Ecosystem
GlusterFS As an Object Storage
Delta Lake: Optimizing Merge
Module: Content Routing in IPFS
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Thinking big
Iceberg: a fast table format for S3
Apache Spark Overview
Batch Processing at Scale with Flink & Iceberg
From deep learning to deep reasoning
End-to-end Data Pipeline with Apache Spark
Introducing TR-069 - An Axiros Workshop for the TR-069 Protocol - Part 1
Alphorm.com Formation Microsoft Hyperconvergence
TiDB Introduction
Intro to Graphs and Neo4j
Data warehousing unit 1
Big data-analytics-cpe8035
Ad

Similar to Erasure codes and storage tiers on gluster (20)

PDF
Red Hat Gluster Storage Performance
PDF
GlusterFS And Big Data
ODP
Red Hat Gluster Storage : GlusterFS
PDF
GlusterFS : un file system open source per i big data di oggi e domani - Robe...
PDF
GlusterFs: a scalable file system for today's and tomorrow's big data
PDF
GlusterFS Presentation FOSSCOMM2013 HUA, Athens, GR
PDF
Gluster overview & future directions vault 2015
ODP
Software defined storage
ODP
Efficient data maintaince in GlusterFS using Databases
PDF
Gluster fs architecture_&amp;_roadmap-vijay_bellur-linuxcon_eu_2013
PDF
Gluster intro-tdose
ODP
Gluster intro-tdose
PPTX
HDFS Erasure Coding in Action
PPTX
Gluster Storage
ODP
Gluster fs architecture_future_directions_tlv
PDF
Gluster fs architecture_future_directions_tlv
PDF
Gluster fs current_features_and_roadmap
PDF
Gluster fs current_features_and_roadmap
PDF
Scalable POSIX File Systems in the Cloud
ODP
20160130 Gluster-roadmap
Red Hat Gluster Storage Performance
GlusterFS And Big Data
Red Hat Gluster Storage : GlusterFS
GlusterFS : un file system open source per i big data di oggi e domani - Robe...
GlusterFs: a scalable file system for today's and tomorrow's big data
GlusterFS Presentation FOSSCOMM2013 HUA, Athens, GR
Gluster overview & future directions vault 2015
Software defined storage
Efficient data maintaince in GlusterFS using Databases
Gluster fs architecture_&amp;_roadmap-vijay_bellur-linuxcon_eu_2013
Gluster intro-tdose
Gluster intro-tdose
HDFS Erasure Coding in Action
Gluster Storage
Gluster fs architecture_future_directions_tlv
Gluster fs architecture_future_directions_tlv
Gluster fs current_features_and_roadmap
Gluster fs current_features_and_roadmap
Scalable POSIX File Systems in the Cloud
20160130 Gluster-roadmap
Ad

More from Red_Hat_Storage (20)

PDF
Red Hat Storage Day Dallas - Storage for OpenShift Containers
PPTX
Red Hat Storage Day Dallas - Red Hat Ceph Storage Acceleration Utilizing Flas...
PPTX
Red Hat Storage Day Dallas - Defiance of the Appliance
PPTX
Red Hat Storage Day Dallas - Gluster Storage in Containerized Application
PPTX
Red Hat Storage Day Dallas - Why Software-defined Storage Matters
PPTX
Red Hat Storage Day Boston - Why Software-defined Storage Matters
PPTX
Red Hat Storage Day Boston - Supermicro Super Storage
PDF
Red Hat Storage Day Boston - OpenStack + Ceph Storage
PPTX
Red Hat Ceph Storage Acceleration Utilizing Flash Technology
PDF
Red Hat Storage Day Boston - Persistent Storage for Containers
PDF
Red Hat Storage Day Boston - Red Hat Gluster Storage vs. Traditional Storage ...
PDF
Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...
PDF
Red Hat Storage Day New York - QCT: Avoid the mess, deploy with a validated s...
PDF
Red Hat Storage Day - When the Ceph Hits the Fan
PDF
Red Hat Storage Day New York - Penguin Computing Spotlight: Delivering Open S...
PDF
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
PDF
Red Hat Storage Day New York - New Reference Architectures
PDF
Red Hat Storage Day New York - Persistent Storage for Containers
PDF
Red Hat Storage Day New York -Performance Intensive Workloads with Samsung NV...
PDF
Red Hat Storage Day New York - Welcome Remarks
Red Hat Storage Day Dallas - Storage for OpenShift Containers
Red Hat Storage Day Dallas - Red Hat Ceph Storage Acceleration Utilizing Flas...
Red Hat Storage Day Dallas - Defiance of the Appliance
Red Hat Storage Day Dallas - Gluster Storage in Containerized Application
Red Hat Storage Day Dallas - Why Software-defined Storage Matters
Red Hat Storage Day Boston - Why Software-defined Storage Matters
Red Hat Storage Day Boston - Supermicro Super Storage
Red Hat Storage Day Boston - OpenStack + Ceph Storage
Red Hat Ceph Storage Acceleration Utilizing Flash Technology
Red Hat Storage Day Boston - Persistent Storage for Containers
Red Hat Storage Day Boston - Red Hat Gluster Storage vs. Traditional Storage ...
Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...
Red Hat Storage Day New York - QCT: Avoid the mess, deploy with a validated s...
Red Hat Storage Day - When the Ceph Hits the Fan
Red Hat Storage Day New York - Penguin Computing Spotlight: Delivering Open S...
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
Red Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - Persistent Storage for Containers
Red Hat Storage Day New York -Performance Intensive Workloads with Samsung NV...
Red Hat Storage Day New York - Welcome Remarks

Erasure codes and storage tiers on gluster

  • 1. Dan Lambright1 Erasure Codes and Storage Tiers on Gluster Dan Lambright SA summit Sep 23, 2014
  • 2. Dan Lambright2 AGENDA ● Why erasure codes (ec) in Gluster ● How ec works ● Brief peek at underlying mathematics ● Storage tiering in gluster ● Demo ● “One more thing”
  • 3. Dan Lambright3 Why erasure codes in gluster? ● Desire protection from double failure ● RAID6 controllers are expensive ● Imagine a 64 node volume ● Each brick on a separate bare metal machine ● Cost is 64 x $ for LSI MegaRaid controller 20K =
  • 4. Dan Lambright4 Why erasure codes in gluster? ● Triplication (3 way replication) is expensive ● Two redundant disks for every data disk ● 200% overhead! :(
  • 5. Dan Lambright5 Erasure codes ● Store m disks worth of data on k disks (k>m) ● n redundant disks (k-m), ● can pick n to choose failure tolerance ● A generalization of RAID6 ● Distributed across nodes
  • 6. Dan Lambright6 Overhead analysis ● Can also consider mean time before failure k total disks n how many failures admitted m number of data disks Capacity overhead (n/k) RAID level 3 1 2 33.33% 5 5 1 4 20% 5 6 2 4 33.33% 6 7 3 4 42.86% E 9 1 8 11.11% 5 10 2 8 20% 6 11 3 8 27.27% E 12 4 8 33.33% E
  • 8. Dan Lambright8 ERASURE CODE TERMS ● m data disks ● n parity disks ● k total number disks = m+n ● Symbol – Smallest data unit. w bits. ● Typically w = 8 = a byte ● Chunk (aka fragment) – r symbols per disk ● Stripe – collection of m+n chunks across k disks ● Unit of manipulation for recovery ● Also known as a “slice”
  • 9. Dan Lambright9 ERASURE CODE TERMS ● r=6 m=4 n =2 k=6 w=1 symbol fragment “Stripe” of 6 fragments 011010
  • 10. Dan Lambright10 Systematic ● m data chunks, n coding chunks ● (can stripe parity and data chunks on the same disk) ● Reads are simple, only decode on repairs Slice 1 Slice 2 Slice 3
  • 11. Dan Lambright11 Non-Systematic ● All k chunks in a stripe are coded ● Do not to distinguish data from code servers ● Encode/decode on writes and reads Slice 1 Slice 2 Slice 3
  • 12. Dan Lambright12 Encoding / Decoding Overhead ● Network RTT dominate the encode/decode overhead ● Packages exist to implement the math ● Intel has fast routines for Inverse, dot product, encoding, decoding, etc ● Jerasure library from academia ● Gluster's is purpose built and fast
  • 14. Dan Lambright14 GLUSTERFS “Disperse Volumes” ● Done by Datalab corp. by Xavier Hernandez. ● Use case : archiving medical records ● Developed over last 2 years ● Now part of gluster upstream
  • 15. Dan Lambright15 CLI Two new options have been added to the 'create' command of the cli interface: gluster volume create <name> disperse <count> redundancy <count> Disperse is “k” (total number volumes) Redundancy is “n”
  • 16. Dan Lambright16 “Disperse volumes” design choices ● The “symbols” are bytes: w = 8 ● The fragment size r = 128 ● Algorithm: Reed solomon ● Generator matrix: Vandermonde ● Non–systematic ● Encoding / decoding done on client side ● Modeled after AFR ● Concurrent writes must be processed in order
  • 18. Dan Lambright18 Storage Tiers ● Different “subvolume” tiers presented as a single volume ● HDD, SSD, tape, “persistent memory”, etc. ● Plug-in policy describes how data moves between tiers ● V1 policy: Cache ● slow and fast tiers ● CLI to add/remove cache tier from existing volume
  • 19. Dan Lambright19 Example: Erasure codes + SSD ● User sees one volume ● SSD “caches” ec data Tiered volume “cache”: on SSD ec on HDD Hot Cold demote promote
  • 20. Dan Lambright20 Future : Data classification (DC) ● Add rules to storage graph ● Rule determines subvolume ● File name ● Attribute (size, content) ● Etc. Filename = *.lock ?` Yes No Secure / Encrypted HDD
  • 21. Dan Lambright21 Future flexibility ● Many use cases ● Compliance ● Multi-tenancy ● Rack-aware placement (for performance) ● Policies described by language ● Arbitrary number of tiers, rules, subvolumes .. ● Template based
  • 24. Dan Lambright24 Bitrot ● A daemon that scans gluster volumes ● Finds corrupted data ● Digest associated with each file ● Alert / recover on mismatch ● “Plug-ins” to daemon may do other things.. ● Tuning parameters to be non-intrusive to performance ● Encryption ● Compression ● Etc.
  • 25. 25 Do it! ● Learn the math: ● http://web.eecs.utk.edu/~plank/plank/papers/FAST- 2013-Tutorial.html ● Get the bits: ● https://forge.gluster.org/disperse
  • 26. RED HAT CONFIDENTIAL – DO NOT DISTRIBUTE Thank You! ● dlambright@redhat.com ● RHS: www.redhat.com/storage/ ● GlusterFS: www.gluster.org ● @Glusterorg @RedHatStorage Gluster Red Hat Storage Slides Available on Mojo