SlideShare a Scribd company logo
Challenges in Using Persistent Memory in
Gluster
Gluster Summit 2016
Dan Lambright
Storage System Software Developer
Adjunct Professor University of Massachusetts Lowell
Aug. 23, 2016
RED HAT
● Technologies
○ Persistent memory, aka storage class memory (SCM)
○ Distributed storage
● Challenges
○ Network latency (studied Gluster)
○ Accelerating parts of the system with SCM
○ CPU latency (studied Ceph)
2
Overview
RED HAT3
● Near DRAM speeds
● Wearability better than SSDs (claims Intel for 3DxPoint)
● API available
○ Crash-proof transactions
○ Byte or block addressable
● Likely to be at least as expensive as SSDs
● Fast random access
● Has support in Linux
● NVDIMMs available today.
Storage Class Memory
What do we know / expect?
RED HAT
Media Latency
HDD 10ms
SSD 1ms
SCM < 1us
CPU
(Ceph, aprox.)
~1000us
Network
(RDMA)
~10-50us
4
The problem
Must lower latencies throughout system : storage, network, CPU
RED HAT
● “Primary copy” : update replicas in parallel,
○ processes reads and writes
○ Ceph’s choice, also JBR
● Other design options
○ Read at “tail” - the data there is always committed
Server Side Replication
Latency cost to replicate across nodes
client
Primary
server
Replica 1
Replica 2
RED HAT
● Uses more client side bandwidth
● Likely client has slower network than server.
Client Side Replication
Latency price lower than server software replication
client
Replica 1
Replica 1
Replica 2
RED HAT
● Avoid OS data copy; free CPU from transfer
● Application must manage buffers
○ Reserve memory up-front, sized properly
○ Both Ceph and Gluster have good RDMA interfaces
● Extend protocol to further improve latency?
○ Proposed protocol extensions, could shrink latency to ~3us.
○ RDMA write completion does not indicate data was persisted (“ship and pray”)
○ ACK in higher level protocol - adds overhead
○ Add “commit bit”, perhaps combine with last write?
Improving Network Latency
RDMA
RED HAT
IOPS RDMA vs 10Gbps : Glusterfs 2x replication, 2 clients
Biggest gain with reads, little gain for small I/O.
Sequential I/O (1024 block size)
Random I/O (1024 block size)
1024 bytes transfers
RED HAT
● Reduce protocol traffic (discuss more next section)
● Coalesce protocol operations
○ WIth this, observed 10% gain in small file creates on Gluster
● Pipelining
○ In Ceph, on two updates to same object, start replicating second before first
completes
Improving Network Latency
Other techniques
RED HAT
Adding SCM to Parts of System
Kernel and application level tiering
DM-cache Ceph tiering
RED HAT
● Heterogeneous storage in single volume
○ Fast/expensive storage cache for slower storage
○ Fast “Hot tier” (e.g. SSD, SCM)
○ Slow “Cold tier” (e.g. erasure coded)
● Database tracks files
○ Pro: easy metadata manipulation
○ Con: very slow O(n) enumeration of objects for promotion/demotion scans.
● Policies:
○ Data put on hot tier, until “full”
○ Once “full”, data “promoted/demoted” based on access frequency
Gluster Tiering
Illustration of network problem
RED HAT
Gluster Tiering
RED HAT
● Tiering helped large I/Os, not small
● Pattern seen elsewhere ..
○ RDMA tests
○ Customer Feedback, overall GlusterFS reputation …
● Observed many “LOOKUP operations” over network
● Hypothesis: metadata transfers dominate data transfers for small files
○ small file data transfer speedup fails to help overall IO latency
Gluster’s “Small File” Tiering Problem
Analysis
RED HAT
● Each directory in path is tested on an open(), by client’s VFS layer
○ Full path traversal
○ d1/d2/d3/f1
○ Existence
○ Permission
Understanding LOOKUPs in Gluster
Problem : Path Traversal
d1
d2
d3
f1
RED HAT
● Distributed hash space is split in parts
○ Unique space for each directory
○ Each node owns a piece of this “layout”
○ Stored in extended attributes
○ When new nodes added to cluster, rebuild the layouts
● When file opened, entire layout is rechecked, for each directory
○ Each node receives a lookup to retrieve its part of the layout
● Work is underway to improve this.
Understanding LOOKUPs in Gluster
Problem : Coalescing Distributed Hash Ranges
RED HAT
LOOKUP Amplification
d2 d3 f1
S1 S2 S3 S4
d1/d2/d3/f1
Four LOOKUPs
Four servers
16 LOOKUPs total in worse case
d2 VFS layer
Gluster client
Gluster server
Client
Path used in Ben England’s “smallfile” utility..
/mnt/p6.1/file_dstdir/192.168.1.2/thrd_00/d_000
RED HAT
● Cache file metadata at client long term
○ WIP - under development
● Invalidate cache entry on another client’s change
○ Invalidate intelligently, not spuriously
○ Some attributes may change a lot (ctime, ..)
Client Metadata Cache: Tiering
Gluster’s “md-cache” translator
Red is
cached
RED HAT
Client Metadata Cache RDMA
One client , 16K 64 byte files, 8 threads, 14Gbps Mellonox IB
Red is
cached
RED HAT
● Near term (2016)
○ Add-remove brick, partially in review stages
○ Md-cache integration, working with SMB team
○ Better performance counters
● Longer term (2017)
○ Policy based tiering
○ More than two tiers?
○ Replace database with “hitsets” (ceph-like bloom filters) ?
Tiering
Development notes
RED HAT
● Network latency reductions
○ Use RDMA
○ Reduce round trips by streamlining protocol , coalescing etc
○ Cache at client
● CPU latency reductions
○ Aggressively optimize / shrink stack
○ Remove/replace large components
● Consider
○ SCM as a tier/cache
○ 2 way replication
Summary and Discussion
Distributed storage and SCM pose unique problems with latency.
THANK YOU
plus.google.com/+RedHat
linkedin.com/company/red-hat
youtube.com/user/RedHatVideos
facebook.com/redhatinc
twitter.com/RedHatNews

More Related Content

PDF
State of Gluster Performance
PDF
Life as a GlusterFS Consultant with Ivan Rossi
PDF
Integration of Glusterfs in to commvault simpana
PDF
Performance bottlenecks for metadata workload in Gluster with Poornima Gurusi...
PDF
Sharding: Past, Present and Future with Krutika Dhananjay
ODP
Tiering barcelona
PDF
Erasure codes and storage tiers on gluster
ODP
Lcna tutorial-2012
State of Gluster Performance
Life as a GlusterFS Consultant with Ivan Rossi
Integration of Glusterfs in to commvault simpana
Performance bottlenecks for metadata workload in Gluster with Poornima Gurusi...
Sharding: Past, Present and Future with Krutika Dhananjay
Tiering barcelona
Erasure codes and storage tiers on gluster
Lcna tutorial-2012

What's hot (20)

PDF
Gluster overview & future directions vault 2015
ODP
Dedupe nmamit
ODP
Red Hat Gluster Storage : GlusterFS
ODP
Gluster intro-tdose
PDF
Disperse xlator ramon_datalab
PDF
Gluster fs tutorial part 2 gluster and big data- gluster for devs and sys ...
PDF
Red Hat Gluster Storage Performance
ODP
Accessing gluster ufo_-_eco_willson
PDF
Gluster d2
ODP
Software defined storage
ODP
Gluster fs architecture_future_directions_tlv
ODP
Performance characterization in large distributed file system with gluster fs
ODP
Gluster technical overview
ODP
Gluster fs hadoop_fifth-elephant
ODP
Sdc challenges-2012
PDF
Lcna 2012-tutorial
PDF
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...
ODP
GlusterFs Architecture & Roadmap - LinuxCon EU 2013
ODP
Scale out backups-with_bareos_and_gluster
PDF
GlusterFS CTDB Integration
Gluster overview & future directions vault 2015
Dedupe nmamit
Red Hat Gluster Storage : GlusterFS
Gluster intro-tdose
Disperse xlator ramon_datalab
Gluster fs tutorial part 2 gluster and big data- gluster for devs and sys ...
Red Hat Gluster Storage Performance
Accessing gluster ufo_-_eco_willson
Gluster d2
Software defined storage
Gluster fs architecture_future_directions_tlv
Performance characterization in large distributed file system with gluster fs
Gluster technical overview
Gluster fs hadoop_fifth-elephant
Sdc challenges-2012
Lcna 2012-tutorial
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...
GlusterFs Architecture & Roadmap - LinuxCon EU 2013
Scale out backups-with_bareos_and_gluster
GlusterFS CTDB Integration
Ad

Viewers also liked (7)

PDF
Hands On Gluster with Jeff Darcy
PDF
Red hat storage objects, containers and Beyond!
PPTX
Gluster the ugly parts with Jeff Darcy
PDF
Gluster intro-tdose
ODP
Lisa 2015-gluster fs-hands-on
PDF
Gluster for Geeks: Performance Tuning Tips & Tricks
ODP
Glusterfs for sysadmins-justin_clift
Hands On Gluster with Jeff Darcy
Red hat storage objects, containers and Beyond!
Gluster the ugly parts with Jeff Darcy
Gluster intro-tdose
Lisa 2015-gluster fs-hands-on
Gluster for Geeks: Performance Tuning Tips & Tricks
Glusterfs for sysadmins-justin_clift
Ad

Similar to Challenges with Gluster and Persistent Memory with Dan Lambright (20)

PDF
SNIA SDC 2016 final
PPTX
Software Defined storage
PPTX
Vault2016
PDF
Comparison of foss distributed storage
PDF
Red Hat Storage Day Boston - Red Hat Gluster Storage vs. Traditional Storage ...
PDF
Scalable POSIX File Systems in the Cloud
PDF
Comparison of-foss-distributed-storage
PDF
Red Hat Gluster Storage - Direction, Roadmap and Use-Cases
PDF
GlusterFS : un file system open source per i big data di oggi e domani - Robe...
PDF
GlusterFs: a scalable file system for today's and tomorrow's big data
PDF
Red Hat Storage Server For AWS
ODP
Efficient data maintaince in GlusterFS using Databases
ODP
Gluster Data Tiering
PDF
GlusterFS w/ Tiered XFS
PPTX
Red Hat Storage Day Atlanta - Red Hat Gluster Storage vs. Traditional Storage...
PDF
PLNOG 9: Adam Obszyński - DNS Caching
PDF
はじめてのGlusterFS
ODP
GlusterFS Architecture - June 30, 2011 Meetup
PDF
Future of cloud storage
PPTX
In-memory Caching in HDFS: Lower Latency, Same Great Taste
SNIA SDC 2016 final
Software Defined storage
Vault2016
Comparison of foss distributed storage
Red Hat Storage Day Boston - Red Hat Gluster Storage vs. Traditional Storage ...
Scalable POSIX File Systems in the Cloud
Comparison of-foss-distributed-storage
Red Hat Gluster Storage - Direction, Roadmap and Use-Cases
GlusterFS : un file system open source per i big data di oggi e domani - Robe...
GlusterFs: a scalable file system for today's and tomorrow's big data
Red Hat Storage Server For AWS
Efficient data maintaince in GlusterFS using Databases
Gluster Data Tiering
GlusterFS w/ Tiered XFS
Red Hat Storage Day Atlanta - Red Hat Gluster Storage vs. Traditional Storage...
PLNOG 9: Adam Obszyński - DNS Caching
はじめてのGlusterFS
GlusterFS Architecture - June 30, 2011 Meetup
Future of cloud storage
In-memory Caching in HDFS: Lower Latency, Same Great Taste

More from Gluster.org (20)

PDF
Automating Gluster @ Facebook - Shreyas Siravara
PDF
nfusr: a new userspace NFS client based on libnfs - Shreyas Siravara
PDF
Facebook’s upstream approach to GlusterFS - David Hasson
PDF
Throttling Traffic at Facebook Scale
PDF
Gluster Metrics: why they are crucial for running stable deployments of all s...
PDF
Up and Running with Glusto & Glusto-Tests in 5 Minutes (or less)
PDF
Data Reduction for Gluster with VDO
PDF
Releases: What are contributors responsible for
PDF
RIO Distribution: Reconstructing the onion - Shyamsundar Ranganathan
PDF
Gluster and Kubernetes
PDF
Native Clients, more the merrier with GFProxy!
PDF
Gluster: a SWOT Analysis
PDF
GlusterD-2.0: What's Happening? - Kaushal Madappa
PDF
Scalability and Performance of CNS 3.6
PDF
What Makes Us Fail
PDF
Gluster as Native Storage for Containers - past, present and future
PDF
Heketi Functionality into Glusterd2
PDF
Architecture of the High Availability Solution for Ganesha and Samba with Kal...
PDF
Gluster Containerized Storage for Cloud Applications
PDF
Gluster as Block Store in Containers
Automating Gluster @ Facebook - Shreyas Siravara
nfusr: a new userspace NFS client based on libnfs - Shreyas Siravara
Facebook’s upstream approach to GlusterFS - David Hasson
Throttling Traffic at Facebook Scale
Gluster Metrics: why they are crucial for running stable deployments of all s...
Up and Running with Glusto & Glusto-Tests in 5 Minutes (or less)
Data Reduction for Gluster with VDO
Releases: What are contributors responsible for
RIO Distribution: Reconstructing the onion - Shyamsundar Ranganathan
Gluster and Kubernetes
Native Clients, more the merrier with GFProxy!
Gluster: a SWOT Analysis
GlusterD-2.0: What's Happening? - Kaushal Madappa
Scalability and Performance of CNS 3.6
What Makes Us Fail
Gluster as Native Storage for Containers - past, present and future
Heketi Functionality into Glusterd2
Architecture of the High Availability Solution for Ganesha and Samba with Kal...
Gluster Containerized Storage for Cloud Applications
Gluster as Block Store in Containers

Recently uploaded (20)

PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
cuic standard and advanced reporting.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Modernizing your data center with Dell and AMD
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Electronic commerce courselecture one. Pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Encapsulation theory and applications.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Cloud computing and distributed systems.
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Big Data Technologies - Introduction.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
cuic standard and advanced reporting.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Modernizing your data center with Dell and AMD
Spectral efficient network and resource selection model in 5G networks
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Understanding_Digital_Forensics_Presentation.pptx
The AUB Centre for AI in Media Proposal.docx
Reach Out and Touch Someone: Haptics and Empathic Computing
The Rise and Fall of 3GPP – Time for a Sabbatical?
Electronic commerce courselecture one. Pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Encapsulation theory and applications.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
MYSQL Presentation for SQL database connectivity
Cloud computing and distributed systems.

Challenges with Gluster and Persistent Memory with Dan Lambright

  • 1. Challenges in Using Persistent Memory in Gluster Gluster Summit 2016 Dan Lambright Storage System Software Developer Adjunct Professor University of Massachusetts Lowell Aug. 23, 2016
  • 2. RED HAT ● Technologies ○ Persistent memory, aka storage class memory (SCM) ○ Distributed storage ● Challenges ○ Network latency (studied Gluster) ○ Accelerating parts of the system with SCM ○ CPU latency (studied Ceph) 2 Overview
  • 3. RED HAT3 ● Near DRAM speeds ● Wearability better than SSDs (claims Intel for 3DxPoint) ● API available ○ Crash-proof transactions ○ Byte or block addressable ● Likely to be at least as expensive as SSDs ● Fast random access ● Has support in Linux ● NVDIMMs available today. Storage Class Memory What do we know / expect?
  • 4. RED HAT Media Latency HDD 10ms SSD 1ms SCM < 1us CPU (Ceph, aprox.) ~1000us Network (RDMA) ~10-50us 4 The problem Must lower latencies throughout system : storage, network, CPU
  • 5. RED HAT ● “Primary copy” : update replicas in parallel, ○ processes reads and writes ○ Ceph’s choice, also JBR ● Other design options ○ Read at “tail” - the data there is always committed Server Side Replication Latency cost to replicate across nodes client Primary server Replica 1 Replica 2
  • 6. RED HAT ● Uses more client side bandwidth ● Likely client has slower network than server. Client Side Replication Latency price lower than server software replication client Replica 1 Replica 1 Replica 2
  • 7. RED HAT ● Avoid OS data copy; free CPU from transfer ● Application must manage buffers ○ Reserve memory up-front, sized properly ○ Both Ceph and Gluster have good RDMA interfaces ● Extend protocol to further improve latency? ○ Proposed protocol extensions, could shrink latency to ~3us. ○ RDMA write completion does not indicate data was persisted (“ship and pray”) ○ ACK in higher level protocol - adds overhead ○ Add “commit bit”, perhaps combine with last write? Improving Network Latency RDMA
  • 8. RED HAT IOPS RDMA vs 10Gbps : Glusterfs 2x replication, 2 clients Biggest gain with reads, little gain for small I/O. Sequential I/O (1024 block size) Random I/O (1024 block size) 1024 bytes transfers
  • 9. RED HAT ● Reduce protocol traffic (discuss more next section) ● Coalesce protocol operations ○ WIth this, observed 10% gain in small file creates on Gluster ● Pipelining ○ In Ceph, on two updates to same object, start replicating second before first completes Improving Network Latency Other techniques
  • 10. RED HAT Adding SCM to Parts of System Kernel and application level tiering DM-cache Ceph tiering
  • 11. RED HAT ● Heterogeneous storage in single volume ○ Fast/expensive storage cache for slower storage ○ Fast “Hot tier” (e.g. SSD, SCM) ○ Slow “Cold tier” (e.g. erasure coded) ● Database tracks files ○ Pro: easy metadata manipulation ○ Con: very slow O(n) enumeration of objects for promotion/demotion scans. ● Policies: ○ Data put on hot tier, until “full” ○ Once “full”, data “promoted/demoted” based on access frequency Gluster Tiering Illustration of network problem
  • 13. RED HAT ● Tiering helped large I/Os, not small ● Pattern seen elsewhere .. ○ RDMA tests ○ Customer Feedback, overall GlusterFS reputation … ● Observed many “LOOKUP operations” over network ● Hypothesis: metadata transfers dominate data transfers for small files ○ small file data transfer speedup fails to help overall IO latency Gluster’s “Small File” Tiering Problem Analysis
  • 14. RED HAT ● Each directory in path is tested on an open(), by client’s VFS layer ○ Full path traversal ○ d1/d2/d3/f1 ○ Existence ○ Permission Understanding LOOKUPs in Gluster Problem : Path Traversal d1 d2 d3 f1
  • 15. RED HAT ● Distributed hash space is split in parts ○ Unique space for each directory ○ Each node owns a piece of this “layout” ○ Stored in extended attributes ○ When new nodes added to cluster, rebuild the layouts ● When file opened, entire layout is rechecked, for each directory ○ Each node receives a lookup to retrieve its part of the layout ● Work is underway to improve this. Understanding LOOKUPs in Gluster Problem : Coalescing Distributed Hash Ranges
  • 16. RED HAT LOOKUP Amplification d2 d3 f1 S1 S2 S3 S4 d1/d2/d3/f1 Four LOOKUPs Four servers 16 LOOKUPs total in worse case d2 VFS layer Gluster client Gluster server Client Path used in Ben England’s “smallfile” utility.. /mnt/p6.1/file_dstdir/192.168.1.2/thrd_00/d_000
  • 17. RED HAT ● Cache file metadata at client long term ○ WIP - under development ● Invalidate cache entry on another client’s change ○ Invalidate intelligently, not spuriously ○ Some attributes may change a lot (ctime, ..) Client Metadata Cache: Tiering Gluster’s “md-cache” translator Red is cached
  • 18. RED HAT Client Metadata Cache RDMA One client , 16K 64 byte files, 8 threads, 14Gbps Mellonox IB Red is cached
  • 19. RED HAT ● Near term (2016) ○ Add-remove brick, partially in review stages ○ Md-cache integration, working with SMB team ○ Better performance counters ● Longer term (2017) ○ Policy based tiering ○ More than two tiers? ○ Replace database with “hitsets” (ceph-like bloom filters) ? Tiering Development notes
  • 20. RED HAT ● Network latency reductions ○ Use RDMA ○ Reduce round trips by streamlining protocol , coalescing etc ○ Cache at client ● CPU latency reductions ○ Aggressively optimize / shrink stack ○ Remove/replace large components ● Consider ○ SCM as a tier/cache ○ 2 way replication Summary and Discussion Distributed storage and SCM pose unique problems with latency.