SlideShare a Scribd company logo
Gluster Metrics: why they are crucial for running stable deployments of all sizes
Gluster Metrics
Running stable deployments of all sizes
David Hasson
Production Engineering
• Collect data locally on each host
• Use a transport system on each host to a central location
• Aggregate at central location
• Interact with and write detectors by querying the central
system
Reporting and monitoring at web-scale
(and you can too!)
How we identify issues
• Alarms
• Request latency
• Error rates for specific errors
• Space usage alarms
• Customers
• Also, our customers tell us about real problems
• Key customer metrics
Using Data to find problems
• Log data
• Useful for finding errors on the various daemons
• Categorized by a tailing system
• The iostats translator
• Key metric calculations
• I/O sampling
• Changes in GlusterFS to export data
• Periodic exports of timeseries data, columned records
Gluster Metrics
1. What we look at today
2. In the pipeline: next steps
3. What we’d like to see in the future
Top line metrics
• SLA unweighted average
• SLA error rate / sec
Full system level stats
1. System load
2. Network I/O
Aggregate vs. Host Level
Aggregate vs. Host Level
Disk stats on the system
• IO Scheduler / Block Layer
• Await
• Queue depths
• IO sizes
• read / write sec
• capacity of the filesystem
• Overall I/O operations/sec
System's view of gluster
• CPU, memory on:
• Management Daemon (glusterd)
• Brick Daemons (glusterfsd)
• Protocol Daemons (glusterd / nfs, glusterd / gfproxy)
The Brick Daemons
Gluster IOstats (timeseries)
• Major statistics piece, available upstream
• Outputs important metrics
• Mean / Top latency
• Broken down by op type (WRITE, READ, etc)
• Includes aggregate (“unweighted”) average of operations
• Useful for top line / SLA alarms
Io-stats / Latency
Gluster IOstats (sampling)
• Implemented out of io-stats
• Dumped to a file for ingestion by infrastructure
• Controlled with volume options
• diagnostics.fop-sample-interval: 10
1490120858.99,NORMAL,CREATE,33727.0,...host01.fb.com,777,Unknown,/neat/path/name,0,No Error
1490120860.52,LEAST,STAT,67.0,...host03.fb.com,980,/groot/namespace1,<gfid:xyz>,0,No Error
1490120865.51,LEAST,STAT,279033.0,...host18.fb.com,980,/groot/namespace1,<gfid:928>,0,No Error
1490120870.54,HIGH,STAT,70.0,...host21.fb.com,911,/groot/namespace5,<gfid:842>,0,No Error
1490120880.55,HIGH,STATFS,31.0,...host29.fb.com,870,Unknown,<gfid:842>,0,No Error
1490120895.53,LEAST,READDIRP,867.0,...host28.fb.com,874,/groot/namespace1,<gfid:6b3>,0,Success
Sampling data
NFS Daemon samples
Gluster IOstats (sampling)
• Data can be brought into a centralized datastore
• Derived metrics/diagnostics
• Fleetwide error rate “drilldowns”
• Information per-client, per-datacenter, etc.
• Captures outliers
Gluster Metrics
1. What we look at today
2. In the pipeline: next steps
3. What we’d like to see in the future
p90/p95 data from iostats
• p95 is recognized as a better indicator of “overall badness” in
system metrics
• Implemented in patches to io-stats
Client stats (nfusr)
• Better “empathy” when looking at client side metrics
• Implemented in nfusr code, however designed to same output
specifications as Gluster
Measuring Reliability
Where we measure it
Capture Metrics
NFS
Server
Kernel
NFS
Mount
Traffic
User
Application
I/O
Measuring Reliability
Where we measure it
Capture Metrics
NFS
Server
nfusr
Traffic
User
Application
I/O
Per namespace stats
• Currently done in 1st version of throttling changes.
• New stats done in the new QoS code (see our other talk)
• Allows customer-based reporting, and data on how well QoS is
working
Gluster Metrics
1. What we look at today
2. In the pipeline: next steps
3. What we’d like to see in the future
Future Ideas
• Per-translator metrics
• Loose (accounting only) quotas
Gluster Metrics: why they are crucial for running stable deployments of all sizes

More Related Content

PDF
Automating Gluster @ Facebook - Shreyas Siravara
PPTX
WebLogic Stability; Detect and Analyse Stuck Threads
PDF
Embedded Recipes 2017 - Reliable monitoring with systemd - Jérémy Rosen
PPTX
Real time operating systems (rtos) concepts 3
POTX
Mobile 3: Launch Like a Boss!
PPTX
Real time operating systems (rtos) concepts 5
PDF
2016 may-countdown-to-postgres-v96-parallel-query
PDF
Looking towards an official cassandra sidecar netflix
Automating Gluster @ Facebook - Shreyas Siravara
WebLogic Stability; Detect and Analyse Stuck Threads
Embedded Recipes 2017 - Reliable monitoring with systemd - Jérémy Rosen
Real time operating systems (rtos) concepts 3
Mobile 3: Launch Like a Boss!
Real time operating systems (rtos) concepts 5
2016 may-countdown-to-postgres-v96-parallel-query
Looking towards an official cassandra sidecar netflix

What's hot (20)

PDF
Performance Tuning - Memory leaks, Thread deadlocks, JDK tools
PDF
How Much Kafka?
PDF
Diagnosing Problems in Production - Cassandra
PPTX
PlovDev 2016: Application Performance in Virtualized Environments by Todor T...
PPTX
Webinar: Keeping Your MongoDB Data Safe
PDF
How netflix manages petabyte scale apache cassandra in the cloud
PDF
Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance
PDF
Crash course intro to cassandra
PDF
Diagnosing Problems in Production (Nov 2015)
PPTX
Webinar: Backups and Disaster Recovery
PDF
Counting image views using redis cluster
PPTX
How to Keep Your Data Safe in MongoDB
PPTX
Day 2 General Session Presentations RedisConf
PPTX
Hardware considerations for different node types
PPTX
Keeping MongoDB Data Safe
PDF
OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...
PDF
Redo log
PDF
Cassandra Core Concepts - Cassandra Day Toronto
PPTX
Building perfect sql servers, every time -oops
PDF
Lessons PostgreSQL learned from commercial databases, and didn’t
Performance Tuning - Memory leaks, Thread deadlocks, JDK tools
How Much Kafka?
Diagnosing Problems in Production - Cassandra
PlovDev 2016: Application Performance in Virtualized Environments by Todor T...
Webinar: Keeping Your MongoDB Data Safe
How netflix manages petabyte scale apache cassandra in the cloud
Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance
Crash course intro to cassandra
Diagnosing Problems in Production (Nov 2015)
Webinar: Backups and Disaster Recovery
Counting image views using redis cluster
How to Keep Your Data Safe in MongoDB
Day 2 General Session Presentations RedisConf
Hardware considerations for different node types
Keeping MongoDB Data Safe
OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...
Redo log
Cassandra Core Concepts - Cassandra Day Toronto
Building perfect sql servers, every time -oops
Lessons PostgreSQL learned from commercial databases, and didn’t
Ad

Similar to Gluster Metrics: why they are crucial for running stable deployments of all sizes (20)

PDF
Tame the Mesh An intro to cross-platform tracing and troubleshooting.pdf
PDF
Teach your application eloquence. Logs, metrics, traces - Dmytro Shapovalov (...
PDF
TAU for Accelerating AI Applications at OpenPOWER Summit Europe
PDF
Performance Whackamole (short version)
PPTX
Cassandra Applications Benchmarking
PDF
I pushed in production :). Have a nice weekend
PPTX
PDF
MongoDB at MapMyFitness
PDF
TAU Performance tool using OpenPOWER
PPTX
Lessons learned from designing QA automation event streaming platform(IoT big...
PPSX
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
PPTX
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
PPTX
Campus days 2013 - Instrumentation
PPTX
Using SAS GRID v 9 with Isilon F810
ODP
Getting to Know MySQL Enterprise Monitor
PPTX
Data stage
PPTX
Data Onboarding Breakout Session
PDF
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
PDF
DNA: an overview
PDF
Building data intensive applications
Tame the Mesh An intro to cross-platform tracing and troubleshooting.pdf
Teach your application eloquence. Logs, metrics, traces - Dmytro Shapovalov (...
TAU for Accelerating AI Applications at OpenPOWER Summit Europe
Performance Whackamole (short version)
Cassandra Applications Benchmarking
I pushed in production :). Have a nice weekend
MongoDB at MapMyFitness
TAU Performance tool using OpenPOWER
Lessons learned from designing QA automation event streaming platform(IoT big...
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Campus days 2013 - Instrumentation
Using SAS GRID v 9 with Isilon F810
Getting to Know MySQL Enterprise Monitor
Data stage
Data Onboarding Breakout Session
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
DNA: an overview
Building data intensive applications
Ad

More from Gluster.org (20)

PDF
nfusr: a new userspace NFS client based on libnfs - Shreyas Siravara
PDF
Facebook’s upstream approach to GlusterFS - David Hasson
PDF
Throttling Traffic at Facebook Scale
PDF
GlusterFS w/ Tiered XFS
PDF
Up and Running with Glusto & Glusto-Tests in 5 Minutes (or less)
PDF
Data Reduction for Gluster with VDO
PDF
Releases: What are contributors responsible for
PDF
RIO Distribution: Reconstructing the onion - Shyamsundar Ranganathan
PDF
Gluster and Kubernetes
PDF
Native Clients, more the merrier with GFProxy!
PDF
Gluster: a SWOT Analysis
PDF
GlusterD-2.0: What's Happening? - Kaushal Madappa
PDF
Scalability and Performance of CNS 3.6
PDF
What Makes Us Fail
PDF
Gluster as Native Storage for Containers - past, present and future
PDF
Heketi Functionality into Glusterd2
PDF
Hands On Gluster with Jeff Darcy
PDF
Architecture of the High Availability Solution for Ganesha and Samba with Kal...
PDF
Challenges with Gluster and Persistent Memory with Dan Lambright
PDF
Gluster Containerized Storage for Cloud Applications
nfusr: a new userspace NFS client based on libnfs - Shreyas Siravara
Facebook’s upstream approach to GlusterFS - David Hasson
Throttling Traffic at Facebook Scale
GlusterFS w/ Tiered XFS
Up and Running with Glusto & Glusto-Tests in 5 Minutes (or less)
Data Reduction for Gluster with VDO
Releases: What are contributors responsible for
RIO Distribution: Reconstructing the onion - Shyamsundar Ranganathan
Gluster and Kubernetes
Native Clients, more the merrier with GFProxy!
Gluster: a SWOT Analysis
GlusterD-2.0: What's Happening? - Kaushal Madappa
Scalability and Performance of CNS 3.6
What Makes Us Fail
Gluster as Native Storage for Containers - past, present and future
Heketi Functionality into Glusterd2
Hands On Gluster with Jeff Darcy
Architecture of the High Availability Solution for Ganesha and Samba with Kal...
Challenges with Gluster and Persistent Memory with Dan Lambright
Gluster Containerized Storage for Cloud Applications

Recently uploaded (20)

PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Encapsulation theory and applications.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
A Presentation on Artificial Intelligence
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Approach and Philosophy of On baking technology
PPTX
Big Data Technologies - Introduction.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Cloud computing and distributed systems.
PDF
Electronic commerce courselecture one. Pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Chapter 3 Spatial Domain Image Processing.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Review of recent advances in non-invasive hemoglobin estimation
Encapsulation theory and applications.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
A Presentation on Artificial Intelligence
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
CIFDAQ's Market Insight: SEC Turns Pro Crypto
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Building Integrated photovoltaic BIPV_UPV.pdf
MYSQL Presentation for SQL database connectivity
Per capita expenditure prediction using model stacking based on satellite ima...
Approach and Philosophy of On baking technology
Big Data Technologies - Introduction.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Network Security Unit 5.pdf for BCA BBA.
Empathic Computing: Creating Shared Understanding
Cloud computing and distributed systems.
Electronic commerce courselecture one. Pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...

Gluster Metrics: why they are crucial for running stable deployments of all sizes