SlideShare a Scribd company logo
SRE Demystified
Practical Alerting
ganesh@ganeshniyer.com
ganesh.vigneswara@gmail.com,
http://ganeshniyer.com
Dr Ganesh Neelakanta Iyer
SRE
•
2https://image.slidesharecdn.com/devopssreatgooglescale-190121123035/95/devops-sre-at-google-scale-30-638.jpg?cb=1548074257
Monitoring
• Monitoring a very large system is challenging for a couple of
reasons:
• The sheer number of components being analyzed
• The need to maintain a reasonably low maintenance burden on the
engineers responsible for the system
• A large system should be designed to aggregate signals and
prune outliers
• We need monitoring systems that allow us to alert for high-
level service objectives, but retain the granularity to inspect
individual components as needed
3
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Borgmon monitoring at Google
• White-box monitoring
• Instead of executing custom scripts to detect system failures,
Borgmon relies on a common data exposition format
• This enables mass data collection with low overheads and avoids
the costs of subprocess execution and network connection setup
• The data is used both for rendering charts and creating
alerts, which are accomplished using simple arithmetic
• To facilitate mass collection, the metrics format had to be
standardized
4
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Instrumentation of applications
• Adding mapped variables for example
• An example map-valued variable
• Showing 25 HTTP 200 responses and 12 HTTP 500s:
• http_responses map:code 200:25 404:0 500:12
5
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Storage in the Time-Series Arena
• A service is typically made up of many binaries running as
many tasks, on many machines, in many clusters
• Borgmon needs to keep all that data organized, while allowing
flexible querying and slicing of that data
• Borgmon stores all the data in an in-memory database,
regularly checkpointed to disk
• The data points have the form (timestamp, value), and are
stored in chronological lists called time-series, and each time-
series is named by a unique set of labels, of the
form name=value.
6
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Storage in the Time-Series Arena
7
A time-series for errors labeled by the original host each was collected from
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Labels and Vectors
• Time-series are stored as sequences of numbers and
timestamps, which are referred to as vectors
• Like vectors in linear algebra, these vectors are slices and cross-sections of
the multidimensional matrix of data points in the arena
• The name of a time-series is a labelset, because it’s implemented
as a set of labels expressed as key=value pairs. One of these
labels is the variable name itself, the key that appears on the varz
page
8https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Labels and Vectors
• Example variable expression
{var=http_requests,job=webserver,instance=host0:80,service=web,zone=us-west}
9
Label Value
var The name of the variable
job The name given to the type of server being monitored
service A loosely defined collection of jobs that provide a service to users,
either internal or external
zone Location of the Borgmon that performed the collection of this
variable
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Rule Evaluation
• The Borgmon program code, also known as Borgmon
rules, consists of simple algebraic expressions that
compute time-series from other time-series
• Rules run in a parallel threadpool where possible, but are
dependent on ordering when using previously defined
rules as input
• Aggregation is the cornerstone of rule evaluation in a
distributed environment
10
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Example Rule
11
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Example Alert Rule
• Creates an alert when the error ratio over 10 minutes exceeds
1% and the total number of errors exceeds 1 per second
12
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Maintaining the configuration
• Borgmon configuration separates the definition of the rules
from the targets being monitored
• Borgmon also supports language templates
• The first class simply codifies the emergent schema of
variables exported from a given library of code
• Such templates exist for the HTTP server library, memory
allocation, the storage client library
• The second class templates are to manage the aggregation
of data from a single-server task to the global service footprint
13
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
References
14
Dr Ganesh Neelakanta Iyer
ganesh@ganeshniyer.com
ganesh.vigneswara@gmail.com

More Related Content

PPTX
SRE 101 (Site Reliability Engineering)
PPTX
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
PDF
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
PPTX
SRE vs DevOps
PPTX
Site (Service) Reliability Engineering
PPTX
Site reliability engineering
PDF
Terraform introduction
PDF
Bjorn Rabenstein. SRE, DevOps, Google, and you
SRE 101 (Site Reliability Engineering)
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
SRE vs DevOps
Site (Service) Reliability Engineering
Site reliability engineering
Terraform introduction
Bjorn Rabenstein. SRE, DevOps, Google, and you

What's hot (20)

PDF
SRE 101
PDF
DevOps & SRE at Google Scale
PDF
Introduction to DevOps | Edureka
PDF
Amazon CloudWatch - Observability and Monitoring
PDF
Cloud Native Bern 05.2023 — Zero Trust Visibility
PDF
SRE Demystified - 05 - Toil Elimination
PPTX
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
PDF
Introduction to CICD
PPTX
How Small Team Get Ready for SRE (public version)
PDF
DevSecOps, The Good, Bad, and Ugly
PDF
Overview of Site Reliability Engineering (SRE) & best practices
PPTX
Observability vs APM vs Monitoring Comparison
PPTX
Introduction à la démarche Devops
PDF
Kks sre book_ch1,2
PDF
SRE in Startup
PDF
DevSecOps What Why and How
PDF
Principles of System Observability
PDF
Android Security & Penetration Testing
PDF
DevSecOps: What Why and How : Blackhat 2019
PDF
SRE Demystified - 14 - SRE Practices overview
SRE 101
DevOps & SRE at Google Scale
Introduction to DevOps | Edureka
Amazon CloudWatch - Observability and Monitoring
Cloud Native Bern 05.2023 — Zero Trust Visibility
SRE Demystified - 05 - Toil Elimination
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
Introduction to CICD
How Small Team Get Ready for SRE (public version)
DevSecOps, The Good, Bad, and Ugly
Overview of Site Reliability Engineering (SRE) & best practices
Observability vs APM vs Monitoring Comparison
Introduction à la démarche Devops
Kks sre book_ch1,2
SRE in Startup
DevSecOps What Why and How
Principles of System Observability
Android Security & Penetration Testing
DevSecOps: What Why and How : Blackhat 2019
SRE Demystified - 14 - SRE Practices overview
Ad

Similar to SRE Demystified - 07 - Practical Alerting (20)

PDF
Overview of Postgres Utility Processes
 
PPTX
Webinar: Best Practices for Upgrading to MongoDB 3.2
PPTX
MongoDB Days Silicon Valley: Best Practices for Upgrading to MongoDB
PPTX
Performance eng prakash.sahu
PPTX
Oracle EBS Production Support - Recommendations
PPTX
Introduction to Prometheus Monitoring (Singapore Meetup)
PDF
Tame the Mesh An intro to cross-platform tracing and troubleshooting.pdf
PPTX
515689311-Postgresql-DBA-Architecture.pptx
PPTX
381CCS_CHAPTER3_UPDATED king Khalid .pptx
PDF
Visual Studio Profiler
PDF
KoprowskiT_SQLSatMoscow_2AMaDisaterJustBegan
PPTX
Performance tuning Grails applications SpringOne 2GX 2014
PDF
KoprowskiT_SPBizConf_2AMaDisasterJustBegan
PDF
KoprowskiT_SPBizConference_2AMaDisasterJustBegan
PPT
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
PDF
071410 sun a_1515_feldman_stephen
PDF
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
PPTX
python_development.pptx
PDF
Why advanced monitoring is key for healthy
PPTX
High availability and disaster recovery in IBM PureApplication System
Overview of Postgres Utility Processes
 
Webinar: Best Practices for Upgrading to MongoDB 3.2
MongoDB Days Silicon Valley: Best Practices for Upgrading to MongoDB
Performance eng prakash.sahu
Oracle EBS Production Support - Recommendations
Introduction to Prometheus Monitoring (Singapore Meetup)
Tame the Mesh An intro to cross-platform tracing and troubleshooting.pdf
515689311-Postgresql-DBA-Architecture.pptx
381CCS_CHAPTER3_UPDATED king Khalid .pptx
Visual Studio Profiler
KoprowskiT_SQLSatMoscow_2AMaDisaterJustBegan
Performance tuning Grails applications SpringOne 2GX 2014
KoprowskiT_SPBizConf_2AMaDisasterJustBegan
KoprowskiT_SPBizConference_2AMaDisasterJustBegan
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
071410 sun a_1515_feldman_stephen
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
python_development.pptx
Why advanced monitoring is key for healthy
High availability and disaster recovery in IBM PureApplication System
Ad

More from Dr Ganesh Iyer (20)

PDF
SRE Demystified - 16 - NALSD - Non-Abstract Large System Design
PDF
SRE Demystified - 13 - Docs that matter -2
PDF
SRE Demystified - 12 - Docs that matter -1
PDF
SRE Demystified - 01 - SLO SLI and SLA
PDF
SRE Demystified - 11 - Release management-2
PDF
SRE Demystified - 10 - Release management-1
PDF
SRE Demystified - 09 - Simplicity
PDF
SRE Demystified - 06 - Distributed Monitoring
PDF
SRE Demystified - 04 - Engagement Model
PDF
SRE Demystified - 03 - Choosing SLIs and SLOs
PDF
Machine Learning for Statisticians - Introduction
PDF
Making Decisions - A Game Theoretic approach
PDF
Cloud and Industry4.0
PDF
Game Theory and Engineering Applications
PDF
Machine Learning and its Applications
PDF
How to become a successful entrepreneur
PDF
Dockers and kubernetes
PDF
Containerization Principles Overview for app development and deployment
PDF
Game Theory and Engineering Applications
PDF
Demystifying Containerization Principles for Data Scientists
SRE Demystified - 16 - NALSD - Non-Abstract Large System Design
SRE Demystified - 13 - Docs that matter -2
SRE Demystified - 12 - Docs that matter -1
SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 11 - Release management-2
SRE Demystified - 10 - Release management-1
SRE Demystified - 09 - Simplicity
SRE Demystified - 06 - Distributed Monitoring
SRE Demystified - 04 - Engagement Model
SRE Demystified - 03 - Choosing SLIs and SLOs
Machine Learning for Statisticians - Introduction
Making Decisions - A Game Theoretic approach
Cloud and Industry4.0
Game Theory and Engineering Applications
Machine Learning and its Applications
How to become a successful entrepreneur
Dockers and kubernetes
Containerization Principles Overview for app development and deployment
Game Theory and Engineering Applications
Demystifying Containerization Principles for Data Scientists

Recently uploaded (20)

PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
cuic standard and advanced reporting.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPT
Teaching material agriculture food technology
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
Cloud computing and distributed systems.
PDF
Electronic commerce courselecture one. Pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Big Data Technologies - Introduction.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Building Integrated photovoltaic BIPV_UPV.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
cuic standard and advanced reporting.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Review of recent advances in non-invasive hemoglobin estimation
Teaching material agriculture food technology
The AUB Centre for AI in Media Proposal.docx
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
NewMind AI Monthly Chronicles - July 2025
Cloud computing and distributed systems.
Electronic commerce courselecture one. Pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
“AI and Expert System Decision Support & Business Intelligence Systems”
Dropbox Q2 2025 Financial Results & Investor Presentation
Big Data Technologies - Introduction.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...

SRE Demystified - 07 - Practical Alerting

  • 3. Monitoring • Monitoring a very large system is challenging for a couple of reasons: • The sheer number of components being analyzed • The need to maintain a reasonably low maintenance burden on the engineers responsible for the system • A large system should be designed to aggregate signals and prune outliers • We need monitoring systems that allow us to alert for high- level service objectives, but retain the granularity to inspect individual components as needed 3 https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 4. Borgmon monitoring at Google • White-box monitoring • Instead of executing custom scripts to detect system failures, Borgmon relies on a common data exposition format • This enables mass data collection with low overheads and avoids the costs of subprocess execution and network connection setup • The data is used both for rendering charts and creating alerts, which are accomplished using simple arithmetic • To facilitate mass collection, the metrics format had to be standardized 4 https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 5. Instrumentation of applications • Adding mapped variables for example • An example map-valued variable • Showing 25 HTTP 200 responses and 12 HTTP 500s: • http_responses map:code 200:25 404:0 500:12 5 https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 6. Storage in the Time-Series Arena • A service is typically made up of many binaries running as many tasks, on many machines, in many clusters • Borgmon needs to keep all that data organized, while allowing flexible querying and slicing of that data • Borgmon stores all the data in an in-memory database, regularly checkpointed to disk • The data points have the form (timestamp, value), and are stored in chronological lists called time-series, and each time- series is named by a unique set of labels, of the form name=value. 6 https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 7. Storage in the Time-Series Arena 7 A time-series for errors labeled by the original host each was collected from https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 8. Labels and Vectors • Time-series are stored as sequences of numbers and timestamps, which are referred to as vectors • Like vectors in linear algebra, these vectors are slices and cross-sections of the multidimensional matrix of data points in the arena • The name of a time-series is a labelset, because it’s implemented as a set of labels expressed as key=value pairs. One of these labels is the variable name itself, the key that appears on the varz page 8https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 9. Labels and Vectors • Example variable expression {var=http_requests,job=webserver,instance=host0:80,service=web,zone=us-west} 9 Label Value var The name of the variable job The name given to the type of server being monitored service A loosely defined collection of jobs that provide a service to users, either internal or external zone Location of the Borgmon that performed the collection of this variable https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 10. Rule Evaluation • The Borgmon program code, also known as Borgmon rules, consists of simple algebraic expressions that compute time-series from other time-series • Rules run in a parallel threadpool where possible, but are dependent on ordering when using previously defined rules as input • Aggregation is the cornerstone of rule evaluation in a distributed environment 10 https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 12. Example Alert Rule • Creates an alert when the error ratio over 10 minutes exceeds 1% and the total number of errors exceeds 1 per second 12 https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 13. Maintaining the configuration • Borgmon configuration separates the definition of the rules from the targets being monitored • Borgmon also supports language templates • The first class simply codifies the emergent schema of variables exported from a given library of code • Such templates exist for the HTTP server library, memory allocation, the storage client library • The second class templates are to manage the aggregation of data from a single-server task to the global service footprint 13 https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 15. Dr Ganesh Neelakanta Iyer ganesh@ganeshniyer.com ganesh.vigneswara@gmail.com