Monitoring service level matrics

Ayush Pandey

🚀 PhD Candidate | Secure Edge Architect | Embedded AI Systems | Resilience & Runtime Monitoring | Cloud-to-Firmware Security

Published Sep 23, 2021

In order to understand if your system is reliable, available, or useful; a deep understanding of SLOs, SLAs, and SLIs is very important.

In order to set a precise numerical target for system availability there is Service-Level Objective (SLO) of the system. Any future discussion about whether the system is running reliably and if any design or architectural changes to it are needed must be framed in terms of our system continuing to meet this SLO.

A Service-Level Agreement SLA normally involves a promise to a service user that the service availability SLO should meet a certain level over a certain period. Because of the principle that availability shouldn’t be much better than the SLO, the availability SLO in the SLA is normally a looser objective than the internal availability SLO. This might be expressed in availability numbers: for instance, an availability SLO of 99.9% over one month, with an internal availability SLO of 99.95%.

Service-Level Indicator (SLI) is a direct measurement of a service’s behavior, defined as the frequency of successful probes of the system.

The number of risks identified for each application varies greatly depending on the maturity of your application and team and target level for reliability or performance.

Each risk has many properties that can be used to evaluate its relative importance. In discussions internally and with customers, two properties in particular stand out as most relevant:

The likelihood of the risk occurring in a given time period.
The impact that would be felt if the risk materializes.

Converting qualitative concepts of impact and likelihood to quantified values that can be used to calculate expected loss using concepts of Mean Time Between Failure (MTBF), Mean Time To Recover (MTTR), and error budget. MTBF and MTTR values of each risk can be used to develop a prioritized list of risks based on the expected impact on the annual error budget.

Avoiding self-inflicted DDoS attack using exponential back-off, retry marking to manage system load and gracefully recovering from errors which is quite a complex undertaking.

Its important to understanding SLO escalation policy thresholds and associated rationales of trade-offs that particular teams make to maintain a high development velocity if latter is a business priority.

Sometimes, dark launching is a valuable tool to have when launching a new service on existing traffic. However understanding the practicalities of a dark launch is equally important.

Overall Site Reliability Engineering(SRE) aims to understand:

what the service does
day-to-day service operation (traffic variation, releases, experiment management, config pushes)
how the service tends to break and how this manifests in alerts
rough edges in monitoring and alerting
where the service configuration diverges from the SRE team’s practices
major operational risks for the service

The SRE team also considers:

whether the service follows SRE team best practices, and if not, how to retrofit it
how to integrate the service with the SRE team’s existing tools and processes
the desired engagement model and separation of responsibilities between the SRE team and the SWE team. When debugging a critical production problem, at what point should the SRE on-call page the developer on-call?

I found this book by Google really helpful https://sre.google/sre-book/table-of-contents/

June Manley

What a great post

Jean-Philippe Poisson

Co-founder & GM / CRO @yzr.ai | Generate refined e-commerce content at scale | Early stage investor & board member

I really enjoyed reading this!

Tanya Gandhi

Management Consulting | Corporate Media | Public Relations | Branding Strategy | Innovation-Driven Growth

Always great reading your posts!

Pierre-Baptiste Landoin

CEO @Icypeas | Love discussing lead gen, sales automation, marketing hacks and entrepreneurship

Monitoring service level matrics

Ayush Pandey

🚀 PhD Candidate | Secure Edge Architect | Embedded AI Systems | Resilience & Runtime Monitoring | Cloud-to-Firmware Security

More articles by this author

Others also viewed

The Indispensable Role of Site Reliability Engineering (SRE) in Critical Government Infrastructures

Embracing Graceful Degradation in Site Reliability Engineering (SRE)

Minimizing Unplanned Downtime: Real-Time Database Monitoring Tools and Incident Response Teams in Manufacturing Industry

Building Effective SLIs: A Practical Guide for Site Reliability Engineers

Observability: A Pillar of Site Reliability Engineering (SRE) Explained

A Site Reliability Engineering (SRE) Manifesto

Rethinking Service Reliability Through MTTR

Measuring Success in SRE - Part#2

SLAs in SRE: Beyond the Numbers

Dickerson’s Hierarchy of Reliability

Explore topics

The Ghost in the Machine Learning: AI, Academia, and the Ed-Tech Déjà Vu

Mar 30, 2025

The Looming Shadow of Misinformation: How Falsehoods About AGI Threaten its Future

Jan 11, 2025

P-adic numbers are beautiful

Jan 12, 2022

IoT Hybrid Cloud Edge Cyber Attack Security

Oct 26, 2021

Hybrid Cloud Edge 5G IoT Open Networking

Oct 23, 2021

OpenDayLight OpenFlow OpenVSwitch

Oct 18, 2021

Container Management Systems | Consistency & Persistent Data Protection in Cloud-Fog-IoT

Oct 15, 2021

Monitoring vs Observability

Sep 21, 2021

Access control, attribute based access control, public key encryption

Sep 15, 2021

Traditional infrastructure security vs Cloud native security

Sep 11, 2021