Monitoring service level matrics

Monitoring service level matrics

In order to understand if your system is reliable, available, or useful; a deep understanding of SLOs, SLAs, and SLIs is very important.

In order to set a precise numerical target for system availability there is Service-Level Objective (SLO) of the system. Any future discussion about whether the system is running reliably and if any design or architectural changes to it are needed must be framed in terms of our system continuing to meet this SLO.

A Service-Level Agreement SLA normally involves a promise to a service user that the service availability SLO should meet a certain level over a certain period. Because of the principle that availability shouldn’t be much better than the SLO, the availability SLO in the SLA is normally a looser objective than the internal availability SLO. This might be expressed in availability numbers: for instance, an availability SLO of 99.9% over one month, with an internal availability SLO of 99.95%.

Service-Level Indicator (SLI) is a direct measurement of a service’s behavior, defined as the frequency of successful probes of the system.

The number of risks identified for each application varies greatly depending on the maturity of your application and team and target level for reliability or performance.

Each risk has many properties that can be used to evaluate its relative importance. In discussions internally and with customers, two properties in particular stand out as most relevant:

  • The likelihood of the risk occurring in a given time period.
  • The impact that would be felt if the risk materializes.

Converting qualitative concepts of impact and likelihood to quantified values that can be used to calculate expected loss using concepts of Mean Time Between Failure (MTBF), Mean Time To Recover (MTTR), and error budget. MTBF and MTTR values of each risk can be used to develop a prioritized list of risks based on the expected impact on the annual error budget.

Avoiding self-inflicted DDoS attack using exponential back-off, retry marking to manage system load and gracefully recovering from errors which is quite a complex undertaking.

Its important to understanding SLO escalation policy thresholds and associated rationales of trade-offs that particular teams make to maintain a high development velocity if latter is a business priority.

Sometimes, dark launching is a valuable tool to have when launching a new service on existing traffic. However understanding the practicalities of a dark launch is equally important.

Overall Site Reliability Engineering(SRE) aims to understand:

  • what the service does
  • day-to-day service operation (traffic variation, releases, experiment management, config pushes)
  • how the service tends to break and how this manifests in alerts
  • rough edges in monitoring and alerting
  • where the service configuration diverges from the SRE team’s practices
  • major operational risks for the service

The SRE team also considers:

  • whether the service follows SRE team best practices, and if not, how to retrofit it
  • how to integrate the service with the SRE team’s existing tools and processes
  • the desired engagement model and separation of responsibilities between the SRE team and the SWE team. When debugging a critical production problem, at what point should the SRE on-call page the developer on-call?


I found this book by Google really helpful https://sre.google/sre-book/table-of-contents/ 






June Manley

Founder | CEO | CMO | Growth Mindset | Critical Thinker | Conscious Leader

3y

What a great post

Like
Reply
Jean-Philippe Poisson

Co-founder & GM / CRO @yzr.ai | Generate refined e-commerce content at scale | Early stage investor & board member

3y

I really enjoyed reading this!

Like
Reply
Tanya Gandhi

Management Consulting | Corporate Media | Public Relations | Branding Strategy | Innovation-Driven Growth

3y

Always great reading your posts!

Like
Reply
Pierre-Baptiste Landoin

CEO @Icypeas | Love discussing lead gen, sales automation, marketing hacks and entrepreneurship

3y

Love it!

Like
Reply

To view or add a comment, sign in

Others also viewed

Explore topics