Monitoring service level matrics
In order to understand if your system is reliable, available, or useful; a deep understanding of SLOs, SLAs, and SLIs is very important.
In order to set a precise numerical target for system availability there is Service-Level Objective (SLO) of the system. Any future discussion about whether the system is running reliably and if any design or architectural changes to it are needed must be framed in terms of our system continuing to meet this SLO.
A Service-Level Agreement SLA normally involves a promise to a service user that the service availability SLO should meet a certain level over a certain period. Because of the principle that availability shouldn’t be much better than the SLO, the availability SLO in the SLA is normally a looser objective than the internal availability SLO. This might be expressed in availability numbers: for instance, an availability SLO of 99.9% over one month, with an internal availability SLO of 99.95%.
Service-Level Indicator (SLI) is a direct measurement of a service’s behavior, defined as the frequency of successful probes of the system.
The number of risks identified for each application varies greatly depending on the maturity of your application and team and target level for reliability or performance.
Each risk has many properties that can be used to evaluate its relative importance. In discussions internally and with customers, two properties in particular stand out as most relevant:
Converting qualitative concepts of impact and likelihood to quantified values that can be used to calculate expected loss using concepts of Mean Time Between Failure (MTBF), Mean Time To Recover (MTTR), and error budget. MTBF and MTTR values of each risk can be used to develop a prioritized list of risks based on the expected impact on the annual error budget.
Avoiding self-inflicted DDoS attack using exponential back-off, retry marking to manage system load and gracefully recovering from errors which is quite a complex undertaking.
Its important to understanding SLO escalation policy thresholds and associated rationales of trade-offs that particular teams make to maintain a high development velocity if latter is a business priority.
Sometimes, dark launching is a valuable tool to have when launching a new service on existing traffic. However understanding the practicalities of a dark launch is equally important.
Overall Site Reliability Engineering(SRE) aims to understand:
The SRE team also considers:
I found this book by Google really helpful https://sre.google/sre-book/table-of-contents/
Founder | CEO | CMO | Growth Mindset | Critical Thinker | Conscious Leader
3yWhat a great post
Co-founder & GM / CRO @yzr.ai | Generate refined e-commerce content at scale | Early stage investor & board member
3yI really enjoyed reading this!
Management Consulting | Corporate Media | Public Relations | Branding Strategy | Innovation-Driven Growth
3yAlways great reading your posts!
CEO @Icypeas | Love discussing lead gen, sales automation, marketing hacks and entrepreneurship
3yLove it!