SlideShare a Scribd company logo
Site Reliability Engineering
Presenter Name: Keet Malin Sugathadasa
Designation: Associate Technical Lead
Presented By
Keet Malin Sugathadasa
Associate Tech Lead at Cognite
More than 3 years of experience in
various roles related to Software
Engineering
Contributor to NPM and
Stackoverflow
Research Interests –Cyber
Security, Cloud Computing,
Distributed Computing.
AGENDA
• What is Site Reliability Engineering (SRE)
• The 5 Pillars of SRE
• SLOs, SLIs, SLAs
• Error Budgets
• Toil
• Ensuring Successful operations of a
production system
What is DevOps
Like Agile came in to remove the gap between BA &
Dev, DevOps made the gap between Dev & Ops go
away
What is SRE?
• DevOps has been a community built set of practices, a culture;
• while SRE was groomed inside Google as a secret sauce.
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Reduce Organizational Silos
• SRE teams share ownership of production with
developers
• SRE teams get involved in development at very early
stages
• But products may not start with SRE support at first.
When onboarding, following items get checked
• System architecture and interservice dependencies
• Instrumentation, metrics, and monitoring
• Emergency response
• Capacity planning
• Change management
• Performance: availability, latency, and efficiency
Reduce Silos
Accept Failure as Normal
Blameless Postmortems
• When things have actually gone bazooka,
who’s fault is it?
• Answer: Nobody’s. It's the system’s fault.
It allowed people to act that way!
• Ask WHY not WHO!
If nobody is blamed, people open up, and
then the root cause cascade opens up.
Agility[Devs] vs Stability[Ops]
• What is availability?
• Clear definitions
• How available you want to be?
• Clear numerical indicators
• What to do when availability is
not met?
SLI - SLO - SLA : Service Level what?
Service Level Indicator: A metric aggregated over time, ( 90th percentile, median )
• Batch throughput
• Failures per request
• Is the ratios of errors to total number of requests received in last 5 minutes < 1%?
• Request latency
• Is the average latency of requests in last 5 minutes < 300ms?
• Is the 90th percentile of the latency of requests in last 5 minutes < 300ms?
Service Level Objectives: Number which SLI needs to be
• Is above indicator is YES 99.9% of the time?
• Monitor the SLIs over a long time and decide this
Service Level Agreement: A legal agreement
• The the level of reliability I promise & what will I do if I do not
• Usually based on SLOs but a business agreement
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Risk and availability
• 100% availability is impossible.
• Each 9 you add to the SLO,
increases your cost
• Each 9 you add, you lose your
comfort
Error Budgets
• Once you decide the SLO, you get X number of minutes to go unavailable.
• X is your Error Budget
• If you reach that budget, you cannot release new features anymore
• Under AND over spending is bad.
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Implement Gradual Change
Gradual change
• Updates should be pushed as canaries, not as bulk version changes
• Less code change means lesser mean time to recover on failure
• Rate of change would depend on selection of SLO
Tooling & Automation
Toil
Toil is the manual repetitive work tied to running in PROD ( which can be
automated )
Toil & Toil budget
SREs actively measure Toil. Toil budget should be
around 30% to 50%
If toil is not kept at its margins, it fills up to 100%
easily
But a little amount of toil is not harmful.
• Automation might be harder than the manual
work
• Helps newcomers to orient themselves
Measuring
Service reliability needs to be measured
• Uptime
• Mean time to failure
• Mean time to recover
Whatsapp (Example Use case)
• Message Delivery Time
• Message Throughput
• Image Resolution (Compression Algorithm)
• Video Compression Quality
• Etc etc
Hope is not a
Strategy!
Thank you

More Related Content

PDF
Getting started with Site Reliability Engineering (SRE)
PPTX
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
PDF
Overview of Site Reliability Engineering (SRE) & best practices
PDF
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
PPTX
SRE (service reliability engineer) on big DevOps platform running on the clou...
PPTX
SRE 101 (Site Reliability Engineering)
PPTX
Site (Service) Reliability Engineering
PPTX
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Getting started with Site Reliability Engineering (SRE)
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
Overview of Site Reliability Engineering (SRE) & best practices
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE 101 (Site Reliability Engineering)
Site (Service) Reliability Engineering
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...

What's hot (20)

PPTX
How Small Team Get Ready for SRE (public version)
PDF
Sre summary
PPTX
A Crash Course in Building Site Reliability
PPTX
Site reliability engineering
PDF
Building an SRE Organization @ Squarespace
PDF
SRE 101
PPTX
SRE-iously! Reliability!
PPTX
DevOps Torino Meetup - SRE Concepts
PPTX
What is Site Reliability Engineering (SRE)
PDF
How to SRE when you have no SRE
PDF
Cloud Native Engineering with SRE and GitOps
PDF
SRE Demystified - 01 - SLO SLI and SLA
PPTX
SRE vs DevOps
PDF
Service Level Terminology : SLA ,SLO & SLI
PDF
DevOps Powerpoint Presentation Slides
PDF
SRE in Startup
PDF
SRE Demystified - 05 - Toil Elimination
PPTX
How to Move from Monitoring to Observability, On-Premises and in a Multi-Clou...
PPTX
The Next Wave of Reliability Engineering
PPTX
DevOps Introduction
How Small Team Get Ready for SRE (public version)
Sre summary
A Crash Course in Building Site Reliability
Site reliability engineering
Building an SRE Organization @ Squarespace
SRE 101
SRE-iously! Reliability!
DevOps Torino Meetup - SRE Concepts
What is Site Reliability Engineering (SRE)
How to SRE when you have no SRE
Cloud Native Engineering with SRE and GitOps
SRE Demystified - 01 - SLO SLI and SLA
SRE vs DevOps
Service Level Terminology : SLA ,SLO & SLI
DevOps Powerpoint Presentation Slides
SRE in Startup
SRE Demystified - 05 - Toil Elimination
How to Move from Monitoring to Observability, On-Premises and in a Multi-Clou...
The Next Wave of Reliability Engineering
DevOps Introduction
Ad

Similar to Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa (20)

PDF
Hidden Costs of Chasing the Mythical 'Five Nines'
PDF
GCP-pdevops devops engineer exam prepearitaon guide
PPTX
DevOps & Site Reliability Engineering (SRE).pptx
PDF
Webinar: Demonstrating Business Value for DevOps & Continuous Delivery
PDF
Adapting Scrum in an Organization with Tailored Processes
PPTX
DevOps 101
PPTX
DevOps By The Numbers
PPTX
Kanban testing
PDF
DevOps 101
PPTX
Top 10 Agile Metrics
PPTX
Agile Transformation: People, Process and Tools to Make Your Transformation S...
PDF
How to test a Mainframe Application
PDF
Analyst Keynote: Continuous Delivery: Making DevOps Awesome
PPTX
Can you process 10 trillion logs per day software architecture conference 2015
PDF
SLA and DevsecOps.presentation topic etc
PPTX
ISACA Ireland Keynote 2015
PPTX
Patching is Your Friend in the New World Order of EPM and ERP Cloud
PDF
Measuring DevOps Performance
PDF
Scaling unstable systems velocity 2015
PPTX
Approaching Quality in Digital Era
Hidden Costs of Chasing the Mythical 'Five Nines'
GCP-pdevops devops engineer exam prepearitaon guide
DevOps & Site Reliability Engineering (SRE).pptx
Webinar: Demonstrating Business Value for DevOps & Continuous Delivery
Adapting Scrum in an Organization with Tailored Processes
DevOps 101
DevOps By The Numbers
Kanban testing
DevOps 101
Top 10 Agile Metrics
Agile Transformation: People, Process and Tools to Make Your Transformation S...
How to test a Mainframe Application
Analyst Keynote: Continuous Delivery: Making DevOps Awesome
Can you process 10 trillion logs per day software architecture conference 2015
SLA and DevsecOps.presentation topic etc
ISACA Ireland Keynote 2015
Patching is Your Friend in the New World Order of EPM and ERP Cloud
Measuring DevOps Performance
Scaling unstable systems velocity 2015
Approaching Quality in Digital Era
Ad

More from Keet Sugathadasa (9)

PDF
Chaos Engineering - The Art of Breaking Things in Production
PDF
Human Computer Interaction - Facebook Messenger
PDF
Cyber Security and Cloud Computing
PPTX
How to compete in hackathons
PDF
Quality Engineering - When to Stop Testing
PDF
Training Report WSO2 internship
PDF
Object oriented programming interview questions
PDF
Interview Facing Workshop
PPTX
Revolutionizing digital authentication with gsma mobile connect
Chaos Engineering - The Art of Breaking Things in Production
Human Computer Interaction - Facebook Messenger
Cyber Security and Cloud Computing
How to compete in hackathons
Quality Engineering - When to Stop Testing
Training Report WSO2 internship
Object oriented programming interview questions
Interview Facing Workshop
Revolutionizing digital authentication with gsma mobile connect

Recently uploaded (20)

PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
ETO & MEO Certificate of Competency Questions and Answers
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
Fluid Mechanics, Module 3: Basics of Fluid Mechanics
PPTX
Unit 5 BSP.pptxytrrftyyydfyujfttyczcgvcd
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
web development for engineering and engineering
PDF
Structs to JSON How Go Powers REST APIs.pdf
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
Digital Logic Computer Design lecture notes
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
composite construction of structures.pdf
PPTX
Internet of Things (IOT) - A guide to understanding
PPT
Drone Technology Electronics components_1
PDF
Well-logging-methods_new................
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Embodied AI: Ushering in the Next Era of Intelligent Systems
ETO & MEO Certificate of Competency Questions and Answers
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Fluid Mechanics, Module 3: Basics of Fluid Mechanics
Unit 5 BSP.pptxytrrftyyydfyujfttyczcgvcd
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
web development for engineering and engineering
Structs to JSON How Go Powers REST APIs.pdf
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Digital Logic Computer Design lecture notes
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
composite construction of structures.pdf
Internet of Things (IOT) - A guide to understanding
Drone Technology Electronics components_1
Well-logging-methods_new................
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf

Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa

  • 1. Site Reliability Engineering Presenter Name: Keet Malin Sugathadasa Designation: Associate Technical Lead
  • 2. Presented By Keet Malin Sugathadasa Associate Tech Lead at Cognite More than 3 years of experience in various roles related to Software Engineering Contributor to NPM and Stackoverflow Research Interests –Cyber Security, Cloud Computing, Distributed Computing.
  • 3. AGENDA • What is Site Reliability Engineering (SRE) • The 5 Pillars of SRE • SLOs, SLIs, SLAs • Error Budgets • Toil • Ensuring Successful operations of a production system
  • 4. What is DevOps Like Agile came in to remove the gap between BA & Dev, DevOps made the gap between Dev & Ops go away
  • 5. What is SRE? • DevOps has been a community built set of practices, a culture; • while SRE was groomed inside Google as a secret sauce.
  • 9. • SRE teams share ownership of production with developers • SRE teams get involved in development at very early stages • But products may not start with SRE support at first. When onboarding, following items get checked • System architecture and interservice dependencies • Instrumentation, metrics, and monitoring • Emergency response • Capacity planning • Change management • Performance: availability, latency, and efficiency Reduce Silos
  • 11. Blameless Postmortems • When things have actually gone bazooka, who’s fault is it? • Answer: Nobody’s. It's the system’s fault. It allowed people to act that way! • Ask WHY not WHO! If nobody is blamed, people open up, and then the root cause cascade opens up.
  • 12. Agility[Devs] vs Stability[Ops] • What is availability? • Clear definitions • How available you want to be? • Clear numerical indicators • What to do when availability is not met?
  • 13. SLI - SLO - SLA : Service Level what? Service Level Indicator: A metric aggregated over time, ( 90th percentile, median ) • Batch throughput • Failures per request • Is the ratios of errors to total number of requests received in last 5 minutes < 1%? • Request latency • Is the average latency of requests in last 5 minutes < 300ms? • Is the 90th percentile of the latency of requests in last 5 minutes < 300ms? Service Level Objectives: Number which SLI needs to be • Is above indicator is YES 99.9% of the time? • Monitor the SLIs over a long time and decide this Service Level Agreement: A legal agreement • The the level of reliability I promise & what will I do if I do not • Usually based on SLOs but a business agreement
  • 15. Risk and availability • 100% availability is impossible. • Each 9 you add to the SLO, increases your cost • Each 9 you add, you lose your comfort
  • 16. Error Budgets • Once you decide the SLO, you get X number of minutes to go unavailable. • X is your Error Budget • If you reach that budget, you cannot release new features anymore • Under AND over spending is bad.
  • 19. Gradual change • Updates should be pushed as canaries, not as bulk version changes • Less code change means lesser mean time to recover on failure • Rate of change would depend on selection of SLO
  • 21. Toil Toil is the manual repetitive work tied to running in PROD ( which can be automated )
  • 22. Toil & Toil budget SREs actively measure Toil. Toil budget should be around 30% to 50% If toil is not kept at its margins, it fills up to 100% easily But a little amount of toil is not harmful. • Automation might be harder than the manual work • Helps newcomers to orient themselves
  • 23. Measuring Service reliability needs to be measured • Uptime • Mean time to failure • Mean time to recover
  • 24. Whatsapp (Example Use case) • Message Delivery Time • Message Throughput • Image Resolution (Compression Algorithm) • Video Compression Quality • Etc etc
  • 25. Hope is not a Strategy!