SlideShare a Scribd company logo
SITE RELIABILITY
ENGINEERING*
SEEN FROM DEVOPS AND AGILE PERSPECTIVES
*SERVICE
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY
OWN
1
GAPS IN AGILE, DEVOPS APPROACHES
WHY ADDITIONAL OR SUPPLEMENTARY APPROACHES ARE NEEDED
*EDITORIAL
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 2
HOW OPS GETS OVERLOOKED
• No obvious “product” release cycle
• Keeping complex systems running is not primarily a software
problem
• Ops troubleshooting may not follow any SDLC model
• Some Ops entail managing systems in which no code readily
available
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 3
PHILOSOPHICAL NOTES
• Technical approaches to privacy are inextricably tied to security
• Similarly, reliability engineering is also tied to security
• -- and not just “Availability”
• Quality engineering comfortably straddles both Dev and Ops
• Most quality engineering in practice is pure Ops
• Software engineering has immature notions of quality
• Supporting legacy systems may be more Ops than Dev
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 4
USE CASES
• Call center operations
• Field service
• Sales, sales support
• Most of health care (17.8% of US GDP spending)
• Rework and repair (all sectors)
• Financial services
• Government operations (e.g., voting systems, regulation, transportation management)
• Utilities
• Even the less obvious: decision support
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 5
SOFTWARE SUPPORTS OPS, BUT . . .
• Complex systems lack human-machine controls
• Humans are almost always “man in the middle” by design
• Ops were not designed to be automated
• Software only lightly mitigates labor increases when service
load increases
• Ops must encompass non-automated tasks
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 6
SITE RELIABILITY ENGINEERING
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 7
Edited by Betsy Beyer, Chris Jones, Jennifer Petoff
and Niall Richard Murphy
(O’Reilly). Copyright 2016 Google, Inc., 978-1-491-
92912-4.”
SITE RELIABILITY WORKBOOK
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 8
Edited by Betsy Beyer, Niall Richard Murphy,
David K. Rensin, Kent Kawahara and Stephen
Thorne
O’Reilly Media
Source
CREDIT GOOGLE
GOOGLE DEVELOPED SRE AND PUBLISHES A FREE ONLINE TEXT.
BEN TREYNOR SLOSS ORIGINATED THE TERM.
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 9
GOOGLE’S DEFINITION
“SRE IS WHAT YOU GET WHEN YOU TREAT OPERATIONS AS IF IT’S A SOFTWARE
PROBLEM. OUR MISSION IS TO PROTECT, PROVIDE FOR, AND PROGRESS THE SOFTWARE
AND SYSTEMS BEHIND ALL OF GOOGLE’S PUBLIC SERVICES — GOOGLE SEARCH, ADS,
GMAIL, ANDROID, YOUTUBE, AND APP ENGINE, TO NAME JUST A FEW — WITH AN EVER-
WATCHFUL EYE ON THEIR AVAILABILITY, LATENCY, PERFORMANCE, AND CAPACITY.”
SOURCE
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 10
WHAT IS IT?
• Quasi open standardized process (vs. “standard”)
• Scalable, proven (albeit inside deep pocket enterprises)
• Begun in 2003, it predated DevOps
• Left-shift Sysadmin functions
• But with healthy skills in layers 1-3 in UNIX network stack
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 11
IS IT DEVOPS?
• “. . . We are distinct from the industry term DevOps, because
although we definitely regard infrastructure as code, we
have reliability as our main focus. Additionally, we are strongly
oriented toward removing the necessity for operations—
see The Evolution of Automation at Google for more details.”
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 12
IS IT DEVOPS? (PER GOOGLE)
“One could view DevOps as a generalization of several core SRE
principles to a wider range of organizations, management
structures, and personnel. One could equivalently view SRE as a
specific implementation of DevOps with some idiosyncratic
extensions.” (Chapter 1)
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 13
OPS SRE RESPONSIBILITIES
• Availability
• Latency
• Performance [sic]
• Efficiency*
• Change Management
• Monitoring*
• Emergency Response
• Capacity Planning
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 14
HOW SRE LEFT-SHIFTS OPS
• No more than 50% duty in Ops
• Remaining 50% is “coding skills on project work”
• Heavy reliance on “blame-free postmortem culture”
• Ed: Quality principle
• Ed: Implies analytics, evidence-, data-driven processes
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 15
SRE EVENT ANALYTICS
• Max of two events per 8/12 hr on-call shift
• No equivalent to these events in software engineering
• Tied to monitoring (alerts, tickets, logging)
• Emergency response is a useful event + event metrics
• MTTF and MTTR – MTTR is key
• Playbook* building as synthetic event / scenario construction
• “We have found that thinking through and recording the best practices ahead of time
in a ‘playbook’ produces roughly a 3x improvement in MTTR as compared to the
strategy of "winging it."
• “Wheel of Misfortune” (software engineering equivalent: Adversarial testing?)
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 16
CHANGE MANAGEMENT IN @RL
• “SRE: 70% of outages due to changes in a live system.”
• SRE automation enables:
• Progressive rollouts (Ed not just “promote to QA”)
• Rapid problem diagnosis
• Automated rollback (Ed Typically not an app ‘requirement’)
• Mitigate user exposure to service disruptions
• Automation reduces impact of fatigue, familiarity/contempt, challenges of
highly repetitive tasks
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 17
SRE TACKLES PLANNING, CAPACITY
• Dev rarely has eyes on metrics, processes for provisioning
• Provisioning is higher risk than load shifting: a class of Ops use cases
• Dev rarely accounts for ingest of demand data streams
• Dev has little insight into aperiodic spikes, trends, schedules,
dependencies
• Weather, cascading power outages
• Resource utilization entails variables Dev may be blind to
• Monitoring must utilize alerting from time series data (Few
devs get it)
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 18
SRE LEFT-SHIFTED COMPONENTS
• Abstract Machine (Apache Mesos-like)
• Distributed Storage
• OpenFlow-based SDN
• Prometheus-like Monitoring & Alerting for:
• Acute incidents
• A/B and E1/E2 comparisons
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 19
DEV FOR OPS @GOOGLE
• Single shared repo
• “All software is reviewed before being submitted”
• Even large builds are fast
• Same infrastructure for continuous testing
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 20
SOFTWARE-CENTRIC OPS
“Unlike traditional operations groups, we view software as the
primary tool through which our systems are managed,
maintained, and minded; to that end, we have the source-level
access and moral authority required to fix, extend and scale code
to keep it working, harden it against the vagaries of the Internet,
and develop our own planet-scale platforms.”
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 21
“FULL DEPTH OF THE STACK”
“In Google, we have the good fortune to have developed many
large systems ranging from planet-spanning databases to near
real-time scalable data warehousing to fault-tolerant datastream
joining. In SRE, we flip between the fine-grained detail of disk
driver IO scheduling to the big picture of continental-level
service capacity, across a range of systems and a user population
measured in billions. We own those products in production. We
drive reliability and performance across massive scale by
mastering the full depth of the stack.“M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 22
PRINCIPLES
• Embracing Risk (Ed: Listen up, FinTechs)
• Service Level Objectives
• Eliminating Toil (Ed: More than efficiency, velocity)
• Monitor (Ed: Integrated monitoring)
• Release Engineering
• Simplicity (Ed: Complexity evolved from simplicity?)
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 23
RISK MANAGEMENT IN SRE
“We strive to make a service reliable enough, but
no more reliable than it needs to be. That is, when we set an
availability target of 99.99%,we want to exceed it, but not by
much: that would waste opportunities to add features to the
system, clean up technical debt, or reduce its operational costs.
In a sense, we view the availability target as both a minimum and
a maximum. The key advantage of this framing is that it unlocks
explicit, thoughtful risktaking.” Source
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 24
SRE RISK PROCESS INSIGHTS
• Risk tolerance of consumer services
• Differential impact of failure types on product/service offering
• Google Apps for Business vs. Consumer
• Cost vs. availability (“an extra nine of availability means . . . “)
• Google + Google Partner latency objectives
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 25
SRE “ERROR BUDGET”
“In order to base these decisions [product velocity vs. reliability] on
objective data, the two teams jointly define a quarterly error budget
based on the service’s service level objective, or SLO (see Service Level
Objectives). The error budget provides a clear, objective metric that
determines how unreliable the service is allowed to be within a single
quarter. This metric removes the politics from negotiations between
the SREs and the product developers when deciding how much risk to
allow.”
“The main benefit of an error budget is that it provides a common
incentive that allows both product development and SRE to focus on
finding the right balance between innovation and reliability.”
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 26
KEY INSIGHT
Ed: Ops has a perspective on product performance that Dev will
rarely have. SRE leverages this by integrating processes to
monitor and manage the product while making improvements.
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 27
SERVICE ABSTRACTIONS
• SLA: Set by product owners, not SRE
• SLI Service Level Indicator (Ed: Domain specific dependent
measure)
• SLO Service Level Objective (Ed: Complex target range of
values; sets expectations)
• Agreements (usually, what happens when SLO not met)
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 28
OPS-DRIVEN TARGET GOALS
“Choosing targets (SLOs) is not a purely technical activity
because of the product and business implications, which should
be reflected in both the SLIs and SLOs (and maybe SLAs) that are
selected. Similarly, it may be necessary to trade off certain
product attributes against others within the constraints posed by
staffing, time to market, hardware availability, and funding.”
• SRE Ops-driven concepts: safety margin, throttling, systems
engineering (mod configs, OS tuning, load balancing, physical
updates)M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 29
SRE KEY MONITORING INSIGHT
“Monitoring a complex application is a significant engineering
endeavor in and of itself.”
Ed: Software engineering is 7-20 years away from fully
integrating monitoring concepts into IDE’s
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 30
ALERTING INSIGHTS
• Human alerts must be simple and fast
• Monitoring should identify what’s broken and why (Ed: Domain
dependent!)
• Focus s/b on better post hoc analysis (Ed: Forensics; big data)
• “Google SRE has experienced only limited success with complex
dependency hierarchies”
• “Different aspects of a system should be measured with different
levels of granularity.”
• “In Google’s experience, basic collection and aggregation of metrics,
paired with alerting and dashboards, has worked well as a relatively
standalone system.”M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 31
TYPES OF AUTOMATION
• No automation
• Externally maintained system-specific automation
• Externally maintained generic automation
• Internally maintained system-specific automation
• Systems need no automation
• Ed: Conclude Ops is closer to automation (except domain
specific)
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 32
LEFT-SHIFTING OPS ISN’T ONE-AND-DONE
“Automation code, like unit test code, dies when the maintaining
team isn’t obsessive about keeping the code in sync with the
codebase it covers. The world changes around the code: the DNS
team adds new configuration options, the storage team changes
their package names, and the networking team needs to support
new devices.”
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 33
TYPICAL SRE RELEASE PROCESS
• A typical release process proceeds as follows:
• Rapid uses the requested integration revision number (often obtained automatically from
our continuous test system) to create a release branch.
• Rapid uses Blaze to compile all the binaries and execute the unit tests, often performing
these two steps in parallel. Compilation and testing occur in environments dedicated to
those specific tasks, as opposed to taking place in the Borg job where the Rapid workflow
is executing. This separation allows us to parallelize work easily.
• Build artifacts are then available for system testing and canary deployments. A typical
canary deployment involves starting a few jobs in our production environment after the
completion of system tests.
• The results of each step of the process are logged. A report of all changes since the last
release is created.
• Rapid allows us to manage our release branches and cherry picks; individual cherry pick
requests can be approved or rejected for inclusion in a release. Source
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 34
SOME CONCLUSIONS
BY ED
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 35
1. Complex IT operations are challenging to left-shift at scale
2. Python (+ Go etc.) have facilitated left-shift
3. SDN (5-6G) is a game-changer; Ops is in the game, like it or
not
4. Monitoring and alerting are beyond current SE skills
5. SRE treats security as a feature (casual?)
6. SRE measures manual processes as part of using automation
to drive reliability
7. SRE has a more formal, Ops-driven approach to trade-off
compacts with product owners
8. Current DevOps SDLC practices have not formalized how to
capture and manage quality, reliability
9. Except for CMMI, risk is weakly integrated into the DevOps
SDLC
10. DevOps does not identify “toil,” hence may not participate in
PDCA cycle from Ops
11. Dev teams may not know what can/should be automated.
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 36

More Related Content

PPTX
What is Site Reliability Engineering (SRE)
PPTX
Site reliability engineering
PPTX
SRE (service reliability engineer) on big DevOps platform running on the clou...
PDF
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
PDF
SRE 101
PPTX
A Crash Course in Building Site Reliability
PPTX
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
PDF
SRE Demystified - 01 - SLO SLI and SLA
What is Site Reliability Engineering (SRE)
Site reliability engineering
SRE (service reliability engineer) on big DevOps platform running on the clou...
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
SRE 101
A Crash Course in Building Site Reliability
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE Demystified - 01 - SLO SLI and SLA

What's hot (20)

PDF
Sre summary
PPTX
SRE 101 (Site Reliability Engineering)
PDF
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
PDF
Overview of Site Reliability Engineering (SRE) & best practices
PDF
Building an SRE Organization @ Squarespace
PDF
Getting started with Site Reliability Engineering (SRE)
PPTX
DevOps Torino Meetup - SRE Concepts
PDF
How to SRE when you have no SRE
PPTX
How Small Team Get Ready for SRE (public version)
PPTX
SRE-iously! Reliability!
PPTX
SRE vs DevOps
PDF
Service Level Terminology : SLA ,SLO & SLI
PPTX
Site reliability engineering - Lightning Talk
PPTX
DevOps Introduction
PDF
SRE Demystified - 05 - Toil Elimination
PDF
SRE From Scratch
PPTX
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
PPTX
DevOps introduction
PDF
"DevOps > CI+CD "
Sre summary
SRE 101 (Site Reliability Engineering)
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Overview of Site Reliability Engineering (SRE) & best practices
Building an SRE Organization @ Squarespace
Getting started with Site Reliability Engineering (SRE)
DevOps Torino Meetup - SRE Concepts
How to SRE when you have no SRE
How Small Team Get Ready for SRE (public version)
SRE-iously! Reliability!
SRE vs DevOps
Service Level Terminology : SLA ,SLO & SLI
Site reliability engineering - Lightning Talk
DevOps Introduction
SRE Demystified - 05 - Toil Elimination
SRE From Scratch
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
DevOps introduction
"DevOps > CI+CD "
Ad

Similar to Site (Service) Reliability Engineering (20)

PDF
S.R.E - create ultra-scalable and highly reliable systems
PDF
Site-Reliability-Engineering-v2[6241].pdf
PDF
Kks sre book_ch1,2
PDF
SRE - drupal day aveiro 2016
PPTX
Rethinking Site Reliability Engineering for ITSM - SDI virtual event "New Way...
PDF
ADDO_2020-Driving-Digital-Transformation-through-CloudOps-and-SRE.pdf
PPTX
Site Reliability Engineering: Harnessing (and redefining) it for ITSM
PDF
SRE in Apiary
PDF
SRE Roundtable with 4 DevOps Ambassadors
PDF
SRE Demystified - 14 - SRE Practices overview
PDF
Essential_Skills_of_a_Site_Reliability_E.pdf
PDF
GCP-pdevops devops engineer exam prepearitaon guide
PPTX
DevOps & Site Reliability Engineering (SRE).pptx
PPTX
ADDO_2022_SRE Architectural Patterns_Nov10.pptx
PPTX
ADDO_2022_SRE Architectural Patterns_Nov10.pptx
PDF
Girl Geek X Indeed Talks (January 18, 2018)
PDF
Upskill Yourself With GSDC Site Reliability Engineering Certification
PDF
Clearing the Way For SRE In the Enterprise
PDF
VS Live Las Vegas - When Down is not good enough - SRE on Azure
PDF
Bjorn Rabenstein. SRE, DevOps, Google, and you
S.R.E - create ultra-scalable and highly reliable systems
Site-Reliability-Engineering-v2[6241].pdf
Kks sre book_ch1,2
SRE - drupal day aveiro 2016
Rethinking Site Reliability Engineering for ITSM - SDI virtual event "New Way...
ADDO_2020-Driving-Digital-Transformation-through-CloudOps-and-SRE.pdf
Site Reliability Engineering: Harnessing (and redefining) it for ITSM
SRE in Apiary
SRE Roundtable with 4 DevOps Ambassadors
SRE Demystified - 14 - SRE Practices overview
Essential_Skills_of_a_Site_Reliability_E.pdf
GCP-pdevops devops engineer exam prepearitaon guide
DevOps & Site Reliability Engineering (SRE).pptx
ADDO_2022_SRE Architectural Patterns_Nov10.pptx
ADDO_2022_SRE Architectural Patterns_Nov10.pptx
Girl Geek X Indeed Talks (January 18, 2018)
Upskill Yourself With GSDC Site Reliability Engineering Certification
Clearing the Way For SRE In the Enterprise
VS Live Las Vegas - When Down is not good enough - SRE on Azure
Bjorn Rabenstein. SRE, DevOps, Google, and you
Ad

More from Mark Underwood (13)

PPTX
Security within Scaled Agile
PPTX
The Quality “Logs”-Jam: Why Alerting for Cybersecurity is Awash with False Po...
PPTX
Codes of Ethics and the Ethics of Code
PPTX
Ethics of Analytics and Machine Learning
PDF
DevOps Support for an Ethical Software Development Life Cycle (SDLC)
PPTX
Implications of GDPR for IoT Big Data Security and Privacy Fabric
PPTX
Technologies in Support of Big Data Ethics
PDF
NIST Big Data Public WG : Security and Privacy v2
PPTX
Stakeholders in Systems Design
PPTX
TEDx Poetry and Science
PPTX
IoT Day 2016: Cloud Services for IoT Semantic Interoperability
PPTX
Ontology Summit - Track D Standards Summary & Provocative Use Cases
PPTX
Design Patterns for Ontologies in IoT
Security within Scaled Agile
The Quality “Logs”-Jam: Why Alerting for Cybersecurity is Awash with False Po...
Codes of Ethics and the Ethics of Code
Ethics of Analytics and Machine Learning
DevOps Support for an Ethical Software Development Life Cycle (SDLC)
Implications of GDPR for IoT Big Data Security and Privacy Fabric
Technologies in Support of Big Data Ethics
NIST Big Data Public WG : Security and Privacy v2
Stakeholders in Systems Design
TEDx Poetry and Science
IoT Day 2016: Cloud Services for IoT Semantic Interoperability
Ontology Summit - Track D Standards Summary & Provocative Use Cases
Design Patterns for Ontologies in IoT

Recently uploaded (20)

PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPTX
ai tools demonstartion for schools and inter college
PDF
medical staffing services at VALiNTRY
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPT
Introduction Database Management System for Course Database
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPT
JAVA ppt tutorial basics to learn java programming
PPTX
L1 - Introduction to python Backend.pptx
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
System and Network Administraation Chapter 3
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Complete React Javascript Course Syllabus.pdf
PPTX
Essential Infomation Tech presentation.pptx
PDF
top salesforce developer skills in 2025.pdf
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
How to Migrate SBCGlobal Email to Yahoo Easily
ai tools demonstartion for schools and inter college
medical staffing services at VALiNTRY
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Introduction Database Management System for Course Database
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
JAVA ppt tutorial basics to learn java programming
L1 - Introduction to python Backend.pptx
Operating system designcfffgfgggggggvggggggggg
Design an Analysis of Algorithms I-SECS-1021-03
System and Network Administraation Chapter 3
PTS Company Brochure 2025 (1).pdf.......
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Complete React Javascript Course Syllabus.pdf
Essential Infomation Tech presentation.pptx
top salesforce developer skills in 2025.pdf
ManageIQ - Sprint 268 Review - Slide Deck
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf

Site (Service) Reliability Engineering

  • 1. SITE RELIABILITY ENGINEERING* SEEN FROM DEVOPS AND AGILE PERSPECTIVES *SERVICE M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 1
  • 2. GAPS IN AGILE, DEVOPS APPROACHES WHY ADDITIONAL OR SUPPLEMENTARY APPROACHES ARE NEEDED *EDITORIAL M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 2
  • 3. HOW OPS GETS OVERLOOKED • No obvious “product” release cycle • Keeping complex systems running is not primarily a software problem • Ops troubleshooting may not follow any SDLC model • Some Ops entail managing systems in which no code readily available M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 3
  • 4. PHILOSOPHICAL NOTES • Technical approaches to privacy are inextricably tied to security • Similarly, reliability engineering is also tied to security • -- and not just “Availability” • Quality engineering comfortably straddles both Dev and Ops • Most quality engineering in practice is pure Ops • Software engineering has immature notions of quality • Supporting legacy systems may be more Ops than Dev M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 4
  • 5. USE CASES • Call center operations • Field service • Sales, sales support • Most of health care (17.8% of US GDP spending) • Rework and repair (all sectors) • Financial services • Government operations (e.g., voting systems, regulation, transportation management) • Utilities • Even the less obvious: decision support M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 5
  • 6. SOFTWARE SUPPORTS OPS, BUT . . . • Complex systems lack human-machine controls • Humans are almost always “man in the middle” by design • Ops were not designed to be automated • Software only lightly mitigates labor increases when service load increases • Ops must encompass non-automated tasks M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 6
  • 7. SITE RELIABILITY ENGINEERING M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 7 Edited by Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Richard Murphy (O’Reilly). Copyright 2016 Google, Inc., 978-1-491- 92912-4.”
  • 8. SITE RELIABILITY WORKBOOK M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 8 Edited by Betsy Beyer, Niall Richard Murphy, David K. Rensin, Kent Kawahara and Stephen Thorne O’Reilly Media Source
  • 9. CREDIT GOOGLE GOOGLE DEVELOPED SRE AND PUBLISHES A FREE ONLINE TEXT. BEN TREYNOR SLOSS ORIGINATED THE TERM. M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 9
  • 10. GOOGLE’S DEFINITION “SRE IS WHAT YOU GET WHEN YOU TREAT OPERATIONS AS IF IT’S A SOFTWARE PROBLEM. OUR MISSION IS TO PROTECT, PROVIDE FOR, AND PROGRESS THE SOFTWARE AND SYSTEMS BEHIND ALL OF GOOGLE’S PUBLIC SERVICES — GOOGLE SEARCH, ADS, GMAIL, ANDROID, YOUTUBE, AND APP ENGINE, TO NAME JUST A FEW — WITH AN EVER- WATCHFUL EYE ON THEIR AVAILABILITY, LATENCY, PERFORMANCE, AND CAPACITY.” SOURCE M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 10
  • 11. WHAT IS IT? • Quasi open standardized process (vs. “standard”) • Scalable, proven (albeit inside deep pocket enterprises) • Begun in 2003, it predated DevOps • Left-shift Sysadmin functions • But with healthy skills in layers 1-3 in UNIX network stack M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 11
  • 12. IS IT DEVOPS? • “. . . We are distinct from the industry term DevOps, because although we definitely regard infrastructure as code, we have reliability as our main focus. Additionally, we are strongly oriented toward removing the necessity for operations— see The Evolution of Automation at Google for more details.” M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 12
  • 13. IS IT DEVOPS? (PER GOOGLE) “One could view DevOps as a generalization of several core SRE principles to a wider range of organizations, management structures, and personnel. One could equivalently view SRE as a specific implementation of DevOps with some idiosyncratic extensions.” (Chapter 1) M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 13
  • 14. OPS SRE RESPONSIBILITIES • Availability • Latency • Performance [sic] • Efficiency* • Change Management • Monitoring* • Emergency Response • Capacity Planning M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 14
  • 15. HOW SRE LEFT-SHIFTS OPS • No more than 50% duty in Ops • Remaining 50% is “coding skills on project work” • Heavy reliance on “blame-free postmortem culture” • Ed: Quality principle • Ed: Implies analytics, evidence-, data-driven processes M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 15
  • 16. SRE EVENT ANALYTICS • Max of two events per 8/12 hr on-call shift • No equivalent to these events in software engineering • Tied to monitoring (alerts, tickets, logging) • Emergency response is a useful event + event metrics • MTTF and MTTR – MTTR is key • Playbook* building as synthetic event / scenario construction • “We have found that thinking through and recording the best practices ahead of time in a ‘playbook’ produces roughly a 3x improvement in MTTR as compared to the strategy of "winging it." • “Wheel of Misfortune” (software engineering equivalent: Adversarial testing?) M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 16
  • 17. CHANGE MANAGEMENT IN @RL • “SRE: 70% of outages due to changes in a live system.” • SRE automation enables: • Progressive rollouts (Ed not just “promote to QA”) • Rapid problem diagnosis • Automated rollback (Ed Typically not an app ‘requirement’) • Mitigate user exposure to service disruptions • Automation reduces impact of fatigue, familiarity/contempt, challenges of highly repetitive tasks M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 17
  • 18. SRE TACKLES PLANNING, CAPACITY • Dev rarely has eyes on metrics, processes for provisioning • Provisioning is higher risk than load shifting: a class of Ops use cases • Dev rarely accounts for ingest of demand data streams • Dev has little insight into aperiodic spikes, trends, schedules, dependencies • Weather, cascading power outages • Resource utilization entails variables Dev may be blind to • Monitoring must utilize alerting from time series data (Few devs get it) M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 18
  • 19. SRE LEFT-SHIFTED COMPONENTS • Abstract Machine (Apache Mesos-like) • Distributed Storage • OpenFlow-based SDN • Prometheus-like Monitoring & Alerting for: • Acute incidents • A/B and E1/E2 comparisons M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 19
  • 20. DEV FOR OPS @GOOGLE • Single shared repo • “All software is reviewed before being submitted” • Even large builds are fast • Same infrastructure for continuous testing M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 20
  • 21. SOFTWARE-CENTRIC OPS “Unlike traditional operations groups, we view software as the primary tool through which our systems are managed, maintained, and minded; to that end, we have the source-level access and moral authority required to fix, extend and scale code to keep it working, harden it against the vagaries of the Internet, and develop our own planet-scale platforms.” M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 21
  • 22. “FULL DEPTH OF THE STACK” “In Google, we have the good fortune to have developed many large systems ranging from planet-spanning databases to near real-time scalable data warehousing to fault-tolerant datastream joining. In SRE, we flip between the fine-grained detail of disk driver IO scheduling to the big picture of continental-level service capacity, across a range of systems and a user population measured in billions. We own those products in production. We drive reliability and performance across massive scale by mastering the full depth of the stack.“M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 22
  • 23. PRINCIPLES • Embracing Risk (Ed: Listen up, FinTechs) • Service Level Objectives • Eliminating Toil (Ed: More than efficiency, velocity) • Monitor (Ed: Integrated monitoring) • Release Engineering • Simplicity (Ed: Complexity evolved from simplicity?) M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 23
  • 24. RISK MANAGEMENT IN SRE “We strive to make a service reliable enough, but no more reliable than it needs to be. That is, when we set an availability target of 99.99%,we want to exceed it, but not by much: that would waste opportunities to add features to the system, clean up technical debt, or reduce its operational costs. In a sense, we view the availability target as both a minimum and a maximum. The key advantage of this framing is that it unlocks explicit, thoughtful risktaking.” Source M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 24
  • 25. SRE RISK PROCESS INSIGHTS • Risk tolerance of consumer services • Differential impact of failure types on product/service offering • Google Apps for Business vs. Consumer • Cost vs. availability (“an extra nine of availability means . . . “) • Google + Google Partner latency objectives M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 25
  • 26. SRE “ERROR BUDGET” “In order to base these decisions [product velocity vs. reliability] on objective data, the two teams jointly define a quarterly error budget based on the service’s service level objective, or SLO (see Service Level Objectives). The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter. This metric removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow.” “The main benefit of an error budget is that it provides a common incentive that allows both product development and SRE to focus on finding the right balance between innovation and reliability.” M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 26
  • 27. KEY INSIGHT Ed: Ops has a perspective on product performance that Dev will rarely have. SRE leverages this by integrating processes to monitor and manage the product while making improvements. M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 27
  • 28. SERVICE ABSTRACTIONS • SLA: Set by product owners, not SRE • SLI Service Level Indicator (Ed: Domain specific dependent measure) • SLO Service Level Objective (Ed: Complex target range of values; sets expectations) • Agreements (usually, what happens when SLO not met) M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 28
  • 29. OPS-DRIVEN TARGET GOALS “Choosing targets (SLOs) is not a purely technical activity because of the product and business implications, which should be reflected in both the SLIs and SLOs (and maybe SLAs) that are selected. Similarly, it may be necessary to trade off certain product attributes against others within the constraints posed by staffing, time to market, hardware availability, and funding.” • SRE Ops-driven concepts: safety margin, throttling, systems engineering (mod configs, OS tuning, load balancing, physical updates)M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 29
  • 30. SRE KEY MONITORING INSIGHT “Monitoring a complex application is a significant engineering endeavor in and of itself.” Ed: Software engineering is 7-20 years away from fully integrating monitoring concepts into IDE’s M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 30
  • 31. ALERTING INSIGHTS • Human alerts must be simple and fast • Monitoring should identify what’s broken and why (Ed: Domain dependent!) • Focus s/b on better post hoc analysis (Ed: Forensics; big data) • “Google SRE has experienced only limited success with complex dependency hierarchies” • “Different aspects of a system should be measured with different levels of granularity.” • “In Google’s experience, basic collection and aggregation of metrics, paired with alerting and dashboards, has worked well as a relatively standalone system.”M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 31
  • 32. TYPES OF AUTOMATION • No automation • Externally maintained system-specific automation • Externally maintained generic automation • Internally maintained system-specific automation • Systems need no automation • Ed: Conclude Ops is closer to automation (except domain specific) M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 32
  • 33. LEFT-SHIFTING OPS ISN’T ONE-AND-DONE “Automation code, like unit test code, dies when the maintaining team isn’t obsessive about keeping the code in sync with the codebase it covers. The world changes around the code: the DNS team adds new configuration options, the storage team changes their package names, and the networking team needs to support new devices.” M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 33
  • 34. TYPICAL SRE RELEASE PROCESS • A typical release process proceeds as follows: • Rapid uses the requested integration revision number (often obtained automatically from our continuous test system) to create a release branch. • Rapid uses Blaze to compile all the binaries and execute the unit tests, often performing these two steps in parallel. Compilation and testing occur in environments dedicated to those specific tasks, as opposed to taking place in the Borg job where the Rapid workflow is executing. This separation allows us to parallelize work easily. • Build artifacts are then available for system testing and canary deployments. A typical canary deployment involves starting a few jobs in our production environment after the completion of system tests. • The results of each step of the process are logged. A report of all changes since the last release is created. • Rapid allows us to manage our release branches and cherry picks; individual cherry pick requests can be approved or rejected for inclusion in a release. Source M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 34
  • 35. SOME CONCLUSIONS BY ED M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 35
  • 36. 1. Complex IT operations are challenging to left-shift at scale 2. Python (+ Go etc.) have facilitated left-shift 3. SDN (5-6G) is a game-changer; Ops is in the game, like it or not 4. Monitoring and alerting are beyond current SE skills 5. SRE treats security as a feature (casual?) 6. SRE measures manual processes as part of using automation to drive reliability 7. SRE has a more formal, Ops-driven approach to trade-off compacts with product owners 8. Current DevOps SDLC practices have not formalized how to capture and manage quality, reliability 9. Except for CMMI, risk is weakly integrated into the DevOps SDLC 10. DevOps does not identify “toil,” hence may not participate in PDCA cycle from Ops 11. Dev teams may not know what can/should be automated. M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 36

Editor's Notes

  • #30: P2675 left off here 20190531