SlideShare a Scribd company logo
GOOGLE SRE
CONCEPTSTorino DevOps Meetup
FROM THE KICK-OFF
PRESENTATION
Is SRE an alternative to DevOps?
Google creates and evolved the concept
independently from DevOps movement.
«If you think of DevOps like an interface in a programming language,
class SRE implements DevOps. SRE includes additional practices and
recommendations that are not necessarily part of the DevOps
interface. DevOps and SRE are not two competing methods for
software development and operations, but rather close friends
designed to break down organizational barriers to deliver better
software faster.
DevOps emerged as a culture and a set of practices
that aims to reduce the gaps between software
development and software operation.
The DevOps movement does not explicitly define how
to succeed.
SRE prescribes how to succeed in the various DevOps
areas.” (Liz Fong-Jones, Seth Vargo)
DEVOPS VS SRE (SITE
RELIABILITY ENGINEERING)
DEVOPS MANIFESTO –
SRE MANIFESTO
GOOGLE SRE - HISTORY
Site reliability engineering (SRE) was born at Google in 2003, prior to the
DevOps movement.
“SRE is what happens when you ask a software engineer to design an
operations team.” - Ben Traynor, VP of engineering at Google and founder of
Google SRE
“SRE is fundamentally doing work that has historically been done by an
operations team but using engineers with software expertise and banking on
the fact that these engineers are inherently both predisposed to, and have
the ability to, substitute automation for human labour. In general, an SRE
team is responsible for availability, latency, performance, efficiency, change
management, monitoring, emergency response, and capacity planning.” -
Ben Traynor, VP of engineering at Google and founder of Google SRE
Site reliability engineers create a bridge between development and
operations by applying a software engineering mindset to system
administration topics.
They split their time between operations/on-call duties and developing
systems and software that help increase site reliability and performance.
Google puts a lot of emphasis on SREs not spending more than 50% of their
time on operations and considers any violation of this rule a sign of system
ill-health.
The ideal site reliability engineer candidate is either a software engineer with
a good administration background or a highly skilled system administrator
with knowledge of coding and automation.
“SRE teams are characterized by both rapid innovation and a large
acceptance of Change” - Ben Traynor, VP of engineering at Google and
founder of Google SRE
THE ORIGIN
Ben Traynor, VP of engineering at Google
“Software engineering has this in common with having
children: the labour before the birth is painful and difficult, but
the labour after the birth is where you actually spend most of
your effort. Yet software engineering as a discipline spends
much more time talking about the first period as opposed to
the second [omissis] there must be another discipline that
focuses on the whole lifecycle of software objects, from
inception, through deployment and operation, refinement,
and eventual peaceful decommissioning.”
“Running a service with a team that relies on manual
intervention for both change management and event handling
becomes expensive as the service and/or traffic to the service
grows”
“Traditional operations teams and their counterparts in
product development often end up in conflict, most visibly
over how quickly software can be released to production. At
their core, the development teams want to launch new
features and see them adopted by users. At their core, the ops
teams want to make sure the service doesn’t break while they
are holding the pager. Because most outages are caused by
some kind of change a new configuration, a new feature
launch, or a new type of user traffic the two teams’ goals are
fundamentally in tension.”
THE PROBLEM TO
SOLVE
Site : the terms site comes from the original duty of
the newly born role, having focus on google.com web
site.
Reliability: “reliability is the most fundamental
feature of any product: a system isn’t very useful if
nobody can use it!” (Ben Traynor)
Engineering: “we apply the principles of computer
science and engineering to the design and
development of computing systems” “an SRE team
must spend 50% of its time actually doing
development.” (Ben Traynor)
.
THE NAME
GOOGLE SRE – CONCEPTS –
SRE VS DEVOPS
DevOps movement in the community and SRE initiative
in Google started from the same problem, the inefficiency
of having Developers and Operators working on the
different side of a wall, the first looking for feature and the
second for stability.
“One could view DevOps as a generalization of several
core SRE principles to a wider range of organizations,
management structures, and personnel. One could
equivalently view SRE as a specific implementation of
DevOps with some idiosyncratic extensions.“(Ben Traynor)
A DevOps Engineer is someone who understands the full
SDLC (Software Development Life Cycle)
DevOps focuses more on the automation part
SREs focus is more on the aspects like system availability,
observability, and scale
”The basic tenet of SRE is that doing operations well is a
software problem”
SRE VS DEVOPS
Reduce organisation silos: SREs share the ownership
of production with development teams and use the
same tools
Accept failure as normal: SRE’s concept of Error
Budget, avoiding 100% SLO
Implement gradual change: SRE is prescriptive about
usage of Canary Deployment
Leverage tooling and automation: SRE concept of
Toil and the guideline of “automate this year’s job”
Measure everything: SRE is prescriptive about
measuring Error Budget and Toil
SRE VS DEVOPS
DEVOPS VS. SRE: COMPETING STANDARDS OR FRIENDS?
(CLOUD NEXT ‘19) (@SETHVARGO)
GOOGLE SRE – CONCEPTS
- SLO
Service Level Indicators (SLIs): metrics over time such
as request latency, throughput of requests per
second, or failures per request
Service Level Objectives (SLOs): targets for the
cumulative success of SLIs over a window of time
Service Level Agreements (SLAs): a promise by a
service provider, to a service consumer, about the
availability of a service
SLI, SLO, SLA – A
DEFINITION
“In general, for any software service or system, 100% is
not the right reliability target because no user can tell the
difference between a system being 100% available and
99.999% available.”
100% is not a viable availability target. Having a 100%
availability requirement severely limits a team or
developer’s ability to deliver updates and improvements
to a system.
To set the target isn’t a technical question at all, it’s a
product question, the business lead and the product
management must establish the system’s availability
target:
• What level of availability will the users be happy with,
given how they use the product?
• What alternatives are available to users who are
dissatisfied with the product’s availability?
• What happens to users’ usage of the product at
different availability levels?
SLO - TARGET
SLIs must be selected on the base of business value considerations.
Valuable SLIs:
• Request latency
• Batch throughput
• Failures per request (error rate)
Non valuable SLIs:
• CPU Time
• Memory Usage
• Operating System Uptime
Metrics must be aggregated overtime (for example “last 5 minutes”)
and a function as a percentile to be applied (for example “99
percentile”)
Metrics must be defined in advance and be known and accepted
across all organisation
Measurement must be reliable and not only the definition, also the
implementation of the measurement must be clearly defined and
approved
SLI - SELECTION
GOOGLE SRE – CONCEPTS –
ERROR BUDGET
Error budget is a quantitative
measurement to establish the ratio
between work on implementing new
features and work on improving stability
“The error budget provides a clear,
objective metric that determines how
unreliable the service is allowed to be
within a single quarter. This metric
removes the politics from negotiations
between the SREs and the product
developers when deciding how much
risk to allow”
ERROR BUDGET AND
RISK
“If SLO violations occur frequently
enough to expend the error budget,
releases are temporarily halted while
additional resources are invested in
system testing and development to
make the system more resilient,
improve its performance, and so on.”
ERROR BUDGET AND
RISK
Use measurement to calculate expected
cost in terms of error budget of each
risk
Prioritise work on improvement on the
base of the computed impact of each
risk in terms of error budget
Cost for mitigating the risks could
require a review of the SLO due to
business value considerations
RISK MANAGEMENT AND
PRIORITISATION
GOOGLE SRE – CONCEPTS
– TOIL
Toil is not simply "work I don't like to do.”
Toil is not overhead (commuting, expenses reports,
meetings, …)
Toil is specifically tied to the running of a production
service.
It is work that tends to be manual, repetitive,
automatable, tactical and devoid of long-term value.
Toil is also reactive, as intervention done due to an alert.
When SREs find tasks that can be automated, they work to
engineer a solution to prevent that toil in the future.
Toil is not always bad. Predictable, repetitive tasks are
great ways to onboard a new team member and often
produce an immediate sense of accomplishment and
satisfaction with low risk and low stress.
TOIL AND TOIL
BUDGET
“We ensure that the teams consistently spending less than
50% of their time on development work change their
practices. Often this means shifting some of the
operations burden back to the development team, or
adding staff to the team without assigning that team
additional operational responsibilities.”
SREs must not be spending more than 50% of their time
on supportoperation activities
Developers should be involved in support regularly, but
their involvement become mandatory if a product
requires SREs to spend more than 50% of time on
operations
At least eight people need to be part of the on-call team to
correctly balance the load
Each on call person must handle no more than two events
per on-call shift to assure right quality
To preserve readiness for an effective on call support
practice handling hypothetical outages
ON CALL SUPPORT
“If you currently assign tickets randomly to victims on
your team, stop . Doing so is extremely disrespectful
of your team’s time, and works completely counter to
the principle of not being interruptible as much as
possible. Tickets should be a full-time role, for an
amount of time that’s manageable for a person. If you
happen to be in the unenviable position of having
more tickets than can be closed by the primary and
secondary on-call engineers combined, then structure
your ticket rotation to have two people handling
tickets at any given time. Don’t spread the load across
the entire team.”
SUPPORT TICKET
GOOGLE SRE – CONCEPTS –
OBSERVABILITY AND MONITORING
Each monitoring system should address
two questions: what’s broken
(symptom) and why (cause)
Two types of monitoring can be defined:
• White-box monitoring: it inspects the
internal state of the target service
(application components metrics,
traces, logs). Focus on causes.
• Black-box monitoring: it accesses the
systems from external, as a real user
(httptcp probes, dns resolution,
network ping). Symptom-oriented.
Active recognition of error condition.
WHITE-BOX AND BLACK-
BOX MONITORING
Monitoring may have only three output types:
• Pages - A human must do something now
• Tickets - A human must do something within a few
days
• Logging - No one need look at this output
immediately, but it’s available for later analysis if
needed
”Putting alerts into email and hoping that someone
will read all of them and notice the important ones is
the moral equivalent of piping them to /dev/null :
they will eventually be ignored.”
An alert must be analysed, if an alert is ignored,
remove the alerting rule.
MONITORING AND
ALERTING
Collecting metrics and alerting are
business driven activities.
Collecting metrics and manage alerting
are different thing. You need to collect
metrics to have visibility on your
systems, but you do not have to use all
metrics to generate alerts.
Measurement, monitoring and alerting
must be related to SLOs.
Do not alert on anything, alerts must
report a service impact and be
actionable.
OBSERVABILITY
Latency
The time it takes to service a request. It’s important to
distinguish between the latency of successful requests and the
latency of failed requests. For example, an HTTP 500 error
triggered due to loss of connection to a database or other critical
backend might be served very quickly; however, as an HTTP 500
error indicates a failed request, factoring 500s into your overall
latency might result in misleading calculations. On the other
hand, a slow error is even worse than a fast error! Therefore, it’s
important to track error latency, as opposed to just filtering out
errors.
Traffic
A measure of how much demand is being placed on your system,
measured in a high-level system-specific metric. For a web
service, this measurement is usually HTTP requests per second,
perhaps broken out by the nature of the requests (e.g., static
versus dynamic content). For an audio streaming system, this
measurement might focus on network I/O rate or concurrent
sessions. For a key-value storage system, this measurement
might be transactions and retrievals per second.
THE FOUR GOLDEN
SIGNALS (1/2)
Errors
The rate of requests that fail, either explicitly (e.g., HTTP 500s),
implicitly (for example, an HTTP 200 success response, but
coupled with the wrong content), or by policy (for example, "If
you committed to one-second response times, any request over
one second is an error"). Where protocol response codes are
insufficient to express all failure conditions, secondary (internal)
protocols may be necessary to track partial failure modes.
Monitoring these cases can be drastically different: catching
HTTP 500s at your load balancer can do a decent job of catching
all completely failed requests, while only end-to-end system
tests can detect that you’re serving the wrong content.
Saturation
How "full" your service is. A measure of your system fraction,
emphasizing the resources that are most constrained (e.g., in a
memory-constrained system, show memory; in an I/O-
constrained system, show I/O). Note that many systems degrade
in performance before they achieve 100% utilization, so having a
utilization target is essential.
THE FOUR GOLDEN
SIGNALS (2/2)
GOOGLE SRE – CONCEPTS –
INCIDENT MANAGEMENT
Effective incident management is key to limiting the
disruption caused by an incident and restoring normal
business operations as quickly as possible.
A well-designed incident management process has
the following features:
• Recursive separation of responsibilities
• Incident command
• Operational work
• Communication
• Planning
• A recognised command post
• Live incident state document
• Clear handoff
INCIDENT
MANAGEMENT
”It is better to declare an incident early and then find
a simple fix and close out the incident than to have to
spin up the incident management framework hours
into a burgeoning problem.”
If any of the following is true, the event is an incident:
• Do you need to involve a second team in fixing the
problem?
• Is the outage visible to customers?
• Is the issue unsolved even after an hour’s
concentrated analysis?.
WHEN TO DECLARE AN
INCIDENT
Incident Management Roles:
• Incident Commander
• The incident commander holds the high-level state about the
incident. They structure the incident response task force,
assigning responsibilities according to need and priority.
• Operations lead
• The Ops lead works with the incident commander to respond
to the incident by applying operational tools to the task at
hand. The operations team should be the only group
modifying the system during an incident.
• Communication lead
• This person is the public face of the incident response task
force.
• Planning lead
• The planning role supports Ops by dealing with longer-term
issues, such as filing bugs, arranging handoffs, and tracking
how the system has diverged from the norm so it can be
reverted once the incident is resolved.
• Logistics lead
• The logistic role supports Ops by dealing with things as
ordering dinner or dealing with vendors for spare parts.
INCIDENT
MANAGEMENT
GOOGLE SRE – CONCEPTS –
POSTMORTEMS
”Postmortems should be blameless and
focus on process and technology, not
people. Assume the people involved in
an incident are intelligent, are well
intentioned, and were making the best
choices they could given the
information they had available at the
time.”
Questions about which data where
available, how the system was
behaving, which actions done and their
effect. Avoid questions about why an
action has been done or why not.
POSTMORTEMS AND
RETROSPECTIVES
Content of a postmortem:
• Incident summary
• Detailed timeline
• Detection
• Impact
• Root Causes
• Triggers
• Mitigation and Resolution
• Lesson learned
• What went well
• What went wrong
• Where we got lucky
• Action Items (with clear owner)
POSTMORTEMS AND
RETROSPECTIVES
GOOGLE SRE – CONCEPTS –
ENGAGEMENT MODEL
Not all Google services receive close SRE
engagement.
Google defines three different
engagement models
• Simple PRR (Product Readiness
Reviews) Model
• Early Engagement Model
• Frameworks and SRE Platform
SRE ENGAGEMENT
MODEL
Development team requests that SRE take over
production management of a service, one to three SREs
are selected to conduct the PRR process.
The discussion covers matters such as:
• Establishing an SLO/SLA for the service
• Planning for potentially disruptive design changes
required to improve reliability
• Planning and training schedules
The Training phase unblocks onboarding of the service by
the SRE team. It involves a progressive transfer of
responsibilities and ownership of various production
aspects of the service, including parts of operations, the
change management process, access rights, and so forth.
To complete the transition, the development team must
be available to back up and advise the SRE team for a
period of time as it settles in managing production for
the service. This relationship becomes the basis for the
ongoing work between the teams.
SIMPLE PRR MODE
The Early Engagement Model introduces
SRE earlier in the development lifecycle.
SRE participates in Design and later
phases, eventually taking over the
service any time during or after the
Build phase.
This model is based on active
collaboration between the development
and SRE teams.
EARLY ENGAGEMENT
MODEL
SRE builds framework modules to implement canonical
solutions for the concerned production area. As a result,
development teams can focus on the business logic,
because the framework already takes care of correct
infrastructure use. A framework essentially is a
prescriptive implementation for using a set of software
components and a canonical way of combining these
components.
The service frameworks implement infrastructure code
in a standardized fashion and address various production
concerns. Each concern is encapsulated in one or more
framework modules, each of which provides a cohesive
solution for a problem domain or infrastructure
dependency.
Framework modules address the various SRE concerns
enumerated earlier, such as:
• Instrumentation and metrics
• Request logging
• Control systems involving traffic and load
management
FRAMEWORKS AND
SRE PLATFORM
AM I DOING SRE OR
DEVOPS?
How can I know if I’m doing DevOps?
• Do I consider Developers, Test Engineers and Sys Admins as
members of the same team having one common goal?
• Do I know and care about end users features and
functionalities (business value)?
• Do I automate infrastructure and code deployment? Do I
understand and manage SDLC?
• Do I focus on collecting as much metrics as possible from
build process to production runtime?
How can I know if I’m doing SRE?
• Do I protect my 50% of the time for engineering activities?
• Do I define SLO and manage Error Budget?
• Do I continuously work to improve self-healing of
production solutions? Do I target “autonomous” solutions
and not only “automated”?
• Do I use a structured Incident Management process with
clearly defined roles?
AM I DOING SRE OR
DEVOPS?
THE END – Q&A ?

More Related Content

PDF
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
PDF
Getting started with Site Reliability Engineering (SRE)
PPTX
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
PPTX
SRE-iously! Reliability!
PPT
HTTP Basics
PDF
Overview of Site Reliability Engineering (SRE) & best practices
PPTX
Secure Socket Layer (SSL)
PDF
DevOps Powerpoint Presentation Slides
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Getting started with Site Reliability Engineering (SRE)
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Reliability!
HTTP Basics
Overview of Site Reliability Engineering (SRE) & best practices
Secure Socket Layer (SSL)
DevOps Powerpoint Presentation Slides

What's hot (20)

PPTX
Site (Service) Reliability Engineering
PPTX
SRE (service reliability engineer) on big DevOps platform running on the clou...
PPTX
SRE 101 (Site Reliability Engineering)
PPTX
How Small Team Get Ready for SRE (public version)
PDF
Building an SRE Organization @ Squarespace
PPTX
Site reliability engineering
PDF
Sre summary
PPTX
SRE vs DevOps
PPTX
A Crash Course in Building Site Reliability
PDF
SRE 101
PDF
SRE in Startup
PDF
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
PPTX
DevOps Introduction
PPTX
What is Site Reliability Engineering (SRE)
PPTX
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
PDF
DevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
PDF
DevOps Vs SRE Major Differences That You Need To Know - Hidden Brains Infotech
PDF
DevOps - A Gentle Introduction
PDF
When down is not good enough. SRE On Azure - PolarConf
PPTX
Site reliability engineering - Lightning Talk
Site (Service) Reliability Engineering
SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE 101 (Site Reliability Engineering)
How Small Team Get Ready for SRE (public version)
Building an SRE Organization @ Squarespace
Site reliability engineering
Sre summary
SRE vs DevOps
A Crash Course in Building Site Reliability
SRE 101
SRE in Startup
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
DevOps Introduction
What is Site Reliability Engineering (SRE)
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
DevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
DevOps Vs SRE Major Differences That You Need To Know - Hidden Brains Infotech
DevOps - A Gentle Introduction
When down is not good enough. SRE On Azure - PolarConf
Site reliability engineering - Lightning Talk
Ad

Similar to DevOps Torino Meetup - SRE Concepts (20)

PDF
Site-Reliability-Engineering-v2[6241].pdf
PPTX
Road to DevOps ROI
PDF
Why you need DevOps Consulting Services?
PDF
8 Ways to Boost Your DevOps Efforts
PPTX
DevOps 1 (1).pptx
PDF
A Pattern-Language-for-software-Development
PDF
Scrum an extension pattern language for hyperproductive software development
PPTX
Use DevOps to Respond Faster to End Customers
PPTX
Dev ops training in chennai
PDF
Le cloudvupardesexperts 9pov-curationparloicsimon-clubclouddespartenaires
PDF
DevOps in Regulated Industries: Speed with Compliance
PPTX
Software process is tge helpful for software engineer
PDF
GCP-pdevops devops engineer exam prepearitaon guide
PDF
Different Methodologies Used By Programming Teams
PDF
Releasing Software Without Testing Team
PDF
DevOps culture, concepte , philosophie and practices
PDF
Slides from "Taking an Holistic Approach to Product Quality"
PPT
Extreme programming
PPTX
Dev ops
PPTX
Introducing the Development Director
Site-Reliability-Engineering-v2[6241].pdf
Road to DevOps ROI
Why you need DevOps Consulting Services?
8 Ways to Boost Your DevOps Efforts
DevOps 1 (1).pptx
A Pattern-Language-for-software-Development
Scrum an extension pattern language for hyperproductive software development
Use DevOps to Respond Faster to End Customers
Dev ops training in chennai
Le cloudvupardesexperts 9pov-curationparloicsimon-clubclouddespartenaires
DevOps in Regulated Industries: Speed with Compliance
Software process is tge helpful for software engineer
GCP-pdevops devops engineer exam prepearitaon guide
Different Methodologies Used By Programming Teams
Releasing Software Without Testing Team
DevOps culture, concepte , philosophie and practices
Slides from "Taking an Holistic Approach to Product Quality"
Extreme programming
Dev ops
Introducing the Development Director
Ad

More from Rauno De Pasquale (13)

PPTX
06 azure well architected framework
PPTX
05 azure well architected framework
PPTX
04 azure well architected framework
PPTX
03 azure well architected framework
PPTX
02 azure well architected framework
PPTX
01 azure well architected framework
PPTX
DevOps Training - Introduction to Terraform
PPTX
Kubernetes the deltatre way the basics - introduction to containers and orc...
PPTX
DevOps Torino Meetup - DevOps Engineer, a role that does not exist but is muc...
PPTX
DevOps Torino Meetup Group Kickoff Meeting - Why a meetup group on DevOps, wh...
PPTX
Newesis azure devops-presentation
PPTX
Newesis - Introduction to Containers
PPTX
Newesis - Introduction to the Cloud
06 azure well architected framework
05 azure well architected framework
04 azure well architected framework
03 azure well architected framework
02 azure well architected framework
01 azure well architected framework
DevOps Training - Introduction to Terraform
Kubernetes the deltatre way the basics - introduction to containers and orc...
DevOps Torino Meetup - DevOps Engineer, a role that does not exist but is muc...
DevOps Torino Meetup Group Kickoff Meeting - Why a meetup group on DevOps, wh...
Newesis azure devops-presentation
Newesis - Introduction to Containers
Newesis - Introduction to the Cloud

Recently uploaded (20)

PPTX
presentation_pfe-universite-molay-seltan.pptx
PPTX
international classification of diseases ICD-10 review PPT.pptx
PPTX
Internet___Basics___Styled_ presentation
PDF
Paper PDF World Game (s) Great Redesign.pdf
PPTX
durere- in cancer tu ttresjjnklj gfrrjnrs mhugyfrd
PDF
💰 𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓 💰
PDF
Slides PDF The World Game (s) Eco Economic Epochs.pdf
PPTX
Introuction about ICD -10 and ICD-11 PPT.pptx
PDF
RPKI Status Update, presented by Makito Lay at IDNOG 10
PDF
Decoding a Decade: 10 Years of Applied CTI Discipline
PDF
The Internet -By the Numbers, Sri Lanka Edition
PPTX
Digital Literacy And Online Safety on internet
PPTX
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
DOCX
Unit-3 cyber security network security of internet system
PPTX
Job_Card_System_Styled_lorem_ipsum_.pptx
PPTX
Slides PPTX World Game (s) Eco Economic Epochs.pptx
PDF
LABUAN4D EXCLUSIVE SERVER STAR GAMING ASIA NO.1
PDF
WebRTC in SignalWire - troubleshooting media negotiation
PPTX
Module 1 - Cyber Law and Ethics 101.pptx
PDF
Cloud-Scale Log Monitoring _ Datadog.pdf
presentation_pfe-universite-molay-seltan.pptx
international classification of diseases ICD-10 review PPT.pptx
Internet___Basics___Styled_ presentation
Paper PDF World Game (s) Great Redesign.pdf
durere- in cancer tu ttresjjnklj gfrrjnrs mhugyfrd
💰 𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓 💰
Slides PDF The World Game (s) Eco Economic Epochs.pdf
Introuction about ICD -10 and ICD-11 PPT.pptx
RPKI Status Update, presented by Makito Lay at IDNOG 10
Decoding a Decade: 10 Years of Applied CTI Discipline
The Internet -By the Numbers, Sri Lanka Edition
Digital Literacy And Online Safety on internet
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
Unit-3 cyber security network security of internet system
Job_Card_System_Styled_lorem_ipsum_.pptx
Slides PPTX World Game (s) Eco Economic Epochs.pptx
LABUAN4D EXCLUSIVE SERVER STAR GAMING ASIA NO.1
WebRTC in SignalWire - troubleshooting media negotiation
Module 1 - Cyber Law and Ethics 101.pptx
Cloud-Scale Log Monitoring _ Datadog.pdf

DevOps Torino Meetup - SRE Concepts

  • 3. Is SRE an alternative to DevOps? Google creates and evolved the concept independently from DevOps movement. «If you think of DevOps like an interface in a programming language, class SRE implements DevOps. SRE includes additional practices and recommendations that are not necessarily part of the DevOps interface. DevOps and SRE are not two competing methods for software development and operations, but rather close friends designed to break down organizational barriers to deliver better software faster. DevOps emerged as a culture and a set of practices that aims to reduce the gaps between software development and software operation. The DevOps movement does not explicitly define how to succeed. SRE prescribes how to succeed in the various DevOps areas.” (Liz Fong-Jones, Seth Vargo) DEVOPS VS SRE (SITE RELIABILITY ENGINEERING)
  • 5. GOOGLE SRE - HISTORY
  • 6. Site reliability engineering (SRE) was born at Google in 2003, prior to the DevOps movement. “SRE is what happens when you ask a software engineer to design an operations team.” - Ben Traynor, VP of engineering at Google and founder of Google SRE “SRE is fundamentally doing work that has historically been done by an operations team but using engineers with software expertise and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, substitute automation for human labour. In general, an SRE team is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.” - Ben Traynor, VP of engineering at Google and founder of Google SRE Site reliability engineers create a bridge between development and operations by applying a software engineering mindset to system administration topics. They split their time between operations/on-call duties and developing systems and software that help increase site reliability and performance. Google puts a lot of emphasis on SREs not spending more than 50% of their time on operations and considers any violation of this rule a sign of system ill-health. The ideal site reliability engineer candidate is either a software engineer with a good administration background or a highly skilled system administrator with knowledge of coding and automation. “SRE teams are characterized by both rapid innovation and a large acceptance of Change” - Ben Traynor, VP of engineering at Google and founder of Google SRE THE ORIGIN Ben Traynor, VP of engineering at Google
  • 7. “Software engineering has this in common with having children: the labour before the birth is painful and difficult, but the labour after the birth is where you actually spend most of your effort. Yet software engineering as a discipline spends much more time talking about the first period as opposed to the second [omissis] there must be another discipline that focuses on the whole lifecycle of software objects, from inception, through deployment and operation, refinement, and eventual peaceful decommissioning.” “Running a service with a team that relies on manual intervention for both change management and event handling becomes expensive as the service and/or traffic to the service grows” “Traditional operations teams and their counterparts in product development often end up in conflict, most visibly over how quickly software can be released to production. At their core, the development teams want to launch new features and see them adopted by users. At their core, the ops teams want to make sure the service doesn’t break while they are holding the pager. Because most outages are caused by some kind of change a new configuration, a new feature launch, or a new type of user traffic the two teams’ goals are fundamentally in tension.” THE PROBLEM TO SOLVE
  • 8. Site : the terms site comes from the original duty of the newly born role, having focus on google.com web site. Reliability: “reliability is the most fundamental feature of any product: a system isn’t very useful if nobody can use it!” (Ben Traynor) Engineering: “we apply the principles of computer science and engineering to the design and development of computing systems” “an SRE team must spend 50% of its time actually doing development.” (Ben Traynor) . THE NAME
  • 9. GOOGLE SRE – CONCEPTS – SRE VS DEVOPS
  • 10. DevOps movement in the community and SRE initiative in Google started from the same problem, the inefficiency of having Developers and Operators working on the different side of a wall, the first looking for feature and the second for stability. “One could view DevOps as a generalization of several core SRE principles to a wider range of organizations, management structures, and personnel. One could equivalently view SRE as a specific implementation of DevOps with some idiosyncratic extensions.“(Ben Traynor) A DevOps Engineer is someone who understands the full SDLC (Software Development Life Cycle) DevOps focuses more on the automation part SREs focus is more on the aspects like system availability, observability, and scale ”The basic tenet of SRE is that doing operations well is a software problem” SRE VS DEVOPS
  • 11. Reduce organisation silos: SREs share the ownership of production with development teams and use the same tools Accept failure as normal: SRE’s concept of Error Budget, avoiding 100% SLO Implement gradual change: SRE is prescriptive about usage of Canary Deployment Leverage tooling and automation: SRE concept of Toil and the guideline of “automate this year’s job” Measure everything: SRE is prescriptive about measuring Error Budget and Toil SRE VS DEVOPS
  • 12. DEVOPS VS. SRE: COMPETING STANDARDS OR FRIENDS? (CLOUD NEXT ‘19) (@SETHVARGO)
  • 13. GOOGLE SRE – CONCEPTS - SLO
  • 14. Service Level Indicators (SLIs): metrics over time such as request latency, throughput of requests per second, or failures per request Service Level Objectives (SLOs): targets for the cumulative success of SLIs over a window of time Service Level Agreements (SLAs): a promise by a service provider, to a service consumer, about the availability of a service SLI, SLO, SLA – A DEFINITION
  • 15. “In general, for any software service or system, 100% is not the right reliability target because no user can tell the difference between a system being 100% available and 99.999% available.” 100% is not a viable availability target. Having a 100% availability requirement severely limits a team or developer’s ability to deliver updates and improvements to a system. To set the target isn’t a technical question at all, it’s a product question, the business lead and the product management must establish the system’s availability target: • What level of availability will the users be happy with, given how they use the product? • What alternatives are available to users who are dissatisfied with the product’s availability? • What happens to users’ usage of the product at different availability levels? SLO - TARGET
  • 16. SLIs must be selected on the base of business value considerations. Valuable SLIs: • Request latency • Batch throughput • Failures per request (error rate) Non valuable SLIs: • CPU Time • Memory Usage • Operating System Uptime Metrics must be aggregated overtime (for example “last 5 minutes”) and a function as a percentile to be applied (for example “99 percentile”) Metrics must be defined in advance and be known and accepted across all organisation Measurement must be reliable and not only the definition, also the implementation of the measurement must be clearly defined and approved SLI - SELECTION
  • 17. GOOGLE SRE – CONCEPTS – ERROR BUDGET
  • 18. Error budget is a quantitative measurement to establish the ratio between work on implementing new features and work on improving stability “The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter. This metric removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow” ERROR BUDGET AND RISK
  • 19. “If SLO violations occur frequently enough to expend the error budget, releases are temporarily halted while additional resources are invested in system testing and development to make the system more resilient, improve its performance, and so on.” ERROR BUDGET AND RISK
  • 20. Use measurement to calculate expected cost in terms of error budget of each risk Prioritise work on improvement on the base of the computed impact of each risk in terms of error budget Cost for mitigating the risks could require a review of the SLO due to business value considerations RISK MANAGEMENT AND PRIORITISATION
  • 21. GOOGLE SRE – CONCEPTS – TOIL
  • 22. Toil is not simply "work I don't like to do.” Toil is not overhead (commuting, expenses reports, meetings, …) Toil is specifically tied to the running of a production service. It is work that tends to be manual, repetitive, automatable, tactical and devoid of long-term value. Toil is also reactive, as intervention done due to an alert. When SREs find tasks that can be automated, they work to engineer a solution to prevent that toil in the future. Toil is not always bad. Predictable, repetitive tasks are great ways to onboard a new team member and often produce an immediate sense of accomplishment and satisfaction with low risk and low stress. TOIL AND TOIL BUDGET
  • 23. “We ensure that the teams consistently spending less than 50% of their time on development work change their practices. Often this means shifting some of the operations burden back to the development team, or adding staff to the team without assigning that team additional operational responsibilities.” SREs must not be spending more than 50% of their time on supportoperation activities Developers should be involved in support regularly, but their involvement become mandatory if a product requires SREs to spend more than 50% of time on operations At least eight people need to be part of the on-call team to correctly balance the load Each on call person must handle no more than two events per on-call shift to assure right quality To preserve readiness for an effective on call support practice handling hypothetical outages ON CALL SUPPORT
  • 24. “If you currently assign tickets randomly to victims on your team, stop . Doing so is extremely disrespectful of your team’s time, and works completely counter to the principle of not being interruptible as much as possible. Tickets should be a full-time role, for an amount of time that’s manageable for a person. If you happen to be in the unenviable position of having more tickets than can be closed by the primary and secondary on-call engineers combined, then structure your ticket rotation to have two people handling tickets at any given time. Don’t spread the load across the entire team.” SUPPORT TICKET
  • 25. GOOGLE SRE – CONCEPTS – OBSERVABILITY AND MONITORING
  • 26. Each monitoring system should address two questions: what’s broken (symptom) and why (cause) Two types of monitoring can be defined: • White-box monitoring: it inspects the internal state of the target service (application components metrics, traces, logs). Focus on causes. • Black-box monitoring: it accesses the systems from external, as a real user (httptcp probes, dns resolution, network ping). Symptom-oriented. Active recognition of error condition. WHITE-BOX AND BLACK- BOX MONITORING
  • 27. Monitoring may have only three output types: • Pages - A human must do something now • Tickets - A human must do something within a few days • Logging - No one need look at this output immediately, but it’s available for later analysis if needed ”Putting alerts into email and hoping that someone will read all of them and notice the important ones is the moral equivalent of piping them to /dev/null : they will eventually be ignored.” An alert must be analysed, if an alert is ignored, remove the alerting rule. MONITORING AND ALERTING
  • 28. Collecting metrics and alerting are business driven activities. Collecting metrics and manage alerting are different thing. You need to collect metrics to have visibility on your systems, but you do not have to use all metrics to generate alerts. Measurement, monitoring and alerting must be related to SLOs. Do not alert on anything, alerts must report a service impact and be actionable. OBSERVABILITY
  • 29. Latency The time it takes to service a request. It’s important to distinguish between the latency of successful requests and the latency of failed requests. For example, an HTTP 500 error triggered due to loss of connection to a database or other critical backend might be served very quickly; however, as an HTTP 500 error indicates a failed request, factoring 500s into your overall latency might result in misleading calculations. On the other hand, a slow error is even worse than a fast error! Therefore, it’s important to track error latency, as opposed to just filtering out errors. Traffic A measure of how much demand is being placed on your system, measured in a high-level system-specific metric. For a web service, this measurement is usually HTTP requests per second, perhaps broken out by the nature of the requests (e.g., static versus dynamic content). For an audio streaming system, this measurement might focus on network I/O rate or concurrent sessions. For a key-value storage system, this measurement might be transactions and retrievals per second. THE FOUR GOLDEN SIGNALS (1/2)
  • 30. Errors The rate of requests that fail, either explicitly (e.g., HTTP 500s), implicitly (for example, an HTTP 200 success response, but coupled with the wrong content), or by policy (for example, "If you committed to one-second response times, any request over one second is an error"). Where protocol response codes are insufficient to express all failure conditions, secondary (internal) protocols may be necessary to track partial failure modes. Monitoring these cases can be drastically different: catching HTTP 500s at your load balancer can do a decent job of catching all completely failed requests, while only end-to-end system tests can detect that you’re serving the wrong content. Saturation How "full" your service is. A measure of your system fraction, emphasizing the resources that are most constrained (e.g., in a memory-constrained system, show memory; in an I/O- constrained system, show I/O). Note that many systems degrade in performance before they achieve 100% utilization, so having a utilization target is essential. THE FOUR GOLDEN SIGNALS (2/2)
  • 31. GOOGLE SRE – CONCEPTS – INCIDENT MANAGEMENT
  • 32. Effective incident management is key to limiting the disruption caused by an incident and restoring normal business operations as quickly as possible. A well-designed incident management process has the following features: • Recursive separation of responsibilities • Incident command • Operational work • Communication • Planning • A recognised command post • Live incident state document • Clear handoff INCIDENT MANAGEMENT
  • 33. ”It is better to declare an incident early and then find a simple fix and close out the incident than to have to spin up the incident management framework hours into a burgeoning problem.” If any of the following is true, the event is an incident: • Do you need to involve a second team in fixing the problem? • Is the outage visible to customers? • Is the issue unsolved even after an hour’s concentrated analysis?. WHEN TO DECLARE AN INCIDENT
  • 34. Incident Management Roles: • Incident Commander • The incident commander holds the high-level state about the incident. They structure the incident response task force, assigning responsibilities according to need and priority. • Operations lead • The Ops lead works with the incident commander to respond to the incident by applying operational tools to the task at hand. The operations team should be the only group modifying the system during an incident. • Communication lead • This person is the public face of the incident response task force. • Planning lead • The planning role supports Ops by dealing with longer-term issues, such as filing bugs, arranging handoffs, and tracking how the system has diverged from the norm so it can be reverted once the incident is resolved. • Logistics lead • The logistic role supports Ops by dealing with things as ordering dinner or dealing with vendors for spare parts. INCIDENT MANAGEMENT
  • 35. GOOGLE SRE – CONCEPTS – POSTMORTEMS
  • 36. ”Postmortems should be blameless and focus on process and technology, not people. Assume the people involved in an incident are intelligent, are well intentioned, and were making the best choices they could given the information they had available at the time.” Questions about which data where available, how the system was behaving, which actions done and their effect. Avoid questions about why an action has been done or why not. POSTMORTEMS AND RETROSPECTIVES
  • 37. Content of a postmortem: • Incident summary • Detailed timeline • Detection • Impact • Root Causes • Triggers • Mitigation and Resolution • Lesson learned • What went well • What went wrong • Where we got lucky • Action Items (with clear owner) POSTMORTEMS AND RETROSPECTIVES
  • 38. GOOGLE SRE – CONCEPTS – ENGAGEMENT MODEL
  • 39. Not all Google services receive close SRE engagement. Google defines three different engagement models • Simple PRR (Product Readiness Reviews) Model • Early Engagement Model • Frameworks and SRE Platform SRE ENGAGEMENT MODEL
  • 40. Development team requests that SRE take over production management of a service, one to three SREs are selected to conduct the PRR process. The discussion covers matters such as: • Establishing an SLO/SLA for the service • Planning for potentially disruptive design changes required to improve reliability • Planning and training schedules The Training phase unblocks onboarding of the service by the SRE team. It involves a progressive transfer of responsibilities and ownership of various production aspects of the service, including parts of operations, the change management process, access rights, and so forth. To complete the transition, the development team must be available to back up and advise the SRE team for a period of time as it settles in managing production for the service. This relationship becomes the basis for the ongoing work between the teams. SIMPLE PRR MODE
  • 41. The Early Engagement Model introduces SRE earlier in the development lifecycle. SRE participates in Design and later phases, eventually taking over the service any time during or after the Build phase. This model is based on active collaboration between the development and SRE teams. EARLY ENGAGEMENT MODEL
  • 42. SRE builds framework modules to implement canonical solutions for the concerned production area. As a result, development teams can focus on the business logic, because the framework already takes care of correct infrastructure use. A framework essentially is a prescriptive implementation for using a set of software components and a canonical way of combining these components. The service frameworks implement infrastructure code in a standardized fashion and address various production concerns. Each concern is encapsulated in one or more framework modules, each of which provides a cohesive solution for a problem domain or infrastructure dependency. Framework modules address the various SRE concerns enumerated earlier, such as: • Instrumentation and metrics • Request logging • Control systems involving traffic and load management FRAMEWORKS AND SRE PLATFORM
  • 43. AM I DOING SRE OR DEVOPS?
  • 44. How can I know if I’m doing DevOps? • Do I consider Developers, Test Engineers and Sys Admins as members of the same team having one common goal? • Do I know and care about end users features and functionalities (business value)? • Do I automate infrastructure and code deployment? Do I understand and manage SDLC? • Do I focus on collecting as much metrics as possible from build process to production runtime? How can I know if I’m doing SRE? • Do I protect my 50% of the time for engineering activities? • Do I define SLO and manage Error Budget? • Do I continuously work to improve self-healing of production solutions? Do I target “autonomous” solutions and not only “automated”? • Do I use a structured Incident Management process with clearly defined roles? AM I DOING SRE OR DEVOPS?
  • 45. THE END – Q&A ?

Editor's Notes

  • #9: Text from preface and introduction of “Site Reliability Engineering”, authors Murphy, Niall Richard,Beyer, Betsy,Jones, Chris,Petoff, Jennifer, printed by O'Reilly Media.
  • #10: Text from preface and introduction of “Site Reliability Engineering”, authors Murphy, Niall Richard,Beyer, Betsy,Jones, Chris,Petoff, Jennifer, printed by O'Reilly Media.
  • #25: No more than 25% can be spent on-call, leaving up to another 25% on other types of operational, nonproject work. Assuming that there are always two people on-call (primary and secondary, with different duties), the minimum number of engineers needed for on-call duty from a single-site team is eight (each engineer is on-call (primary or secondary) for one week every month. For each on-call shift, an engineer should have sufficient time to deal with any incidents and follow-up activities such as writing postmortems. We’ve found that on average, dealing with the tasks involved in an on-call incident root-cause analysis, remediation, and follow-up activities like writing a postmortem and fixing bugs takes 6 hours. It follows that the maximum number of incidents per day is 2 per 12-hour on-call shift.
  • #46: When we mention “same team” it does not necessarily mean a concrete unique team; developers, test engineers and sys admins should consider themselves as part of a common virtual team, independently from the practical and concrete organisation.