SlideShare a Scribd company logo
SRE Book Ch 1,2
KKStream SRE Study Group
Presenter: Chris Huang
2018/08/20
Chapter 1 -
Introduction
Sysadmin Approach
● Involves assembling existing software components and deploying them to
work together to produce a service. Sysadmins are then tasked with
running the service and responding to events and updates as they occur.
● As the system grows in complexity and traffic volume, generating a
corresponding increase in events and updates, the sysadmin team grows
to absorb the additional work.
● Because the sysadmin role requires a markedly different skill set than that
required of a product’s developers, developers and sysadmins are divided
into discrete teams: "development" and "operations" or "ops."
3
Sysadmin Approach
4
● Easy to implement
● Many examples to learn
● Relevant talent pool widely
available
● Many of existing tools, software
components
● Direct cost: manual intervention
becomes expensive when service
grows (team size grows as well)
● Indirect cost: two teams
(development v.s. operation) are
different from background, skill
set, and incentives.
● Traditional approach often end up
in conflicts
Google’s Approach
● Our Site Reliability Engineering teams focus on hiring software
engineers to run our products and to create systems to accomplish the
work that would otherwise be performed, often manually, by sysadmins.
Two main categories
● 50–60% are Google Software Engineers, or more precisely, people who
have been hired via the standard procedure for Google Software
Engineers.
● The other 40–50% are candidates who were very close to the Google
Software Engineering qualifications (i.e., 85–99% of the skill set required),
and who in addition had a set of technical skills that is useful to SRE but
is rare for most software engineers. By far, UNIX system internals and
networking (Layer 1 to Layer 3) expertise are the two most common types
of alternate technical skills we seek.
5
Google’s Approach
● End up with a team of people who
(a) will quickly become bored by performing tasks by hand
(b) have the skill set necessary to write software to replace their previously
manual work
● Without constant engineering, operations load increases and teams will
need more people just to keep pace with the workload.
50% cap on the aggregate "ops" work for all SREs
● This cap ensures that the SRE team has enough time in their schedule to
make the service stable and operable.
● The SRE team should end up with very little operational load and almost
entirely engage in development tasks, because the service basically runs
and repairs itself: we want systems that are automatic, not just
automated.
6
50% Cap on Ops Work
● Often this means shifting some of the operations burden back to the
development team, or adding staff to the team without assigning that team
additional operational responsibilities.
● Consciously maintaining this balance between ops and development work
allows us to ensure that SREs have the bandwidth to engage in creative,
autonomous engineering, while still retaining the wisdom gleaned from the
operations side of running a service.
7
0201
SRE Model Challenges
● SRE compete for the same candidates as
the product development hiring pipeline
● The fact that we set the hiring bar so high in
terms of both coding and system
engineering skills means that our hiring pool
is necessarily small
8
● Not much industry information exists on
how to build and manage an SRE team
(hope this book would help)
● Require strong management support: the
decision to stop releases for the remainder
of the quarter once an error budget is
depleted might not be embraced by a
product development team unless
mandated by their management
Section
DevOps or SRE?
● One could view DevOps as a generalization of several core SRE
principles to a wider range of organizations, management structures, and
personnel.
● One could equivalently view SRE as a specific implementation of
DevOps with some idiosyncratic extensions.
9
https://medium.com/kkstream/%E5%A5%BD%E6%96%87%E7%BF%BB%E8%AD%AF-%E4%BD%A0%E5%9C%A8%E6%89%BE%E7%9A%84
%E6%98%AF-sre-%E9%82%84%E6%98%AF-devops-2ded43c2852
Tenets of SRE
● In general, an SRE team is responsible for below tasks
● Those rules and work practices help us to maintain our focus on engineering work, as opposed
to operations work
10
availability
latency
performance
efficiency
change management
monitoring
emergency response
capacity planning
Ensuring a Durable Focus on Engineering
● Caps operational work for SREs at 50% of their time. Their remaining time should be spent using
their coding skills on project work
● Redirecting excess operational work to the product development teams
● Reassigning bugs and tickets to development managers
● Reintegrating developers into on-call pager rotations
● This also provides an effective feedback mechanism, guiding developers to build systems that
don’t need manual intervention.
● SREs should receive a maximum of two events per 8–12-hour on-call shift. This target
volume gives the on-call engineer enough time to handle the event accurately and quickly, clean
up and restore normal service, and then conduct a postmortem.
● Postmortems should be written for all significant incidents, regardless of whether or not they
paged. This investigation should establish what happened in detail, find all root causes of the
event, and assign actions to correct the problem or improve how it is addressed next time.
● A blame-free postmortem culture, with the goal of exposing faults and applying engineering to
fix these faults, rather than avoiding or minimizing them.
11
Maximum Change Velocity w/o Violating Service SLO
● The structural conflict is between pace of innovation and product
stability. In SRE we bring this conflict to the fore, and then resolve it with
the introduction of an error budget.
● 100% is the wrong reliability target for basically everything
● In general, no user can tell the difference between a system being 100%
available and 99.999% available. There are many other systems in the
path between user and service (their laptop, their home WiFi, their ISP,
the power grid…) and those systems collectively are far less than
99.999% available.
● The user receives no benefit from the enormous effort required to add that
last 0.001% of availability.
12
Error Budget
● The business or the product must establish the system’s availability target.
● What level of availability will the users be happy with, given how they use the
product?
● What alternatives are available to users who are dissatisfied with the product’s
availability?
● What happens to users’ usage of the product at different availability levels?
● A service that’s 99.99% available is 0.01% unavailable. That permitted 0.01%
unavailability is the service’s error budget. We can spend the budget on anything we
want, as long as we don’t overspend it.
● SRE’s goal is no longer "zero outages"; rather, SREs and product developers aim to
spend the error budget getting maximum feature velocity. This change makes all the
difference. An outage is no longer a "bad" thing—it is an expected part of the
process of innovation, and an occurrence that both development and SRE teams
manage rather than fear.
13
Monitoring
Monitoring should never require a human to interpret any part of the alerting domain. Instead, software
should do the interpreting, and humans should be notified only when they need to take action.
14
Signify that a human needs to
take action immediately in
response to something that is
either happening or about to
happen, in order to improve the
situation.
No one needs to look at this
information, but it is recorded
for diagnostic or forensic
purposes. The expectation is
that no one reads logs unless
something else prompts them
to do so.
Signify that a human needs to
take action, but not
immediately. The system
cannot automatically handle
the situation, but if a human
takes action in a few days, no
damage will result.
Alerts Tickets Logging
Emergency Response
● Reliability is a function of mean time to failure (MTTF) and mean time to repair (MTTR).
● The most relevant metric in evaluating the effectiveness of emergency response is how quickly
the response team can bring the system back to health—that is, the MTTR.
● Humans add latency. A system that can avoid emergencies that require human intervention will
have higher availability than a system that requires hands-on intervention.
● When humans are necessary, we have found that thinking through and recording the best
practices ahead of time in a "playbook" produces roughly a 3x improvement in MTTR as
compared to the strategy of "winging it." (臨場發揮)
● The hero jack-of-all-trades (博而不精的人 or 萬事通) on-call engineer does work, but the
practiced on-call engineer armed with a playbook works much better.
15
Change Management
SRE has found that roughly 70% of outages are due to changes in a live system.
Best practices in this domain use automation to accomplish the following:
● Implementing progressive rollouts
● Quickly and accurately detecting problems
● Rolling back changes safely when problems arise
By removing humans from the loop, these practices avoid the normal problems
of fatigue, familiarity/contempt, and inattention to highly repetitive tasks. As a
result, both release velocity and safety increase.
16
Demand Forecasting and Capacity Planning
● Capacity planning should take both organic growth (which stems
from natural product adoption and usage by customers) and
inorganic growth (which results from events like feature launches,
marketing campaigns, or other business-driven changes) into
account.
● An accurate organic demand forecast, which extends beyond
the lead time required for acquiring capacity
● An accurate incorporation of inorganic demand sources into
the demand forecast
● Regular load testing of the system to correlate raw
capacity(servers, disks, and so on) to service capacity
● Because capacity is critical to availability, it naturally follows that the
SRE team must be in charge of capacity planning, which means
they also must be in charge of provisioning.
17
KKS: Responsible for costs
for companies who leverage
AWS or GCP.
Provisioning
18
● Provisioning combines both change management and capacity
planning.
● In our experience, provisioning must be conducted quickly and only
when necessary, as capacity is expensive.
● Adding new capacity often involves spinning up a new instance or
location, making significant modification to existing systems
(configuration files, load balancers, networking), and validating that
the new capacity performs and delivers correct results.
KKS: increase the service
limits in AWS?
Efficiency and Performance
● Resource use is a function of demand (load), capacity, and software
efficiency. SREs predict demand, provision capacity, and can modify the
software. These three factors are a large part of a service’s efficiency.
● Software systems become slower as load is added to them. A slowdown in
a service equates to a loss of capacity. At some point, a slowing system
stops serving, which corresponds to infinite slowness.
● SREs provision to meet a capacity target at a specific response speed,
and thus are keenly interested in a service’s performance.
● SREs and product developers will (and should) monitor and modify a
service to improve its performance, thus adding capacity and improving
efficiency.
19
The End of the Beginning
● Motivated originally by familiarity—"as a software engineer, this is how I
would want to invest my time to accomplish a set of repetitive tasks"
● It has become much more: a set of principles, a set of practices, a set of
incentives, and a field of endeavor within the larger software engineering
discipline.
20
Chapter 2 -
Production Environment
at Google, from the
Viewpoint of an SRE
Google-designed Datacenter
22
A piece of hardware (or perhaps a VM)
Machines can run any server, so we don’t
dedicate specific machines to specific server
programs. There’s no specific machine that runs
our mail server, for example. Instead, resource
allocation is handled by our cluster operating
system, Borg.
A piece of software that implements a service
The common use of the word conflates “binary
that accepts network connection” with machine
● To eliminate the confusion between server hardware and server software, we use the following
terminology throughout the book:
The topology of a Google datacenter
● Tens of machines are placed in a rack.
● Racks stand in a row.
● One or more rows form a cluster.
● Usually a datacenter building houses
multiple clusters.
● Multiple datacenter buildings that are
located close together form a campus.
23
Managing Machines
● Borg, is a distributed cluster operating
system, similar to Apache Mesos. Borg
manages its jobs at the cluster level.
● Borg is responsible for running users’
jobs, which can either be indefinitely
running servers or batch processes like a
MapReduce.
● Jobs can consist of more than one of
identical tasks. When Borg starts a job, it
finds machines for the tasks and tells the
machines to start the server program.
● Borg then continually monitors these
tasks. If a task malfunctions, it is killed and
restarted, possibly on a different machine.
24
Storage
● D is a fileserver running on almost all machines in a
cluster.
● A layer on top of D called Colossus creates a
cluster-wide filesystem that offers usual filesystem
semantics, as well as replication and encryption. Colossus
is the successor to GFS, the Google File System
● There are several database-like services built on top of
Colossus:
● Bigtable is a NoSQL database system that can
handle databases that are petabytes in size.
● Spanner offers an SQL-like interface for users that
require real consistency across the world.
● Several other database systems, such as
Blobstore, are available.
25
Networking
● Instead of using "smart" routing hardware, we rely on less expensive "dumb" switching
components in combination with a central (duplicated) controller that precomputes best paths
across the network. Therefore, we’re able to move compute-expensive routing decisions away
from the routers and use simple switching hardware.
● Our Global Software Load Balancer (GSLB) performs load balancing on three levels:
● Geographic load balancing for DNS requests (for example, to www.google.com), described
in Load Balancing at the Frontend
● Load balancing at a user service level (for example, YouTube or Google Maps)
● Load balancing at the Remote Procedure Call (RPC) level, described in Load Balancing in
the Datacenter
● Service owners specify a symbolic name for a service, a list of BNS addresses of servers, and the
capacity available at each of the locations (typically measured in queries per second). GSLB then
directs traffic to the BNS addresses.
26
Other System Software
The Chubby lock service provides a
filesystem-like API for maintaining locks.
Chubby handles these locks across
datacenter locations. It uses the Paxos
protocol for asynchronous Consensus.
Data that must be consistent is well suited to
storage in Chubby. For this reason, BNS uses
Chubby to store mapping between BNS paths
and IP address:port pairs.
Borgmon regularly "scrapes" metrics from
monitored servers. These metrics can be used
instantaneously for alerting and also stored for use
in historic overviews (e.g., graphs).
27
Our Software Infrastructure
● Our code is heavily multithreaded, so one task can easily use many cores. To facilitate
dashboards, monitoring, and debugging, every server has an HTTP server that provides
diagnostics and statistics for a given task.
● All of Google’s services communicate using a Remote Procedure Call (RPC) infrastructure
named Stubby; an open source version, gRPC, is available.
● Often, an RPC call is made even when a call to a subroutine in the local program needs to be
performed. This makes it easier to refactor the call into a different server if more modularity is
needed, or when a server’s codebase grows. GSLB can load balance RPCs in the same way it
load balances externally visible services.
● A server receives RPC requests from its frontend and sends RPCs to its backend. In traditional
terms, the frontend is called the client and the backend is called the server.
● Data is transferred to and from an RPC using protocol buffers,12
often abbreviated to
"protobufs," which are similar to Apache’s Thrift.
28
Our Development Environment
● Google Software Engineers work from a single shared repository
● If engineers encounter a problem in a component outside of their
project, they can fix the problem, send the proposed changes
("changelist," or CL) to the owner for review
● Changes to source code in an engineer’s own project require a
review. All software is reviewed before being submitted.
● When software is built, the build request is sent to build servers in a
datacenter. Even large builds are executed quickly, as many build servers
can compile in parallel.
29
Shakespeare: A Sample Service
30
We’re Hiring
31
https://jobs.lever.co/kkstream
Thank you!
KKStream (Japan) KKStream (Taiwan)
www.kkstream.com.tw

More Related Content

PPTX
SRE vs DevOps
PPTX
Site reliability engineering
PPTX
SRE-iously! Reliability!
PDF
SRE 101
PPTX
How Small Team Get Ready for SRE (public version)
PPTX
SRE 101 (Site Reliability Engineering)
PPTX
Site (Service) Reliability Engineering
PDF
Building an SRE Organization @ Squarespace
SRE vs DevOps
Site reliability engineering
SRE-iously! Reliability!
SRE 101
How Small Team Get Ready for SRE (public version)
SRE 101 (Site Reliability Engineering)
Site (Service) Reliability Engineering
Building an SRE Organization @ Squarespace

What's hot (20)

PPTX
A Crash Course in Building Site Reliability
PDF
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
PDF
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
PPTX
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
PDF
Sre summary
PDF
DevOps Vs SRE Major Differences That You Need To Know - Hidden Brains Infotech
PPTX
SRE (service reliability engineer) on big DevOps platform running on the clou...
PDF
Getting started with Site Reliability Engineering (SRE)
PPTX
What is Site Reliability Engineering (SRE)
PDF
Overview of Site Reliability Engineering (SRE) & best practices
PDF
OpenStack構築手順書 Kilo版
PPTX
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
PPTX
How to Move from Monitoring to Observability, On-Premises and in a Multi-Clou...
PDF
Api observability
PDF
Cloud Native Engineering with SRE and GitOps
PDF
Service Level Terminology : SLA ,SLO & SLI
PPTX
DevOps Torino Meetup - SRE Concepts
PDF
Kubernetes Architecture | Understanding Kubernetes Components | Kubernetes Tu...
PDF
kintoneがAWSで目指すDevOpsQAな開発
PDF
How to SRE when you have no SRE
A Crash Course in Building Site Reliability
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Sre summary
DevOps Vs SRE Major Differences That You Need To Know - Hidden Brains Infotech
SRE (service reliability engineer) on big DevOps platform running on the clou...
Getting started with Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)
Overview of Site Reliability Engineering (SRE) & best practices
OpenStack構築手順書 Kilo版
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
How to Move from Monitoring to Observability, On-Premises and in a Multi-Clou...
Api observability
Cloud Native Engineering with SRE and GitOps
Service Level Terminology : SLA ,SLO & SLI
DevOps Torino Meetup - SRE Concepts
Kubernetes Architecture | Understanding Kubernetes Components | Kubernetes Tu...
kintoneがAWSで目指すDevOpsQAな開発
How to SRE when you have no SRE
Ad

Similar to Kks sre book_ch1,2 (20)

PPT
PDF
Project management part 2
PDF
Kept up by Potential IT Disasters? Your Guide to Disaster Recovery as a Servi...
PPT
Chapter 3 - Agile Software Development.ppt
PPTX
Agile is a flexible and iterative approach to software development that empha...
PPTX
Agile is a flexible and iterative approach to software development that empha...
PDF
aw_survivalguide_r2opt
PDF
Defect effort prediction models in software
PDF
Agile ERP_ Continuous Improvements Through Rapid, Incremental Implementations...
PDF
Site-Reliability-Engineering-v2[6241].pdf
PDF
A Pattern-Language-for-software-Development
PDF
Agile software development and challenges
PDF
7 deadly sins of backup and recovery
PDF
Observability at Scale
PDF
Basic-Project-Estimation-1999
PDF
Erp implementation lifecycle
PDF
Agile software development and challenges
PDF
Map r whitepaper_zeta_architecture
PDF
Netreo whitepaper 5 ways to avoid it management becoming shelfware
PDF
Towards preventing software from becoming legacy a road map
Project management part 2
Kept up by Potential IT Disasters? Your Guide to Disaster Recovery as a Servi...
Chapter 3 - Agile Software Development.ppt
Agile is a flexible and iterative approach to software development that empha...
Agile is a flexible and iterative approach to software development that empha...
aw_survivalguide_r2opt
Defect effort prediction models in software
Agile ERP_ Continuous Improvements Through Rapid, Incremental Implementations...
Site-Reliability-Engineering-v2[6241].pdf
A Pattern-Language-for-software-Development
Agile software development and challenges
7 deadly sins of backup and recovery
Observability at Scale
Basic-Project-Estimation-1999
Erp implementation lifecycle
Agile software development and challenges
Map r whitepaper_zeta_architecture
Netreo whitepaper 5 ways to avoid it management becoming shelfware
Towards preventing software from becoming legacy a road map
Ad

More from Chris Huang (20)

PDF
Data compression, data security, and machine learning
PDF
Kks sre book_ch10
PPTX
Real time big data applications with hadoop ecosystem
PPTX
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
PPTX
Approaching real-time-hadoop
PPTX
20130310 solr tuorial
PDF
Scaling big-data-mining-infra2
PPT
Applying Media Content Analysis to the Production of Musical Videos as Summar...
PDF
Wissbi osdc pdf
PDF
Hbase status quo apache-con europe - nov 2012
PDF
Hbase schema design and sizing apache-con europe - nov 2012
PPTX
重構—改善既有程式的設計(chapter 12,13)
PPTX
重構—改善既有程式的設計(chapter 10)
PPTX
重構—改善既有程式的設計(chapter 9)
PPTX
重構—改善既有程式的設計(chapter 8)part 2
PPTX
重構—改善既有程式的設計(chapter 8)part 1
PPTX
重構—改善既有程式的設計(chapter 7)
PPTX
重構—改善既有程式的設計(chapter 6)
PPTX
重構—改善既有程式的設計(chapter 4,5)
PPTX
重構—改善既有程式的設計(chapter 2,3)
Data compression, data security, and machine learning
Kks sre book_ch10
Real time big data applications with hadoop ecosystem
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
Approaching real-time-hadoop
20130310 solr tuorial
Scaling big-data-mining-infra2
Applying Media Content Analysis to the Production of Musical Videos as Summar...
Wissbi osdc pdf
Hbase status quo apache-con europe - nov 2012
Hbase schema design and sizing apache-con europe - nov 2012
重構—改善既有程式的設計(chapter 12,13)
重構—改善既有程式的設計(chapter 10)
重構—改善既有程式的設計(chapter 9)
重構—改善既有程式的設計(chapter 8)part 2
重構—改善既有程式的設計(chapter 8)part 1
重構—改善既有程式的設計(chapter 7)
重構—改善既有程式的設計(chapter 6)
重構—改善既有程式的設計(chapter 4,5)
重構—改善既有程式的設計(chapter 2,3)

Recently uploaded (20)

PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PDF
Modernizing your data center with Dell and AMD
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Machine learning based COVID-19 study performance prediction
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
Approach and Philosophy of On baking technology
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
NewMind AI Weekly Chronicles - August'25 Week I
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
Modernizing your data center with Dell and AMD
Advanced methodologies resolving dimensionality complications for autism neur...
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Machine learning based COVID-19 study performance prediction
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
GamePlan Trading System Review: Professional Trader's Honest Take
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Network Security Unit 5.pdf for BCA BBA.
Dropbox Q2 2025 Financial Results & Investor Presentation
Chapter 3 Spatial Domain Image Processing.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Reach Out and Touch Someone: Haptics and Empathic Computing
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
Approach and Philosophy of On baking technology
Advanced Soft Computing BINUS July 2025.pdf
NewMind AI Monthly Chronicles - July 2025
NewMind AI Weekly Chronicles - August'25 Week I

Kks sre book_ch1,2

  • 1. SRE Book Ch 1,2 KKStream SRE Study Group Presenter: Chris Huang 2018/08/20
  • 3. Sysadmin Approach ● Involves assembling existing software components and deploying them to work together to produce a service. Sysadmins are then tasked with running the service and responding to events and updates as they occur. ● As the system grows in complexity and traffic volume, generating a corresponding increase in events and updates, the sysadmin team grows to absorb the additional work. ● Because the sysadmin role requires a markedly different skill set than that required of a product’s developers, developers and sysadmins are divided into discrete teams: "development" and "operations" or "ops." 3
  • 4. Sysadmin Approach 4 ● Easy to implement ● Many examples to learn ● Relevant talent pool widely available ● Many of existing tools, software components ● Direct cost: manual intervention becomes expensive when service grows (team size grows as well) ● Indirect cost: two teams (development v.s. operation) are different from background, skill set, and incentives. ● Traditional approach often end up in conflicts
  • 5. Google’s Approach ● Our Site Reliability Engineering teams focus on hiring software engineers to run our products and to create systems to accomplish the work that would otherwise be performed, often manually, by sysadmins. Two main categories ● 50–60% are Google Software Engineers, or more precisely, people who have been hired via the standard procedure for Google Software Engineers. ● The other 40–50% are candidates who were very close to the Google Software Engineering qualifications (i.e., 85–99% of the skill set required), and who in addition had a set of technical skills that is useful to SRE but is rare for most software engineers. By far, UNIX system internals and networking (Layer 1 to Layer 3) expertise are the two most common types of alternate technical skills we seek. 5
  • 6. Google’s Approach ● End up with a team of people who (a) will quickly become bored by performing tasks by hand (b) have the skill set necessary to write software to replace their previously manual work ● Without constant engineering, operations load increases and teams will need more people just to keep pace with the workload. 50% cap on the aggregate "ops" work for all SREs ● This cap ensures that the SRE team has enough time in their schedule to make the service stable and operable. ● The SRE team should end up with very little operational load and almost entirely engage in development tasks, because the service basically runs and repairs itself: we want systems that are automatic, not just automated. 6
  • 7. 50% Cap on Ops Work ● Often this means shifting some of the operations burden back to the development team, or adding staff to the team without assigning that team additional operational responsibilities. ● Consciously maintaining this balance between ops and development work allows us to ensure that SREs have the bandwidth to engage in creative, autonomous engineering, while still retaining the wisdom gleaned from the operations side of running a service. 7
  • 8. 0201 SRE Model Challenges ● SRE compete for the same candidates as the product development hiring pipeline ● The fact that we set the hiring bar so high in terms of both coding and system engineering skills means that our hiring pool is necessarily small 8 ● Not much industry information exists on how to build and manage an SRE team (hope this book would help) ● Require strong management support: the decision to stop releases for the remainder of the quarter once an error budget is depleted might not be embraced by a product development team unless mandated by their management Section
  • 9. DevOps or SRE? ● One could view DevOps as a generalization of several core SRE principles to a wider range of organizations, management structures, and personnel. ● One could equivalently view SRE as a specific implementation of DevOps with some idiosyncratic extensions. 9 https://medium.com/kkstream/%E5%A5%BD%E6%96%87%E7%BF%BB%E8%AD%AF-%E4%BD%A0%E5%9C%A8%E6%89%BE%E7%9A%84 %E6%98%AF-sre-%E9%82%84%E6%98%AF-devops-2ded43c2852
  • 10. Tenets of SRE ● In general, an SRE team is responsible for below tasks ● Those rules and work practices help us to maintain our focus on engineering work, as opposed to operations work 10 availability latency performance efficiency change management monitoring emergency response capacity planning
  • 11. Ensuring a Durable Focus on Engineering ● Caps operational work for SREs at 50% of their time. Their remaining time should be spent using their coding skills on project work ● Redirecting excess operational work to the product development teams ● Reassigning bugs and tickets to development managers ● Reintegrating developers into on-call pager rotations ● This also provides an effective feedback mechanism, guiding developers to build systems that don’t need manual intervention. ● SREs should receive a maximum of two events per 8–12-hour on-call shift. This target volume gives the on-call engineer enough time to handle the event accurately and quickly, clean up and restore normal service, and then conduct a postmortem. ● Postmortems should be written for all significant incidents, regardless of whether or not they paged. This investigation should establish what happened in detail, find all root causes of the event, and assign actions to correct the problem or improve how it is addressed next time. ● A blame-free postmortem culture, with the goal of exposing faults and applying engineering to fix these faults, rather than avoiding or minimizing them. 11
  • 12. Maximum Change Velocity w/o Violating Service SLO ● The structural conflict is between pace of innovation and product stability. In SRE we bring this conflict to the fore, and then resolve it with the introduction of an error budget. ● 100% is the wrong reliability target for basically everything ● In general, no user can tell the difference between a system being 100% available and 99.999% available. There are many other systems in the path between user and service (their laptop, their home WiFi, their ISP, the power grid…) and those systems collectively are far less than 99.999% available. ● The user receives no benefit from the enormous effort required to add that last 0.001% of availability. 12
  • 13. Error Budget ● The business or the product must establish the system’s availability target. ● What level of availability will the users be happy with, given how they use the product? ● What alternatives are available to users who are dissatisfied with the product’s availability? ● What happens to users’ usage of the product at different availability levels? ● A service that’s 99.99% available is 0.01% unavailable. That permitted 0.01% unavailability is the service’s error budget. We can spend the budget on anything we want, as long as we don’t overspend it. ● SRE’s goal is no longer "zero outages"; rather, SREs and product developers aim to spend the error budget getting maximum feature velocity. This change makes all the difference. An outage is no longer a "bad" thing—it is an expected part of the process of innovation, and an occurrence that both development and SRE teams manage rather than fear. 13
  • 14. Monitoring Monitoring should never require a human to interpret any part of the alerting domain. Instead, software should do the interpreting, and humans should be notified only when they need to take action. 14 Signify that a human needs to take action immediately in response to something that is either happening or about to happen, in order to improve the situation. No one needs to look at this information, but it is recorded for diagnostic or forensic purposes. The expectation is that no one reads logs unless something else prompts them to do so. Signify that a human needs to take action, but not immediately. The system cannot automatically handle the situation, but if a human takes action in a few days, no damage will result. Alerts Tickets Logging
  • 15. Emergency Response ● Reliability is a function of mean time to failure (MTTF) and mean time to repair (MTTR). ● The most relevant metric in evaluating the effectiveness of emergency response is how quickly the response team can bring the system back to health—that is, the MTTR. ● Humans add latency. A system that can avoid emergencies that require human intervention will have higher availability than a system that requires hands-on intervention. ● When humans are necessary, we have found that thinking through and recording the best practices ahead of time in a "playbook" produces roughly a 3x improvement in MTTR as compared to the strategy of "winging it." (臨場發揮) ● The hero jack-of-all-trades (博而不精的人 or 萬事通) on-call engineer does work, but the practiced on-call engineer armed with a playbook works much better. 15
  • 16. Change Management SRE has found that roughly 70% of outages are due to changes in a live system. Best practices in this domain use automation to accomplish the following: ● Implementing progressive rollouts ● Quickly and accurately detecting problems ● Rolling back changes safely when problems arise By removing humans from the loop, these practices avoid the normal problems of fatigue, familiarity/contempt, and inattention to highly repetitive tasks. As a result, both release velocity and safety increase. 16
  • 17. Demand Forecasting and Capacity Planning ● Capacity planning should take both organic growth (which stems from natural product adoption and usage by customers) and inorganic growth (which results from events like feature launches, marketing campaigns, or other business-driven changes) into account. ● An accurate organic demand forecast, which extends beyond the lead time required for acquiring capacity ● An accurate incorporation of inorganic demand sources into the demand forecast ● Regular load testing of the system to correlate raw capacity(servers, disks, and so on) to service capacity ● Because capacity is critical to availability, it naturally follows that the SRE team must be in charge of capacity planning, which means they also must be in charge of provisioning. 17 KKS: Responsible for costs for companies who leverage AWS or GCP.
  • 18. Provisioning 18 ● Provisioning combines both change management and capacity planning. ● In our experience, provisioning must be conducted quickly and only when necessary, as capacity is expensive. ● Adding new capacity often involves spinning up a new instance or location, making significant modification to existing systems (configuration files, load balancers, networking), and validating that the new capacity performs and delivers correct results. KKS: increase the service limits in AWS?
  • 19. Efficiency and Performance ● Resource use is a function of demand (load), capacity, and software efficiency. SREs predict demand, provision capacity, and can modify the software. These three factors are a large part of a service’s efficiency. ● Software systems become slower as load is added to them. A slowdown in a service equates to a loss of capacity. At some point, a slowing system stops serving, which corresponds to infinite slowness. ● SREs provision to meet a capacity target at a specific response speed, and thus are keenly interested in a service’s performance. ● SREs and product developers will (and should) monitor and modify a service to improve its performance, thus adding capacity and improving efficiency. 19
  • 20. The End of the Beginning ● Motivated originally by familiarity—"as a software engineer, this is how I would want to invest my time to accomplish a set of repetitive tasks" ● It has become much more: a set of principles, a set of practices, a set of incentives, and a field of endeavor within the larger software engineering discipline. 20
  • 21. Chapter 2 - Production Environment at Google, from the Viewpoint of an SRE
  • 22. Google-designed Datacenter 22 A piece of hardware (or perhaps a VM) Machines can run any server, so we don’t dedicate specific machines to specific server programs. There’s no specific machine that runs our mail server, for example. Instead, resource allocation is handled by our cluster operating system, Borg. A piece of software that implements a service The common use of the word conflates “binary that accepts network connection” with machine ● To eliminate the confusion between server hardware and server software, we use the following terminology throughout the book:
  • 23. The topology of a Google datacenter ● Tens of machines are placed in a rack. ● Racks stand in a row. ● One or more rows form a cluster. ● Usually a datacenter building houses multiple clusters. ● Multiple datacenter buildings that are located close together form a campus. 23
  • 24. Managing Machines ● Borg, is a distributed cluster operating system, similar to Apache Mesos. Borg manages its jobs at the cluster level. ● Borg is responsible for running users’ jobs, which can either be indefinitely running servers or batch processes like a MapReduce. ● Jobs can consist of more than one of identical tasks. When Borg starts a job, it finds machines for the tasks and tells the machines to start the server program. ● Borg then continually monitors these tasks. If a task malfunctions, it is killed and restarted, possibly on a different machine. 24
  • 25. Storage ● D is a fileserver running on almost all machines in a cluster. ● A layer on top of D called Colossus creates a cluster-wide filesystem that offers usual filesystem semantics, as well as replication and encryption. Colossus is the successor to GFS, the Google File System ● There are several database-like services built on top of Colossus: ● Bigtable is a NoSQL database system that can handle databases that are petabytes in size. ● Spanner offers an SQL-like interface for users that require real consistency across the world. ● Several other database systems, such as Blobstore, are available. 25
  • 26. Networking ● Instead of using "smart" routing hardware, we rely on less expensive "dumb" switching components in combination with a central (duplicated) controller that precomputes best paths across the network. Therefore, we’re able to move compute-expensive routing decisions away from the routers and use simple switching hardware. ● Our Global Software Load Balancer (GSLB) performs load balancing on three levels: ● Geographic load balancing for DNS requests (for example, to www.google.com), described in Load Balancing at the Frontend ● Load balancing at a user service level (for example, YouTube or Google Maps) ● Load balancing at the Remote Procedure Call (RPC) level, described in Load Balancing in the Datacenter ● Service owners specify a symbolic name for a service, a list of BNS addresses of servers, and the capacity available at each of the locations (typically measured in queries per second). GSLB then directs traffic to the BNS addresses. 26
  • 27. Other System Software The Chubby lock service provides a filesystem-like API for maintaining locks. Chubby handles these locks across datacenter locations. It uses the Paxos protocol for asynchronous Consensus. Data that must be consistent is well suited to storage in Chubby. For this reason, BNS uses Chubby to store mapping between BNS paths and IP address:port pairs. Borgmon regularly "scrapes" metrics from monitored servers. These metrics can be used instantaneously for alerting and also stored for use in historic overviews (e.g., graphs). 27
  • 28. Our Software Infrastructure ● Our code is heavily multithreaded, so one task can easily use many cores. To facilitate dashboards, monitoring, and debugging, every server has an HTTP server that provides diagnostics and statistics for a given task. ● All of Google’s services communicate using a Remote Procedure Call (RPC) infrastructure named Stubby; an open source version, gRPC, is available. ● Often, an RPC call is made even when a call to a subroutine in the local program needs to be performed. This makes it easier to refactor the call into a different server if more modularity is needed, or when a server’s codebase grows. GSLB can load balance RPCs in the same way it load balances externally visible services. ● A server receives RPC requests from its frontend and sends RPCs to its backend. In traditional terms, the frontend is called the client and the backend is called the server. ● Data is transferred to and from an RPC using protocol buffers,12 often abbreviated to "protobufs," which are similar to Apache’s Thrift. 28
  • 29. Our Development Environment ● Google Software Engineers work from a single shared repository ● If engineers encounter a problem in a component outside of their project, they can fix the problem, send the proposed changes ("changelist," or CL) to the owner for review ● Changes to source code in an engineer’s own project require a review. All software is reviewed before being submitted. ● When software is built, the build request is sent to build servers in a datacenter. Even large builds are executed quickly, as many build servers can compile in parallel. 29
  • 30. Shakespeare: A Sample Service 30
  • 32. Thank you! KKStream (Japan) KKStream (Taiwan) www.kkstream.com.tw