SlideShare a Scribd company logo
When things go bump
in the night
Some thoughts on building and
running reliable software
About me
Senior Site Reliability Engineer, Dollar Shave Club
Part-time Faculty, California State University, Northridge
Mentor, CSUN META+LAB
Formerly: Katch, Zefr, Prevoty, Twitter, Eucalyptus, CSUN, LAUSD
@ahamilton55
Warning: Opinions ahead
Thinking about reliability
"The thing that amazes you
is not that your system goes
down sometimes. It's that
it's up at all."
Richard Cook
Velocity NY 2013, "Resilience In Complex Adaptive Systems"
Failure is always an option
Failure is almost the only guarantee that we have
We can't guarantee that what we are building will be successful and that it is
really what our users want
We can guarantee that if we run software for long enough, it will eventually fail
We need to keep this in mind when building software
100% reliability isn't feasible
"It turns out that past a certain point, however, increasing reliability is worse for a
service (and its users) rather than better! Extreme reliability comes at a cost:
maximizing stability limits how fast new features can be developed and how
quickly products can be delivered to users, and dramatically increases their cost,
which in turn reduces the numbers of features a team can afford to offer."
Site Reliability Engineering, Chapter 3 - Embracing Risk
Your users are probably okay with some degree of failure
Reliability is a feature, not a by-product
If reliability isn't a focus as part of what you're building, you won't get it
How will that new feature you're building fail?
Be a bit of a pessimist when it comes to the software you write
Understand your dependencies and figure out what should happen if they break
Systems are complex
Systems are complex,
● we run into unexpected behaviors
● unexpected responses to interventions
● new forms of failures are constantly being discovered and created
● changes are occuring
The more complexity in a system, the more possibilities we have for a failure
Every new feature or service we add increases this complexity
Systems require change
Changes are always required
There are no systems that are perfect the first time and won't be updated
These changes or adaptations will cause unreliability to the system over time if
we aren't careful
Changes can be very quick or slow over time but both can still cause problems
Build systems that are resilient
A resilient system can help itself
A service's dependencies can fail and the service can automatically react in a
way that allows itself to continue functioning in a degraded way
If one of the parts of a page is not working, should we fail the entire page or
judge whether the user must have the content?
Services can check to see if the failure was temporary by trying additional
requests to get the information from other nodes
Reliability is a cultural aspect
It is difficult to just say, build more reliable software
There are usually reasons why people aren't focusing on reliability
We need to look at our current situation and determine why reliability isn't a focal
point
Perhaps you need to slow down a little bit and focus on pushing some
improvements
Does your organization have a "stop the line" feature?
Run as little code as possible
The less code you run, the less you have to worry about
Try to simplify and remove code from your system as much as possible
If all you're ever doing is adding more code then eventually you're going to hit a
wall and have a mess
Refactoring is a good practice that will help keep your code lean and help prune
out old code that could eventually be hit by an wayward user
It's difficult to decide if code is ever "done" as people and systems evolve over
time
How things fail
Legacy code
Legacy code makes your company money yet most try to avoid it as much as
possible
The new, shiny and perfect code that is being written now will be so much better
and have none of the problems from before so why worry about my legacy
services and code?
We it is currently running and giving our users the information they want
We shouldn't shun the old code that we have until we have deployed our new
code and completely moved away from our legacy service, it still will need some
TLC.
Beware of single points of failure (SPOF)
There are always single points of failure in any system
For example, if we only have a single load balancer for our service, what happens
if it goes down?
Try to limit the impact of a single system going down
Understand the dependencies in your system and don't fail if you don't have to
People are also points of failure
Beware of a Brent or a Hank or a Ben or a Gerardo or a Harold or ...
You want to spread out as much information as possible
If everyone always goes to that one person to fix everything you should assume
that you need to start extracting as much information
Software is a team sport that requires the work of the team to be successful
Spreading that information out around your organization with help both the
organization and the person
You probably still need system experts
Having an expert in your system is still important
What you need to do is to get them to work on ways of giving others their
knowledge as often as possible
Senior engineers become "10x engineers" by increasing the amount of work the
team as a whole is able to accomplish
This can be done in plenty of ways including automation and documentation
Users don't always use the "happy path"
The easiest way to add entropy to a software system is to have users
Many users will not proceed down the happy path that we designed for them and
will end up using our software in ways that we didn't think of
We need to look at the flow the user is supposed to take and figure out how
they'll break it
How will the user give us unusual data and what can we do to prevent its impact
on our services?
Everybody makes trade-offs
When developing software we make trade-offs for what we want to accomplish
It isn't that developers are okay with building software with bugs
People are always working within constraints
It's no one's fault when issues arise as we should assume everyone is doing their
best with whatever task they perform
We need to try and help as an organization by prioritizing reliability and making
sure that it is part of every engineer's process
Know your limits
Don't shun boring technology
Boring and tested technologies can be very helpful when trying to build a reliable
infrastructure
You don't always need to use the coolest thing on HackerNews because
"everyone else is doing it"
Figure out what problems new tech will truly solve for you over doing it for
resume driven development
Just because Google or Facebook or Twitter or Uber or <TechCo> has a certain
problem and used technology X to fix it doesn't mean that you have that problem
as well
Do your best when failing
There is little worse than a blank page when a failure has occurred
Try to give the user as much of the page as possible when a failure occurs
If you can't give the user what they wanted, can you give them a default set of
data?
Worst case, tell them you screwed up and that you're going to try and fix it
Expand your view
Move away from viewing hosts to viewing services
You should be able to look at the service as a whole and not worry so much if a
single host is having issues
If it is you should be able to quarantine or remove the host from being accessed
by customers, spin up another host and troubleshoot as needed
You should build microservices!
Microservices are the only logical choice to make
Allow groups can work on things without needing to talk to the other groups
Everybody is doing them and they are much more successful
With the loose coupling, different groups can do whatever they want
Velocity is much faster for different groups in the company because they are no
longer waiting to release their features
Facebook, Twitter, Google, Uber, Lyft, AirBnB all talk about how microservices are
so much better too
Monoliths or bust!
Monoliths don't require you to talk over the network to another service
All of my data is available from a single service
I know how my service is generating a piece of data because the code is in
another file in this repo
Everybody knows where everything comes from
Don't need to worry about things like service discovery, mesh networks, sidecars,
dependent services, etc
Your architecture matters
Adding the network between calls instead of memory forces us to think more
directly about reliability and adds complexity
Having a large, single service takes effort to effectively scale both development
and production
Architecture will change what we focus on making work properly for us
Understand "best practices"
Best practices tend to be best for the organization talking about it
Best practices can be helpful but might not be a one-to-one match
Learning about best practices is great but you should understand how they will
fit in your organization
Don't blindly follow what a company like Google or Facebook or Netflix does just
because it helped them
Take other's best practices and make they work in your own context
Do more before it's needed
The more work that is put in before we have an incident, the easier resolving that
incident should be
If we put in the work for metrics documentation and other reliability features,
then when an incident occurs, we should have a better understanding of what is
failing so that we can more reliably determine the why(s)
Detection and recovery are key
Limit the impact of failures by detecting them and being able to recover the
service quickly
The faster the detect and recovery cycle, the shorter the failure will be
This pushes the focus on detection to be automated and you should have
automated ways of pulling in additional information to determine the problem
and eventually fix it
Detection should rarely involve a human and recovery can be automated where
applicable
Service tiers
There are different priorities for different systems that an org runs
If we break up these tiers we will get a better idea of what is important and what
can wait until the morning to fix
For example, processing payments for orders is probably a big deal and must be
fixed right away
The social media widgets on your site that show a listing of posts to your
accounts is probably less urgent to fix and can wait for the morning
Think about your infrastructure and break it into different pieces for how it will
affect the business and how it prioritises things
Build vs buy
If you're a small business or group, it's important to focus on what makes your
company money
Choose to buy before you build unless you have the resources available to
effectively build, troubleshoot and repair the service on a regular basis
Building might seem cheaper, but assume that if the service is tier 0 or 1, you'll
need to have someone focusing on that at least half of their time for the first
year it is running
Beware of service sprawl
Preparation
Practice makes better
Reliability is like anything you want to be good at as it requires practice
Practice allows us to use our skills in a controlled environment and help us find
areas to improve
We should be using deliberate practice where we set out to practice a particular
skill each time
Disaster Recovery Test (DiRT)
Chaos Engineering
Remember that practice is never 100% accurate
Define "done"
Create a definition of done for the work that your group does
Create an understanding that there is more to the work we do than closing the
ticket in JIRA
We should be making sure that it is easy to find as much information as possible
about the different systems we have
The added items can slow down work but will help if an issue occurs in the
future
Additional requirements of a task can include adding metrics, logging, updating
documentation, updating automation scripts, updated dashboards with new
data, updating runbooks with possible failures and how to fix them, etc
Documentation
Document as much as possible
Most engineers today don't seem to like doing documentation but when things
break it's very helpful
The more information that is available about a service, the less likely the system
expert will be bothered
Documentation should be living and updated
Try to consolidate where your documentation lives and make it easy to update
Beware of documentation in ephemeral sources
Runbooks
Runbooks are a set of vital, condensed information used during an incident
Runbooks hold basic information about a service such as which team owns it,
on-call lists, email lists, etc
Runbooks should also hold instructions for different common operations tasks
Every alert that you create for a service should have a corresponding runbook
section that describes the issue, how to gather data about the issue and
hopefully how to solve the issue
Try not to "assume" your reader knows everything about the service and that it is
being read at 3am
Automate whatever you can
Automation allows for engineers to increase their effectiveness over time
Automation can help reduce errors when doing a task with many steps
Start by automating your most common tasks such as deployments or
infrastructure provisioning
Move on to other tasks that are performed regularly by one person
Look for "toil" tasks to help free up engineers to work on more difficult problems
You don't necessarily need to automate everything right away and humans can
still be involved
Make sure your automation is available when needed
Testing
Create different types of automated tests to make sure your system does what
you want
Testing should be as quick as possible to give the developers fast feedback
Extended testing can be pushed out into your production environment as well
End-to-end automated tests that are run every few minutes can let you know
when a problem is affecting users
Load testing can give you a baseline for how your service should act as its work
increases
Metrics
Metrics are data points over time for how your service performs
We have different types of metrics such as counters, gauges, histograms, etc
Each of the different types have different usages
Collect at least RED metrics (Requests, Errors and Duration)
If you don't have metrics, start with a general view and become more specific
over time
Try to make metrics easy at your company by using shims or middleware or
services like linkerd or envoy
RED in one metric...
DataDog (statsd with tags)
dd.Timing("service.requests", duration,
[]string{fmt.Sprintf("code=%s", code)}, 1)
Prometheus
histogram := prometheus.NewHistogramVec(prometheus.HistogramOpts{
Name: "requests",
}, []string{"code"})
histogram.WithLabelValues(fmt.Sprintf("%d", code))
.Observe(duration.Seconds())
Logging
Logging gives details on events from a service
When something goes wrong, you should log an error with as much detail as
possible
The more information provided, the easier it will be to fix the problem
Use log levels to allow operators to turn up or down the verbosity as needed
Be careful with what you add to a log message and exclude sensitive data
Making logs easier for both humans and machines to understand
Send all logs to a single location
What does this message mean?
What does each piece of information here represent?
10.185.248.71 - - [09/Jan/2015:19:12:06 +0000] 808840 "GET
/inventoryService/inventory/purchaseItem?userId=20253471&itemId=23434300
HTTP/1.1" 500 17 "-" "Apache-HttpClient/4.2.6 (java 1.5)"
Structured logging FTW
level="info" remote_ip="10.185.248.71" time="09/Jan/2015:19:12:06+0000"
request="GET
/inventoryService/inventory/purchaseItem?userId=20253471&itemId=23434300
HTTP/1.1" code=500 resp_bytes=17 user_agent="Apache-HttpClient/4.2.6 (java
1.5)"
{"level"="info", "remote_ip": "10.185.248.71", "time":
"09/Jan/2015:19:12:06+0000", "request": "GET
/inventoryService/inventory/purchaseItem?userId=20253471&itemId=23434300
HTTP/1.1", "code": 500, "resp_bytes": 17, "user_agent": "Apache-HttpClient/4.2.6
(java 1.5)"}
Define "availability"
What is availability for your services?
The definition can be different for every service
How can we go about defining availability for our services in a way that is easy
for us to understand and track?
How will we discover that we are not as reliable as we are expected to be?
Service Level Indicators (SLI)
An SLI is used by the internal group to keep track of how a service is functioning
An SLI can be a single metric or it can be an aggregate of metrics
SLIs will be used to alert engineers when something is not running properly
SLIs do not need to be set in stone and can change
Service Level Objectives (SLO)
An SLO is the goal for a specific SLI
SLOs can be set as a percentage (99.99%), a maximum (500ms) or a minimum
(50qps)
SLOs can have a time period of time for computation (per week, per month)
SLOs are used internally by the engineering groups to track how well a service
performs over time
SLOs can be used with error budgets to slow down changes if the SLO is broken
Base SLOs on historic data and expectations for the future
Service Level Agreements (SLA)
An SLA is an external SLO given to users
An SLA is usually slightly lower than the SLO so that the engineering team has
some room if something goes wrong
SLAs should be based on the same SLIs as the internal SLOs
If internal to your company, an SLA lets other groups know how they should
prepare when using your service
If external to your company, an SLA should be verified with other business
entities to check for possible liabilities when broken for users
Peer review
People make mistakes and that isn't necessarily a bad thing
Peer review can help us catch bugs based on the different understandings
people have of a system
Peer review can help new users better understand a system by getting feedback
Actionable alerts
Assume alerts will fire in the middle of the night and wake someone up
Every alert must have some action that can be taken to fix the issue
Every alert should have some information in the runbook about how to diagnose
and hopefully fix it
Limit the number of alerts received and fix noisy alerts
As you progress, try to allow another machine to fix the issue and bring in a
human if unable to remedy
Point users to runbooks inside of an alert message for quick response
Try to use metrics for alerts over log parsing
On-call rotations
Create rotations for who will be contacted when something goes wrong
Primary and secondary routines with management escalation
Have at least four people on-call to spread the load
Try to have each part of the rotation be a week, max
Prepare engineers for on-call by having them shadow experienced engineers
Have your on-call person be the point of contact for other groups
Use on-call to find issues or inefficiencies and plan long term fixes
Backups aren't important
Being able to restore a backup is what's important
If you're unable to get the data that was backed up, you're wasting your
resources and time
Test your restoration process as often as possible
Automate your restoration so that you can have something automatically check
Build checklists
Checklists are used by many industries when dealing with issues (pilots, power
plants, oil rigs)
When an issue occurs, a quick run through a checklist will give you more details
Create checklists to help those working on an incident easily gather data
Create checklists for the different problems you expect to see
Automate checklists as appropriate
There's always more to do
Feature gates
A/B testing
Retries
Dynamic service discovery
Distributed Tracing
And on and on...
When things go bump
Focus on what you know
Look at the alert that was sent and the data it uses
Look at other metrics your service outputs for more information
Look at logs for errors to see if you can begin to narrow down the problem
Check your upstream and downstream services to see if your service is part of a
larger issue
Gather as much data as you can with your information you're gather and start
working on a solution
Try not to panic and make sure there is a reason you're performing each action
Incidents
The important thing to do is to communicate
Incidents occur when a large number of users are being affected by a user
Determine a leveled framework for determining the severity of an issue and what
to do for each level
Reach out for help if you need it and this should become a team effort if required
The end goal is to get the service back up and running for the user as quickly as
possible
Incident command
You should setup an Incident Command when an incident has reached a certain
level in your severity framework
Determine a set of severity levels for when an incident requires more help
Divide up tasks for communication as well as triage and debugging
There should be an "Incident Commander" whose job it is to gather information
and disseminate it to the required groups
The incident commander shouldn't be the person directly working to fix the issue
Create a checklist or automate as much as possible to make this less stressful
Use your normal processes
Try not to change your processes when working on fixing an issue
If you have automation for completing different tasks such as changing
configurations or deploying software continue to use those
If in an incident, continue to use peer review before pushing changes to make
sure you don't cause a bigger issue in a stressful situation
If your process is too slow to use during an incident then you should also look at
making that process more effective
Learning is the key
Be able to restart if interrupted
Restart the processes that were killed or delayed
Make sure that you're able to pick up and fix any data that might have been
corrupted during an incident
Processes should be able to retrace steps in unfinished jobs and complete them
Beware of the "root cause"
There is usually no single thing that fails to cause an incident
There are usually a chain of events that occur and all of those events must occur
for a particular issue to happen
Make sure that you don't limit your view on a single aspect of the issue and miss
the others
Blameless postmortems
Gather as much information as you can about what happened before, during and
after an incident
Make this information available as widely as possible in the organization and
beyond
Don't blame people for issues that occurred as it isn't helpful
A postmortem should stick to provable facts from the incident
Try to do your postmortem as quickly as possible after the incident to reduce the
information from changing due to hindsight bias
Postmortems should be a learning exercise from the incident
Postmortem meetings
Gather as a group either monthly or as needed
Share experiences and discoveries due to issues in your services
Postmortems might not be read by every team but there can still be insights that
will be helpful in other services
Do an overview of what happened, discuss next steps for long term fixing and
stay up-to-date on long term fix completion
Fix for the long term
Fix issues for the long term
Try not to constantly add band-aids for issues that happen often
Fixing issues can be easy or can require a complete rethinking of the problem
Understand that the company does have other priorities but that constant issues
can drive down moral and cause other problems
Updating all the things
Fixing code is not the only thing that needs to be updated after an incident
Did you add new metrics to the code that need to have new alerts or graphs
added to a dashboard?
Did you find new information for how to fix a particular type of issue that you
should add to documentation?
Are there other services that are also vulnerable to the issue you found that other
teams should be aware of?
References
Velocity NY 2013: Richard Cook, "Resilience In Complex Adaptive Systems"
https://www.youtube.com/watch?v=PGLYEDpNu60
How Complex Systems Fail, Richard Cook
http://web.mit.edu/2.75/resources/random/How%20Complex%20Systems%20Fail.pdf
How Complex Systems Fail -- The Morning Paper
https://blog.acolyer.org/2016/02/10/how-complex-systems-fail/
SRE Book
https://landing.google.com/sre/book.html
SLIs, SLOs, SLAs, oh my!
https://www.youtube.com/watch?v=tEylFyxbDLE&feature=youtu.be
On-call Doesn't Have to Suck
https://medium.com/@copyconstruct/on-call-b0bd8c5ea4e0
References
Practical Postmortems at Etsy
https://www.infoq.com/articles/postmortems-etsy
Blameless PostMortems and a Just Culture
https://codeascraft.com/2012/05/22/blameless-postmortems/
Tackling Alert Fatigue
https://speakerdeck.com/caitiem20/tackling-alert-fatigue
Observability at Google
https://speakerdeck.com/rakyll/observability-at-google
Testing Microservice the Sane Way
https://medium.com/@copyconstruct/testing-microservices-the-sane-way-9bb31d158c16
Each necessary, but only jointly sufficient
https://www.kitchensoap.com/2012/02/10/each-necessary-but-only-jointly-sufficient/
STELLA: Report from the SNAFUcatchers Workshop on Coping With Complexity
https://drive.google.com/file/d/0B7kFkt5WxLeDTml5cTFsWXFCb1U/view
Weekly digests
DevOps Weekly
http://www.devopsweekly.com/
SRE Weekly
https://sreweekly.com/
Kubernetes Weekly
http://kube.news/
O’Reilly Systems Engineering and Operations
http://www.oreilly.com/webops-perf/newsletter.html
High Scalability
http://highscalability.com/
References
Thanks. Questions?

More Related Content

PPTX
ROOTS2011 Continuous Delivery
PDF
Open is as Open does
PDF
1999: Myths and Realities of installing new software (ERP)
PDF
Software Process... the good parts
PDF
Waste Not Want Not Best Practice Guide
PDF
The History of DevOps (and what you need to do about it)
PDF
mri-bp2015
PPTX
SecureWorld: Security is Dead, Rugged DevOps 1f
ROOTS2011 Continuous Delivery
Open is as Open does
1999: Myths and Realities of installing new software (ERP)
Software Process... the good parts
Waste Not Want Not Best Practice Guide
The History of DevOps (and what you need to do about it)
mri-bp2015
SecureWorld: Security is Dead, Rugged DevOps 1f

What's hot (18)

PPTX
Enterprise Testing in The Cloud
PPT
Get Faster - While You're Getting Better
PDF
I build the future - Agile 2014
PPTX
Winnipeg ISACA Security is Dead, Rugged DevOps
PDF
5 Traits of a Proactive Guard Tour System
PDF
Lean and-kanban-final
PDF
Patterns of Evolutionary Architecture - Agile and Beyond 2018
PPTX
Dev ops overview (brief)
PPTX
Site Reliability Engineering: Harnessing (and redefining) it for ITSM
PDF
Process Evolution and Product Maturity
PDF
Brighttalk high scale low touch and other bedtime stories - final
PDF
4AA5-6782ENW
PPTX
Leading A DevOps Transformation: Lessons Learned
PPTX
PuppetConf2012GeneKim
PPTX
Systems Administrators Survey Results
PDF
Tag-Team of Workshops Provides Proven Path of Data Center Transformation, Ass...
PPTX
Tour de France Azure PaaS 5/7 Accélérer avec le DevOps
Enterprise Testing in The Cloud
Get Faster - While You're Getting Better
I build the future - Agile 2014
Winnipeg ISACA Security is Dead, Rugged DevOps
5 Traits of a Proactive Guard Tour System
Lean and-kanban-final
Patterns of Evolutionary Architecture - Agile and Beyond 2018
Dev ops overview (brief)
Site Reliability Engineering: Harnessing (and redefining) it for ITSM
Process Evolution and Product Maturity
Brighttalk high scale low touch and other bedtime stories - final
4AA5-6782ENW
Leading A DevOps Transformation: Lessons Learned
PuppetConf2012GeneKim
Systems Administrators Survey Results
Tag-Team of Workshops Provides Proven Path of Data Center Transformation, Ass...
Tour de France Azure PaaS 5/7 Accélérer avec le DevOps
Ad

Similar to When Things Go Bump in the Night (20)

DOCX
The principles of agile development
PDF
Empowered productivity
PPTX
Continuous Delivery
PPTX
Prometheus - Open Source Forum Japan
PDF
5 signs your project is off the rails
PDF
Slides from "Taking an Holistic Approach to Product Quality"
PDF
4 Prerequisites for DevOps Success
PDF
DevOps and Security, a Match Made in Heaven
PPT
How To Plan a Software Project
ODP
Uklug 2011 administrator development synergy
PPTX
Technology in financial services
PPTX
Technology in financial services
PDF
451’s Berkholz on How DevOps, Automation and Orchestration Combine for Contin...
PDF
ITIL Guide for DevOps
PPTX
DevOps by DevOpSolution
PDF
Rails Operations - Lessons Learned
PPT
Open Source adoption in a Mexicon Second tier Bank
PDF
3. introduction to software testing
PDF
8 Ways to Boost Your DevOps Efforts
PDF
Agile on Fire: IT Enters the New Era of 'Continuous' Everything
The principles of agile development
Empowered productivity
Continuous Delivery
Prometheus - Open Source Forum Japan
5 signs your project is off the rails
Slides from "Taking an Holistic Approach to Product Quality"
4 Prerequisites for DevOps Success
DevOps and Security, a Match Made in Heaven
How To Plan a Software Project
Uklug 2011 administrator development synergy
Technology in financial services
Technology in financial services
451’s Berkholz on How DevOps, Automation and Orchestration Combine for Contin...
ITIL Guide for DevOps
DevOps by DevOpSolution
Rails Operations - Lessons Learned
Open Source adoption in a Mexicon Second tier Bank
3. introduction to software testing
8 Ways to Boost Your DevOps Efforts
Agile on Fire: IT Enters the New Era of 'Continuous' Everything
Ad

Recently uploaded (20)

PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Approach and Philosophy of On baking technology
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Cloud computing and distributed systems.
PPTX
MYSQL Presentation for SQL database connectivity
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
NewMind AI Weekly Chronicles - August'25 Week I
“AI and Expert System Decision Support & Business Intelligence Systems”
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Building Integrated photovoltaic BIPV_UPV.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Per capita expenditure prediction using model stacking based on satellite ima...
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Spectral efficient network and resource selection model in 5G networks
Advanced methodologies resolving dimensionality complications for autism neur...
Network Security Unit 5.pdf for BCA BBA.
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Approach and Philosophy of On baking technology
Empathic Computing: Creating Shared Understanding
Cloud computing and distributed systems.
MYSQL Presentation for SQL database connectivity

When Things Go Bump in the Night

  • 1. When things go bump in the night Some thoughts on building and running reliable software
  • 2. About me Senior Site Reliability Engineer, Dollar Shave Club Part-time Faculty, California State University, Northridge Mentor, CSUN META+LAB Formerly: Katch, Zefr, Prevoty, Twitter, Eucalyptus, CSUN, LAUSD @ahamilton55
  • 5. "The thing that amazes you is not that your system goes down sometimes. It's that it's up at all." Richard Cook Velocity NY 2013, "Resilience In Complex Adaptive Systems"
  • 6. Failure is always an option Failure is almost the only guarantee that we have We can't guarantee that what we are building will be successful and that it is really what our users want We can guarantee that if we run software for long enough, it will eventually fail We need to keep this in mind when building software
  • 7. 100% reliability isn't feasible "It turns out that past a certain point, however, increasing reliability is worse for a service (and its users) rather than better! Extreme reliability comes at a cost: maximizing stability limits how fast new features can be developed and how quickly products can be delivered to users, and dramatically increases their cost, which in turn reduces the numbers of features a team can afford to offer." Site Reliability Engineering, Chapter 3 - Embracing Risk Your users are probably okay with some degree of failure
  • 8. Reliability is a feature, not a by-product If reliability isn't a focus as part of what you're building, you won't get it How will that new feature you're building fail? Be a bit of a pessimist when it comes to the software you write Understand your dependencies and figure out what should happen if they break
  • 9. Systems are complex Systems are complex, ● we run into unexpected behaviors ● unexpected responses to interventions ● new forms of failures are constantly being discovered and created ● changes are occuring The more complexity in a system, the more possibilities we have for a failure Every new feature or service we add increases this complexity
  • 10. Systems require change Changes are always required There are no systems that are perfect the first time and won't be updated These changes or adaptations will cause unreliability to the system over time if we aren't careful Changes can be very quick or slow over time but both can still cause problems
  • 11. Build systems that are resilient A resilient system can help itself A service's dependencies can fail and the service can automatically react in a way that allows itself to continue functioning in a degraded way If one of the parts of a page is not working, should we fail the entire page or judge whether the user must have the content? Services can check to see if the failure was temporary by trying additional requests to get the information from other nodes
  • 12. Reliability is a cultural aspect It is difficult to just say, build more reliable software There are usually reasons why people aren't focusing on reliability We need to look at our current situation and determine why reliability isn't a focal point Perhaps you need to slow down a little bit and focus on pushing some improvements Does your organization have a "stop the line" feature?
  • 13. Run as little code as possible The less code you run, the less you have to worry about Try to simplify and remove code from your system as much as possible If all you're ever doing is adding more code then eventually you're going to hit a wall and have a mess Refactoring is a good practice that will help keep your code lean and help prune out old code that could eventually be hit by an wayward user It's difficult to decide if code is ever "done" as people and systems evolve over time
  • 15. Legacy code Legacy code makes your company money yet most try to avoid it as much as possible The new, shiny and perfect code that is being written now will be so much better and have none of the problems from before so why worry about my legacy services and code? We it is currently running and giving our users the information they want We shouldn't shun the old code that we have until we have deployed our new code and completely moved away from our legacy service, it still will need some TLC.
  • 16. Beware of single points of failure (SPOF) There are always single points of failure in any system For example, if we only have a single load balancer for our service, what happens if it goes down? Try to limit the impact of a single system going down Understand the dependencies in your system and don't fail if you don't have to
  • 17. People are also points of failure Beware of a Brent or a Hank or a Ben or a Gerardo or a Harold or ... You want to spread out as much information as possible If everyone always goes to that one person to fix everything you should assume that you need to start extracting as much information Software is a team sport that requires the work of the team to be successful Spreading that information out around your organization with help both the organization and the person
  • 18. You probably still need system experts Having an expert in your system is still important What you need to do is to get them to work on ways of giving others their knowledge as often as possible Senior engineers become "10x engineers" by increasing the amount of work the team as a whole is able to accomplish This can be done in plenty of ways including automation and documentation
  • 19. Users don't always use the "happy path" The easiest way to add entropy to a software system is to have users Many users will not proceed down the happy path that we designed for them and will end up using our software in ways that we didn't think of We need to look at the flow the user is supposed to take and figure out how they'll break it How will the user give us unusual data and what can we do to prevent its impact on our services?
  • 20. Everybody makes trade-offs When developing software we make trade-offs for what we want to accomplish It isn't that developers are okay with building software with bugs People are always working within constraints It's no one's fault when issues arise as we should assume everyone is doing their best with whatever task they perform We need to try and help as an organization by prioritizing reliability and making sure that it is part of every engineer's process
  • 22. Don't shun boring technology Boring and tested technologies can be very helpful when trying to build a reliable infrastructure You don't always need to use the coolest thing on HackerNews because "everyone else is doing it" Figure out what problems new tech will truly solve for you over doing it for resume driven development Just because Google or Facebook or Twitter or Uber or <TechCo> has a certain problem and used technology X to fix it doesn't mean that you have that problem as well
  • 23. Do your best when failing There is little worse than a blank page when a failure has occurred Try to give the user as much of the page as possible when a failure occurs If you can't give the user what they wanted, can you give them a default set of data? Worst case, tell them you screwed up and that you're going to try and fix it
  • 24. Expand your view Move away from viewing hosts to viewing services You should be able to look at the service as a whole and not worry so much if a single host is having issues If it is you should be able to quarantine or remove the host from being accessed by customers, spin up another host and troubleshoot as needed
  • 25. You should build microservices! Microservices are the only logical choice to make Allow groups can work on things without needing to talk to the other groups Everybody is doing them and they are much more successful With the loose coupling, different groups can do whatever they want Velocity is much faster for different groups in the company because they are no longer waiting to release their features Facebook, Twitter, Google, Uber, Lyft, AirBnB all talk about how microservices are so much better too
  • 26. Monoliths or bust! Monoliths don't require you to talk over the network to another service All of my data is available from a single service I know how my service is generating a piece of data because the code is in another file in this repo Everybody knows where everything comes from Don't need to worry about things like service discovery, mesh networks, sidecars, dependent services, etc
  • 27. Your architecture matters Adding the network between calls instead of memory forces us to think more directly about reliability and adds complexity Having a large, single service takes effort to effectively scale both development and production Architecture will change what we focus on making work properly for us
  • 28. Understand "best practices" Best practices tend to be best for the organization talking about it Best practices can be helpful but might not be a one-to-one match Learning about best practices is great but you should understand how they will fit in your organization Don't blindly follow what a company like Google or Facebook or Netflix does just because it helped them Take other's best practices and make they work in your own context
  • 29. Do more before it's needed The more work that is put in before we have an incident, the easier resolving that incident should be If we put in the work for metrics documentation and other reliability features, then when an incident occurs, we should have a better understanding of what is failing so that we can more reliably determine the why(s)
  • 30. Detection and recovery are key Limit the impact of failures by detecting them and being able to recover the service quickly The faster the detect and recovery cycle, the shorter the failure will be This pushes the focus on detection to be automated and you should have automated ways of pulling in additional information to determine the problem and eventually fix it Detection should rarely involve a human and recovery can be automated where applicable
  • 31. Service tiers There are different priorities for different systems that an org runs If we break up these tiers we will get a better idea of what is important and what can wait until the morning to fix For example, processing payments for orders is probably a big deal and must be fixed right away The social media widgets on your site that show a listing of posts to your accounts is probably less urgent to fix and can wait for the morning Think about your infrastructure and break it into different pieces for how it will affect the business and how it prioritises things
  • 32. Build vs buy If you're a small business or group, it's important to focus on what makes your company money Choose to buy before you build unless you have the resources available to effectively build, troubleshoot and repair the service on a regular basis Building might seem cheaper, but assume that if the service is tier 0 or 1, you'll need to have someone focusing on that at least half of their time for the first year it is running Beware of service sprawl
  • 34. Practice makes better Reliability is like anything you want to be good at as it requires practice Practice allows us to use our skills in a controlled environment and help us find areas to improve We should be using deliberate practice where we set out to practice a particular skill each time Disaster Recovery Test (DiRT) Chaos Engineering Remember that practice is never 100% accurate
  • 35. Define "done" Create a definition of done for the work that your group does Create an understanding that there is more to the work we do than closing the ticket in JIRA We should be making sure that it is easy to find as much information as possible about the different systems we have The added items can slow down work but will help if an issue occurs in the future Additional requirements of a task can include adding metrics, logging, updating documentation, updating automation scripts, updated dashboards with new data, updating runbooks with possible failures and how to fix them, etc
  • 36. Documentation Document as much as possible Most engineers today don't seem to like doing documentation but when things break it's very helpful The more information that is available about a service, the less likely the system expert will be bothered Documentation should be living and updated Try to consolidate where your documentation lives and make it easy to update Beware of documentation in ephemeral sources
  • 37. Runbooks Runbooks are a set of vital, condensed information used during an incident Runbooks hold basic information about a service such as which team owns it, on-call lists, email lists, etc Runbooks should also hold instructions for different common operations tasks Every alert that you create for a service should have a corresponding runbook section that describes the issue, how to gather data about the issue and hopefully how to solve the issue Try not to "assume" your reader knows everything about the service and that it is being read at 3am
  • 38. Automate whatever you can Automation allows for engineers to increase their effectiveness over time Automation can help reduce errors when doing a task with many steps Start by automating your most common tasks such as deployments or infrastructure provisioning Move on to other tasks that are performed regularly by one person Look for "toil" tasks to help free up engineers to work on more difficult problems You don't necessarily need to automate everything right away and humans can still be involved Make sure your automation is available when needed
  • 39. Testing Create different types of automated tests to make sure your system does what you want Testing should be as quick as possible to give the developers fast feedback Extended testing can be pushed out into your production environment as well End-to-end automated tests that are run every few minutes can let you know when a problem is affecting users Load testing can give you a baseline for how your service should act as its work increases
  • 40. Metrics Metrics are data points over time for how your service performs We have different types of metrics such as counters, gauges, histograms, etc Each of the different types have different usages Collect at least RED metrics (Requests, Errors and Duration) If you don't have metrics, start with a general view and become more specific over time Try to make metrics easy at your company by using shims or middleware or services like linkerd or envoy
  • 41. RED in one metric... DataDog (statsd with tags) dd.Timing("service.requests", duration, []string{fmt.Sprintf("code=%s", code)}, 1) Prometheus histogram := prometheus.NewHistogramVec(prometheus.HistogramOpts{ Name: "requests", }, []string{"code"}) histogram.WithLabelValues(fmt.Sprintf("%d", code)) .Observe(duration.Seconds())
  • 42. Logging Logging gives details on events from a service When something goes wrong, you should log an error with as much detail as possible The more information provided, the easier it will be to fix the problem Use log levels to allow operators to turn up or down the verbosity as needed Be careful with what you add to a log message and exclude sensitive data Making logs easier for both humans and machines to understand Send all logs to a single location
  • 43. What does this message mean? What does each piece of information here represent? 10.185.248.71 - - [09/Jan/2015:19:12:06 +0000] 808840 "GET /inventoryService/inventory/purchaseItem?userId=20253471&itemId=23434300 HTTP/1.1" 500 17 "-" "Apache-HttpClient/4.2.6 (java 1.5)"
  • 44. Structured logging FTW level="info" remote_ip="10.185.248.71" time="09/Jan/2015:19:12:06+0000" request="GET /inventoryService/inventory/purchaseItem?userId=20253471&itemId=23434300 HTTP/1.1" code=500 resp_bytes=17 user_agent="Apache-HttpClient/4.2.6 (java 1.5)" {"level"="info", "remote_ip": "10.185.248.71", "time": "09/Jan/2015:19:12:06+0000", "request": "GET /inventoryService/inventory/purchaseItem?userId=20253471&itemId=23434300 HTTP/1.1", "code": 500, "resp_bytes": 17, "user_agent": "Apache-HttpClient/4.2.6 (java 1.5)"}
  • 45. Define "availability" What is availability for your services? The definition can be different for every service How can we go about defining availability for our services in a way that is easy for us to understand and track? How will we discover that we are not as reliable as we are expected to be?
  • 46. Service Level Indicators (SLI) An SLI is used by the internal group to keep track of how a service is functioning An SLI can be a single metric or it can be an aggregate of metrics SLIs will be used to alert engineers when something is not running properly SLIs do not need to be set in stone and can change
  • 47. Service Level Objectives (SLO) An SLO is the goal for a specific SLI SLOs can be set as a percentage (99.99%), a maximum (500ms) or a minimum (50qps) SLOs can have a time period of time for computation (per week, per month) SLOs are used internally by the engineering groups to track how well a service performs over time SLOs can be used with error budgets to slow down changes if the SLO is broken Base SLOs on historic data and expectations for the future
  • 48. Service Level Agreements (SLA) An SLA is an external SLO given to users An SLA is usually slightly lower than the SLO so that the engineering team has some room if something goes wrong SLAs should be based on the same SLIs as the internal SLOs If internal to your company, an SLA lets other groups know how they should prepare when using your service If external to your company, an SLA should be verified with other business entities to check for possible liabilities when broken for users
  • 49. Peer review People make mistakes and that isn't necessarily a bad thing Peer review can help us catch bugs based on the different understandings people have of a system Peer review can help new users better understand a system by getting feedback
  • 50. Actionable alerts Assume alerts will fire in the middle of the night and wake someone up Every alert must have some action that can be taken to fix the issue Every alert should have some information in the runbook about how to diagnose and hopefully fix it Limit the number of alerts received and fix noisy alerts As you progress, try to allow another machine to fix the issue and bring in a human if unable to remedy Point users to runbooks inside of an alert message for quick response Try to use metrics for alerts over log parsing
  • 51. On-call rotations Create rotations for who will be contacted when something goes wrong Primary and secondary routines with management escalation Have at least four people on-call to spread the load Try to have each part of the rotation be a week, max Prepare engineers for on-call by having them shadow experienced engineers Have your on-call person be the point of contact for other groups Use on-call to find issues or inefficiencies and plan long term fixes
  • 52. Backups aren't important Being able to restore a backup is what's important If you're unable to get the data that was backed up, you're wasting your resources and time Test your restoration process as often as possible Automate your restoration so that you can have something automatically check
  • 53. Build checklists Checklists are used by many industries when dealing with issues (pilots, power plants, oil rigs) When an issue occurs, a quick run through a checklist will give you more details Create checklists to help those working on an incident easily gather data Create checklists for the different problems you expect to see Automate checklists as appropriate
  • 54. There's always more to do Feature gates A/B testing Retries Dynamic service discovery Distributed Tracing And on and on...
  • 56. Focus on what you know Look at the alert that was sent and the data it uses Look at other metrics your service outputs for more information Look at logs for errors to see if you can begin to narrow down the problem Check your upstream and downstream services to see if your service is part of a larger issue Gather as much data as you can with your information you're gather and start working on a solution Try not to panic and make sure there is a reason you're performing each action
  • 57. Incidents The important thing to do is to communicate Incidents occur when a large number of users are being affected by a user Determine a leveled framework for determining the severity of an issue and what to do for each level Reach out for help if you need it and this should become a team effort if required The end goal is to get the service back up and running for the user as quickly as possible
  • 58. Incident command You should setup an Incident Command when an incident has reached a certain level in your severity framework Determine a set of severity levels for when an incident requires more help Divide up tasks for communication as well as triage and debugging There should be an "Incident Commander" whose job it is to gather information and disseminate it to the required groups The incident commander shouldn't be the person directly working to fix the issue Create a checklist or automate as much as possible to make this less stressful
  • 59. Use your normal processes Try not to change your processes when working on fixing an issue If you have automation for completing different tasks such as changing configurations or deploying software continue to use those If in an incident, continue to use peer review before pushing changes to make sure you don't cause a bigger issue in a stressful situation If your process is too slow to use during an incident then you should also look at making that process more effective
  • 61. Be able to restart if interrupted Restart the processes that were killed or delayed Make sure that you're able to pick up and fix any data that might have been corrupted during an incident Processes should be able to retrace steps in unfinished jobs and complete them
  • 62. Beware of the "root cause" There is usually no single thing that fails to cause an incident There are usually a chain of events that occur and all of those events must occur for a particular issue to happen Make sure that you don't limit your view on a single aspect of the issue and miss the others
  • 63. Blameless postmortems Gather as much information as you can about what happened before, during and after an incident Make this information available as widely as possible in the organization and beyond Don't blame people for issues that occurred as it isn't helpful A postmortem should stick to provable facts from the incident Try to do your postmortem as quickly as possible after the incident to reduce the information from changing due to hindsight bias Postmortems should be a learning exercise from the incident
  • 64. Postmortem meetings Gather as a group either monthly or as needed Share experiences and discoveries due to issues in your services Postmortems might not be read by every team but there can still be insights that will be helpful in other services Do an overview of what happened, discuss next steps for long term fixing and stay up-to-date on long term fix completion
  • 65. Fix for the long term Fix issues for the long term Try not to constantly add band-aids for issues that happen often Fixing issues can be easy or can require a complete rethinking of the problem Understand that the company does have other priorities but that constant issues can drive down moral and cause other problems
  • 66. Updating all the things Fixing code is not the only thing that needs to be updated after an incident Did you add new metrics to the code that need to have new alerts or graphs added to a dashboard? Did you find new information for how to fix a particular type of issue that you should add to documentation? Are there other services that are also vulnerable to the issue you found that other teams should be aware of?
  • 67. References Velocity NY 2013: Richard Cook, "Resilience In Complex Adaptive Systems" https://www.youtube.com/watch?v=PGLYEDpNu60 How Complex Systems Fail, Richard Cook http://web.mit.edu/2.75/resources/random/How%20Complex%20Systems%20Fail.pdf How Complex Systems Fail -- The Morning Paper https://blog.acolyer.org/2016/02/10/how-complex-systems-fail/ SRE Book https://landing.google.com/sre/book.html SLIs, SLOs, SLAs, oh my! https://www.youtube.com/watch?v=tEylFyxbDLE&feature=youtu.be On-call Doesn't Have to Suck https://medium.com/@copyconstruct/on-call-b0bd8c5ea4e0
  • 68. References Practical Postmortems at Etsy https://www.infoq.com/articles/postmortems-etsy Blameless PostMortems and a Just Culture https://codeascraft.com/2012/05/22/blameless-postmortems/ Tackling Alert Fatigue https://speakerdeck.com/caitiem20/tackling-alert-fatigue Observability at Google https://speakerdeck.com/rakyll/observability-at-google Testing Microservice the Sane Way https://medium.com/@copyconstruct/testing-microservices-the-sane-way-9bb31d158c16 Each necessary, but only jointly sufficient https://www.kitchensoap.com/2012/02/10/each-necessary-but-only-jointly-sufficient/ STELLA: Report from the SNAFUcatchers Workshop on Coping With Complexity https://drive.google.com/file/d/0B7kFkt5WxLeDTml5cTFsWXFCb1U/view
  • 69. Weekly digests DevOps Weekly http://www.devopsweekly.com/ SRE Weekly https://sreweekly.com/ Kubernetes Weekly http://kube.news/ O’Reilly Systems Engineering and Operations http://www.oreilly.com/webops-perf/newsletter.html High Scalability http://highscalability.com/