When Things Go Bump in the Night

When things go bump
in the night
Some thoughts on building and
running reliable software

About me
Senior Site Reliability Engineer, Dollar Shave Club
Part-time Faculty, California State University, Northridge
Mentor, CSUN META+LAB
Formerly: Katch, Zefr, Prevoty, Twitter, Eucalyptus, CSUN, LAUSD
@ahamilton55

"The thing that amazes you
is not that your system goes
down sometimes. It's that
it's up at all."
Richard Cook
Velocity NY 2013, "Resilience In Complex Adaptive Systems"

Failure is always an option
Failure is almost the only guarantee that we have
We can't guarantee that what we are building will be successful and that it is
really what our users want
We can guarantee that if we run software for long enough, it will eventually fail
We need to keep this in mind when building software

100% reliability isn't feasible
"It turns out that past a certain point, however, increasing reliability is worse for a
service (and its users) rather than better! Extreme reliability comes at a cost:
maximizing stability limits how fast new features can be developed and how
quickly products can be delivered to users, and dramatically increases their cost,
which in turn reduces the numbers of features a team can afford to offer."
Site Reliability Engineering, Chapter 3 - Embracing Risk
Your users are probably okay with some degree of failure

Reliability is a feature, not a by-product
If reliability isn't a focus as part of what you're building, you won't get it
How will that new feature you're building fail?
Be a bit of a pessimist when it comes to the software you write
Understand your dependencies and figure out what should happen if they break

Systems are complex
Systems are complex,
● we run into unexpected behaviors
● unexpected responses to interventions
● new forms of failures are constantly being discovered and created
● changes are occuring
The more complexity in a system, the more possibilities we have for a failure
Every new feature or service we add increases this complexity

Systems require change
Changes are always required
There are no systems that are perfect the first time and won't be updated
These changes or adaptations will cause unreliability to the system over time if
we aren't careful
Changes can be very quick or slow over time but both can still cause problems

Build systems that are resilient
A resilient system can help itself
A service's dependencies can fail and the service can automatically react in a
way that allows itself to continue functioning in a degraded way
If one of the parts of a page is not working, should we fail the entire page or
judge whether the user must have the content?
Services can check to see if the failure was temporary by trying additional
requests to get the information from other nodes

Reliability is a cultural aspect
It is difficult to just say, build more reliable software
There are usually reasons why people aren't focusing on reliability
We need to look at our current situation and determine why reliability isn't a focal
point
Perhaps you need to slow down a little bit and focus on pushing some
improvements
Does your organization have a "stop the line" feature?

Run as little code as possible
The less code you run, the less you have to worry about
Try to simplify and remove code from your system as much as possible
If all you're ever doing is adding more code then eventually you're going to hit a
wall and have a mess
Refactoring is a good practice that will help keep your code lean and help prune
out old code that could eventually be hit by an wayward user
It's difficult to decide if code is ever "done" as people and systems evolve over
time

Legacy code
Legacy code makes your company money yet most try to avoid it as much as
possible
The new, shiny and perfect code that is being written now will be so much better
and have none of the problems from before so why worry about my legacy
services and code?
We it is currently running and giving our users the information they want
We shouldn't shun the old code that we have until we have deployed our new
code and completely moved away from our legacy service, it still will need some
TLC.

Beware of single points of failure (SPOF)
There are always single points of failure in any system
For example, if we only have a single load balancer for our service, what happens
if it goes down?
Try to limit the impact of a single system going down
Understand the dependencies in your system and don't fail if you don't have to

People are also points of failure
Beware of a Brent or a Hank or a Ben or a Gerardo or a Harold or ...
You want to spread out as much information as possible
If everyone always goes to that one person to fix everything you should assume
that you need to start extracting as much information
Software is a team sport that requires the work of the team to be successful
Spreading that information out around your organization with help both the
organization and the person

You probably still need system experts
Having an expert in your system is still important
What you need to do is to get them to work on ways of giving others their
knowledge as often as possible
Senior engineers become "10x engineers" by increasing the amount of work the
team as a whole is able to accomplish
This can be done in plenty of ways including automation and documentation

Users don't always use the "happy path"
The easiest way to add entropy to a software system is to have users
Many users will not proceed down the happy path that we designed for them and
will end up using our software in ways that we didn't think of
We need to look at the flow the user is supposed to take and figure out how
they'll break it
How will the user give us unusual data and what can we do to prevent its impact
on our services?

Everybody makes trade-offs
When developing software we make trade-offs for what we want to accomplish
It isn't that developers are okay with building software with bugs
People are always working within constraints
It's no one's fault when issues arise as we should assume everyone is doing their
best with whatever task they perform
We need to try and help as an organization by prioritizing reliability and making
sure that it is part of every engineer's process

Don't shun boring technology
Boring and tested technologies can be very helpful when trying to build a reliable
infrastructure
You don't always need to use the coolest thing on HackerNews because
"everyone else is doing it"
Figure out what problems new tech will truly solve for you over doing it for
resume driven development
Just because Google or Facebook or Twitter or Uber or <TechCo> has a certain
problem and used technology X to fix it doesn't mean that you have that problem
as well

Do your best when failing
There is little worse than a blank page when a failure has occurred
Try to give the user as much of the page as possible when a failure occurs
If you can't give the user what they wanted, can you give them a default set of
data?
Worst case, tell them you screwed up and that you're going to try and fix it

Expand your view
Move away from viewing hosts to viewing services
You should be able to look at the service as a whole and not worry so much if a
single host is having issues
If it is you should be able to quarantine or remove the host from being accessed
by customers, spin up another host and troubleshoot as needed

You should build microservices!
Microservices are the only logical choice to make
Allow groups can work on things without needing to talk to the other groups
Everybody is doing them and they are much more successful
With the loose coupling, different groups can do whatever they want
Velocity is much faster for different groups in the company because they are no
longer waiting to release their features
Facebook, Twitter, Google, Uber, Lyft, AirBnB all talk about how microservices are
so much better too

Monoliths or bust!
Monoliths don't require you to talk over the network to another service
All of my data is available from a single service
I know how my service is generating a piece of data because the code is in
another file in this repo
Everybody knows where everything comes from
Don't need to worry about things like service discovery, mesh networks, sidecars,
dependent services, etc

Your architecture matters
Adding the network between calls instead of memory forces us to think more
directly about reliability and adds complexity
Having a large, single service takes effort to effectively scale both development
and production
Architecture will change what we focus on making work properly for us

Understand "best practices"
Best practices tend to be best for the organization talking about it
Best practices can be helpful but might not be a one-to-one match
Learning about best practices is great but you should understand how they will
fit in your organization
Don't blindly follow what a company like Google or Facebook or Netflix does just
because it helped them
Take other's best practices and make they work in your own context

Do more before it's needed
The more work that is put in before we have an incident, the easier resolving that
incident should be
If we put in the work for metrics documentation and other reliability features,
then when an incident occurs, we should have a better understanding of what is
failing so that we can more reliably determine the why(s)

Detection and recovery are key
Limit the impact of failures by detecting them and being able to recover the
service quickly
The faster the detect and recovery cycle, the shorter the failure will be
This pushes the focus on detection to be automated and you should have
automated ways of pulling in additional information to determine the problem
and eventually fix it
Detection should rarely involve a human and recovery can be automated where
applicable

Service tiers
There are different priorities for different systems that an org runs
If we break up these tiers we will get a better idea of what is important and what
can wait until the morning to fix
For example, processing payments for orders is probably a big deal and must be
fixed right away
The social media widgets on your site that show a listing of posts to your
accounts is probably less urgent to fix and can wait for the morning
Think about your infrastructure and break it into different pieces for how it will
affect the business and how it prioritises things

Build vs buy
If you're a small business or group, it's important to focus on what makes your
company money
Choose to buy before you build unless you have the resources available to
effectively build, troubleshoot and repair the service on a regular basis
Building might seem cheaper, but assume that if the service is tier 0 or 1, you'll
need to have someone focusing on that at least half of their time for the first
year it is running
Beware of service sprawl

Practice makes better
Reliability is like anything you want to be good at as it requires practice
Practice allows us to use our skills in a controlled environment and help us find
areas to improve
We should be using deliberate practice where we set out to practice a particular
skill each time
Disaster Recovery Test (DiRT)
Chaos Engineering
Remember that practice is never 100% accurate

Define "done"
Create a definition of done for the work that your group does
Create an understanding that there is more to the work we do than closing the
ticket in JIRA
We should be making sure that it is easy to find as much information as possible
about the different systems we have
The added items can slow down work but will help if an issue occurs in the
future
Additional requirements of a task can include adding metrics, logging, updating
documentation, updating automation scripts, updated dashboards with new
data, updating runbooks with possible failures and how to fix them, etc

Documentation
Document as much as possible
Most engineers today don't seem to like doing documentation but when things
break it's very helpful
The more information that is available about a service, the less likely the system
expert will be bothered
Documentation should be living and updated
Try to consolidate where your documentation lives and make it easy to update
Beware of documentation in ephemeral sources

Runbooks
Runbooks are a set of vital, condensed information used during an incident
Runbooks hold basic information about a service such as which team owns it,
on-call lists, email lists, etc
Runbooks should also hold instructions for different common operations tasks
Every alert that you create for a service should have a corresponding runbook
section that describes the issue, how to gather data about the issue and
hopefully how to solve the issue
Try not to "assume" your reader knows everything about the service and that it is
being read at 3am

Automate whatever you can
Automation allows for engineers to increase their effectiveness over time
Automation can help reduce errors when doing a task with many steps
Start by automating your most common tasks such as deployments or
infrastructure provisioning
Move on to other tasks that are performed regularly by one person
Look for "toil" tasks to help free up engineers to work on more difficult problems
You don't necessarily need to automate everything right away and humans can
still be involved
Make sure your automation is available when needed

Testing
Create different types of automated tests to make sure your system does what
you want
Testing should be as quick as possible to give the developers fast feedback
Extended testing can be pushed out into your production environment as well
End-to-end automated tests that are run every few minutes can let you know
when a problem is affecting users
Load testing can give you a baseline for how your service should act as its work
increases

Metrics
Metrics are data points over time for how your service performs
We have different types of metrics such as counters, gauges, histograms, etc
Each of the different types have different usages
Collect at least RED metrics (Requests, Errors and Duration)
If you don't have metrics, start with a general view and become more specific
over time
Try to make metrics easy at your company by using shims or middleware or
services like linkerd or envoy

RED in one metric...
DataDog (statsd with tags)
dd.Timing("service.requests", duration,
[]string{fmt.Sprintf("code=%s", code)}, 1)
Prometheus
histogram := prometheus.NewHistogramVec(prometheus.HistogramOpts{
Name: "requests",
}, []string{"code"})
histogram.WithLabelValues(fmt.Sprintf("%d", code))
.Observe(duration.Seconds())

Logging
Logging gives details on events from a service
When something goes wrong, you should log an error with as much detail as
possible
The more information provided, the easier it will be to fix the problem
Use log levels to allow operators to turn up or down the verbosity as needed
Be careful with what you add to a log message and exclude sensitive data
Making logs easier for both humans and machines to understand
Send all logs to a single location

What does this message mean?
What does each piece of information here represent?
10.185.248.71 - - [09/Jan/2015:19:12:06 +0000] 808840 "GET
/inventoryService/inventory/purchaseItem?userId=20253471&itemId=23434300
HTTP/1.1" 500 17 "-" "Apache-HttpClient/4.2.6 (java 1.5)"

Structured logging FTW
level="info" remote_ip="10.185.248.71" time="09/Jan/2015:19:12:06+0000"
request="GET
HTTP/1.1" code=500 resp_bytes=17 user_agent="Apache-HttpClient/4.2.6 (java
1.5)"
{"level"="info", "remote_ip": "10.185.248.71", "time":
"09/Jan/2015:19:12:06+0000", "request": "GET
HTTP/1.1", "code": 500, "resp_bytes": 17, "user_agent": "Apache-HttpClient/4.2.6
(java 1.5)"}

Define "availability"
What is availability for your services?
The definition can be different for every service
How can we go about defining availability for our services in a way that is easy
for us to understand and track?
How will we discover that we are not as reliable as we are expected to be?

Service Level Indicators (SLI)
An SLI is used by the internal group to keep track of how a service is functioning
An SLI can be a single metric or it can be an aggregate of metrics
SLIs will be used to alert engineers when something is not running properly
SLIs do not need to be set in stone and can change

Service Level Objectives (SLO)
An SLO is the goal for a specific SLI
SLOs can be set as a percentage (99.99%), a maximum (500ms) or a minimum
(50qps)
SLOs can have a time period of time for computation (per week, per month)
SLOs are used internally by the engineering groups to track how well a service
performs over time
SLOs can be used with error budgets to slow down changes if the SLO is broken
Base SLOs on historic data and expectations for the future

Service Level Agreements (SLA)
An SLA is an external SLO given to users
An SLA is usually slightly lower than the SLO so that the engineering team has
some room if something goes wrong
SLAs should be based on the same SLIs as the internal SLOs
If internal to your company, an SLA lets other groups know how they should
prepare when using your service
If external to your company, an SLA should be verified with other business
entities to check for possible liabilities when broken for users

Peer review
People make mistakes and that isn't necessarily a bad thing
Peer review can help us catch bugs based on the different understandings
people have of a system
Peer review can help new users better understand a system by getting feedback

Actionable alerts
Assume alerts will fire in the middle of the night and wake someone up
Every alert must have some action that can be taken to fix the issue
Every alert should have some information in the runbook about how to diagnose
and hopefully fix it
Limit the number of alerts received and fix noisy alerts
As you progress, try to allow another machine to fix the issue and bring in a
human if unable to remedy
Point users to runbooks inside of an alert message for quick response
Try to use metrics for alerts over log parsing

On-call rotations
Create rotations for who will be contacted when something goes wrong
Primary and secondary routines with management escalation
Have at least four people on-call to spread the load
Try to have each part of the rotation be a week, max
Prepare engineers for on-call by having them shadow experienced engineers
Have your on-call person be the point of contact for other groups
Use on-call to find issues or inefficiencies and plan long term fixes

Backups aren't important
Being able to restore a backup is what's important
If you're unable to get the data that was backed up, you're wasting your
resources and time
Test your restoration process as often as possible
Automate your restoration so that you can have something automatically check

Build checklists
Checklists are used by many industries when dealing with issues (pilots, power
plants, oil rigs)
When an issue occurs, a quick run through a checklist will give you more details
Create checklists to help those working on an incident easily gather data
Create checklists for the different problems you expect to see
Automate checklists as appropriate

There's always more to do
Feature gates
A/B testing
Retries
Dynamic service discovery
Distributed Tracing
And on and on...

Focus on what you know
Look at the alert that was sent and the data it uses
Look at other metrics your service outputs for more information
Look at logs for errors to see if you can begin to narrow down the problem
Check your upstream and downstream services to see if your service is part of a
larger issue
Gather as much data as you can with your information you're gather and start
working on a solution
Try not to panic and make sure there is a reason you're performing each action

Incidents
The important thing to do is to communicate
Incidents occur when a large number of users are being affected by a user
Determine a leveled framework for determining the severity of an issue and what
to do for each level
Reach out for help if you need it and this should become a team effort if required
The end goal is to get the service back up and running for the user as quickly as
possible

Incident command
You should setup an Incident Command when an incident has reached a certain
level in your severity framework
Determine a set of severity levels for when an incident requires more help
Divide up tasks for communication as well as triage and debugging
There should be an "Incident Commander" whose job it is to gather information
and disseminate it to the required groups
The incident commander shouldn't be the person directly working to fix the issue
Create a checklist or automate as much as possible to make this less stressful

Use your normal processes
Try not to change your processes when working on fixing an issue
If you have automation for completing different tasks such as changing
configurations or deploying software continue to use those
If in an incident, continue to use peer review before pushing changes to make
sure you don't cause a bigger issue in a stressful situation
If your process is too slow to use during an incident then you should also look at
making that process more effective

Be able to restart if interrupted
Restart the processes that were killed or delayed
Make sure that you're able to pick up and fix any data that might have been
corrupted during an incident
Processes should be able to retrace steps in unfinished jobs and complete them

Beware of the "root cause"
There is usually no single thing that fails to cause an incident
There are usually a chain of events that occur and all of those events must occur
for a particular issue to happen
Make sure that you don't limit your view on a single aspect of the issue and miss
the others

Blameless postmortems
Gather as much information as you can about what happened before, during and
after an incident
Make this information available as widely as possible in the organization and
beyond
Don't blame people for issues that occurred as it isn't helpful
A postmortem should stick to provable facts from the incident
Try to do your postmortem as quickly as possible after the incident to reduce the
information from changing due to hindsight bias
Postmortems should be a learning exercise from the incident

Postmortem meetings
Gather as a group either monthly or as needed
Share experiences and discoveries due to issues in your services
Postmortems might not be read by every team but there can still be insights that
will be helpful in other services
Do an overview of what happened, discuss next steps for long term fixing and
stay up-to-date on long term fix completion

Fix for the long term
Fix issues for the long term
Try not to constantly add band-aids for issues that happen often
Fixing issues can be easy or can require a complete rethinking of the problem
Understand that the company does have other priorities but that constant issues
can drive down moral and cause other problems

Updating all the things
Fixing code is not the only thing that needs to be updated after an incident
Did you add new metrics to the code that need to have new alerts or graphs
added to a dashboard?
Did you find new information for how to fix a particular type of issue that you
should add to documentation?
Are there other services that are also vulnerable to the issue you found that other
teams should be aware of?

References
Velocity NY 2013: Richard Cook, "Resilience In Complex Adaptive Systems"
https://www.youtube.com/watch?v=PGLYEDpNu60
How Complex Systems Fail, Richard Cook
http://web.mit.edu/2.75/resources/random/How%20Complex%20Systems%20Fail.pdf
How Complex Systems Fail -- The Morning Paper
https://blog.acolyer.org/2016/02/10/how-complex-systems-fail/
SRE Book
https://landing.google.com/sre/book.html
SLIs, SLOs, SLAs, oh my!
https://www.youtube.com/watch?v=tEylFyxbDLE&feature=youtu.be
On-call Doesn't Have to Suck
https://medium.com/@copyconstruct/on-call-b0bd8c5ea4e0

References
Practical Postmortems at Etsy
https://www.infoq.com/articles/postmortems-etsy
Blameless PostMortems and a Just Culture
https://codeascraft.com/2012/05/22/blameless-postmortems/
Tackling Alert Fatigue
https://speakerdeck.com/caitiem20/tackling-alert-fatigue
Observability at Google
https://speakerdeck.com/rakyll/observability-at-google
Testing Microservice the Sane Way
https://medium.com/@copyconstruct/testing-microservices-the-sane-way-9bb31d158c16
Each necessary, but only jointly sufficient
https://www.kitchensoap.com/2012/02/10/each-necessary-but-only-jointly-sufficient/
STELLA: Report from the SNAFUcatchers Workshop on Coping With Complexity
https://drive.google.com/file/d/0B7kFkt5WxLeDTml5cTFsWXFCb1U/view

Weekly digests
DevOps Weekly
http://www.devopsweekly.com/
SRE Weekly
https://sreweekly.com/
Kubernetes Weekly
http://kube.news/
O’Reilly Systems Engineering and Operations
http://www.oreilly.com/webops-perf/newsletter.html
High Scalability
http://highscalability.com/

When Things Go Bump in the Night

More Related Content

What's hot (18)

Similar to When Things Go Bump in the Night (20)

Recently uploaded (20)

When Things Go Bump in the Night