SlideShare a Scribd company logo
Joe Smith
November 14, 2017
1
A Culture Of Automation
Joe Smith
Operations Engineer, Slack Application Operations Team
● Build/Run core systems responsible for Slack
● CDNs, Edge Regions, Web tier, Websockets, development workflow, etc
● Previously:
○ Tech Lead, Aurora/Mesos SRE at Twitter
○ Internal Technology Resident at Google
Folks with the desire to build
resilient systems
These roles may have different names, but
share the above goal
● Reliability Engineering
● Operations
● DevOps
● Site Reliability Engineering
● Production Engineering
● Systems Engineering
Audience
Agenda
1. Running Production Services
2. Runbooks
3. Automation
Running Production Services
Two Pizza Rule
A team should be sized to share
around two large pizzas
“”
“We will take the site down today! 💥”
– No one when they wake up
● Careful Planning and Procedures
● Extensive Documentation
● Good Communication
Strategies
Planning
● Some components are prioritized for speed while others are meant to be
canaried and analyzed
● Changes need to be staged to coordinate with each other
● Give teams the tools and visibility they need to make improvements and
understand impact
● Identify your rollback strategy ahead of time
Documentation
● Each code or procedure change should be paired with an update to easily-
readable text
● Help your teammates and yourself weeks from now when you need to
understand how things work.
● Do not just describe how systems are structured, explain why they are built
that way!
● Additional context can inform future decisions
Communication
● Change Management - Coordinating release schedules can be difficult
● Launch Channel - Announce changes, link to more details in a feature-
specific Slack channel for the change
● Add links to commits, code reviews, threads in Slack, mailing list posts, and
StackOverflow questions
● This enables your team to benefit from the research you've done
● Careful Planning and Procedures
● Extensive Documentation
● Good communication
Strategies
● Unexpected changes, forced roll-forwards
● Outdated Runbooks
● Missed Notifications
Growing Pains
Good Problems to Have
As the team grows, it's no
longer possible to understand
everything that's happening at
once.
The scope of work is also
increasing!
Runbooks
“”
“Checklists for commonly
repeated operational tasks.”
– Slack, Runbook README
1. Location
2. Format
3. Contents
Location
● Google Docs
○ Good Formatting, Mobile Apps, external service
● Wiki
○ Web Interface, Track Changes
● Markdown in git repo (paired with Github)
○ Formatting, offline support, normal Pull Request flow
Markdown in Git Repo
● Track changes across revisions
● Optional peer review
● Link to relevant sections
● Clone repo for offline support
Runbook Template
(thanks to my teammate Megan!)
● ApplicationServer
○ README.md
○ standard_actions.md
○ other_actions.md
○ alerts/
■ box_failure.md
■ some_alert_name.md
README.md
standard_actions.md
other_action.md
alert.md
box_failure.md
Example
Content
This is not the place for Design
Documentation.
These are highly-actionable,
succinct descriptions of next
steps.
Automation
“”
"Test until fear turns
to boredom."
– JUnit FAQ,
http://junit.sourceforge.net/doc/faq/faq.htm#tests_6
“”
"Automate once fear
turns to boredom"
– Ancient SRE Corrollary
Beyond Runbooks
● Turn a manual checklist into a testable, repeatable set of steps anyone can
run
● Anytime you discover a sharp edge or workaround, this can be codified in the
tool
● Reduce sections of "but if this happens, check this dashboard and then do
one of three things"
The Tooling Workflow
Initial Steps
This process can evolve over a long time and generally improve things.
● One person has all the knowledge in their head
● That person writes down everything they know in a runbook
● Someone sees an annoying or complicated piece and writes a small script
to be run instead for a tiny part of the process
The next jump will be the most difficult part!
The Tooling Workflow
Maintenance
● It feels great to have written your first tool!
● You may be lucky and have no bugs
● Most likely there are some edge cases- that is okay and expected!
● Take some time to figure out what went wrong and how to make things better.
The Tooling Workflow
Completion
● Later on, another part of the process can be added in and the documentation
further updated
● Over time- the runbook becomes "Run this tool we wrote, send bugs to the
authors"
● Finally- there is no longer an entry! The tool is run automatically, or the
system itself is able to solve that problem
“”
"The value of humans is to
execute Judgement, the value
of computers is to execute
instructions"
– Aron, teammate at Slack
Runbook to Automated Workflow
1. Brain
2. Runbook
3. Start of automation
4. Automation evolution (safeguards)
5. Self-contained Tool
6. Fully Automated
Process in Code
● Using libraries like fabric, pychef, and boto3 can ease automation
● When there are issues, the code can be reviewed for process changes, git
history can be consulted, etc
● No more "I forgot that was changed and followed the old process!"
● Each time someone submits an improvement or workflow tweak, that will
always be useful from now on!
Thank You!
38
For more information go to: slack.com/jobs
Joe Smith
Operations Engineer, Slack Application Operations Team
● Come build the future of work!
○ https://slack.com/careers/641062/senior-site-reliability-engineer
● Please reach out and say hi!
○ @Yasumoto on Twitter
● Tools Scaffold
○ https://github.com/Yasumoto/tools

More Related Content

PPTX
Design patterns for efficient DevOps processes - Rebecca Fitzhugh - DevOpsDay...
PDF
How to survive continuous innovation - Sebastien Goasguen - DevOpsDays Tel Av...
PDF
Devops at Startup Weekend BXL
PDF
Debugging distributed systems
PPTX
That worked before
PDF
Software architecture in a DevOps world
PDF
Maintenance Stabilisation
PDF
Skills Matter DevSecOps eXchange Forum 2022 - Software architecture in a DevO...
Design patterns for efficient DevOps processes - Rebecca Fitzhugh - DevOpsDay...
How to survive continuous innovation - Sebastien Goasguen - DevOpsDays Tel Av...
Devops at Startup Weekend BXL
Debugging distributed systems
That worked before
Software architecture in a DevOps world
Maintenance Stabilisation
Skills Matter DevSecOps eXchange Forum 2022 - Software architecture in a DevO...

What's hot (20)

PDF
TDC 2021 - Better software, faster: Principles of Continuous Delivery and DevOps
PDF
Automated Performance Testing
PDF
JUG CH September 2021 - Debugging distributed systems
ODP
OpenNTF Essentials
PDF
JavaLand 2022 - Debugging distributed systems
PDF
Random thoughts and dev practices / advices to build a great product
PPTX
Develop 4 Developers
PPTX
The Clash Between Devops and Quality Assurance
PPTX
Bootstrapping Quality
PDF
YOUR OPEN SOURCE PROJECT IS LIKE A STARTUP, TREAT IT LIKE ONE, EYAR ZILBERMAN...
PPTX
Dev ops is more than CI+CD tools
PPTX
Agile Mindset and Its Implications - My Understanding
PDF
JavaLand 2022 - Software architecture in a DevOps world
PPTX
Lightning talk how to edit the Silverstripe CMS docs
PDF
Marko Berković
PDF
Developer disciplines
PDF
JUG Bonn June 2021 - The DevOps disaster
PDF
Programming Sessions KU Leuven - Session 01
PDF
130511 stop wasting_your_time
PPTX
10 skills developers should invest in for 2014
TDC 2021 - Better software, faster: Principles of Continuous Delivery and DevOps
Automated Performance Testing
JUG CH September 2021 - Debugging distributed systems
OpenNTF Essentials
JavaLand 2022 - Debugging distributed systems
Random thoughts and dev practices / advices to build a great product
Develop 4 Developers
The Clash Between Devops and Quality Assurance
Bootstrapping Quality
YOUR OPEN SOURCE PROJECT IS LIKE A STARTUP, TREAT IT LIKE ONE, EYAR ZILBERMAN...
Dev ops is more than CI+CD tools
Agile Mindset and Its Implications - My Understanding
JavaLand 2022 - Software architecture in a DevOps world
Lightning talk how to edit the Silverstripe CMS docs
Marko Berković
Developer disciplines
JUG Bonn June 2021 - The DevOps disaster
Programming Sessions KU Leuven - Session 01
130511 stop wasting_your_time
10 skills developers should invest in for 2014
Ad

Similar to A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017 (20)

PDF
How to Automate Yourself out of a Job (7/9/19)
PPTX
Slack for the mere mortals
PPTX
Automate Everything! (No stress development/Tallinn)
PDF
Scaling Up Lookout
PPTX
How we daily manage and work in a dispersed company: Particular Software
PPTX
On working in Particular
PDF
HackYale 0-60 in Startup Tech
ODP
Build and Deploy a Python Web App to Amazon in 30 Mins
PDF
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr
PDF
apidays LIVE JAKARTA - How we Build APIs and Workflows at Slack by Bear Douglas
PPTX
Agile, DevOps & Test
PDF
apidays LIVE Australia 2020 - How we Build APIs and Workflows at Slack by Bea...
PDF
apidays LIVE Singapore - How we Build APIs and Workflows at Slack by Bear Dou...
PDF
Intro to GitHub Actions
PDF
DevOps: Automate all the things
PDF
GeneralAssemb.ly Summer Program: Tech from the Ground Up
PDF
meetup version of Paving the road to production
PDF
Scale your Software development process while scaling your team
PPTX
DevOps and Build Automation
PDF
The Evolution of Continuous Delivery at Scale @ Linkedin
How to Automate Yourself out of a Job (7/9/19)
Slack for the mere mortals
Automate Everything! (No stress development/Tallinn)
Scaling Up Lookout
How we daily manage and work in a dispersed company: Particular Software
On working in Particular
HackYale 0-60 in Startup Tech
Build and Deploy a Python Web App to Amazon in 30 Mins
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr
apidays LIVE JAKARTA - How we Build APIs and Workflows at Slack by Bear Douglas
Agile, DevOps & Test
apidays LIVE Australia 2020 - How we Build APIs and Workflows at Slack by Bea...
apidays LIVE Singapore - How we Build APIs and Workflows at Slack by Bear Dou...
Intro to GitHub Actions
DevOps: Automate all the things
GeneralAssemb.ly Summer Program: Tech from the Ground Up
meetup version of Paving the road to production
Scale your Software development process while scaling your team
DevOps and Build Automation
The Evolution of Continuous Delivery at Scale @ Linkedin
Ad

More from DevOpsDays Tel Aviv (20)

PPTX
GRAPHQL TO THE RES(T)CUE, ELLA SHARAKANSKI, Salto
PPTX
MICROSERVICES ABOVE THE CLOUD - DESIGNING THE INTERNATIONAL SPACE STATION FOR...
PPTX
THE (IR)RATIONAL INCIDENT RESPONSE: HOW PSYCHOLOGICAL BIASES AFFECT INCIDENT ...
PPTX
PRINCIPLES OF OBSERVABILITY // DANIEL MAHER, DataDog
PPTX
NUDGE AND SLUDGE: DRIVING SECURITY WITH DESIGN // J. WOLFGANG GOERLICH, Duo S...
PPTX
(Ignite) TAKE A HIKE: PREVENTING BATTERY CORROSION - LEAH VOGEL, CHEGG
PPTX
BUILDING A DR PLAN FOR YOUR CLOUD INFRASTRUCTURE FROM THE GROUND UP, MOSHE BE...
PPTX
THE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider Security
PDF
THE PLEASURES OF ON-PREM, TOMER GABEL
PPTX
CONFIGURATION MANAGEMENT IN THE CLOUD NATIVE ERA, SHAHAR MINTZ, EggPack
PPTX
SOLVING THE DEVOPS CRISIS, ONE PERSON AT A TIME, CHRISTINA BABITSKI, Develeap
PPTX
OPTIMIZING PERFORMANCE USING CONTINUOUS PRODUCTION PROFILING ,YONATAN GOLDSCH...
PPTX
HOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKH
PPTX
HOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearB
PPTX
FLYING BLIND - ACCESSIBILITY IN MONITORING, FEU MOUREK, Icinga
PPTX
(Ignite) WHAT'S BURNING THROUGH YOUR CLOUD BILL - GIL BAHAT, CIDER SECURITY
PPTX
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
PPTX
ONBOARDING IN LOCKDOWN, HILA FOX, Augury
PPTX
DON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, Firefly
PPTX
KEYNOTE | WHAT'S COMING IN THE NEXT 10 YEARS OF DEVOPS? // ELLEN CHISA, bolds...
GRAPHQL TO THE RES(T)CUE, ELLA SHARAKANSKI, Salto
MICROSERVICES ABOVE THE CLOUD - DESIGNING THE INTERNATIONAL SPACE STATION FOR...
THE (IR)RATIONAL INCIDENT RESPONSE: HOW PSYCHOLOGICAL BIASES AFFECT INCIDENT ...
PRINCIPLES OF OBSERVABILITY // DANIEL MAHER, DataDog
NUDGE AND SLUDGE: DRIVING SECURITY WITH DESIGN // J. WOLFGANG GOERLICH, Duo S...
(Ignite) TAKE A HIKE: PREVENTING BATTERY CORROSION - LEAH VOGEL, CHEGG
BUILDING A DR PLAN FOR YOUR CLOUD INFRASTRUCTURE FROM THE GROUND UP, MOSHE BE...
THE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider Security
THE PLEASURES OF ON-PREM, TOMER GABEL
CONFIGURATION MANAGEMENT IN THE CLOUD NATIVE ERA, SHAHAR MINTZ, EggPack
SOLVING THE DEVOPS CRISIS, ONE PERSON AT A TIME, CHRISTINA BABITSKI, Develeap
OPTIMIZING PERFORMANCE USING CONTINUOUS PRODUCTION PROFILING ,YONATAN GOLDSCH...
HOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKH
HOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearB
FLYING BLIND - ACCESSIBILITY IN MONITORING, FEU MOUREK, Icinga
(Ignite) WHAT'S BURNING THROUGH YOUR CLOUD BILL - GIL BAHAT, CIDER SECURITY
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
ONBOARDING IN LOCKDOWN, HILA FOX, Augury
DON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, Firefly
KEYNOTE | WHAT'S COMING IN THE NEXT 10 YEARS OF DEVOPS? // ELLEN CHISA, bolds...

Recently uploaded (20)

PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Encapsulation_ Review paper, used for researhc scholars
PPT
Teaching material agriculture food technology
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Modernizing your data center with Dell and AMD
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Electronic commerce courselecture one. Pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
NewMind AI Monthly Chronicles - July 2025
“AI and Expert System Decision Support & Business Intelligence Systems”
Diabetes mellitus diagnosis method based random forest with bat algorithm
MYSQL Presentation for SQL database connectivity
20250228 LYD VKU AI Blended-Learning.pptx
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Encapsulation_ Review paper, used for researhc scholars
Teaching material agriculture food technology
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Modernizing your data center with Dell and AMD
The Rise and Fall of 3GPP – Time for a Sabbatical?
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Unlocking AI with Model Context Protocol (MCP)
Understanding_Digital_Forensics_Presentation.pptx
Network Security Unit 5.pdf for BCA BBA.
Electronic commerce courselecture one. Pdf
Chapter 3 Spatial Domain Image Processing.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy

A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017

  • 1. Joe Smith November 14, 2017 1 A Culture Of Automation
  • 2. Joe Smith Operations Engineer, Slack Application Operations Team ● Build/Run core systems responsible for Slack ● CDNs, Edge Regions, Web tier, Websockets, development workflow, etc ● Previously: ○ Tech Lead, Aurora/Mesos SRE at Twitter ○ Internal Technology Resident at Google
  • 3. Folks with the desire to build resilient systems These roles may have different names, but share the above goal ● Reliability Engineering ● Operations ● DevOps ● Site Reliability Engineering ● Production Engineering ● Systems Engineering Audience
  • 4. Agenda 1. Running Production Services 2. Runbooks 3. Automation
  • 6. Two Pizza Rule A team should be sized to share around two large pizzas
  • 7. “” “We will take the site down today! 💥” – No one when they wake up
  • 8. ● Careful Planning and Procedures ● Extensive Documentation ● Good Communication Strategies
  • 9. Planning ● Some components are prioritized for speed while others are meant to be canaried and analyzed ● Changes need to be staged to coordinate with each other ● Give teams the tools and visibility they need to make improvements and understand impact ● Identify your rollback strategy ahead of time
  • 10. Documentation ● Each code or procedure change should be paired with an update to easily- readable text ● Help your teammates and yourself weeks from now when you need to understand how things work. ● Do not just describe how systems are structured, explain why they are built that way! ● Additional context can inform future decisions
  • 11. Communication ● Change Management - Coordinating release schedules can be difficult ● Launch Channel - Announce changes, link to more details in a feature- specific Slack channel for the change ● Add links to commits, code reviews, threads in Slack, mailing list posts, and StackOverflow questions ● This enables your team to benefit from the research you've done
  • 12. ● Careful Planning and Procedures ● Extensive Documentation ● Good communication Strategies
  • 13. ● Unexpected changes, forced roll-forwards ● Outdated Runbooks ● Missed Notifications Growing Pains
  • 14. Good Problems to Have As the team grows, it's no longer possible to understand everything that's happening at once. The scope of work is also increasing!
  • 16. “” “Checklists for commonly repeated operational tasks.” – Slack, Runbook README
  • 18. Location ● Google Docs ○ Good Formatting, Mobile Apps, external service ● Wiki ○ Web Interface, Track Changes ● Markdown in git repo (paired with Github) ○ Formatting, offline support, normal Pull Request flow
  • 19. Markdown in Git Repo ● Track changes across revisions ● Optional peer review ● Link to relevant sections ● Clone repo for offline support
  • 20. Runbook Template (thanks to my teammate Megan!) ● ApplicationServer ○ README.md ○ standard_actions.md ○ other_actions.md ○ alerts/ ■ box_failure.md ■ some_alert_name.md
  • 27. Content This is not the place for Design Documentation. These are highly-actionable, succinct descriptions of next steps.
  • 29. “” "Test until fear turns to boredom." – JUnit FAQ, http://junit.sourceforge.net/doc/faq/faq.htm#tests_6
  • 30. “” "Automate once fear turns to boredom" – Ancient SRE Corrollary
  • 31. Beyond Runbooks ● Turn a manual checklist into a testable, repeatable set of steps anyone can run ● Anytime you discover a sharp edge or workaround, this can be codified in the tool ● Reduce sections of "but if this happens, check this dashboard and then do one of three things"
  • 32. The Tooling Workflow Initial Steps This process can evolve over a long time and generally improve things. ● One person has all the knowledge in their head ● That person writes down everything they know in a runbook ● Someone sees an annoying or complicated piece and writes a small script to be run instead for a tiny part of the process The next jump will be the most difficult part!
  • 33. The Tooling Workflow Maintenance ● It feels great to have written your first tool! ● You may be lucky and have no bugs ● Most likely there are some edge cases- that is okay and expected! ● Take some time to figure out what went wrong and how to make things better.
  • 34. The Tooling Workflow Completion ● Later on, another part of the process can be added in and the documentation further updated ● Over time- the runbook becomes "Run this tool we wrote, send bugs to the authors" ● Finally- there is no longer an entry! The tool is run automatically, or the system itself is able to solve that problem
  • 35. “” "The value of humans is to execute Judgement, the value of computers is to execute instructions" – Aron, teammate at Slack
  • 36. Runbook to Automated Workflow 1. Brain 2. Runbook 3. Start of automation 4. Automation evolution (safeguards) 5. Self-contained Tool 6. Fully Automated
  • 37. Process in Code ● Using libraries like fabric, pychef, and boto3 can ease automation ● When there are issues, the code can be reviewed for process changes, git history can be consulted, etc ● No more "I forgot that was changed and followed the old process!" ● Each time someone submits an improvement or workflow tweak, that will always be useful from now on!
  • 38. Thank You! 38 For more information go to: slack.com/jobs
  • 39. Joe Smith Operations Engineer, Slack Application Operations Team ● Come build the future of work! ○ https://slack.com/careers/641062/senior-site-reliability-engineer ● Please reach out and say hi! ○ @Yasumoto on Twitter ● Tools Scaffold ○ https://github.com/Yasumoto/tools

Editor's Notes

  • #4: I do believe that folks in these roles all have different approaches, and that is excellent. There is a focus of continuity with Operations, and deep architectural expertise with Site Reliability Engineering. I’d love to talk to you all afterward to hear your approaches to building out teams!
  • #5: We're going to break this talk into 3 pieces, and I would love to talk about any of them for hours But we only have time for one talk this year, so let's see what we'd like to focus on Who here considers them an experienced Operations Engineer or SRE? Who has written at least a few hundred lines of Python? Who has published your own python distribution to Cheeseshop (aka PyPI?)
  • #6: This is going to set the stage for priorities and approaches based on years running large web services. The focus is primarily around working with a larger group of people— a team that has gone beyond the “two pizza rule”
  • #7: Once you go beyond this number of engineers, your problems begin to center around organization and collaboration instead of technology (usually!)
  • #8: There are a few universal best practices which we can use to inform how we structure our work.
  • #9: Different approaches to doing this For this talk, we're going to consider three pillars of a great operations team These are aspects of rolling out a new tool, procedure, service, or automation
  • #10: ^ Consider strategies for dealing with mistakes vs. just "do a rollback" ^ Is everyone familiar with Rollback strategies? ^ Whenever you make a change in production, you should consider how are you going to rollback if something breaks! ^ DO NOT come up with your rollback strategy under pressure after things break! ^ Use the _feature-flag_ approach: Hide chef/puppet changes behind a per-node attribute ^ Deploy processes should be the same for a roll-forward as a roll-back ^ Specify which version number should be running and structure the config to match
  • #11: These written pieces of text will be generally of two types: Design Documentation describing architecture and function. The second, which we will dig into, are called “Runbooks” These are descriptions about how to run a service. They contain important information about how to spin the service up from scratch. How to repair components when they fail. And most importantly, a corresponding runbook for each page/alert that goes to an engineer.
  • #12: Documentation is written in the past, and consumed in the present. “Communication” is real-time: this means information is conveyed at the same time it is received.
  • #13: Any questions about bringing these together? So these are the ideal, but unfortunately it isn't always the case! Let’s talk about corresponding issues we can run into
  • #14: ^ These all get more difficult over time ^ More people, more difficult to communicate
  • #16: Here’s where we get to a really fun part of the talk I’ve been on a team where our runbook was one really long wiki page, and now a team with much more structured information. I’m biased, but I think this approach is much better! A little effort has helped make oncall and remediation much easier. Even if you don’t follow anything else I suggest here— make time explicitly for your team to improve documentation!
  • #17: First, consider whether what you're writing is really a runbook. If you're documenting an architecture or a workflow that's expected to become muscle memory (like how Logstash works or our Chef workflow) then your file belongs in the ops directory. Runbooks are for reapeatable processes. We are all required to refer to the runbook each and every time we execute a process documented in one so that changes to runbooks will reliably take effect.
  • #18: We’re going to talk about three different components to good runbooks.
  • #21: Runbooks should be organized into directories in the runbooks directory named descriptively and broadly. For example, message_server as a top level directory, with descriptive files within. Within those files there should be a first-level heading reiterating the name of the file (some_alert_name.md becomes "# Some Alert Name") and many second-level headings, following the template laid out in the example directory.
  • #22: There should also be a README.md file with a basic explanation of the service, related services, contacts/channels for that service, and related links/graphs. Now remember, when someone realizes something is wrong, this will likely be the first place they go to get the lay on the land. This should be generally helpful information.
  • #23: Each process documented should present, as prose, any context that may be required to decide if use of this runbook is appropriate or otherwise set the stage for execution. Then an ordered list should be presented. Use "1. " as the prefix for every element in the list to keep diffs tidier; when the Markdown is rendered the numbers will be sequential. Finally, any parting considerations should be presented, again as prose.
  • #24: It is very important that Markdown be used precisely when writing runbooks. In the heat of the moment, consistent formatting adds clarity and helps us move quickly and confidently.
  • #25: Specify where a command is to be run before specifying the command ("From staging:", "From your chef-repo:", etc.) Link to graphs, God pages, and other specific resources required during execution of the runbook Include actions like updating the status site, sending particular messages in Slack, deploying the site, enabling notifications, and so on lest they be forgotten or their omission confuse the operator executing your runbook
  • #27: This is an example of something that gives some context— it clearly points the reader to the next services to investigate. Following along with what Corey said this morning... It could be improved in a few ways.. Most notably by helpfully linking to runbooks about how to diagnose if memcache, databases, and other downstreams are healthy or not! This is something that an experienced team member can read and have no issues with. Someone new on the team may need to burn time getting help from a teammate before isolating the root cause.
  • #28: Ensure you are creating something that can be acted upon by a teammate at 2am. Otherwise expect to get woken up to chip in!
  • #29: So here’s my favorite part! To start we’re going to talk about one of my favorite maxims, followed by an ancient proverb.
  • #31: This is the best rule of thumb we have for ensuring we are directing efforts appropriately For something that is so standard that you know you can handle it, and have seen the edge cases, it’s going to be boring Don’t let yourself or your team stay working on the same issue for too long. This is the perfect opportunity to put in some extra effort to make sure it is automated.
  • #32: Automatically detect which procedures to follow As we saw, that example runbook was helpful, but sort of passed the buck when it came to further troubleshooting. A potential tool might automatically query health metrics for downstream services to help identify any problems Let’s walk through what that procedure might look like.
  • #36: Once you have defined the logic for a process, it can be written out in a language for the future ^ That language could also happen to be understood by a machine! Humans can also read code, but machines can execute explicit instructions.
  • #37: ^ We have found that investing the time to write tools as we understand a process has been worth the effort ^ We are able to work on much more interesting problems, and not repeatedly wake up for the same issues again and again
  • #38: This can be a large amount of work, and the payoff isn’t always immediate. It can be slow-going to write something which is resilient enough to be run by someone not familiar with both the process and the tool. The payoff will be well-worth it however!
  • #39: I’m hopeful I’ve motivated you to not only invest in your runbooks, but also looking into improvements beyond just text. Using machines to do some work for us will allow our brains to focus on bigger and more impactful challenges.
  • #40: If you agree, disagree, or have any questions, I’d love to hear it! Thank you!