DATA-DRIVEN POSTMORTEMS
JASON YEE, DATADOG @GITBISECT
“THE ONLY REAL
MISTAKE IS THE
ONE FROM WHICH
WE LEARN
NOTHING.”
- Henry Ford
TW: @gitbisect @datadoghq
@gitbisect
Technical Writer/Evangelist
“Docs & Talks”
Travel Hacker & Whiskey Hunter
@datadoghq
SaaS-based monitoring
Trillions of data points per day
http://jobs.datadoghq.com
“The problems we work on at Datadog are hard
and often don't have obvious, clean-cut
solutions, so it's useful to cultivate your
troubleshooting skills, no matter what role you
work in.”
Internal Datadog Developer Guide
TW: @gitbisect @datadoghq
BLAMELESS
POSTMORTEMS
TW: @gitbisect @datadoghq
DAMON EDWARDS & JOHN WILLIS - DEVOPSDAY LOS ANGELES
WHAT IS DEVOPS? ▸ Culture
▸ Automation
▸ Metrics
▸ Sharing
TW: @gitbisect @datadoghq
TW: @gitbisect @datadoghq
TW: @gitbisect @datadoghq
DAMON EDWARDS & JOHN WILLIS - DEVOPSDAY LOS ANGELES
OUR FOCUS AREA ▸ Culture
▸ Sharing
TW: @gitbisect @datadoghq
TW: @gitbisect @datadoghq
CULTURE & SHARING RESOURCES
BLAMELESS POSTMORTEMS
▸Blameless Postmortems by John Allspaw
http://bit.ly/etsy-blameless
▸The Human Side of Postmortems by Dave
Zwieback
http://bit.ly/human-postmortem
TW: @gitbisect @datadoghq
METRICS
CULTURE & SHARING ARE GREAT, BUT WHAT ABOUT
TW: @gitbisect @datadoghq
TW: @gitbisect @datadoghq
COLLECTING DATA IS CHEAP;
NOT HAVING IT WHEN YOU
NEED IT CAN BE EXPENSIVE
SO INSTRUMENT ALL THE THINGS!
TW: @gitbisect @datadoghq
4 QUALITIES OF GOOD METRICS
NOT ALL METRICS ARE EQUAL
TW: @gitbisect @datadoghq
1. WELL UNDERSTOOD
1. WELL UNDERSTOOD
1. WELL UNDERSTOOD
TW: @gitbisect @datadoghq
2. SUFFICIENT GRANULARITY
1 second
Peak 46%
1 minute
Peak 36%
5 minutes
Peak 12%
1 second
Peak 46%
1 minute
Peak 36%
5 minutes
Peak 12%
1 second
Peak 46%
1 minute
Peak 36%
5 minutes
Peak 12%
3. TAGGED & FILTERABLE
TW: @gitbisect @datadoghq
Webinar - Data driven postmortems - Jason Yee
Webinar - Data driven postmortems - Jason Yee
Webinar - Data driven postmortems - Jason Yee
Query Based Monitoring
“What’s the average throughput of
application:nginx per version ?”
“How many requests per second is my
role:accounting-app running
application:postgresql hosted in region:us-
west-1 compared to region:us-east-1?”
TW: @gitbisect @datadoghq
4. LONG-LIVED
TW: @gitbisect @datadoghq
METRICS 101
HOW LONG?
▸ AWS Cloudwatch: Up to 15months at 1h granularity
▸ MS Azure Monitoring Service: Up to 90d at 1d granularity
▸ Google Stackdriver: Up to 6 weeks at 1m granularity
▸ Datadog: Up to 15months at 1s granularity
TW: @gitbisect @datadoghq
TW: @gitbisect @datadoghq
TW: @gitbisect @datadoghq
TW: @gitbisect @datadoghq
TW: @gitbisect @datadoghq
P.S. - June 1! Mark your calendar!
RECURSE UNTIL YOU FIND THE TECHNICAL CAUSES
TW: @gitbisect @datadoghq
There is no
singular
“Root Cause”
HUMAN ELEMENT
TECHNICAL ISSUES HAVE NON-TECHNICAL CAUSES
TW: @gitbisect @datadoghq
IF YOU’RE STILL
RESPONDING TO
THE INCIDENT,
IT’S NOT TIME FOR
A POSTMORTEM
TW: @gitbisect @datadoghq
HUMAN DATA
DATA COLLECTION: WHO?
▸ Everyone!
▸ Responders
▸ Identifiers
▸ Affected Users
TW: @gitbisect @datadoghq
HUMAN DATA
DATA COLLECTION: WHAT?
▸Their perspective
▸What they did
▸What they thought
▸Why they thought/did it
TW: @gitbisect @datadoghq
“WRITING IS NATURE’S WAY OF
LETTING YOU KNOW HOW SLOPPY
YOUR THINKING IS.”
RICHARD GUINDON
TW: @gitbisect @datadoghq
TELLING STORIES
“A PICTURE IS
WORTH A
THOUSAND
WORDS”
- ANCIENT PROVERB
TW: @gitbisect @datadoghq
HUMAN DATA
DATA COLLECTION: WHEN?
▸ As soon as possible.
▸ Memory drops sharply within 20 minutes
▸ Susceptibility to “false memory” increases
▸ Get your project managers involved!
TW: @gitbisect @datadoghq
HUMAN DATA
DATA SKEW/CORRUPTION
▸ Stress
▸ Sleep deprivation
▸ Burnout
TW: @gitbisect @datadoghq
HUMAN DATA
DATA SKEW/CORRUPTION
▸ Blame
▸ Fear of punitive action
TW: @gitbisect @datadoghq
HUMAN DATA
DATA SKEW/CORRUPTION
▸ Bias
▸ Anchoring
▸ Hindsight
▸ Outcome
▸ Availability (Recency)
▸ Bandwagon Effect
TW: @gitbisect @datadoghq
HOW WE
DO POSTMORTEMS
AT DATADOG
TW: @gitbisect @datadoghq
DATADOG POSTMORTEMS
A FEW NOTES
▸ Postmortems emailed to company wide
▸ Scheduled recurring postmortem meetings
TW: @gitbisect @datadoghq
DATADOG’S POSTMORTEM TEMPLATE (1/5)
SUMMARY: WHAT HAPPENED?
▸ Describe what happened here at a high-level --
think of it as an abstract in a scientific paper.
▸ What was the impact on customers?
▸ What was the severity of the outage?
▸ What components were affected?
▸ What ultimately resolved the outage?
TW: @gitbisect @datadoghq
TW: @gitbisect @datadoghq
TW: @gitbisect @datadoghq
DATADOG’S POSTMORTEM TEMPLATE (2/5)
HOW WAS THE OUTAGE DETECTED?
▸ We want to make sure we detected the issue
early and would catch the same issue if it were to
repeat.
▸ Did we have a metric that showed the outage?
▸ Was there a monitor on that metric?
▸ How long did it take for us to declare an outage?
TW: @gitbisect @datadoghq
TW: @gitbisect @datadoghq
TW: @gitbisect @datadoghq
DATADOG’S POSTMORTEM TEMPLATE (3/5)
HOW DID WE RESPOND?
▸ Who was the incident owner & who else was
involved?
▸ Slack archive links and timeline of events!
▸ What went well?
▸ What didn’t go so well?
TW: @gitbisect @datadoghq
*Names changed
TW: @gitbisect @datadoghq
CHATOPS
ARCHIVES
FTW!
*Names changed
TW: @gitbisect @datadoghq
*Names changed
TRACK LEARNINGS AS YOU GO
TW: @gitbisect @datadoghq
DATADOG’S POSTMORTEM TEMPLATE (4/5)
WHY DID IT HAPPEN?
▸ Deep dive into the cause
▸ Examples from this incident:
▸ http://bit.ly/dd-statuspage
▸ http://bit.ly/alq-postmortem
TW: @gitbisect @datadoghq
DATADOG’S POSTMORTEM TEMPLATE (5/5)
HOW DO WE PREVENT IT IN THE FUTURE?
▸ Link to Github issues and Trello cards
▸ Now?
▸ Next?
▸ Later?
▸ Follow up notes
TW: @gitbisect @datadoghq
*Names changed
TW: @gitbisect @datadoghq
DATADOG’S POSTMORTEM TEMPLATE
RECAP:
▸ What happened (summary)?
▸ How did we detect it?
▸ How did we respond?
▸ Why did it happen (deep dive)?
▸ Actionable next steps!
TW: @gitbisect @datadoghq
KEEP LEARNING
MORE RESOURCES
▸ Postmortem Template

http://bit.ly/postmortem-template
▸ Post-Incident Reviews by Jason Hand
http://bit.ly/post-incident-review
TW: @gitbisect @datadoghq
QUESTIONS?
LET’S TALK!@GITBISECT
@DATADOGHQ
SLIDES: bit.ly/cm-postmortems

More Related Content

PDF
Tyranny of the SLA
PDF
Agile (making some sense of it all)
PDF
The Blameless Cloud: Bringing Actionable Retrospectives to Salesforce
PDF
Making some sense of it all
PDF
Breaking bad the cult of not giving 'bad' news
PDF
(Moonconf 2016) Fetching Moths from the Works: Correctness Methods in Software
PDF
Designing Events-first Microservices
PPTX
How to Run Social Media Without an Intern
Tyranny of the SLA
Agile (making some sense of it all)
The Blameless Cloud: Bringing Actionable Retrospectives to Salesforce
Making some sense of it all
Breaking bad the cult of not giving 'bad' news
(Moonconf 2016) Fetching Moths from the Works: Correctness Methods in Software
Designing Events-first Microservices
How to Run Social Media Without an Intern

What's hot (12)

PDF
#NECSTCamp: come partecipare
PPTX
Google and 180Fusion Seminar - Move People in the Moments that Matter
PDF
What's Really Happening in Virtual Reality?
PPTX
s2gx2015 who needs batch
PDF
Deployments So Easy, Devs Will Ask For It by Name
PPTX
Hurricane web quest
PDF
Pair Programming
PDF
From Scrum to flow using actionable agile metrics
PPT
10 Steps: Integrating Social Media Into Crisis Plans
PDF
130426 yujuan jiang - will my patch make it and how fast
PDF
Whats Your Reality? AR and VR for Learning
PDF
Smart Pitch
#NECSTCamp: come partecipare
Google and 180Fusion Seminar - Move People in the Moments that Matter
What's Really Happening in Virtual Reality?
s2gx2015 who needs batch
Deployments So Easy, Devs Will Ask For It by Name
Hurricane web quest
Pair Programming
From Scrum to flow using actionable agile metrics
10 Steps: Integrating Social Media Into Crisis Plans
130426 yujuan jiang - will my patch make it and how fast
Whats Your Reality? AR and VR for Learning
Smart Pitch
Ad

Similar to Webinar - Data driven postmortems - Jason Yee (20)

PDF
DevOps Roadtrip NYC
PDF
Inextricably linked: reproducibility and productivity in data science and AI
PDF
Inextricably linked reproducibility and productivity in data science and ai ...
PPTX
Track Everything with Google Tag Manager - #DFWSEM May 2017
PPTX
Data Visualization - CodeMash 2022
PDF
A Farmers Market of Open Data
PPTX
Accidental DataOps
PDF
Big Data and Fast Data – Big and Fast Combined, is it Possible?
PPTX
If an Application Fails in the Datacenter and No Users Are On It, Will it Cut...
PPTX
Data Scientists: Your Must-Have Business Investment
PDF
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
PPTX
Navigating the big data landscape
PDF
Big Data Analytics London - Data Science in the Cloud
PDF
What to expect when you are visualizing
PDF
Big Data Projects, Small Newsroom
PDF
DevSecOps and the New Path Forward
PPTX
Plenary Talk from GeCoWest ~ Best of Breed for Geospatial
PDF
Going Cloud Native
PDF
Codemotion Milan 2018 - AI with a devops mindset: experimentation, sharing an...
PDF
Thiago de Faria - AI with a devops mindset - experimentation, sharing and eas...
DevOps Roadtrip NYC
Inextricably linked: reproducibility and productivity in data science and AI
Inextricably linked reproducibility and productivity in data science and ai ...
Track Everything with Google Tag Manager - #DFWSEM May 2017
Data Visualization - CodeMash 2022
A Farmers Market of Open Data
Accidental DataOps
Big Data and Fast Data – Big and Fast Combined, is it Possible?
If an Application Fails in the Datacenter and No Users Are On It, Will it Cut...
Data Scientists: Your Must-Have Business Investment
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Navigating the big data landscape
Big Data Analytics London - Data Science in the Cloud
What to expect when you are visualizing
Big Data Projects, Small Newsroom
DevSecOps and the New Path Forward
Plenary Talk from GeCoWest ~ Best of Breed for Geospatial
Going Cloud Native
Codemotion Milan 2018 - AI with a devops mindset: experimentation, sharing an...
Thiago de Faria - AI with a devops mindset - experimentation, sharing and eas...
Ad

More from Codemotion (20)

PDF
Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
PDF
Pompili - From hero to_zero: The FatalNoise neverending story
PPTX
Pastore - Commodore 65 - La storia
PPTX
Pennisi - Essere Richard Altwasser
PPTX
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
PPTX
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
PPTX
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
PPTX
Francesco Baldassarri - Deliver Data at Scale - Codemotion Amsterdam 2019 -
PDF
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
PDF
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
PDF
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
PDF
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
PDF
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
PDF
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
PPTX
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
PPTX
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
PDF
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
PDF
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
PDF
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
PDF
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019
Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
Pompili - From hero to_zero: The FatalNoise neverending story
Pastore - Commodore 65 - La storia
Pennisi - Essere Richard Altwasser
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
Francesco Baldassarri - Deliver Data at Scale - Codemotion Amsterdam 2019 -
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019

Recently uploaded (20)

PDF
Two-dimensional Klein-Gordon and Sine-Gordon numerical solutions based on dee...
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
A review of recent deep learning applications in wood surface defect identifi...
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
CloudStack 4.21: First Look Webinar slides
PPTX
Microsoft Excel 365/2024 Beginner's training
PDF
Five Habits of High-Impact Board Members
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PDF
Consumable AI The What, Why & How for Small Teams.pdf
PDF
Flame analysis and combustion estimation using large language and vision assi...
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PDF
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
PPTX
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
PPTX
Configure Apache Mutual Authentication
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
The influence of sentiment analysis in enhancing early warning system model f...
Two-dimensional Klein-Gordon and Sine-Gordon numerical solutions based on dee...
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
A comparative study of natural language inference in Swahili using monolingua...
A review of recent deep learning applications in wood surface defect identifi...
Chapter 5: Probability Theory and Statistics
Enhancing emotion recognition model for a student engagement use case through...
CloudStack 4.21: First Look Webinar slides
Microsoft Excel 365/2024 Beginner's training
Five Habits of High-Impact Board Members
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
sustainability-14-14877-v2.pddhzftheheeeee
Consumable AI The What, Why & How for Small Teams.pdf
Flame analysis and combustion estimation using large language and vision assi...
Final SEM Unit 1 for mit wpu at pune .pptx
Taming the Chaos: How to Turn Unstructured Data into Decisions
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
Configure Apache Mutual Authentication
Hindi spoken digit analysis for native and non-native speakers
The influence of sentiment analysis in enhancing early warning system model f...

Webinar - Data driven postmortems - Jason Yee