SlideShare a Scribd company logo
Learning from Learnings:
Anatomy of Three Incidents
Randy Shoup
@randyshoup
linkedin.com/in/randyshoup
Background
@randyshoup
Learning from Learnings: Anatomy of Three Incidents
App Engine Outage
Oct 2012
http://googleappengine.blogspot.com/2012/10/about-todays-app-engine-outage.html
App Engine Outage
Oct 2012
App Engine
Reliability Fixit
• Step 1: Identify the Problem
o All team leads and senior engineers met in a room with a whiteboard
o Enumerated all known and suspected reliability issues
o Too much technical debt had accumulated
o Reliability issues had not been prioritized
o Identify 8-10 themes
• Step 2: Understand the Problem
o Each theme assigned to a senior engineer to investigate
o Timeboxed for 1 week
o After 1 week, all leads came back with
• Detailed list of issues
• Recommended steps to address them
• Estimated order-of-magnitude of effort (1 day, 1 week, 1 month, etc.)
App Engine
Reliability Fixit
• Step 3: Consensus and Prioritization
o Leads discussed themes and prioritized work
o Assigned engineers to tasks
App Engine
Reliability Fixit
• Step 4: Implementation and Follow-up
o Engineers worked on assigned tasks
o Simple spreadsheet of task status, which engineers updated weekly
o Minimal effort from management (~1 hour / week) to summarize progress at
weekly team meeting
App Engine
Reliability Fixit
•  Results
o 10x reduction in reliability issues
o Improved team cohesion and camaraderie
o Broader participation and ownership of the future health of the platform
o Still remembered several years later
App Engine
Reliability Fixit
Learning from Learnings: Anatomy of Three Incidents
Stitch Fix Outages
Oct / Nov 2016
• (11/08/2016) Spectre unavailable for ~3 minutes [Shared Database]
• (11/05/2016) Spectre unavailable for ~5 minutes [Shared Database]
• (10/25/2016) All systems unavailable for ~5 minutes [Shared Database]
• (10/24/2016) All systems unavailable for ~5 minutes [Shared Database]
• (10/21/2016) All systems unavailable for ~3 ½ hours [DDOS attack]
• (10/18/2016) All systems unavailable for ~3 minutes [Shared Database]
• (10/17/2016) All systems unavailable for ~20 minutes [Shared Database]
• (10/13/2016) Minx escalation broken for ~2 hours [Zendesk outage]
• (10/11/2016) Label printing unavailable for ~10 minutes [FedEx outage]
• (10/10/2016) Label printing unavailable for ~15 minutes [FedEx outage]
• (10/10/2016) All systems unavailable for ~10 minutes [Shared Database]
Database Stability
Problems
• 1. Applications contended on common tables
• 2. Scalability limited by database connections
• 3. One application could take down entire company
Stitch Fix
Stability Retrospective
• Step 1: Identify the Problem
• Step 2: Understand the Problem
• Step 3: Consensus and Prioritization
• Step 4: Implementation and Follow-Up
•  Results
@randyshoup
Stability
Solutions
• 1. Focus on expensive queries
o Log
o Eliminate
o Rewrite
o Reduce
• 2. Manage database connections via connection
concentrator
• 3. Stability and Scalability Program
o Ongoing 25% investment in services migration
@randyshoup
WeWork
Login Issues
• Problem: Some members unable to log in
• Inconsistent representations across different services in
the system
• Over time, simple system interactions grew increasingly
complex and convoluted
• Not enough graceful degradation or automated repair
@randyshoup
WeWork
Login Retrospective
• Step 1: Identify the Problem
• Step 2: Understand the Problem
• Step 3: Consensus and Prioritization
• Step 4: Implementation and Follow-Up
@randyshoup
Common
Elements
• Unintentional, long-term accumulation of small,
individually reasonable decisions
• “Compelling event” catalyzes long-term change
• Blameless culture makes learning and improvement
possible
• Structured post-incident approach
@randyshoup
If you don’t end up regretting
your early technology
decisions, you probably over-
engineered.
Lessons
•Leadership and Collaboration
•Quality and Discipline
•Driving Change
Lessons
•Leadership and Collaboration
•Quality and Discipline
•Driving Change
Blameless
Post-Mortems
• Open and Honest Discussion
o Detect
o Diagnose
o Mitigate
o Remediate
o Prevent
Engineers compete to take personal responsibility (!)
@randyshoup linkedin.com/in/randyshoup
“Finally we can prioritize
fixing that broken system!”
Cross-Functional
Collaboration
• Best decisions made through partnership
• Shared context
o Tradeoffs and implications
o Given common context, well-meaning people generally agree
• “Disagree and Commit”
@randyshoup
15 Million
“Never let a
good crisis go
to waste.”
Lessons
•Leadership and Collaboration
•Quality and Discipline
•Driving Change
Quality and reliability are
business concerns
@randyshoup
“Do you have time to do it
twice?”
“We don’t have time to do it
right!”
@randyshoup
The more constrained you are
on time or resources, the more
important it is to build it right
the first time.
@randyshoup
“Do not try to
do everything.
Do one thing
well.”
@randyshoup
Definition of
Done
• Feature complete
• Automated Tests
• Released to Production
• Feature gate turned on
• Monitored
Vicious Cycle
of Technical Debt
Technical
Debt
“No time
to do it
right”
Quick-
and-dirty
@randyshoup
Virtuous Cycle
of Investment
Solid
Foundation
Confidence
Faster and
Better
Quality
Investment
@randyshoup
Lessons
•Leadership and Collaboration
•Quality and Discipline
•Driving Change
Top-down
and
Bottom-up
Funding the
Improvement Program
• Approach 1: Standard project
o Prioritize and fund like any other project
o Works best when project is within one team
• Approach 2: Explicit global investment
o Agree on an up-front investment
(e.g., 25%, 30% of engineering efforts)
o Works best when program is across many teams
Finance
An unexpected ally …
@randyshoup
Lessons
•Leadership and Collaboration
•Quality and Discipline
•Driving Change
“If you can’t change your
organization,
change your organization.”
-- Martin Fowler
@randyshoup
We are Hiring!
700 software engineers
globally, in
• New York
• Tel Aviv
• San Francisco
• Seattle
• Shanghai
• Singapore
@randyshoup

More Related Content

PPTX
One Terrible Day at Google, and How It Made Us Better
PPTX
Moving Fast At Scale
PPTX
DOES15 - Randy Shoup - Ten (Hard-Won) Lessons of the DevOps Transition
PPTX
Service Architectures at Scale
PPTX
Minimal Viable Architecture - Silicon Slopes 2020
PPTX
Scaling Your Architecture for the Long Term
PPTX
Anatomy of Three Incidents -- Commonalities and Lessons
PPTX
Evolving Architecture and Organization - Lessons from Google and eBay
One Terrible Day at Google, and How It Made Us Better
Moving Fast At Scale
DOES15 - Randy Shoup - Ten (Hard-Won) Lessons of the DevOps Transition
Service Architectures at Scale
Minimal Viable Architecture - Silicon Slopes 2020
Scaling Your Architecture for the Long Term
Anatomy of Three Incidents -- Commonalities and Lessons
Evolving Architecture and Organization - Lessons from Google and eBay

What's hot (20)

PPTX
Minimum Viable Architecture - Good Enough is Good Enough
PPTX
A CTO's Guide to Scaling Organizations
PPTX
DevOps - It's About How We Work
PPTX
Managing Data in Microservices
PPTX
Service Architectures At Scale - QCon London 2015
PPTX
Monoliths, Migrations, and Microservices
PPTX
Moving Fast at Scale
PPTX
Pragmatic Microservices
PPTX
Managing Data at Scale - Microservices and Events
PPTX
Effective Microservices In a Data-centric World
PPTX
An Agile Approach to Machine Learning
PPTX
Minimum Viable Architecture -- Good Enough is Good Enough in a Startup
PPTX
Scaling Your Architecture with Services and Events
PDF
Microservices - Scaling Development and Service
PPTX
ONE-SIZE DOESN'T FIT ALL - EFFECTIVELY (RE)EVALUATE A DATA SOLUTION FOR YOUR ...
PPTX
Enterprise DevOps fact or fiction - DevOps Summit 2014
PPTX
Being Elastic -- Evolving Programming for the Cloud
PDF
You build it - Cyber Chicago Keynote
PDF
Art of the Possible - Serverless Conference NYC 2017
PPTX
The agile elephant in the room
Minimum Viable Architecture - Good Enough is Good Enough
A CTO's Guide to Scaling Organizations
DevOps - It's About How We Work
Managing Data in Microservices
Service Architectures At Scale - QCon London 2015
Monoliths, Migrations, and Microservices
Moving Fast at Scale
Pragmatic Microservices
Managing Data at Scale - Microservices and Events
Effective Microservices In a Data-centric World
An Agile Approach to Machine Learning
Minimum Viable Architecture -- Good Enough is Good Enough in a Startup
Scaling Your Architecture with Services and Events
Microservices - Scaling Development and Service
ONE-SIZE DOESN'T FIT ALL - EFFECTIVELY (RE)EVALUATE A DATA SOLUTION FOR YOUR ...
Enterprise DevOps fact or fiction - DevOps Summit 2014
Being Elastic -- Evolving Programming for the Cloud
You build it - Cyber Chicago Keynote
Art of the Possible - Serverless Conference NYC 2017
The agile elephant in the room
Ad

Similar to Learning from Learnings: Anatomy of Three Incidents (20)

PDF
Hanno Jarvet - VSM, Planning and Problem Solving - ConFu
PDF
CMS Crash Course!
PDF
Refactoring Into Microservices 2016-11-06
PDF
Large Scale Architecture -- The Unreasonable Effectiveness of Simplicity
PDF
Turning Human Capital into High Performance Organizational Capital
PDF
ConFoo: Moving web performance testing to the left
PDF
Refactoring Into Microservices 2016-11-08
PPTX
From Technical Debt to Technical Health
PPTX
A modern architecturereview–usingcodereviewtools-ver-3.5
 
PDF
Microservices for architects los angeles-2016-07-16
PPTX
Software Project Management
PDF
8D Problem Solving (Oshkosh).pdf
PDF
Building A Production-Level Machine Learning Pipeline
PDF
Scaling a Web Site - OSCON Tutorial
PPTX
How does your _____ car handle in the rain?
PPTX
Did your ____ car come with satellite radio as a trial? Did they keep it?
PPTX
Mark Andersen DFW DevOps Days 2017
PPTX
PPTX
From Project Manager to Scrum Master
PPTX
DNN-Connect 2019: DNN Horror Stories
Hanno Jarvet - VSM, Planning and Problem Solving - ConFu
CMS Crash Course!
Refactoring Into Microservices 2016-11-06
Large Scale Architecture -- The Unreasonable Effectiveness of Simplicity
Turning Human Capital into High Performance Organizational Capital
ConFoo: Moving web performance testing to the left
Refactoring Into Microservices 2016-11-08
From Technical Debt to Technical Health
A modern architecturereview–usingcodereviewtools-ver-3.5
 
Microservices for architects los angeles-2016-07-16
Software Project Management
8D Problem Solving (Oshkosh).pdf
Building A Production-Level Machine Learning Pipeline
Scaling a Web Site - OSCON Tutorial
How does your _____ car handle in the rain?
Did your ____ car come with satellite radio as a trial? Did they keep it?
Mark Andersen DFW DevOps Days 2017
From Project Manager to Scrum Master
DNN-Connect 2019: DNN Horror Stories
Ad

More from Randy Shoup (9)

PPTX
Breaking Codes, Designing Jets, and Building Teams
PPTX
Ten Lessons of the DevOps Transition
PPTX
From the Monolith to Microservices - CraftConf 2015
PPTX
Concurrency at Scale: Evolution to Micro-Services
PPTX
Why Enterprises Are Embracing the Cloud
PPTX
DevOpsDays Silicon Valley 2014 - The Game of Operations
PPTX
QCon New York 2014 - Scalable, Reliable Analytics Infrastructure at KIXEYE
PPTX
QCon Tokyo 2014 - Virtuous Cycles of Velocity: What I Learned About Going Fas...
PPTX
The Importance of Culture: Building and Sustaining Effective Engineering Org...
Breaking Codes, Designing Jets, and Building Teams
Ten Lessons of the DevOps Transition
From the Monolith to Microservices - CraftConf 2015
Concurrency at Scale: Evolution to Micro-Services
Why Enterprises Are Embracing the Cloud
DevOpsDays Silicon Valley 2014 - The Game of Operations
QCon New York 2014 - Scalable, Reliable Analytics Infrastructure at KIXEYE
QCon Tokyo 2014 - Virtuous Cycles of Velocity: What I Learned About Going Fas...
The Importance of Culture: Building and Sustaining Effective Engineering Org...

Recently uploaded (20)

PDF
Types of Token_ From Utility to Security.pdf
PDF
Time Tracking Features That Teams and Organizations Actually Need
DOCX
Greta — No-Code AI for Building Full-Stack Web & Mobile Apps
PDF
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
PDF
Topaz Photo AI Crack New Download (Latest 2025)
PPTX
WiFi Honeypot Detecscfddssdffsedfseztor.pptx
PDF
MCP Security Tutorial - Beginner to Advanced
PPTX
chapter 5 systemdesign2008.pptx for cimputer science students
PDF
Designing Intelligence for the Shop Floor.pdf
PDF
DNT Brochure 2025 – ISV Solutions @ D365
PPTX
Advanced SystemCare Ultimate Crack + Portable (2025)
PDF
Website Design Services for Small Businesses.pdf
PDF
Autodesk AutoCAD Crack Free Download 2025
PDF
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PPTX
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
PDF
Salesforce Agentforce AI Implementation.pdf
PPTX
Monitoring Stack: Grafana, Loki & Promtail
PDF
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
PPTX
Patient Appointment Booking in Odoo with online payment
Types of Token_ From Utility to Security.pdf
Time Tracking Features That Teams and Organizations Actually Need
Greta — No-Code AI for Building Full-Stack Web & Mobile Apps
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
Topaz Photo AI Crack New Download (Latest 2025)
WiFi Honeypot Detecscfddssdffsedfseztor.pptx
MCP Security Tutorial - Beginner to Advanced
chapter 5 systemdesign2008.pptx for cimputer science students
Designing Intelligence for the Shop Floor.pdf
DNT Brochure 2025 – ISV Solutions @ D365
Advanced SystemCare Ultimate Crack + Portable (2025)
Website Design Services for Small Businesses.pdf
Autodesk AutoCAD Crack Free Download 2025
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
wealthsignaloriginal-com-DS-text-... (1).pdf
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
Salesforce Agentforce AI Implementation.pdf
Monitoring Stack: Grafana, Loki & Promtail
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
Patient Appointment Booking in Odoo with online payment

Learning from Learnings: Anatomy of Three Incidents

  • 1. Learning from Learnings: Anatomy of Three Incidents Randy Shoup @randyshoup linkedin.com/in/randyshoup
  • 6. App Engine Reliability Fixit • Step 1: Identify the Problem o All team leads and senior engineers met in a room with a whiteboard o Enumerated all known and suspected reliability issues o Too much technical debt had accumulated o Reliability issues had not been prioritized o Identify 8-10 themes
  • 7. • Step 2: Understand the Problem o Each theme assigned to a senior engineer to investigate o Timeboxed for 1 week o After 1 week, all leads came back with • Detailed list of issues • Recommended steps to address them • Estimated order-of-magnitude of effort (1 day, 1 week, 1 month, etc.) App Engine Reliability Fixit
  • 8. • Step 3: Consensus and Prioritization o Leads discussed themes and prioritized work o Assigned engineers to tasks App Engine Reliability Fixit
  • 9. • Step 4: Implementation and Follow-up o Engineers worked on assigned tasks o Simple spreadsheet of task status, which engineers updated weekly o Minimal effort from management (~1 hour / week) to summarize progress at weekly team meeting App Engine Reliability Fixit
  • 10. •  Results o 10x reduction in reliability issues o Improved team cohesion and camaraderie o Broader participation and ownership of the future health of the platform o Still remembered several years later App Engine Reliability Fixit
  • 12. Stitch Fix Outages Oct / Nov 2016 • (11/08/2016) Spectre unavailable for ~3 minutes [Shared Database] • (11/05/2016) Spectre unavailable for ~5 minutes [Shared Database] • (10/25/2016) All systems unavailable for ~5 minutes [Shared Database] • (10/24/2016) All systems unavailable for ~5 minutes [Shared Database] • (10/21/2016) All systems unavailable for ~3 ½ hours [DDOS attack] • (10/18/2016) All systems unavailable for ~3 minutes [Shared Database] • (10/17/2016) All systems unavailable for ~20 minutes [Shared Database] • (10/13/2016) Minx escalation broken for ~2 hours [Zendesk outage] • (10/11/2016) Label printing unavailable for ~10 minutes [FedEx outage] • (10/10/2016) Label printing unavailable for ~15 minutes [FedEx outage] • (10/10/2016) All systems unavailable for ~10 minutes [Shared Database]
  • 13. Database Stability Problems • 1. Applications contended on common tables • 2. Scalability limited by database connections • 3. One application could take down entire company
  • 14. Stitch Fix Stability Retrospective • Step 1: Identify the Problem • Step 2: Understand the Problem • Step 3: Consensus and Prioritization • Step 4: Implementation and Follow-Up •  Results @randyshoup
  • 15. Stability Solutions • 1. Focus on expensive queries o Log o Eliminate o Rewrite o Reduce • 2. Manage database connections via connection concentrator • 3. Stability and Scalability Program o Ongoing 25% investment in services migration
  • 17. WeWork Login Issues • Problem: Some members unable to log in • Inconsistent representations across different services in the system • Over time, simple system interactions grew increasingly complex and convoluted • Not enough graceful degradation or automated repair @randyshoup
  • 18. WeWork Login Retrospective • Step 1: Identify the Problem • Step 2: Understand the Problem • Step 3: Consensus and Prioritization • Step 4: Implementation and Follow-Up @randyshoup
  • 19. Common Elements • Unintentional, long-term accumulation of small, individually reasonable decisions • “Compelling event” catalyzes long-term change • Blameless culture makes learning and improvement possible • Structured post-incident approach @randyshoup
  • 20. If you don’t end up regretting your early technology decisions, you probably over- engineered.
  • 21. Lessons •Leadership and Collaboration •Quality and Discipline •Driving Change
  • 22. Lessons •Leadership and Collaboration •Quality and Discipline •Driving Change
  • 23. Blameless Post-Mortems • Open and Honest Discussion o Detect o Diagnose o Mitigate o Remediate o Prevent Engineers compete to take personal responsibility (!) @randyshoup linkedin.com/in/randyshoup
  • 24. “Finally we can prioritize fixing that broken system!”
  • 25. Cross-Functional Collaboration • Best decisions made through partnership • Shared context o Tradeoffs and implications o Given common context, well-meaning people generally agree • “Disagree and Commit” @randyshoup
  • 26. 15 Million “Never let a good crisis go to waste.”
  • 27. Lessons •Leadership and Collaboration •Quality and Discipline •Driving Change
  • 28. Quality and reliability are business concerns @randyshoup
  • 29. “Do you have time to do it twice?” “We don’t have time to do it right!” @randyshoup
  • 30. The more constrained you are on time or resources, the more important it is to build it right the first time. @randyshoup
  • 31. “Do not try to do everything. Do one thing well.” @randyshoup
  • 32. Definition of Done • Feature complete • Automated Tests • Released to Production • Feature gate turned on • Monitored
  • 33. Vicious Cycle of Technical Debt Technical Debt “No time to do it right” Quick- and-dirty @randyshoup
  • 34. Virtuous Cycle of Investment Solid Foundation Confidence Faster and Better Quality Investment @randyshoup
  • 35. Lessons •Leadership and Collaboration •Quality and Discipline •Driving Change
  • 37. Funding the Improvement Program • Approach 1: Standard project o Prioritize and fund like any other project o Works best when project is within one team • Approach 2: Explicit global investment o Agree on an up-front investment (e.g., 25%, 30% of engineering efforts) o Works best when program is across many teams
  • 38. Finance An unexpected ally … @randyshoup
  • 39. Lessons •Leadership and Collaboration •Quality and Discipline •Driving Change
  • 40. “If you can’t change your organization, change your organization.” -- Martin Fowler @randyshoup
  • 41. We are Hiring! 700 software engineers globally, in • New York • Tel Aviv • San Francisco • Seattle • Shanghai • Singapore @randyshoup