This document provides guidance on conducting post-mortem analyses of outages. It discusses establishing standard procedures and templates to minimize defensiveness and promote learning. Key recommendations include focusing analyses on process improvement rather than blame, having a third party lead investigations, and ensuring action is taken on findings through follow up. Templates are provided to structure data collection, including timelines, root cause analysis, and review of monitoring and logging. The overall goal is for teams to walk away from outages looking like heroes by continually improving based on lessons learned.
Related topics: