SlideShare a Scribd company logo
The evolving role of context in Incident Management
Matthew Boeckman
Developer Advocate
Victorops.com/blog
@matthewboeckman
Background
● 18 years on-call Ops
● 15 years w/software
teams
● Startup junkie
● DevOps enthusiast
3
What is VictorOps?
VictorOps ingests all of your alerts from your current monitoring tools and becomes the logical
layer between your alerts and the people who receives them.
victorops.com/IMA
5
5 Phases of Incident Management
Detection
monitoring,
metrics,
thresholds
Response
alerting,
on-call,
escalation
Remediation
fixes,
tickets,
deployments
Analysis
postmortem,
how or why,
understand
Readiness
improvement,
game days,
learning
6
Standard Incident Workflow
Detection Response Remediation
AnalysisReadiness
7
Incident Management Assessment Matrix
Detection Response Remediation Analysis Preparedness
Novice
Beginner
Competent
Proficient
Expert
8
Incident Management Maturity Matrix
Detection Response Remediation Analysis Preparedness
Novice
Beginner
x
Competent
x x
Proficient
x x
Expert
9
Self Assessment
Poll: How would you rate your overall team
maturity?
A. Novice
B. Beginner
C. Competent
D.Proficient
E. Expert
10
The Focus Question
How can we help teams
mature their incident management practice
(Stated plainly: Make On-Call suck less)
11
Situational Context
12
Incident Management Key Metrics
● MTTR Mean time to Repair(MTTR)
● Availability (SLA)
● Ticket Volumes
● Escalations
● Customer Satisfaction
13
Incident Management Key Metrics
14
Time Spent Managing Incidents - Low Maturity
Detection Response Remediation Analysis
R
e
a
d
i
n
e
s
s
Time to Repair (MTTR)
15
Time Spent Managing Incidents - Medium Maturity
Detection Response Remediation Analysis
R
e
a
d
i
n
e
s
s
Time to Repair (MTTR)
16
Time Spent Managing Incidents - High Maturity
D
e
t
e
c
t
i
o
n
R
e
s
p
o
n
s
e
Remediation Analysis
Readiness
Time to Repair
(MTTR)
17
A New Core Metric
D
e
t
e
c
t
i
o
n
R
e
s
p
o
n
s
e
Remediation Analysis
Readiness
Time to Repair
(MTTR)
Time to Learn
(TTL)
Identify trends
Capacity plan
Improve infrastructure
Gamedays
Cross train
Update runbooks
18
Beep Beep Beep
19
Standard Incident Workflow
20
Standard Diagnostic Procedure
1. Fire up the VPN
2. Navigate dashboards, find
relevant section
3. Review ticket or incident
history for host
4. Review Runbooks for
associated host
21
Common Bottlenecks to Establishing Context
● Multiple sources of record
● Duplicate Runbooks or documentation
● Metric overload
● New responders unfamiliar with systems
22
Where Does it Hurt?
Poll: Which is the most painful problem you
experience in establishing context
A. Multiple sources of record
B. Duplicate documentation
C. Metric overload
D.Everything is equally on fire
E. Everything is fantastic
23
Beep Beep Beep
24
A Tale of Two Graphs
Massive spike above expected norm
Response: Fire up the laptop and put a pot
of coffee on
25
A Tale of Two Graphs
Small spike for a consistently loaded box.
Response: ACK alert, go back to sleep
26
This Time, with Context!
27
Enhanced Contextual Workflow
28
Alert Enhancements
Poll: My team is doing some enhancement of
alerts today.
A. True
B. False
Many incidents can be tracked to deploys
Developer Velocity = Constant Change
Silos impair communication
29
CI/CD Exacerbates the Contextual Challenge
30
A Tale of Two Incidents
31
A Tale of Two Incidents
32
Introducing: The Scientific Method
Make Observations (the measurement)
Ask a question (why would a webserver quit working?)
Form a hypothesis (because we just deployed?)
33
The Sandstorm
34
No. Do not.
35
Measure Everything: the Anti-pattern
Measurements cost time and money
Busy dashboards lead to sub-concious filtering
Measurements create a natural impulse to alert
36
Enhance
37
Stop
38
An Embarrassment of Dashboards
39
Rule of Thumb
Measure much
Alert on some
Contextualize all
40
Iteration is Key
Dialing in context takes time
Conduct blameless postmortems
Experiment with more and less context
Be objective in your assessment of what works
41
Leverage Situational Context
Providing incident responders with context
can meaningfully impact MTTR
paying dividends in time
to move your practice forward
42
The Beginning
Detection Response Remediation Analysis
R
e
a
d
i
n
e
s
s
Time to Repair (MTTR)
43
The Goal
D
e
t
e
c
t
i
o
n
R
e
s
p
o
n
s
e
Remediation Analysis
Readiness
Time to Repair
(MTTR)
Time to Learn
(TTL)
Identify trends
Capacity plan
Improve infrastructure
Gamedays
Cross train
Update runbooks
Take the IMA!
http://victorops.com/ima
Questions?
44
Thank you!
Matthew Boeckman
@matthewboeckman
Slides on devops.com & slideshare.com
45
Context Matters
● Simplifies diagnosis
● context can be thought of as out of band signalling to a team “recently patched”,
“often hangs”, “specific process must be followed for reboot”,
● context collapses informational walls (runbooks)
● context collapses communication walls (escalations)
● responders cannot be experts in everything

More Related Content

PDF
Sandstorm or Significant? The evolving role of situational context in inciden...
PPTX
Rewriting DevOps
PDF
Monitoring That Will Make Your Engineers Give Up - Gil Zellner, GigaSpaces - ...
PPTX
Incident Response Test
PDF
Verification Bug Metrics: A Different Approach
PPTX
Software development practices & Infrastructure as Code - how well do they wo...
PDF
Applying SRE techniques to micro service design
PDF
Analytics for large-scale time series and event data
Sandstorm or Significant? The evolving role of situational context in inciden...
Rewriting DevOps
Monitoring That Will Make Your Engineers Give Up - Gil Zellner, GigaSpaces - ...
Incident Response Test
Verification Bug Metrics: A Different Approach
Software development practices & Infrastructure as Code - how well do they wo...
Applying SRE techniques to micro service design
Analytics for large-scale time series and event data

What's hot (8)

PPTX
Identify Root Causes – DCP Overview
PDF
Mindful Metrics (QAotHW 2018)
PPTX
Risk Event Modeling and Event Chains
PPTX
Identify Root Causes – 5 Whys
PDF
Strategizing to build a perfect test environment
PDF
Who would ever fore see risk identification? by Dr.Mahboob ali khan Phd
PPT
Project risk management workshops
PDF
The Mythical Man-Month #2 The Mythical Man-Month
Identify Root Causes – DCP Overview
Mindful Metrics (QAotHW 2018)
Risk Event Modeling and Event Chains
Identify Root Causes – 5 Whys
Strategizing to build a perfect test environment
Who would ever fore see risk identification? by Dr.Mahboob ali khan Phd
Project risk management workshops
The Mythical Man-Month #2 The Mythical Man-Month
Ad

Viewers also liked (20)

PPTX
Creative industries: EU Policy: McGill Globalization Forum
PPSX
PPT
Incident Management
PPTX
3Com 10/100BASE-TX
PPTX
Tipos recoleccion de datos
PDF
Semana 2 movimientos compositivos
DOC
Газета январь 2016
PPTX
Historia del teléfono
PPTX
Year 10 Film Studies Taster Session
PPTX
Presentación
PPTX
Liberty Deep Dive
PPT
Parte I. Riesgos terremotos y volcanes
PDF
Love and Happiness: Building Community with Smart Communications
PDF
Best Practices for Creating Scalable Apps with Heroku
PDF
Datawarehouse and reporting in service manager
PDF
Sap incident communication plan
ODP
Nagios Conference 2014 - Andy Brist - Intro to Incident Manager
PPTX
IT Operations - Incident Process Workflow
PPT
SubSift: a novel application of the vector space model to support the academi...
Creative industries: EU Policy: McGill Globalization Forum
Incident Management
3Com 10/100BASE-TX
Tipos recoleccion de datos
Semana 2 movimientos compositivos
Газета январь 2016
Historia del teléfono
Year 10 Film Studies Taster Session
Presentación
Liberty Deep Dive
Parte I. Riesgos terremotos y volcanes
Love and Happiness: Building Community with Smart Communications
Best Practices for Creating Scalable Apps with Heroku
Datawarehouse and reporting in service manager
Sap incident communication plan
Nagios Conference 2014 - Andy Brist - Intro to Incident Manager
IT Operations - Incident Process Workflow
SubSift: a novel application of the vector space model to support the academi...
Ad

Similar to Sandstorm or Significant: The evolving role of context in Incident Management (20)

PDF
Top 10 Practices of Highly Successful DevOps Incident Management Teams
PDF
Top 10 Practices of Highly Successful DevOps Incident Management Teams
PPTX
Top 10 Practices of Highly Successful DevOps Incident Management Teams
PPTX
Driving Innovative IT Metrics (Project Management Institute Presentation)
PDF
Metrics in Security Operations
PDF
Rick Clymer - Incident Management.pdf
PDF
DevOps Roadtrip Minneapolis
PPTX
World-Class Incident Response Management
PDF
DevOpsDaysRiga 2018: Matty Stratton - How Do You Infect Your Organization Wit...
PDF
The Importance of Minimum Viable Runbooks Webinar
PPTX
Driving Innovative IT Metrics (Project Management Institute Presentation)
PDF
HDI Capital Area Meeting April 2016
PDF
ITIL Incident Management Workflow PowerPoint Presentation Slides
PDF
Incident Response: Don't Mess It Up, Here's How To Get It Right
PDF
People Metrics: How to Use Team Data to Produce Positive Change
PPTX
DevOpsRoadTrip San Francisco Final Speaking Deck
PPTX
DOES15 - Mark Michaelis - Metrics that Matter
PDF
South Florida HDI National Speakers Preview April 19 2012
PDF
MIM Quote Book Major Incident Management
PDF
How to Build an Invincible Incident Management Plan
Top 10 Practices of Highly Successful DevOps Incident Management Teams
Top 10 Practices of Highly Successful DevOps Incident Management Teams
Top 10 Practices of Highly Successful DevOps Incident Management Teams
Driving Innovative IT Metrics (Project Management Institute Presentation)
Metrics in Security Operations
Rick Clymer - Incident Management.pdf
DevOps Roadtrip Minneapolis
World-Class Incident Response Management
DevOpsDaysRiga 2018: Matty Stratton - How Do You Infect Your Organization Wit...
The Importance of Minimum Viable Runbooks Webinar
Driving Innovative IT Metrics (Project Management Institute Presentation)
HDI Capital Area Meeting April 2016
ITIL Incident Management Workflow PowerPoint Presentation Slides
Incident Response: Don't Mess It Up, Here's How To Get It Right
People Metrics: How to Use Team Data to Produce Positive Change
DevOpsRoadTrip San Francisco Final Speaking Deck
DOES15 - Mark Michaelis - Metrics that Matter
South Florida HDI National Speakers Preview April 19 2012
MIM Quote Book Major Incident Management
How to Build an Invincible Incident Management Plan

More from Jules Pierre-Louis (19)

PPTX
The Coming Earthquake in IIS and SQL Configuration Management
PDF
Diving Deeper into DevOps Deployments
PPTX
Microservice Monitoring and Quality Management for Modern Apps and Infrastruc...
PPTX
The Human Side of DevSecOps
PPTX
Cloud bees and forester open source is not enough
PPTX
From Monolith to Microservices – and Beyond!
PPTX
Efficient Performance Test Automation - Opitmizing the Jenkins Pipeline
PPTX
How to Build the Right Automation
PPTX
Starting and Scaling Devops
PPTX
Starting and Scaling DevOps
PPTX
Containers: DevOp Enablers of Technical Solutions
PPTX
Adopting DevOps @ Scale: Lessons learned at Hertz, Kaiser Permanente and lBM
PDF
Managing Quality of Service for Containerized Microservice Applications
PPTX
The Evolution of Application Release Automation
PDF
DevOPs Transformation Workshop
PDF
Pipeline: Continuous Delivery as Code in Jenkins 2.0
PDF
7 Habits of Highly Effective Jenkins Users
PPTX
7 Habits of Highly Effective Jenkins Users
PPTX
Webinar: A Roadmap for DevOps Success
The Coming Earthquake in IIS and SQL Configuration Management
Diving Deeper into DevOps Deployments
Microservice Monitoring and Quality Management for Modern Apps and Infrastruc...
The Human Side of DevSecOps
Cloud bees and forester open source is not enough
From Monolith to Microservices – and Beyond!
Efficient Performance Test Automation - Opitmizing the Jenkins Pipeline
How to Build the Right Automation
Starting and Scaling Devops
Starting and Scaling DevOps
Containers: DevOp Enablers of Technical Solutions
Adopting DevOps @ Scale: Lessons learned at Hertz, Kaiser Permanente and lBM
Managing Quality of Service for Containerized Microservice Applications
The Evolution of Application Release Automation
DevOPs Transformation Workshop
Pipeline: Continuous Delivery as Code in Jenkins 2.0
7 Habits of Highly Effective Jenkins Users
7 Habits of Highly Effective Jenkins Users
Webinar: A Roadmap for DevOps Success

Recently uploaded (20)

PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
Introduction to Artificial Intelligence
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
ai tools demonstartion for schools and inter college
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
Online Work Permit System for Fast Permit Processing
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
System and Network Administraation Chapter 3
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
L1 - Introduction to python Backend.pptx
PDF
top salesforce developer skills in 2025.pdf
PPTX
ISO 45001 Occupational Health and Safety Management System
VVF-Customer-Presentation2025-Ver1.9.pptx
PTS Company Brochure 2025 (1).pdf.......
Design an Analysis of Algorithms I-SECS-1021-03
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Introduction to Artificial Intelligence
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
How to Choose the Right IT Partner for Your Business in Malaysia
ai tools demonstartion for schools and inter college
Navsoft: AI-Powered Business Solutions & Custom Software Development
Online Work Permit System for Fast Permit Processing
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Design an Analysis of Algorithms II-SECS-1021-03
Softaken Excel to vCard Converter Software.pdf
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
System and Network Administraation Chapter 3
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
L1 - Introduction to python Backend.pptx
top salesforce developer skills in 2025.pdf
ISO 45001 Occupational Health and Safety Management System

Sandstorm or Significant: The evolving role of context in Incident Management