SlideShare a Scribd company logo
Incident Response
Orchestration
October 26, 2017 | Berkay Mollamustafaoglu
The rise of the always-on services
Failures may be inevitable...
“We understand that we are operating large, complex
systems and build in redundancy and resiliency to
maximize uptime, we must also understand that
failures will still occur.”
Incident management for Operations
Rob Schnepp, Ron Vidal & Chris Hawley
It’s how we respond that matters
Incident Response is about how we respond to those
failures, it's about minimizing the likelihood that
failures become business impacting outages.
Journey from alerting, on-call
management & escalations to Incident
Response
What happens during an incident?
● Detection
● Assessment
● Dispatching the responders
● Collaboration during the incident
● Stakeholder communications
● Post incident analysis
Assessing the priority
● Impact & urgency determines the priority
● There may be multiple dimensions for classification but
keep it simple
● There may be multiple dimensions for classification but
keep it simple
Dispatching the responders
One or more teams can
be notified as the
responders (as well as
individuals)
Collaboration during the incident
● Chat
● Conference calls
● Web conference
Stakeholders communications
● Stakeholders need different
information
● Provide a simple way to
check status of services
and incident updates
● Enable users to control
which services they are
interested in
Post incident analysis
● Time to detect
● Time to engage
○ First responder
○ Each responder team
● Time to resolve
● How many people were involved?
● How many man hours were spent?
● How many alerts are received?
We race against time!
Preparation is the key!
Effective communications &
collaboration is essential.
Heroism is not sustainable
Don’t expect you people to compensate for systemic
problems forever
Don’t become the crutch for systemic flaws
What can we do to minimize MTTR?
MTTR
(Mean Time to Repair)
Time to detect
Time to respond
Time to resolve
Reliable notifications is essential for rapid
response
Use multiple notification methods for reliability.
Don’t rely on the your own infrastructure for
alerting.
How to communicate with:
● The first responder
● Best people to resolve the incident
● The stakeholders
● Customers/users
Effective communications with customers &
users is crucial
DEVELOPERS
OPERATIONS
PM
MANAGEMENT
CUSTOMER
SERVICE
INTERNAL
CUSTOMERS
SALES
USERS
CUSTOMERS
Easy access to all supporting data accelerates
resolution
Identify the relevant contextual data in
advance:
● Alerts
● Logs
● Configuration changes
● Metrics
● Dependencies
Make it easy to access or better yet
gather the data in advance.
Automate or at least document incident
resolution
● Investigative steps
● Remedial steps
● Runbooks
Measure, track and report for continuous
improvement
Metrics
● Detection time
● First response time
● Diagnosis time
● Resolution time
● Number of alerts
● Number of people
involved
Update response
procedures
● Classification, alerting,
escalation policies
● Runbooks
● Communications
Incident response orchestration

More Related Content

PDF
Resolve Incidents Faster: Transforming Your Incident Management Process
PPTX
Test Data Management a Managed Service for Software Quality Assurance
PDF
Continuous Testing in DevOps
PDF
Red Hat Satellite 6 - Automation with Puppet
PPTX
Advantages and disadvantages of Agile approach for products and services deve...
PPTX
DevOps and Cloud
PDF
Instalacion y uso basico de Kubernetes.
PDF
Incident Response
Resolve Incidents Faster: Transforming Your Incident Management Process
Test Data Management a Managed Service for Software Quality Assurance
Continuous Testing in DevOps
Red Hat Satellite 6 - Automation with Puppet
Advantages and disadvantages of Agile approach for products and services deve...
DevOps and Cloud
Instalacion y uso basico de Kubernetes.
Incident Response

What's hot (20)

PDF
Building an Observability platform with ClickHouse
PDF
Scrum Roles : Scrum Master | Product Owner |Team
PDF
How to Replace Your Legacy Antivirus Solution with CrowdStrike
PPTX
How to Build a Platform Team
PDF
Introduction to Scaled Agile Framework SAFe
PDF
The Executives Step-by-Step Guide to Leading a Large-Scale Agile Transformation
PDF
PKI in DevOps: How to Deploy Certificate Automation within CI/CD
PPTX
DevOps-CoE
PPTX
Meetup#2 SAFe Patrick & Maxence
PDF
Shifting Security Left - The Innovation of DevSecOps - ValleyTechCon
PDF
Foundations of the Scaled Agile Framework® (SAFe® ) 4.5
PDF
Getting started with Site Reliability Engineering (SRE)
PDF
Présentation ELK/SIEM et démo Wazuh
PPT
Registry Forensics
PDF
SRE and GitOps for Building Robust Kubernetes Platforms.pdf
PPTX
DevSecOps reference architectures 2018
PDF
Agile Maintenance
PDF
Introduction to scaled agile framework
PPTX
Putting MITRE ATT&CK into Action with What You Have, Where You Are
Building an Observability platform with ClickHouse
Scrum Roles : Scrum Master | Product Owner |Team
How to Replace Your Legacy Antivirus Solution with CrowdStrike
How to Build a Platform Team
Introduction to Scaled Agile Framework SAFe
The Executives Step-by-Step Guide to Leading a Large-Scale Agile Transformation
PKI in DevOps: How to Deploy Certificate Automation within CI/CD
DevOps-CoE
Meetup#2 SAFe Patrick & Maxence
Shifting Security Left - The Innovation of DevSecOps - ValleyTechCon
Foundations of the Scaled Agile Framework® (SAFe® ) 4.5
Getting started with Site Reliability Engineering (SRE)
Présentation ELK/SIEM et démo Wazuh
Registry Forensics
SRE and GitOps for Building Robust Kubernetes Platforms.pdf
DevSecOps reference architectures 2018
Agile Maintenance
Introduction to scaled agile framework
Putting MITRE ATT&CK into Action with What You Have, Where You Are
Ad

Similar to Incident response orchestration (20)

PPTX
Copy of webinar modern incident management (1)
PPTX
Modern incident management
PDF
Tactical Application Security: Getting Stuff Done - Black Hat Briefings 2015
PPTX
Enterprise security incident management
PDF
Cybersecurity Summit AHR20 Recover Tridium
PPTX
Getting Started with Business Continuity
PDF
ITIL Incident Management Workflow PowerPoint Presentation Slides
PDF
The on-call survival guide - how to be confident on-call
PPT
PDF
Project management part 2
PPTX
(ONLINE) ITIL Indonesia Community – Meetup “ITIL Introduction: Incident and P...
PDF
BiznetGio Presentation Business Continuity
PDF
10 Critical Aspects of IT Service Continuity to Protect Your Company's Digita...
PPTX
Backups and Disaster Recovery for Nonprofits
PPTX
Strategic Essentials for Effective Incident Response Planning.pptx
PPTX
DevOpsRoadTrip San Francisco Final Speaking Deck
PDF
Agile Methodologies and Scrum / Lean Development and Agile Methodologies - 2...
PDF
Servers compliance: audit, remediation, proof
PDF
Tenants for Going at DevSecOps Speed - LASCON 2023
PPTX
5 forces incident problem mgmt-presentation
Copy of webinar modern incident management (1)
Modern incident management
Tactical Application Security: Getting Stuff Done - Black Hat Briefings 2015
Enterprise security incident management
Cybersecurity Summit AHR20 Recover Tridium
Getting Started with Business Continuity
ITIL Incident Management Workflow PowerPoint Presentation Slides
The on-call survival guide - how to be confident on-call
Project management part 2
(ONLINE) ITIL Indonesia Community – Meetup “ITIL Introduction: Incident and P...
BiznetGio Presentation Business Continuity
10 Critical Aspects of IT Service Continuity to Protect Your Company's Digita...
Backups and Disaster Recovery for Nonprofits
Strategic Essentials for Effective Incident Response Planning.pptx
DevOpsRoadTrip San Francisco Final Speaking Deck
Agile Methodologies and Scrum / Lean Development and Agile Methodologies - 2...
Servers compliance: audit, remediation, proof
Tenants for Going at DevSecOps Speed - LASCON 2023
5 forces incident problem mgmt-presentation
Ad

Recently uploaded (20)

PPTX
Cybersecurity: Protecting the Digital World
PDF
Time Tracking Features That Teams and Organizations Actually Need
PDF
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
PDF
Cost to Outsource Software Development in 2025
PDF
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025
PDF
Wondershare Recoverit Full Crack New Version (Latest 2025)
PPTX
Custom Software Development Services.pptx.pptx
PPTX
Patient Appointment Booking in Odoo with online payment
DOCX
How to Use SharePoint as an ISO-Compliant Document Management System
PPTX
GSA Content Generator Crack (2025 Latest)
PDF
AI/ML Infra Meetup | Beyond S3's Basics: Architecting for AI-Native Data Access
PDF
How Tridens DevSecOps Ensures Compliance, Security, and Agility
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PPTX
Monitoring Stack: Grafana, Loki & Promtail
PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PPTX
Weekly report ppt - harsh dattuprasad patel.pptx
PDF
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
PDF
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
PDF
Topaz Photo AI Crack New Download (Latest 2025)
PDF
STL Containers in C++ : Sequence Container : Vector
Cybersecurity: Protecting the Digital World
Time Tracking Features That Teams and Organizations Actually Need
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
Cost to Outsource Software Development in 2025
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025
Wondershare Recoverit Full Crack New Version (Latest 2025)
Custom Software Development Services.pptx.pptx
Patient Appointment Booking in Odoo with online payment
How to Use SharePoint as an ISO-Compliant Document Management System
GSA Content Generator Crack (2025 Latest)
AI/ML Infra Meetup | Beyond S3's Basics: Architecting for AI-Native Data Access
How Tridens DevSecOps Ensures Compliance, Security, and Agility
Why Generative AI is the Future of Content, Code & Creativity?
Monitoring Stack: Grafana, Loki & Promtail
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
Weekly report ppt - harsh dattuprasad patel.pptx
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
Topaz Photo AI Crack New Download (Latest 2025)
STL Containers in C++ : Sequence Container : Vector

Incident response orchestration

  • 1. Incident Response Orchestration October 26, 2017 | Berkay Mollamustafaoglu
  • 2. The rise of the always-on services
  • 3. Failures may be inevitable... “We understand that we are operating large, complex systems and build in redundancy and resiliency to maximize uptime, we must also understand that failures will still occur.” Incident management for Operations Rob Schnepp, Ron Vidal & Chris Hawley
  • 4. It’s how we respond that matters Incident Response is about how we respond to those failures, it's about minimizing the likelihood that failures become business impacting outages.
  • 5. Journey from alerting, on-call management & escalations to Incident Response
  • 6. What happens during an incident? ● Detection ● Assessment ● Dispatching the responders ● Collaboration during the incident ● Stakeholder communications ● Post incident analysis
  • 7. Assessing the priority ● Impact & urgency determines the priority ● There may be multiple dimensions for classification but keep it simple ● There may be multiple dimensions for classification but keep it simple
  • 8. Dispatching the responders One or more teams can be notified as the responders (as well as individuals)
  • 9. Collaboration during the incident ● Chat ● Conference calls ● Web conference
  • 10. Stakeholders communications ● Stakeholders need different information ● Provide a simple way to check status of services and incident updates ● Enable users to control which services they are interested in
  • 11. Post incident analysis ● Time to detect ● Time to engage ○ First responder ○ Each responder team ● Time to resolve ● How many people were involved? ● How many man hours were spent? ● How many alerts are received?
  • 12. We race against time! Preparation is the key! Effective communications & collaboration is essential.
  • 13. Heroism is not sustainable Don’t expect you people to compensate for systemic problems forever Don’t become the crutch for systemic flaws
  • 14. What can we do to minimize MTTR? MTTR (Mean Time to Repair) Time to detect Time to respond Time to resolve
  • 15. Reliable notifications is essential for rapid response Use multiple notification methods for reliability. Don’t rely on the your own infrastructure for alerting. How to communicate with: ● The first responder ● Best people to resolve the incident ● The stakeholders ● Customers/users
  • 16. Effective communications with customers & users is crucial DEVELOPERS OPERATIONS PM MANAGEMENT CUSTOMER SERVICE INTERNAL CUSTOMERS SALES USERS CUSTOMERS
  • 17. Easy access to all supporting data accelerates resolution Identify the relevant contextual data in advance: ● Alerts ● Logs ● Configuration changes ● Metrics ● Dependencies Make it easy to access or better yet gather the data in advance.
  • 18. Automate or at least document incident resolution ● Investigative steps ● Remedial steps ● Runbooks
  • 19. Measure, track and report for continuous improvement Metrics ● Detection time ● First response time ● Diagnosis time ● Resolution time ● Number of alerts ● Number of people involved Update response procedures ● Classification, alerting, escalation policies ● Runbooks ● Communications

Editor's Notes

  • #3: There will be increasingly more applications and services that will be required to be “always-on” Since your consumers are always digitally connected your business approach should change. You have to provide them with always-on services, especially in the fastly growing IT market full of innovative solutions. ITIL or ITSM are not enough any more and Composable Incident Service Management results more actual with its broad possibilities, with its customer-centric approach.
  • #15: Time is precious, and fast and effective incident resolution is vital when you are running an IT business. Every millisecond counts and end user experience shows that timing is one of the most important aspects.
  • #17: Nowadays,the people outside of your company, such as your users or customers, your partners, or individuals who may potentially be impacted by an event such as a critical system outage or any anomalous event happening to any of the components of their IT infrastructure want to stay in the know. To ensure better communication and transparency between you and them you have to notify them via any preferable communication mean.
  • #21: No long time ago all the day-to-day IT infrastructure management including the provisioning, capacity, performance and availability of the computing, networking and application environment was managed by Operation Team. IT Operations is responsible for the smooth functioning of the infrastructure and operational environments to internal and external customers, including the network infrastructure; server and device management; computer operations;ITIL management etc.