SlideShare a Scribd company logo
Incident Management Framework
Preparing for System
Failure
Our Approach at Rentman
About Me
- Software Architect at Gapstars / Rentman
- ~15 years of experience, mistakes and learning
- Primarily APIs and Web tech
- I sporadically blog on randomcoding.com
- I tweet as @jomanlk
About Rentman
- Provides resource management and
planning for the AV & Event industry
- Industry leader in rentals management for
the events industry
- 10+ years in the events space
- Customers across 75 countries
- 70+ employees spread across NA, Europe
and Sri Lanka
- Tech stack primarily on AWS
- Most services multi region / multi AZ
- Primarily running on top of AWS ECS
- Heavily use Atlassian products
Agenda
- Introduction ←
- Approach
- Learnings
Why Now, for Rentman?
- Increase our ‘bus factor’
- Reduce loss of institutional knowledge
- Increase active monitoring coverage
- Growing pains. Reduce stress & panic
What Is An Incident Response Plan?
- Well defined framework to deal with incidents
- No ambiguity
- Clear command structure
- Refer the Incident Command System (ICS)
“An incident response plan is a document that outlines an organization's procedures, steps,
and responsibilities of its incident response program.”
Goals: The 3 Cs
- Coordinate response effort.
- Communicate between incident responders, within the organization, and to
the outside world.
- Maintain control over the incident response.
The Approach
Step 0
- Documentation
- Documentation
- Create a Playbook
- Setup Teams & Organizational support
- Tiered teams. T1, T2
Incident Response Phases
Triage Coordinate
Mitigate
Resolve Learnings
Incident Management Framework
Common Terms
- IC: Incident Commander
- CL: Comms lead
- LI: Lead investigator
- DS: Domain specialist
Triage
- What’s going on?
- How bad is it?
- Depends on
- Monitoring
- User reports
- P3, P2?
- Not great, but it can wait
- P1
- BIG problem
Triage
Coordinate
- Use tooling
- Scheduling
- Alerting
- Who needs to be involved?
- Small incident?
- Big incident?
- Who’s available?
Coordinate
Mitigate
- STOP THE BLEED!
- Goal ≠ Finding and fixing issue
- Goal = Get things working
- Collaborate
- Keep it DRY
- Keep it documented
Reviewing
recent
releases
Disabling
demo creation
Support is asking me
for an update, do we
have anything?
Joining the
incident response!
Where are we at?
Mitigate
Reviewing
recent
releases
Disabling
demo creation
Support is asking me
for an update, do we
have anything?
Joining the
incident response!
Where are we at?
Resolve
- Make sure the root cause is
addressed
- This could be days or sometimes
weeks after incident
Creating hotfix
branch
Added extra
logs for this
specific issue
Resolve
Creating hotfix
branch
Added extra
logs for this
specific issue
Follow Up
- Document the JIRA issue
timeline
- Psychological Safety
- Learn from the experience
- Failure is in process not individual
- Blame free / Owned by team
- Review the process
- What went well / not well?
- What was missing?
Improvements
to process
Additional
logging added
Learnings
Create RCA
The Learnings
Learnings
- Leverage existing workflows / tools
- Practice. Practice. Practice.
- Breakathons
- Simulations
Learnings Continued
- Plan. Do. Review. Improve.
- Incorporate Organizational Requirements Early
- Compensation for on-call
- Uptime guarantees
- SLA with customers
Fin.
- Questions: Stay tuned for the panel
discussion
- Want to reach out?
- @jomanlk on Twitter
- linkedin.com/in/jnxpereira on LinkedIn
- john@jnx.me on Email

More Related Content

PDF
Managing a Major Incident
PDF
ITIL Incident Management Workflow PowerPoint Presentation Slides
PDF
ITIL Incident Management Workflow - Process Guide
PPTX
What Is Incident Management | Incident Management Process | ITIL V4 Foundatio...
PPTX
QRadar, ArcSight and Splunk
PDF
Incident Management PowerPoint Presentation Slides
PPTX
SIEM presentation final
PPTX
Security Information and Event Management (SIEM)
Managing a Major Incident
ITIL Incident Management Workflow PowerPoint Presentation Slides
ITIL Incident Management Workflow - Process Guide
What Is Incident Management | Incident Management Process | ITIL V4 Foundatio...
QRadar, ArcSight and Splunk
Incident Management PowerPoint Presentation Slides
SIEM presentation final
Security Information and Event Management (SIEM)

What's hot (20)

PDF
Incident Management Best Practices
PDF
BIA - Example of Business Impact Analysis and Dependencies
PPTX
Effective Security Operation Center - present by Reza Adineh
PDF
Patch and Vulnerability Management
PDF
Threat Hunting
PDF
INCIDENT RESPONSE NIST IMPLEMENTATION
PDF
DTS Solution - Building a SOC (Security Operations Center)
PDF
Cyber threat intelligence ppt
PDF
Incident Response Swimlanes
PPTX
Putting MITRE ATT&CK into Action with What You Have, Where You Are
PPT
Incident Management
PPTX
ITIL Incident management
PPTX
Security operation center (SOC)
PDF
Building an effective Information Security Roadmap
PDF
Endpoint Detection & Response - FireEye
PPTX
SIEM : Security Information and Event Management
PDF
IT4IT BCS
PDF
NIST cybersecurity framework
PDF
Security operations center-SOC Presentation-مرکز عملیات امنیت
PPTX
An introduction to SOC (Security Operation Center)
Incident Management Best Practices
BIA - Example of Business Impact Analysis and Dependencies
Effective Security Operation Center - present by Reza Adineh
Patch and Vulnerability Management
Threat Hunting
INCIDENT RESPONSE NIST IMPLEMENTATION
DTS Solution - Building a SOC (Security Operations Center)
Cyber threat intelligence ppt
Incident Response Swimlanes
Putting MITRE ATT&CK into Action with What You Have, Where You Are
Incident Management
ITIL Incident management
Security operation center (SOC)
Building an effective Information Security Roadmap
Endpoint Detection & Response - FireEye
SIEM : Security Information and Event Management
IT4IT BCS
NIST cybersecurity framework
Security operations center-SOC Presentation-مرکز عملیات امنیت
An introduction to SOC (Security Operation Center)
Ad

Similar to Incident Management Framework (20)

PDF
ITSM in an Agile World - Scott Goh-Davis, Solutions Engineer APAC, Atlassian ...
PDF
[Atlassian meets dev ops and itsm] itsm in an agile world atlassian scott
PDF
GCP-pdevops devops engineer exam prepearitaon guide
PDF
Paging, Alerting, Chaos Eng Overview
PPTX
DIY guide to runbooks, incident reports, and incident response
PDF
Incident Response and SAP Systems
PPTX
SAST Managed Services for SAP [Webinar]
PPTX
Inside SecOps at bet365
PPT
Business Continuity and Disaster Recovery for the Modern Office
PDF
S.R.E - create ultra-scalable and highly reliable systems
PPTX
ISACA Ireland Keynote 2015
PDF
Corona| COVID IT Tactical Security Preparedness: Threat Management
PDF
Splunk Discovery: Warsaw 2018 - Legacy SIEM to Splunk, How to Conquer Migrati...
PDF
How to Build an Invincible Incident Management Plan
PPTX
How Dealertrack Optimizes the DevOps Toolchain, FutureStack17
PDF
Webinar: Demonstrating Business Value for DevOps & Continuous Delivery
PPTX
SplunkLive! Paris 2018: Event Management Is Dead
PPTX
DevSecCon Keynote
PPTX
DevSecCon KeyNote London 2015
PPTX
Tenable_One_Sales_Presentation_for_Customers.pptx
ITSM in an Agile World - Scott Goh-Davis, Solutions Engineer APAC, Atlassian ...
[Atlassian meets dev ops and itsm] itsm in an agile world atlassian scott
GCP-pdevops devops engineer exam prepearitaon guide
Paging, Alerting, Chaos Eng Overview
DIY guide to runbooks, incident reports, and incident response
Incident Response and SAP Systems
SAST Managed Services for SAP [Webinar]
Inside SecOps at bet365
Business Continuity and Disaster Recovery for the Modern Office
S.R.E - create ultra-scalable and highly reliable systems
ISACA Ireland Keynote 2015
Corona| COVID IT Tactical Security Preparedness: Threat Management
Splunk Discovery: Warsaw 2018 - Legacy SIEM to Splunk, How to Conquer Migrati...
How to Build an Invincible Incident Management Plan
How Dealertrack Optimizes the DevOps Toolchain, FutureStack17
Webinar: Demonstrating Business Value for DevOps & Continuous Delivery
SplunkLive! Paris 2018: Event Management Is Dead
DevSecCon Keynote
DevSecCon KeyNote London 2015
Tenable_One_Sales_Presentation_for_Customers.pptx
Ad

Recently uploaded (20)

PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Enhancing emotion recognition model for a student engagement use case through...
PPTX
A Presentation on Artificial Intelligence
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
Chapter 5: Probability Theory and Statistics
PPTX
Tartificialntelligence_presentation.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
Web App vs Mobile App What Should You Build First.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Mushroom cultivation and it's methods.pdf
PPTX
1. Introduction to Computer Programming.pptx
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
Univ-Connecticut-ChatGPT-Presentaion.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Enhancing emotion recognition model for a student engagement use case through...
A Presentation on Artificial Intelligence
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
NewMind AI Weekly Chronicles - August'25-Week II
Chapter 5: Probability Theory and Statistics
Tartificialntelligence_presentation.pptx
Encapsulation_ Review paper, used for researhc scholars
Hindi spoken digit analysis for native and non-native speakers
Web App vs Mobile App What Should You Build First.pdf
Programs and apps: productivity, graphics, security and other tools
Building Integrated photovoltaic BIPV_UPV.pdf
DP Operators-handbook-extract for the Mautical Institute
Heart disease approach using modified random forest and particle swarm optimi...
SOPHOS-XG Firewall Administrator PPT.pptx
Mushroom cultivation and it's methods.pdf
1. Introduction to Computer Programming.pptx
Group 1 Presentation -Planning and Decision Making .pptx
Unlocking AI with Model Context Protocol (MCP)

Incident Management Framework

  • 2. Preparing for System Failure Our Approach at Rentman
  • 3. About Me - Software Architect at Gapstars / Rentman - ~15 years of experience, mistakes and learning - Primarily APIs and Web tech - I sporadically blog on randomcoding.com - I tweet as @jomanlk
  • 4. About Rentman - Provides resource management and planning for the AV & Event industry - Industry leader in rentals management for the events industry - 10+ years in the events space - Customers across 75 countries - 70+ employees spread across NA, Europe and Sri Lanka - Tech stack primarily on AWS - Most services multi region / multi AZ - Primarily running on top of AWS ECS - Heavily use Atlassian products
  • 5. Agenda - Introduction ← - Approach - Learnings
  • 6. Why Now, for Rentman? - Increase our ‘bus factor’ - Reduce loss of institutional knowledge - Increase active monitoring coverage - Growing pains. Reduce stress & panic
  • 7. What Is An Incident Response Plan? - Well defined framework to deal with incidents - No ambiguity - Clear command structure - Refer the Incident Command System (ICS) “An incident response plan is a document that outlines an organization's procedures, steps, and responsibilities of its incident response program.”
  • 8. Goals: The 3 Cs - Coordinate response effort. - Communicate between incident responders, within the organization, and to the outside world. - Maintain control over the incident response.
  • 10. Step 0 - Documentation - Documentation - Create a Playbook - Setup Teams & Organizational support - Tiered teams. T1, T2
  • 11. Incident Response Phases Triage Coordinate Mitigate Resolve Learnings
  • 13. Common Terms - IC: Incident Commander - CL: Comms lead - LI: Lead investigator - DS: Domain specialist
  • 14. Triage - What’s going on? - How bad is it? - Depends on - Monitoring - User reports - P3, P2? - Not great, but it can wait - P1 - BIG problem
  • 16. Coordinate - Use tooling - Scheduling - Alerting - Who needs to be involved? - Small incident? - Big incident? - Who’s available?
  • 18. Mitigate - STOP THE BLEED! - Goal ≠ Finding and fixing issue - Goal = Get things working - Collaborate - Keep it DRY - Keep it documented Reviewing recent releases Disabling demo creation Support is asking me for an update, do we have anything? Joining the incident response! Where are we at?
  • 19. Mitigate Reviewing recent releases Disabling demo creation Support is asking me for an update, do we have anything? Joining the incident response! Where are we at?
  • 20. Resolve - Make sure the root cause is addressed - This could be days or sometimes weeks after incident Creating hotfix branch Added extra logs for this specific issue
  • 22. Follow Up - Document the JIRA issue timeline - Psychological Safety - Learn from the experience - Failure is in process not individual - Blame free / Owned by team - Review the process - What went well / not well? - What was missing? Improvements to process Additional logging added Learnings Create RCA
  • 24. Learnings - Leverage existing workflows / tools - Practice. Practice. Practice. - Breakathons - Simulations
  • 25. Learnings Continued - Plan. Do. Review. Improve. - Incorporate Organizational Requirements Early - Compensation for on-call - Uptime guarantees - SLA with customers
  • 26. Fin. - Questions: Stay tuned for the panel discussion - Want to reach out? - @jomanlk on Twitter - linkedin.com/in/jnxpereira on LinkedIn - john@jnx.me on Email