SlideShare a Scribd company logo
Paging/Alerting
Workshop
#sre-office-hours | 04.04.2018
What qualifies as an
incident?
What should happen in
an incident?
Who should be
alerted/paged?
What tools would we
use to gather data?
How reliable/accurate
would this information be?
What’s the acceptable time
to be alerted/respond?
An Initial Framework for
Discussing SEVs
What is a SEV?
SEV is a term used to refer to an incident, it is derived from the
word severity.
Common types of SEV
- Availability Drop
- Product Issue / Feature Broken
- Data Loss
- Security Risk
- etc.
SEV Levels
Any SEV which involves a
loss of customer data should
be classified as SEV0.
Calculate Critical
Uptime in 9’s
https://uptime.is/
What uptime do you
think we could safely
publish?
Would we be okay
telling our
customers/vendors that
number?
SEV Terminology
Lifecycle of a SEV event
How to measure a SEV?
% loss * outage duration
Visualizing/tracking SEV’s
How do we maintain
combat effectiveness
during a SEV?
Incident Manager On-Call (IMOC)
- Should be a small rotation of Engineering Leaders
- Only one person is on-call in this role at any point in time
- These people should possess a wide knowledge of services and
engineering teams
- Will be our version of Air-Traffic-Control for the SEV, ensuring
different people working on the SEV are organized and working
coherently as a unit!
Tech Lead On-Call (TLOC)
- This would be the engineer driving resolution of the SEV
- Should have deep knowledge of specific domain of knowledge;
be a SME (Subject Matter Expert)
- Should have a deep knowledge of upstream and downstream
dependencies
What we need to define to have these roles:
- IMOC runbook/guide
- Designate a Primary and Secondary IMOC at all times
- Escalation should be automatic
- Monthly sync for all IMOC and TLOC
- Way to quickly triage what systems are effected/find root cause
- How would we do this?
- How do we record / document SEV’s?
- Google Form? Git repo? Suggestions??
- SEV naming convention
What happens when we
don’t meet our uptime
requirements?!?
What causes SEV’s?
Technical Issues
● Dependency Failure
● Cloud Provider Region/Zone Failure
● Provider Failure
● Connectivity Issues
● Power issues (our local office power affects AWS RDS!)
● DNS outage/latency
● Misconfiguration of machines/docker images
● Software Bugs
● Corrupt/unavailable backups
Cultural Issues
● Lack of knowledge sharing
● Lack of knowledge handover
● Lack of on-call training
● Lack of chaos engineering
● Lack of a high severity incident management program
● Lack of documentation and playbooks
● Lack of alerts and pages
● Lack of effective alerting thresholds
● Lack of backup strategy
How do we prevent SEVs from repeating?
● Combination of:
○ Record outages
○ Correlate failures
○ Track SEVs
Chaos Engineering!
What if we could break
things safely!?
What lessons/data could
we gather?
Chaos Engineering...yes, it is a real thing!
● 2010 - Netflix created the Chaos Monkey which can wreak
havoc in AWS at will deleting instances (but fully
customizable/controllable) -- this is OSS as of 2012
● 2011 - Netflix creates the Simian Army--a host of chaos tools to
test failure modes in your infrastructure and applications
● 2014 - the Role of Chaos Engineer is created at Netflix
Principles of Chaos Engineering
Can we do this?
Paging, Alerting, Chaos Eng Overview
And we all get to “share
the pain” with our new
tool PagerDuty...
Credits
- https://www.gremlin.com/community/tutorials/how-to-establish-a-high-severity-inciden
t-management-program/
- https://www.gremlin.com/community/tutorials/chaos-engineering-the-history-principles
-and-practice/
- https://www.gremlin.com/the-discipline-of-chaos-engineering/
- https://github.com/tammybutow/chaos_engineering_bootcamp
- https://www.usenix.org/conference/srecon17americas/program/presentation/andrus
They did a cool workshop about Chaos Engineering
with hands-on labs at SREcon this year. If you like this
notion of chaos, more of us should go next year!
#notjustforSRE
Questions/thoughts?
#sre-office-hours

More Related Content

PPTX
Open Source Defense for Edge 2017
PPTX
Cloud, DevOps and the New Security Practitioner
PPTX
451 and Endgame - Zero breach Tolerance: Earliest protection across the attac...
PPTX
451 AppSense Webinar - Why blame the user?
PPTX
Stranded on Infosec Island: Defending the Enterprise with Nothing but Windows...
PPTX
RSAC 2016: CISO's guide to Startups
PPTX
Ten Security Product Categories You've Probably Never Heard Of
PPTX
Security and DevOps Overview
Open Source Defense for Edge 2017
Cloud, DevOps and the New Security Practitioner
451 and Endgame - Zero breach Tolerance: Earliest protection across the attac...
451 AppSense Webinar - Why blame the user?
Stranded on Infosec Island: Defending the Enterprise with Nothing but Windows...
RSAC 2016: CISO's guide to Startups
Ten Security Product Categories You've Probably Never Heard Of
Security and DevOps Overview

What's hot (20)

PPTX
2016 virus bulletin
PDF
Chaos Engineering: Injecting Failure for Building Resilience in Systems
PPTX
Application Security Webcast
PDF
Chaos Engineering 101: A Field Guide
PPTX
The Teams Behind DevSecOps
PDF
Silver Lining for Miles: DevOps for Building Security Solutions
PPTX
451 and Cylance - The Roadmap To Better Endpoint Security
PDF
[Webinar] Building a Product Security Incident Response Team: Learnings from ...
PDF
Establishing a-quality-vulnerability-management-program
PDF
Security vulnerabilities for grown ups - GOTOcon 2012
PDF
Top 6 Technology Threats to Your Long Term Care Organization
PDF
Chaos Engineering
PDF
Make it Fixable, Living with Risk (Paranoia 2017)
PPTX
Security Surveillance 2010_Final
PDF
CSA Raleigh application security and deception in the cloud
PDF
API Vulnerabilties and What to Do About Them
PDF
Make it Fixable (Security Divas 2017)
PDF
Ops Happen: Improve Security Without Getting in the Way
PPTX
В чому різниця між тестами на проникнення, аудитами, та іншими послугами з кі...
PDF
Deception in Cyber Security (League of Women in Cyber Security)
2016 virus bulletin
Chaos Engineering: Injecting Failure for Building Resilience in Systems
Application Security Webcast
Chaos Engineering 101: A Field Guide
The Teams Behind DevSecOps
Silver Lining for Miles: DevOps for Building Security Solutions
451 and Cylance - The Roadmap To Better Endpoint Security
[Webinar] Building a Product Security Incident Response Team: Learnings from ...
Establishing a-quality-vulnerability-management-program
Security vulnerabilities for grown ups - GOTOcon 2012
Top 6 Technology Threats to Your Long Term Care Organization
Chaos Engineering
Make it Fixable, Living with Risk (Paranoia 2017)
Security Surveillance 2010_Final
CSA Raleigh application security and deception in the cloud
API Vulnerabilties and What to Do About Them
Make it Fixable (Security Divas 2017)
Ops Happen: Improve Security Without Getting in the Way
В чому різниця між тестами на проникнення, аудитами, та іншими послугами з кі...
Deception in Cyber Security (League of Women in Cyber Security)
Ad

Similar to Paging, Alerting, Chaos Eng Overview (20)

PPTX
Securing Systems - Still Crazy After All These Years
PPTX
DIY guide to runbooks, incident reports, and incident response
PDF
Build Automate and Test Strategies - BATMAN
PDF
Brighttalk understanding the promise of sde - final
PPTX
Security engineering 101 when good design & security work together
PDF
GameDay - Achieving resilience through Chaos Engineering
PPTX
"You Got That SIEM. Now What Do You Do?"  by Dr. Anton Chuvakin
PPTX
Solnet dev secops meetup
PPTX
2016 - Safely Removing the Last Roadblock to Continuous Delivery
PPTX
The End of Security as We Know It - Shannon Lietz
PDF
AI at Scale in Enterprises
PPTX
Safely Removing the Last Roadblock to Continuous Delivery
PPTX
Something Fun About Using SIEM by Dr. Anton Chuvakin
PPTX
DevOpsRoadTrip San Francisco Final Speaking Deck
PDF
From Monoliths to Microservices at Realestate.com.au
PDF
Computational Patterns of the Cloud - QCon NYC 2014
PPTX
Sailing Through The Storm of Kubernetes CVEs Meetup 29062023.pptx
PDF
Automation and Management of Database Clusters MariaDB Roadshow 2014
PPTX
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
PPTX
DevSecCon Keynote
Securing Systems - Still Crazy After All These Years
DIY guide to runbooks, incident reports, and incident response
Build Automate and Test Strategies - BATMAN
Brighttalk understanding the promise of sde - final
Security engineering 101 when good design & security work together
GameDay - Achieving resilience through Chaos Engineering
"You Got That SIEM. Now What Do You Do?"  by Dr. Anton Chuvakin
Solnet dev secops meetup
2016 - Safely Removing the Last Roadblock to Continuous Delivery
The End of Security as We Know It - Shannon Lietz
AI at Scale in Enterprises
Safely Removing the Last Roadblock to Continuous Delivery
Something Fun About Using SIEM by Dr. Anton Chuvakin
DevOpsRoadTrip San Francisco Final Speaking Deck
From Monoliths to Microservices at Realestate.com.au
Computational Patterns of the Cloud - QCon NYC 2014
Sailing Through The Storm of Kubernetes CVEs Meetup 29062023.pptx
Automation and Management of Database Clusters MariaDB Roadshow 2014
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
DevSecCon Keynote
Ad

Recently uploaded (20)

PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
A novel scalable deep ensemble learning framework for big data classification...
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
project resource management chapter-09.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Web App vs Mobile App What Should You Build First.pdf
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
TLE Review Electricity (Electricity).pptx
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Mushroom cultivation and it's methods.pdf
Heart disease approach using modified random forest and particle swarm optimi...
Assigned Numbers - 2025 - Bluetooth® Document
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Hindi spoken digit analysis for native and non-native speakers
A novel scalable deep ensemble learning framework for big data classification...
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
project resource management chapter-09.pdf
Unlocking AI with Model Context Protocol (MCP)
Web App vs Mobile App What Should You Build First.pdf
cloud_computing_Infrastucture_as_cloud_p
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
A comparative analysis of optical character recognition models for extracting...
TLE Review Electricity (Electricity).pptx
WOOl fibre morphology and structure.pdf for textiles
Enhancing emotion recognition model for a student engagement use case through...
1 - Historical Antecedents, Social Consideration.pdf
Mushroom cultivation and it's methods.pdf

Paging, Alerting, Chaos Eng Overview