SlideShare a Scribd company logo
@paulcarletonjr
How our security requirements
turned us into
accidental chaos engineers
@paulcarletonjr
“We don’t make mistakes, we have
happy accidents
- Bob Ross
@paulcarletonjr@paulcarletonjr
Hello!
I’m Paul Carleton
@ Stripe
(we’re hiring!)
3
@paulcarletonjr
Topics / Spoilers
1. Old Instances are bad
2. Enter Lifespan Management
3. Some stories of how things broke
4. What we learned
@paulcarletonjr
1.
Instance Age
What is it and why do I care?
@paulcarletonjr
Terminology
▷ Instance: Cloud Hosted VM (EC2)
▷ Age: Time since launch
@paulcarletonjr
OldYoung
# of
hosts
Just launched
Instance Age
@paulcarletonjr
OldYoung
# of
hosts
Weeks later
Instance Age
@paulcarletonjr
OldYoung
# of
hosts
Months later
Instance Age
@paulcarletonjr
OldYoung
# of
hosts
Terminate & Replace
Instance Age
@paulcarletonjr
OldYoung
# of
hosts
Instance Age
@paulcarletonjr
OldYoung
# of
hosts
Instance Age
New hadoop
cluster
@paulcarletonjr
OldYoung
# of
hosts
Instance Age
New hadoop
cluster
Big migration
@paulcarletonjr
OldYoung
# of
hosts
Instance Age
New hadoop
cluster
Big migration
@paulcarletonjr
What’s wrong with old instances?
@paulcarletonjr
What’s wrong with old instances?
Replacement is
like a fire
extinguisher...
@paulcarletonjr
OldYoung
# of
hosts
Instance Age
Last replacement
Breaking changes
@paulcarletonjr
What’s wrong with old instances?
Replacement is
like a fire
extinguisher...
… that might catch
on fire
@paulcarletonjr
What’s wrong with old instances?
Replacement is
like a fire
extinguisher...
… that might catch
on fire
@paulcarletonjr
What’s wrong with old instances?
@paulcarletonjr
What’s wrong with old instances?
Will replacing a
with a
work?
@paulcarletonjr
What’s wrong with old instances?
@paulcarletonjr
OldYoung
# of
hosts
CVE Patch Released
@paulcarletonjr
OldYoung
# of
hosts
CVE Patch Released
@paulcarletonjr
OldYoung
# of
hosts
CVE Patch Released
@paulcarletonjr
OldYoung
# of
hosts
CVE Patch Released
@paulcarletonjr
OldYoung
# of
hosts
CVE Patch Released
@paulcarletonjr
OldYoung
# of
hosts
CVE Patch Released
@paulcarletonjr
Old Instances are bad
@paulcarletonjr
2.
Lifespan Management
@paulcarletonjr
Components
▷ ASG
▷ Terminator
▷ Lifespan Manager:
@paulcarletonjr
What is an auto-scaling group ?
✨ ASG ✨
@paulcarletonjr
Terminator
Terminate
Wait
AWS
Shave yaks
Shutdown
2
What is a terminator?
3
4
1
@paulcarletonjr
Lifespan Manager
(waiting)
Lifespan
Manager
ASG
@paulcarletonjr
Terminate First vs. Launch First
Steady
State
ASG
Size
Time
@paulcarletonjr
Terminate First vs. Launch First
Steady
State
ASG
Size
Time
@paulcarletonjr
“Now what?
Rollout Plan
Breaking the problem up with labels
38
Breaking it up with labels
Stateless
Safe to replace
Stateful Automated
Replaceable with some graceful
state hand-off.
Requires Operator
Not safe to replace automatically.
Want someone watching
39
@paulcarletonjr@paulcarletonjr
Automated termination
What could go wrong?
@paulcarletonjr
A Year Long Journey
@paulcarletonjr
5 Chaotic Discoveries
@paulcarletonjr
3.1
How NOT to health check
@paulcarletonjr
“The lifespan manager terminated
all the LDAP servers.
We’re locked out of QA.
@paulcarletonjr
Don’t we check for this?
@paulcarletonjr
How did this happen?
What’s your health?
@paulcarletonjr
How did this happen?
What’s your health?
LDAP
Maintenance
@paulcarletonjr
How did this happen?
What’s your health?
LDAP
Maintenance
Maintenance
@paulcarletonjr
How did this happen?
What’s your health?
LDAP
Maintenance
Maintenance
Everything’s green!
Let’s terminate!
0 unhealthy != healthy
50
@paulcarletonjr
How NOT to health check
▷ Pick good defaults
▷ Use pre-shared knowledge to
verify health
@paulcarletonjr
Explicit Expectations
What’s your LDAP
health?
LDAP
Maintenance
Maintenance
@paulcarletonjr
Explicit Expectations
What’s your LDAP
health?
LDAP
Maintenance
Maintenance
No LDAP?
I better wait.
@paulcarletonjr
3.2
RIP Kubernetes Workers
@paulcarletonjr
“The Kubernetes workers are going
down… HARD!
@paulcarletonjr
Terminator
Terminate
Wait
AWS
Shave yaks
Shutdown
2
Terminator Recap
3
4
1
@paulcarletonjr
Terminator
Terminate
Wait
AWS
Shave yaks
Shutdown
2
Terminator Recap
3
4
1
@paulcarletonjr
Terminate
Terminate
@paulcarletonjr
Terminate
Terminate
@paulcarletonjr
Terminate
Terminate
@paulcarletonjr
Terminate
Terminate
@paulcarletonjr
RIP Kubernetes Workers
● Track feature usage
● Make the chaos easy to turn off
3.3
Blackhole Scenario
63
@paulcarletonjr
TerminatorTerminate
Wait
AWS
Shave yaks
Shutdown
1
2
34
Terminator Recap
Heartbeat Options
Delay termination Proceed with
termination
… but no Cancel
65
@paulcarletonjr
TerminatorTerminate
Wait
AWS
Shave yaks
Shutdown
1
2
3
Blackhole Scenario
@paulcarletonjr
Blackhole Scenario
● Non-zero exit
● Timeouts
● Rate limits
@paulcarletonjr
“The terminations will continue until
morale improves!
69
@paulcarletonjr
Terminator
Terminate
AWS
Already done!
Shutdown
1
2
3
4
Two Touch Termination
Terminator
Shave yaks
1
2
3
@paulcarletonjr
The Blackhole Scenario
● Align incentives
● Systems vary, so adapt to match!
@paulcarletonjr
3.4
Self-Service Meltdown
@paulcarletonjr
A Year Long Journey
@paulcarletonjr
A Year Long Journey
@paulcarletonjr
Enhance!
@paulcarletonjr
Enhance!
@paulcarletonjr
“If you would like to never think
about a kernel upgrade again,
consider Lifespan Management!
@paulcarletonjr
Part 1:
Turning it On
@paulcarletonjr
I want to enable lifespan
management!
@paulcarletonjr
Great! Here are some docs!
@paulcarletonjr
@paulcarletonjr
Part 2:
Who does what?
@paulcarletonjr
What happens during
termination?
@paulcarletonjr
Let me tell you!
@paulcarletonjr
yadda yadda
yadda yadda
yadda yadda
yadda yadda
yadda yadda
yadda yadda
yadda yadda
yadda yadda
@paulcarletonjr
yadda yadda
yadda yadda
yadda yadda
yadda yadda
yadda yadda
yadda yadda
yadda yadda
yadda yadda
@paulcarletonjr
What part of that is relevant to
me?
@paulcarletonjr
… Great question!
@paulcarletonjr
Part 3:
False Alarms
@paulcarletonjr
Did lifespan management just
break my thing?
@paulcarletonjr
Let me check!
@paulcarletonjr
aws
@paulcarletonjr
aws
5
minutes
later...
@paulcarletonjr
Nope!
@paulcarletonjr
Okay… what did break my thing?
@paulcarletonjr
Okay… what did break my thing?
@paulcarletonjr
How we changed
@paulcarletonjr
Part 1: Turning it On
@paulcarletonjr
Part 2: Who does what?
@paulcarletonjr
Part 3: False Alarm
aws
go/whydead/$instance_id
@paulcarletonjr
Self-service Meltdown
▷ Make it easy to adopt safely
▷ Explicitly state the contract
▷ Make it easy to rule chaos out
@paulcarletonjr
3.5
Death by a thousand
JIRA tickets
@paulcarletonjr
Something’s wrong,
I can’t terminate anything
These warnings should be
tickets!
@paulcarletonjr
Great!
@paulcarletonjr
@paulcarletonjr
@paulcarletonjr
Death by a thousand JIRA tickets
● File against ourselves first, then
automate
● 1% case matters more with 10x
terminations
● Measure Quantity and Reliability
of tickets
@paulcarletonjr
4.
Calling it Done
The End… for now
@paulcarletonjr
5.
Summary and Closing
@paulcarletonjr
Takeaway
● What automation problems can
you solve with a little chaos?
@paulcarletonjr
Takeaway
Do you know how old your instances are?
@paulcarletonjr@paulcarletonjr
Thank you!
@paulcarletonjr
Credits
● Photo by rawpixel on Unsplash
● Photo by Jens Lelie on Unsplash
● Photo by JohnsonMartin https://pixabay.com/en/wormhole-space-time-light-tunnel-739872/

More Related Content

PPT
My Road To Test Driven Development
PDF
Testing in the 21st Century
PDF
Testing Java Microservices Workshop
PDF
DevOps @Scale (Greek Tragedy in 3 Acts) as it was presented at QCon NYC 2017
PDF
Habitat Service Discovery
PDF
TEST SMARTER AND GAIN SOME TIME BACK
PDF
A Taxonomy of Clustering, or, No Container is an Island
PPTX
Word study 1 vocabulary practice
My Road To Test Driven Development
Testing in the 21st Century
Testing Java Microservices Workshop
DevOps @Scale (Greek Tragedy in 3 Acts) as it was presented at QCon NYC 2017
Habitat Service Discovery
TEST SMARTER AND GAIN SOME TIME BACK
A Taxonomy of Clustering, or, No Container is an Island
Word study 1 vocabulary practice

Similar to How our security requirements turned us into accidental chaos engineers (20)

PDF
Using Machine Learning to Debug Oracle RAC Issues
PDF
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
PDF
Monitoring of OpenNebula installations
PDF
DevOpsDays Austin: Helping Horses Become Unicorns, Chef's Operations Maturity...
PDF
Smart monitoring how does oracle rac manage resource, state ukoug19
PDF
Docker Cluster Management with ECS
PPTX
Oracle real application clusters system tests with demo
PPTX
DevOps and the Death & Rebirth of Childhood Innocence
PPTX
10 Tips for Your Journey to the Public Cloud
PDF
An introduction to_rac_system_test_planning_methods
PPTX
Velocity 2015: Building Self-Healing Systems
PPTX
Velocity 2015 building self healing systems (slide share version)
PDF
An Engineer's Guide to a Good Night's Sleep
PDF
Automating Security in Cloud Workloads with DevSecOps
PDF
Synergy 2015 Session Slides: SYN408 XenDesktop 7.6 Architecture - Dealing Wit...
KEY
DevOpsSec: Appling DevOps Principles to Security, DevOpsDays Austin 2012
PDF
us-19-Shortridge-Forsgren-Controlled-Chaos-the-Inevitable-Marriage-of-DevOps-...
PPTX
Chaos Engineering: Why Breaking Things Should Be Practised.
PPTX
Harnessing the Power of Apache Hadoop Series
PDF
Immutable infrastructure - Beyond stateless
Using Machine Learning to Debug Oracle RAC Issues
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
Monitoring of OpenNebula installations
DevOpsDays Austin: Helping Horses Become Unicorns, Chef's Operations Maturity...
Smart monitoring how does oracle rac manage resource, state ukoug19
Docker Cluster Management with ECS
Oracle real application clusters system tests with demo
DevOps and the Death & Rebirth of Childhood Innocence
10 Tips for Your Journey to the Public Cloud
An introduction to_rac_system_test_planning_methods
Velocity 2015: Building Self-Healing Systems
Velocity 2015 building self healing systems (slide share version)
An Engineer's Guide to a Good Night's Sleep
Automating Security in Cloud Workloads with DevSecOps
Synergy 2015 Session Slides: SYN408 XenDesktop 7.6 Architecture - Dealing Wit...
DevOpsSec: Appling DevOps Principles to Security, DevOpsDays Austin 2012
us-19-Shortridge-Forsgren-Controlled-Chaos-the-Inevitable-Marriage-of-DevOps-...
Chaos Engineering: Why Breaking Things Should Be Practised.
Harnessing the Power of Apache Hadoop Series
Immutable infrastructure - Beyond stateless
Ad

Recently uploaded (20)

PDF
Approach and Philosophy of On baking technology
PDF
Getting Started with Data Integration: FME Form 101
PDF
Encapsulation theory and applications.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPT
Teaching material agriculture food technology
PPTX
1. Introduction to Computer Programming.pptx
PPTX
Tartificialntelligence_presentation.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
cuic standard and advanced reporting.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Electronic commerce courselecture one. Pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
Approach and Philosophy of On baking technology
Getting Started with Data Integration: FME Form 101
Encapsulation theory and applications.pdf
Spectroscopy.pptx food analysis technology
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Teaching material agriculture food technology
1. Introduction to Computer Programming.pptx
Tartificialntelligence_presentation.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
MYSQL Presentation for SQL database connectivity
Group 1 Presentation -Planning and Decision Making .pptx
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Assigned Numbers - 2025 - Bluetooth® Document
Digital-Transformation-Roadmap-for-Companies.pptx
SOPHOS-XG Firewall Administrator PPT.pptx
cuic standard and advanced reporting.pdf
A comparative analysis of optical character recognition models for extracting...
Electronic commerce courselecture one. Pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Ad

How our security requirements turned us into accidental chaos engineers