SlideShare a Scribd company logo
Chaos Engineering
What is it and why should I care?
Matthew Brahms | Site Reliability Engineer | @matthewbrahms
For realz?
Oh yeah, really!
● 2010 - Netflix created the
Chaos Monkey which can wreak
havoc in AWS at will deleting
instances (fully
customizable/controllable) --
this is OSS as of 2012
● 2011 - Netflix creates the
Simian Army--a host of chaos
tools to test failure modes in
your infrastructure and
applications
● 2014 - the Role of Chaos
Engineer is created at Netflix
What is it?
Chaos Engineering is the discipline
of experimenting on a distributed
system in order to build confidence
in the system’s capability to
withstand turbulent conditions in
production.
- http://principlesofchaos.org/
A working definition
Alternative definition
Bad things will happen to
your system, no matter
how well designed it is.
You cannot become
ignorant to it.
Do you really wanna be that person?
What you should do
A fast primer on getting
started with Chaos Eng
(like 75 seconds short…)
It’s a discipline...
- Not a process
- Principled effort
- Each org/user is unique in
implementation and
philosophy
Start with a question
● Know your distributed system
well
● Whiteboard your entire system
with another person
● Find domains or services where
you think a failure case may
exist
Be responsible...
Test Case 0
Hypothesis: “Deleting/attacking
our production database *may*
cause a service outage?”
Good examples...
“What happens if we were
to unexpectedly lose a
node in our cloud
provider?”
“What happens if CPU
utilization were to be at
100% on all cores of most
of our frontend servers?”
Test your ideas
You need two things to
proceed:
1. Means to test your
ideas
2. Means to measure
the outcome of your
experiment
You’ll need tools...
Chaos tooling is available!
Available tooling resources
- Chaos toolkit (http://chaostoolkit.org/)
- Netflix (https://github.com/Netflix/SimianArmy)
- Gremlin (https://gremlin.com)
- Chaos Engineering Hands-On Bootcamp with k8s
(https://github.com/tammybutow/chaos_engineering_b
ootcamp)
You gotta see it...
Observability is a key tenant!
Pick a tool, any tool.
...the list goes on and on.
You should see something like this...
Here is a sample of
what a single-core
CPU attack against
a Kubernetes host
looks like…
#somuchfun
Cool bro, now what?
● Important to codify lessons
learned from failure
● If your attack fails, you still may
learn something!
● Repeating this process will refine
and make your attacks even better
Wanna go next level?
Plan a gameday!
It is a dedicated team day focused
on using chaos engineering to
reveal weaknesses in your system.
You’ll need buy-in from all levels
of your org for this.
Level-up your resiliency!
Chaos Engineering
vs. DevOps
The Three Ways of DevOps:
✓ Systems Thinking
✓ Amplify Feedback Loops
✓ Culture of Continual Learning
and Experimentation
I mean seriously, should you
care?
Let’s check some boxes:
Where do I go from here?
Good news! There is a
community forming!
Chaos Engineering Slack
https://slofile.com/slack/chaosengineering
Austin Chaos Engineering
https://www.meetup.com/Austin-Chaos-En
gineering-Meetup/
Thank you for your time!
Any questions, feel free to reach out to me!

More Related Content

PPT
Automating Software Releases (Dallas/Ft. Worth Perl Mongers 2004)
PDF
STAMP, or Test Amplification to DevTestOps service, OW2con'18, June 7-8, 2018...
 
PDF
Spaghetti gate
PPT
Test Presentation
PPTX
Mocking in python
PPTX
Test driving QML
PPTX
Antifragility and testing for distributed systems failure
PDF
The 3 Things Every New Development Team Needs (and what to avoid like the pla...
Automating Software Releases (Dallas/Ft. Worth Perl Mongers 2004)
STAMP, or Test Amplification to DevTestOps service, OW2con'18, June 7-8, 2018...
 
Spaghetti gate
Test Presentation
Mocking in python
Test driving QML
Antifragility and testing for distributed systems failure
The 3 Things Every New Development Team Needs (and what to avoid like the pla...

What's hot (20)

PDF
The Perfect Neos Project Setup
PDF
TDD super mondays-june-2014
PDF
Beer & Beta by Flockler - Feb 4th 2016
PPTX
Dot all 2019 | Testing with Craft | Giel Tettelar
PDF
Managing Modules Without Going Crazy (NPW 2007)
ZIP
Five Easy Ways to QA Your Drupal Site
PDF
Rubyslava debugging with_pry
PDF
Applying Chaos Engineering to Build Resilient Serverless Applications
PPTX
Django strategy-test
PPTX
Azphp phpunit-jenkins
PPTX
openQA Hoverboard - Open-source Question Answering Framework
PDF
Drupal 7 ci and testing
KEY
Overview of Testing Talks at Pycon
PPTX
Building Open-Source React Components
PPTX
Building Open-source React Components
PDF
Stress Test & Chaos Engineering
PPTX
The Search for the Perfect Program
PDF
Fast end-to-end-tests
ODP
Buildbot
PPTX
Bootstrapping Quality
The Perfect Neos Project Setup
TDD super mondays-june-2014
Beer & Beta by Flockler - Feb 4th 2016
Dot all 2019 | Testing with Craft | Giel Tettelar
Managing Modules Without Going Crazy (NPW 2007)
Five Easy Ways to QA Your Drupal Site
Rubyslava debugging with_pry
Applying Chaos Engineering to Build Resilient Serverless Applications
Django strategy-test
Azphp phpunit-jenkins
openQA Hoverboard - Open-source Question Answering Framework
Drupal 7 ci and testing
Overview of Testing Talks at Pycon
Building Open-Source React Components
Building Open-source React Components
Stress Test & Chaos Engineering
The Search for the Perfect Program
Fast end-to-end-tests
Buildbot
Bootstrapping Quality
Ad

Similar to Chaos Engineering Talk at DevOps Days Austin (20)

PDF
Infrastructure as Code, Theory Crash Course
PDF
Chaos Engineering - The Art of Breaking Things in Production
PDF
How to get started with Site Reliability Engineering
PDF
Chaos Engineering 101: A Field Guide
PDF
The Art Of Performance Tuning - with presenter notes!
PPTX
Testing for the deeplearning folks
PPTX
Cinci ug-january2011-anti-patterns
PDF
Sensepost assessment automation
PPTX
Vulnerability, exploit to metasploit
ODP
Automated Deployment using Open Source
PDF
What is ATT&CK coverage, anyway? Breadth and depth analysis with Atomic Red Team
PPTX
DevOps - Boldly Go for Distro
PDF
10 Ways To Improve Your Code
PPTX
Cyber Security Workshop Presentation.pptx
PPTX
Watching Somebody Else's Computer: Cloud Native Observability
PPTX
Software Security : From school to reality and back!
PDF
Javaland 2017: "You´ll do microservices now". Now what?
ODP
Debugging
PPTX
How volkswagen used microservices and automation to develop self service solu...
PDF
Automation for Anyone at Nutanix NEXT 2017 US
Infrastructure as Code, Theory Crash Course
Chaos Engineering - The Art of Breaking Things in Production
How to get started with Site Reliability Engineering
Chaos Engineering 101: A Field Guide
The Art Of Performance Tuning - with presenter notes!
Testing for the deeplearning folks
Cinci ug-january2011-anti-patterns
Sensepost assessment automation
Vulnerability, exploit to metasploit
Automated Deployment using Open Source
What is ATT&CK coverage, anyway? Breadth and depth analysis with Atomic Red Team
DevOps - Boldly Go for Distro
10 Ways To Improve Your Code
Cyber Security Workshop Presentation.pptx
Watching Somebody Else's Computer: Cloud Native Observability
Software Security : From school to reality and back!
Javaland 2017: "You´ll do microservices now". Now what?
Debugging
How volkswagen used microservices and automation to develop self service solu...
Automation for Anyone at Nutanix NEXT 2017 US
Ad

Recently uploaded (20)

PDF
KodekX | Application Modernization Development
PDF
cuic standard and advanced reporting.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Encapsulation theory and applications.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Approach and Philosophy of On baking technology
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Spectroscopy.pptx food analysis technology
KodekX | Application Modernization Development
cuic standard and advanced reporting.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
20250228 LYD VKU AI Blended-Learning.pptx
Unlocking AI with Model Context Protocol (MCP)
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Encapsulation theory and applications.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
NewMind AI Weekly Chronicles - August'25 Week I
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Approach and Philosophy of On baking technology
Agricultural_Statistics_at_a_Glance_2022_0.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
“AI and Expert System Decision Support & Business Intelligence Systems”
Machine learning based COVID-19 study performance prediction
Spectroscopy.pptx food analysis technology

Chaos Engineering Talk at DevOps Days Austin

  • 1. Chaos Engineering What is it and why should I care? Matthew Brahms | Site Reliability Engineer | @matthewbrahms
  • 2. For realz? Oh yeah, really! ● 2010 - Netflix created the Chaos Monkey which can wreak havoc in AWS at will deleting instances (fully customizable/controllable) -- this is OSS as of 2012 ● 2011 - Netflix creates the Simian Army--a host of chaos tools to test failure modes in your infrastructure and applications ● 2014 - the Role of Chaos Engineer is created at Netflix
  • 3. What is it? Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production. - http://principlesofchaos.org/ A working definition
  • 4. Alternative definition Bad things will happen to your system, no matter how well designed it is. You cannot become ignorant to it.
  • 5. Do you really wanna be that person?
  • 6. What you should do A fast primer on getting started with Chaos Eng (like 75 seconds short…)
  • 7. It’s a discipline... - Not a process - Principled effort - Each org/user is unique in implementation and philosophy
  • 8. Start with a question ● Know your distributed system well ● Whiteboard your entire system with another person ● Find domains or services where you think a failure case may exist
  • 9. Be responsible... Test Case 0 Hypothesis: “Deleting/attacking our production database *may* cause a service outage?”
  • 10. Good examples... “What happens if we were to unexpectedly lose a node in our cloud provider?” “What happens if CPU utilization were to be at 100% on all cores of most of our frontend servers?”
  • 11. Test your ideas You need two things to proceed: 1. Means to test your ideas 2. Means to measure the outcome of your experiment
  • 12. You’ll need tools... Chaos tooling is available!
  • 13. Available tooling resources - Chaos toolkit (http://chaostoolkit.org/) - Netflix (https://github.com/Netflix/SimianArmy) - Gremlin (https://gremlin.com) - Chaos Engineering Hands-On Bootcamp with k8s (https://github.com/tammybutow/chaos_engineering_b ootcamp)
  • 14. You gotta see it... Observability is a key tenant! Pick a tool, any tool. ...the list goes on and on.
  • 15. You should see something like this... Here is a sample of what a single-core CPU attack against a Kubernetes host looks like… #somuchfun
  • 16. Cool bro, now what? ● Important to codify lessons learned from failure ● If your attack fails, you still may learn something! ● Repeating this process will refine and make your attacks even better
  • 17. Wanna go next level? Plan a gameday! It is a dedicated team day focused on using chaos engineering to reveal weaknesses in your system. You’ll need buy-in from all levels of your org for this. Level-up your resiliency!
  • 18. Chaos Engineering vs. DevOps The Three Ways of DevOps: ✓ Systems Thinking ✓ Amplify Feedback Loops ✓ Culture of Continual Learning and Experimentation I mean seriously, should you care? Let’s check some boxes:
  • 19. Where do I go from here? Good news! There is a community forming! Chaos Engineering Slack https://slofile.com/slack/chaosengineering Austin Chaos Engineering https://www.meetup.com/Austin-Chaos-En gineering-Meetup/
  • 20. Thank you for your time! Any questions, feel free to reach out to me!