SlideShare a Scribd company logo
Introduction to
Chaos Engineering
with Microsoft Azure
Ana Margarita Medina
Gremlin, Sr. Chaos Engineer
ana@gremlin.com
@ana_m_medina
Our businesses, health,
and safety rely on
applications and systems
that will fail
2
@ana_m_medina
Complexity Keeps Increasing
Failures are inherent to
complex systems and will
cause downtime unless
tested for.
@ana_m_medina
Application
Code
Dependencies
Configuration
Infrastructure
People and
Processes
Traditional testing only
covers a small portion
of the software stack
Traditional Testing
is not enough
anymore.
Cost = R + E + C + ( B + A )
During the Outage
R = Revenue Lost
E = Employee Productivity
After the Outage
C = Customer Chargebacks
(SLA Breaches)
Unquantifiable
B = Brand Defamation
A = Employee Attrition
The average company loses $300,000/hour of downtime
Measuring the Cost of Downtime
@ana_m_medina
Chaos Engineering
@ana_m_medina
8
Gremlin, Sr. Chaos Engineer
Developer since 2007
SRE/Chaos Engineer since 2016
󰐖󰎫
Ana Margarita Medina
@ana_m_medina
@ana_m_medina
Chaos Engineering
@ana_m_medina
Chaos Engineering is
thoughtful, planned
experiments designed to
reveal weakness in our
systems.
@ana_m_medina
Experiment
Hypothesis
Blast Radius
Magnitude
Abort Conditions
Terminology
@ana_m_medina
How to do Chaos Engineering?
1. Observe Your System
2. Baseline your metrics (set
SLOs/SLIs per service)
3. Form a Hypothesis with Abort
Conditions
4. Define Blast Radius
5. Run Experiment
6. Analyze Results
7. Expand Scope and Re-Test
8. Share Results
@ana_m_medina
Gradually increase
experiment magnitude
@ana_m_medina
14
We want to, but...
Excuse:
We don’t have time.
Excuse:
We don’t need to break
things. They break on their
own!
Excuse:
We don’t know how to get started.
18
No more excuses!
We test proactively,
instead of waiting for an
outage.
@ana_m_medina
How do we do Chaos
Engineering when working
on Azure?
@ana_m_medina
AKS
Azure Kubernetes Service
@ana_m_medina
Experiment 1:
Pod Restart/Replication
Hypothesis
If a Pod shuts down, K8s automatically restarts the
Pod or deploys a replica.
Procedure
Use a shutdown Gremlin to stop the pod.
Observation
Pod enters shutdown state, and Kubernetes
immediately restarts it. The application returns to
a healthy state.
Conclusion
K8s can successfully recover from a Pod
failure without significant interruption to users.
Experiment 2:
Validating Autoscaling
Hypothesis
Using excessive CPU triggers the Cluster
Autoscaler to provision a new node.
Procedure
Use a CPU Gremlin to consume 50% of CPU on all
nodes for 2 minutes.
Observation
The Cluster Autoscaler provisions a new node
after one minute. Meanwhile, new Pods appear
in the Pending state.
Conclusion
Our autoscaling policy added a node and
avoided evicting Pods due to resource
exhaustion.
Experiment 3:
Latency and Networking
Hypothesis
Increasing the latency of a downstream Pod will
cause poor performance upstream.
Procedure
Use a latency Gremlin to add 300 ms of latency
to the product catalog Pod, then use
ApacheBench to time requests.
Observation
Requests are notably delayed.
Conclusion
We need to optimize network throughput or
develop a load balancing strategy.
Experiment 4:
Storage Constraints
Hypothesis
Pods with defined storage limits do not exceed
those limits.
Procedure
Use a disk Gremlin to consume 90% of storage on
a redis Pod.
Observation
Kubernetes identifies disk pressure and starts
evicting Pods.
Conclusion
We need to increase our storage limits, review
disk pressure thresholds, or set resource
limits/quotas.
Many More Recommended
Scenarios
Demo
@ana_m_medina
Introduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft Azure
Latency - 1000ms
app:mhc-front
120 seconds
X
What would you
case to HALT
this experiment?
Demo Environment
Latency - 1000ms
app:mhc-front
120 seconds
X
● User is
unable to
use site
● Data Loss
Demo Environment
Introduction to Chaos Engineering with Microsoft Azure
Chaos Engineering
Automation
@ana_m_medina
Why Automate?
- Prevent Regression
- Prevent Human Error
- Reduce time spent manually running tests
- Reduce Toil
Chaos Engineering Automation
- Status Checks
- Scheduling
- API Calls
- SDK Implementations
- CI/CD
Introduction to Chaos Engineering with Microsoft Azure
Development Staging Production
Chaos @ Pipeline
@ana_m_medina
Introduction to Chaos Engineering with Microsoft Azure
Chaos @ Release
@ana_m_medina
continuous delivery
Introduction to Chaos Engineering with Microsoft Azure
observe
If needed, manual step
available
What does Chaos Engineering
Maturity look like?
- The end goal is automated Chaos
Engineering Experiments ran across all
environments for all critical services
continuously.
About Gremlin
The leading enterprise Chaos Engineering platform
Founded in 2015 by
Chaos Engineering
pioneers from Amazon
and Netflix.
Industry Leaders
Enterprise-grade
security, scale, control,
and support.
Built for Enterprise
Guided expertise
throughout the customer
journey.
Trusted Partner
Mission
Build a more reliable internet.
Trusted By Teams Worldwide
Intuitive interface and
well-documented API
Simple
Safely halt and roll
back any experiment
Safe
SOC II Certified
RBAC, MFA, SSO
Secure
Cloud Native Guided Breadth
Why Gremlin
Runs in all cloud
environments - AWS, Azure,
GCP
Recommended attacks
and scenarios
Supports Linux, Windows,
Kubernetes, and more
Failure happens.
Run Chaos Engineering
experiments on everything.
@ana_m_medina
Failures are inherent to
complex systems and will
cause downtime unless
tested for.
@ana_m_medina
Interested in hands-on
learning with Azure?
Chaos Engineering on
Azure
March 31, 2021
10AM PST • 1PM EST
gremlin.com/bootcamps
Interactive
Learning
@datadoghq | @gremlininc
Join the Chaos
Engineering Community:
gremlin.com/slack
@ana_m_medina
gremlin.com/talk/chaos-azure
Free Stickers!
Chaos Engineering
Community
Introduction to Chaos Engineering with Microsoft Azure
61
Ana Margarita Medina
Sr. Chaos Engineer, Gremlin
ana@gremlin.com
@ana_m_medina
Thank You! 󰗜
Try Gremlin Free:
go.gremlin.com/ana
Free Stickers: gremlin.com/talk/chaos-azure

More Related Content

PPTX
Introduction to Chaos Engineering
PDF
Chaos Engineering, When should you release the monkeys?
PDF
Chaos Engineering - The Art of Breaking Things in Production
PDF
Chaos Engineering: Why the World Needs More Resilient Systems
PDF
Chaos Engineering 101: A Field Guide
PDF
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
PDF
Building an SRE Organization @ Squarespace
PPTX
Chaos engineering and chaos testing
Introduction to Chaos Engineering
Chaos Engineering, When should you release the monkeys?
Chaos Engineering - The Art of Breaking Things in Production
Chaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering 101: A Field Guide
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
Building an SRE Organization @ Squarespace
Chaos engineering and chaos testing

What's hot (20)

PDF
Chaos Engineering
PDF
An Introduction to Chaos Engineering
PPTX
Chaos engineering
PPTX
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
PPSX
Service Mesh - Observability
PDF
Infrastructure as Code with Terraform and Ansible
PPTX
SRE vs DevOps
PPTX
Infrastructure as code: running microservices on AWS using Docker, Terraform,...
PPTX
How Small Team Get Ready for SRE (public version)
PDF
DevOps & SRE at Google Scale
PPTX
Terraform on Azure
PPTX
Continues Integration and Continuous Delivery with Azure DevOps - Deploy Anyt...
PPTX
SRE-iously! Reliability!
PDF
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
PDF
Overview of Site Reliability Engineering (SRE) & best practices
PDF
Sre summary
PPTX
Everything You Need To Know About Persistent Storage in Kubernetes
PDF
DevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
PPTX
SRE (service reliability engineer) on big DevOps platform running on the clou...
PDF
Practical Chaos Engineering
Chaos Engineering
An Introduction to Chaos Engineering
Chaos engineering
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Service Mesh - Observability
Infrastructure as Code with Terraform and Ansible
SRE vs DevOps
Infrastructure as code: running microservices on AWS using Docker, Terraform,...
How Small Team Get Ready for SRE (public version)
DevOps & SRE at Google Scale
Terraform on Azure
Continues Integration and Continuous Delivery with Azure DevOps - Deploy Anyt...
SRE-iously! Reliability!
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
Overview of Site Reliability Engineering (SRE) & best practices
Sre summary
Everything You Need To Know About Persistent Storage in Kubernetes
DevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
SRE (service reliability engineer) on big DevOps platform running on the clou...
Practical Chaos Engineering
Ad

Similar to Introduction to Chaos Engineering with Microsoft Azure (20)

PPTX
Chaos Engineering with Containers - QCon SF 2018
PDF
The Case for Chaos Testing
PDF
Introduction to Chaos Engineering | SRECon Asia - Ana Medina
PDF
The case for chaos testing
PDF
The Practice of Chaos Engineering - Reactive Summit 2018 - Montreal, QC
PDF
Chaos Engineering in a Multi-Cloud World | Escape Conference 2019
PDF
DevOps - Chaos Engineering on Kubernetes
PDF
Chaos Engineering Site Reliability Through Controlled Disruption 1st Edition ...
PDF
Chaos Engineering with Containers
PPTX
Chaos Engineering when you're not Netflix
PPTX
Chaos engineering - The art of breaking stuff in production on purpose
PPTX
ChaosEngineeringITEA.pptx
PDF
Chaos Engineering: Site reliability through controlled disruption 1st Edition...
PDF
SRECon Europe - Chaos Engineering Bootcamp | August 2018
PPTX
Chaos Engineering with Gremlin Platform
PPTX
Antifragility and testing for distributed systems failure
PDF
Chaos engineering intro
PDF
Applying principles of chaos engineering to serverless (reinvent DVC305)
PDF
Chaos Engineering Talk at DevOps Days Austin
PPTX
Resilience Engineering at Hotels.com
Chaos Engineering with Containers - QCon SF 2018
The Case for Chaos Testing
Introduction to Chaos Engineering | SRECon Asia - Ana Medina
The case for chaos testing
The Practice of Chaos Engineering - Reactive Summit 2018 - Montreal, QC
Chaos Engineering in a Multi-Cloud World | Escape Conference 2019
DevOps - Chaos Engineering on Kubernetes
Chaos Engineering Site Reliability Through Controlled Disruption 1st Edition ...
Chaos Engineering with Containers
Chaos Engineering when you're not Netflix
Chaos engineering - The art of breaking stuff in production on purpose
ChaosEngineeringITEA.pptx
Chaos Engineering: Site reliability through controlled disruption 1st Edition...
SRECon Europe - Chaos Engineering Bootcamp | August 2018
Chaos Engineering with Gremlin Platform
Antifragility and testing for distributed systems failure
Chaos engineering intro
Applying principles of chaos engineering to serverless (reinvent DVC305)
Chaos Engineering Talk at DevOps Days Austin
Resilience Engineering at Hotels.com
Ad

More from Ana Medina (7)

PDF
InfoQ Live - Reducing Uncertainty in Software Delivery - Building reliability...
PDF
Navigating Mental Health as a Human - Write/Speak/Code 2019
PPTX
Next Level Chaos Engineering - Chaos Conf 2018
PDF
Chaos Engineering Bootcamp - QCon SF 2018
PDF
Velocity London - Chaos Engineering Bootcamp
PDF
DevOpsDays Kansas City - Getting Started with Chaos Engineering
PDF
#AllDayDevOps Getting Started with Chaos Engineering
InfoQ Live - Reducing Uncertainty in Software Delivery - Building reliability...
Navigating Mental Health as a Human - Write/Speak/Code 2019
Next Level Chaos Engineering - Chaos Conf 2018
Chaos Engineering Bootcamp - QCon SF 2018
Velocity London - Chaos Engineering Bootcamp
DevOpsDays Kansas City - Getting Started with Chaos Engineering
#AllDayDevOps Getting Started with Chaos Engineering

Recently uploaded (20)

PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPT
Project quality management in manufacturing
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
R24 SURVEYING LAB MANUAL for civil enggi
DOCX
573137875-Attendance-Management-System-original
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
PPT on Performance Review to get promotions
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
Welding lecture in detail for understanding
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Project quality management in manufacturing
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
CH1 Production IntroductoryConcepts.pptx
OOP with Java - Java Introduction (Basics)
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
R24 SURVEYING LAB MANUAL for civil enggi
573137875-Attendance-Management-System-original
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPT on Performance Review to get promotions
Operating System & Kernel Study Guide-1 - converted.pdf
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Welding lecture in detail for understanding
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Foundation to blockchain - A guide to Blockchain Tech
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...

Introduction to Chaos Engineering with Microsoft Azure