Introduction to Chaos Engineering with Microsoft Azure

Introduction to
Chaos Engineering
with Microsoft Azure
Ana Margarita Medina
Gremlin, Sr. Chaos Engineer
ana@gremlin.com
@ana_m_medina

Our businesses, health,
and safety rely on
applications and systems
that will fail
2
@ana_m_medina

Failures are inherent to
complex systems and will
cause downtime unless
tested for.
@ana_m_medina

Application
Code
Dependencies
Configuration
Infrastructure
People and
Processes
Traditional testing only
covers a small portion
of the software stack
Traditional Testing
is not enough
anymore.

Cost = R + E + C + ( B + A )
During the Outage
R = Revenue Lost
E = Employee Productivity
After the Outage
C = Customer Chargebacks
(SLA Breaches)
Unquantifiable
B = Brand Defamation
A = Employee Attrition
The average company loses $300,000/hour of downtime
Measuring the Cost of Downtime
@ana_m_medina

Chaos Engineering
@ana_m_medina

8
Gremlin, Sr. Chaos Engineer
Developer since 2007
SRE/Chaos Engineer since 2016
󰐖󰎫
@ana_m_medina
@ana_m_medina

Chaos Engineering is
thoughtful, planned
experiments designed to
reveal weakness in our
systems.
@ana_m_medina

Experiment
Hypothesis
Blast Radius
Magnitude
Abort Conditions
Terminology
@ana_m_medina

How to do Chaos Engineering?
1. Observe Your System
2. Baseline your metrics (set
SLOs/SLIs per service)
3. Form a Hypothesis with Abort
Conditions
4. Define Blast Radius
5. Run Experiment
6. Analyze Results
7. Expand Scope and Re-Test
8. Share Results
@ana_m_medina

Gradually increase
experiment magnitude
@ana_m_medina

Excuse:
We don’t need to break
things. They break on their
own!

Excuse:
We don’t know how to get started.

We test proactively,
instead of waiting for an
outage.
@ana_m_medina

How do we do Chaos
Engineering when working
on Azure?
@ana_m_medina

AKS
Azure Kubernetes Service
@ana_m_medina

Experiment 1:
Pod Restart/Replication
Hypothesis
If a Pod shuts down, K8s automatically restarts the
Pod or deploys a replica.
Procedure
Use a shutdown Gremlin to stop the pod.
Observation
Pod enters shutdown state, and Kubernetes
immediately restarts it. The application returns to
a healthy state.
Conclusion
K8s can successfully recover from a Pod
failure without significant interruption to users.

Experiment 2:
Validating Autoscaling
Hypothesis
Using excessive CPU triggers the Cluster
Autoscaler to provision a new node.
Procedure
Use a CPU Gremlin to consume 50% of CPU on all
nodes for 2 minutes.
Observation
The Cluster Autoscaler provisions a new node
after one minute. Meanwhile, new Pods appear
in the Pending state.
Conclusion
Our autoscaling policy added a node and
avoided evicting Pods due to resource
exhaustion.

Experiment 3:
Latency and Networking
Hypothesis
Increasing the latency of a downstream Pod will
cause poor performance upstream.
Procedure
Use a latency Gremlin to add 300 ms of latency
to the product catalog Pod, then use
ApacheBench to time requests.
Observation
Requests are notably delayed.
Conclusion
We need to optimize network throughput or
develop a load balancing strategy.

Experiment 4:
Storage Constraints
Hypothesis
Pods with defined storage limits do not exceed
those limits.
Procedure
Use a disk Gremlin to consume 90% of storage on
a redis Pod.
Observation
Kubernetes identifies disk pressure and starts
evicting Pods.
Conclusion
We need to increase our storage limits, review
disk pressure thresholds, or set resource
limits/quotas.

Many More Recommended
Scenarios

Introduction to Chaos Engineering with Microsoft Azure

Latency - 1000ms
app:mhc-front
120 seconds
X
What would you
case to HALT
this experiment?
Demo Environment

Latency - 1000ms
app:mhc-front
120 seconds
X
● User is
unable to
use site
● Data Loss
Demo Environment

Chaos Engineering
Automation
@ana_m_medina

Why Automate?
- Prevent Regression
- Prevent Human Error
- Reduce time spent manually running tests
- Reduce Toil

Chaos Engineering Automation
- Status Checks
- Scheduling
- API Calls
- SDK Implementations
- CI/CD

Development Staging Production

Chaos @ Pipeline
@ana_m_medina

If needed, manual step
available

What does Chaos Engineering
Maturity look like?
- The end goal is automated Chaos
Engineering Experiments ran across all
environments for all critical services
continuously.

About Gremlin
The leading enterprise Chaos Engineering platform
Founded in 2015 by
Chaos Engineering
pioneers from Amazon
and Netflix.
Industry Leaders
Enterprise-grade
security, scale, control,
and support.
Built for Enterprise
Guided expertise
throughout the customer
journey.
Trusted Partner

Mission
Build a more reliable internet.

Intuitive interface and
well-documented API
Simple
Safely halt and roll
back any experiment
Safe
SOC II Certified
RBAC, MFA, SSO
Secure
Cloud Native Guided Breadth
Why Gremlin
Runs in all cloud
environments - AWS, Azure,
GCP
Recommended attacks
and scenarios
Supports Linux, Windows,
Kubernetes, and more

Failure happens.
Run Chaos Engineering
experiments on everything.
@ana_m_medina

Interested in hands-on
learning with Azure?
Chaos Engineering on
Azure
March 31, 2021
10AM PST • 1PM EST
gremlin.com/bootcamps
Interactive
Learning
@datadoghq | @gremlininc

Join the Chaos
Engineering Community:
gremlin.com/slack
@ana_m_medina

gremlin.com/talk/chaos-azure
Free Stickers!
Chaos Engineering
Community

61
Sr. Chaos Engineer, Gremlin
ana@gremlin.com
@ana_m_medina
Thank You! 󰗜
Try Gremlin Free:
go.gremlin.com/ana
Free Stickers: gremlin.com/talk/chaos-azure

Introduction to Chaos Engineering with Microsoft Azure

More Related Content

What's hot (20)

Similar to Introduction to Chaos Engineering with Microsoft Azure (20)

More from Ana Medina (7)

Recently uploaded (20)

Introduction to Chaos Engineering with Microsoft Azure