SlideShare a Scribd company logo
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Arun Gupta, @arungupta
Principal Open Source Technologist,
Amazon Web Services
Using Chaos to Bring Resiliency
to Your Applications in
Kubernetes
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Failures are a given and
everything will eventually
fail over time.
https://www.allthingsdistributed.com/2016/03/10-lessons-from-10-years-of-aws.html
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
https://www.youtube.com/watch?v=zoz0ZjfrQ9s
Amazon 2006
GameDay: Creating
Resiliency Through
Destruction
Jesse Robbins
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Chaos Monkeys
https://github.com/Netflix/SimianArmy
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Chaos Engineering
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Resilience
Ability of a system to adapt
to changes, failures, and disturbances
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Chaos Engineering is the discipline of
experimenting on a distributed system in
order to build confidence in the system’s
capability to withstand turbulent
conditions in production
Credit: https://www.flickr.com/photos/loseryouthcrew/8775130600/
https://principlesofchaos.org/
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Bad things will happen to your system,
no matter how well designed it is
You cannot become ignorant to it
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Break your systems on purpose
Find out their weaknesses and
fix them before they break when least expected
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Chaos doesn’t cause problems.
It reveals them.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
• Application level
• Host failure
• Resource attacks (CPU, memory, …)
• Network attacks (dependencies, latency, …)
• Region attacks!
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Where do you inject Chaos?
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Phases of chaos engineering
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Phases of chaos engineering
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero
”Normal” behavior of your system
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Business metric
https://medium.com/netflix-
techblog/sps-the-pulse-of-
netflix-streaming-
ae4db0e05f8a
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Phases of chaos engineering
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Phases of chaos engineering
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
• a service gives 404 or 503?
• latency increases by 300ms?
• the port is not accessible?
• security group rules changed?
• the database stops?
• excessive number of requests come?
• iptables are wiped out?
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Phases of chaos engineering
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Phases of chaos engineering
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Pick hypothesis
Scope the experiment
Identify metrics
Notify the organization
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Start with very small
As close as possible to production
Minimize the blast radius.
Have an emergency STOP!
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Users
Canary deployment
99%
users
1%
users
Start with...
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Phases of chaos engineering
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Phases of chaos engineering
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Time to detect?
Time for notification? And escalation?
Time to public notification?
Time for graceful degradation to kick-in?
Time for self healing to happen?
Time to recovery—partial and full?
Time to all-clear and stable?
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
DON’T blame that one person…
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
PostMortems—COE (Correction of Errors)
The 5 WHYs
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Phases of chaos engineering
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Phases of chaos engineering
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Fix
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Failure free operations require
experience with failure.
http://web.mit.edu/2.75/resources/random/How%20Complex%20Systems%20Fail.pdf
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Kubernetes cluster
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Reconciles desired and actual state for pods
Distributes pods across AZs
Automatic health-check based restarts
Rolling deployment of a service
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Kubernetes cluster with Amazon EKS
AWS managed
Customer account
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Kubernetes cluster with Amazon EKS
mycluster.eks.amazonaws.com
Availability
Zone 1
Availability
Zone 2
Availability
Zone 3
Kubectl
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Region and Availability Zones
Control Plane is highly available
Master and Workers are configured in ASG
Master instance type auto-scaling
Etcd is HA and backed up every hour
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Chaos in a Kubernetes cluster
mycluster.eks.amazonaws.com
Availability
Zone 1
Availability
Zone 2
Availability
Zone 3
Kubectl
x
x
Health check?
Dead node?
x
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Istio
Chaos Toolkit
Kube Monkey
PowerfulSeal
Gremlin
Simian Army
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Istio
Intelligent routing
and load balancing
Resilience across
languages and
platforms
Fleet-wide policy
enforcement
In-depth
telemetry
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Timeouts
Bounded retries with timeout budget
Concurrent connections limit and request load
Active health checks (periodic)
Passive health checks (circuit breakers)
AZ-aware load balancing with automatic failover
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
• Timing failures
• Increased network latency
• Overloaded upstream service
• Crashes
• HTTP error codes
• TCP connection failures
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Fault injection using Istio—timeout
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: greeting
spec:
hosts:
- greeting
http:
- fault:
delay:
fixedDelay: 10s
percent: 100
route:
- destination:
host: greeting
subset: greeting-hello
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: greeting-destination-rule
spec:
host: greeting
subsets:
- name: greeting-hello
labels:
greeting: hello
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Fault injection using Istio—HTTP abort
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: greeting
spec:
hosts:
- greeting
http:
- fault:
abort:
httpStatus: 500
percent: 100
route:
- destination:
host: greeting
subset: greeting-hello
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Istio traffic management
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: greeting-virtual-service
spec:
hosts:
- greeting
http:
- route:
- destination:
host: greeting
subset: greeting-hello
weight: 75
- destination:
host: greeting
subset: greeting-howdy
weight: 25
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: greeting-destination-rule
spec:
host: greeting
subsets:
- name: greeting-hello
labels:
greeting: hello
- name: greeting-howdy
labels:
greeting: howdy
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Istio circuit breaker
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: greeting-destination-rule
spec:
host: greeting
subsets:
- name: greeting-hello
labels:
greeting: hello
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
https://istio.io/docs/
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Chaos Toolkit
Open API for Chaos Engineering
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
CLI-driven
Experiments declared in JSON/YAML files
Open specification
Extensible: Kubernetes, AWS, Spring, others
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Chaos Toolkit follows the principles of chaos
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
query a system to observe a behavior
• Check state of a pod with a specific label
• Multiple probes to define steady state
real-world events
• Terminate a deployment
• Multiple actions simulate events
Types of probe and method
• Process: Run a binary
• HTTP: Invoke a HTTP endpoint
• Python: Call a Python function to perform richer operations
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Chaos Toolkit metadata
{
"version": "1.0.0",
"title": "Terminating the greeting service should not impact users",
"description": "How does the greeting service unavailbility impacts our users? Do they see
an error or does the webapp gets slower?",
"tags": [
"kubernetes",
"aws"
],
"configuration": {
"web_app_url": {
"type": "env",
"key": "WEBAPP_URL"
}
},
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Chaos Toolkit steady state & hypothesis
"steady-state-hypothesis": {
"title": "Services are all available and healthy",
"probes": [
{
"type": "probe",
"name": "alive-and-healthy",
"tolerance": true,
"provider": {
"type": "python",
"module": "chaosk8s.pod.probes",
"func": "pods_in_phase",
"arguments": {
"label_selector": "app=webapp-pod",
"phase": "Running",
"ns": "default"
}
}
},
{
"type": "probe",
"name": "application-must-respond-normally",
"tolerance": 200,
"provider": {
"type": "http",
"url": "${web_app_url}",
"timeout": 3
}
}
]
},
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Chaos Toolkit experiment & verify
"method": [
{
"type": "action",
"name": "terminate-greeting-service",
"provider": {
"type": "python",
"module": "chaosk8s.pod.actions",
"func": "terminate_pods",
"arguments": {
"label_selector": "app=greeter-pod",
"ns": "default"
}
}
},
{
"type": "probe",
"name": "fetch-application-logs",
"provider": {
"type": "python",
"module": "chaosk8s.pod.probes",
"func": "read_pod_logs",
"arguments": {
"label_selector": "app=webapp-pod",
"last": "20s",
"ns": "default"
}
}
}
],
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Chaos Toolkit run
$ chaos run experiments/experiment.json
[2018-03-10 14:42:38 INFO] Validating the experiment's syntax
[2018-03-10 14:42:38 INFO] Experiment looks valid
[2018-03-10 14:42:38 INFO] Running experiment: Terminate the greeting service should not impact users
[2018-03-10 14:42:38 INFO] Steady state hypothesis: Services are all available and healthy
[2018-03-10 14:42:38 INFO] Probe: application-should-be-alive-and-healthy
[2018-03-10 14:42:38 INFO] Probe: application-must-respond-normally
[2018-03-10 14:42:39 INFO] Steady state hypothesis is met!
[2018-03-10 14:42:39 INFO] Action: terminate-greeting-service
[2018-03-10 14:42:40 INFO] Probe: fetch-application-logs
[2018-03-10 14:42:41 INFO] Steady state hypothesis: Services are all available and healthy
[2018-03-10 14:42:41 INFO] Probe: application-should-be-alive-and-healthy
[2018-03-10 14:42:42 INFO] Probe: application-must-respond-normally
[2018-03-10 14:42:45 ERROR] => failed: activity took too long to complete
[2018-03-10 14:42:45 CRITICAL] Steady state probe 'application-must-respond-normally' is not in the
given tolerance so failing this experiment
[2018-03-10 14:42:45 INFO] Let's rollback...
[2018-03-10 14:42:45 INFO] No declared rollbacks, let's move on.
[2018-03-10 14:42:45 INFO] Experiment ended with status: failed
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
https://github.com/chaostoolkit/chaostoolkit/
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Implementation of Netflix’s Chaos Monkey for Kubernetes
Randomly deletes pods in the cluster
Applications opt-in using annotations
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Run Kube-Monkey—create configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: kube-monkey-config-map
namespace: kube-system
data:
config.toml: |
[kubemonkey]
run_hour = 8
start_hour = 10
end_hour = 16
blacklisted_namespaces = ["kube-system"]
whitelisted_namespaces = [""]
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Kube-Monkey application opt-in
apiVersion: apps/v1
kind: Deployment
. . .
template:
metadata:
labels:
app: greeting
kube-monkey/enabled: enabled
kube-monkey/identifier: monkey-victim-pods
kube-monkey/mtbf: 2
kube-monkey/kill-mode: random-max-percent
kube-monkey/kill-value: 40
spec:
containers:
- name: greeting
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
https://github.com/asobti/kube-monkey
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Chaos Engineering working group @ CNCF
https://github.com/chaoseng/wg-chaoseng
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Chaos Engineering mind map
https://bit.ly/2uKOJMQ
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
You don’t chose the moment,
the moment chooses you.
You only choose how prepared
you are, when it does.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Thank you!

More Related Content

PPTX
Chaos engineering
PPTX
Introduction to Chaos Engineering
PDF
Chaos Engineering - The Art of Breaking Things in Production
PPTX
Chaos engineering & Gameday on AWS
PDF
An Introduction to Chaos Engineering
PDF
Chaos Engineering, When should you release the monkeys?
PPTX
Chaos engineering and chaos testing
PDF
Chaos Engineering
Chaos engineering
Introduction to Chaos Engineering
Chaos Engineering - The Art of Breaking Things in Production
Chaos engineering & Gameday on AWS
An Introduction to Chaos Engineering
Chaos Engineering, When should you release the monkeys?
Chaos engineering and chaos testing
Chaos Engineering

What's hot (20)

PDF
Chaos Engineering Kubernetes
PDF
Principles Of Chaos Engineering - Chaos Engineering Hamburg
PDF
DevSecOps What Why and How
PDF
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
PPTX
App Modernization Pitch Deck.pptx
PDF
chaos-engineering-Knolx
PPTX
Platform engineering 101
PPTX
Using Azure DevOps to continuously build, test, and deploy containerized appl...
PDF
DevSecOps and the CI/CD Pipeline
PPSX
Agile, User Stories, Domain Driven Design
PDF
DevSecOps Implementation Journey
PDF
DevSecOps: What Why and How : Blackhat 2019
PPTX
Monoliths and Microservices
PDF
DevSecOps Jenkins Pipeline -Security
PDF
Empowering Your Java Applications with Quarkus. A New Era of Fast, Efficient,...
PDF
Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023
PPTX
DevOps Approach (Point of View by Ravi Tadwalkar)
PDF
Choose your own adventure Chaos Engineering - QCon NYC 2017
PPTX
Chaos Engineering with Gremlin Platform
PDF
Demystifying DevSecOps
Chaos Engineering Kubernetes
Principles Of Chaos Engineering - Chaos Engineering Hamburg
DevSecOps What Why and How
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
App Modernization Pitch Deck.pptx
chaos-engineering-Knolx
Platform engineering 101
Using Azure DevOps to continuously build, test, and deploy containerized appl...
DevSecOps and the CI/CD Pipeline
Agile, User Stories, Domain Driven Design
DevSecOps Implementation Journey
DevSecOps: What Why and How : Blackhat 2019
Monoliths and Microservices
DevSecOps Jenkins Pipeline -Security
Empowering Your Java Applications with Quarkus. A New Era of Fast, Efficient,...
Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023
DevOps Approach (Point of View by Ravi Tadwalkar)
Choose your own adventure Chaos Engineering - QCon NYC 2017
Chaos Engineering with Gremlin Platform
Demystifying DevSecOps
Ad

Similar to Chaos Engineering with Kubernetes (20)

PDF
Using chaos to bring resiliency to your applications
PDF
Kubernates를 위한 Chaos Engineering in Action :: 윤석찬 (AWS 테크에반젤리스트)
PPTX
Chaos Engineering: Why Breaking Things Should Be Practised.
PDF
Exploring Cloud Computing with Amazon Web Services (AWS)
PDF
Applying principles of chaos engineering to serverless (reinvent DVC305)
PPTX
Chaos Engineering: Why Breaking Things Should Be Practised.
PPTX
Release the Monkeys ! Testing in the Wild at Netflix
PDF
Security in the cloud
PPTX
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
PPTX
Chaos Engineering when you're not Netflix
PPTX
Keynote - Chaos Engineering: Why breaking things should be practiced
PDF
AWS Security Best Practices
PPTX
Resiliency through Failure @ OSCON 2013
PDF
Enabling Lean IT with AWS by Carlos Condé at the Lean IT Summit 2014
PPTX
Managing Security on AWS
PDF
The Coming Tsunami in Microservices: Operating Microservices at Scale
PPTX
Pitt Immersion Day- Module 1
PDF
Securing Container-Based Applications at the Speed of DevOps
PPTX
Resilience and Compliance at Speed and Scale
PDF
Trusted Application Delivery: Achieving Ultimate Security
Using chaos to bring resiliency to your applications
Kubernates를 위한 Chaos Engineering in Action :: 윤석찬 (AWS 테크에반젤리스트)
Chaos Engineering: Why Breaking Things Should Be Practised.
Exploring Cloud Computing with Amazon Web Services (AWS)
Applying principles of chaos engineering to serverless (reinvent DVC305)
Chaos Engineering: Why Breaking Things Should Be Practised.
Release the Monkeys ! Testing in the Wild at Netflix
Security in the cloud
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Chaos Engineering when you're not Netflix
Keynote - Chaos Engineering: Why breaking things should be practiced
AWS Security Best Practices
Resiliency through Failure @ OSCON 2013
Enabling Lean IT with AWS by Carlos Condé at the Lean IT Summit 2014
Managing Security on AWS
The Coming Tsunami in Microservices: Operating Microservices at Scale
Pitt Immersion Day- Module 1
Securing Container-Based Applications at the Speed of DevOps
Resilience and Compliance at Speed and Scale
Trusted Application Delivery: Achieving Ultimate Security
Ad

More from Arun Gupta (20)

PDF
5 Skills To Force Multiply Technical Talents.pdf
PPTX
Machine Learning using Kubernetes - AI Conclave 2019
PDF
Machine Learning using Kubeflow and Kubernetes
PPTX
Secure and Fast microVM for Serverless Computing using Firecracker
PPTX
Building Java in the Open - j.Day at OSCON 2019
PPTX
Why Amazon Cares about Open Source
PDF
Machine learning using Kubernetes
PDF
Building Cloud Native Applications
PDF
How to be a mentor to bring more girls to STEAM
PDF
Java in a World of Containers - DockerCon 2018
PPTX
The Serverless Tidal Wave - SwampUP 2018 Keynote
PDF
Introduction to Amazon EKS - KubeCon 2018
PDF
Mastering Kubernetes on AWS - Tel Aviv Summit
PDF
Top 10 Technology Trends Changing Developer's Landscape
PDF
Container Landscape in 2017
PDF
Java EE and NoSQL using JBoss EAP 7 and OpenShift
PDF
Docker, Kubernetes, and Mesos recipes for Java developers
PDF
Thanks Managers!
PDF
Migrate your traditional VM-based Clusters to Containers
PDF
NoSQL - Vital Open Source Ingredient for Modern Success
5 Skills To Force Multiply Technical Talents.pdf
Machine Learning using Kubernetes - AI Conclave 2019
Machine Learning using Kubeflow and Kubernetes
Secure and Fast microVM for Serverless Computing using Firecracker
Building Java in the Open - j.Day at OSCON 2019
Why Amazon Cares about Open Source
Machine learning using Kubernetes
Building Cloud Native Applications
How to be a mentor to bring more girls to STEAM
Java in a World of Containers - DockerCon 2018
The Serverless Tidal Wave - SwampUP 2018 Keynote
Introduction to Amazon EKS - KubeCon 2018
Mastering Kubernetes on AWS - Tel Aviv Summit
Top 10 Technology Trends Changing Developer's Landscape
Container Landscape in 2017
Java EE and NoSQL using JBoss EAP 7 and OpenShift
Docker, Kubernetes, and Mesos recipes for Java developers
Thanks Managers!
Migrate your traditional VM-based Clusters to Containers
NoSQL - Vital Open Source Ingredient for Modern Success

Recently uploaded (20)

PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
cuic standard and advanced reporting.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Encapsulation theory and applications.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Modernizing your data center with Dell and AMD
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Machine learning based COVID-19 study performance prediction
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
CIFDAQ's Market Insight: SEC Turns Pro Crypto
cuic standard and advanced reporting.pdf
The AUB Centre for AI in Media Proposal.docx
Spectral efficient network and resource selection model in 5G networks
Encapsulation theory and applications.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Chapter 3 Spatial Domain Image Processing.pdf
Modernizing your data center with Dell and AMD
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Mobile App Security Testing_ A Comprehensive Guide.pdf
Encapsulation_ Review paper, used for researhc scholars
Diabetes mellitus diagnosis method based random forest with bat algorithm
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Machine learning based COVID-19 study performance prediction
Unlocking AI with Model Context Protocol (MCP)
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
NewMind AI Monthly Chronicles - July 2025
MYSQL Presentation for SQL database connectivity
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows

Chaos Engineering with Kubernetes

  • 1. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Arun Gupta, @arungupta Principal Open Source Technologist, Amazon Web Services Using Chaos to Bring Resiliency to Your Applications in Kubernetes
  • 2. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Failures are a given and everything will eventually fail over time. https://www.allthingsdistributed.com/2016/03/10-lessons-from-10-years-of-aws.html
  • 3. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark https://www.youtube.com/watch?v=zoz0ZjfrQ9s Amazon 2006 GameDay: Creating Resiliency Through Destruction Jesse Robbins
  • 4. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Chaos Monkeys https://github.com/Netflix/SimianArmy
  • 5. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Chaos Engineering
  • 6. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Resilience Ability of a system to adapt to changes, failures, and disturbances
  • 7. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production Credit: https://www.flickr.com/photos/loseryouthcrew/8775130600/ https://principlesofchaos.org/
  • 8. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Bad things will happen to your system, no matter how well designed it is You cannot become ignorant to it
  • 9. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Break your systems on purpose Find out their weaknesses and fix them before they break when least expected
  • 10. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
  • 11. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Chaos doesn’t cause problems. It reveals them.
  • 12. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark • Application level • Host failure • Resource attacks (CPU, memory, …) • Network attacks (dependencies, latency, …) • Region attacks!
  • 13. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Where do you inject Chaos?
  • 14. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Phases of chaos engineering
  • 15. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Phases of chaos engineering
  • 16. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero ”Normal” behavior of your system
  • 17. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Business metric https://medium.com/netflix- techblog/sps-the-pulse-of- netflix-streaming- ae4db0e05f8a
  • 18. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Phases of chaos engineering
  • 19. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Phases of chaos engineering
  • 20. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark • a service gives 404 or 503? • latency increases by 300ms? • the port is not accessible? • security group rules changed? • the database stops? • excessive number of requests come? • iptables are wiped out?
  • 21. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Phases of chaos engineering
  • 22. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Phases of chaos engineering
  • 23. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Pick hypothesis Scope the experiment Identify metrics Notify the organization
  • 24. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Start with very small As close as possible to production Minimize the blast radius. Have an emergency STOP!
  • 25. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Users Canary deployment 99% users 1% users Start with...
  • 26. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Phases of chaos engineering
  • 27. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Phases of chaos engineering
  • 28. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Time to detect? Time for notification? And escalation? Time to public notification? Time for graceful degradation to kick-in? Time for self healing to happen? Time to recovery—partial and full? Time to all-clear and stable?
  • 29. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark DON’T blame that one person…
  • 30. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark PostMortems—COE (Correction of Errors) The 5 WHYs
  • 31. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Phases of chaos engineering
  • 32. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Phases of chaos engineering
  • 33. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Fix
  • 34. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Failure free operations require experience with failure. http://web.mit.edu/2.75/resources/random/How%20Complex%20Systems%20Fail.pdf
  • 35. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Kubernetes cluster
  • 36. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Reconciles desired and actual state for pods Distributes pods across AZs Automatic health-check based restarts Rolling deployment of a service
  • 37. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Kubernetes cluster with Amazon EKS AWS managed Customer account
  • 38. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Kubernetes cluster with Amazon EKS mycluster.eks.amazonaws.com Availability Zone 1 Availability Zone 2 Availability Zone 3 Kubectl
  • 39. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Region and Availability Zones Control Plane is highly available Master and Workers are configured in ASG Master instance type auto-scaling Etcd is HA and backed up every hour
  • 40. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Chaos in a Kubernetes cluster mycluster.eks.amazonaws.com Availability Zone 1 Availability Zone 2 Availability Zone 3 Kubectl x x Health check? Dead node? x
  • 41. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Istio Chaos Toolkit Kube Monkey PowerfulSeal Gremlin Simian Army
  • 42. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Istio Intelligent routing and load balancing Resilience across languages and platforms Fleet-wide policy enforcement In-depth telemetry
  • 43. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Timeouts Bounded retries with timeout budget Concurrent connections limit and request load Active health checks (periodic) Passive health checks (circuit breakers) AZ-aware load balancing with automatic failover
  • 44. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark • Timing failures • Increased network latency • Overloaded upstream service • Crashes • HTTP error codes • TCP connection failures
  • 45. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Fault injection using Istio—timeout apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: greeting spec: hosts: - greeting http: - fault: delay: fixedDelay: 10s percent: 100 route: - destination: host: greeting subset: greeting-hello --- apiVersion: networking.istio.io/v1alpha3 kind: DestinationRule metadata: name: greeting-destination-rule spec: host: greeting subsets: - name: greeting-hello labels: greeting: hello
  • 46. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Fault injection using Istio—HTTP abort apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: greeting spec: hosts: - greeting http: - fault: abort: httpStatus: 500 percent: 100 route: - destination: host: greeting subset: greeting-hello
  • 47. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Istio traffic management apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: greeting-virtual-service spec: hosts: - greeting http: - route: - destination: host: greeting subset: greeting-hello weight: 75 - destination: host: greeting subset: greeting-howdy weight: 25 --- apiVersion: networking.istio.io/v1alpha3 kind: DestinationRule metadata: name: greeting-destination-rule spec: host: greeting subsets: - name: greeting-hello labels: greeting: hello - name: greeting-howdy labels: greeting: howdy
  • 48. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Istio circuit breaker apiVersion: networking.istio.io/v1alpha3 kind: DestinationRule metadata: name: greeting-destination-rule spec: host: greeting subsets: - name: greeting-hello labels: greeting: hello trafficPolicy: connectionPool: tcp: maxConnections: 100
  • 49. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark https://istio.io/docs/
  • 50. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Chaos Toolkit Open API for Chaos Engineering
  • 51. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark CLI-driven Experiments declared in JSON/YAML files Open specification Extensible: Kubernetes, AWS, Spring, others
  • 52. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Chaos Toolkit follows the principles of chaos
  • 53. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark query a system to observe a behavior • Check state of a pod with a specific label • Multiple probes to define steady state real-world events • Terminate a deployment • Multiple actions simulate events Types of probe and method • Process: Run a binary • HTTP: Invoke a HTTP endpoint • Python: Call a Python function to perform richer operations
  • 54. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Chaos Toolkit metadata { "version": "1.0.0", "title": "Terminating the greeting service should not impact users", "description": "How does the greeting service unavailbility impacts our users? Do they see an error or does the webapp gets slower?", "tags": [ "kubernetes", "aws" ], "configuration": { "web_app_url": { "type": "env", "key": "WEBAPP_URL" } },
  • 55. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Chaos Toolkit steady state & hypothesis "steady-state-hypothesis": { "title": "Services are all available and healthy", "probes": [ { "type": "probe", "name": "alive-and-healthy", "tolerance": true, "provider": { "type": "python", "module": "chaosk8s.pod.probes", "func": "pods_in_phase", "arguments": { "label_selector": "app=webapp-pod", "phase": "Running", "ns": "default" } } }, { "type": "probe", "name": "application-must-respond-normally", "tolerance": 200, "provider": { "type": "http", "url": "${web_app_url}", "timeout": 3 } } ] },
  • 56. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Chaos Toolkit experiment & verify "method": [ { "type": "action", "name": "terminate-greeting-service", "provider": { "type": "python", "module": "chaosk8s.pod.actions", "func": "terminate_pods", "arguments": { "label_selector": "app=greeter-pod", "ns": "default" } } }, { "type": "probe", "name": "fetch-application-logs", "provider": { "type": "python", "module": "chaosk8s.pod.probes", "func": "read_pod_logs", "arguments": { "label_selector": "app=webapp-pod", "last": "20s", "ns": "default" } } } ],
  • 57. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Chaos Toolkit run $ chaos run experiments/experiment.json [2018-03-10 14:42:38 INFO] Validating the experiment's syntax [2018-03-10 14:42:38 INFO] Experiment looks valid [2018-03-10 14:42:38 INFO] Running experiment: Terminate the greeting service should not impact users [2018-03-10 14:42:38 INFO] Steady state hypothesis: Services are all available and healthy [2018-03-10 14:42:38 INFO] Probe: application-should-be-alive-and-healthy [2018-03-10 14:42:38 INFO] Probe: application-must-respond-normally [2018-03-10 14:42:39 INFO] Steady state hypothesis is met! [2018-03-10 14:42:39 INFO] Action: terminate-greeting-service [2018-03-10 14:42:40 INFO] Probe: fetch-application-logs [2018-03-10 14:42:41 INFO] Steady state hypothesis: Services are all available and healthy [2018-03-10 14:42:41 INFO] Probe: application-should-be-alive-and-healthy [2018-03-10 14:42:42 INFO] Probe: application-must-respond-normally [2018-03-10 14:42:45 ERROR] => failed: activity took too long to complete [2018-03-10 14:42:45 CRITICAL] Steady state probe 'application-must-respond-normally' is not in the given tolerance so failing this experiment [2018-03-10 14:42:45 INFO] Let's rollback... [2018-03-10 14:42:45 INFO] No declared rollbacks, let's move on. [2018-03-10 14:42:45 INFO] Experiment ended with status: failed
  • 58. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark https://github.com/chaostoolkit/chaostoolkit/
  • 59. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Implementation of Netflix’s Chaos Monkey for Kubernetes Randomly deletes pods in the cluster Applications opt-in using annotations
  • 60. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Run Kube-Monkey—create configuration apiVersion: v1 kind: ConfigMap metadata: name: kube-monkey-config-map namespace: kube-system data: config.toml: | [kubemonkey] run_hour = 8 start_hour = 10 end_hour = 16 blacklisted_namespaces = ["kube-system"] whitelisted_namespaces = [""]
  • 61. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Kube-Monkey application opt-in apiVersion: apps/v1 kind: Deployment . . . template: metadata: labels: app: greeting kube-monkey/enabled: enabled kube-monkey/identifier: monkey-victim-pods kube-monkey/mtbf: 2 kube-monkey/kill-mode: random-max-percent kube-monkey/kill-value: 40 spec: containers: - name: greeting
  • 62. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark https://github.com/asobti/kube-monkey
  • 63. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Chaos Engineering working group @ CNCF https://github.com/chaoseng/wg-chaoseng
  • 64. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Chaos Engineering mind map https://bit.ly/2uKOJMQ
  • 65. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark You don’t chose the moment, the moment chooses you. You only choose how prepared you are, when it does.
  • 66. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Thank you!