SlideShare a Scribd company logo
Applying principles of chaos
engineering to Serverless
Yan Cui @theburningmonk
Berlin | November 20 - 21, 2018
Agenda
What is chaos engineering?
New challenges with serverless
Applying latency injection to serverless
Applying error injection to serverless
What is chaos engineering?
Chaos Engineering is the discipline of experimenting on a distributed system

in order to build confidence in the system’s capability

to withstand turbulent conditions in production.
- principlesofchaos.org
Smallpox
Earliest evidence of disease in 3rd Century BC Egyptian Mummy.
est. 400K deaths per year in 18th Century Europe.
History of vaccination
First vaccine was developed in
1798 by Edward Jenner.
https://en.wikipedia.org/wiki/Edward_Jenner
History of vaccination
First vaccine was developed in
1798 by Edward Jenner.
WHO certified global eradication
in 1980.
https://en.wikipedia.org/wiki/Edward_Jenner
https://en.wikipedia.org/wiki/Vaccine
History of vaccination
History of vaccination
Vaccination is the most effective method to prevent infectious diseases.
History of vaccination
Vaccination is the most effective method to prevent infectious diseases.
Vaccines stimulate the immune system to recognise and destroy the
disease before contracting it for real.
Chaos engineering
Use controlled experiments to inject failures into our system.
Chaos engineering
Use controlled experiments to inject failures into our system.
Help us learn about our system’s behaviour and uncover unknown failure
modes, before they manifest like wildfire in production.
Chaos engineering
Use controlled experiments to inject failures into our system.
Help us learn about our system’s behaviour and uncover unknown failure
modes, before they manifest like wildfire in production.
Lets us build confidence in its ability to withstand turbulent conditions.
Chaos Engineering is the vaccine to frailties in modern software.
Who am I?
Principal Engineer at DAZN.
AWS Serverless Hero.
Author of Production-Ready Serverless* course by Manning.
Blogger**, speaker.
* https://bit.ly/production-ready-serverless
** https://theburningmonk.com
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
https://www.ft.com/content/07d375ee-6ee5-11e8-92d3-6c13e5c92914
https://www.theguardian.com/media/2018/may/14/streaming-service-dazn-netflix-sport-us-boxing-eddie-hearn
About DAZN
Available in 7 countries - Austria, Switzerland, Germany,
Japan, Canada, Italy and USA.
Available on 30+ platforms.
About DAZN
Available in 7 countries - Austria, Switzerland, Germany,
Japan, Canada, Italy and USA.
Available on 30+ platforms.
Around 1,000,000 concurrent viewers at peak.
follow @dazneng for
updates about the
engineering team
We’re hiring! Visit
engineering.dazn.com
to learn more.
WE’RE HIRING!
Chaos engineering has an image problem
Chaos engineering has an image problem
Chaos engineering has an image problem
Too much emphasis is on breaking things.
Chaos engineering has an image problem
Too much emphasis is on breaking things.
Easy to conflate the action of injecting failures with the payback.
Why did you break
production?
Because I can!
Chaos engineering has an image problem
The goal is to learn about the system and build confidence.
Chaos engineering has an image problem
The goal is to learn about the system and build confidence.
The goal is not to break things.
Chaos in practice
4 steps to start running chaos
experiments yourself.
STEP 1. Define “steady state”
aka. what does normal, working
condition looks like?
this is not a
steady state
Hypothesise steady state will
continue in both control group
& the experiment group
ie. you should have a reasonable degree of confidence the
system would handle the failure before you proceed with
the experiment
STEP 2.
Chaos in practice
Explore unknown unknowns away from production.
Chaos in practice
Explore unknown unknowns away from production.
Experiments that graduate to production should be carefully
considered and planned.
You should have reasonable confidence in the system before
running experiments in production.
Chaos in practice
Treat production with the care it deserves.
The goal is not to break things.
Chaos in practice
If you know the system would break and you did it anyway,
then it’s not a chaos experiment!
It’s called being irresponsible.
STEP 3. Inject realistic failures
e.g. server crash, network error, HD
malfunction, etc.
https://github.com/Netflix/SimianArmy http://oreil.ly/2tZU1Sn
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
STEP 4. Disprove hypothesis
i.e. look for difference in steady state
Chaos in practice
Look for evidence that steady state was impacted by the
injected failure.
Chaos in practice
Look for evidence that steady state was impacted by the
injected failure.
Address weaknesses before failures happen for real.
Containment
Experiments needs to be controlled.
The goal is not to break things.
Containment
Ensure everyone knows what you are doing.
Don’t surprise your teammates.
Containment
Run experiments during office hours.
Containment
Run experiments during office hours.
Avoid important dates.
Containment
Make the smallest change necessary to prove or disprove hypothesis.
Containment
Make the smallest change necessary to prove or disprove hypothesis.
Have a rollback plan.
Stop the experiment right away if things start to go wrong.
Containment
Don’t start in production.
Can learn a lot by running experiments in staging.
by Russ Miles @russmiles
source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
New challenges with serverless
chaos monkey kills an EC2
instance
latency monkey induces
artificial delay in APIs
chaos gorilla kills an AWS
Availability Zone
chaos kong kills an entire
AWS region
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Serverless challenges
There are no servers that you can access and kill.
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
There are more inherent chaos and complexity in
a serverless architecture.
Serverless challenges
Smaller unit of deployment, but a lot more of them.
serverful
serverlessServerless challenges
Serverless challenges
Every function needs to be correctly configured and secured.
Kinesis
?
SNS
CloudWatch
Events
CloudWatch
LogsIoT
Core
DynamoDB
S3 SES
Serverless challenges
Serverless challenges
A lot of managed, intermediate services.
Each with its own set of failure modes.
Serverless challenges
Unknown failure modes in the infrastructure we don’t control.
Serverless challenges
Unknown failure modes in the infrastructure we don’t control.
Often there’s little we can do when an outage occurs in the platform.
Common weaknesses
Common weaknesses
Improperly tuned timeouts.
Common weaknesses
Missing error handling.
Common weaknesses
Missing fallback.
Common weaknesses
Missing regional failover.
Latency injection with serverless
STEP 1. Define “steady state”
aka. what does normal, working
condition looks like?
Defining steady state
What metrics do you use?
Defining steady state
What metrics do you use?
p95/p99 latencies, error count, backlog size, yield*, harvest**
* percentage of requests completed
** completeness of the returned response
Hypothesise steady state will
continue in both control group
& the experiment group
ie. you should have a reasonable degree of confidence the
system would handle the failure before you proceed with
the experiment
STEP 2.
API Gateway
Serverless considerations
Serverless considerations
Consider the effect of cold starts.
How does it affect your strategy
for handling slow responses.
Request timeouts
Strategy should:
Request timeouts
Strategy should:
1. Give requests the best chance to succeed
Request timeouts
Strategy should:
1. Give requests the best chance to succeed
2. Do not allow slow response to timeout the caller function
Request timeouts
Finding the right timeout value is tricky.
Request timeouts
Finding the right timeout value is tricky.
Too short : requests not given the best chance to succeed.
Request timeouts
Finding the right timeout value is tricky.
Too short : requests not given the best chance to succeed.
Too long : risk timing out the calling function.
Request timeouts
Finding the right timeout value is tricky.
Too short : requests not given the best chance to succeed.
Too long : risk timing out the calling function.
Even more complicated when you have multiple integration points.
Approach 1: split invocation time equally
(e.g. 3 requests, 6s function timeout = 2s timeout per request)
Approach 2: every request is given nearly all the invocation time
(e.g. 3 requests, 6s function timeout = 5s timeout per request)
Request timeouts
Proposal: set request timeouts dynamically based on
invocation time left
Request timeouts
Set timeout based on remaining invocation time
Set timeout based on remaining invocation time
Recovery steps
Log the timeout with as much context as possible.
The API, timeout value, correlation IDs, request object, etc.
Recovery steps
Log the timeout with as much context as possible.
The API, timeout value, correlation IDs, request object, etc.
Record custom metrics.
Recovery steps
Log the timeout with as much context as possible.
The API, timeout value, correlation IDs, request object, etc.
Record custom metrics.
Use fallbacks.
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Recovery steps
Be mindful when you sacrifice precision for availability.
User experience is the king.
STEP 3. Inject realistic failures
e.g. server crash, network error, HD
malfunction, etc.
Where to inject latency?
hypothesis:
Function has appropriate timeout on its HTTP communications
and can degrade gracefully when these requests time out
Where to inject latency?
Where to inject latency?
Should be applied to 3rd party services too.
DynamoDB, Twillio, Auth0, …
Where to inject latency?
Where to inject latency?
Be mindful of the blast radius of the experiment.
The goal is not to break things.
http client
public-api-a
http client
public-api-b
internal-api
Where to inject latency?
hypothesis:
All functions have appropriate timeout on their HTTP
communications to this internal API, and can degrade
gracefully when requests are timed out
Where to inject latency?
Where to inject latency?
Where to inject latency?
Large blast radius, can cause cascade failures unintentionally
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
development
development production
Priming (psychology):
Priming is a technique whereby exposure to one stimulus
influences a response to a subsequent stimulus, without
conscious guidance or intention.
It is a technique in psychology used to train a person's
memory both in positive and negative ways.
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Use failure injection to programme your colleagues into
thinking about failure modes early.
Where to inject latency?
Make X% of all requests slow
in the dev environment.
hypothesis:
The client app has appropriate timeout on their HTTP
communication with the server, and can degrade gracefully
when requests are timed out
Where to inject latency?
Where to inject latency?
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Where to inject latency?
STEP 4. Disprove hypothesis
i.e. look for difference in steady state
How to inject latency?
How to inject latency?
Static weavers (e.g. PostSharp, AspectJ).
Dynamic proxies.
https://theburningmonk.com/2015/04/design-for-latency-issues/
How to inject latency?
Manually crafted wrapper libraries.
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
configured in SSM Parameter Store
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
no injected latency
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
with injected latency
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
factory wrapper function
(think bluebird’s promisifyAll function)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Error injection with serverless
Common errors
HTTP 5XX
DynamoDB provisioned throughput exceeded
Throttled Lambda invocations
hypothesis:
Function has appropriate error handling on its HTTP communications
and can degrade gracefully when downstream dependencies fail
Where to inject errors?
hypothesis:
Function has appropriate error handling on DynamoDB operations and
can degrade gracefully when DynamoDB throughputs are exceeded
Where to inject errors?
Where to inject errors?
Induce Lambda throttling by temporarily setting reserved concurrency.
Recap
failures are INEVITABLE
the only way to truly know your system’s
resilience against failures is to test it
through CONTROLLED experiments
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
the goal of chaos engineering is NOT to
actually break production
CONTAINMENT should be front and
centre of your thinking
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
STEP 1. Define “steady state”
aka. what does normal, working
condition looks like?
Hypothesise steady state will
continue in both control group
& the experiment group
ie. you should have a reasonable degree of confidence the
system would handle the failure before you proceed with
the experiment
STEP 2.
STEP 3. Inject realistic failures
e.g. server crash, network error, HD
malfunction, etc.
STEP 4. Disprove hypothesis
i.e. look for difference in steady state
there are more inherent chaos and
complexity in a serverless application
even without servers, you can still inject
CONTROLLED failures at the application level
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
API Gateway and Kinesis
Authentication & authorisation (IAM, Cognito)
Testing
Running & Debugging functions locally
Log aggregation
Monitoring & Alerting
X-Ray
Correlation IDs
CI/CD
Performance and Cost optimisation
Error Handling
Configuration management
VPC
Security
Leading practices (API Gateway, Kinesis, Lambda)
Canary deployments
http://bit.ly/prod-ready-serverless
get 40% off with:
ytcui
@theburningmonk
theburningmonk.com
github.com/theburningmonk
API Gateway and Kinesis
Authentication & authorisation (IAM, Cognito)
Testing
Running & Debugging functions locally
Log aggregation
Monitoring & Alerting
X-Ray
Correlation IDs
CI/CD
Performance and Cost optimisation
Error Handling
Configuration management
VPC
Security
Leading practices (API Gateway, Kinesis, Lambda)
Canary deployments
http://bit.ly/prod-ready-serverless
get 40% off with:
ytcui

More Related Content

PDF
Introduction to Chaos Engineering with Microsoft Azure
PDF
Applying principles of chaos engineering to serverless (reinvent DVC305)
PDF
Applying principles of chaos engineering to serverless (O'Reilly Software Arc...
PPTX
Next Level Chaos Engineering - Chaos Conf 2018
PDF
InfoQ Live - Reducing Uncertainty in Software Delivery - Building reliability...
PDF
Chaos Engineering in a Multi-Cloud World | Escape Conference 2019
PDF
Chaos Engineering
PDF
SecOps - Bringing Agility into Security
Introduction to Chaos Engineering with Microsoft Azure
Applying principles of chaos engineering to serverless (reinvent DVC305)
Applying principles of chaos engineering to serverless (O'Reilly Software Arc...
Next Level Chaos Engineering - Chaos Conf 2018
InfoQ Live - Reducing Uncertainty in Software Delivery - Building reliability...
Chaos Engineering in a Multi-Cloud World | Escape Conference 2019
Chaos Engineering
SecOps - Bringing Agility into Security

What's hot (17)

PPTX
Failure is inevitable but it isn't permanent
PPTX
Chaos Engineering with Containers - QCon SF 2018
PDF
5 Essential Techniques for Building Fault-tolerant Systems
PDF
Chaos Engineering, When should you release the monkeys?
PDF
PDF
Chaos Engineering: Injecting Failure for Building Resilience in Systems
PDF
Shifting Security Left - The Innovation of DevSecOps - ValleyTechCon
PPTX
Splunk'ing JIRA for deep insights into application, database, and server heal...
PDF
The Four Principles of Atlassian Performance Tuning
PDF
Behind the Scenes of Vendor Security Reviews in the Enterprise
PPTX
From Duke of DevOps to Queen of Chaos - Api days 2018
PDF
Bootiful Development with Spring Boot and React - Dublin JUG 2018
PDF
Shifting Security Left - The Innovation of DevSecOps - AgileDC
PDF
How to run a kick ass bug bounty program - Node Summit 2013
PDF
AppSec is Eating Security
PDF
The Practice of Chaos Engineering - Reactive Summit 2018 - Montreal, QC
PDF
Chaos engineering intro
Failure is inevitable but it isn't permanent
Chaos Engineering with Containers - QCon SF 2018
5 Essential Techniques for Building Fault-tolerant Systems
Chaos Engineering, When should you release the monkeys?
Chaos Engineering: Injecting Failure for Building Resilience in Systems
Shifting Security Left - The Innovation of DevSecOps - ValleyTechCon
Splunk'ing JIRA for deep insights into application, database, and server heal...
The Four Principles of Atlassian Performance Tuning
Behind the Scenes of Vendor Security Reviews in the Enterprise
From Duke of DevOps to Queen of Chaos - Api days 2018
Bootiful Development with Spring Boot and React - Dublin JUG 2018
Shifting Security Left - The Innovation of DevSecOps - AgileDC
How to run a kick ass bug bounty program - Node Summit 2013
AppSec is Eating Security
The Practice of Chaos Engineering - Reactive Summit 2018 - Montreal, QC
Chaos engineering intro

Similar to Applying principles of chaos engineering to Serverless (CodeMotion Berlin) (20)

PDF
Applying principles of chaos engineering to serverless (CodeMesh)
PDF
Applying principles of chaos engineering to serverless (ServerlessCPH)
PDF
Applying principles of chaos engineering to Serverless (SRECon)
PDF
Applying principles of chaos engineering to Serverless
PDF
Applying principles of chaos engineering to serverless
PDF
Applying principles of chaos engineering to serverless
PDF
Applying principles of chaos engineering to Serverless
PDF
Applying Chaos Engineering to Build Resilient Serverless Applications
PDF
How to bring chaos engineering to serverless
PPTX
Green Custard Friday Talk 19: Chaos Engineering
PDF
Applying Chaos Engineering to build Resilient Serverless Applications - Emrah...
PPTX
Chaos engineering
PDF
A chaos experiment a day, keeping the outage away
PDF
Using security to drive chaos engineering - April 2018
PPTX
Chaos Engineering when you're not Netflix
PPTX
Chaos engineering - The art of breaking stuff in production on purpose
PDF
Chaos Engineering - Limiting Damage During Chaos Experiments
PDF
Chaos Engineering and Systems Reliability
ODP
muCon 2017 - Build Confidence in your System with Chaos Engineering
PDF
Practical Chaos Engineering
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (ServerlessCPH)
Applying principles of chaos engineering to Serverless (SRECon)
Applying principles of chaos engineering to Serverless
Applying principles of chaos engineering to serverless
Applying principles of chaos engineering to serverless
Applying principles of chaos engineering to Serverless
Applying Chaos Engineering to Build Resilient Serverless Applications
How to bring chaos engineering to serverless
Green Custard Friday Talk 19: Chaos Engineering
Applying Chaos Engineering to build Resilient Serverless Applications - Emrah...
Chaos engineering
A chaos experiment a day, keeping the outage away
Using security to drive chaos engineering - April 2018
Chaos Engineering when you're not Netflix
Chaos engineering - The art of breaking stuff in production on purpose
Chaos Engineering - Limiting Damage During Chaos Experiments
Chaos Engineering and Systems Reliability
muCon 2017 - Build Confidence in your System with Chaos Engineering
Practical Chaos Engineering

More from Yan Cui (20)

PDF
How to win the game of trade-offs
PDF
How to choose the right messaging service
PDF
How to choose the right messaging service for your workload
PDF
Patterns and practices for building resilient serverless applications.pdf
PDF
Lambda and DynamoDB best practices
PDF
Lessons from running AppSync in prod
PDF
Serverless observability - a hero's perspective
PDF
How to ship customer value faster with step functions
PDF
How serverless changes the cost paradigm
PDF
Why your next serverless project should use AWS AppSync
PDF
Build social network in 4 weeks
PDF
Patterns and practices for building resilient serverless applications
PDF
Migrating existing monolith to serverless in 8 steps
PDF
Building a social network in under 4 weeks with Serverless and GraphQL
PDF
FinDev as a business advantage in the post covid19 economy
PDF
How to improve lambda cold starts
PDF
What can you do with lambda in 2020
PDF
How to debug slow lambda response times
PDF
What can you do with lambda in 2020
PDF
How to ship customer value faster with step functions
How to win the game of trade-offs
How to choose the right messaging service
How to choose the right messaging service for your workload
Patterns and practices for building resilient serverless applications.pdf
Lambda and DynamoDB best practices
Lessons from running AppSync in prod
Serverless observability - a hero's perspective
How to ship customer value faster with step functions
How serverless changes the cost paradigm
Why your next serverless project should use AWS AppSync
Build social network in 4 weeks
Patterns and practices for building resilient serverless applications
Migrating existing monolith to serverless in 8 steps
Building a social network in under 4 weeks with Serverless and GraphQL
FinDev as a business advantage in the post covid19 economy
How to improve lambda cold starts
What can you do with lambda in 2020
How to debug slow lambda response times
What can you do with lambda in 2020
How to ship customer value faster with step functions

Recently uploaded (20)

PPT
Teaching material agriculture food technology
PDF
Approach and Philosophy of On baking technology
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
cuic standard and advanced reporting.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
Teaching material agriculture food technology
Approach and Philosophy of On baking technology
Network Security Unit 5.pdf for BCA BBA.
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Review of recent advances in non-invasive hemoglobin estimation
MYSQL Presentation for SQL database connectivity
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
NewMind AI Weekly Chronicles - August'25 Week I
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Dropbox Q2 2025 Financial Results & Investor Presentation
cuic standard and advanced reporting.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Programs and apps: productivity, graphics, security and other tools
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Reach Out and Touch Someone: Haptics and Empathic Computing
The AUB Centre for AI in Media Proposal.docx
MIND Revenue Release Quarter 2 2025 Press Release

Applying principles of chaos engineering to Serverless (CodeMotion Berlin)