SlideShare a Scribd company logo
APPLYING PRINCIPLES
to SERVERLESSt
a
b
chaos engineering
of
A
E
S
of
Yan Cui
http://theburningmonk.com
@theburningmonk
Principal Engineer @
“Netflix for sports”
offices in London, Leeds, Katowice and Tokyo
We’re hiring ;-)
http://engineering.dazn.com
history of Smallpox
est. 400K deaths per year in 18th Century Europe.
earliest evidence of disease in 3rd Century BC Egyptian Mummy
history of Smallpox
est. 400K deaths per year in 18th Century Europe.
earliest evidence of disease in 3rd Century BC Egyptian Mummy
1798
first vaccine developed
Edward Jenner
1798
first vaccine developed
1980
history of Smallpox
Edward Jenner
WHO certified
global eradication
est. 400K deaths per year in 18th Century Europe.
earliest evidence of disease in 3rd Century BC Egyptian Mummy
Applying principles of chaos engineering to serverless (ServerlessCPH)
Vaccination is the most effective method of
preventing infectious diseases
stimulates the immune system to recognize
and destroy the disease before contracting
the disease for real
Chaos Engineering
controlled experiments to help us learn about
our system’s behaviour and build confidence
in its ability to withstand turbulent conditions
Applying principles of chaos engineering to serverless (ServerlessCPH)
Applying principles of chaos engineering to serverless (ServerlessCPH)
Applying principles of chaos engineering to serverless (ServerlessCPH)
it’s about building confidence,
NOT breaking things
I’m gonna inject
you with a deadly
disease now
http://principlesofchaos.org
STEP 1. define “Steady State”
aka. what does normal, working
condition looks like?
this is not a
steady state
STEP 2.
hypothesize steady state will
continue in both control group
& the experiment group
ie. you should have a reasonable degree of
confidence the system would handle the failure
before you proceed with the experiment
explore unknown unknowns
away from production
treat production with the
care it deserves
the goal is NOT,
to actually hurt production
If you know the system would break,
and you did it anyway…
then it’s NOT a chaos experiment.
It’s called being IRRESPONSIBLE.
Applying principles of chaos engineering to serverless (ServerlessCPH)
STEP 3.
inject realistic failures
e.g. server crash, network error,
HD malfunction, etc.
https://github.com/Netflix/SimianArmy
https://github.com/Netflix/SimianArmy http://oreil.ly/2tZU1Sn
Applying principles of chaos engineering to serverless (ServerlessCPH)
STEP 4.
disprove hypothesis
i.e. look for difference with steady state
if a WEAkNESS is uncovered,
IMPROVE it before the behaviour
manifests in the system at large
Chaos Engineering
controlled experiments to help us learn about
our system’s behaviour and build confidence
in its ability to withstand turbulent conditions
Applying principles of chaos engineering to serverless (ServerlessCPH)
Chaos Engineering
controlled experiments to help us learn about
our system’s behaviour and build confidence
in its ability to withstand turbulent conditions
communication
ensure everyone knows what you’re doing
ensure everyone knows what you’re doing
NO surprises!
communication
Timing
run experiments during office hours
AVOID important dates
communication
Timing
contain Blast radius
smallest change that allows
you to detect a signal that
steady state is disrupted
rollback at the first sign of
TROUBLE!
communication
Timing
contain Blast radius
don’t try to run before you
know how to walk.
by Russ Miles @russmiles
source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
chaos monkey kills an
EC2 instance
latency monkey induces
artificial delay in APIs
chaos gorilla kills an
AWS Availability Zone
chaos kong kills an
entire AWS region
Applying principles of chaos engineering to serverless (ServerlessCPH)
Applying principles of chaos engineering to serverless (ServerlessCPH)
there is no server…
there is no server…
that you can kill
Applying principles of chaos engineering to serverless (ServerlessCPH)
Applying principles of chaos engineering to serverless (ServerlessCPH)
there are more inherent chaos and
complexity in a Serverless architecture
smaller units of deployment
but A LOT more of them!
more difficult to harden
around boundaries
serverful
serverless
?
SNS
Kinesis
CloudWatch
Events
CloudWatch
LogsIoT
DynamoDB
S3 SES
?
SNS
Kinesis
CloudWatch
Events
CloudWatch
LogsIoT
DynamoDB
S3 SES
more intermediary services,
and greater variety too
?
SNS
Kinesis
CloudWatch
Events
CloudWatch
LogsIoT
DynamoDB
S3 SES
more intermediary services,
and greater variety too
each with its own set of
failure modes
serverful
serverless
more configurations,
more opportunities for misconfiguration
more unknown failure modes in
infrastructure that we don’t control
often there’s little we can do when an
outage occurs in the platform
improperly tuned timeouts
missing error handling
missing fallback when downstream is unavailable
LATENCY INJECTION
STEP 1. define “Steady State”
aka. what does normal, working
condition looks like?
what metrics do you monitor?
9X-percentile latency
error count
yield (% of requests completed)
harvest (completeness of results)
STEP 2.
hypothesize steady state will
continue in both control group
& the experiment group
ie. you should have a reasonable degree of
confidence the system would handle the failure
before you proceed with the experiment
API Gateway
consider the effect of cold-starts
& API Gateway overhead
use short timeout for API calls
the goal of a timeout strategy is to give HTTP
requests the best chance to succeed,
provided that doing so does not cause the
calling function itself to err
fixed timeout are tricky to get right…
fixed timeout are tricky to get right…
too short and you don’t
give requests the best
chance to succeed
fixed timeout are tricky to get right…
too long and you run the
risk of letting the request
timeout the calling function
and it gets worse when you make multiple
API calls in one function…
Applying principles of chaos engineering to serverless (ServerlessCPH)
Applying principles of chaos engineering to serverless (ServerlessCPH)
set the request timeout based on the
amount of invocation time left
Applying principles of chaos engineering to serverless (ServerlessCPH)
Applying principles of chaos engineering to serverless (ServerlessCPH)
Applying principles of chaos engineering to serverless (ServerlessCPH)
Applying principles of chaos engineering to serverless (ServerlessCPH)
log the timeout incident with
as much context as possible
e.g. timeout value, correlation IDs,
request object, …
report custom metrics
Applying principles of chaos engineering to serverless (ServerlessCPH)
Applying principles of chaos engineering to serverless (ServerlessCPH)
Applying principles of chaos engineering to serverless (ServerlessCPH)
be mindful when you sacrifice precision for
availability, user experience is the king
STEP 3.
inject realistic failures
e.g. server crash, network error,
HD malfunction, etc.
where to inject latency?
hypothesis:
function has appropriate timeout on its HTTP
communications and can degrade gracefully
when these requests time out
Applying principles of chaos engineering to serverless (ServerlessCPH)
should also be applied to 3rd parties
services we depend on, e.g. DynamoDB
Applying principles of chaos engineering to serverless (ServerlessCPH)
what’s the blast radius?
http client
public-api-a
http client
public-api-b
internal-api
hypothesis:
all functions have appropriate timeout on
their HTTP communications to this internal
API, and can degrade gracefully when
requests are timed out
Applying principles of chaos engineering to serverless (ServerlessCPH)
Applying principles of chaos engineering to serverless (ServerlessCPH)
large blast radius, risky..
Applying principles of chaos engineering to serverless (ServerlessCPH)
could be effective when used away from
production environment, to weed out
weaknesses quickly
not priming developers to
build more resilient systems
development
development
production
Priming (psychology):
Priming is a technique whereby exposure to one
stimulus influences a response to a subsequent
stimulus, without conscious guidance or intention.
It is a technique in psychology used to train a
person's memory both in positive and negative ways.
Applying principles of chaos engineering to serverless (ServerlessCPH)
Applying principles of chaos engineering to serverless (ServerlessCPH)
make dev environments better resemble the
turbulent conditions you should realistically
expect your system to survive in production
hypothesis:
the client app has appropriate timeout on
their HTTP communication with the server,
and can degrade gracefully when requests
are timed out
Applying principles of chaos engineering to serverless (ServerlessCPH)
Applying principles of chaos engineering to serverless (ServerlessCPH)
Applying principles of chaos engineering to serverless (ServerlessCPH)
Applying principles of chaos engineering to serverless (ServerlessCPH)
STEP 4.
disprove hypothesis
i.e. look for difference with steady state
how to inject latency?
static weaver (e.g. AspectJ, PostSharp),
or dynamic proxies
https://theburningmonk.com/2015/04/design-for-latency-issues/
manually crafted wrapper library
Applying principles of chaos engineering to serverless (ServerlessCPH)
Applying principles of chaos engineering to serverless (ServerlessCPH)
Applying principles of chaos engineering to serverless (ServerlessCPH)
Applying principles of chaos engineering to serverless (ServerlessCPH)
Applying principles of chaos engineering to serverless (ServerlessCPH)
configured in SSM Parameter Store
Applying principles of chaos engineering to serverless (ServerlessCPH)
no injected latency
Applying principles of chaos engineering to serverless (ServerlessCPH)
with injected latency
Applying principles of chaos engineering to serverless (ServerlessCPH)
factory wrapper function
(think bluebird’s promisifyAll function)
Applying principles of chaos engineering to serverless (ServerlessCPH)
Applying principles of chaos engineering to serverless (ServerlessCPH)
Applying principles of chaos engineering to serverless (ServerlessCPH)
Applying principles of chaos engineering to serverless (ServerlessCPH)
Applying principles of chaos engineering to serverless (ServerlessCPH)
ERROR INJECTION
failures are INEVITABLE
the only way to truly know your system’s
resilience against failures is to test it
through controlled experiments
Applying principles of chaos engineering to serverless (ServerlessCPH)
Applying principles of chaos engineering to serverless (ServerlessCPH)
Applying principles of chaos engineering to serverless (ServerlessCPH)
vaccinate your serverless
architecture against failures
Yan Cui
http://theburningmonk.com
@theburningmonk
@theburningmonk
theburningmonk.com
github.com/theburningmonk
API Gateway and Kinesis
Authentication & authorisation (IAM, Cognito)
Testing
Running & Debugging functions locally
Log aggregation
Monitoring & Alerting
X-Ray
Correlation IDs
CI/CD
Performance and Cost optimisation
Error Handling
Configuration management
VPC
Security
Leading practices (API Gateway, Kinesis, Lambda)
Canary deployments
http://bit.ly/production-ready-serverless
API Gateway and Kinesis
Authentication & authorisation (IAM, Cognito)
Testing
Running & Debugging functions locally
Log aggregation
Monitoring & Alerting
X-Ray
Correlation IDs
CI/CD
Performance and Cost optimisation
Error Handling
Configuration management
VPC
Security
Leading practices (API Gateway, Kinesis, Lambda)
Canary deployments
http://bit.ly/production-ready-serverless
get 40% off with code:
ytcui

More Related Content

PDF
Applying principles of chaos engineering to Serverless
PDF
Applying principles of chaos engineering to serverless
PDF
Applying principles of chaos engineering to serverless
PDF
Applying principles of chaos engineering to Serverless (SRECon)
PDF
Applying principles of chaos engineering to serverless (O'Reilly Software Arc...
PDF
Applying principles of chaos engineering to serverless (CodeMesh)
PDF
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
PDF
Elixir in Production
Applying principles of chaos engineering to Serverless
Applying principles of chaos engineering to serverless
Applying principles of chaos engineering to serverless
Applying principles of chaos engineering to Serverless (SRECon)
Applying principles of chaos engineering to serverless (O'Reilly Software Arc...
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Elixir in Production

Similar to Applying principles of chaos engineering to serverless (ServerlessCPH) (20)

PDF
Applying principles of chaos engineering to Serverless
PDF
Yan Cui - Applying principles of chaos engineering to Serverless - Codemotion...
PDF
Yan Cui - Applying principles of chaos engineering to Serverless - Codemotion...
PDF
The Case for Chaos Testing
PDF
The case for chaos testing
PPTX
From Duke of DevOps to Queen of Chaos - Api days 2018
PDF
Muwanika rogers (software testing) muni university
DOCX
Chapter 10. ScenariosI have always been a big fan of learnin
DOCX
1.8 Exercises1. Distinguish between vulnerability, threat, and con.docx
PPTX
What does "monitoring" mean? (FOSDEM 2017)
PDF
Chaos Engineering - The Art of Breaking Things in Production
PPTX
Prometheus - Open Source Forum Japan
PDF
Using security to drive chaos engineering - April 2018
PPTX
Prometheus (Prometheus London, 2016)
PDF
Monitoring As Code: How to Integrate App Monitoring Into Your Developer Cycle
PPTX
An Introduction to Prometheus (GrafanaCon 2016)
PDF
Chaos Engineering and Systems Reliability
PPTX
INTERNSHIPREVIEW-ISHAQ (1) [Recovered].pptx
PDF
What activates a bug? A refinement of the Laprie terminology model.
PPTX
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Applying principles of chaos engineering to Serverless
Yan Cui - Applying principles of chaos engineering to Serverless - Codemotion...
Yan Cui - Applying principles of chaos engineering to Serverless - Codemotion...
The Case for Chaos Testing
The case for chaos testing
From Duke of DevOps to Queen of Chaos - Api days 2018
Muwanika rogers (software testing) muni university
Chapter 10. ScenariosI have always been a big fan of learnin
1.8 Exercises1. Distinguish between vulnerability, threat, and con.docx
What does "monitoring" mean? (FOSDEM 2017)
Chaos Engineering - The Art of Breaking Things in Production
Prometheus - Open Source Forum Japan
Using security to drive chaos engineering - April 2018
Prometheus (Prometheus London, 2016)
Monitoring As Code: How to Integrate App Monitoring Into Your Developer Cycle
An Introduction to Prometheus (GrafanaCon 2016)
Chaos Engineering and Systems Reliability
INTERNSHIPREVIEW-ISHAQ (1) [Recovered].pptx
What activates a bug? A refinement of the Laprie terminology model.
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...

More from Yan Cui (20)

PDF
How to win the game of trade-offs
PDF
How to choose the right messaging service
PDF
How to choose the right messaging service for your workload
PDF
Patterns and practices for building resilient serverless applications.pdf
PDF
Lambda and DynamoDB best practices
PDF
Lessons from running AppSync in prod
PDF
Serverless observability - a hero's perspective
PDF
How to ship customer value faster with step functions
PDF
How serverless changes the cost paradigm
PDF
Why your next serverless project should use AWS AppSync
PDF
Build social network in 4 weeks
PDF
Patterns and practices for building resilient serverless applications
PDF
How to bring chaos engineering to serverless
PDF
Migrating existing monolith to serverless in 8 steps
PDF
Building a social network in under 4 weeks with Serverless and GraphQL
PDF
FinDev as a business advantage in the post covid19 economy
PDF
How to improve lambda cold starts
PDF
What can you do with lambda in 2020
PDF
A chaos experiment a day, keeping the outage away
PDF
How to debug slow lambda response times
How to win the game of trade-offs
How to choose the right messaging service
How to choose the right messaging service for your workload
Patterns and practices for building resilient serverless applications.pdf
Lambda and DynamoDB best practices
Lessons from running AppSync in prod
Serverless observability - a hero's perspective
How to ship customer value faster with step functions
How serverless changes the cost paradigm
Why your next serverless project should use AWS AppSync
Build social network in 4 weeks
Patterns and practices for building resilient serverless applications
How to bring chaos engineering to serverless
Migrating existing monolith to serverless in 8 steps
Building a social network in under 4 weeks with Serverless and GraphQL
FinDev as a business advantage in the post covid19 economy
How to improve lambda cold starts
What can you do with lambda in 2020
A chaos experiment a day, keeping the outage away
How to debug slow lambda response times

Recently uploaded (20)

PPT
Teaching material agriculture food technology
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Big Data Technologies - Introduction.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Spectroscopy.pptx food analysis technology
PDF
KodekX | Application Modernization Development
PDF
cuic standard and advanced reporting.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Cloud computing and distributed systems.
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
Teaching material agriculture food technology
Encapsulation_ Review paper, used for researhc scholars
Advanced methodologies resolving dimensionality complications for autism neur...
Big Data Technologies - Introduction.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Unlocking AI with Model Context Protocol (MCP)
Spectroscopy.pptx food analysis technology
KodekX | Application Modernization Development
cuic standard and advanced reporting.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Diabetes mellitus diagnosis method based random forest with bat algorithm
Spectral efficient network and resource selection model in 5G networks
Cloud computing and distributed systems.
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Understanding_Digital_Forensics_Presentation.pptx

Applying principles of chaos engineering to serverless (ServerlessCPH)