SlideShare a Scribd company logo
APPLYING PRINCIPLES
to SERVERLESSt
a
b
chaos engineering
of
A
E
S
of
Yan Cui
http://theburningmonk.com
@theburningmonk
Offices in London and Katowice
Offices in London and Katowice
WE’RE HIRING!
https://bit.ly/2HfLzGE
history of Smallpox
est. 400K deaths per year in 18th Century Europe.
earliest evidence of disease in 3rd Century BC Egyptian Mummy
history of Smallpox
est. 400K deaths per year in 18th Century Europe.
earliest evidence of disease in 3rd Century BC Egyptian Mummy
1798
first vaccine developed
Edward Jenner
1798
first vaccine developed
1980
history of Smallpox
Edward Jenner
WHO certified
global eradication
est. 400K deaths per year in 18th Century Europe.
earliest evidence of disease in 3rd Century BC Egyptian Mummy
Applying principles of chaos engineering to serverless
Vaccination is the most effective method of
preventing infectious diseases
stimulates the immune system to recognize
and destroy the disease before contracting
the disease for real
Chaos Engineering
controlled experiments to help us learn about
our system’s behaviour and build confidence
in its ability to withstand turbulent conditions
Applying principles of chaos engineering to serverless
Applying principles of chaos engineering to serverless
Applying principles of chaos engineering to serverless
it’s about building confidence,
NOT breaking things
I’m gonna inject
you with a deadly
disease now
http://principlesofchaos.org
STEP 1. define “Steady State”
aka. what does normal, working
condition looks like?
this is not a
steady state
STEP 2.
hypothesize steady state will
continue in both control group
& the experiment group
ie. you should have a reasonable degree of
confidence the system would handle the failure
before you proceed with the experiment
explore unknown unknowns
away from production
treat production with the
care it deserves
the goal is NOT,
to actually hurt production
If you know the system would break,
and you did it anyway…
then it’s NOT a chaos experiment.
It’s called being IRRESPONSIBLE.
Applying principles of chaos engineering to serverless
STEP 3.
inject realistic failures
e.g. server crash, network error,
HD malfunction, etc.
https://github.com/Netflix/SimianArmy
https://github.com/Netflix/SimianArmy http://oreil.ly/2tZU1Sn
Applying principles of chaos engineering to serverless
STEP 4.
disprove hypothesis
i.e. look for difference with steady state
if a WEAkNESS is uncovered,
improve it before the behaviour
manifests in the system at large
Chaos Engineering
controlled experiments to help us learn about
our system’s behaviour and build confidence
in its ability to withstand turbulent conditions
Applying principles of chaos engineering to serverless
Chaos Engineering
controlled experiments to help us learn about
our system’s behaviour and build confidence
in its ability to withstand turbulent conditions
communication
ensure everyone knows what you’re doing
ensure everyone knows what you’re doing
NO surprises!
communication
Timing
run experiments during office hours
avoid important dates
communication
Timing
contain Blast radius
smallest change that allows
you to detect a signal that
steady state is disrupted
rollback at the first sign of
TROUBLE!
communication
Timing
contain Blast radius
Don’t try to run before you
know how to walk.
by Russ Miles @russmiles
source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
chaos monkey kills an
EC2 instance
latency monkey induces
artificial delay in APIs
chaos gorilla kills an
AWS Availability Zone
chaos kong kills an
entire AWS region
Applying principles of chaos engineering to serverless
Applying principles of chaos engineering to serverless
there is no server…
there is no server…
that you can kill
Applying principles of chaos engineering to serverless
Applying principles of chaos engineering to serverless
there are more inherent chaos and
complexity in a Serverless architecture
smaller units of deployment
and A LOT more of them!
more difficult to harden
around boundaries
serverful
serverless
?
SNS
Kinesis
CloudWatch
Events
CloudWatch
LogsIoT
DynamoDB
S3 SES
?
SNS
Kinesis
CloudWatch
Events
CloudWatch
LogsIoT
DynamoDB
S3 SES
more intermediary services,
and greater variety too
?
SNS
Kinesis
CloudWatch
Events
CloudWatch
LogsIoT
DynamoDB
S3 SES
more intermediary services,
and greater variety too
each with its own set of
failure modes
serverful
serverless
more configurations,
more opportunities for misconfiguration
more unknown failure modes in
infrastructure that we don’t control
often there’s little we can do when an
outage occurs in the platform
improperly tuned timeouts
missing error handling
missing fallback when downstream is unavailable
LATENCY INJECTION
STEP 1. define “Steady State”
aka. what does normal, working
condition looks like?
what metrics do you monitor?
9X-percentile latency
error count
yield (% of requests completed)
harvest (completeness of results)
STEP 2.
hypothesize steady state will
continue in both control group
& the experiment group
ie. you should have a reasonable degree of
confidence the system would handle the failure
before you proceed with the experiment
API Gateway
consider the effect of cold-starts
& API Gateway overhead
use short timeout for API calls
the goal of a timeout strategy is to give HTTP
requests the best chance to succeed,
provided that doing so does not cause the
calling function itself to err
fixed timeout are tricky to get right…
fixed timeout are tricky to get right…
too short and you don’t
give requests the best
chance to succeed
fixed timeout are tricky to get right…
too long and you run the
risk of letting the request
timeout the calling function
and it gets worse when you make multiple
API calls in one function…
Applying principles of chaos engineering to serverless
Applying principles of chaos engineering to serverless
set the request timeout based on the
amount of invocation time left
Applying principles of chaos engineering to serverless
Applying principles of chaos engineering to serverless
Applying principles of chaos engineering to serverless
Applying principles of chaos engineering to serverless
log the timeout incident with
as much context as possible
e.g. timeout value, correlation IDs,
request object, …
report custom metrics
Applying principles of chaos engineering to serverless
Applying principles of chaos engineering to serverless
Applying principles of chaos engineering to serverless
be mindful when you sacrifice precision for
availability, user experience is the king
STEP 3.
inject realistic failures
e.g. server crash, network error,
HD malfunction, etc.
where to inject latency?
hypothesis:
function has appropriate timeout on its HTTP
communications and can degrade gracefully
when these requests time out
Applying principles of chaos engineering to serverless
should also be applied to 3rd parties
services we depend on, e.g. DynamoDB
Applying principles of chaos engineering to serverless
what’s the blast radius?
http client
public-api-a
http client
public-api-b
internal-api
hypothesis:
all functions have appropriate timeout on
their HTTP communications to this internal
API, and can degrade gracefully when
requests are timed out
Applying principles of chaos engineering to serverless
Applying principles of chaos engineering to serverless
large blast radius, risky..
Applying principles of chaos engineering to serverless
could be effective when used away from
production environment, to weed out
weaknesses quickly
not priming developers to
build more resilient systems
development
development
production
Priming (psychology):
Priming is a technique whereby exposure to one
stimulus influences a response to a subsequent
stimulus, without conscious guidance or intention.
It is a technique in psychology used to train a
person's memory both in positive and negative ways.
Applying principles of chaos engineering to serverless
Applying principles of chaos engineering to serverless
make dev environments better resemble the
turbulent conditions you should realistically
expect your system to survive in production
hypothesis:
the client app has appropriate timeout on
their HTTP communication with the server,
and can degrade gracefully when requests
are timed out
Applying principles of chaos engineering to serverless
Applying principles of chaos engineering to serverless
Applying principles of chaos engineering to serverless
Applying principles of chaos engineering to serverless
STEP 4.
disprove hypothesis
i.e. look for difference with steady state
how to inject latency?
static weaver (e.g. AspectJ, PostSharp),
or dynamic proxies
https://theburningmonk.com/2015/04/design-for-latency-issues/
manually crafted wrapper library
Applying principles of chaos engineering to serverless
Applying principles of chaos engineering to serverless
Applying principles of chaos engineering to serverless
Applying principles of chaos engineering to serverless
Applying principles of chaos engineering to serverless
configured in SSM Parameter Store
Applying principles of chaos engineering to serverless
no injected latency
Applying principles of chaos engineering to serverless
with injected latency
Applying principles of chaos engineering to serverless
factory wrapper function
(think bluebird’s promisifyAll function)
Applying principles of chaos engineering to serverless
Applying principles of chaos engineering to serverless
Applying principles of chaos engineering to serverless
Applying principles of chaos engineering to serverless
Applying principles of chaos engineering to serverless
ERROR INJECTION
failures are INEVITABLE
the only way to truly know your system’s
resilience against failures is to test it
through controlled experiments
Applying principles of chaos engineering to serverless
Applying principles of chaos engineering to serverless
Applying principles of chaos engineering to serverless
vaccinate your serverless
architecture against failures
Offices in London and Katowice
WE’RE HIRING!
https://bit.ly/2HfLzGE
Yan Cui
http://theburningmonk.com
@theburningmonk
@theburningmonk
theburningmonk.com
github.com/theburningmonk

More Related Content

PDF
Applying principles of chaos engineering to serverless
PDF
Applying principles of chaos engineering to Serverless
PDF
Applying principles of chaos engineering to serverless (ServerlessCPH)
PDF
Applying principles of chaos engineering to Serverless (SRECon)
PDF
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
PDF
Applying principles of chaos engineering to serverless (O'Reilly Software Arc...
PDF
Applying principles of chaos engineering to serverless (CodeMesh)
PDF
How To Use IFTTT For Social Media Automation
Applying principles of chaos engineering to serverless
Applying principles of chaos engineering to Serverless
Applying principles of chaos engineering to serverless (ServerlessCPH)
Applying principles of chaos engineering to Serverless (SRECon)
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Applying principles of chaos engineering to serverless (O'Reilly Software Arc...
Applying principles of chaos engineering to serverless (CodeMesh)
How To Use IFTTT For Social Media Automation

Similar to Applying principles of chaos engineering to serverless (20)

PDF
Applying principles of chaos engineering to Serverless
PDF
Yan Cui - Applying principles of chaos engineering to Serverless - Codemotion...
PDF
Yan Cui - Applying principles of chaos engineering to Serverless - Codemotion...
PDF
Chaos Engineering: Why the World Needs More Resilient Systems
DOCX
1.8 Exercises1. Distinguish between vulnerability, threat, and con.docx
PPTX
From Duke of DevOps to Queen of Chaos - Api days 2018
PDF
The Case for Chaos Testing
PDF
Muwanika rogers (software testing) muni university
PDF
The case for chaos testing
PPTX
Prometheus - Open Source Forum Japan
PPTX
What does "monitoring" mean? (FOSDEM 2017)
PPTX
Prometheus (Prometheus London, 2016)
PDF
A model for detecting the existence of unknown computer viruses in real time
PPTX
An Introduction to Prometheus (GrafanaCon 2016)
DOCX
Chapter 10. ScenariosI have always been a big fan of learnin
PPTX
INTERNSHIPREVIEW-ISHAQ (1) [Recovered].pptx
PDF
Ceh v8 labs module 02 footprinting and reconnaissance
PDF
What activates a bug? A refinement of the Laprie terminology model.
PDF
Monitoring As Code: How to Integrate App Monitoring Into Your Developer Cycle
PPTX
Velocity 2015 Amsterdam: Alerts overload
Applying principles of chaos engineering to Serverless
Yan Cui - Applying principles of chaos engineering to Serverless - Codemotion...
Yan Cui - Applying principles of chaos engineering to Serverless - Codemotion...
Chaos Engineering: Why the World Needs More Resilient Systems
1.8 Exercises1. Distinguish between vulnerability, threat, and con.docx
From Duke of DevOps to Queen of Chaos - Api days 2018
The Case for Chaos Testing
Muwanika rogers (software testing) muni university
The case for chaos testing
Prometheus - Open Source Forum Japan
What does "monitoring" mean? (FOSDEM 2017)
Prometheus (Prometheus London, 2016)
A model for detecting the existence of unknown computer viruses in real time
An Introduction to Prometheus (GrafanaCon 2016)
Chapter 10. ScenariosI have always been a big fan of learnin
INTERNSHIPREVIEW-ISHAQ (1) [Recovered].pptx
Ceh v8 labs module 02 footprinting and reconnaissance
What activates a bug? A refinement of the Laprie terminology model.
Monitoring As Code: How to Integrate App Monitoring Into Your Developer Cycle
Velocity 2015 Amsterdam: Alerts overload

More from Yan Cui (20)

PDF
How to win the game of trade-offs
PDF
How to choose the right messaging service
PDF
How to choose the right messaging service for your workload
PDF
Patterns and practices for building resilient serverless applications.pdf
PDF
Lambda and DynamoDB best practices
PDF
Lessons from running AppSync in prod
PDF
Serverless observability - a hero's perspective
PDF
How to ship customer value faster with step functions
PDF
How serverless changes the cost paradigm
PDF
Why your next serverless project should use AWS AppSync
PDF
Build social network in 4 weeks
PDF
Patterns and practices for building resilient serverless applications
PDF
How to bring chaos engineering to serverless
PDF
Migrating existing monolith to serverless in 8 steps
PDF
Building a social network in under 4 weeks with Serverless and GraphQL
PDF
FinDev as a business advantage in the post covid19 economy
PDF
How to improve lambda cold starts
PDF
What can you do with lambda in 2020
PDF
A chaos experiment a day, keeping the outage away
PDF
How to debug slow lambda response times
How to win the game of trade-offs
How to choose the right messaging service
How to choose the right messaging service for your workload
Patterns and practices for building resilient serverless applications.pdf
Lambda and DynamoDB best practices
Lessons from running AppSync in prod
Serverless observability - a hero's perspective
How to ship customer value faster with step functions
How serverless changes the cost paradigm
Why your next serverless project should use AWS AppSync
Build social network in 4 weeks
Patterns and practices for building resilient serverless applications
How to bring chaos engineering to serverless
Migrating existing monolith to serverless in 8 steps
Building a social network in under 4 weeks with Serverless and GraphQL
FinDev as a business advantage in the post covid19 economy
How to improve lambda cold starts
What can you do with lambda in 2020
A chaos experiment a day, keeping the outage away
How to debug slow lambda response times

Recently uploaded (20)

PDF
KodekX | Application Modernization Development
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Cloud computing and distributed systems.
PDF
Encapsulation theory and applications.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
cuic standard and advanced reporting.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Electronic commerce courselecture one. Pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Spectroscopy.pptx food analysis technology
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
KodekX | Application Modernization Development
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Cloud computing and distributed systems.
Encapsulation theory and applications.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
20250228 LYD VKU AI Blended-Learning.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
cuic standard and advanced reporting.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Electronic commerce courselecture one. Pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Understanding_Digital_Forensics_Presentation.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Spectroscopy.pptx food analysis technology
Chapter 3 Spatial Domain Image Processing.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx

Applying principles of chaos engineering to serverless