SlideShare a Scribd company logo
Applying principles of chaos engineering to serverless (reinvent DVC305)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Applying Principles of Chaos
Engineering to Serverless
Yan Cui
Principal Engineer
DAZN
D V C 3 0 5
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
What is chaos engineering?
New challenges with serverless
Applying latency injection to serverless
Applying error injection to serverless
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
After the talk
Slides will be shared on Slideshare
Recording will be posted on YouTube within 48 hours
Find the links on https://theburningmonk.com/reinvent2018
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What is chaos engineering?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos engineering is the discipline of experimenting on a distributed system
in order to build confidence in the system’s capability
to withstand turbulent conditions in production.
- principlesofchaos.org
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Smallpox
Earliest evidence of disease in third century BC Egyptian mummy
Estimated 400K deaths per year in eighteenth century Europe
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
History of vaccination
First vaccine was developed in
1798 by Edward Jenner
https://en.wikipedia.org/wiki/Edward_Jenner
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
History of vaccination
WHO certified global eradication
in 1980
https://en.wikipedia.org/wiki/Edward_Jenner
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
https://en.wikipedia.org/wiki/Vaccine
History of vaccination
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
History of vaccination
Vaccination is the most effective method to prevent infectious diseases
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
History of vaccination
Vaccines stimulate the immune system to recognize and destroy the
disease before contracting it for real
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos engineering
Use controlled experiments to inject failures into our system
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos engineering
Help us learn about our system’s behavior and uncover unknown failure
modes, before they manifest like wildfire in production
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos engineering
Lets us build confidence in its ability to withstand turbulent conditions
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos engineering is the vaccine to frailties in modern software
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Who am I?
Principal engineer at DAZN
AWS Serverless hero
Author of Production-Ready Serverless* course by Manning.
Blogger**, speaker.
* https://bit.ly/production-ready-serverless
** https://theburningmonk.com
Applying principles of chaos engineering to serverless (reinvent DVC305)
Applying principles of chaos engineering to serverless (reinvent DVC305)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
About DAZN
Available in seven countries—Austria, Switzerland, Germany,
Japan, Canada, Italy, and USA
Available on 30+ platforms
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
About DAZN
Around 1,000,000 concurrent viewers at peak
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos engineering has an image problem
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos engineering has an image problem
Too much emphasis is on breaking things
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos engineering has an image problem
Easy to conflate the action of injecting failures with the payback
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos engineering has an image problem
The goal is to learn about the system and build confidence
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos engineering has an image problem
The goal is not to break things
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos in practice
Four steps to start running chaos
experiments yourself
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
STEP 1. Define “steady state”
What does normal, working
condition looks like?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
this is not a
steady state
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hypothesize steady state will
continue in both control group
& the experiment group
In other words, you should have a reasonable degree of
confidence the system would handle the failure before you
proceed with the experiment
STEP 2.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos in practice
Explore unknown unknowns away from production
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos in practice
Experiments that graduate to production should be carefully
considered and planned
You should have reasonable confidence in the system before
running experiments in production
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos in practice
Treat production with the care it deserves
The goal is not to break things
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos in practice
If you knew the system would break and you did it anyway,
then it’s not a chaos experiment!
It’s called being irresponsible.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
STEP 3. Inject realistic failures
For example, server crash, network
error, HD malfunction, more
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos in practice
Netflix’s Simian Army:
https://github.com/Netflix/SimianArmy
Chaos Engineering ebook (O’Reilly): http://oreil.ly/2tZU1Sn
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
STEP 4. Disprove hypothesis
In other words, look for difference
in steady state
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos in practice
Look for evidence that steady state was impacted by the
injected failure
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos in practice
Address weaknesses before failures happen for real
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Containment
Experiments needs to be controlled
The goal is not to break things
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Containment
Ensure everyone knows what you are doing
Don’t surprise your teammates
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Containment
Run experiments during office hours
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Containment
Avoid important dates
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Containment
Make the smallest change necessary to prove or disprove hypothesis
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Containment
Have a rollback plan
Stop the experiment right away if things start to go wrong
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Containment
Don’t start in production
Can learn a lot by running experiments in staging
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
by Russ Miles @russmiles
source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
New challenges with serverless
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
chaos monkey kills an
Amazon Elastic Cloud
(Amazon EC2) instance
latency monkey induces
artificial delay in APIs
chaos gorilla kills an AWS
Availability Zone
chaos kong kills an entire
AWS region
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Serverless challenges
There are no servers that you can access and kill
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
There is more inherent chaos and complexity in a
serverless architecture.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Serverless challenges
Smaller units of deployment, but a lot more of them
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
serverful
serverlessServerless challenges
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Serverless challenges
Every function needs to be correctly configured and secured
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Kinesis
?
SNS
CloudWatch
Events
CloudWatch
LogsIoT
Core
DynamoDB
S3 SES
Serverless challenges
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Serverless challenges
A lot of managed, intermediate services
Each with its own set of failure modes
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Serverless challenges
Unknown failure modes in the infrastructure we don’t control
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Serverless challenges
Often there’s little we can do when an outage occurs in the platform
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Common weaknesses
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Common weaknesses
Improperly tuned timeouts
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Common weaknesses
Missing error handling
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Common weaknesses
Missing fallback
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Common weaknesses
Missing regional failover
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Latency injection with serverless
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
STEP 1. Define “steady state”
What does normal, working
condition looks like?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Defining steady state
What metrics do you use?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Defining steady state
p95/p99 latencies, error count, backlog size, yield*, harvest**
* percentage of requests completed
** completeness of the returned response
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hypothesize steady state will
continue in both control group
& the experiment group
In other words, you should have a reasonable degree of
confidence the system would handle the failure before you
proceed with the experiment
STEP 2.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
API Gateway
Serverless considerations
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Serverless considerations
Consider the effect of cold starts
How does it affect your strategy
for handling slow responses
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Request timeouts
Strategy should:
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Request timeouts
Strategy should:
1. Give requests the best chance to succeed
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Request timeouts
Strategy should:
1. Give requests the best chance to succeed
2. Do not allow slow response to timeout the caller function
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Request timeouts
Finding the right timeout value is tricky
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Request timeouts
Too short: requests not given the best chance to succeed
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Request timeouts
Too long: risk timing out the calling function
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Request timeouts
Even more complicated when you have multiple integration points
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Approach 1: Split invocation time equally
(for example, 3 requests, 6s function timeout = 2s timeout per request)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Approach 2: Every request is given nearly all the invocation time
(for example, 3 requests, 6s function timeout = 5s timeout per request)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Request timeouts
Proposal: set request timeouts dynamically based on
invocation time left
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Request timeouts
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Set timeout based on remaining invocation time
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Set timeout based on remaining invocation time
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Recovery steps
Log the timeout with as much context as possible
The API, timeout value, correlation IDs, request object, and more
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Recovery steps
Record custom metrics
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Recovery steps
Use fallbacks
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Recovery steps
Be mindful when you sacrifice precision for availability
User experience is the king
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
STEP 3. Inject realistic failures
For example, server crash, network
error, HD malfunction, more
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject latency?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hypothesis:
Function has appropriate timeout on its HTTP communications
and can degrade gracefully when these requests time out
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject latency?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject latency?
Should be applied to third-party services too
DynamoDB, Twillio, Auth0 …
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject latency?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject latency?
Be mindful of the blast radius of the experiment
The goal is not to break things
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
http client
public-api-a
http client
public-api-b
internal-api
Where to inject latency?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hypothesis:
All functions have appropriate timeout on their HTTP
communications to this internal API and can degrade
gracefully when requests are timed out
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject latency?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject latency?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject latency?
Large blast radius, can cause cascade failures unintentionally
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Priming (psychology):
Priming is a technique whereby exposure to one stimulus
influences a response to a subsequent stimulus, without
conscious guidance or intention
It is a technique in psychology used to train a person's
memory both in positive and negative ways
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Use failure injection to program your colleagues into
thinking about failure modes early.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject latency?
Make X% of all requests slow
in the dev environment
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hypothesis:
The client app has appropriate timeout on their HTTP
communication with the server and can degrade gracefully
when requests are timed out
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject latency?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject latency?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject latency?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
STEP 4. Disprove hypothesis
In other words, look for difference
in steady state
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How to inject latency?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How to inject latency?
Static weavers (such as PostSharp, AspectJ)
Dynamic proxies
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
https://theburningmonk.com/2015/04/design-for-latency-issues/
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How to inject latency?
Manually crafted wrapper libraries
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Configured in SSM Parameter Store
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
No injected latency
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
With injected latency
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Factory wrapper function
(think bluebird’s promisifyAll function)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Error injection with serverless
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Common errors
HTTP 5XX
Amazon DynamoDB provisioned throughput exceeded
Throttled AWS Lambda invocations
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hypothesis:
Function has appropriate error handling on its HTTP communications
and can degrade gracefully when downstream dependencies fail
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject errors?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hypothesis:
Function has appropriate error handling on DynamoDB operations and
can degrade gracefully when DynamoDB throughputs are exceeded
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject errors?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject errors?
Induce Lambda throttling by temporarily setting reserve concurrency
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Recap
Failures are INEVITABLE
The only way to truly know your system’s
resilience against failures is to test it
through CONTROLLED experiments
Applying principles of chaos engineering to serverless (reinvent DVC305)
The goal of chaos engineering is NOT to
actually break production
CONTAINMENT should be front and
centre of your thinking
Applying principles of chaos engineering to serverless (reinvent DVC305)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
STEP 1. Define “steady state”
What does normal, working
condition looks like?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hypothesize steady state will
continue in both control group
& the experiment group
In other words, you should have a reasonable degree of
confidence the system would handle the failure before you
proceed with the experiment
STEP 2.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
STEP 3. Inject realistic failures
For example, server crash, network
error, HD malfunction, more
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
STEP 4. Disprove hypothesis
In other words, look for difference
in steady state
There is more inherent chaos and
complexity in a serverless application
Even without servers, you can still inject
CONTROLLED failures at the application level
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Yan Cui
@theburningmonk
https://theburningmonk.com
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Related breakouts
Wednesday, Nov 28
SRV425-R - Best Practices for Building Multi-Region, Active-Active Serverless Applications
4:00PM – 5:00PM | Venetian, Level 4, Lando 4305
Wednesday, Nov 28
SRV343-R - Best Practices for Safe Deployments on AWS Lambda and Amazon API Gateway
4:45PM – 5:45PM | MGM, Level 1, South Concourse 105
Thursday, Nov 29
ARC308 - Chaos Engineering and Scalability at Audible.com
1:00PM – 2:00PM | Aria West, Level 3, Ironwood 5
Please complete the session
survey in the mobile app.
!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

More Related Content

PDF
Introduction to Chaos Engineering with Microsoft Azure
PDF
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
PPTX
Use GitLab with Chaos Engineering to Harden your Applications + OpenEBS 1.3 ...
PPTX
Next Level Chaos Engineering - Chaos Conf 2018
PDF
InfoQ Live - Reducing Uncertainty in Software Delivery - Building reliability...
PDF
Chaos Engineering in a Multi-Cloud World | Escape Conference 2019
PPTX
Jets: A Ruby Serverless Framework
PDF
Amazon guard duty_lab
Introduction to Chaos Engineering with Microsoft Azure
Applying principles of chaos engineering to Serverless (CodeMotion Berlin)
Use GitLab with Chaos Engineering to Harden your Applications + OpenEBS 1.3 ...
Next Level Chaos Engineering - Chaos Conf 2018
InfoQ Live - Reducing Uncertainty in Software Delivery - Building reliability...
Chaos Engineering in a Multi-Cloud World | Escape Conference 2019
Jets: A Ruby Serverless Framework
Amazon guard duty_lab

What's hot (9)

PDF
5 Essential Techniques for Building Fault-tolerant Systems
PPTX
Chaos Engineering with Containers - QCon SF 2018
PDF
Akamai Tech day Amsterdam 2019
PDF
AWS SAM(Serverless Application Model) 을 이용한 백오피스 마이그레이션 (현창훈, HBSmith) :: AWS...
PPTX
Chaos engineering & Gameday on AWS
PDF
IoT from Cloud to Edge & Back Again - WebSummit 2018
PDF
Akamai Tech day Amsterdam 2019
PPTX
re:Invent CON320 Tracing and Debugging for Containerized Services
PDF
Top 10 Tips for Securing and Scaling Atlassian Cloud
5 Essential Techniques for Building Fault-tolerant Systems
Chaos Engineering with Containers - QCon SF 2018
Akamai Tech day Amsterdam 2019
AWS SAM(Serverless Application Model) 을 이용한 백오피스 마이그레이션 (현창훈, HBSmith) :: AWS...
Chaos engineering & Gameday on AWS
IoT from Cloud to Edge & Back Again - WebSummit 2018
Akamai Tech day Amsterdam 2019
re:Invent CON320 Tracing and Debugging for Containerized Services
Top 10 Tips for Securing and Scaling Atlassian Cloud

Similar to Applying principles of chaos engineering to serverless (reinvent DVC305) (20)

PDF
Applying principles of chaos engineering to serverless (O'Reilly Software Arc...
PDF
Yan Cui - Applying principles of chaos engineering to Serverless - Codemotion...
PDF
Yan Cui - Applying principles of chaos engineering to Serverless - Codemotion...
PDF
Principles Of Chaos Engineering - Chaos Engineering Hamburg
PPTX
Chaos Engineering when you're not Netflix
PDF
Principles of Chaos Engineering
PDF
Chaos Engineering Talk at DevOps Days Austin
PPTX
Introduction to Chaos Engineering
PPTX
Chaos engineering
PPTX
ChaosEngineeringITEA.pptx
PDF
Chaos Engineering
PDF
Chaos Engineering – why we should all practice breaking things on purpose by ...
PPTX
Green Custard Friday Talk 19: Chaos Engineering
PDF
Chaos Engineering
PDF
Chaos Engineering - The Art of Breaking Things in Production
PDF
Using chaos to bring resiliency to your applications
PPTX
CHAOS ENGINEERING – OR LET'S SHAKE THE TREE
PDF
Chaos Engineering, When should you release the monkeys?
PDF
Introduction to Chaos Engineering | SRECon Asia - Ana Medina
PPTX
#ATAGTR2021 Presentation : "Chaos engineering: Break it to make it" by Anupa...
Applying principles of chaos engineering to serverless (O'Reilly Software Arc...
Yan Cui - Applying principles of chaos engineering to Serverless - Codemotion...
Yan Cui - Applying principles of chaos engineering to Serverless - Codemotion...
Principles Of Chaos Engineering - Chaos Engineering Hamburg
Chaos Engineering when you're not Netflix
Principles of Chaos Engineering
Chaos Engineering Talk at DevOps Days Austin
Introduction to Chaos Engineering
Chaos engineering
ChaosEngineeringITEA.pptx
Chaos Engineering
Chaos Engineering – why we should all practice breaking things on purpose by ...
Green Custard Friday Talk 19: Chaos Engineering
Chaos Engineering
Chaos Engineering - The Art of Breaking Things in Production
Using chaos to bring resiliency to your applications
CHAOS ENGINEERING – OR LET'S SHAKE THE TREE
Chaos Engineering, When should you release the monkeys?
Introduction to Chaos Engineering | SRECon Asia - Ana Medina
#ATAGTR2021 Presentation : "Chaos engineering: Break it to make it" by Anupa...

More from Yan Cui (20)

PDF
How to win the game of trade-offs
PDF
How to choose the right messaging service
PDF
How to choose the right messaging service for your workload
PDF
Patterns and practices for building resilient serverless applications.pdf
PDF
Lambda and DynamoDB best practices
PDF
Lessons from running AppSync in prod
PDF
Serverless observability - a hero's perspective
PDF
How to ship customer value faster with step functions
PDF
How serverless changes the cost paradigm
PDF
Why your next serverless project should use AWS AppSync
PDF
Build social network in 4 weeks
PDF
Patterns and practices for building resilient serverless applications
PDF
How to bring chaos engineering to serverless
PDF
Migrating existing monolith to serverless in 8 steps
PDF
Building a social network in under 4 weeks with Serverless and GraphQL
PDF
FinDev as a business advantage in the post covid19 economy
PDF
How to improve lambda cold starts
PDF
What can you do with lambda in 2020
PDF
A chaos experiment a day, keeping the outage away
PDF
How to debug slow lambda response times
How to win the game of trade-offs
How to choose the right messaging service
How to choose the right messaging service for your workload
Patterns and practices for building resilient serverless applications.pdf
Lambda and DynamoDB best practices
Lessons from running AppSync in prod
Serverless observability - a hero's perspective
How to ship customer value faster with step functions
How serverless changes the cost paradigm
Why your next serverless project should use AWS AppSync
Build social network in 4 weeks
Patterns and practices for building resilient serverless applications
How to bring chaos engineering to serverless
Migrating existing monolith to serverless in 8 steps
Building a social network in under 4 weeks with Serverless and GraphQL
FinDev as a business advantage in the post covid19 economy
How to improve lambda cold starts
What can you do with lambda in 2020
A chaos experiment a day, keeping the outage away
How to debug slow lambda response times

Recently uploaded (20)

PDF
Machine learning based COVID-19 study performance prediction
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Approach and Philosophy of On baking technology
PPTX
Cloud computing and distributed systems.
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Spectroscopy.pptx food analysis technology
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
MYSQL Presentation for SQL database connectivity
PDF
cuic standard and advanced reporting.pdf
Machine learning based COVID-19 study performance prediction
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Approach and Philosophy of On baking technology
Cloud computing and distributed systems.
20250228 LYD VKU AI Blended-Learning.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
MIND Revenue Release Quarter 2 2025 Press Release
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Review of recent advances in non-invasive hemoglobin estimation
Spectroscopy.pptx food analysis technology
Building Integrated photovoltaic BIPV_UPV.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
MYSQL Presentation for SQL database connectivity
cuic standard and advanced reporting.pdf

Applying principles of chaos engineering to serverless (reinvent DVC305)

  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Applying Principles of Chaos Engineering to Serverless Yan Cui Principal Engineer DAZN D V C 3 0 5
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda What is chaos engineering? New challenges with serverless Applying latency injection to serverless Applying error injection to serverless
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. After the talk Slides will be shared on Slideshare Recording will be posted on YouTube within 48 hours Find the links on https://theburningmonk.com/reinvent2018
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What is chaos engineering?
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production. - principlesofchaos.org
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Smallpox Earliest evidence of disease in third century BC Egyptian mummy Estimated 400K deaths per year in eighteenth century Europe
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. History of vaccination First vaccine was developed in 1798 by Edward Jenner https://en.wikipedia.org/wiki/Edward_Jenner
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. History of vaccination WHO certified global eradication in 1980 https://en.wikipedia.org/wiki/Edward_Jenner
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://en.wikipedia.org/wiki/Vaccine History of vaccination
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. History of vaccination Vaccination is the most effective method to prevent infectious diseases
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. History of vaccination Vaccines stimulate the immune system to recognize and destroy the disease before contracting it for real
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering Use controlled experiments to inject failures into our system
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering Help us learn about our system’s behavior and uncover unknown failure modes, before they manifest like wildfire in production
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering Lets us build confidence in its ability to withstand turbulent conditions
  • 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering is the vaccine to frailties in modern software
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Who am I? Principal engineer at DAZN AWS Serverless hero Author of Production-Ready Serverless* course by Manning. Blogger**, speaker. * https://bit.ly/production-ready-serverless ** https://theburningmonk.com
  • 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. About DAZN Available in seven countries—Austria, Switzerland, Germany, Japan, Canada, Italy, and USA Available on 30+ platforms
  • 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. About DAZN Around 1,000,000 concurrent viewers at peak
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering has an image problem
  • 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering has an image problem Too much emphasis is on breaking things
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering has an image problem Easy to conflate the action of injecting failures with the payback
  • 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering has an image problem The goal is to learn about the system and build confidence
  • 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering has an image problem The goal is not to break things
  • 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos in practice Four steps to start running chaos experiments yourself
  • 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEP 1. Define “steady state” What does normal, working condition looks like?
  • 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. this is not a steady state
  • 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hypothesize steady state will continue in both control group & the experiment group In other words, you should have a reasonable degree of confidence the system would handle the failure before you proceed with the experiment STEP 2.
  • 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos in practice Explore unknown unknowns away from production
  • 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos in practice Experiments that graduate to production should be carefully considered and planned You should have reasonable confidence in the system before running experiments in production
  • 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos in practice Treat production with the care it deserves The goal is not to break things
  • 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos in practice If you knew the system would break and you did it anyway, then it’s not a chaos experiment! It’s called being irresponsible.
  • 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEP 3. Inject realistic failures For example, server crash, network error, HD malfunction, more
  • 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos in practice Netflix’s Simian Army: https://github.com/Netflix/SimianArmy Chaos Engineering ebook (O’Reilly): http://oreil.ly/2tZU1Sn
  • 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEP 4. Disprove hypothesis In other words, look for difference in steady state
  • 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos in practice Look for evidence that steady state was impacted by the injected failure
  • 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos in practice Address weaknesses before failures happen for real
  • 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Containment Experiments needs to be controlled The goal is not to break things
  • 41. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Containment Ensure everyone knows what you are doing Don’t surprise your teammates
  • 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Containment Run experiments during office hours
  • 43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Containment Avoid important dates
  • 44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Containment Make the smallest change necessary to prove or disprove hypothesis
  • 45. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Containment Have a rollback plan Stop the experiment right away if things start to go wrong
  • 46. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Containment Don’t start in production Can learn a lot by running experiments in staging
  • 47. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. by Russ Miles @russmiles source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
  • 48. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. New challenges with serverless
  • 49. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. chaos monkey kills an Amazon Elastic Cloud (Amazon EC2) instance latency monkey induces artificial delay in APIs chaos gorilla kills an AWS Availability Zone chaos kong kills an entire AWS region
  • 50. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 51. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serverless challenges There are no servers that you can access and kill
  • 52. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 53. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 54. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. There is more inherent chaos and complexity in a serverless architecture.
  • 55. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serverless challenges Smaller units of deployment, but a lot more of them
  • 56. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. serverful serverlessServerless challenges
  • 57. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serverless challenges Every function needs to be correctly configured and secured
  • 58. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Kinesis ? SNS CloudWatch Events CloudWatch LogsIoT Core DynamoDB S3 SES Serverless challenges
  • 59. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serverless challenges A lot of managed, intermediate services Each with its own set of failure modes
  • 60. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serverless challenges Unknown failure modes in the infrastructure we don’t control
  • 61. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serverless challenges Often there’s little we can do when an outage occurs in the platform
  • 62. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Common weaknesses
  • 63. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Common weaknesses Improperly tuned timeouts
  • 64. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Common weaknesses Missing error handling
  • 65. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Common weaknesses Missing fallback
  • 66. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Common weaknesses Missing regional failover
  • 67. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Latency injection with serverless
  • 68. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEP 1. Define “steady state” What does normal, working condition looks like?
  • 69. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Defining steady state What metrics do you use?
  • 70. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Defining steady state p95/p99 latencies, error count, backlog size, yield*, harvest** * percentage of requests completed ** completeness of the returned response
  • 71. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hypothesize steady state will continue in both control group & the experiment group In other words, you should have a reasonable degree of confidence the system would handle the failure before you proceed with the experiment STEP 2.
  • 72. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. API Gateway Serverless considerations
  • 73. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serverless considerations Consider the effect of cold starts How does it affect your strategy for handling slow responses
  • 74. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Request timeouts Strategy should:
  • 75. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Request timeouts Strategy should: 1. Give requests the best chance to succeed
  • 76. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Request timeouts Strategy should: 1. Give requests the best chance to succeed 2. Do not allow slow response to timeout the caller function
  • 77. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Request timeouts Finding the right timeout value is tricky
  • 78. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Request timeouts Too short: requests not given the best chance to succeed
  • 79. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Request timeouts Too long: risk timing out the calling function
  • 80. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Request timeouts Even more complicated when you have multiple integration points
  • 81. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Approach 1: Split invocation time equally (for example, 3 requests, 6s function timeout = 2s timeout per request)
  • 82. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Approach 2: Every request is given nearly all the invocation time (for example, 3 requests, 6s function timeout = 5s timeout per request)
  • 83. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Request timeouts Proposal: set request timeouts dynamically based on invocation time left
  • 84. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Request timeouts
  • 85. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Set timeout based on remaining invocation time
  • 86. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Set timeout based on remaining invocation time
  • 87. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Recovery steps Log the timeout with as much context as possible The API, timeout value, correlation IDs, request object, and more
  • 88. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Recovery steps Record custom metrics
  • 89. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Recovery steps Use fallbacks
  • 90. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 91. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Recovery steps Be mindful when you sacrifice precision for availability User experience is the king
  • 92. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEP 3. Inject realistic failures For example, server crash, network error, HD malfunction, more
  • 93. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency?
  • 94. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hypothesis: Function has appropriate timeout on its HTTP communications and can degrade gracefully when these requests time out
  • 95. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency?
  • 96. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency? Should be applied to third-party services too DynamoDB, Twillio, Auth0 …
  • 97. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency?
  • 98. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency? Be mindful of the blast radius of the experiment The goal is not to break things
  • 99. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. http client public-api-a http client public-api-b internal-api Where to inject latency?
  • 100. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hypothesis: All functions have appropriate timeout on their HTTP communications to this internal API and can degrade gracefully when requests are timed out
  • 101. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency?
  • 102. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency?
  • 103. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency? Large blast radius, can cause cascade failures unintentionally
  • 104. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 105. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Priming (psychology): Priming is a technique whereby exposure to one stimulus influences a response to a subsequent stimulus, without conscious guidance or intention It is a technique in psychology used to train a person's memory both in positive and negative ways
  • 106. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Use failure injection to program your colleagues into thinking about failure modes early.
  • 107. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency? Make X% of all requests slow in the dev environment
  • 108. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hypothesis: The client app has appropriate timeout on their HTTP communication with the server and can degrade gracefully when requests are timed out
  • 109. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency?
  • 110. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency?
  • 111. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 112. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency?
  • 113. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEP 4. Disprove hypothesis In other words, look for difference in steady state
  • 114. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How to inject latency?
  • 115. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How to inject latency? Static weavers (such as PostSharp, AspectJ) Dynamic proxies
  • 116. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://theburningmonk.com/2015/04/design-for-latency-issues/
  • 117. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How to inject latency? Manually crafted wrapper libraries
  • 118. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 119. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 120. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 121. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 122. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 123. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Configured in SSM Parameter Store
  • 124. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 125. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. No injected latency
  • 126. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 127. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. With injected latency
  • 128. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 129. Factory wrapper function (think bluebird’s promisifyAll function)
  • 130. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 131. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 132. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 133. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 134. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 135. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Error injection with serverless
  • 136. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Common errors HTTP 5XX Amazon DynamoDB provisioned throughput exceeded Throttled AWS Lambda invocations
  • 137. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hypothesis: Function has appropriate error handling on its HTTP communications and can degrade gracefully when downstream dependencies fail
  • 138. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject errors?
  • 139. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hypothesis: Function has appropriate error handling on DynamoDB operations and can degrade gracefully when DynamoDB throughputs are exceeded
  • 140. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject errors?
  • 141. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject errors? Induce Lambda throttling by temporarily setting reserve concurrency
  • 142. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Recap
  • 144. The only way to truly know your system’s resilience against failures is to test it through CONTROLLED experiments
  • 146. The goal of chaos engineering is NOT to actually break production
  • 147. CONTAINMENT should be front and centre of your thinking
  • 149. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEP 1. Define “steady state” What does normal, working condition looks like?
  • 150. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hypothesize steady state will continue in both control group & the experiment group In other words, you should have a reasonable degree of confidence the system would handle the failure before you proceed with the experiment STEP 2.
  • 151. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEP 3. Inject realistic failures For example, server crash, network error, HD malfunction, more
  • 152. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEP 4. Disprove hypothesis In other words, look for difference in steady state
  • 153. There is more inherent chaos and complexity in a serverless application
  • 154. Even without servers, you can still inject CONTROLLED failures at the application level
  • 155. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 156. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 157. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 158. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 159. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Yan Cui @theburningmonk https://theburningmonk.com
  • 160. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Related breakouts Wednesday, Nov 28 SRV425-R - Best Practices for Building Multi-Region, Active-Active Serverless Applications 4:00PM – 5:00PM | Venetian, Level 4, Lando 4305 Wednesday, Nov 28 SRV343-R - Best Practices for Safe Deployments on AWS Lambda and Amazon API Gateway 4:45PM – 5:45PM | MGM, Level 1, South Concourse 105 Thursday, Nov 29 ARC308 - Chaos Engineering and Scalability at Audible.com 1:00PM – 2:00PM | Aria West, Level 3, Ironwood 5
  • 161. Please complete the session survey in the mobile app. ! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.