SlideShare a Scribd company logo
How to Fail with Serverless
Jeremy Daly
CTO, AlertMe.news
@jeremy_daly
Jeremy Daly
• CTO at AlertMe.news
• Consult with companies building in the cloud
• 20+ year veteran of technology startups
• Started working with AWS in 2009 and started using
Lambda in 2015
• Blogger (jeremydaly.com), OSS contributor, speaker
• Publish the Off-by-none serverless newsletter
• Host of the Serverless Chats podcast
@jeremy_daly
Agenda
• Distributed systems and serverless
• Writing code FOR the cloud
• Failure modes in the cloud
• Serverless patterns to deal with failure
• Decoupling our services
@jeremy_daly
Distributed Systems are…
Systems whose components are located on
different networked computers, which communicate
and coordinate their actions by passing messages to
one another. ~Wikipedia
@jeremy_daly
They’re also really hard!
@jeremy_daly
EVERYTHING FAILS
ALL THE TIME
WernerVogels
CTO, Amazon.com
8 Fallacies of Distributed Computing
@jeremy_daly
The network
is reliable
Latency
is zero
Bandwidth
is infinite
The network is
secure
Topology
doesn’t change
There is one
administrator
Transport cost
is zero
The network is
homogeneous
🚫 🚫 🚫 🚫
🚫 🚫 🚫 🚫
Serverless applications are…
Distributed systems on steroids! 💪 💪 💪
@jeremy_daly
• Smaller, more distributed compute units
• Stateless, requiring network access to state
• Uncoordinated, requires buses, queues, pub/sub, state machines
• Heavily reliant on other networked cloud services
What does it mean to be Serverless?
• No server management
• Flexible scaling
• Pay for value / never pay for idle
• Automated high availability
• LOTS of configuration and knowledge of cloud services
• Highly event-driven
@jeremy_daly
Lots of services to communicate with!
@jeremy_daly
ElastiCache
RDS
EMR Amazon ES
Redshift
Fargate
Anything “on EC2”Lambda Cognito Kinesis
S3 DynamoDB SQS
SNS API Gateway
AppSync IoT Comprehend
“Serverless” Managed Non-Serverless
DocumentDB
(MongoDB)
Managed Streaming
for Kada
EventBridge
Reliability or High Availability is…
@jeremy_daly
A characteristic of a system, which aims to ensure an
agreed level of operational performance,
usually uptime, for a higher than normal period.
~Wikipedia
Resiliency is…
@jeremy_daly
The ability of a software solution to absorb the impact
of a problem in one or more parts of a system, while
continuing to provide an acceptable service level to
the business. ~ IBM
IT’S NOT ABOUT PREVENTING FAILURE
IT’S UNDERSTANDING HOWTO GRACEFULLY DEAL WITH IT
Writing code FOR the Cloud
Using Lambda for our business logic
• Ephemeral compute service
• Runs your code in response to events
• Automatically manages the runtime,
compute, and scaling
• Single concurrency model
• No sticky-sessions or guaranteed
lifespan
@jeremy_daly
Traditional Error Handling & Retries
try {
// Do something important
} catch (err) {
// Do some error handling
// Do some logging
// Maybe retry the operation
}
@jeremy_daly
What happens to the
original event?
What happens if the function
container crashes?
What happens if there is a
network issue?
👨✈ Losing events is very bad!
What happens if the function
never runs?
Function Error Types
• Unhandled Exception ‼
• FunctionTimeouts ⏳
• Out-of-Memory Errors 🧠
• Throttling Errors 🚦
@jeremy_daly
The cloud is better than you…
…at error handling
…at retrying failures
…at understanding network failures
…at mapping the network topology
…at handling failover and redundancy
@jeremy_daly
So why not let the
cloud do those
things for you? 🤷
Fail up the stack
• Don’t swallow errors with try/catch – fail the function
• Return errors directly to the invoking service
• Configure built-in retry mechanisms to reprocess events
• Utilize dead letter queues to capture failed events
@jeremy_daly
(sometimes 😉)
Types of Lambda Functions
• The Lambdalith
• The Fat Lambda
• The Single-Purpose Function
@jeremy_daly
😬
🤔
👍
The Mighty Lambdalith
• The entire application is in one
Lambda function
• Often times these are “lift and
shift” Express.js or Flask apps
• Events are synchronous via API
Gateway or ALB
• Partial failures are handled “in the
code”
@jeremy_daly
The Fat Lambda
• Several related methods are collocated in a single Lambda function
• Generally used to optimize the speed of synchronous operations
• Partial failures are still handled “in the code”
• Under the right circumstances, this can be useful
@jeremy_daly
The Single-Purpose Function
• Tightly scoped function that handles a single discrete piece of
business logic
• Can be invoked synchronously or asynchronously
• Failures are generally “total failures” and are passed back to the
invoking service
• Can be reused as part of other “workflows”, can scale (or throttle)
independently, and can utilize the Principle of Least Privilege
@jeremy_daly
Failure Modes in the Cloud
WARNING: Firehose of overly-technical content ahead 👩🚒
A quick word about retries…
• Retries are a vital part of distributed systems
• Most cloud services guarantee “at least once” delivery
• It is possible for the same event to be received more than once
• Retried operations should be idempotent
@jeremy_daly
Idempotent means that…
An operation can be repeated multiple times and
always provide the same result, with no side effects
to other objects in the system. ~ Computer Hope
@jeremy_daly
Idempotent operations:
• Update a database record
• Authenticate a user
• Check if a record exists and create if not
There are lots of
strategies to ensure
idempotency!
What are Dead Letter Queues (DLQs)?
• Capture messages/events that fail to process or are skipped
• Allows for alarming, inspection, and potential replay
• Can be added to SQS queues, SNS subscriptions, Lambda functions
@jeremy_daly
Lambda Invocation Types
• Synchronous – request/response model
• Asynchronous – set it and forget it
• Stream-based – push
• Poller-based – pull
@jeremy_daly
Synchronous Lambda Retry Behavior
• Functions are invoked directly using request/response method
• Failures are returned to the invoker
• Retries are delegated to the invoking application
• Some AWS services automatically retry (e.g. Alexa & Cognito)
• Other services do not retry (e.g. API Gateway, ALB, Step Functions)
• API Gateway and ALB can return errors to the client for retry
@jeremy_daly
Asynchronous Lambda Retry Behavior
• The Lambda Service accepts requests and adds them to a queue
• The invoker receives a 202 status code and disconnects
• The Lambda Service will attempt to reprocess failed events up to
2 times, configured using the MaximumRetryAttempts setting
• If the Lambda function is throttled, the event will be retried for up to
6 hours, configured using MaximumEventAgeInSeconds
• Failed and expired events can be sent to a Dead Letter Queue (DLQ)
or an on-failure destination
@jeremy_daly
Stream-based Lambda Retry Behavior
• Records are pushed synchronously to Lambda from Kinesis or
DynamoDB streams in batches (10k and 1k limits per batch)
• MaximumRetryAttempts: number of retry attempts for batches
before they can be skipped (up to 10,000)
• MaximumRecordAgeInSeconds: store records up to 7 days
• BisectBatchOnFunctionError: recursively split failed batches
(poison pill)
• Skipped records are sent to an On-failure Destination (SQS or SNS)
@jeremy_daly
Poller-based Lambda Retry Behavior
• The Lambda Poller pulls records synchronously from SQS in
batches (up to 10)
• Errors fail the entire batch
• MaxReceiveCount: number of times messages can be returned to
the queue before being sent to the DLQ (up to 1,000)
• Polling frequency is tied to function concurrency
• VisibilityTimeout should be set to at least 6 times the timeout
configured on your consuming function
@jeremy_daly
Lambda Destinations
• Only for asynchronous invocations
• Routing based on SUCCESS and/or FAILURE
• OnFailure should be favored over a standard DLQ
• Destinations can be an SQS queue, SNS topic,
Lambda function, or EventBridge event bus
@jeremy_daly
Lambda Destinations (continued)
Destination-specific JSON format
• SQS/SNS: JSON object is passed as the Message
• Lambda: JSON is passed as the payload to the function
• EventBridge: JSON is passed as the Detail in the PutEvents call
• Source is ”lambda”
• DetailType is “Lambda Function Invocation Result – Success/Failure”
• Resource fields contain the function and destination ARNs
@jeremy_daly
SQS Redrive Policies
• Only supports another SQS queue as the DLQ
• Messages are sent to the DLQ if the Maximum Receives value is
exceeded
@jeremy_daly
SNS Redrive Policies
• Dead Letter Queues are attached to Subscriptions, not Topics
• Only supports SQS queues as the DLQ
• Client-side errors (e.g. Lambda doesn’t exist) do no retry
• Messages to SQS or Lambda are retried 100,015 times over 23 days
• Messages to SMTP, SMS, and Mobile retry 50 times over 6 hours
• HTTP endpoints support customer-defined retry policies
(number of retries, delays, and backoff strategy)
@jeremy_daly
EventBridge Retry Behavior
• Will attempt to deliver events for up to 24 hours with backoff
• Lambda functions are invoked asynchronously
• Failed events are lost (this is very unlikely)
• Once events are accepted by the target service, failure modes of
those services are used
@jeremy_daly
Step Functions
• State Machines: Orchestration workflows
• Lambdas are invoked synchronously
• Retriers and Catchers allow for complex
error handling patterns
• Use “error names” with ErrorEquals for
condition error handling (States.*)
• Control retry policies with IntervalSeconds,
MaxAttempts, BackoffRate
@jeremy_daly
Complex Error Handling Pattern
Credit:Yan Cui
AWS SDK Retries
• Automatic retries and exponential backoff
@jeremy_daly
AWS SDK
Maximum retry
count
Connection
timeout
Socket timeout
Python (Boto 3) depends on service 60 seconds 60 seconds
Node.js depends on service N/A 120 seconds
Java 3 10 seconds 50 seconds
.NET 4 100 seconds 300 seconds
Go 3 N/A N/A
Error Handling Patterns
miss
Caching strategy
Client API Gateway RDSLambda
Elasticache
Key Points:
• Create new RDS connections ONLY on misses
• Make sureTTLs are set appropriately
• Include the ability to invalidate cache
@jeremy_daly
YOU STILL NEEDTO
SIZEYOUR DATABASE
CLUSTERS APPROPRIATELY
RDS
Buffer events for throttling and durability
Client API Gateway
SQS
Queue
SQS
(DLQ)
Lambda Lambda
(throttled)
ack
“Asynchronous”
Request
Synchronous
Request
@jeremy_daly
Key Points:
• SQS adds durability
• Throttled Lambdas reduce downstream pressure
• Failed events are stored for further inspection/replay
Limit the
concurrency to match
RDS throughput
x
Utilize Service
Integrations
DynamoDB
Stripe API
The Circuit Breaker
Client API Gateway Lambda
Key Points:
• Cache your cache with warm functions
• Use a reasonable failure count
• Understand idempotency!
Status
Check CLOSED
OPEN
Increment Failure Count
HALF OPEN
“Everything fails all the time.”
~WernerVogels
@jeremy_daly
Elasticache
or
Event Processing Conditional Retry Pattern
@jeremy_daly
EventBridge Processing
Lambda
EventBridge
onSuccess
onFailure
SQS
(DLQ)
Replay SQS
Replay Lambda
(Throttled)
Rule to route
permanent
failuresRule to route
“retriable” failures
Asynchronous
(use delivery delay)
Decoupling Our Services
Multicasting with SNS
Key Points:
• SNS has a “well-defined API”
• Decouples downstream processes
• Allows multiple subscribers with message filters
Client
SNS
“Asynchronous”
Request
ack
Event Service
@jeremy_daly
HTTP
SMS
Lambda
SQS
Email
SQS (DLQ)
@jeremy_daly
Multicasting with EventBridge
Key Points:
• Allows multiple subscribers with RULES, PATTERNS and FILTERS
• Forward events to other accounts
• 24 hours of automated retries
Asynchronous
“PutEvents” Request
ack
w/ event id
Amazon
EventBridge
Lambda
SQS
Client
Step Function
Event Bus
+16 others
Key Points:
• Filter events to selectively trigger services
• Manage throttling/quotas/failures per service
• Use Lambda Destinations with asynchronous events
Stripe API
@jeremy_daly
Distribute & Throttle
ack
SQS
Queue Lambda
(concurrency 25)
Client API
Gateway
Lambda
Order Service
"total": [{ "numeric": [ ”>", 0 ]}]
RDS
SQS
Queue Lambda
(concurrency 10)
SMS Alerting Service
Twilio API
SQS
Queue Lambda
(concurrency 5)
Billing Service
"detail-type": [ "ORDER COMPLETE" ]
EventBridge
Key Takeaways
• Be prepared for failure – everything fails all the time!
• Utilize the built in retry mechanisms of the cloud
• Understand failure modes to protect against data loss
• Buffer and throttle events to distributed systems
• Embrace asynchronous processes to decouple components
@jeremy_daly
Things I’m working on…
Blog: JeremyDaly.com
Podcast: ServerlessChats.com
Newsletter: Offbynone.io
DDBToolbox: DynamoDBToolbox.com
Lambda API: LambdaAPI.com
GitHub: github.com/jeremydaly
Twitter: @jeremy_daly
@jeremy_daly
Thank You!
Jeremy Daly
jeremy@jeremydaly.com
@jeremy_daly

More Related Content

PDF
Building resilient serverless systems with non serverless components
PDF
Building resilient serverless systems with non-serverless components (Belfast)
PDF
Building resilient serverless systems with non-serverless components - Server...
PDF
Building resilient serverless systems with non-serverless components - Cardif...
PDF
Building Resilient Serverless Systems with Non-Serverless Components
PDF
Serverless Microservice Patterns for AWS
PDF
Building Event-Driven Applications with Serverless and AWS - AWS Summit New York
PDF
Serverless Security: Best practices and mitigation strategies (re:Inforce 2019)
Building resilient serverless systems with non serverless components
Building resilient serverless systems with non-serverless components (Belfast)
Building resilient serverless systems with non-serverless components - Server...
Building resilient serverless systems with non-serverless components - Cardif...
Building Resilient Serverless Systems with Non-Serverless Components
Serverless Microservice Patterns for AWS
Building Event-Driven Applications with Serverless and AWS - AWS Summit New York
Serverless Security: Best practices and mitigation strategies (re:Inforce 2019)

What's hot (12)

PDF
Sloppy Little Serverless Stories
PDF
Stephen Liedig: Building Serverless Backends with AWS Lambda and API Gateway
PDF
How LEGO.com Accelerates With Serverless
PDF
Choosing the right messaging service for your serverless app [with lumigo]
PDF
To Serverless And Beyond!
PDF
Enterprise Serverless Adoption. An Experience Report
PDF
Getting Started with Amazon EventBridge
PDF
Thinking Asynchronously Full Vesion - Utah UG
PDF
Serverless
PDF
Stop calling everything serverless!
PPTX
CQRS Evolved - CQRS + Akka.NET
PPTX
Using AWS Lambda for Infrastructure Automation and Beyond
Sloppy Little Serverless Stories
Stephen Liedig: Building Serverless Backends with AWS Lambda and API Gateway
How LEGO.com Accelerates With Serverless
Choosing the right messaging service for your serverless app [with lumigo]
To Serverless And Beyond!
Enterprise Serverless Adoption. An Experience Report
Getting Started with Amazon EventBridge
Thinking Asynchronously Full Vesion - Utah UG
Serverless
Stop calling everything serverless!
CQRS Evolved - CQRS + Akka.NET
Using AWS Lambda for Infrastructure Automation and Beyond
Ad

Similar to How to fail with serverless (20)

PPTX
Going serverless with aws
PDF
An exception occurred - Please try again
PDF
AWS Step Functions을 활용한 서버리스 앱 오케스트레이션
PDF
Si fa presto a dire serverless
PPTX
Operationnal challenges behind Serverless architectures by Laurent Bernaille
PDF
Patterns and practices for building resilient Serverless applications
PPTX
Serverless cloud architecture patterns
PDF
How to handle errors and retries in a stateless environment - Nitzan Shapira ...
PDF
Error Resilient Design: Building Scalable & Fault-Tolerant Microservices with...
PPTX
Cloud patterns - NDC Oslo 2016 - Tamir Dresher
PDF
Serverless Apps with AWS Step Functions
PPTX
Operational challenges behind Serverless architectures
PPTX
Serverless - DevOps Lessons Learned From Production
PDF
Datascript: Serverless Architetecture
PDF
Ops for NoOps - Operational Challenges for Serverless Apps
PPTX
Serverless Computing & Automation - GCP
PDF
PDX Serverless Meetup - Self-Healing Serverless Applications
PDF
(Kishore Jalleda) Launching products at massive scale - the DevOps way
PDF
A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...
PPTX
JakartaJS: Serverless in production
Going serverless with aws
An exception occurred - Please try again
AWS Step Functions을 활용한 서버리스 앱 오케스트레이션
Si fa presto a dire serverless
Operationnal challenges behind Serverless architectures by Laurent Bernaille
Patterns and practices for building resilient Serverless applications
Serverless cloud architecture patterns
How to handle errors and retries in a stateless environment - Nitzan Shapira ...
Error Resilient Design: Building Scalable & Fault-Tolerant Microservices with...
Cloud patterns - NDC Oslo 2016 - Tamir Dresher
Serverless Apps with AWS Step Functions
Operational challenges behind Serverless architectures
Serverless - DevOps Lessons Learned From Production
Datascript: Serverless Architetecture
Ops for NoOps - Operational Challenges for Serverless Apps
Serverless Computing & Automation - GCP
PDX Serverless Meetup - Self-Healing Serverless Applications
(Kishore Jalleda) Launching products at massive scale - the DevOps way
A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...
JakartaJS: Serverless in production
Ad

Recently uploaded (20)

PDF
Empathic Computing: Creating Shared Understanding
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Cloud computing and distributed systems.
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Encapsulation theory and applications.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
DOCX
The AUB Centre for AI in Media Proposal.docx
Empathic Computing: Creating Shared Understanding
Building Integrated photovoltaic BIPV_UPV.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Programs and apps: productivity, graphics, security and other tools
20250228 LYD VKU AI Blended-Learning.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Cloud computing and distributed systems.
Diabetes mellitus diagnosis method based random forest with bat algorithm
Encapsulation theory and applications.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Approach and Philosophy of On baking technology
Review of recent advances in non-invasive hemoglobin estimation
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Spectroscopy.pptx food analysis technology
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Chapter 3 Spatial Domain Image Processing.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Dropbox Q2 2025 Financial Results & Investor Presentation
“AI and Expert System Decision Support & Business Intelligence Systems”
The AUB Centre for AI in Media Proposal.docx

How to fail with serverless

  • 1. How to Fail with Serverless Jeremy Daly CTO, AlertMe.news @jeremy_daly
  • 2. Jeremy Daly • CTO at AlertMe.news • Consult with companies building in the cloud • 20+ year veteran of technology startups • Started working with AWS in 2009 and started using Lambda in 2015 • Blogger (jeremydaly.com), OSS contributor, speaker • Publish the Off-by-none serverless newsletter • Host of the Serverless Chats podcast @jeremy_daly
  • 3. Agenda • Distributed systems and serverless • Writing code FOR the cloud • Failure modes in the cloud • Serverless patterns to deal with failure • Decoupling our services @jeremy_daly
  • 4. Distributed Systems are… Systems whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another. ~Wikipedia @jeremy_daly They’re also really hard!
  • 5. @jeremy_daly EVERYTHING FAILS ALL THE TIME WernerVogels CTO, Amazon.com
  • 6. 8 Fallacies of Distributed Computing @jeremy_daly The network is reliable Latency is zero Bandwidth is infinite The network is secure Topology doesn’t change There is one administrator Transport cost is zero The network is homogeneous 🚫 🚫 🚫 🚫 🚫 🚫 🚫 🚫
  • 7. Serverless applications are… Distributed systems on steroids! 💪 💪 💪 @jeremy_daly • Smaller, more distributed compute units • Stateless, requiring network access to state • Uncoordinated, requires buses, queues, pub/sub, state machines • Heavily reliant on other networked cloud services
  • 8. What does it mean to be Serverless? • No server management • Flexible scaling • Pay for value / never pay for idle • Automated high availability • LOTS of configuration and knowledge of cloud services • Highly event-driven @jeremy_daly
  • 9. Lots of services to communicate with! @jeremy_daly ElastiCache RDS EMR Amazon ES Redshift Fargate Anything “on EC2”Lambda Cognito Kinesis S3 DynamoDB SQS SNS API Gateway AppSync IoT Comprehend “Serverless” Managed Non-Serverless DocumentDB (MongoDB) Managed Streaming for Kada EventBridge
  • 10. Reliability or High Availability is… @jeremy_daly A characteristic of a system, which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period. ~Wikipedia
  • 11. Resiliency is… @jeremy_daly The ability of a software solution to absorb the impact of a problem in one or more parts of a system, while continuing to provide an acceptable service level to the business. ~ IBM IT’S NOT ABOUT PREVENTING FAILURE IT’S UNDERSTANDING HOWTO GRACEFULLY DEAL WITH IT
  • 12. Writing code FOR the Cloud
  • 13. Using Lambda for our business logic • Ephemeral compute service • Runs your code in response to events • Automatically manages the runtime, compute, and scaling • Single concurrency model • No sticky-sessions or guaranteed lifespan @jeremy_daly
  • 14. Traditional Error Handling & Retries try { // Do something important } catch (err) { // Do some error handling // Do some logging // Maybe retry the operation } @jeremy_daly What happens to the original event? What happens if the function container crashes? What happens if there is a network issue? 👨✈ Losing events is very bad! What happens if the function never runs?
  • 15. Function Error Types • Unhandled Exception ‼ • FunctionTimeouts ⏳ • Out-of-Memory Errors 🧠 • Throttling Errors 🚦 @jeremy_daly
  • 16. The cloud is better than you… …at error handling …at retrying failures …at understanding network failures …at mapping the network topology …at handling failover and redundancy @jeremy_daly So why not let the cloud do those things for you? 🤷
  • 17. Fail up the stack • Don’t swallow errors with try/catch – fail the function • Return errors directly to the invoking service • Configure built-in retry mechanisms to reprocess events • Utilize dead letter queues to capture failed events @jeremy_daly (sometimes 😉)
  • 18. Types of Lambda Functions • The Lambdalith • The Fat Lambda • The Single-Purpose Function @jeremy_daly 😬 🤔 👍
  • 19. The Mighty Lambdalith • The entire application is in one Lambda function • Often times these are “lift and shift” Express.js or Flask apps • Events are synchronous via API Gateway or ALB • Partial failures are handled “in the code” @jeremy_daly
  • 20. The Fat Lambda • Several related methods are collocated in a single Lambda function • Generally used to optimize the speed of synchronous operations • Partial failures are still handled “in the code” • Under the right circumstances, this can be useful @jeremy_daly
  • 21. The Single-Purpose Function • Tightly scoped function that handles a single discrete piece of business logic • Can be invoked synchronously or asynchronously • Failures are generally “total failures” and are passed back to the invoking service • Can be reused as part of other “workflows”, can scale (or throttle) independently, and can utilize the Principle of Least Privilege @jeremy_daly
  • 22. Failure Modes in the Cloud WARNING: Firehose of overly-technical content ahead 👩🚒
  • 23. A quick word about retries… • Retries are a vital part of distributed systems • Most cloud services guarantee “at least once” delivery • It is possible for the same event to be received more than once • Retried operations should be idempotent @jeremy_daly
  • 24. Idempotent means that… An operation can be repeated multiple times and always provide the same result, with no side effects to other objects in the system. ~ Computer Hope @jeremy_daly Idempotent operations: • Update a database record • Authenticate a user • Check if a record exists and create if not There are lots of strategies to ensure idempotency!
  • 25. What are Dead Letter Queues (DLQs)? • Capture messages/events that fail to process or are skipped • Allows for alarming, inspection, and potential replay • Can be added to SQS queues, SNS subscriptions, Lambda functions @jeremy_daly
  • 26. Lambda Invocation Types • Synchronous – request/response model • Asynchronous – set it and forget it • Stream-based – push • Poller-based – pull @jeremy_daly
  • 27. Synchronous Lambda Retry Behavior • Functions are invoked directly using request/response method • Failures are returned to the invoker • Retries are delegated to the invoking application • Some AWS services automatically retry (e.g. Alexa & Cognito) • Other services do not retry (e.g. API Gateway, ALB, Step Functions) • API Gateway and ALB can return errors to the client for retry @jeremy_daly
  • 28. Asynchronous Lambda Retry Behavior • The Lambda Service accepts requests and adds them to a queue • The invoker receives a 202 status code and disconnects • The Lambda Service will attempt to reprocess failed events up to 2 times, configured using the MaximumRetryAttempts setting • If the Lambda function is throttled, the event will be retried for up to 6 hours, configured using MaximumEventAgeInSeconds • Failed and expired events can be sent to a Dead Letter Queue (DLQ) or an on-failure destination @jeremy_daly
  • 29. Stream-based Lambda Retry Behavior • Records are pushed synchronously to Lambda from Kinesis or DynamoDB streams in batches (10k and 1k limits per batch) • MaximumRetryAttempts: number of retry attempts for batches before they can be skipped (up to 10,000) • MaximumRecordAgeInSeconds: store records up to 7 days • BisectBatchOnFunctionError: recursively split failed batches (poison pill) • Skipped records are sent to an On-failure Destination (SQS or SNS) @jeremy_daly
  • 30. Poller-based Lambda Retry Behavior • The Lambda Poller pulls records synchronously from SQS in batches (up to 10) • Errors fail the entire batch • MaxReceiveCount: number of times messages can be returned to the queue before being sent to the DLQ (up to 1,000) • Polling frequency is tied to function concurrency • VisibilityTimeout should be set to at least 6 times the timeout configured on your consuming function @jeremy_daly
  • 31. Lambda Destinations • Only for asynchronous invocations • Routing based on SUCCESS and/or FAILURE • OnFailure should be favored over a standard DLQ • Destinations can be an SQS queue, SNS topic, Lambda function, or EventBridge event bus @jeremy_daly
  • 32. Lambda Destinations (continued) Destination-specific JSON format • SQS/SNS: JSON object is passed as the Message • Lambda: JSON is passed as the payload to the function • EventBridge: JSON is passed as the Detail in the PutEvents call • Source is ”lambda” • DetailType is “Lambda Function Invocation Result – Success/Failure” • Resource fields contain the function and destination ARNs @jeremy_daly
  • 33. SQS Redrive Policies • Only supports another SQS queue as the DLQ • Messages are sent to the DLQ if the Maximum Receives value is exceeded @jeremy_daly
  • 34. SNS Redrive Policies • Dead Letter Queues are attached to Subscriptions, not Topics • Only supports SQS queues as the DLQ • Client-side errors (e.g. Lambda doesn’t exist) do no retry • Messages to SQS or Lambda are retried 100,015 times over 23 days • Messages to SMTP, SMS, and Mobile retry 50 times over 6 hours • HTTP endpoints support customer-defined retry policies (number of retries, delays, and backoff strategy) @jeremy_daly
  • 35. EventBridge Retry Behavior • Will attempt to deliver events for up to 24 hours with backoff • Lambda functions are invoked asynchronously • Failed events are lost (this is very unlikely) • Once events are accepted by the target service, failure modes of those services are used @jeremy_daly
  • 36. Step Functions • State Machines: Orchestration workflows • Lambdas are invoked synchronously • Retriers and Catchers allow for complex error handling patterns • Use “error names” with ErrorEquals for condition error handling (States.*) • Control retry policies with IntervalSeconds, MaxAttempts, BackoffRate @jeremy_daly Complex Error Handling Pattern Credit:Yan Cui
  • 37. AWS SDK Retries • Automatic retries and exponential backoff @jeremy_daly AWS SDK Maximum retry count Connection timeout Socket timeout Python (Boto 3) depends on service 60 seconds 60 seconds Node.js depends on service N/A 120 seconds Java 3 10 seconds 50 seconds .NET 4 100 seconds 300 seconds Go 3 N/A N/A
  • 39. miss Caching strategy Client API Gateway RDSLambda Elasticache Key Points: • Create new RDS connections ONLY on misses • Make sureTTLs are set appropriately • Include the ability to invalidate cache @jeremy_daly YOU STILL NEEDTO SIZEYOUR DATABASE CLUSTERS APPROPRIATELY
  • 40. RDS Buffer events for throttling and durability Client API Gateway SQS Queue SQS (DLQ) Lambda Lambda (throttled) ack “Asynchronous” Request Synchronous Request @jeremy_daly Key Points: • SQS adds durability • Throttled Lambdas reduce downstream pressure • Failed events are stored for further inspection/replay Limit the concurrency to match RDS throughput x Utilize Service Integrations
  • 41. DynamoDB Stripe API The Circuit Breaker Client API Gateway Lambda Key Points: • Cache your cache with warm functions • Use a reasonable failure count • Understand idempotency! Status Check CLOSED OPEN Increment Failure Count HALF OPEN “Everything fails all the time.” ~WernerVogels @jeremy_daly Elasticache or
  • 42. Event Processing Conditional Retry Pattern @jeremy_daly EventBridge Processing Lambda EventBridge onSuccess onFailure SQS (DLQ) Replay SQS Replay Lambda (Throttled) Rule to route permanent failuresRule to route “retriable” failures Asynchronous (use delivery delay)
  • 44. Multicasting with SNS Key Points: • SNS has a “well-defined API” • Decouples downstream processes • Allows multiple subscribers with message filters Client SNS “Asynchronous” Request ack Event Service @jeremy_daly HTTP SMS Lambda SQS Email SQS (DLQ)
  • 45. @jeremy_daly Multicasting with EventBridge Key Points: • Allows multiple subscribers with RULES, PATTERNS and FILTERS • Forward events to other accounts • 24 hours of automated retries Asynchronous “PutEvents” Request ack w/ event id Amazon EventBridge Lambda SQS Client Step Function Event Bus +16 others
  • 46. Key Points: • Filter events to selectively trigger services • Manage throttling/quotas/failures per service • Use Lambda Destinations with asynchronous events Stripe API @jeremy_daly Distribute & Throttle ack SQS Queue Lambda (concurrency 25) Client API Gateway Lambda Order Service "total": [{ "numeric": [ ”>", 0 ]}] RDS SQS Queue Lambda (concurrency 10) SMS Alerting Service Twilio API SQS Queue Lambda (concurrency 5) Billing Service "detail-type": [ "ORDER COMPLETE" ] EventBridge
  • 47. Key Takeaways • Be prepared for failure – everything fails all the time! • Utilize the built in retry mechanisms of the cloud • Understand failure modes to protect against data loss • Buffer and throttle events to distributed systems • Embrace asynchronous processes to decouple components @jeremy_daly
  • 48. Things I’m working on… Blog: JeremyDaly.com Podcast: ServerlessChats.com Newsletter: Offbynone.io DDBToolbox: DynamoDBToolbox.com Lambda API: LambdaAPI.com GitHub: github.com/jeremydaly Twitter: @jeremy_daly @jeremy_daly