How to fail with serverless

How to Fail with Serverless
Jeremy Daly
CTO, AlertMe.news
@jeremy_daly

Jeremy Daly
• CTO at AlertMe.news
• Consult with companies building in the cloud
• 20+ year veteran of technology startups
• Started working with AWS in 2009 and started using
Lambda in 2015
• Blogger (jeremydaly.com), OSS contributor, speaker
• Publish the Off-by-none serverless newsletter
• Host of the Serverless Chats podcast
@jeremy_daly

Agenda
• Distributed systems and serverless
• Writing code FOR the cloud
• Failure modes in the cloud
• Serverless patterns to deal with failure
• Decoupling our services
@jeremy_daly

Distributed Systems are…
Systems whose components are located on
diﬀerent networked computers, which communicate
and coordinate their actions by passing messages to
one another. ~Wikipedia
@jeremy_daly
They’re also really hard!

@jeremy_daly
EVERYTHING FAILS
ALL THE TIME
WernerVogels
CTO, Amazon.com

8 Fallacies of Distributed Computing
@jeremy_daly
The network
is reliable
Latency
is zero
Bandwidth
is infinite
The network is
secure
Topology
doesn’t change
There is one
administrator
Transport cost
is zero
The network is
homogeneous
🚫 🚫 🚫 🚫
🚫 🚫 🚫 🚫

Serverless applications are…
Distributed systems on steroids! 💪 💪 💪
@jeremy_daly
• Smaller, more distributed compute units
• Stateless, requiring network access to state
• Uncoordinated, requires buses, queues, pub/sub, state machines
• Heavily reliant on other networked cloud services

What does it mean to be Serverless?
• No server management
• Flexible scaling
• Pay for value / never pay for idle
• Automated high availability
• LOTS of configuration and knowledge of cloud services
• Highly event-driven
@jeremy_daly

Lots of services to communicate with!
@jeremy_daly
ElastiCache
RDS
EMR Amazon ES
Redshift
Fargate
Anything “on EC2”Lambda Cognito Kinesis
S3 DynamoDB SQS
SNS API Gateway
AppSync IoT Comprehend
“Serverless” Managed Non-Serverless
DocumentDB
(MongoDB)
Managed Streaming
for Kada
EventBridge

Reliability or High Availability is…
@jeremy_daly
A characteristic of a system, which aims to ensure an
agreed level of operational performance,
usually uptime, for a higher than normal period.
~Wikipedia

Resiliency is…
@jeremy_daly
The ability of a software solution to absorb the impact
of a problem in one or more parts of a system, while
continuing to provide an acceptable service level to
the business. ~ IBM
IT’S NOT ABOUT PREVENTING FAILURE
IT’S UNDERSTANDING HOWTO GRACEFULLY DEAL WITH IT

Using Lambda for our business logic
• Ephemeral compute service
• Runs your code in response to events
• Automatically manages the runtime,
compute, and scaling
• Single concurrency model
• No sticky-sessions or guaranteed
lifespan
@jeremy_daly

Traditional Error Handling & Retries
try {
// Do something important
} catch (err) {
// Do some error handling
// Do some logging
// Maybe retry the operation
}
@jeremy_daly
What happens to the
original event?
What happens if the function
container crashes?
What happens if there is a
network issue?
👨✈ Losing events is very bad!
What happens if the function
never runs?

Function Error Types
• Unhandled Exception ‼
• FunctionTimeouts ⏳
• Out-of-Memory Errors 🧠
• Throttling Errors 🚦
@jeremy_daly

The cloud is better than you…
…at error handling
…at retrying failures
…at understanding network failures
…at mapping the network topology
…at handling failover and redundancy
@jeremy_daly
So why not let the
cloud do those
things for you? 🤷

Fail up the stack
• Don’t swallow errors with try/catch – fail the function
• Return errors directly to the invoking service
• Conﬁgure built-in retry mechanisms to reprocess events
• Utilize dead letter queues to capture failed events
@jeremy_daly
(sometimes 😉)

Types of Lambda Functions
• The Lambdalith
• The Fat Lambda
• The Single-Purpose Function
@jeremy_daly
😬
🤔
👍

The Mighty Lambdalith
• The entire application is in one
Lambda function
• Often times these are “lift and
shift” Express.js or Flask apps
• Events are synchronous via API
Gateway or ALB
• Partial failures are handled “in the
code”
@jeremy_daly

The Fat Lambda
• Several related methods are collocated in a single Lambda function
• Generally used to optimize the speed of synchronous operations
• Partial failures are still handled “in the code”
• Under the right circumstances, this can be useful
@jeremy_daly

The Single-Purpose Function
• Tightly scoped function that handles a single discrete piece of
business logic
• Can be invoked synchronously or asynchronously
• Failures are generally “total failures” and are passed back to the
invoking service
• Can be reused as part of other “workﬂows”, can scale (or throttle)
independently, and can utilize the Principle of Least Privilege
@jeremy_daly

Failure Modes in the Cloud
WARNING: Firehose of overly-technical content ahead 👩🚒

A quick word about retries…
• Retries are a vital part of distributed systems
• Most cloud services guarantee “at least once” delivery
• It is possible for the same event to be received more than once
• Retried operations should be idempotent
@jeremy_daly

Idempotent means that…
An operation can be repeated multiple times and
always provide the same result, with no side eﬀects
to other objects in the system. ~ Computer Hope
@jeremy_daly
Idempotent operations:
• Update a database record
• Authenticate a user
• Check if a record exists and create if not
There are lots of
strategies to ensure
idempotency!

What are Dead Letter Queues (DLQs)?
• Capture messages/events that fail to process or are skipped
• Allows for alarming, inspection, and potential replay
• Can be added to SQS queues, SNS subscriptions, Lambda functions
@jeremy_daly

Lambda Invocation Types
• Synchronous – request/response model
• Asynchronous – set it and forget it
• Stream-based – push
• Poller-based – pull
@jeremy_daly

Synchronous Lambda Retry Behavior
• Functions are invoked directly using request/response method
• Failures are returned to the invoker
• Retries are delegated to the invoking application
• Some AWS services automatically retry (e.g. Alexa & Cognito)
• Other services do not retry (e.g. API Gateway, ALB, Step Functions)
• API Gateway and ALB can return errors to the client for retry
@jeremy_daly

Asynchronous Lambda Retry Behavior
• The Lambda Service accepts requests and adds them to a queue
• The invoker receives a 202 status code and disconnects
• The Lambda Service will attempt to reprocess failed events up to
2 times, conﬁgured using the MaximumRetryAttempts setting
• If the Lambda function is throttled, the event will be retried for up to
6 hours, conﬁgured using MaximumEventAgeInSeconds
• Failed and expired events can be sent to a Dead Letter Queue (DLQ)
or an on-failure destination
@jeremy_daly

Stream-based Lambda Retry Behavior
• Records are pushed synchronously to Lambda from Kinesis or
DynamoDB streams in batches (10k and 1k limits per batch)
• MaximumRetryAttempts: number of retry attempts for batches
before they can be skipped (up to 10,000)
• MaximumRecordAgeInSeconds: store records up to 7 days
• BisectBatchOnFunctionError: recursively split failed batches
(poison pill)
• Skipped records are sent to an On-failure Destination (SQS or SNS)
@jeremy_daly

Poller-based Lambda Retry Behavior
• The Lambda Poller pulls records synchronously from SQS in
batches (up to 10)
• Errors fail the entire batch
• MaxReceiveCount: number of times messages can be returned to
the queue before being sent to the DLQ (up to 1,000)
• Polling frequency is tied to function concurrency
• VisibilityTimeout should be set to at least 6 times the timeout
configured on your consuming function
@jeremy_daly

Lambda Destinations
• Only for asynchronous invocations
• Routing based on SUCCESS and/or FAILURE
• OnFailure should be favored over a standard DLQ
• Destinations can be an SQS queue, SNS topic,
Lambda function, or EventBridge event bus
@jeremy_daly

Lambda Destinations (continued)
Destination-specific JSON format
• SQS/SNS: JSON object is passed as the Message
• Lambda: JSON is passed as the payload to the function
• EventBridge: JSON is passed as the Detail in the PutEvents call
• Source is ”lambda”
• DetailType is “Lambda Function Invocation Result – Success/Failure”
• Resource fields contain the function and destination ARNs
@jeremy_daly

SQS Redrive Policies
• Only supports another SQS queue as the DLQ
• Messages are sent to the DLQ if the Maximum Receives value is
exceeded
@jeremy_daly

SNS Redrive Policies
• Dead Letter Queues are attached to Subscriptions, not Topics
• Only supports SQS queues as the DLQ
• Client-side errors (e.g. Lambda doesn’t exist) do no retry
• Messages to SQS or Lambda are retried 100,015 times over 23 days
• Messages to SMTP, SMS, and Mobile retry 50 times over 6 hours
• HTTP endpoints support customer-deﬁned retry policies
(number of retries, delays, and backoﬀ strategy)
@jeremy_daly

EventBridge Retry Behavior
• Will attempt to deliver events for up to 24 hours with backoﬀ
• Lambda functions are invoked asynchronously
• Failed events are lost (this is very unlikely)
• Once events are accepted by the target service, failure modes of
those services are used
@jeremy_daly

Step Functions
• State Machines: Orchestration workﬂows
• Lambdas are invoked synchronously
• Retriers and Catchers allow for complex
error handling patterns
• Use “error names” with ErrorEquals for
condition error handling (States.*)
• Control retry policies with IntervalSeconds,
MaxAttempts, BackoﬀRate
@jeremy_daly
Complex Error Handling Pattern
Credit:Yan Cui

AWS SDK Retries
• Automatic retries and exponential backoff
@jeremy_daly
AWS SDK
Maximum retry
count
Connection
timeout
Socket timeout
Python (Boto 3) depends on service 60 seconds 60 seconds
Node.js depends on service N/A 120 seconds
Java 3 10 seconds 50 seconds
.NET 4 100 seconds 300 seconds
Go 3 N/A N/A

miss
Caching strategy
Client API Gateway RDSLambda
Elasticache
Key Points:
• Create new RDS connections ONLY on misses
• Make sureTTLs are set appropriately
• Include the ability to invalidate cache
@jeremy_daly
YOU STILL NEEDTO
SIZEYOUR DATABASE
CLUSTERS APPROPRIATELY

RDS
Buffer events for throttling and durability
Client API Gateway
SQS
Queue
SQS
(DLQ)
Lambda Lambda
(throttled)
ack
“Asynchronous”
Request
Synchronous
Request
@jeremy_daly
Key Points:
• SQS adds durability
• Throttled Lambdas reduce downstream pressure
• Failed events are stored for further inspection/replay
Limit the
concurrency to match
RDS throughput
x
Utilize Service
Integrations

DynamoDB
Stripe API
The Circuit Breaker
Client API Gateway Lambda
Key Points:
• Cache your cache with warm functions
• Use a reasonable failure count
• Understand idempotency!
Status
Check CLOSED
OPEN
Increment Failure Count
HALF OPEN
“Everything fails all the time.”
~WernerVogels
@jeremy_daly
Elasticache
or

Event Processing Conditional Retry Pattern
@jeremy_daly
EventBridge Processing
Lambda
EventBridge
onSuccess
onFailure
SQS
(DLQ)
Replay SQS
Replay Lambda
(Throttled)
Rule to route
permanent
failuresRule to route
“retriable” failures
Asynchronous
(use delivery delay)

Multicasting with SNS
Key Points:
• SNS has a “well-deﬁned API”
• Decouples downstream processes
• Allows multiple subscribers with message ﬁlters
Client
SNS
“Asynchronous”
Request
ack
Event Service
@jeremy_daly
HTTP
SMS
Lambda
SQS
Email
SQS (DLQ)

@jeremy_daly
Multicasting with EventBridge
Key Points:
• Allows multiple subscribers with RULES, PATTERNS and FILTERS
• Forward events to other accounts
• 24 hours of automated retries
Asynchronous
“PutEvents” Request
ack
w/ event id
Amazon
EventBridge
Lambda
SQS
Client
Step Function
Event Bus
+16 others

Key Points:
• Filter events to selectively trigger services
• Manage throttling/quotas/failures per service
• Use Lambda Destinations with asynchronous events
Stripe API
@jeremy_daly
Distribute & Throttle
ack
SQS
Queue Lambda
(concurrency 25)
Client API
Gateway
Lambda
Order Service
"total": [{ "numeric": [ ”>", 0 ]}]
RDS
SQS
Queue Lambda
(concurrency 10)
SMS Alerting Service
Twilio API
SQS
Queue Lambda
(concurrency 5)
Billing Service
"detail-type": [ "ORDER COMPLETE" ]
EventBridge

Key Takeaways
• Be prepared for failure – everything fails all the time!
• Utilize the built in retry mechanisms of the cloud
• Understand failure modes to protect against data loss
• Buﬀer and throttle events to distributed systems
• Embrace asynchronous processes to decouple components
@jeremy_daly

Things I’m working on…
Blog: JeremyDaly.com
Podcast: ServerlessChats.com
Newsletter: Offbynone.io
DDBToolbox: DynamoDBToolbox.com
Lambda API: LambdaAPI.com
GitHub: github.com/jeremydaly
Twitter: @jeremy_daly
@jeremy_daly

Thank You!
Jeremy Daly
jeremy@jeremydaly.com
@jeremy_daly

How to fail with serverless

More Related Content

What's hot (12)

Similar to How to fail with serverless (20)

Recently uploaded (20)

How to fail with serverless