SlideShare a Scribd company logo
NORDICS
Building resilient serverless workloads:
Navigating through failures
JIMMY DAHLQVIST | 2024-09-17
Building resilient serverless workloads:  Navigating through failures
Thank You!
Building resilient serverless workloads:  Navigating through failures
@jimmydahlqvist
JIMMY DAHLQVIST
Head of AWS @ Sigma Technology Cloud
Founder of serverless-handbook.com
AWS Serverless Hero ☁️User Group Leader ☁️AWS Ambassador
§
Hello, I'm
@jimmydahlqvist
Resources
• https://serverless-handbook.com
• Architecture patterns
• Solutions
• Workshops
@jimmydahlqvist
Agenda
• What is serverless and resiliency
• Architecting resilient system – Good practices
• Summary
• What did you learn?
@jimmydahlqvist
What is serverless?
• Automatic and flexible scaling
• No capacity planning
• High Availability
• Pay-for-use billing
@jimmydahlqvist
What is resiliency?
The ability for a software solution to handle the impact of problems,
and recover from turbulent conditions, when other parts in the system
fails.
@jimmydahlqvist
“Everything fails all the time
Dr. Werner Vogels, CTO, Amazon.com
Building resilient serverless workloads:  Navigating through failures
@jimmydahlqvist
Understand AWS Services
• Everything has a limit
• Understand how services work under the hood
@jimmydahlqvist
Resiliency testing (in prod-ish)
• Chaos Engineering
• Amazon Fault Injector Service
• Start in QA
• Don’t forget about data
@jimmydahlqvist
Web application
Do we need an
immediate response?
@jimmydahlqvist
Storage-First
@jimmydahlqvist
Storage-First
• Data-centric design
• Durability and availability
• Scalable System Design
• Asynchronous processing
@jimmydahlqvist
Storage-First Things to consider
• Architectural complexity
• Eventual consistency
• Design for idempotency
• Risk of over-optimization
@jimmydahlqvist
Queue Load Leveling
@jimmydahlqvist
Queue load leveling
• System stability
• Handle unexpected spikes
• Protect downstream resources
@jimmydahlqvist
Decoupling
@jimmydahlqvist
Decoupling
@jimmydahlqvist
Retries
• Selfish
• Exponential backoff
• Users can make it worse
@jimmydahlqvist
Retries with backoff and jitter
No Jitter With Jitter
Image: Amazon Architecture blog (https://tinyurl.com/y48t2v4h)
@jimmydahlqvist
DLQ
@jimmydahlqvist
DLQ
@jimmydahlqvist
Circuit breaker
@jimmydahlqvist
Circuit breaker
Half Open
@jimmydahlqvist
Circuit breaker
• Avoid cascading failures
• Protect system resources
• Risk of early circuit break
• Good observability required
Put it all together
Notification Service
Payment Service
@jimmydahlqvist
What we talked about
• Design for failure
• Buffer and store messages first
• Process asynchronously
• Level the load
• Retry on failures
• Break if integrations are not healthy
Quiz with
me!
B
B
@jimmydahlqvist
dahlqvistjimmy
https://serverless-handbook.com
https://jimmydqv.com
PLEASE FILL IN
THE SESSION
SURVEY
@jimmydahlqvist
PLEASE FILL IN THE SESSION SURVEY

More Related Content

PPTX
Building-resilient-serverless-workloads-Navigating-through-failure
PPTX
Serverless cloud architecture patterns
PDF
Improve Enterprise App Resilience with AWS Consulting Services
PDF
Serverless Toronto User Group - Let's go Serverless!
PDF
Site reliability in the Serverless age - Serverless Boston 2019
PDF
Serhat Can- 4 keytakeaways from running serverless on production for 4 years
PDF
Montréal AWS Users United: Let's go Serverless!
PPTX
Embracing Failure - AzureDay Rome
Building-resilient-serverless-workloads-Navigating-through-failure
Serverless cloud architecture patterns
Improve Enterprise App Resilience with AWS Consulting Services
Serverless Toronto User Group - Let's go Serverless!
Site reliability in the Serverless age - Serverless Boston 2019
Serhat Can- 4 keytakeaways from running serverless on production for 4 years
Montréal AWS Users United: Let's go Serverless!
Embracing Failure - AzureDay Rome

Similar to Building resilient serverless workloads: Navigating through failures (20)

PDF
Serverless is the future... or is it?
PPTX
Serverless Toronto helps Startups
PDF
Convince your boss to go Serverless at AWS User Group Tirupathi and Serverles...
PDF
Devoxx2017
PDF
How did we get here and where are we going
PPTX
Going Serverless on AWS
DOCX
Chithrai Mani Director of Architecture and Delivery - Talking About Cloud Ser...
PDF
Whizlabs webinar - Deploying Portfolio Site with AWS Serverless
PPTX
Resiliency through Failure @ OSCON 2013
PDF
20180111 we bde-bs - serverless url shortener
PPTX
Site reliability in the serverless age - Serverless Boston Meetup
PDF
Jak nie zwariować z architekturą Serverless?
PPTX
New Serverless World - Cloud Native Apps
PPTX
New Serverless World, Cloud Native Apps
PDF
Serverless - Your Gateway to the Cloud!
PDF
Applying Chaos Engineering to build Resilient Serverless Applications - Emrah...
PDF
Dhaval Nagar - ServerlessDays Bengaluru 2023
PDF
FaaS or not to FaaS. Visible and invisible benefits of the Serverless paradig...
PPTX
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
PDF
Convince your boss to go Serverless at serverless week Brazil
Serverless is the future... or is it?
Serverless Toronto helps Startups
Convince your boss to go Serverless at AWS User Group Tirupathi and Serverles...
Devoxx2017
How did we get here and where are we going
Going Serverless on AWS
Chithrai Mani Director of Architecture and Delivery - Talking About Cloud Ser...
Whizlabs webinar - Deploying Portfolio Site with AWS Serverless
Resiliency through Failure @ OSCON 2013
20180111 we bde-bs - serverless url shortener
Site reliability in the serverless age - Serverless Boston Meetup
Jak nie zwariować z architekturą Serverless?
New Serverless World - Cloud Native Apps
New Serverless World, Cloud Native Apps
Serverless - Your Gateway to the Cloud!
Applying Chaos Engineering to build Resilient Serverless Applications - Emrah...
Dhaval Nagar - ServerlessDays Bengaluru 2023
FaaS or not to FaaS. Visible and invisible benefits of the Serverless paradig...
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Convince your boss to go Serverless at serverless week Brazil
Ad

More from Jimmy Dahlqvist (20)

PPTX
Event-driven and serverless in the world of IoT
PPTX
Serverless website analytics with Lambda@Edge
PPTX
AWS ECS and AWS Fargate demystified: run serverless containers
PPTX
Cloud-grilled delights a high-tech approach to perfect BBQ
PPTX
Serverless website analytics with Lambda@Edge
PPTX
Encrypting data in S3 with Stepfunctions
PPTX
Building a serverless AI powered translation service
PPTX
AI Powered event-driven translation bot
PPTX
Serverless and event-driven in a world of IoT
PPTX
Event-driven and serverless in the world of IoT
PPTX
IoT Enabled Smoker for Great BBQ
PPTX
Building a serverless event driven Slack Bot
PPTX
IoT Enabled smoker for Great BBQ
PPTX
IoT enable smoker for great BBQ
PPTX
Autoscaled Github Runners using StepFunctions
PPTX
EventBridge Patterns and real world use case
PPTX
re:Invent Recap Breakfast
PPTX
CI/CD As first and last line of defence
PPTX
Introduction to testing in Cloud / AWS
PPTX
CHAOS ENGINEERING – OR LET'S SHAKE THE TREE
Event-driven and serverless in the world of IoT
Serverless website analytics with Lambda@Edge
AWS ECS and AWS Fargate demystified: run serverless containers
Cloud-grilled delights a high-tech approach to perfect BBQ
Serverless website analytics with Lambda@Edge
Encrypting data in S3 with Stepfunctions
Building a serverless AI powered translation service
AI Powered event-driven translation bot
Serverless and event-driven in a world of IoT
Event-driven and serverless in the world of IoT
IoT Enabled Smoker for Great BBQ
Building a serverless event driven Slack Bot
IoT Enabled smoker for Great BBQ
IoT enable smoker for great BBQ
Autoscaled Github Runners using StepFunctions
EventBridge Patterns and real world use case
re:Invent Recap Breakfast
CI/CD As first and last line of defence
Introduction to testing in Cloud / AWS
CHAOS ENGINEERING – OR LET'S SHAKE THE TREE
Ad

Recently uploaded (20)

PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPT
Teaching material agriculture food technology
PDF
Approach and Philosophy of On baking technology
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
KodekX | Application Modernization Development
PDF
cuic standard and advanced reporting.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Empathic Computing: Creating Shared Understanding
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
The Rise and Fall of 3GPP – Time for a Sabbatical?
Mobile App Security Testing_ A Comprehensive Guide.pdf
Teaching material agriculture food technology
Approach and Philosophy of On baking technology
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
20250228 LYD VKU AI Blended-Learning.pptx
Spectral efficient network and resource selection model in 5G networks
Per capita expenditure prediction using model stacking based on satellite ima...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
KodekX | Application Modernization Development
cuic standard and advanced reporting.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Empathic Computing: Creating Shared Understanding
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Programs and apps: productivity, graphics, security and other tools
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Diabetes mellitus diagnosis method based random forest with bat algorithm

Building resilient serverless workloads: Navigating through failures

Editor's Notes

  • #1: Navigating failures! Building resilient serverless workloads! That is what we are going to talk about here today, failures! We are going to talk about how we can architect and build resilient serverless systems. We will not talk about preventing failures, instead we will talk about recovering from them, and some good architectures that can help you on your way. We will start with a serverless workload, and add to it to make it more resilient.
  • #2: Navigating failures! Building resilient serverless workloads! That is what we are going to talk about here today, failures! We are going to talk about how we can architect and build resilient serverless systems. We will not talk about preventing failures, instead we will talk about recovering from them, and some good architectures that can help you on your way. We will start with a serverless workload, and add to it to make it more resilient.
  • #3: Some of you might now be thinking. But hey, doesn't serverless on AWS come with built in availability and resiliency? Isn’t that one of the strengths with serverless? It's resilient out of the box? And you are absolutely right, serverless services from AWS has availability and resiliency built into them.
  • #4: This was a short talk then. So, thank you very much for listing and see you around..... Or is there more to resilient systems than that?
  • #5: Serverless services does come with high availability and resiliency built in. But that is on the service level, Lambda is highly available and resilient, StepFunctions are highly available and resilient, EventBridge are highly available and resilient. If all components in our systems were serverless that would be great. However, that is not the case. In 95% of all systems that I have designed or worked on there has been components that are not serverless. It can be the need for a relational database, and yes, I do argue that not all datamodels and search requirements can't be modeled for DynamoDB. It can be that we need to integrate with a 3rd party and their API and connections come with quotas and throttling. It could be that we as developers are unaware of certain traits of AWS services that make our serverless services break. And even sometimes our users are even our worst enemy. When we are building our serverless systems we must always remember that there are components involved that don't scale as fast or to that degree that serverless components does.
  • #6: Hi! I'm Jimmy! I have worked with AWS and severless since 2015, almost a decade now, and I have seen all kind of strange things. I’m a true serverless enthusiast, the very first solution I built on AWS was serverless and I have not looked back since. I have built serverless solutions for a variaty of companies, from startups to large enterpices. I'm the founder of serverless-handbook.com where you can find all kind of serverless things that i have built, ranging from workshops to small architecture patterns. And I have my blog on Jimmydqv.com As a day-time job, and yes, I do have a daytime job, I know people have been questioning that. I work as Head of AWS at Sigma Technology Cloud, we are an advanced services partner with AWS and do all kind of fun solutions. If you like to know more about us, visit our booth outside.... I’m AWS Serverless Hero, AWS Ambassador, and one of user Group leader for the Scania user group.
  • #9: So, when I say serverless I mean services that come with automatic scaling, little to no capacity planning, built-in high availability pay-for-use billing model. This is my definition of serverless, and I’m sure many of you can agree with that.
  • #10: So, what is resiliency? Sometimes it gets confusing, and people mix up resiliency with reliability. As I said in the beginning, resiliency is NOT about preventing failures, it's about recovering from them. It’s about making sure our system maintain an acceptable level of service even if when other parts of our system is not healthy. It's about gracefully deal with failures. Reliability focuses on the prevention of the failure happening in the first place, while resiliency is about recovering from it. Oxford dictunary Reliability is “the quality of performing consistently well,” whereas resilience is “the capacity to recover quickly from difficulties.” 
  • #11: This is by far one of my favorite quotes by Dr. Werner Vogels. Because this is real life! Running large distributed systems, then everything will eventually fail. We can have down-stream services that is not responding as we expect, they can be having health problems. Or we can be throttled by a 3rd party or even our own services. We need to design our system not with mindset “what happens if this fails” but instead “How can we keep running and recover WHEN this fails”
  • #12: It's important that the cracks that form when components in our system fail, doesn't spread. That they don’t take down our entire system. We need ways to handle and contain the cracks. That way we can isolate and protect our entire system. When our serverless systems integrate with non-serverless components. In some cases it can be obvious, your system interacts with a Amazon Aurora database. Other times it's not that clear, the system integrates with a 3rd party API or does encryption using KMS. Both of these scenarios can lead to throttling that can affect our system and start forming cracks if not handled properly. How does our system handle a integration point that is not responding, specially under a period of high load. This can easily start creating cracks that can bring our entire system to a halt or that we start losing data.
  • #13: When we build serverless systems we must remember that every API in AWS has a limit. We can store application properties in System Manager Parameter Store, a few of them might be sensitive and encrypted with KMS. What now happens is that we can get throttled by a different service without realizing it. SSM might have a higher limit but getting an encrypted value would then be impacted by the KMS limit. If we then don't design our functions correctly, and call SSM in the Lambda handler on every invocation we would quickly get throttles and build up a heft bill. Instead we could load properties in the initialization phase. Understanding how AWS services work under the hood, to some extent, is extremely important, so our systems doesn't fail due to some unknown kink. For example, consuming a Kinesis Data Stream with a Lambda function, if processing an item in a batch fails, the entire batch will fail. The batch would then be sent to the Lambda function over and over again. TELL ASSA KINESIS STORY!!!!! What we can do in this case is to bisect batches on Lambda function failures. The processed batch will be split in half and sent to the function. Bisect would continue to we only have the single failing item left.
  • #14: Now, I bet everyone in this room run a multi-environment system, you have your dev, test, pre-prod, and prod environments. Many would also argue that they have a QA, Staging, Pre-prod, or what ever you call it, has an identical setup with your prod environment? But now, let's make sure you consider data as well. The amount of data, the difference in user generated data. This is an important part when we consider our systems and when we plan for Resiliency testing. Data is different. I have seen system been taken down on multiple occasions due to differences in data and even in integration points. TELL ASSA DATABASE STORY – BOOM!! With one client, we had an update that had been tested and prepared in all environments. But when deploying to prod, the database went haywire on us. We used Amazon Aurora serverless and the database suddenly scaled to max and then could cope anymore. Our entire service was brought down. All due to a SQL query that due to the amount of data in prod consumed all database resources. Or if you have a integration with a 3rd party where that 3rd party integration staging environment is different. I had a scenario where in prod the 3rd party had an IP-Allow listing in place, so when we extended our system and got some new IPs suddenly only 1/3 of our calls was allowed. In staging, this was not in place. That was.... intermittent failures are always the most fun to debug. ### MM NAT Gateway Story ### A good way to practice and prepare for failures are through Resiliency testing, chaos engineering. AWS offers their service around this topic, AWS Fault Injection Service, which you can use to simulate failures and see how your components and system handles them. Now.... So what I'm saying is that when you plan for your Resiliency testing, start in your QA or staging environment. But don't forget about prod and do plan to run test there as well.
  • #15: Now let's start off with a classic web application with an API. Compute in Lambda and a database in DynamoDB. Now that is one scalable application. But maybe we actually need an SQL database, as mentioned in the beginning this is still frequently used in many applications. Or we need to integrate with 3rd party, and this could be an integration that on prem, it could be running in a different cloud on servers. A compute solution that doesn’t scale as fast and flexible as our serverless solution. With a lot of users we could quickly overwhelm the 3rd party API or any downstream service in our solution that doesn’t scale as fast. This application is setup as a classic synchronous request-response where our client expect a response back immediate to the request. We wait for this entire process to happen, storing data directly to a database might be very fast and the blocking isn't that long. But, with more complex integrations with chained calls and even 3rd party integrations the time quickly adds up, and if one of the components is down and not responding we need to fail the entire operation, and we leave any form of retry to the calling application.
  • #16: One question we need to ask when building our APIs is does our write operations really need an immediate response? Can we make this an asynchronous process? In a distributed system. Does the calling application need to know that we have stored the data already, or can we just hand over the event and expect response back saying that "Hey I got the message and I will do something with it". ### Buffer events With an asynchronous we can add a buffer between our calls and storage of our data. What this will do is protect us and the downstream service. The downstream service will not be overwhelmed and and by that we protect our own system as well from failures. This can however create an eventual consistency model, where read after write not always gives us the same data back.
  • #17: Let’s return to our application, but we focus only on the API part from now on. Let's get rid of our Lambda integration completely and instead integrate directly to the SQS. This will create one of the most powerful patterns when building resilient serverless systems, and I use this all the time. Storage First! So instead of having this integration API GW to Lambda we move the Lambda function
  • #18: This takes us to the Storage-first architecture pattern. The idea behind this architecture pattern is to safely store the messages in a durable storage, and then process them in an asynchrounus way. This way we can handle them in the pace we sit fit and we can re-process them if they fail. Basically we add an buffer to our API. Tell MM Lea story!
  • #19: Latency: Prioritizing storage might increase the time it takes to process or access the data in real-time scenarios. Complexity: Designing with storage in mind may lead to intricate architectures, especially when integrating with diverse processing systems. Prerequisites: Requires robust and often expensive storage solutions to ensure data durability and high availability. Data Integrity: Ensuring data stored is accurate and consistent can pose challenges, especially in high ingestion systems. Potential for Over-Optimization: There's a risk of over-investing in storage without considering the balance of other architectural needs.
  • #20: If we circle back to API solution again, not only do we use the storage first pattern in the current setup, we have the possibility for other resilient solution as well. I have already briefly mentioned this several times without putting a name on it. In this solution we also use the Queue Load leveling pattern
  • #21: Using the queue load leveling pattern we protect the downstream service, and by doing that our self, by only processing events in a pace that we know the service can handle. Other benefits that come with this pattern, that might not be that obvious. It can help us control cost as we can run on subscriptions with lower throughput that is lower in cost, or we can down-scale our database as we don't need to run a huge instance to deal with peaks. Same goes if we don't process the queue with Lambda functions but instead use containers, we can set the scaling to fewer instances or even do a better auto-scaling solution. Now! One consideration with this pattern is if our producers are always creating more requests than we can process, we can end up in a situation where we are always trailing behind. For that scenario we either need to scale up the consumers, which might lead to unwanted downstream consequences or we need at some point evict and throw away messages. What you choose of course come with the standard architect answer "It depends...."
  • #22: So what if we have more than one service that it is interested in the request? For a SQS queue we can only have one consumer, two consumers can't get the same message. In this case we need to create a fan out or multicast system.
  • #23: So, what we can do in this solution. Is that we can then replace our queue with EventBridge that can route the request or the message to many different services. It can be SQS queues, StepFunctions, Lamda Functions, other EventBridge buses and many many more. EventBridge is highly scalable with high availability and resiliency with a built in retry mechanism for 24 hours. With the archive feature we can also replay messages in case they failed. And if there is a problem delivering message to a target we can set a DLQ to handle that scenario. We just however remember the DLQ only come into affect if there is a problem calling the target, lacking IAM permissions or similar. If the target it self has a problem and fails processing message will not end up in the DLQ. Therefor each of our target services must implement resiliency using the patterns we have been talking about.
  • #24: Even with a storage-first approach we are of course not protected against failures. They will happen, remember "Everything fails all the time". In the scenarios where our processing do fail we need to retry again. But, retries are selfish and what we don't want to do, in case it's a downstream services that fail, or if we are throttled by the database, is to just retry again. Instead we like to backoff and give the service som breathing room. We would also like to apply exponential backoff, so if our second call also fails we like to back off a bit more. So first retry we do after 1 second, then 2, then 4, and so on till we either timeout and give up of have a success.
  • #25: In our retry scenario there is a study conducted by AWS that show that in a highly distributed system retries will happen at the same time. If all retries happen with the same backoff, 1 second, 2 seconds, 4 seconds and so on they will eventually line up and happen at the same time. This can then lead to the downstream service crashing directly after becoming healthy just due to the amount of job that has stacked up and now happen at the same time. It's like in an electric grid, after a power failure, all appliances turn on at the same time creating such a load on the grid that it go out again, or we blow a fuse. Then we change the fuse, everything turn on at the same time, and the fuse blow again. Therefor we should also use some form of jitter in our backoff algorithm. This could be that we add a random wait time to the backoff time. It would work that we first wait 1 second + a random number of hundreds of milliseconds. Second time we wait 2 second + 2x a random number, and so on. By doing that, our services will not line up the retries. How we add the jitter and how much, that well depends on your system and implementation. USERS MAKE IT WORSE STORY. TV4 OR REINVENT!
  • #26: In the cases where we do give up the processing. We have hit the max number of retries, this is where the DLQ come in. We route the messages to a DLQ where we can use a different retry logic or even inspect the messages manually. The DLQ also create a good indicator that something might be wrong, and we can create alarms and alerts based on number of messages in the DLQ. One message might not be an problem but the number of messages start stacking up it's a clear indicator that something is wrong. In case we are using SQS as our message buffer we can directly connect a DLQ to it. We can also use Lambda functions failure destinations and set a SQS as that destination. So in case the function exit with an failure the message is sent to the destination. If we use StepFunctions as our processor we can send messages to a SQS queue if we reach our retry limit.
  • #27: One more approach would be for use to use Step Functions built in retry with backoff. However, SQS can’t invoke StepFunctions, so what we can do is to use EventBridge instead of SQS, rely on EventBridge durability and archive and replay mechanism. We add a DLQ where we send event to when we give up the calls.
  • #28: Retries is all good, but there is no point for us to send requests to an integration that is not healthy, it will just keep failing over and over again. So what we can do here is implement Circuit breakers. If you are not familiar with Circuit breakers it is a classic pattern, and what it does is make sure we don't send requests to API or integration that is not healthy and doesn't respond. This way we can both protect the integration or API but also our self from doing work we know will fail. Because everything fails all the time, right. So before we call the API we'll have a status check, if the API is all healthy we'll send the request this is a closed state of the circuit breaker. Think of it as an electric circuit, when the circuit is closed electricity can flow and the lights are on.
  • #29: So before we call the API we'll have a status check, if the API is all healthy we'll send the request this is a closed state of the circuit breaker. Think of it as an electric circuit, when the circuit is closed electricity can flow and the lights are on. As we do make calls to the API we'll update the status, if we start to get error responses on our requests we'll open the circuit and stop sending requests. In this state is where storage-first shine, we can keep our messages in the storage queue until the integration is back healthy again. But we just can't stop sending requests for ever. So what we do is to periodically place the circuit in a half-open state to send a few requests to it and update our status with the health from these requests.
  • #34: 876159