Tips, webinars, and releases for a reliable 2025!
💡 How-tos and best practices
Every company needs to invest in the reliability of their systems. On the surface, this investment ROI seems like a straightforward calculation. After all, going from 98% to 99% availability reduces downtime from ~173 hours/year to ~86 hours/year, or a 50% decrease that saves, according to recent average downtime costs, $72 million over the course of a year.
But as everyone with any experience in budgeting will tell you, it’s a little more complicated than that in practice.
In this blog post, we take a look at how to calculate the ROI of reliability efforts, including a deeper dive into computing the amount your company gains from reliability.
•
You’re well on your way to becoming more reliable. You’ve added your services, found and fixed some Detected Risks, and run your first set of reliability tests. However, some of your tests returned as “Failed.”
Not to worry: this isn’t a reflection of you or your engineering skills, but rather an opportunity to learn more about how your systems work and, more importantly, how to make them more resilient.
In this blog, we’ll review Gremlin’s built-in reliability tests and what it means if they fail. We’ll explain what each test does, what a failure means, and how to address those failures to make your systems more resilient.
•
Cloud providers like AWS excel at creating reliable platforms for developers to build on. But while the platforms may be rock-solid, this doesn’t guarantee your applications will be too—you’re still on the hook for making your workloads resilient, recoverable, and fault-tolerant.
There’s only one problem: cloud platforms are essentially black boxes. With no insight into how the platform is built, how it works, or how it handles failures, how can you design highly-available workloads?
In this blog post, we’ll answer this question by examining some of the most popular AWS services, such as EC2, EKS, and ECS.
——
🚀 New releases!
Reliability testing is ongoing work, and tracking that work can be difficult in large organizations. Engineers run one-off experiments, scheduled Scenarios run in the background, and, for more mature teams, CI/CD workflows fire off automated tests on demand.
According to our own product metrics, teams run an average of 200 to 500 tests each day! With so much happening, it’s hard to keep track of everything going on in Gremlin—until now.
We’re excited to announce the launch of three new screens for managing activity in Gremlin: Now Running, What’s Scheduled, and What Ran.
Read the release blog to learn more about these new features!
——
🖥️ Office Hours
Upcoming!
DATE: February 13th TIME: 11am PT/2pm ET
Pop quiz - what are all of the dependencies your services rely on?
If you’re like most engineers, you probably struggled to come up with the answer. Modern applications are complex and rely on dozens (if not hundreds) of dependencies. Many teams rely on spreadsheets, but manual processes like these break down over time. What if you had a tool that found and tracked dependencies for you?
In this Office Hours session, we’ll show Gremlin’s Dependency Discovery feature. We’ll show you how it works, how to set it up, and how you can use it to make your services more resilient to slow or failed dependencies.
•
ON-DEMAND
You’ve spent much time and energy improving your service’s reliability, but how do you demonstrate it? Gremlin provides tools to help you analyze and understand your organization’s reliability posture, including reliability reports, trends, scores, and more.
In this Office Hours session, we look in-depth at each of Gremlin’s reports. You’ll learn what they represent, what you can take away from them, and how you can best apply what you have learned to your own services.
——