Tips, webinars, and releases for a reliable 2025!

Gremlin

The Reliability Management Platform for high-velocity engineering teams

Published Jan 23, 2025

💡 How-tos and best practices

Every company needs to invest in the reliability of their systems. On the surface, this investment ROI seems like a straightforward calculation. After all, going from 98% to 99% availability reduces downtime from ~173 hours/year to ~86 hours/year, or a 50% decrease that saves, according to recent average downtime costs, $72 million over the course of a year.

But as everyone with any experience in budgeting will tell you, it’s a little more complicated than that in practice.

In this blog post, we take a look at how to calculate the ROI of reliability efforts, including a deeper dive into computing the amount your company gains from reliability.

•

How to fix the root cause of a failed reliability test

You’re well on your way to becoming more reliable. You’ve added your services, found and fixed some Detected Risks, and run your first set of reliability tests. However, some of your tests returned as “Failed.”

Not to worry: this isn’t a reflection of you or your engineering skills, but rather an opportunity to learn more about how your systems work and, more importantly, how to make them more resilient.

In this blog, we’ll review Gremlin’s built-in reliability tests and what it means if they fail. We’ll explain what each test does, what a failure means, and how to address those failures to make your systems more resilient.

•

Maximizing your reliability on AWS

Cloud providers like AWS excel at creating reliable platforms for developers to build on. But while the platforms may be rock-solid, this doesn’t guarantee your applications will be too—you’re still on the hook for making your workloads resilient, recoverable, and fault-tolerant.

There’s only one problem: cloud platforms are essentially black boxes. With no insight into how the platform is built, how it works, or how it handles failures, how can you design highly-available workloads?

In this blog post, we’ll answer this question by examining some of the most popular AWS services, such as EC2, EKS, and ECS.

——

🚀 New releases!

Manage your reliability work more easily

with Gremlin’s newest features

Reliability testing is ongoing work, and tracking that work can be difficult in large organizations. Engineers run one-off experiments, scheduled Scenarios run in the background, and, for more mature teams, CI/CD workflows fire off automated tests on demand.

According to our own product metrics, teams run an average of 200 to 500 tests each day! With so much happening, it’s hard to keep track of everything going on in Gremlin—until now.

We’re excited to announce the launch of three new screens for managing activity in Gremlin: Now Running, What’s Scheduled, and What Ran.

Read the release blog to learn more about these new features!

——

🖥️ Office Hours

Upcoming!

How to find and test critical dependencies with Gremlin

DATE: February 13th TIME: 11am PT/2pm ET

Pop quiz - what are all of the dependencies your services rely on?

If you’re like most engineers, you probably struggled to come up with the answer. Modern applications are complex and rely on dozens (if not hundreds) of dependencies. Many teams rely on spreadsheets, but manual processes like these break down over time. What if you had a tool that found and tracked dependencies for you?

In this Office Hours session, we’ll show Gremlin’s Dependency Discovery feature. We’ll show you how it works, how to set it up, and how you can use it to make your services more resilient to slow or failed dependencies.

•

How to demonstrate your reliability progress

ON-DEMAND

You’ve spent much time and energy improving your service’s reliability, but how do you demonstrate it? Gremlin provides tools to help you analyze and understand your organization’s reliability posture, including reliability reports, trends, scores, and more.

In this Office Hours session, we look in-depth at each of Gremlin’s reports. You’ll learn what they represent, what you can take away from them, and how you can best apply what you have learned to your own services.

WATCH NOW

——

Tips, webinars, and releases for a reliable 2025!

Gremlin

The Reliability Management Platform for high-velocity engineering teams

💡 How-tos and best practices

🚀 New releases!

🖥️ Office Hours

Gremlin Reliability Newsletter

2,163 followers

More articles by this author

Others also viewed

Everything starts with you: Jan Theobald, Cloud Architect at OVHcloud

How Tech Firms Can Scale Faster with AI-Powered Cloud Migrations

GLPI Insights: Monthly GLPI Updates & Webinars GLPI 11, Understanding ITIL and how GLPI aligns with best practices, News & events. Read now🚀

🌩How to Nail Your Atlassian Cloud Migration (And Let AI Do the Heavy Lifting)

Tools to Efficiently Manage Multi-Cloud Environments

The Evolution of Cloud Service Management: Navigating the Shifting Paradigm!

Part 2 - Understanding Service Catalogs for FinOps: A Comprehensive Guide

Building a Cloud-First Culture in Your Organization: Strategies for AWS Integration

Cloud Mindset: A Different Way of Thinking (tech Pill)

When technology, security and people are your passion...

Explore topics

💡 How-tos and best practices

🚀 New releases!

🖥️ Office Hours

Gremlin Reliability Newsletter

2,163 followers

🕶July Newsletter: Be cool when things get heated by preventing outages with these tips

Jul 24, 2025

🌞 June Newsletter: Create a sunny forecast for your AI and cloud systems with these insights and best practices!

Jun 27, 2025

🌷Best practices, insights, and a case study to help your reliability bloom!

Apr 23, 2025

🍀 Don’t depend on good luck for reliability! Use these best practices instead.

Mar 21, 2025

💜 Give your system some reliability love with these webinars, tips, and best practices!

Feb 26, 2025

⛄Dec. Newsletter: May your days be merry and reliable!

Dec 19, 2024

🍁 A bountiful harvest of reliability tips

Nov 18, 2024

🎃 Tips to help you avoid your worst reliability nightmares

Oct 21, 2024

Release roundup, customer webinar, office hours, and compliance!

Sep 26, 2024

AWS tips, new RBAC release, TLS/WR SSL certificate tests, and more!

Aug 23, 2024

Others also viewed

Everything starts with you: Jan Theobald, Cloud Architect at OVHcloud

How Tech Firms Can Scale Faster with AI-Powered Cloud Migrations

GLPI Insights: Monthly GLPI Updates & Webinars GLPI 11, Understanding ITIL and how GLPI aligns with best practices, News & events. Read now🚀

🌩How to Nail Your Atlassian Cloud Migration (And Let AI Do the Heavy Lifting)

Tools to Efficiently Manage Multi-Cloud Environments

The Evolution of Cloud Service Management: Navigating the Shifting Paradigm!

Part 2 - Understanding Service Catalogs for FinOps: A Comprehensive Guide

Building a Cloud-First Culture in Your Organization: Strategies for AWS Integration

Cloud Mindset: A Different Way of Thinking (tech Pill)

When technology, security and people are your passion...

Explore topics