🌞 June Newsletter: Create a sunny forecast for your AI and cloud systems with these insights and best practices!

Gremlin

The Reliability Management Platform for high-velocity engineering teams

Published Jun 27, 2025

💡 How-tos and best practices

How to be prepared for cloud provider outages

GCP’s recent outage on June 12th was a reminder of just how interconnected modern architectures are. The 2 hour and 28 minute outage affected dozens of companies and spanned 80+ Google services and products.

But what was really illuminating was just how far the outage spread due to hidden dependency risks. Many companies that don’t run on GCP were startled to find their services suddenly affected because they had dependencies or depended on vendors that did use GCP.

Fortunately, there is something you can do about it. Check out these testing best practices teams should follow to minimize the impact of large-scale outages so they don’t catch you by surprise.

——

Insights to keep AI applications reliable

AI has become a massive investment for companies. Engineering teams across industries are integrating AI into their products, whether it’s through homegrown, self-managed models or third-party model integrations.

But no matter how much AI shifts the user experience, it’s still an application, which means your engineering team still needs to operate it and keep it reliable. At the same time, AI applications add complexity and complications that require a shift in your approach.

Check out this blog post for high-level takeaways from a recent roundtable with Gremlin CEO and Founder Kolton Andrus, Nobl9 CTO Alex Nauda, and Mandi Walls from PagerDuty—and check out the full conversation below!

🎙 Webinar with Nobl9 & Pagerduty

Complex AI, Fragile Systems: Proven Strategies for Maximizing Uptime

AI applications are complex. In some ways, managing their reliability is no different than other applications, while in other ways it’s an entirely different way of looking at things, with differences in both techniques required as well as the impact of failures, in some cases requiring costly restarts of long running processes.

Join Nobl9, Gremlin, and Pagerduty for a roundtable discussion about what engineers can do to keep the uptime of AI applications high and avoid or lessen the impact of incidents. We’ll cover how SLOs, resilience testing, and incident response come together to support AI reliability.

WATCH ON-DEMAND

🖥️ Office Hours

Upcoming!

How to ensure your AWS workloads are resilient

DATE: July 17th, TIME: 11am PT/2pm ET

In this Office Hours session, we’ll look at some best practices for making your AWS workloads more resilient.

We’ll explore various ways workloads can fail on AWS, options and tools AWS provides you to help improve reliability, and even parts of the AWS Well-Architected Framework. We’ll wrap up by looking at how Gremlin makes it easy to test AWS workloads, find common AWS reliability risks, and help you adhere to the Well-Architected Framework using Test Suites.

——

How to test your systems for scalability and redundancy with fault injection

ON-DEMAND

Do you know if your services can tolerate losing a node? What about an entire availability zone? Or a region?

In this Office hours, we’ll show you how to test the scalability and redundancy of your systems by testing them directly. We’ll use Fault Injection to simulate large-scale failures, use observability tools to monitor the state of our systems, and discuss ways of using our findings to make our systems more resilient.

WATCH NOW

🌞 June Newsletter: Create a sunny forecast for your AI and cloud systems with these insights and best practices!

Gremlin

The Reliability Management Platform for high-velocity engineering teams

💡 How-tos and best practices

🎙 Webinar with Nobl9 & Pagerduty

🖥️ Office Hours

Gremlin Reliability Newsletter

2,166 followers

More articles by this author

Others also viewed

Understanding Kubernetes Service Types

AI’s Rewriting Automation & Itential’s MCP Server Is Your Guide

Terraform State Management – Handling Changes (Azure Cloud)

7 Key Strategies for Building a Successful Multi-Cloud Architecture

What is Tanzu? A Customer’s POV

Discover What's New in the November Edition of the AKS Newsletter!

Unlocking Cost Efficiency: Leveraging Karpenter and Spot Instances for EKS Cost Optimization

Welcome to the February 2025 edition of the AKS newsletter!

How Horizontal Scaling Guarantees 24/7 Uptime

Why We Won: Inside an Industry Giant’s Evaluation of Container Management Platforms

Explore topics

💡 How-tos and best practices

🎙 Webinar with Nobl9 & Pagerduty

🖥️ Office Hours

Gremlin Reliability Newsletter

2,166 followers

🕶July Newsletter: Be cool when things get heated by preventing outages with these tips

Jul 24, 2025

🌷Best practices, insights, and a case study to help your reliability bloom!

Apr 23, 2025

🍀 Don’t depend on good luck for reliability! Use these best practices instead.

Mar 21, 2025

💜 Give your system some reliability love with these webinars, tips, and best practices!

Feb 26, 2025

Tips, webinars, and releases for a reliable 2025!

Jan 23, 2025

⛄Dec. Newsletter: May your days be merry and reliable!

Dec 19, 2024

🍁 A bountiful harvest of reliability tips

Nov 18, 2024

🎃 Tips to help you avoid your worst reliability nightmares

Oct 21, 2024

Release roundup, customer webinar, office hours, and compliance!

Sep 26, 2024

AWS tips, new RBAC release, TLS/WR SSL certificate tests, and more!

Aug 23, 2024

Others also viewed

Understanding Kubernetes Service Types

AI’s Rewriting Automation & Itential’s MCP Server Is Your Guide

Terraform State Management – Handling Changes (Azure Cloud)

7 Key Strategies for Building a Successful Multi-Cloud Architecture

What is Tanzu? A Customer’s POV

Discover What's New in the November Edition of the AKS Newsletter!

Unlocking Cost Efficiency: Leveraging Karpenter and Spot Instances for EKS Cost Optimization

Welcome to the February 2025 edition of the AKS newsletter!

How Horizontal Scaling Guarantees 24/7 Uptime

Why We Won: Inside an Industry Giant’s Evaluation of Container Management Platforms

Explore topics