🌞 June Newsletter: Create a sunny forecast for your AI and cloud systems with these insights and best practices!
💡 How-tos and best practices
GCP’s recent outage on June 12th was a reminder of just how interconnected modern architectures are. The 2 hour and 28 minute outage affected dozens of companies and spanned 80+ Google services and products.
But what was really illuminating was just how far the outage spread due to hidden dependency risks. Many companies that don’t run on GCP were startled to find their services suddenly affected because they had dependencies or depended on vendors that did use GCP.
Fortunately, there is something you can do about it. Check out these testing best practices teams should follow to minimize the impact of large-scale outages so they don’t catch you by surprise.
——
AI has become a massive investment for companies. Engineering teams across industries are integrating AI into their products, whether it’s through homegrown, self-managed models or third-party model integrations.
But no matter how much AI shifts the user experience, it’s still an application, which means your engineering team still needs to operate it and keep it reliable. At the same time, AI applications add complexity and complications that require a shift in your approach.
Check out this blog post for high-level takeaways from a recent roundtable with Gremlin CEO and Founder Kolton Andrus, Nobl9 CTO Alex Nauda, and Mandi Walls from PagerDuty—and check out the full conversation below!
🎙 Webinar with Nobl9 & Pagerduty
AI applications are complex. In some ways, managing their reliability is no different than other applications, while in other ways it’s an entirely different way of looking at things, with differences in both techniques required as well as the impact of failures, in some cases requiring costly restarts of long running processes.
Join Nobl9, Gremlin, and Pagerduty for a roundtable discussion about what engineers can do to keep the uptime of AI applications high and avoid or lessen the impact of incidents. We’ll cover how SLOs, resilience testing, and incident response come together to support AI reliability.
🖥️ Office Hours
Upcoming!
DATE: July 17th, TIME: 11am PT/2pm ET
In this Office Hours session, we’ll look at some best practices for making your AWS workloads more resilient.
We’ll explore various ways workloads can fail on AWS, options and tools AWS provides you to help improve reliability, and even parts of the AWS Well-Architected Framework. We’ll wrap up by looking at how Gremlin makes it easy to test AWS workloads, find common AWS reliability risks, and help you adhere to the Well-Architected Framework using Test Suites.
——
ON-DEMAND
Do you know if your services can tolerate losing a node? What about an entire availability zone? Or a region?
In this Office hours, we’ll show you how to test the scalability and redundancy of your systems by testing them directly. We’ll use Fault Injection to simulate large-scale failures, use observability tools to monitor the state of our systems, and discuss ways of using our findings to make our systems more resilient.