🌞 June Newsletter: Create a sunny forecast for your AI and cloud systems with these insights and best practices!

🌞 June Newsletter: Create a sunny forecast for your AI and cloud systems with these insights and best practices!

💡 How-tos and best practices

How to be prepared for cloud provider outages

GCP’s recent outage on June 12th was a reminder of just how interconnected modern architectures are. The 2 hour and 28 minute outage affected dozens of companies and spanned 80+ Google services and products.

But what was really illuminating was just how far the outage spread due to hidden dependency risks. Many companies that don’t run on GCP were startled to find their services suddenly affected because they had dependencies or depended on vendors that did use GCP.

Fortunately, there is something you can do about it. Check out these testing best practices teams should follow to minimize the impact of large-scale outages so they don’t catch you by surprise.

——

Insights to keep AI applications reliable

AI has become a massive investment for companies. Engineering teams across industries are integrating AI into their products, whether it’s through homegrown, self-managed models or third-party model integrations.

But no matter how much AI shifts the user experience, it’s still an application, which means your engineering team still needs to operate it and keep it reliable. At the same time, AI applications add complexity and complications that require a shift in your approach.

Check out this blog post for high-level takeaways from a recent roundtable with Gremlin CEO and Founder Kolton Andrus, Nobl9 CTO Alex Nauda, and Mandi Walls from PagerDuty—and check out the full conversation below!


🎙 Webinar with Nobl9 & Pagerduty

Complex AI, Fragile Systems: Proven Strategies for Maximizing Uptime

AI applications are complex. In some ways, managing their reliability  is no different than other applications, while in other ways it’s an entirely different way of looking at things, with differences in both techniques required as well as the impact of failures, in some cases requiring costly restarts of long running processes.

Join Nobl9, Gremlin, and Pagerduty for a roundtable discussion about what engineers can do to keep the uptime of AI applications high and avoid or lessen the impact of incidents. We’ll cover how SLOs, resilience testing, and incident response come together to support AI reliability.

WATCH ON-DEMAND


🖥️ Office Hours

Upcoming!

How to ensure your AWS workloads are resilient

DATE:  July 17th, TIME: 11am PT/2pm ET

In this Office Hours session, we’ll look at some best practices for making your AWS workloads more resilient.

We’ll explore various ways workloads can fail on AWS, options and tools AWS provides you to help improve reliability, and even parts of the AWS Well-Architected Framework. We’ll wrap up by looking at how Gremlin makes it easy to test AWS workloads, find common AWS reliability risks, and help you adhere to the Well-Architected Framework using Test Suites.

REGISTER HERE

——

How to test your systems for scalability and redundancy with fault injection

ON-DEMAND

Do you know if your services can tolerate losing a node? What about an entire availability zone? Or a region?

In this Office hours, we’ll show you how to test the scalability and redundancy of your systems by testing them directly. We’ll use Fault Injection to simulate large-scale failures, use observability tools to monitor the state of our systems, and discuss ways of using our findings to make our systems more resilient.

WATCH NOW



To view or add a comment, sign in

Others also viewed

Explore topics