Why chaos engineering is essential for enterprise resilience

Chaos engineering is a valuable framework for tamping down on those unexpected problems lying in wait across every back-end system, argues R Systems’ Rohan Gupta. (Image: Shutterstock)

As enterprises continue their pursuit of cloud-native technologies to support increasingly distributed, always-on digital experiences, system reliability has never been more important. From content streaming and gaming platforms to e-commerce and financial services apps, enterprises face immense pressure to ensure the seamless performance of their products and services.

Over the last few years, several high-profile examples have emerged of how latency, disruption, or complete inaccessibility to popular platforms have caused chaos for enterprises.

One notable example is Spotify. Earlier this year, the audio streaming giant experienced an outage that lasted several hours for some, sparking dissatisfaction worldwide. It was not only bad publicity, but also harmful to the business, especially since many users pay a premium to listen to music or podcasts without ads. Another example is CrowdStrike. In 2024, the global cybersecurity firm experienced a significant incident that grounded airlines and disrupted critical industries like healthcare and financial services, making national headlines. Meta also faced a widespread disruption across multiple platforms in early 2024, causing millions of users to lose their ability to post content and refresh their feeds – a nightmare for consumers, influencers, and businesses.

As enterprises develop new platforms and applications – or improve existing ones – to deliver better experiences for end-users, they introduce new levels of complexity. This complexity, in turn, can create new vulnerabilities that can halt experiences and operations, ultimately impacting their bottom line. In fact, research from Oxford Economics and Splunk shows that Global 2000 companies, on average, lose $200 million annually due to unexpected failures in their digital environments.

Bringing chaos to testing and monitoring

Due to the complexity of today’s applications and platforms – and the costly consequences of potential failures – testing and monitoring are critical for enterprises and their resilience. However, it’s clear traditional methods are falling short, as shown by the examples above. Such testing and monitoring procedures often overlook vital distributed systems supporting an enterprise’s platform or app, or fail to identify potential blind spots inside them.

To address these limitations, enterprises can leverage the power of chaos engineering – the testing of an ecosystem by running custom, confined experiments to identify single points of failure and build resiliency so businesses have confidence in their systems.

Enterprises should treat chaos engineering as a routine practice, just like sports teams before every game. These groups would never participate in matches without understanding their opponent or ensuring they are in the best possible position to win. They train under pressure, run through potential scenarios, and test their plays to identify the weaknesses of their opponents. This same mindset applies to enterprise engineering teams preparing for potential chaos in their environments.

Using chaos engineering to build resilience

By purposely simulating disruptions like server outages, latency, or dropped connections, or by identifying bugs and poor code, enterprises can position themselves to perform at their best when these scenarios occur in real life. They can adopt proactive approaches to detecting vulnerabilities, instituting recovery strategies, building trust in systems and, in the end, improving their overall resilience.

In the case of CrowdStrike, the company’s engineering team could have run controlled chaos experiments on endpoint update mechanisms. This would have aimed to test rollback strategies and ensure corrupted updates would not propagate across critical systems, minimising the risk of grounding airlines or disrupting essential industries.

For Meta, the company would have benefited from simulating high-traffic surges coupled with API rate-limit failures to detect bottlenecks that caused content refresh issues. This would have allowed teams to fine-tune auto-scaling and caching layers proactively without causing disruptions for countless users.

AI is beginning to play a more significant role in these efforts. Today, enterprises are integrating AI into their chaos engineering pipelines to accelerate root cause analysis (RCA) during and after these experiments. This is similar to how sports teams are increasingly leveraging AI to provide insights into performance, potential plays, and opponent weaknesses.

For example, when teams purposely spike memory or bring down servers as part of a test, AI can automatically generate remediation steps by analysing the root cause and providing guidance directly to engineering teams. This reduces manual effort and minimises the time it takes for engineering teams to resolve key issues not only in the experiments, but also in live scenarios.

Additionally, chaos engineering can help improve scalability within the organisation. Enterprises are constantly seeking ways to grow and enhance their apps or platforms so that more and more end-users can see the benefits. By doing this, they can remain competitive and generate more revenue. Yet, if there are any cracks within the facets or systems that power their apps or platforms, it can be extremely difficult to scale and deliver value to both customers and the organisation.

AI could also change how enterprises prepare for this by enabling their engineering teams to automatically generate new experiments based on modifications to their architecture or services. Engineering teams could eventually be supported by AI agents that can identify potential points of failure by analysing architectural changes, referencing known vulnerabilities from industry examples, and developing new custom experiments and remediation approaches accordingly.

Furthermore, chaos engineering encourages a mindset of preparedness and resilience within engineering teams. Instead of simply reacting to failures, engineering teams learn to anticipate and handle them proactively. This cultural shift improves incident response and decreases unexpected issues in production.

Using good chaos engineering to avoid the bad and ugly chaos of system failures

As enterprises continue to develop or expand their digital apps and platforms, it is important to remember that chaos is unavoidable.

Chaos engineering helps enterprises prepare for disruptive events in a safe and controlled way. This enables them to build resilience across their entire organisation and operations, from their apps to their teams. AI is increasingly supporting these efforts by making it easier for teams to identify and anticipate potential issues. However, a strong engineering culture rooted in proactivity, not reactivity, remains critical.

Just as sports teams need proper preparation, practice, and learning from mistakes before a big game, enterprises must also apply discipline and foresight to succeed in today’s complex digital landscape. In the end, this will help enterprises deliver great customer experiences and grow their businesses. If they fail to prepare for chaos, they risk facing serious consequences that could damage their reputations and long-term success.

Rohan Gupta is the VP for Cloud, Security, & DevOps at R Systems