Error Budgets Aren’t Dead
Error Budgets Aren’t Dead—They Just Grew Up
There was a time when error budgets were the toast of the SRE world. People talked about them with a kind of reverence, the way you might talk about a clever hack that somehow solves a deeply human problem with a bit of math and operational discipline.
The premise was deceptively simple: set a service-level objective—a target level of reliability, like 99.9% uptime—and then treat the gap between that target and perfection as your “error budget.” That budget was yours to spend. You could use it to ship faster, take calculated risks, or try something bold. But burn through it, and it was time to pump the brakes, stabilize, and reassess.
It was a brilliant mechanism—a kind of economic model for reliability. You could finally have the hard conversation about velocity vs. stability using data, not just instincts. Engineers had numbers to defend caution. Product managers had room to move fast—as long as the system held up. Everyone was working toward the same north star.
But then, like many great ideas, error budgets started to falter under the weight of real-world implementation. Somewhere along the way, what was meant to be a collaborative tool turned into a compliance mechanism. A blunt instrument. And that’s when you started hearing it: the quiet grumbling on Slack, the conference hallway banter, the tired eye-rolls when someone brought them up.
“Error budgets are dead.”
It’s easy to see where the sentiment comes from.
In many organizations, error budgets became flashpoints. The incentives weren’t aligned. Product teams still got judged by how much they shipped—not how stable things stayed. So when a service blew through its budget, SREs found themselves saying “no” while everyone else was saying “go.” What began as a shared metric turned into a battlefield.
And when the budget ran out? The policies were often rigid and absolute. No deploys. No exceptions. Never mind the nuance of whether that incident was a one-off or if the next release might actually fix the root cause. To make matters worse, some teams started gaming the system—tweaking SLO definitions, splitting services to isolate burn, manipulating time windows so the graphs looked better than reality. Not exactly what the creators had in mind.
Some teams realized their error budgets were practically unusable. A single outage could wipe out a month’s worth of allowance. Suddenly, you’re stuck in release limbo over something that, in context, may not have even been a user-facing issue.
And then there was the tooling. Calculating an error budget correctly requires solid observability infrastructure, good telemetry coverage, and a team that knows how to interpret the numbers. For a lot of companies, that stack just wasn’t there yet. They liked the idea—but they didn’t have the maturity to act on it.
Eventually, leadership grew skeptical. Error budgets felt like technical overhead. The numbers didn’t always align with business impact. And when it came time to make real decisions—product launches, executive timelines—they were easy to ignore.
So yeah, in some places, they died. Or more accurately, they faded. Still running in the background, still displayed on dashboards—but no longer influencing behavior. No longer part of the conversation.
But here’s the thing: the concept was never broken. The implementation was.
And when applied with the right mindset, error budgets still have tremendous value.
The key is to stop treating them as rules and start using them as tools. You don’t need to slam the brakes every time the needle tips into the red. But if you never look at that needle, what’s the point of having a dashboard?
When a team’s error budget starts to burn, it should prompt a conversation, not a lockdown. Do we understand what’s causing the burn? Is it systemic, or just a blip? Are we about to make things better or worse? Is this the moment to shift focus to resilience—or is it worth accepting the risk for something bigger on the horizon?
Some teams have moved toward more nuanced approaches—tiered budgets, for example. Mission-critical services get tighter SLOs, while supporting tools have more leeway. That reflects reality. Not all downtime is created equal, and pretending otherwise only breeds frustration.
Others have tied error budgets to business outcomes. Downtime that affects user experience or revenue carries more weight than an internal admin tool going dark for five minutes. Suddenly, the metric isn’t just a percentage—it’s a proxy for impact.
And maybe most importantly, the successful teams treat error budgets as shared resources. Not engineering metrics. Not SRE toys. They’re visible to product managers, customer success, leadership. They’re not wielded as sticks, but used to make smarter decisions. Do we pause development for a sprint? Do we invest in fixing that flaky dependency? Do we need to rethink how we test before deploying?
At one media company, early attempts to enforce error budgets flopped. Breach the budget and your pipeline froze. The result? Frustrated product teams, shadow deploys, and a growing sense of distrust. So they reworked it. Now, breaching a budget triggers a review. The team gets together, unpacks what happened, and decides—together—what to do next. No blame. Just learning. And guess what? They’ve seen fewer breaches, not more.
So no, error budgets aren’t dead. But the naive version of them—the one that assumes a single metric can govern complex systems with no room for judgment—that version should be buried.
If you’re building—or rebuilding—an SRE practice, don’t start by throwing them out. Start by rethinking how they’re used. Let them be early warning signs. Prioritization tools. A way to surface toil. A way to align teams without locking them into rigid frameworks.
Just don’t let the number drive the culture. Numbers don’t understand context. People do.
And maybe that’s the real lesson here. Error budgets were never meant to control teams. They were meant to empower them—to give them the visibility and confidence to ship responsibly, knowing when to go fast and when to slow down.
So the next time someone says, “error budgets are dead,” pause before you agree. Maybe the better question is: are we using them to shut people down? Or to open up better conversations?
Because when done right, an error budget isn’t a traffic cop. It’s a mirror. And in a world where systems are more complex and deploys more frequent than ever, having something that nudges us to reflect—to ask, are we okay with this risk?—feels not just alive, but essential.