SRE Is Not About Kubernetes

SRE Is Not About Kubernetes

SRE Is Not About Kubernetes — It’s Culture On Call

The line “SRE is about technology, not culture” sounds tidy until you meet reality at 03:17 on a Sunday when a cascade turns your pager into a metronome. The tools you chose yesterday matter—observability, deployment pipelines, graceful degradation—but what actually decides how fast you recover and how much you learn is the culture you built last quarter. That’s the uncomfortable truth underneath most heroic incident write-ups: the tech may break; the culture decides whether you bounce or bruise.

I’ve sat in incident reviews where every graph in the world couldn’t make a single person admit, “I was confused by the dashboard, so I rolled back the wrong service.” And I’ve seen the opposite: someone says exactly that, nobody flinches, and the team leaves with three improvements and a lighter on-call. Same tooling. Different culture. Very different reliability.

Why “Tech-Only” Breaks When It Matters Most

Complex systems fail in complex ways. Your elegant architecture diagram didn’t include “new hire misreads a runbook during noisy failover,” yet that’s precisely the kind of thing that knocks you off balance. Safety research from high-risk domains has been brutally clear for decades: organizational culture often explains disasters better than the broken part you can point at. If space agencies can have flawless hardware and still suffer because signals were ignored and dissent felt risky, our microservices don’t stand a chance without psychological safety and shared learning.

SRE learned this lesson early. Blameless postmortems exist for a reason: not because we’re soft, but because fear is a lousy debugging tool. When people are worried about careers, they edit reality. You don’t just lose candor; you lose data. That erases the very signal you needed to make the system safer next time.

Modern research on software delivery echoes the same theme: elite performance correlates with healthy culture—psychological safety, user-centric focus, and stable priorities—not just the latest stack. Platform engineering can pave roads, but a paved road is still useless if your drivers are afraid to report potholes. Culture isn’t a “nice to have”; it’s the operating system that all of your reliability practices run on.

Two Views, One Production Environment (and They’re Arguing in Your Slack)

Let’s stage the debate the way it actually happens at work:

Tech-First Tim: “We don’t need culture talks; we need SLOs, error budgets, progressive rollouts, and a platform team. If we automate enough, humans stop being the problem.”

Culture-First Casey: “You can’t automate trust. If engineers don’t feel safe surfacing weak signals, your SLOs are theater and your error budgets merely decorative.”

They’re both right—and wrong—at the same time. Tim’s right that practices like SLOs and error budgets transform arguments into data-driven agreements instead of status contests. But Casey’s right that those practices only work if people will stick their necks out and say, “we burned the budget, we freeze now,” without fearing retribution. The contract is technical; the compliance is cultural.

I’ve watched a team write a perfectly sensible error-budget policy and then never enforce it because a senior leader quietly labeled freezes “career-limiting.” The metrics said one thing; the culture said another. Guess who won? Conversely, I’ve seen teams with modest tooling outperform richer peers because they learned loudly from every incident, celebrated the folks who raised uncomfortable truths, and actually made time for reliability work.

The Latest Trends: Culture Keeps Showing Up in the Data

If you needed more than battle scars to believe it, recent research keeps underlining the human side. The latest learnings emphasize that organizations perform better when they balance platform investments with user-centric design, stable priorities, and developer experience. Those words—experience, stability, priorities—are managerial and cultural choices wrapped around technical capability. When teams treat internal platforms like products, adopt feedback loops with their “customers” (engineers), and run to a clear reliability strategy, they ship faster with fewer bruises. When they don’t, new tools show up as new toil.

And incident practice? The reason SRE books bang on about blameless postmortems isn’t because engineers love meetings. It’s because the boldest reliability improvements often start with “here’s the part I misunderstood” or “our docs were designed for senior staff.” You can’t buy that honesty; you can only create an environment where it’s safe to bring it.

Practical Culture, Not Posters

Let’s get concrete. Culture is not the mural in the lobby. It’s what gets rewarded on Tuesday. It’s whether the person who pulled the wrong lever last night is thanked for a candid timeline today. It’s whether your error-budget breach leads to a clear reliability sprint—backed by leadership—or a stealth feature push with “just one more” exception. It’s the difference between a postmortem that says “operator error” and one that says “two dashboards contradicted each other; the runbook hid step 7; the on-call was solo; now here’s what we’ll change.”

Two anecdotes to make this real. First, the DNS classic: the platform did a dead-simple, high-confidence deploy…which black-holed internal DNS. The fix was trivial. The learning was not: the runbook assumed the on-call knew an internal acronym nobody explained. After a blameless debrief, they added a glossary to the runbook template, highlighted “first five minutes” triage, and ran tabletop drills. Outage frequency didn’t change—randomness is stubborn—but duration plummeted because the first ten minutes became muscle memory.

Second, the error-budget freeze that wasn’t. The team burned through three quarters’ worth of budget in six weeks. The policy said “freeze new features.” The VP said, “not this quarter.” What followed was a hospital-grade on-call rotation, brittle morale, and a spectacular user-visible incident four weeks later that forced a freeze anyway. The next quarter, leadership made the policy explicit, public, and sacrosanct. Weirdly enough, product velocity increased. Why? People stopped negotiating the laws of physics every sprint and planned within them.

Three (Plus) Approaches That Actually Change Reliability

Turn SLOs into a social contract, not a spreadsheet.

Great SLOs translate user experience into engineering choices. The trick is to co-author them across product, SRE, and engineering leaders with crystal-clear error-budget rules. When the budget is healthy, the team chooses where to “invest” it—risky migrations, bigger features, chaos experiments. When it’s exhausted, reliability work isn’t an optional chore; it’s the plan. The cultural move is more important than the math: leadership must defend the contract when it’s inconvenient. Otherwise, you’ve automated a broken promise.

Institutionalize blameless learning like it’s your most reliable service.

Postmortems should feel like CSI for systems, not HR for humans. Write them quickly while memories are fresh. Make them narrative so context survives. Publish them widely inside the org so patterns emerge. Close the loop by tracking action items until they’re done; unfinished follow-ups are where incidents love to respawn. And treat great postmortems as promotable artifacts—career assets, not confessions. The moment people see candor rewarded, your signal-to-noise improves overnight.

Run platform engineering as a product with an opinionated, kind UX.

Internal platforms fail when they ship features instead of outcomes. If paved roads are bumpy, teams will take the scenic route and reliability will pay the toll. Assign product managers to your platform, set reliability SLOs for the platform itself, and measure time-to-first-deploy, rollback ease, and incident ergonomics. If an engineer can’t find the right dashboard in 30 seconds during an outage, your platform has a usability bug. Fix it like you would a production defect.

Budget for toil reduction like it’s cloud spend—because it is.

Set explicit targets for toil so SREs don’t drown in reactive work. The more time engineers have for engineering, the more reliability you buy for free. Burn down tickets that computers should do, automate the midnight recurring fixes, and retire sacred scripts nobody wants to touch. When leadership protects this time, it compounds like interest.

Drill deliberately.

Tabletops, game days, and disaster role-playing don’t just harden systems; they calibrate humans. Practice the awkward handoffs, the first five minutes of triage, the “call it” moment where you route traffic and accept partial degradation. Keep the tone curious and respectful. Then make the next drill slightly harder. Your mean time to composure will plummet.

The Human Nature Bit We Pretend Isn’t There

SRE lives at the intersection of math and people. We measure budgets, latencies, and error rates. We build backoffs, retries, and graceful degradations. And then we ask a teammate to acknowledge an alert at 03:17 and make trade-offs under uncertainty while sleepy and imperfect. Reliability is empathy codified into systems: pacing changes so future you has capacity, exposing health so others can help, telling the truth about incidents so strangers you’ll never meet won’t repeat your pain.

A good culture doesn’t eliminate incidents. It shortens them, enriches the lessons, and makes the next on-call slightly less lonely. True reliability comes from culture, not Kubernetes. The tech enables; the culture decides.

Closing With A Wink

If the cure for failure were “more YAML,” our jobs would be easy. But the reason SRE is a career and not a script is the same reason your systems are interesting: people are involved. Click less, learn more, blame never—and enjoy the weirdly satisfying moment when your team ships faster after agreeing to slow down sometimes. That, friends, is the culture doing work your cluster can’t.

References


  1. DORA | Accelerate State of DevOps Report 2024https://dora.dev/research/2024/dora-report/
  2. Google SRE Book — “Postmortem Culture: Learning from Failure”https://sre.google/sre-book/postmortem-culture/
  3. Google SRE Workbook — “Implementing SLOs”https://sre.google/workbook/implementing-slos/
  4. Atlassian — “How to run a blameless postmortem”https://www.atlassian.com/incident-management/postmortem/blameless
  5. Columbia Accident Investigation Board Report, Volume Ihttps://ehss.energy.gov/deprep/archive/documents/0308_caib_report_volume1.pdf


#SRE #SiteReliability #DEVOPS #ReliabilityEngineering #Postmortems #PsychologicalSafety #PlatformEngineering #ErrorBudgets #Leadership #OnCall #DevOpsCulture



Totally agree Marcel Koert - AT the end of the day, we deal with humans, who need to embrace, walk-the-talk, and be connected to the "why" of SRE. Please do let me know your thoughts on my recent post: https://www.linkedin.com/feed/update/urn:li:activity:7390425045845807105/

Like
Reply

Spot-on framing of SRE as a people-centric system, error budgets as social contracts and platform engineering run like a product are practical, no-nonsense insights that hit home.

Like
Reply

To view or add a comment, sign in

More articles by Marcel Koert

  • SREs never sleep.

    This is technically false. We just sleep in 15-minute increments between alerts.

    6 Comments
  • You can’t measure reliability — it’s just uptime.

    Myth : “You can’t measure reliability — it’s just uptime.” Oh, the dream of simplicity.

    1 Comment
  • SREs Aren’t Allergic to Meetings

    SREs Aren’t Allergic to Meetings — We’re Allergic to Meetings That Don’t Earn Their Keep Somewhere along the way, “SREs…

    5 Comments
  • The 15 Personalities of SRE

    The 15 Personalities of SRE (And Why Your Error Budget Thinks They’re Hilarious) The cast you already know (and…

    6 Comments
  • 25 years of prices versus freelance day rates

    The uncomfortable math: 25 years of prices versus freelance day rates Let’s rip off the Band-Aid with numbers. If you…

    4 Comments
  • SRE is just Ops with a cooler name

    “SRE is just Ops with a cooler name” — Myth-busting the Ferrari vs Fancy Red Bicycle Why the myth persists You’ve heard…

    21 Comments
  • SLI , SLO & SLA Setup, How?

    If you’re setting up SLOs, SLAs, and SLIs for the first time—or rebooting them for systems that have been running…

    5 Comments
  • Working in the Noise: How to Thrive on LinkedIn Without Getting Spammed to Death

    If my LinkedIn inbox were a monitoring dashboard, it would be paging me for a “critical: unsolicited pitch storm” every…

  • Is Cloud Vendor Lock-In a Good Thing or a Bad Thing?

    Few phrases trigger more eye-rolls in engineering than “vendor lock-in.” It’s the great bogeyman of platform decisions,…

  • The Art of Paving Roads Without Building Cages

    Golden Paths vs. Developer Autonomy: The Art of Paving Roads Without Building Cages “According to our incident runbook,…

Others also viewed

Explore content categories