Why Most Kubernetes Platforms Fail at Scale (And How to Build One That Doesn't)

⎈ Dario Tranchitella

Capsule and Kamaji maintainer

Published May 29, 2025

“Running Kubernetes isn't the hard part. Running multiple clusters, for multiple teams, with multiple SLAs — is.”

Most platform teams still think the hard part is managing etcd (the Kubernetes key/value store maintaining the state of Kubernetes itself), tuning ingress controllers (allowing external traffic to deployed applications), or tweaking node pool. But that’s not what kills internal platforms.

The true failure modes of Kubernetes at scale are organizational, not technical:

No clear multi-tenancy strategy — every team is treated as special.
No governance model — platform teams become manual gatekeepers.
No enablement layer — developer experience degrades with each new team.

Throughout my career, I've engaged with several organizations aiming to build "the" Kubernetes platform, the holy grail of engineering powered by the engineers' NIH bias: consequence, organisations started pouring a sizeable amount of money into bespoke Kubernetes setups that implode under their own complexity — not because they couldn’t scale pods, but because they couldn't scale trust and ownership.

If you’re serious about scaling Kubernetes, you need to stop thinking like an ops team and start thinking like a product company backed by a framework.

The framework: 3 Layers of Platform Scale

Tenancy – How do you isolate, onboard, and delegate with security as first-class citizen?
Governance – What’s observable, auditable, and automatable?
Enablement – Can product teams self-serve without breaking things?

❌ Ignore these, and your platform will collapse under what I call the JIRA Tickets support nightmare and the unbearable shadow infra.

✅ Design for these, and you’ll unlock real leverage — faster team delivery, lower infra friction, and strategic visibility at the C-level.

This is my new LinkedIn newsletter where I discuss Kubernetes, multi-cluster management, and Platform Engineering — I won't focus on why IdP is the new hype and yada-yada. Direct insights from the forefront of advanced engineering research, built for production-grade for enterprise-level resiliency.

Next week: Why Cluster API Is Quietly Eating Platform Engineering (and Why You Should Care).

The Platform Brief

1,995 follower

+ Subscribe

Joanna Wyganowska

Octopus Deploy | CI/CD | DevOps | GitOps | Argo

2mo

Congratulations on the inaugural post! ⎈ Dario

1 Reaction

Jimmy Ray

Author of Policy as Code - Improving Cloud Native Security

2mo

Enablement is where a lot of failures occur. Whether building or buying, the product you are pushing will fail if you don't have a well-thought-out enablement plan and a team focused on that plan. Your tenants should not have to be SMEs of the tech you are using for your product.

4 Reactions

Phoebe Goh

Evangelist @ NetApp | Translating enterprise technology into human | Co-host of The STEMINISTS podcast

2mo

Yes, the technology is only as important as the outcomes it serves - and I think the success of Kubernetes has also been its biggest failing in some ways, because engineers were so successful in building it at first, it was never treated as a product. Now some shops might have to go backwards before they go forwards :) Looking forward to checking out your newsletter!

1 Reaction

Luca Ravazzolo

2mo

Looking forward to the full artcle ⎈ Dario Tranchitella

1 Reaction

Spas Atanasov

2mo

🆂🅸🅼🅿🅻🅸🅲🅸🆃🆈 🅼🅰🆃🆃🅴🆁🆂

2 Reactions

See more comments

To view or add a comment, sign in

See all

Why Most Kubernetes Platforms Fail at Scale (And How to Build One That Doesn't)

⎈ Dario Tranchitella

Capsule and Kamaji maintainer

The framework: 3 Layers of Platform Scale

The Platform Brief

1,995 follower

More articles by this author

Others also viewed

The cost of squirrels: why your platform team never finishes what it starts

Digital Transformation Magic! (Part 2)

Lessons Learned: Building Resilient and Scalable Systems in Fast-Paced Teams

10 Years Since Kubernetes Launched at DockerCon

Kargo 1.0 Launch, KubeVision Updates, and Insights from ArgoCon and KubeCon, and more!

SmartGuyCodes: A Year of Shipping Enterprise-Grade Solutions

Building the Future of Scalable Platforms

Odyssey Platform # 08

The Cost of "More with Less": A Reflection on Platform Scale and Hidden Complexity

Understanding CrashLoopBackOff: What It Means and How to Resolve It

Explore topics

The framework: 3 Layers of Platform Scale

The Platform Brief

1,995 follower

Stop calling everything Cloud Native

Aug 14, 2025

GitOps is not a religion: it's just a tool

Aug 7, 2025

Your platform isn't a Product until it has SLAs

Jul 31, 2025

The Golden Path myth

Jul 24, 2025

If your platform needs a Ticket System, it's already broken

Jul 17, 2025

Cognitive load is the new downtime

Jul 3, 2025

Platform Drift: The silent killer of Internal Kubernetes products

Jun 26, 2025

Multi-Tenancy is a Product, not a Feature.

Jun 19, 2025

The Hidden Cost of DIY Kubernetes Platforms (Spoiler: "We'll just build it ourselves" is the most expensive sentence in platform engineering)

Jun 12, 2025

How Cluster API Is Quietly Rewriting the Rules of Kubernetes Platform Engineering (Spoiler: it's the missing control plane for platform teams)

Jun 5, 2025

Others also viewed

The cost of squirrels: why your platform team never finishes what it starts

Digital Transformation Magic! (Part 2)

Lessons Learned: Building Resilient and Scalable Systems in Fast-Paced Teams

10 Years Since Kubernetes Launched at DockerCon

Kargo 1.0 Launch, KubeVision Updates, and Insights from ArgoCon and KubeCon, and more!

SmartGuyCodes: A Year of Shipping Enterprise-Grade Solutions

Building the Future of Scalable Platforms

Odyssey Platform # 08

The Cost of "More with Less": A Reflection on Platform Scale and Hidden Complexity

Understanding CrashLoopBackOff: What It Means and How to Resolve It

Explore topics