Chapter 6: Simplicity Made Simple

Chapter 6: Simplicity Made Simple

✍️ By Poojitha A S Adapted and simplified from the Google SRE Book and lessons from Google’s Display Ads, Borg, Omega, and platform-wide SRE efforts


Why Simplicity Is a Superpower in SRE

“A complex system that works is invariably found to have evolved from a simple system that worked.” — Gall’s Law

In SRE, simplicity = reliability.

Simple systems break less, recover faster, and are easier to maintain, test, and debug.

Simplicity isn’t just about clean cod. it’s end-to-end: System design, tools, deployment pipelines, architecture diagrams, even onboarding and documentation.


Measuring Complexity: Easier Said Than Done

You can measure code complexity with tools like cyclomatic complexity, but systems? Much harder.

Here are a few proxies SREs use:

Training time : How long before a new engineer can go on-call?

Explanation time : Can you whiteboard the system in 10 minutes?

Configuration chaos : Are there 10 ways to set a flag?

Number of unique binaries : How many configs are actually deployed?

Age of the system : The older it gets, the more fragile it becomes (Hyrum’s Law strikes again)

TLDR: Complexity grows unless someone fights it. That “someone” is often you.

Why SREs Are Simplicity Champions

Systems evolve. They grow feature by feature, team by team. Complexity creeps in through retries, new dependencies, undocumented changes.

The result? A change in one service breaks another 10 steps downstream.

That’s where SREs come in. We don’t just support our systems, we understand the entire stack. We’re the connective tissue between services, teams, and environments.

Simplicity is everyone’s job. But SREs make it happen.


Case Study 1: When “Flexible” Becomes a Trap

A startup built core APIs using flexible key/value bags. Everything was “simple” : no structured contracts.

Result?

❌ Poor documentation

❌ Breaking changes in every release

❌ Compatibility nightmares

✅ Lesson learned: Structured data types (like Protobufs or Thrift) force thoughtful design and documentation early leading to simpler outcomes end-to-end.


Case Study 2: Rewriting Isn’t Always Simpler

Borg, Google’s internal container manager, grew complex. So the team began building Omega.A clean, principled replacement.

Reality check?

❌ Borg evolved faster than Omega

❌ Migration was near-impossible (thousands of services, millions of lines)

❌ Cost of dual-maintenance was too high

✅ What worked: Taking Omega’s ideas and feeding them back into Borg

✅ Bonus: Those same concepts helped launch Kubernetes

Don’t rewrite just to “start fresh.” Improve what you have. Make simplicity iterative.


Case Study 3: Taming the Display Ads Spiderweb

Ads SREs managed interconnected systems from DoubleClick, AdMob, AdSense, and more.

Problem:

  • Endless config permutations

  • Loops in query flows

  • Impossible-to-debug traffic paths

Solution:

✅ Unified standards

✅ One way to copy data, monitor, configure

✅ Gradual flag removal

✅ Consolidated servers

“System smell” is real. If you’re rewriting requests to pass through multiple engines, you have a design problem.


Case Study 4: Microservices at Scale Without Chaos

Google’s social SRE teams were overwhelmed by every team having its own stack.

They built a shared platform:

  • One set of CI/CD tools

  • Unified release + monitoring experience

  • Tiered SRE engagement (from light to deep)

✅ Services gained reliability

✅ Engineers switched teams easily

✅ No SRE bottleneck required

Standardization isn’t just cleaner, it makes scale manageable.


Case Study 5: pDNS Loops Back on Itself

Google’s production DNS (pDNS) depended on Svelte for lookup. But Svelte used pDNS. 😬

Cold-starting the system? Impossible.

Fix:

✅ Local IP list for Svelte

✅ Whitelisted service access

✅ Removed the circular dependency

Design like your system might go cold one day. Because it might.


Regaining Simplicity Is an Engineering Investment

Simplification usually means removing, not adding

🔁 Simplification often means replacing duplicate work with shared services

🏆 Celebrate it! Google literally gives “Zombie Code Slayer” badges for major code deletions


What You Can Do as an SRE

✅ Encourage system diagramming — before going on-call

✅ Review every design doc for complexity impact

✅ Track and reward simplification projects like feature launches

✅ Allocate 10% engineering time for simplicity work

✅ Create a rotating team with full-stack visibility

✅ Watch for:

  • Amplification: Error retries causing 10x RPCs

  • Cyclic dependencies: One cold start away from failure

TLDR

✅ Simplicity = reliability

✅ Complexity grows on its own. Simplicity requires effort.

✅ Rewrites aren’t always simpler. Improve what you’ve got.

✅ Celebrate code deletion as much as code creation.

✅ SREs must lead the push—no one else sees the system end-to-end


🎧 Want to Learn More?

Books

  • The Google SRE Book

  • Software Engineering at Google

Talks & Podcasts

  • Google Prodcast – Internal system design breakdowns

  • The Art of Software Simplicity – GOTO Conference talks

Tools That Help


Credits

Based on Google SRE Book – Chapter 7: Simplicity Case studies adapted from Display Ads, Borg, Omega, and production DNS efforts.

📬 New drops every Monday, Wednesday, and Friday

👉 Subscribe now — No fluff, just field-tested DevOps wisdom

Mani Senthil

Vice President - Observability Engineer / SRE at Citi Bank

2mo

Good one👍

To view or add a comment, sign in

Others also viewed

Explore topics