Chapter 6: Simplicity Made Simple
✍️ By Poojitha A S Adapted and simplified from the Google SRE Book and lessons from Google’s Display Ads, Borg, Omega, and platform-wide SRE efforts
Why Simplicity Is a Superpower in SRE
“A complex system that works is invariably found to have evolved from a simple system that worked.” — Gall’s Law
In SRE, simplicity = reliability.
Simple systems break less, recover faster, and are easier to maintain, test, and debug.
Simplicity isn’t just about clean cod. it’s end-to-end: System design, tools, deployment pipelines, architecture diagrams, even onboarding and documentation.
Measuring Complexity: Easier Said Than Done
You can measure code complexity with tools like cyclomatic complexity, but systems? Much harder.
Here are a few proxies SREs use:
✅ Training time : How long before a new engineer can go on-call?
✅ Explanation time : Can you whiteboard the system in 10 minutes?
✅ Configuration chaos : Are there 10 ways to set a flag?
✅ Number of unique binaries : How many configs are actually deployed?
✅ Age of the system : The older it gets, the more fragile it becomes (Hyrum’s Law strikes again)
TLDR: Complexity grows unless someone fights it. That “someone” is often you.
Why SREs Are Simplicity Champions
Systems evolve. They grow feature by feature, team by team. Complexity creeps in through retries, new dependencies, undocumented changes.
The result? A change in one service breaks another 10 steps downstream.
That’s where SREs come in. We don’t just support our systems, we understand the entire stack. We’re the connective tissue between services, teams, and environments.
Simplicity is everyone’s job. But SREs make it happen.
Case Study 1: When “Flexible” Becomes a Trap
A startup built core APIs using flexible key/value bags. Everything was “simple” : no structured contracts.
Result?
❌ Poor documentation
❌ Breaking changes in every release
❌ Compatibility nightmares
✅ Lesson learned: Structured data types (like Protobufs or Thrift) force thoughtful design and documentation early leading to simpler outcomes end-to-end.
Case Study 2: Rewriting Isn’t Always Simpler
Borg, Google’s internal container manager, grew complex. So the team began building Omega.A clean, principled replacement.
Reality check?
❌ Borg evolved faster than Omega
❌ Migration was near-impossible (thousands of services, millions of lines)
❌ Cost of dual-maintenance was too high
✅ What worked: Taking Omega’s ideas and feeding them back into Borg
✅ Bonus: Those same concepts helped launch Kubernetes
Don’t rewrite just to “start fresh.” Improve what you have. Make simplicity iterative.
Case Study 3: Taming the Display Ads Spiderweb
Ads SREs managed interconnected systems from DoubleClick, AdMob, AdSense, and more.
Problem:
Endless config permutations
Loops in query flows
Impossible-to-debug traffic paths
Solution:
✅ Unified standards
✅ One way to copy data, monitor, configure
✅ Gradual flag removal
✅ Consolidated servers
“System smell” is real. If you’re rewriting requests to pass through multiple engines, you have a design problem.
Case Study 4: Microservices at Scale Without Chaos
Google’s social SRE teams were overwhelmed by every team having its own stack.
They built a shared platform:
One set of CI/CD tools
Unified release + monitoring experience
Tiered SRE engagement (from light to deep)
✅ Services gained reliability
✅ Engineers switched teams easily
✅ No SRE bottleneck required
Standardization isn’t just cleaner, it makes scale manageable.
Case Study 5: pDNS Loops Back on Itself
Google’s production DNS (pDNS) depended on Svelte for lookup. But Svelte used pDNS. 😬
Cold-starting the system? Impossible.
Fix:
✅ Local IP list for Svelte
✅ Whitelisted service access
✅ Removed the circular dependency
Design like your system might go cold one day. Because it might.
Regaining Simplicity Is an Engineering Investment
Simplification usually means removing, not adding
🔁 Simplification often means replacing duplicate work with shared services
🏆 Celebrate it! Google literally gives “Zombie Code Slayer” badges for major code deletions
What You Can Do as an SRE
✅ Encourage system diagramming — before going on-call
✅ Review every design doc for complexity impact
✅ Track and reward simplification projects like feature launches
✅ Allocate 10% engineering time for simplicity work
✅ Create a rotating team with full-stack visibility
✅ Watch for:
Amplification: Error retries causing 10x RPCs
Cyclic dependencies: One cold start away from failure
TLDR
✅ Simplicity = reliability
✅ Complexity grows on its own. Simplicity requires effort.
✅ Rewrites aren’t always simpler. Improve what you’ve got.
✅ Celebrate code deletion as much as code creation.
✅ SREs must lead the push—no one else sees the system end-to-end
🎧 Want to Learn More?
Books
The Google SRE Book
Software Engineering at Google
Talks & Podcasts
Google Prodcast – Internal system design breakdowns
The Art of Software Simplicity – GOTO Conference talks
Tools That Help
Structurizr – Diagram-as-code for systems
SonarQube – Detect complexity in code
Protocol Buffers – Design once, scale forever
Credits
Based on Google SRE Book – Chapter 7: Simplicity Case studies adapted from Display Ads, Borg, Omega, and production DNS efforts.
📬 New drops every Monday, Wednesday, and Friday
👉 Subscribe now — No fluff, just field-tested DevOps wisdom
Vice President - Observability Engineer / SRE at Citi Bank
2moGood one👍