Chapter 2: SLOs Made Simple

Chapter 2: SLOs Made Simple

✍️ By Poojitha A S, adapted and simplified from the Google SRE Book with real-world flavor from Evernote, The Home Depot, and lessons you can apply today


Why SLOs Actually Matter

SLOs (Service Level Objectives) aren’t just impressive-sounding figures you slap on a dashboard. They’re mutual agreements, clear expectations for how systems should behave. Think of them as a heads-up system. When something drifts out of bounds, you’ll know before users hit the panic button.

They help answer:

  • How fast should we fix this?

  • Should we hold off on that release?

  • Is it time to focus on reliability instead of adding new stuff?

But here's the thing. SLOs aren’t just about numbers or uptime percentages. They’re about how teams collaborate, make decisions, and balance risk. So let’s cut the jargon and get to the heart of it.

“The whole point of SLOs is to support the notion of gradual improvement.”


How to Build SLOs

  1. Choose the right SLIs: These are your signals like latency, availability, error rates, and throughput.

  2. Set meaningful SLOs: Targets that reflect reality such as 99.9% uptime, less than 300ms latency, and so on.

  3. Define your error budget: How much failure can you tolerate before it’s a problem?

  4. Write a response plan: What happens when you run out of budget?

  5. Get buy-in from everyone: Dev, Ops, and Product all need to care.

  6. Measure, review, adjust: Use real data and update when things change.

“If you’re not going to take action when an SLO is violated, don’t bother setting it.”

A Look at Evernote’s SLO Journey

The issue? Constant tension between Dev and Ops. One side wanted to ship fast. The other was tired of cleaning up the mess. Everyone felt the friction.

What changed? They moved to the cloud and introduced SLOs.

Here’s what that looked like:

  • Picked a high-impact SLO: 99.95% uptime

  • Used external tools like Pingdom to verify availability

  • Set a simple rule: if probes failed from two separate regions, it counted as downtime

  • Focused on user impact, not just backend metrics

What came out of it?

  • Shared goals for Dev and Ops

  • Monthly SLO reviews with Google’s Customer Reliability Engineering team

  • Less downtime, fewer finger-pointing meetings

“Perfect is the enemy of good. Start simple, and evolve your SLOs.”

Inside The Home Depot’s VALET Framework

Massive company, complex systems, and everyone was using different metrics. That made it hard to align.

To fix that, they created the VALET framework:

  • Volume: Can the system handle what’s coming?

  • Availability: Is it up and running?

  • Latency: Is it fast enough?

  • Errors: Is it giving the right responses?

  • Tickets: Are people opening support tickets?

What started as a framework turned into a full culture shift:

  • Internal training and workshops

  • Custom dashboards and reports

  • Evangelism campaigns, even t-shirts

  • Chatbots and BigQuery pipelines to auto-report VALET stats

📉 What improved?

  • Root cause analysis got faster

  • Surprises dropped

  • Teams finally spoke the same language


🎧 Want to Go Deeper?

Podcasts to Bookmark:

  • Google SRE Prodcast: Engineers unpacking what keeps systems stable

  • Screaming in the Cloud: Where SLOs meet real-world team dynamics

Tools to Explore:

  • Nobl9:Great for visualizing error budgets

  • Pingdom and New Relic: Popular for tracking SLIs

Books Worth Reading:

  • The Art of Monitoring by James Turnbull

  • Reliable Machine Learning by Cathy Chen

“The act of writing an SLO is more valuable than the number itself.”

TL;DR:

✅ Choose SLIs that reflect what your users actually care about

✅ Set SLOs that are ambitious, not unrealistic

✅ Foster a culture of reliability, not finger-pointing

✅ Let the data guide you, but don’t forget the human side

Want a free SLO template in Notion and Excel? Drop a “SLO READY” in the comments.

Credits

  • Inspired by “Implementing SLOs” and “SLO Case Studies” from the Google SRE Book (CC BY-NC-ND 4.0)

  • Real-world examples from Evernote and The Home Depot

  • Simplified and written by Poojitha A S for the DevOps Made Simple community


📬 DevOps Made Simple – Weekly Drops

No jargon. No noise. Just practical DevOps strategies from the real world, delivered every week.

👉 Subscribe now

Syed Dawood

Full Stack Engineer | Java, Spring, Angular, .NET | Cloud & DevOps | Golang | VoIP | REST/SOAP | Oracle, MongoDB | Agile/Scrum | Scalable Enterprise Systems

3mo

Thanks for sharing, Poojitha

Mani Senthil

Vice President - Observability Engineer / SRE at Citi Bank

3mo

Very good write-up on the mark👌

SLO READY

To view or add a comment, sign in

Others also viewed

Explore topics