SlideShare a Scribd company logo
can be applied to the nascent world of microservices.
Put some SRE
in your microservices
Hard-won lessons from the world of SRE…
The many faces of
Theo Schlossnagle
@postwait
CEO Circonus
The nature of the problem
Software Sucks
Once you’ve run software at scale,
you have a deep understanding of
how it is all tied together with
loose string and hope.
All software will fail, but
good software
fails well
• Consider the phrase:
“have you used X in anger.”
Never undervalue grace in failure.
Rule . 𝛌1 Crash landings should be both
fast and controlled.
What it means to
fail quickly & safely
• The scope of failure should
collapse completely.
• The time to failure should be
measured in small multiples of
normal service time
• Nothing outside the scope of
failure should be impacted.
https://www.youtube.com/watch?v=5SL1A2d2e7M
Autopsies: not just for medicine.
Rule . 𝛌2 Post-mortems are
fundamental.
Pragmatic analysis is required to
understand failure’s
true nature
• Post-mortem analysis is critical
• Stack traces
• Forensic logs
• Images (cores, dumps, etc.)
The difference between a shock and electrocution is real.
Rule . 𝛌3 Use circuit breakers.
Circuit breakers are designed to
avoid
cascading failure
• it’s not all about,
especially with microservices
• protect yourselves and others
• circuit breakers of many type
• timing
• queue depth
• concurrency
http://melissaomarkham.com
You cannot understand what you cannot measure.
Rule . 𝛌4 Behavior is complex.
Understand it.
Don’t measure to assess availability
measure to understand
Build robust models of behavior
Understand performance changes
Don’t use averages
Don’t use percentiles alone
Don’t measure to assess availability
measure to understand
Build robust models of behavior
Understand performance changes
Don’t use averages
Don’t use percentiles alone
It’s easy to demand perfection; it’s also stupid.
Rule . 𝛌5 Have an failure budget.
Avoid failure is simply impossible,
expect and manage
failure
• use failure budgets
• set expectations reasonably
• define and reward successes on
improvement and competency,
not just uptime.
Justice should be blind; operations should not.
Rule . 𝛌6 Instrumentation &
Observability have no equals.
For every “I wonder what X is right now?”
in production,
you must have answers
DTrace
eBPF
Instrument code for observability
https://www.pinterest.com/pin/441775044670412234/
Thank you.

More Related Content

PPTX
Monitoring 101
PDF
Is this normal?
PPTX
Craftsmanship
PDF
Operational Software Design
PDF
Chaos Engineering, When should you release the monkeys?
PDF
A Coherent Discussion About Performance
PDF
Chaos Engineering: Injecting Failure for Building Resilience in Systems
PDF
Chaos Engineering
Monitoring 101
Is this normal?
Craftsmanship
Operational Software Design
Chaos Engineering, When should you release the monkeys?
A Coherent Discussion About Performance
Chaos Engineering: Injecting Failure for Building Resilience in Systems
Chaos Engineering

What's hot (20)

PDF
Chaos Engineering 101: A Field Guide
PDF
Shift Left. Wait, what? No, Shift Right!!!
PDF
The left is not wrong, just not right; It's time to shift right!
PDF
Overview of Site Reliability Engineering (SRE) & best practices
PDF
How to SRE when you have no SRE
PDF
OpsStack Overview 20170806.1
PDF
Is your Automation Infrastructure ‘Well Architected’?
PDF
Sigma Open Tech Week: Bitter Truth About Software Security
PDF
Ops Happen: Improve Security Without Getting in the Way
PPTX
Introduction to Chaos Engineering
PDF
Tales from a radically polyglot team
PDF
CSA Raleigh application security and deception in the cloud
PDF
What We Learned from Three Years of Sciencing the Crap Out of DevOps
PDF
The Most Important Thing: How Mozilla Does Security and What You Can Steal
PDF
Your Data Scientist Hates You
PDF
Chaos engineering intro
PDF
An Introduction to Chaos Engineering
KEY
Make Life Suck Less (Building Scalable Systems)
PPT
Building a culture where software projects get done
KEY
Blend it up - leancamp london presentation
Chaos Engineering 101: A Field Guide
Shift Left. Wait, what? No, Shift Right!!!
The left is not wrong, just not right; It's time to shift right!
Overview of Site Reliability Engineering (SRE) & best practices
How to SRE when you have no SRE
OpsStack Overview 20170806.1
Is your Automation Infrastructure ‘Well Architected’?
Sigma Open Tech Week: Bitter Truth About Software Security
Ops Happen: Improve Security Without Getting in the Way
Introduction to Chaos Engineering
Tales from a radically polyglot team
CSA Raleigh application security and deception in the cloud
What We Learned from Three Years of Sciencing the Crap Out of DevOps
The Most Important Thing: How Mozilla Does Security and What You Can Steal
Your Data Scientist Hates You
Chaos engineering intro
An Introduction to Chaos Engineering
Make Life Suck Less (Building Scalable Systems)
Building a culture where software projects get done
Blend it up - leancamp london presentation
Ad

Similar to Applying SRE techniques to micro service design (20)

PPTX
Chaos engineering
PPTX
Making disaster routine
PDF
Working Effectively with PeopleSoft Support
PPTX
Orchestration, the conductor's score
PPT
Normal accidents and outpatient surgeries
PDF
Reliability Engineering Q&A - LCE
PDF
Evil Tester's Guide to Agile Testing
PPTX
Reanimating DevOps to Build Things that Work
PPTX
High Reliabilty Systems
PPT
01. foundamentals of testing
PDF
Startup Operating Systems
PDF
Mucon microservices and innovation
PPTX
CS5032 Lecture 5: Human Error 1
PDF
No more excuses QASymphony
PDF
Advanced Maintenance And Reliability (Maintenance and Reliability Best Pract...
PPT
Design testabilty
PPTX
Introduction to Software Engineering and Software Process Models
PDF
Adapting Scrum in an Organization with Tailored Processes
PDF
A real-life overview of Agile and Scrum
PPTX
SOFTWARE TESTING TRAFUNDAMENTALS OF SOFTWARE TESTING.pptx
Chaos engineering
Making disaster routine
Working Effectively with PeopleSoft Support
Orchestration, the conductor's score
Normal accidents and outpatient surgeries
Reliability Engineering Q&A - LCE
Evil Tester's Guide to Agile Testing
Reanimating DevOps to Build Things that Work
High Reliabilty Systems
01. foundamentals of testing
Startup Operating Systems
Mucon microservices and innovation
CS5032 Lecture 5: Human Error 1
No more excuses QASymphony
Advanced Maintenance And Reliability (Maintenance and Reliability Best Pract...
Design testabilty
Introduction to Software Engineering and Software Process Models
Adapting Scrum in an Organization with Tailored Processes
A real-life overview of Agile and Scrum
SOFTWARE TESTING TRAFUNDAMENTALS OF SOFTWARE TESTING.pptx
Ad

More from Theo Schlossnagle (20)

PPTX
Adding Simplicity to Complexity
PPTX
Put Some SRE in Your Shipped Software
PPTX
Distributed Systems - Like It Or Not
PDF
SRECon Coherent Performance
PDF
Commandments of scale
PDF
Adaptive availability
PDF
Project reality
PDF
Monitoring the #DevOps way
PDF
The math behind big systems analysis.
PDF
Understanding Slowness
PDF
OmniOS Motivation and Design ~ LISA 2012
PDF
Monitoring and observability
PDF
Omnios and unix
PDF
Monitoring and observability
PDF
Xtreme Deployment
PDF
PDF
It's all about telemetry
PDF
Monitoring is easy, why are we so bad at it presentation
PDF
Social improvements in monitoring
PDF
What's in a number?
Adding Simplicity to Complexity
Put Some SRE in Your Shipped Software
Distributed Systems - Like It Or Not
SRECon Coherent Performance
Commandments of scale
Adaptive availability
Project reality
Monitoring the #DevOps way
The math behind big systems analysis.
Understanding Slowness
OmniOS Motivation and Design ~ LISA 2012
Monitoring and observability
Omnios and unix
Monitoring and observability
Xtreme Deployment
It's all about telemetry
Monitoring is easy, why are we so bad at it presentation
Social improvements in monitoring
What's in a number?

Recently uploaded (20)

PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Nekopoi APK 2025 free lastest update
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
Transform Your Business with a Software ERP System
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
top salesforce developer skills in 2025.pdf
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
Essential Infomation Tech presentation.pptx
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
VVF-Customer-Presentation2025-Ver1.9.pptx
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Nekopoi APK 2025 free lastest update
How to Choose the Right IT Partner for Your Business in Malaysia
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
How to Migrate SBCGlobal Email to Yahoo Easily
Softaken Excel to vCard Converter Software.pdf
Transform Your Business with a Software ERP System
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
How Creative Agencies Leverage Project Management Software.pdf
top salesforce developer skills in 2025.pdf
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Reimagine Home Health with the Power of Agentic AI​
Design an Analysis of Algorithms I-SECS-1021-03
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Odoo Companies in India – Driving Business Transformation.pdf
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Essential Infomation Tech presentation.pptx

Applying SRE techniques to micro service design

  • 1. can be applied to the nascent world of microservices. Put some SRE in your microservices Hard-won lessons from the world of SRE…
  • 2. The many faces of Theo Schlossnagle @postwait CEO Circonus
  • 3. The nature of the problem Software Sucks Once you’ve run software at scale, you have a deep understanding of how it is all tied together with loose string and hope.
  • 4. All software will fail, but good software fails well • Consider the phrase: “have you used X in anger.”
  • 5. Never undervalue grace in failure. Rule . 𝛌1 Crash landings should be both fast and controlled.
  • 6. What it means to fail quickly & safely • The scope of failure should collapse completely. • The time to failure should be measured in small multiples of normal service time • Nothing outside the scope of failure should be impacted. https://www.youtube.com/watch?v=5SL1A2d2e7M
  • 7. Autopsies: not just for medicine. Rule . 𝛌2 Post-mortems are fundamental.
  • 8. Pragmatic analysis is required to understand failure’s true nature • Post-mortem analysis is critical • Stack traces • Forensic logs • Images (cores, dumps, etc.)
  • 9. The difference between a shock and electrocution is real. Rule . 𝛌3 Use circuit breakers.
  • 10. Circuit breakers are designed to avoid cascading failure • it’s not all about, especially with microservices • protect yourselves and others • circuit breakers of many type • timing • queue depth • concurrency http://melissaomarkham.com
  • 11. You cannot understand what you cannot measure. Rule . 𝛌4 Behavior is complex. Understand it.
  • 12. Don’t measure to assess availability measure to understand Build robust models of behavior Understand performance changes Don’t use averages Don’t use percentiles alone
  • 13. Don’t measure to assess availability measure to understand Build robust models of behavior Understand performance changes Don’t use averages Don’t use percentiles alone
  • 14. It’s easy to demand perfection; it’s also stupid. Rule . 𝛌5 Have an failure budget.
  • 15. Avoid failure is simply impossible, expect and manage failure • use failure budgets • set expectations reasonably • define and reward successes on improvement and competency, not just uptime.
  • 16. Justice should be blind; operations should not. Rule . 𝛌6 Instrumentation & Observability have no equals.
  • 17. For every “I wonder what X is right now?” in production, you must have answers DTrace eBPF Instrument code for observability https://www.pinterest.com/pin/441775044670412234/