SlideShare a Scribd company logo
Three Pillars with Zero Answers
A New Observability Scorecard
March 5, 2020
● Katia Bazzi
● Software Engineer @ LightStep
● katia@lightstep.com
● lightstep.com
Part I
A Critique
Observing microservices is hard
Google and Facebook solved this (right???)
They used Metrics, Logging, and Distributed Tracing…
So we should, too.
The Conventional Wisdom
The Three Pillars of Observability
- Metrics
- Logging
- Distributed Tracing
Metrics!
Logging!
Tracing!
Three Pillars, Zero Answers: Rethinking Observability
Fatal Flaws
A word nobody knew in 2015…
Dimensions (aka “tags”) can explain variance
in timeseries data (aka “metrics”) …
… but cardinality
Logging Data Volume: a reality check
transaction rate
x all microservices
x cost of net+storage
x weeks of retention
-----------------------
way too much $$$$
The Life of Transaction Data: Dapper
Stage Overhead affects… Retained
Instrumentation Executed App 100.00%
Buffered within app process App 000.10%
Flushed out of process App 000.10%
Centralized regionally Regional network + storage 000.10%
Centralized globally WAN + storage 000.01%
Logs Metrics Dist. Traces
TCO scales gracefully
– ✓ ✓
Accounts for all data
(i.e., unsampled) ✓ ✓ –
Immune to cardinality
✓ – ✓
Fatal Flaws: A Review
Three Pillars, Zero Answers: Rethinking Observability
Data vs UI
Data vs UI
Data vs UI
Metrics
Logs
Traces
Metrics, Logs, and Traces are
Just Data,
… not a feature or use case.
Part II
A New Scorecard
for Observability
Mental Model: “Goals” and “Activities”
Goals: how our services perform in the
eyes of their consumers
Activities: what we (as operators) actually
do to further our goals
“SLI” = “Service Level Indicator”
TL;DR: An SLI is an indicator of health that
a service’s consumers would care about.
… not an indicator of its inner workings
Quick Vocab Refresher: SLIs
Observability: Two Fundamental Goals
- Gradually improving an SLI
- Rapidly restoring an SLI
Reminder: “SLI” = “Service Level Indicator”
NOW!!!!
days, weeks, months…
1. Detection: measuring SLIs precisely
2. Refinement: reducing the search
space for plausible explanations
Observability: Two Fundamental Activities
An interlude about stats frequency
Specificity:
- Cost of cardinality ($ per tag value)
- Stack support (mobile/web platforms, managed services, “black-
box OSS infra” like Kafka/Cassandra)
Fidelity:
- Correct stats!!! (global p95, p99)
- High stats frequency (stats sampling frequency, in seconds)
Freshness (lag from real-time, in seconds)
Scorecard: Detection
# of things your users
actually care about
# of microservices
# of failure modes
Must reduce
the search space!
Why “Refinement”?
The Refinement Process
Discover Variance
Explain Variance
Deploy
Fix
Histograms vs “p99”
Scorecard: Refinement
Identifying Variance:
- Cardinality ($ per tag value)
- Robust stats (histograms (see prev slide))
- Retention horizons for plausible queries (time duration)
Explaining variance:
- Correct stats!!! (global p95, p99)
- “Suppress the messengers” of microservice failures
Wrapping up…
(first, a hint at my perspective)
A fun game!
Design your own observability system:
High-throughput
High-cardinality
Lengthy retention window
Unsampled
Choose three.
(“Observability Whack-a-Mole”)
The Life of Trace Data: Dapper
Stage Overhead affects… Retained
Instrumentation Executed App 100.00%
Buffered within app process App 000.10%
Flushed out of process App 000.10%
Centralized regionally Regional network + storage 000.10%
Centralized globally WAN + storage 000.01%
(Review)
The Life of Trace Data: Dapper Other Approaches
Stage Overhead affects… Retained
Instrumentation Executed App 100.00%
Buffered within app process App 100.00%
Flushed out of process App 100.00%
Centralized regionally Regional network + storage 100.00%
Centralized globally WAN + storage “fancy”
Refinement
- Identifying variance:
cardinality cost, correct
stats, hi-fi histograms,
retention horizons
- “Suppress the messengers”
Detection
- Specificity: cardinality
cost, stack coverage
- Fidelity: correct stats,
high stats frequency
- Freshness: ≤ 5 seconds
An Observability Scorecard
● Automatic deployment and regression detection
● System and service diagrams
● Real-time and historical root cause analysis
● Correlations
● Custom alerting
● Easy Setup with no vendor lock-in
● No cardinality limitations, really
LightStep: Observability with context
Get Started Today
lightstep.com/play

More Related Content

PDF
Complex event flows in distributed systems
PPTX
Art of refactoring - Code Smells and Microservices Antipatterns
PPTX
Rethinking Best Practices
PDF
Architecture for Flow w/ Wardley Mapping, Domain-Driven Design, and Team Topo...
PDF
More the merrier: a microservices anti-pattern
PPTX
Extending OutSystems with Javascript
PDF
Apache Spark Crash Course
PDF
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
Complex event flows in distributed systems
Art of refactoring - Code Smells and Microservices Antipatterns
Rethinking Best Practices
Architecture for Flow w/ Wardley Mapping, Domain-Driven Design, and Team Topo...
More the merrier: a microservices anti-pattern
Extending OutSystems with Javascript
Apache Spark Crash Course
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev

What's hot (20)

PPTX
Deploying and Operating KSQL
PDF
Stop the Guessing: Performance Methodologies for Production Systems
PPTX
RedisConf17 - Roblox - How Roblox Keeps Millions of Users Up to Date with Red...
PDF
Introduction to Domain Driven Design
PDF
Behaviour Driven Development (BDD) With Apex on Force.com
PDF
[온라인교육시리즈] 글로벌 서비스를 위한 인프라 구축방법(남용현 클라우드 솔루션 아키텍트)
PDF
The Service Mesh: It's about Traffic
PDF
Monitoring and observability
PDF
Resilient Functional Service Design
PPTX
[2022 DevOpsDays Taipei] 走過 DevOps 風雨的下一步
PDF
Introduction To Kibana
PPTX
Introduction to microservices
PDF
Clean Architecture
PDF
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
PPTX
Deep-Dive to Application Insights
PPTX
Applications Performance Monitoring with Applications Manager part 1
PPTX
Splunk Search Optimization
PDF
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
PDF
Importance of ML Reproducibility & Applications with MLfLow
PDF
Data Pipline Observability meetup
Deploying and Operating KSQL
Stop the Guessing: Performance Methodologies for Production Systems
RedisConf17 - Roblox - How Roblox Keeps Millions of Users Up to Date with Red...
Introduction to Domain Driven Design
Behaviour Driven Development (BDD) With Apex on Force.com
[온라인교육시리즈] 글로벌 서비스를 위한 인프라 구축방법(남용현 클라우드 솔루션 아키텍트)
The Service Mesh: It's about Traffic
Monitoring and observability
Resilient Functional Service Design
[2022 DevOpsDays Taipei] 走過 DevOps 風雨的下一步
Introduction To Kibana
Introduction to microservices
Clean Architecture
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Deep-Dive to Application Insights
Applications Performance Monitoring with Applications Manager part 1
Splunk Search Optimization
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Importance of ML Reproducibility & Applications with MLfLow
Data Pipline Observability meetup
Ad

Similar to Three Pillars, Zero Answers: Rethinking Observability (20)

PDF
Three Pillars, No Answers: Helping Platform Teams Solve Real Observability Pr...
PDF
Three Pillars with Zero Answers: A New Observability Scorecard
PDF
Big Data : Bits of History, Words of Advice
PDF
Mantis: Netflix's Event Stream Processing System
PDF
High Availability HPC ~ Microservice Architectures for Supercomputing
PPTX
Cloudera Data Science Challenge
PPTX
Data Science Challenge presentation given to the CinBITools Meetup Group
PPTX
"Traffic Speed Control System in the Cloud using Machine Learning" by Albert ...
PPTX
How we evolved data pipeline at Celtra and what we learned along the way
PDF
QConSF 2014 talk on Netflix Mantis, a stream processing system
PDF
fundamentalsofeventdrivenmicroservices11728489736099.pdf
PDF
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
PPTX
Kakfa summit london 2019 - the art of the event-streaming app
PDF
Big Data Berlin - Criteo
PDF
From 6 hours to 1 minute... in 2 days! How we managed to stream our (long) Ha...
PDF
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
PPTX
Malstone KDD 2010
PDF
Cloud Native London 2019 Faas composition using Kafka and cloud-events
PDF
Tek12: Graphing real-time performance with Graphite
PDF
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Three Pillars, No Answers: Helping Platform Teams Solve Real Observability Pr...
Three Pillars with Zero Answers: A New Observability Scorecard
Big Data : Bits of History, Words of Advice
Mantis: Netflix's Event Stream Processing System
High Availability HPC ~ Microservice Architectures for Supercomputing
Cloudera Data Science Challenge
Data Science Challenge presentation given to the CinBITools Meetup Group
"Traffic Speed Control System in the Cloud using Machine Learning" by Albert ...
How we evolved data pipeline at Celtra and what we learned along the way
QConSF 2014 talk on Netflix Mantis, a stream processing system
fundamentalsofeventdrivenmicroservices11728489736099.pdf
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
Kakfa summit london 2019 - the art of the event-streaming app
Big Data Berlin - Criteo
From 6 hours to 1 minute... in 2 days! How we managed to stream our (long) Ha...
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Malstone KDD 2010
Cloud Native London 2019 Faas composition using Kafka and cloud-events
Tek12: Graphing real-time performance with Graphite
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Ad

More from DevOps.com (20)

PDF
Modernizing on IBM Z Made Easier With Open Source Software
PPTX
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
PPTX
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
PDF
Next Generation Vulnerability Assessment Using Datadog and Snyk
PPTX
Vulnerability Discovery in the Cloud
PDF
2021 Open Source Governance: Top Ten Trends and Predictions
PDF
A New Year’s Ransomware Resolution
PPTX
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
PDF
Don't Panic! Effective Incident Response
PDF
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
PDF
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
PDF
Monitoring Serverless Applications with Datadog
PDF
Deliver your App Anywhere … Publicly or Privately
PPTX
Securing medical apps in the age of covid final
PDF
How to Build a Healthy On-Call Culture
PPTX
The Evolving Role of the Developer in 2021
PDF
Service Mesh: Two Big Words But Do You Need It?
PPTX
Secure Data Sharing in OpenShift Environments
PPTX
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
PDF
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...
Modernizing on IBM Z Made Easier With Open Source Software
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Next Generation Vulnerability Assessment Using Datadog and Snyk
Vulnerability Discovery in the Cloud
2021 Open Source Governance: Top Ten Trends and Predictions
A New Year’s Ransomware Resolution
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
Don't Panic! Effective Incident Response
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
Monitoring Serverless Applications with Datadog
Deliver your App Anywhere … Publicly or Privately
Securing medical apps in the age of covid final
How to Build a Healthy On-Call Culture
The Evolving Role of the Developer in 2021
Service Mesh: Two Big Words But Do You Need It?
Secure Data Sharing in OpenShift Environments
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...

Recently uploaded (20)

PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPT
Teaching material agriculture food technology
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Encapsulation theory and applications.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Cloud computing and distributed systems.
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
KodekX | Application Modernization Development
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
Per capita expenditure prediction using model stacking based on satellite ima...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
“AI and Expert System Decision Support & Business Intelligence Systems”
Building Integrated photovoltaic BIPV_UPV.pdf
Spectral efficient network and resource selection model in 5G networks
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Teaching material agriculture food technology
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
The AUB Centre for AI in Media Proposal.docx
Encapsulation theory and applications.pdf
Understanding_Digital_Forensics_Presentation.pptx
Cloud computing and distributed systems.
NewMind AI Weekly Chronicles - August'25 Week I
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Diabetes mellitus diagnosis method based random forest with bat algorithm
MIND Revenue Release Quarter 2 2025 Press Release
KodekX | Application Modernization Development
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Network Security Unit 5.pdf for BCA BBA.

Three Pillars, Zero Answers: Rethinking Observability

  • 1. Three Pillars with Zero Answers A New Observability Scorecard March 5, 2020
  • 2. ● Katia Bazzi ● Software Engineer @ LightStep ● katia@lightstep.com ● lightstep.com
  • 4. Observing microservices is hard Google and Facebook solved this (right???) They used Metrics, Logging, and Distributed Tracing… So we should, too. The Conventional Wisdom
  • 5. The Three Pillars of Observability - Metrics - Logging - Distributed Tracing
  • 11. A word nobody knew in 2015… Dimensions (aka “tags”) can explain variance in timeseries data (aka “metrics”) … … but cardinality
  • 12. Logging Data Volume: a reality check transaction rate x all microservices x cost of net+storage x weeks of retention ----------------------- way too much $$$$
  • 13. The Life of Transaction Data: Dapper Stage Overhead affects… Retained Instrumentation Executed App 100.00% Buffered within app process App 000.10% Flushed out of process App 000.10% Centralized regionally Regional network + storage 000.10% Centralized globally WAN + storage 000.01%
  • 14. Logs Metrics Dist. Traces TCO scales gracefully – ✓ ✓ Accounts for all data (i.e., unsampled) ✓ ✓ – Immune to cardinality ✓ – ✓ Fatal Flaws: A Review
  • 19. Metrics, Logs, and Traces are Just Data, … not a feature or use case.
  • 20. Part II A New Scorecard for Observability
  • 21. Mental Model: “Goals” and “Activities” Goals: how our services perform in the eyes of their consumers Activities: what we (as operators) actually do to further our goals
  • 22. “SLI” = “Service Level Indicator” TL;DR: An SLI is an indicator of health that a service’s consumers would care about. … not an indicator of its inner workings Quick Vocab Refresher: SLIs
  • 23. Observability: Two Fundamental Goals - Gradually improving an SLI - Rapidly restoring an SLI Reminder: “SLI” = “Service Level Indicator” NOW!!!! days, weeks, months…
  • 24. 1. Detection: measuring SLIs precisely 2. Refinement: reducing the search space for plausible explanations Observability: Two Fundamental Activities
  • 25. An interlude about stats frequency
  • 26. Specificity: - Cost of cardinality ($ per tag value) - Stack support (mobile/web platforms, managed services, “black- box OSS infra” like Kafka/Cassandra) Fidelity: - Correct stats!!! (global p95, p99) - High stats frequency (stats sampling frequency, in seconds) Freshness (lag from real-time, in seconds) Scorecard: Detection
  • 27. # of things your users actually care about # of microservices # of failure modes Must reduce the search space! Why “Refinement”?
  • 28. The Refinement Process Discover Variance Explain Variance Deploy Fix
  • 30. Scorecard: Refinement Identifying Variance: - Cardinality ($ per tag value) - Robust stats (histograms (see prev slide)) - Retention horizons for plausible queries (time duration) Explaining variance: - Correct stats!!! (global p95, p99) - “Suppress the messengers” of microservice failures
  • 32. (first, a hint at my perspective)
  • 33. A fun game! Design your own observability system: High-throughput High-cardinality Lengthy retention window Unsampled Choose three. (“Observability Whack-a-Mole”)
  • 34. The Life of Trace Data: Dapper Stage Overhead affects… Retained Instrumentation Executed App 100.00% Buffered within app process App 000.10% Flushed out of process App 000.10% Centralized regionally Regional network + storage 000.10% Centralized globally WAN + storage 000.01% (Review)
  • 35. The Life of Trace Data: Dapper Other Approaches Stage Overhead affects… Retained Instrumentation Executed App 100.00% Buffered within app process App 100.00% Flushed out of process App 100.00% Centralized regionally Regional network + storage 100.00% Centralized globally WAN + storage “fancy”
  • 36. Refinement - Identifying variance: cardinality cost, correct stats, hi-fi histograms, retention horizons - “Suppress the messengers” Detection - Specificity: cardinality cost, stack coverage - Fidelity: correct stats, high stats frequency - Freshness: ≤ 5 seconds An Observability Scorecard
  • 37. ● Automatic deployment and regression detection ● System and service diagrams ● Real-time and historical root cause analysis ● Correlations ● Custom alerting ● Easy Setup with no vendor lock-in ● No cardinality limitations, really LightStep: Observability with context