SlideShare a Scribd company logo
Three Pillars, No Answers: Helping Platform
Teams Solve Real Observability Problems
Austin Parker, Principal Developer Advocate at Lightstep
Who Am I?
Austin Parker
Principal Developer Advocate
@austinlparker
austin@lightstep.com✉
Part 1: A
Critique
The Conventional Wisdom
● Observing microservices is hard
● Google and Facebook solved this (right???)
● They used Metrics, Logging, and Distributed Tracing…
● So we should, too.
The Three Pillars of Observability
- Metrics
- Logging
- Distributed Tracing
Metrics
Logging
Tracing
Three Pillars, No Answers: Helping Platform Teams Solve Real Observability Problems
Fatal Flaws
A word nobody knew in 2015…
Dimensions (aka “tags”) can explain
variance in timeseries data (aka “metrics”)
…… but cardinality
Logging Data Volume: a reality check
transaction rate
x all microservices
x cost of net+storage
x weeks of retention
-----------------------
way too much $$$$
The Life of Transaction Data: Dapper
Stage Overhead affects… Retained
Instrumentation Executed App 100.00%
Buffered within app process App 000.10%
Flushed out of process App 000.10%
Centralized regionally Regional network + storage 000.10%
Centralized globally WAN + storage 000.01%
Fatal Flaws: A Review
Logs Metrics Dist. Traces
TCO scales gracefully
– ✓ ✓
Accounts for all data
(i.e., unsampled) ✓ ✓ –
Immune to cardinality
✓ – ✓
Three Pillars, No Answers: Helping Platform Teams Solve Real Observability Problems
Data vs UI
Data vs UI
Data vs UI
Metrics
Logs
Traces
Metrics, Logs, and Traces are
Just Data,
… not a feature or use case.
Part 2: A New
Scorecard for
Observability
Mental Model: Goals and Activities
● Goals: how our services perform in the eyes of
their consumers
● Activities: what we (as operators) actually do
to further our goals
Quick Vocab Refresher: SLIs
“SLI” = “Service Level Indicator”
TL;DR: An SLI is an indicator of health that a
service’s consumers would care about.
… not an indicator of its inner workings
Observability: 2 Fundamental Goals
Gradually improving an SLI
Rapidly restoring an SLI
Reminder: “SLI” = “Service Level Indicator”
NOW!!!!
days, weeks, months…
Observability: 2 Fundamental Activities
1. Detection: measuring SLIs precisely
2. Refinement: reducing the search
space for plausible explanations
An interlude about stats frequency
Scorecard: Detection
1. Specificity:
- Cost of cardinality ($ per tag value)
- Stack support (mobile/web platforms, managed services, “black-box
OSS infra” like Kafka/Cassandra)
2. Fidelity:
- Correct stats!!! (global p95, p99)
- High stats frequency (stats sampling frequency, in seconds)
3. Freshness (lag from real-time, in seconds)
Why “Refinement”?
# of things your users
actually care about
# of microservices
# of failure modes
Must reduce
the search space!
The Refinement Process
Discover Variance
Explain Variance
Deploy
Fix
Histograms vs “p99”
Scorecard: Refinement
Identifying Variance:
- Cardinality ($ per tag value)
- Robust stats (histograms (see prev slide))
- Retention horizons for plausible queries (time duration)
Explaining variance:
- Correct stats!!! (global p95, p99)
- “Suppress the messengers” of microservice failures
Wrapping Up...
(first, a hint at my
perspective)
A fun game! (“Observability Whack-a-Mole”)
Design your own observability system:
❏ High-throughput
❏ High-cardinality
❏ Lengthy retention window
❏ Unsampled
Choose three
The Life of Trace Data:
Dapper
Stage Overhead affects… Retained
Instrumentation Executed App 100.00%
Buffered within app process App 000.10%
Flushed out of process App 000.10%
Centralized regionally Regional network + storage 000.10%
Centralized globally WAN + storage 000.01%
The Life of Trace Data:
Dapper Other Approaches
Stage Overhead affects… Retained
Instrumentation Executed App 100.00%
Buffered within app process App 100.00%
Flushed out of process App 100.00%
Centralized regionally Regional network + storage 100.00%
Centralized globally WAN + storage “fancy”
An Observability Scorecard
Detection
- Specificity: cardinality cost,
stack coverage
- Fidelity: correct stats, high stats
frequency
- Freshness: ≤ 5 seconds
Refinement
- Identifying variance: cardinality
cost, correct stats, hi-fi
histograms, retention horizons
- “Suppress the messengers”
LightStep: Observability with context
Automatic deployment and regression detection
System and service diagrams
Real-time and historical root cause analysis
Correlations
Custom alerting
Easy Setup with no vendor lock-in
No cardinality limitations, really
Q&A
Get Started Today
go.lightstep.com/trial
Extra Slides
Ideal Measurement: Robust
Ideal Measurement: High-Dimensional
Ideal Refinement: Real-time
Must be able to test and eliminate hypotheses
quickly
Actual data must be ≤10s fresh
UI / API latency must be very low
Ideal Refinement: Global
Ideal Refinement: Context-Rich
We can’t expect humans to know what’s normal

More Related Content

PDF
More Than Monitoring: How Observability Takes You From Firefighting to Fire P...
PDF
Achieving observability-in-modern-applications
PPTX
Do You Really Need to Evolve From Monitoring to Observability?
PPTX
Splunk Discovery Köln - 17-01-2020 - Willkommen!
PPTX
Splunk Discovery Köln - 17-01-2020 - Accelerate Incident Response
PPTX
The Top 10 Glasstable Design Principles to Boost Your Career and Your Business
PPTX
Splunk Platform 2020 & Beyond
PPTX
Machine Learning and Social Good
More Than Monitoring: How Observability Takes You From Firefighting to Fire P...
Achieving observability-in-modern-applications
Do You Really Need to Evolve From Monitoring to Observability?
Splunk Discovery Köln - 17-01-2020 - Willkommen!
Splunk Discovery Köln - 17-01-2020 - Accelerate Incident Response
The Top 10 Glasstable Design Principles to Boost Your Career and Your Business
Splunk Platform 2020 & Beyond
Machine Learning and Social Good

What's hot (20)

PPTX
Wie erkenne ich die Auswirkungen von IT Ausfallen auf meine Produktion?
PPTX
Splunk Discovery Köln - 17-01-2020 - Splunk for ITOps
PPTX
Splunk Overview
PPTX
.conf21 - The Best of
PPTX
Security Automation & Orchestration
PPTX
Better Threat Analytics: From Getting Started to Cloud Security Analytics and...
PDF
Manufacturing Webinar AMS
PPTX
SplunkLive! Paris 2018: Splunk Overview
PPTX
The Risks and Rewards of AI
PPTX
IoT Analytics @ splunk
PPTX
Catch these Sessions on-demand at .conf Online
PPTX
Introduction into Security Analytics Methods
PDF
Splunk Artificial Intelligence & Machine Learning Webinar
PPTX
Splunk Discovery Köln - 17-01-2020 - Turning Data Into Business Outcomes
PPTX
Best Practices for Forwarder Hierarchies
PDF
Observe 2020-d mc
PPTX
SplunkLive! Stockholm 2019 - Customer presentation: ISS
PDF
Monitoring Modern Architectures with Data Science
PPTX
Splunk for Monitoring and Diagnostics Breakout Session
PPTX
SplunkLive! Munich 2018: Getting Started with Splunk Enterprise
Wie erkenne ich die Auswirkungen von IT Ausfallen auf meine Produktion?
Splunk Discovery Köln - 17-01-2020 - Splunk for ITOps
Splunk Overview
.conf21 - The Best of
Security Automation & Orchestration
Better Threat Analytics: From Getting Started to Cloud Security Analytics and...
Manufacturing Webinar AMS
SplunkLive! Paris 2018: Splunk Overview
The Risks and Rewards of AI
IoT Analytics @ splunk
Catch these Sessions on-demand at .conf Online
Introduction into Security Analytics Methods
Splunk Artificial Intelligence & Machine Learning Webinar
Splunk Discovery Köln - 17-01-2020 - Turning Data Into Business Outcomes
Best Practices for Forwarder Hierarchies
Observe 2020-d mc
SplunkLive! Stockholm 2019 - Customer presentation: ISS
Monitoring Modern Architectures with Data Science
Splunk for Monitoring and Diagnostics Breakout Session
SplunkLive! Munich 2018: Getting Started with Splunk Enterprise
Ad

Similar to Three Pillars, No Answers: Helping Platform Teams Solve Real Observability Problems (20)

PDF
Three Pillars, Zero Answers: Rethinking Observability
PDF
Three Pillars with Zero Answers: A New Observability Scorecard
PDF
Mantis: Netflix's Event Stream Processing System
PDF
High Availability HPC ~ Microservice Architectures for Supercomputing
PDF
Big Data : Bits of History, Words of Advice
PDF
Is this normal?
PDF
Real-Time Analytics With StarRocks (DWH+DL).pdf
PDF
Intelligent Monitoring
PPTX
Machine Learning Impact on IoT - Part 2
PDF
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
PPTX
Cloudera Data Science Challenge
PPTX
Data Science Challenge presentation given to the CinBITools Meetup Group
PDF
How to not fail at security data analytics (by CxOSidekick)
PDF
QConSF 2014 talk on Netflix Mantis, a stream processing system
PDF
Cloudera Movies Data Science Project On Big Data
PPT
No specimen (software) left behind
PDF
Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdf
PDF
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution
PPTX
"Traffic Speed Control System in the Cloud using Machine Learning" by Albert ...
PDF
Velocity Conference: Building a Scalable, Global SaaS Offering: Lessons from ...
Three Pillars, Zero Answers: Rethinking Observability
Three Pillars with Zero Answers: A New Observability Scorecard
Mantis: Netflix's Event Stream Processing System
High Availability HPC ~ Microservice Architectures for Supercomputing
Big Data : Bits of History, Words of Advice
Is this normal?
Real-Time Analytics With StarRocks (DWH+DL).pdf
Intelligent Monitoring
Machine Learning Impact on IoT - Part 2
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Cloudera Data Science Challenge
Data Science Challenge presentation given to the CinBITools Meetup Group
How to not fail at security data analytics (by CxOSidekick)
QConSF 2014 talk on Netflix Mantis, a stream processing system
Cloudera Movies Data Science Project On Big Data
No specimen (software) left behind
Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdf
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution
"Traffic Speed Control System in the Cloud using Machine Learning" by Albert ...
Velocity Conference: Building a Scalable, Global SaaS Offering: Lessons from ...
Ad

More from DevOps.com (20)

PDF
Modernizing on IBM Z Made Easier With Open Source Software
PPTX
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
PPTX
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
PDF
Next Generation Vulnerability Assessment Using Datadog and Snyk
PPTX
Vulnerability Discovery in the Cloud
PDF
2021 Open Source Governance: Top Ten Trends and Predictions
PDF
A New Year’s Ransomware Resolution
PPTX
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
PDF
Don't Panic! Effective Incident Response
PDF
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
PDF
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
PDF
Monitoring Serverless Applications with Datadog
PDF
Deliver your App Anywhere … Publicly or Privately
PPTX
Securing medical apps in the age of covid final
PDF
How to Build a Healthy On-Call Culture
PPTX
The Evolving Role of the Developer in 2021
PDF
Service Mesh: Two Big Words But Do You Need It?
PPTX
Secure Data Sharing in OpenShift Environments
PPTX
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
PDF
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...
Modernizing on IBM Z Made Easier With Open Source Software
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Next Generation Vulnerability Assessment Using Datadog and Snyk
Vulnerability Discovery in the Cloud
2021 Open Source Governance: Top Ten Trends and Predictions
A New Year’s Ransomware Resolution
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
Don't Panic! Effective Incident Response
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
Monitoring Serverless Applications with Datadog
Deliver your App Anywhere … Publicly or Privately
Securing medical apps in the age of covid final
How to Build a Healthy On-Call Culture
The Evolving Role of the Developer in 2021
Service Mesh: Two Big Words But Do You Need It?
Secure Data Sharing in OpenShift Environments
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...

Recently uploaded (20)

PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Electronic commerce courselecture one. Pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Spectroscopy.pptx food analysis technology
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
cuic standard and advanced reporting.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Unlocking AI with Model Context Protocol (MCP)
Building Integrated photovoltaic BIPV_UPV.pdf
Review of recent advances in non-invasive hemoglobin estimation
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Per capita expenditure prediction using model stacking based on satellite ima...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
MYSQL Presentation for SQL database connectivity
Mobile App Security Testing_ A Comprehensive Guide.pdf
Machine learning based COVID-19 study performance prediction
Electronic commerce courselecture one. Pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Reach Out and Touch Someone: Haptics and Empathic Computing
Spectroscopy.pptx food analysis technology
Network Security Unit 5.pdf for BCA BBA.
cuic standard and advanced reporting.pdf
Programs and apps: productivity, graphics, security and other tools
The Rise and Fall of 3GPP – Time for a Sabbatical?
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...

Three Pillars, No Answers: Helping Platform Teams Solve Real Observability Problems

  • 1. Three Pillars, No Answers: Helping Platform Teams Solve Real Observability Problems Austin Parker, Principal Developer Advocate at Lightstep
  • 2. Who Am I? Austin Parker Principal Developer Advocate @austinlparker austin@lightstep.com✉
  • 4. The Conventional Wisdom ● Observing microservices is hard ● Google and Facebook solved this (right???) ● They used Metrics, Logging, and Distributed Tracing… ● So we should, too.
  • 5. The Three Pillars of Observability - Metrics - Logging - Distributed Tracing
  • 11. A word nobody knew in 2015… Dimensions (aka “tags”) can explain variance in timeseries data (aka “metrics”) …… but cardinality
  • 12. Logging Data Volume: a reality check transaction rate x all microservices x cost of net+storage x weeks of retention ----------------------- way too much $$$$
  • 13. The Life of Transaction Data: Dapper Stage Overhead affects… Retained Instrumentation Executed App 100.00% Buffered within app process App 000.10% Flushed out of process App 000.10% Centralized regionally Regional network + storage 000.10% Centralized globally WAN + storage 000.01%
  • 14. Fatal Flaws: A Review Logs Metrics Dist. Traces TCO scales gracefully – ✓ ✓ Accounts for all data (i.e., unsampled) ✓ ✓ – Immune to cardinality ✓ – ✓
  • 19. Metrics, Logs, and Traces are Just Data, … not a feature or use case.
  • 20. Part 2: A New Scorecard for Observability
  • 21. Mental Model: Goals and Activities ● Goals: how our services perform in the eyes of their consumers ● Activities: what we (as operators) actually do to further our goals
  • 22. Quick Vocab Refresher: SLIs “SLI” = “Service Level Indicator” TL;DR: An SLI is an indicator of health that a service’s consumers would care about. … not an indicator of its inner workings
  • 23. Observability: 2 Fundamental Goals Gradually improving an SLI Rapidly restoring an SLI Reminder: “SLI” = “Service Level Indicator” NOW!!!! days, weeks, months…
  • 24. Observability: 2 Fundamental Activities 1. Detection: measuring SLIs precisely 2. Refinement: reducing the search space for plausible explanations
  • 25. An interlude about stats frequency
  • 26. Scorecard: Detection 1. Specificity: - Cost of cardinality ($ per tag value) - Stack support (mobile/web platforms, managed services, “black-box OSS infra” like Kafka/Cassandra) 2. Fidelity: - Correct stats!!! (global p95, p99) - High stats frequency (stats sampling frequency, in seconds) 3. Freshness (lag from real-time, in seconds)
  • 27. Why “Refinement”? # of things your users actually care about # of microservices # of failure modes Must reduce the search space!
  • 28. The Refinement Process Discover Variance Explain Variance Deploy Fix
  • 30. Scorecard: Refinement Identifying Variance: - Cardinality ($ per tag value) - Robust stats (histograms (see prev slide)) - Retention horizons for plausible queries (time duration) Explaining variance: - Correct stats!!! (global p95, p99) - “Suppress the messengers” of microservice failures
  • 32. (first, a hint at my perspective)
  • 33. A fun game! (“Observability Whack-a-Mole”) Design your own observability system: ❏ High-throughput ❏ High-cardinality ❏ Lengthy retention window ❏ Unsampled Choose three
  • 34. The Life of Trace Data: Dapper Stage Overhead affects… Retained Instrumentation Executed App 100.00% Buffered within app process App 000.10% Flushed out of process App 000.10% Centralized regionally Regional network + storage 000.10% Centralized globally WAN + storage 000.01%
  • 35. The Life of Trace Data: Dapper Other Approaches Stage Overhead affects… Retained Instrumentation Executed App 100.00% Buffered within app process App 100.00% Flushed out of process App 100.00% Centralized regionally Regional network + storage 100.00% Centralized globally WAN + storage “fancy”
  • 36. An Observability Scorecard Detection - Specificity: cardinality cost, stack coverage - Fidelity: correct stats, high stats frequency - Freshness: ≤ 5 seconds Refinement - Identifying variance: cardinality cost, correct stats, hi-fi histograms, retention horizons - “Suppress the messengers”
  • 37. LightStep: Observability with context Automatic deployment and regression detection System and service diagrams Real-time and historical root cause analysis Correlations Custom alerting Easy Setup with no vendor lock-in No cardinality limitations, really
  • 42. Ideal Refinement: Real-time Must be able to test and eliminate hypotheses quickly Actual data must be ≤10s fresh UI / API latency must be very low
  • 44. Ideal Refinement: Context-Rich We can’t expect humans to know what’s normal