SlideShare a Scribd company logo
Austin Parker, Principal Developer Advocate
Why Distributed
Tracing is Essential
for Performance and
Reliability
or, How to Get Actual Business
Value From Distributed Tracing!
Who Am I?
Austin Parker
Principal Developer Advocate
2
@austinlparker
austin@lightstep.com✉
What Changed?
3
4
More autonomy…
but less visibility!
Observe
kustomize
? ??
Control Team-by-
team ok
Must be
org-wide!
Distributed tracing!
What you can control
What you are
responsible for
Stress (n): responsibility without control
6
Closing the gap between control and responsibility
Responsibility for delivering performance and reliability
Like many problems, the solution requires:
- Having the right data
- Setting the right goals
- Giving teams ownership
Distributed tracing is essential to closing this gap
7
1. (n) the ability to navigate from effect to cause
2. (adj) related to supporting that ability (such as a tool or process)
"used an observability tool to understand what caused the change”
For example, being able to navigate from…
Spike in errors → misconfiguration
Increased latency → new customer behavior
User complaints → upstream service deployed
Observability əb-ˈzər-və-bi-lə-tē
8
Getting Actual Business Value From Distributed tracing
9
Developer velocity
Software performance
Managing costs
Fundamentals
Deploying tracing
Distributed Tracing
10
Why Distributed Tracing is Essential for Performance and Reliability
Traces are a form of telemetry based on spans with structure
- Span = timed event describing work done by a single service
Tracing is a diagnostic tool that reveals…
… how a set of services coordinate to handle individual user requests
… from mobile or browser to backends to databases (end-to-end)
… including metadata like events (logs) and annotations (tags)
Provides a request-centric view of application performance
Distributed tracing, defined
12
Relationships matter
13
Traces encode causal relationships between callers and callees
calls
returns
Traces are the raw material, not the finished product
Distributed traces – basically just structs
Distributed tracing – the art and science of deriving value from traces
14
Developer Velocity
15
Increasing developer velocity
- Make (common) tasks faster
- Reduce interruptions
- Improve communication
- Prioritize high impact work
16
Verify deployments
Root cause analysis
Better alerts
Understand dependencies
Define and track SLOs
Accelerate root cause analysis
17
More actionable alerts
18
“Are We All on the Same Page?
Let’s Fix That”Luis Mineiro, SREcon EMEA 2019
Search for “same page usenix”)
Understanding dependencies… without tracing
19
A
B
C
E
D
C
B
B
D
A
B
E
D 8% error
rate
avg. response
size up 31%
request rate up
4%
Understanding dependencies
Without tracing...
- Each connection in isolation
- “A talks to B”
- No way to narrow scope
- No way to meaningfully tie in
other metrics
20
With tracing...
- End-to-end context
- Request graph
- Can refine based on any
property of the request
- Metrics linked to current scope
Use traces and service dependencies
- Enhance training for new team members
- Facilitate operational review meetings
- Inform architectural design decisions
- Set SLOs for internal services
21
Use SLOs to…- Measure reliability- Set error budgets- Hold teams accountable
Software
Performance
22
Improving software performance
Performance means “performance as
experienced by end users”
Tracing can help by…
- Better distribution of computation
- Focusing optimization where it
matters
23
Defining the critical path
24
waiting for blue…
A (part of a) span is on the critical path if:
- reducing its duration speeds up overall request
therefore, blue is on
the critical path here
Rebalancing fan-out
25
Given a choice between speeding up A and B…
1. 50% improvement in B is better than an 50% improvement in A
2. No improvement in A will ever improve overall performance by >15%
Obvious… once you have the data :)
A B
A B A B
Amdahl’s Law
26
OR
Managing Costs
27
Types of costs
Operational costs
- Developer time (failed deployments, oncall, meeting overhead)
Revenue and reputational costs
- Missed SLOs, failed conversions, unhappy users
Infrastructure costs
- Compute, network, storage, API usage
Monitoring costs
28
Take aggregated logs as an example
Calculating logging costs
Initial Factors
‐ Aggregating and indexing logs per
service:
‐ Storage
‐ Compute
‐ Network
‐ Peak instance count
‐ Retention period
‐ Services involved in a request
29
Initial Values
Assuming 50GB of log data a day, 14
day retention, high availability (no cold
storage)
1 Primary (L Compute Optimized)
 $89
2 Data (XL Memory Optimized)
 $426
3 SSDs (General Purpose)
 $201
$716
Cloud spend @ 50GB/logs (monthly)
30
$3,386
Total after setup, maintenance (monthly)
31
Reducing logging spend with tracing
Annotate spans with logs! It’s as easy as:
span.addEvent(“illegal base64 data at input byte 7”)
Leverage traces to determine which logs to store
32
Monthly logging spend
$3,571 $712
Logging data is more valuable in context!
Deploying Tracing
33
On your tracing migration
Tracing is not an all-or-nothing endeavour
- How to deliver incremental value for the org
- How to use that value to inform next steps of the journey
Value to developers should be your (meta-)metric of success
journey
34
Step 1 Start w/ customer-critical experiences
Look at the edge and build an MVP
- As close as you can (reasonably) get to users
- Often an API gateway or proxy
Map incoming operations → dependencies
- Identify next steps
- Build a case for others to adopt tracing
35
Step 2 Playbook for service owners
Establish conventions for tags, etc.
- What matters to your business?
- What would explain failures?
Instrument frameworks, libraries, shared services
- Accelerate adoption by reusing code
- Enforce conventions programmatically
36
Step 3 Integrate with existing workflows
Where do engineers work today?
- IDEs, testing frameworks, CI/CD
- Dashboards
- Notification and alerting
- …
37
Building observable services
Use open standards like OpenTelemetry
for instrumenting service code.
OpenTelemetry provides a single set of
APIs, SDKs, and tools for generating
distributed traces and metrics from your
services.
38
In summary, distributed tracing provides...
Faster RCA
Better alerts
Up-to-date dependency maps
Improved compute fan-out
Targeted optimization
Integrated telemetry
39
Improved developer velocity
Faster software performance
Better cost management
Distributed tracing puts application behavior in context
to help answer the primary question of observability:
“What caused that change?”
Get Started Today
go.lightstep.com/request-a-demo
Q&A

More Related Content

PDF
WJAX 2019 - Taking Distributed Tracing to the next level
PDF
Why Distributed Tracing is Essential for Performance and Reliability
PDF
Architectures That Scale Deep - Regaining Control in Deep Systems
PPTX
Solving the Hidden Costs of Kubernetes with Observability
PDF
Go Observability (in practice)
PDF
Everything You wanted to Know About Distributed Tracing
PDF
Testing in a distributed world
PDF
Juraci Paixão Kröhling - All you need to know about OpenTelemetry
WJAX 2019 - Taking Distributed Tracing to the next level
Why Distributed Tracing is Essential for Performance and Reliability
Architectures That Scale Deep - Regaining Control in Deep Systems
Solving the Hidden Costs of Kubernetes with Observability
Go Observability (in practice)
Everything You wanted to Know About Distributed Tracing
Testing in a distributed world
Juraci Paixão Kröhling - All you need to know about OpenTelemetry

What's hot (20)

PDF
Tracing Micro Services with OpenTracing
PDF
Distributed Tracing
PDF
[WSO2Con EU 2018] Tooling for Observability
PPT
Distributed Tracing Velocity2016
PDF
CQRS and Event Sourcing: A DevOps perspective
PDF
Distributed tracing with OpenTracing and Jaeger @ getstream.io
PPTX
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.io
PDF
Software cracking and patching
PPTX
OpenTelemetry For Developers
PDF
How to Streamline Incident Response with InfluxDB, PagerDuty and Rundeck
PDF
Adopting Open Telemetry as Distributed Tracer on your Microservices at Kubern...
PDF
Making Runtime Data Useful for Incident Diagnosis: An Experience Report
PPTX
Observability – the good, the bad, and the ugly
PDF
Monitoring to the Nth tier: The state of distributed tracing in 2016
PDF
Opentracing jaeger
PPTX
Distributed tracing 101
PDF
Distributed tracing using open tracing & jaeger 2
PDF
Managing Big Data projects in a constantly changing environment - Rafał Zalew...
PDF
2017 Microservices Practitioner Virtual Summit: Ancestry's Journey towards Mi...
PDF
Time series-analysis-using-an-event-streaming-platform -_v3_final
Tracing Micro Services with OpenTracing
Distributed Tracing
[WSO2Con EU 2018] Tooling for Observability
Distributed Tracing Velocity2016
CQRS and Event Sourcing: A DevOps perspective
Distributed tracing with OpenTracing and Jaeger @ getstream.io
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.io
Software cracking and patching
OpenTelemetry For Developers
How to Streamline Incident Response with InfluxDB, PagerDuty and Rundeck
Adopting Open Telemetry as Distributed Tracer on your Microservices at Kubern...
Making Runtime Data Useful for Incident Diagnosis: An Experience Report
Observability – the good, the bad, and the ugly
Monitoring to the Nth tier: The state of distributed tracing in 2016
Opentracing jaeger
Distributed tracing 101
Distributed tracing using open tracing & jaeger 2
Managing Big Data projects in a constantly changing environment - Rafał Zalew...
2017 Microservices Practitioner Virtual Summit: Ancestry's Journey towards Mi...
Time series-analysis-using-an-event-streaming-platform -_v3_final
Ad

Similar to Why Distributed Tracing is Essential for Performance and Reliability (20)

PDF
Why DevOps Needs to Embrace Distributed Tracing
PDF
Performance monitoring and call tracing in microservice environments
PDF
Monitoring and Observability: Building Products That Don't Break in Silence
PDF
Why DevOps Needs to Embrace Distributed Tracing
PDF
Driving Service Ownership with Distributed Tracing
PPTX
Observability for Application Developers (1)-1.pptx
PDF
Meetup OpenTelemetry Intro
PDF
"Introducing Distributed Tracing in a Large Software System", Kostiantyn Sha...
PPTX
Distributed Tracing at UBER Scale: Creating a treasure map for your monitori...
PPTX
Understanding Microservice Latency for DevOps Teams: An Introduction to New R...
PPTX
Microservice observability 2019
PDF
I pushed in production :). Have a nice weekend
PDF
Les logs, traces et indicateurs au service d'une observabilité unifiée
PDF
Combining Logs, Metrics, and Traces for Unified Observability
PPTX
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
PDF
Pintrace: Distributed tracing @Pinterest
PDF
"Distributed Tracing: New DevOps Foundation" by Jayesh Ahire
PPTX
ADDO Open Source Observability Tools
PPTX
Lightstep webinar jan 30, 2019
PDF
Alon Fliess: APM – What Is It, and Why Do I Need It? - Architecture Next 20
Why DevOps Needs to Embrace Distributed Tracing
Performance monitoring and call tracing in microservice environments
Monitoring and Observability: Building Products That Don't Break in Silence
Why DevOps Needs to Embrace Distributed Tracing
Driving Service Ownership with Distributed Tracing
Observability for Application Developers (1)-1.pptx
Meetup OpenTelemetry Intro
"Introducing Distributed Tracing in a Large Software System", Kostiantyn Sha...
Distributed Tracing at UBER Scale: Creating a treasure map for your monitori...
Understanding Microservice Latency for DevOps Teams: An Introduction to New R...
Microservice observability 2019
I pushed in production :). Have a nice weekend
Les logs, traces et indicateurs au service d'une observabilité unifiée
Combining Logs, Metrics, and Traces for Unified Observability
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
Pintrace: Distributed tracing @Pinterest
"Distributed Tracing: New DevOps Foundation" by Jayesh Ahire
ADDO Open Source Observability Tools
Lightstep webinar jan 30, 2019
Alon Fliess: APM – What Is It, and Why Do I Need It? - Architecture Next 20
Ad

More from DevOps.com (20)

PDF
Modernizing on IBM Z Made Easier With Open Source Software
PPTX
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
PPTX
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
PDF
Next Generation Vulnerability Assessment Using Datadog and Snyk
PPTX
Vulnerability Discovery in the Cloud
PDF
2021 Open Source Governance: Top Ten Trends and Predictions
PDF
A New Year’s Ransomware Resolution
PPTX
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
PDF
Don't Panic! Effective Incident Response
PDF
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
PDF
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
PDF
Monitoring Serverless Applications with Datadog
PDF
Deliver your App Anywhere … Publicly or Privately
PPTX
Securing medical apps in the age of covid final
PDF
How to Build a Healthy On-Call Culture
PPTX
The Evolving Role of the Developer in 2021
PDF
Service Mesh: Two Big Words But Do You Need It?
PPTX
Secure Data Sharing in OpenShift Environments
PPTX
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
PDF
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...
Modernizing on IBM Z Made Easier With Open Source Software
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Next Generation Vulnerability Assessment Using Datadog and Snyk
Vulnerability Discovery in the Cloud
2021 Open Source Governance: Top Ten Trends and Predictions
A New Year’s Ransomware Resolution
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
Don't Panic! Effective Incident Response
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
Monitoring Serverless Applications with Datadog
Deliver your App Anywhere … Publicly or Privately
Securing medical apps in the age of covid final
How to Build a Healthy On-Call Culture
The Evolving Role of the Developer in 2021
Service Mesh: Two Big Words But Do You Need It?
Secure Data Sharing in OpenShift Environments
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...

Recently uploaded (20)

PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
sap open course for s4hana steps from ECC to s4
PPT
Teaching material agriculture food technology
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
KodekX | Application Modernization Development
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Spectroscopy.pptx food analysis technology
PDF
Approach and Philosophy of On baking technology
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Chapter 3 Spatial Domain Image Processing.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
sap open course for s4hana steps from ECC to s4
Teaching material agriculture food technology
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
MIND Revenue Release Quarter 2 2025 Press Release
The AUB Centre for AI in Media Proposal.docx
Diabetes mellitus diagnosis method based random forest with bat algorithm
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Digital-Transformation-Roadmap-for-Companies.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
KodekX | Application Modernization Development
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Spectroscopy.pptx food analysis technology
Approach and Philosophy of On baking technology
NewMind AI Weekly Chronicles - August'25 Week I
Profit Center Accounting in SAP S/4HANA, S4F28 Col11

Why Distributed Tracing is Essential for Performance and Reliability

  • 1. Austin Parker, Principal Developer Advocate Why Distributed Tracing is Essential for Performance and Reliability or, How to Get Actual Business Value From Distributed Tracing!
  • 2. Who Am I? Austin Parker Principal Developer Advocate 2 @austinlparker austin@lightstep.com✉
  • 5. Observe kustomize ? ?? Control Team-by- team ok Must be org-wide! Distributed tracing!
  • 6. What you can control What you are responsible for Stress (n): responsibility without control 6
  • 7. Closing the gap between control and responsibility Responsibility for delivering performance and reliability Like many problems, the solution requires: - Having the right data - Setting the right goals - Giving teams ownership Distributed tracing is essential to closing this gap 7
  • 8. 1. (n) the ability to navigate from effect to cause 2. (adj) related to supporting that ability (such as a tool or process) "used an observability tool to understand what caused the change” For example, being able to navigate from… Spike in errors → misconfiguration Increased latency → new customer behavior User complaints → upstream service deployed Observability əb-ˈzər-və-bi-lə-tē 8
  • 9. Getting Actual Business Value From Distributed tracing 9 Developer velocity Software performance Managing costs Fundamentals Deploying tracing
  • 12. Traces are a form of telemetry based on spans with structure - Span = timed event describing work done by a single service Tracing is a diagnostic tool that reveals… … how a set of services coordinate to handle individual user requests … from mobile or browser to backends to databases (end-to-end) … including metadata like events (logs) and annotations (tags) Provides a request-centric view of application performance Distributed tracing, defined 12
  • 13. Relationships matter 13 Traces encode causal relationships between callers and callees calls returns
  • 14. Traces are the raw material, not the finished product Distributed traces – basically just structs Distributed tracing – the art and science of deriving value from traces 14
  • 16. Increasing developer velocity - Make (common) tasks faster - Reduce interruptions - Improve communication - Prioritize high impact work 16 Verify deployments Root cause analysis Better alerts Understand dependencies Define and track SLOs
  • 17. Accelerate root cause analysis 17
  • 18. More actionable alerts 18 “Are We All on the Same Page? Let’s Fix That”Luis Mineiro, SREcon EMEA 2019 Search for “same page usenix”)
  • 19. Understanding dependencies… without tracing 19 A B C E D C B B D A B E D 8% error rate avg. response size up 31% request rate up 4%
  • 20. Understanding dependencies Without tracing... - Each connection in isolation - “A talks to B” - No way to narrow scope - No way to meaningfully tie in other metrics 20 With tracing... - End-to-end context - Request graph - Can refine based on any property of the request - Metrics linked to current scope
  • 21. Use traces and service dependencies - Enhance training for new team members - Facilitate operational review meetings - Inform architectural design decisions - Set SLOs for internal services 21 Use SLOs to…- Measure reliability- Set error budgets- Hold teams accountable
  • 23. Improving software performance Performance means “performance as experienced by end users” Tracing can help by… - Better distribution of computation - Focusing optimization where it matters 23
  • 24. Defining the critical path 24 waiting for blue… A (part of a) span is on the critical path if: - reducing its duration speeds up overall request therefore, blue is on the critical path here
  • 26. Given a choice between speeding up A and B… 1. 50% improvement in B is better than an 50% improvement in A 2. No improvement in A will ever improve overall performance by >15% Obvious… once you have the data :) A B A B A B Amdahl’s Law 26 OR
  • 28. Types of costs Operational costs - Developer time (failed deployments, oncall, meeting overhead) Revenue and reputational costs - Missed SLOs, failed conversions, unhappy users Infrastructure costs - Compute, network, storage, API usage Monitoring costs 28 Take aggregated logs as an example
  • 29. Calculating logging costs Initial Factors ‐ Aggregating and indexing logs per service: ‐ Storage ‐ Compute ‐ Network ‐ Peak instance count ‐ Retention period ‐ Services involved in a request 29 Initial Values Assuming 50GB of log data a day, 14 day retention, high availability (no cold storage) 1 Primary (L Compute Optimized)  $89 2 Data (XL Memory Optimized)  $426 3 SSDs (General Purpose)  $201
  • 30. $716 Cloud spend @ 50GB/logs (monthly) 30
  • 31. $3,386 Total after setup, maintenance (monthly) 31
  • 32. Reducing logging spend with tracing Annotate spans with logs! It’s as easy as: span.addEvent(“illegal base64 data at input byte 7”) Leverage traces to determine which logs to store 32 Monthly logging spend $3,571 $712 Logging data is more valuable in context!
  • 34. On your tracing migration Tracing is not an all-or-nothing endeavour - How to deliver incremental value for the org - How to use that value to inform next steps of the journey Value to developers should be your (meta-)metric of success journey 34
  • 35. Step 1 Start w/ customer-critical experiences Look at the edge and build an MVP - As close as you can (reasonably) get to users - Often an API gateway or proxy Map incoming operations → dependencies - Identify next steps - Build a case for others to adopt tracing 35
  • 36. Step 2 Playbook for service owners Establish conventions for tags, etc. - What matters to your business? - What would explain failures? Instrument frameworks, libraries, shared services - Accelerate adoption by reusing code - Enforce conventions programmatically 36
  • 37. Step 3 Integrate with existing workflows Where do engineers work today? - IDEs, testing frameworks, CI/CD - Dashboards - Notification and alerting - … 37
  • 38. Building observable services Use open standards like OpenTelemetry for instrumenting service code. OpenTelemetry provides a single set of APIs, SDKs, and tools for generating distributed traces and metrics from your services. 38
  • 39. In summary, distributed tracing provides... Faster RCA Better alerts Up-to-date dependency maps Improved compute fan-out Targeted optimization Integrated telemetry 39 Improved developer velocity Faster software performance Better cost management Distributed tracing puts application behavior in context to help answer the primary question of observability: “What caused that change?”
  • 41. Q&A