SlideShare a Scribd company logo
Distributed Tracing at UBER Scale
Creating a treasure map
for your monitoring data
Yuri Shkuro, UBER Technologies
ABOUT ME
• Software Engineer on the
Observability team in NYC
• Working on the open source
distributed tracing system Jaeger
• Co-founded the OpenTracing
project
• Banking industry survivor
• Github: yurishkuro
• Twitter: @yurishkuro
Would You Like Some Tracing with
Your Monitoring?
What does it take to roll it out?
Why Distributed Tracing
• Distributed transaction monitoring
• Performance / latency optimization
• Root cause analysis
• Service dependency analysis
• Distributed context propagation (“baggage”)
JAEGER, Distributed Tracing
• Open Source
• OpenTracing inside
• In active development
• PRs are welcome
• Zipkin compatible
• github.com/uber/jaeger
Who Thinks Tracing is Awesome?
Distributed Tracing at UBER Scale: Creating a treasure map for your monitoring data
Distributed Tracing at UBER Scale: Creating a treasure map for your monitoring data
Why Doesn’t Everyone Do Tracing?
Tracing Instrumentation is
HARD
EXPENSIVE
BORING
Instrumentation
• Metrics and logging are not new
• Tracing is both new and harder
Context Propagation
A
B
C
D
E
{context}
{context}
{context}
{context}
UniqueID → {context}
Edgeservice
Headers:
. . .
Trace ID
. . . Instrumentation
APPLICATION / MICROSERVICE
Handler
Context
[Span]
Client
Context
[Span]
Inbound
HTTP
Request
Instrumentation
Headers:
. . .
Trace ID
. . .
Outbound
HTTP
Request
Context Propagation
In-Process Context Propagation
Implicit, via Thread-Locals
but: thread pools, futures
Explicit
It’s Also the Frameworks
• Go: stdlib, gorilla, …
• Java: jaxrs2, okhttp, ApacheHttpClient, …
• Python: Flask, Django, Tornado, urllib2, …
• Node.js – who knows…
OpenTracing to the Rescue
No Help With In-Process Propagation
• Must be done manually
• UBER has 2000-3000 microservices
• Resources of the tracing team are limited
• Developers must instrument their code!
BITE MAKE ME!
How do we mobilize the org?
Traveling Salesman Problem
2017 edition
They Must Want Your Product
or Sticks and Carrots
Recap: Why Distributed Tracing
• Distributed transaction monitoring
• Performance / latency optimization
• Root cause analysis
• Service dependency analysis
• Distributed context propagation (“baggage”)
Service Dependency Analysis
• Explain to us what we just built
• Who are my dependencies
• Workflow analysis
• Where is all this traffic coming from?
• Service tiers
Baggage
• Tenancy, test or production
– Set at the top
– Used at the storage layer, prod or test DB
• Authentication tokens
– Signed user or service identity
– Checked at multiple levels
Sticks and Carrots
• Get other teams build features on top
– Performance team
– Capacity & cost accounting
– Baggage
• More carrots
• Eventually they become sticks (peer pressure)
Each Organization is Different
Find what works best
How to Measure Adoption?
Measure everything
Does Service X Report Traces?
• Daily aggregation job
• Auto-book tickets
• Build a dashboard
• Pass/Fail: too easy to pass
Trace Quality Score
• Inspect traces
– See a caller, but no spans
• Join with other data
– Routing logs
• Auto-book tickets (carefully, not for everyone)
– With detailed report
Trace Quality Metrics by Service
Thank You
• Jaeger
– https://github.com/uber/jaeger
– Blog: Evolving Distributed Tracing at UBER
– Blog: Take OpenTracing for a HotROD Ride
• OpenTracing: http://opentracing.io/
• We are hiring
• @yurishkuro

More Related Content

PDF
Distributed tracing with OpenTracing and Jaeger @ getstream.io
PDF
Tracing 2000+ polyglot microservices at Uber with Jaeger and OpenTracing
PDF
Tracing Micro Services with OpenTracing
PDF
Jaeger and OpenTracing Cloud Native Computing (CNCF) meetup Zurich
PDF
Monitoring to the Nth tier: The state of distributed tracing in 2016
PDF
Distributed tracing - get a grasp on your production
PPTX
Introduction to Distributed Tracing
PDF
Distributed Tracing
Distributed tracing with OpenTracing and Jaeger @ getstream.io
Tracing 2000+ polyglot microservices at Uber with Jaeger and OpenTracing
Tracing Micro Services with OpenTracing
Jaeger and OpenTracing Cloud Native Computing (CNCF) meetup Zurich
Monitoring to the Nth tier: The state of distributed tracing in 2016
Distributed tracing - get a grasp on your production
Introduction to Distributed Tracing
Distributed Tracing

What's hot (20)

PDF
Open Tracing, to order and understand your mess. - ApiConf 2017
PDF
Distributed Tracing
PDF
[WSO2Con EU 2018] Tooling for Observability
PDF
Juraci Paixão Kröhling - All you need to know about OpenTelemetry
PDF
Opentracing jaeger
PDF
Opentracing 101
PDF
REST APIs for the Internet of Things
PDF
Distributed tracing using open tracing & jaeger 2
PDF
Adopting Open Telemetry as Distributed Tracer on your Microservices at Kubern...
PDF
The Service Mesh: It's about Traffic
PDF
Matching the Scale at Tinder with Kafka
PDF
Security Analytics using ELK stack
PDF
Microservices in Scala - theory & practice
PPTX
Solving the Hidden Costs of Kubernetes with Observability
PDF
INTERFACE, by apidays - Apache Cassandra now speaks developer with Stargate ...
PPT
Dynamic routing in microservice oriented architecture
PPTX
linkerd: The Cloud Native Service Mesh
PDF
Fine-grained Authorization in a Containerized World
PDF
Why Distributed Tracing is Essential for Performance and Reliability
PDF
A Modular Open Source Platform for IoT
Open Tracing, to order and understand your mess. - ApiConf 2017
Distributed Tracing
[WSO2Con EU 2018] Tooling for Observability
Juraci Paixão Kröhling - All you need to know about OpenTelemetry
Opentracing jaeger
Opentracing 101
REST APIs for the Internet of Things
Distributed tracing using open tracing & jaeger 2
Adopting Open Telemetry as Distributed Tracer on your Microservices at Kubern...
The Service Mesh: It's about Traffic
Matching the Scale at Tinder with Kafka
Security Analytics using ELK stack
Microservices in Scala - theory & practice
Solving the Hidden Costs of Kubernetes with Observability
INTERFACE, by apidays - Apache Cassandra now speaks developer with Stargate ...
Dynamic routing in microservice oriented architecture
linkerd: The Cloud Native Service Mesh
Fine-grained Authorization in a Containerized World
Why Distributed Tracing is Essential for Performance and Reliability
A Modular Open Source Platform for IoT
Ad

Similar to Distributed Tracing at UBER Scale: Creating a treasure map for your monitoring data (20)

PPTX
Visualising montioring and evaluation data
PDF
Application Insights and Jupyter Notebook(Opensource) combo to analyze large ...
PDF
SCaLE 16x - Application Monitoring And Tracing In Kubernetes
PDF
See through software
PPTX
Monitoring Containerized Micro-Services In Azure
ODP
Building an Open Source AppSec Pipeline - 2015 Texas Linux Fest
PPTX
Monitor Cloud Resources using Alerts & Insights
PPTX
The Analysis Part of Integration Projects
PPTX
Functionality, security and performance monitoring of web assets (e.g. Joomla...
PDF
Redundant devops
PDF
DevSecCon Asia 2017 - Abhay Bhargav: Building an Application Vulnerability To...
PPTX
Microservices and Integration: what's next with Istio service mesh
PPTX
Building enterprise platforms - off the beaten path - SharePoint User Group U...
PDF
DCEU 18: From Monolith to Microservices
PDF
OSMC 2023 | Current State of Icinga by Bernd Erk
PDF
Getting Started with Product Analytics - A 101 Implementation Guide for Begin...
PDF
Going Agile: Brought to You by the Public Broadcasting System - Atlassian Sum...
PPTX
Oracle Management Cloud - introduction, overview and getting started (AMIS, 2...
PPTX
What it Means to be a Next-Generation MSP - CloudHesive
PDF
Observability, Distributed Tracing, and Open Source: The Missing Primer
Visualising montioring and evaluation data
Application Insights and Jupyter Notebook(Opensource) combo to analyze large ...
SCaLE 16x - Application Monitoring And Tracing In Kubernetes
See through software
Monitoring Containerized Micro-Services In Azure
Building an Open Source AppSec Pipeline - 2015 Texas Linux Fest
Monitor Cloud Resources using Alerts & Insights
The Analysis Part of Integration Projects
Functionality, security and performance monitoring of web assets (e.g. Joomla...
Redundant devops
DevSecCon Asia 2017 - Abhay Bhargav: Building an Application Vulnerability To...
Microservices and Integration: what's next with Istio service mesh
Building enterprise platforms - off the beaten path - SharePoint User Group U...
DCEU 18: From Monolith to Microservices
OSMC 2023 | Current State of Icinga by Bernd Erk
Getting Started with Product Analytics - A 101 Implementation Guide for Begin...
Going Agile: Brought to You by the Public Broadcasting System - Atlassian Sum...
Oracle Management Cloud - introduction, overview and getting started (AMIS, 2...
What it Means to be a Next-Generation MSP - CloudHesive
Observability, Distributed Tracing, and Open Source: The Missing Primer
Ad

Recently uploaded (20)

PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
web development for engineering and engineering
PPT
Mechanical Engineering MATERIALS Selection
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
PPT on Performance Review to get promotions
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
Well-logging-methods_new................
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Operating System & Kernel Study Guide-1 - converted.pdf
web development for engineering and engineering
Mechanical Engineering MATERIALS Selection
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Internet of Things (IOT) - A guide to understanding
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Model Code of Practice - Construction Work - 21102022 .pdf
Foundation to blockchain - A guide to Blockchain Tech
PPT on Performance Review to get promotions
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Automation-in-Manufacturing-Chapter-Introduction.pdf
Well-logging-methods_new................
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
UNIT-1 - COAL BASED THERMAL POWER PLANTS

Distributed Tracing at UBER Scale: Creating a treasure map for your monitoring data

Editor's Notes

  • #2: Thank you all for staying till the very last time segment of the conference. I am sure everyone is tired, so I promise there will be no math or cat pictures in this talk.
  • #3: I am an engineer on the Observability team in NYC, here is a picture of me. It’s actually most of the NY office, both engineers and operations, on the High Line next to Hudson Yards. The Uber office is just two streets to the right. And in the direction we’re all facing, there are no buildings, so I leave it to your imagination how the picture was taken. I work on the distributed tracing system Jaeger that we recently open sourced. I was one of the co-founders of the OpenTracing project, which I will talk about later. And before joining Uber I spent way more time than any engineer should working in finance technology.
  • #4: Many speakers in the last three days mentioned how valuable tracing can be.
  • #9: The poll is biased. Show of hands: - How many people are using some metrics system in their organization? - And how many people have a fully functional distributed tracing solution? - That works across multiple programming languages
  • #11: These adjectives actually refer to the same thing – engineers must spend time adding this instrumentation. You could say, well it’s the same thing with other monitoring, like metrics and logging. But
  • #12: Engineers are used to metrics and logging. Tracing instrumentation is new, and harder to get right (or more boring) Why is that?
  • #13: Tracing instrumentation requires context propagation. It means the architecture must be capable of passing around certain metadata about the distributed transaction. What does it look like in a single node?
  • #14: Here we have an application or microservice, that has a handler for inbound HTTP requests, and it makes some outbound calls to other services. The instrumentation in purpose is responsible for extracting the distributed context from the request, and converting it to some in-memory representation. Then the service itself must make sure that the context object is available to instrumentation on the right to be encoded into the outbound request. In case of HTTP, the context is typically passed as one or more HTTP headers. Obvious: start with infrastructure libraries The difficult part here is actually the dotted line in the middle, passing the context inside the application – in-process propagation.
  • #15: Some languages like Java and Python support thread-local storage. We can use that to pass the context implicitly. Happy story for instrumentation, no changes to the application’s code. Other languages do not have thread locals. For example, Go has no way of even identifying a go-routine. Node.js has CLS, but it’s not performant enough. So we need to explicitly pass the context, which means changes to the application code. Google / twitter experience - localized
  • #16: When I said earlier “start with the infra libraries” – that’s also not so simple in a polyglot environment. You are starting to see the problem? This is why we stood behind OpenTracing from the early days
  • #17: OpenTracing is an open, vendor neutral standard. Frameworks can instrument themselves! Or at minimum, it can be done via open source plugins that everybody can reuse.
  • #18: But OpenTracing does not help with the need for explicit in-process context propagation in languages like Go and Node.js It has to be done manually, there is no other way. Now consider that Uber has somewhere between 2 and 3.000 microservices, and only 5 people on the tracing team. It doesn’t scale, we cannot go and change the source code of half of those services. Inevitable conclusion – application owners must do that themselves.
  • #19: But they are busy shipping features. How do we incentivize them?
  • #20: I call it a traveling salesman problem, 2017 edition. Find the optimal way to spend your tracing team’s time to achieve maximum level of adoption. It’s not a pure technology problem, there’s a salesmanship involved.
  • #21: How do we make developers use ANY given tool? You need to give them something they want. A sticks and carrots approach can also work, because engineering organizations are not democracies.
  • #22: Remember we talked about these features? It turns out the last two