SlideShare a Scribd company logo
Metrics at scale @UBER
Mantas Klasavičius
About Me
Senior software engineer @ Uber
About Me
Senior software engineer @ Uber
<metric_path> <value> <timestamp>
UBER
6 continents
72 countries
425 cities
>5 million a day
>1000 engineers
7 years
UBER in Vilnius
3y ago
>20 engineers4 Teams:
- Observability
- Databases
- Foundations
- DevExp
Hypergrowth defines us...
Growth of Services
Metrics
Metrics @UBER is a first class citizen
T0 Service
Handling ~500M telemetry timeseries
Writing ~3M values/sec and running ~1K queries/sec
50M minutes worth of data per sec
Growing >25% month over month
Metrics Collection
Graphite ~2013
Metrics Collection
Graphite 2015
Metrics Collection
Considered choices
Netflix AtlasBlueflood
Update graphite
Metrics Collection
M3
Metrics Collection
M3
Metrics Collection
Cassandra is a figure of epic tradition and of tragedy.
High write throughput
Cassandra data model supports time series
data-store - DTCS
Cassandra's native TTL support
Metrics Collection
Cassandra - our use case
Separate clusters for different types of data
Clusters spans multiple datacenters
Dynamically control to which cluster data is written
Forcibly deleting old data
https://github.com/m3db/m3db/
Metrics Collection
Metrics as free resource
*.application_1431728998581_0361.*
*. Connections.10_30_3_24.0x64d11081baa1837.*
*. ply_1b09f59b-a3cf-4b9a-99b4-93e8eb16722c.*
*. check-<uid_or_uuid>.*
Metrics Collection
Cost accounting and metrics about metrics
Metrics Visualization
M3 - Querying
Metrics Visualization
Grafana
Observability: Past, Present, and Future
Metrics Visualization
aggregate = fillNulls target | sum;
fetch name:requests.errors caller:cn
| aggregate
| asPercent (fetch name:requests caller:cn | aggregate)
| anomalies
| sort max
| tail 10
M3QL - Query Like It’s Bash
tail(
sort(
anomalies(
asPercent(
sum(fillNulls(stats.counts.cn.*.requests.errors)),
sum(fillNulls(stats.counts.cn.*.requests))
), max
)
), 10
)
Metrics Visualization
Graphite Way vs. M3QL
Observability: Past, Present, and Future
Alerting based on metrics
Query Based Alerting
graphite.absolute_threshold(
‘scale(sumSeries(transformNull(stats.*.counts.api.velocity_filter.uber.views.*.*.blocked, 0)), 0.1)’,
alias=’velocity filter blocked requests’,
warning_over=0.1,
critical_over=10.0,
)
Observability: Past, Present, and Future
Alerting based on metrics
Classic Thresholding
Classic high / low thresholds have some intrinsic problems.
• Labor-intensive: each threshold is hand-tuned and manually
updated.
• Too sensitive: hard to set thresholds for metrics with large
fluctuations, even if there’s an obvious pattern.
• Not sensitive enough: thresholds take a long time to catch
slow degradations.
• Poor UX: configuring really good alerts requires specialized
knowledge of the query language.
• No guidance: system doesn’t offer automated root cause
exploration.
Observability: Past, Present, and Future
Alerting based on metrics
• Zero config: thresholds are set and maintained automatically.
• Dynamic adjustment: thresholds cope with noise, underlying growth, seasonality
and rollouts.
• Rapid detection: embarrassingly parallel algorithm is efficient enough for minute-
by-minute analysis at scale.
• Integrated UX: work within our existing telemetry and alert configuration systems.
• Helpful: automated root cause analysis.
In short, the only input is a list of business-critical metrics.
Intelligent Monitoring
Observability: Past, Present, and Future
Alerting based on metrics
The max lower threshold
exceeds the min upper
threshold
Dynamic Thresholds
Observability: Past, Present, and Future
Alerting based on metrics
Outage Detection
< 1% outages missed.
6.5 out of 10 alerts are true issues.
Observability: Past, Present, and Future
Alerting based on metrics
F3
stats.foo
anomalies(stats.foo)
On-Call Dashboard
We are hiring!
mantas@uber.com

More Related Content

PDF
Stream Computing & Analytics at Uber
PPTX
Big Data Pipeline and Analytics Platform
PDF
Stream Processing with Kafka in Uber, Danny Yuan
PDF
Streaming Analytics in Uber
PDF
Streaming Processing in Uber Marketplace for Kafka Summit 2016
PDF
ML and Data Science at Uber - GITPro talk 2017
PDF
Stream Processing in Uber
PDF
QCon SF-2015 Stream Processing in uber
Stream Computing & Analytics at Uber
Big Data Pipeline and Analytics Platform
Stream Processing with Kafka in Uber, Danny Yuan
Streaming Analytics in Uber
Streaming Processing in Uber Marketplace for Kafka Summit 2016
ML and Data Science at Uber - GITPro talk 2017
Stream Processing in Uber
QCon SF-2015 Stream Processing in uber

What's hot (20)

PDF
Real-Time Analytics at Uber Scale
PDF
Kafka and Stream Processing, Taking Analytics Real-time, Mike Spicer
PDF
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
PDF
Cloud Capacity Planning Tooling - South Bay SRE Meetup Aug-09-2016
PDF
Scalable complex event processing on samza @UBER
PDF
Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...
PDF
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
PDF
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
PPTX
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
PPTX
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
PDF
Spark Summit EU talk by Chris Pool and Jeroen Vlek
PDF
The magic behind your Lyft ride prices: A case study on machine learning and ...
PPTX
Jeremy Foran [BAI Communications] | Detecting Subway Overcrowding in Real Tim...
PDF
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
PDF
Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...
PDF
Event Stream Processing with Kafka and Samza
PDF
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
PDF
Javantura v4 - Getting started with Apache Spark - Dinko Srkoč
PPTX
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
PDF
Engineering Leader opportunity @ Netflix - Playback Data Systems
Real-Time Analytics at Uber Scale
Kafka and Stream Processing, Taking Analytics Real-time, Mike Spicer
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Cloud Capacity Planning Tooling - South Bay SRE Meetup Aug-09-2016
Scalable complex event processing on samza @UBER
Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Spark Summit EU talk by Chris Pool and Jeroen Vlek
The magic behind your Lyft ride prices: A case study on machine learning and ...
Jeremy Foran [BAI Communications] | Detecting Subway Overcrowding in Real Tim...
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...
Event Stream Processing with Kafka and Samza
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Javantura v4 - Getting started with Apache Spark - Dinko Srkoč
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
Engineering Leader opportunity @ Netflix - Playback Data Systems
Ad

Similar to Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream) (20)

PDF
Webinar: SQL for Machine Data?
PPTX
Java one2013 monitoringatscaleincloud
PPTX
ADDO Open Source Observability Tools
PPTX
Census-as-a-service
PPTX
Monitoring Containerized Micro-Services In Azure
PPTX
Sql azure cluster dashboard public.ppt
PPTX
CSense: A Stream-Processing Toolkit for Robust and High-Rate Mobile Sensing A...
PPTX
Challenges of monitoring distributed systems
PPTX
Maplelabs scalable-field-device-cloud-native
PDF
Tsinghua University: Two Exemplary Applications in China
PPTX
IT Solutions Provider in Kosovo uses Bandwidth monitoring, NetFlow Analyzer
PPTX
Instana Customer Presentation for apm monitoring
PDF
Building Streaming And Fast Data Applications With Spark, Mesos, Akka, Cassan...
PDF
System Support for Internet of Things
PPTX
ParStream - Big Data for Business Users
PDF
Unified Monitoring Webinar with Dustin Whittle
PDF
Application and Challenges of Streaming Analytics and Machine Learning on Mu...
PPTX
Big Data and Machine Learning on AWS
PPTX
Shikha fdp 62_14july2017
PPTX
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
Webinar: SQL for Machine Data?
Java one2013 monitoringatscaleincloud
ADDO Open Source Observability Tools
Census-as-a-service
Monitoring Containerized Micro-Services In Azure
Sql azure cluster dashboard public.ppt
CSense: A Stream-Processing Toolkit for Robust and High-Rate Mobile Sensing A...
Challenges of monitoring distributed systems
Maplelabs scalable-field-device-cloud-native
Tsinghua University: Two Exemplary Applications in China
IT Solutions Provider in Kosovo uses Bandwidth monitoring, NetFlow Analyzer
Instana Customer Presentation for apm monitoring
Building Streaming And Fast Data Applications With Spark, Mesos, Akka, Cassan...
System Support for Internet of Things
ParStream - Big Data for Business Users
Unified Monitoring Webinar with Dustin Whittle
Application and Challenges of Streaming Analytics and Machine Learning on Mu...
Big Data and Machine Learning on AWS
Shikha fdp 62_14july2017
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
Ad

More from IT Arena (20)

PDF
Shalini Agarwal, LinkedIn. Engineering excellence: marathon, not a sprint
PDF
Dave Karow, Split. Powering Progressive Delivery With Data
PDF
Ihar Mahaniok, Angel Investor. Hunting unicorns for early stage investments
PDF
Yuriy Zaremba, AXDRAFT. How to sell your startup
PDF
John Griffin, Ford Credit Europe. Normalising failure and making way for succ...
PDF
Vitaliy Diatlenko, Uklon. Transforming your business with machine learning. T...
PDF
Chris Cassarino, SoftServe. Stop Fixating on Fixing – Solving the global enga...
PDF
Michael Labate, Intellias. EDI in the DNA: Why Equity, Diversity and Inclusio...
PDF
Beth Anne Katz, Microsoft. How to Product Manage Your Mental Health
PDF
Sally Foote, GoCompare & Look After My Bills. Magic Goggles: the tools you ne...
PDF
Colleen Graneto, Airbnb. 3 steps to better product decision making
PDF
Vasyl Zadvornyy, Prozorro. The Future of Governance: Can a Script Replace the...
PDF
Godard Abel, G2. The SaaS Trust Crisis
PDF
Zeb Evans, ClickUp. From $0 to $20M ARR in 2 Years: Bootstrapping to Natural ...
PPTX
Namir Anani, ICTC. Economic Resiliency in The Face of Adversity
PDF
Mada Seghete, Branch. Mobile Growth Trends
PDF
Julia Petryk, MacPaw. Product PR: a how-to guide
PDF
Yaroslav Ravlinko, Intellias. You don’t need Kubernetes. You need to understa...
PDF
Yaroslav Novytskyy, Anton Vasylenko, N-iX. Migrating to the cloud: options an...
PDF
Kostiantyn Bokhan, N-iX. CD4ML based on Azure and Kubeflow
Shalini Agarwal, LinkedIn. Engineering excellence: marathon, not a sprint
Dave Karow, Split. Powering Progressive Delivery With Data
Ihar Mahaniok, Angel Investor. Hunting unicorns for early stage investments
Yuriy Zaremba, AXDRAFT. How to sell your startup
John Griffin, Ford Credit Europe. Normalising failure and making way for succ...
Vitaliy Diatlenko, Uklon. Transforming your business with machine learning. T...
Chris Cassarino, SoftServe. Stop Fixating on Fixing – Solving the global enga...
Michael Labate, Intellias. EDI in the DNA: Why Equity, Diversity and Inclusio...
Beth Anne Katz, Microsoft. How to Product Manage Your Mental Health
Sally Foote, GoCompare & Look After My Bills. Magic Goggles: the tools you ne...
Colleen Graneto, Airbnb. 3 steps to better product decision making
Vasyl Zadvornyy, Prozorro. The Future of Governance: Can a Script Replace the...
Godard Abel, G2. The SaaS Trust Crisis
Zeb Evans, ClickUp. From $0 to $20M ARR in 2 Years: Bootstrapping to Natural ...
Namir Anani, ICTC. Economic Resiliency in The Face of Adversity
Mada Seghete, Branch. Mobile Growth Trends
Julia Petryk, MacPaw. Product PR: a how-to guide
Yaroslav Ravlinko, Intellias. You don’t need Kubernetes. You need to understa...
Yaroslav Novytskyy, Anton Vasylenko, N-iX. Migrating to the cloud: options an...
Kostiantyn Bokhan, N-iX. CD4ML based on Azure and Kubeflow

Recently uploaded (20)

PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
A Presentation on Artificial Intelligence
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
KodekX | Application Modernization Development
PDF
Unlocking AI with Model Context Protocol (MCP)
PPT
Teaching material agriculture food technology
PDF
Empathic Computing: Creating Shared Understanding
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Cloud computing and distributed systems.
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
cuic standard and advanced reporting.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Network Security Unit 5.pdf for BCA BBA.
The AUB Centre for AI in Media Proposal.docx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Approach and Philosophy of On baking technology
Review of recent advances in non-invasive hemoglobin estimation
A Presentation on Artificial Intelligence
20250228 LYD VKU AI Blended-Learning.pptx
KodekX | Application Modernization Development
Unlocking AI with Model Context Protocol (MCP)
Teaching material agriculture food technology
Empathic Computing: Creating Shared Understanding
Spectral efficient network and resource selection model in 5G networks
Cloud computing and distributed systems.
NewMind AI Monthly Chronicles - July 2025
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
cuic standard and advanced reporting.pdf
MYSQL Presentation for SQL database connectivity
Per capita expenditure prediction using model stacking based on satellite ima...
Advanced methodologies resolving dimensionality complications for autism neur...

Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)