SlideShare a Scribd company logo
13
Most read
14
Most read
16
Most read
Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed
quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent.
Pinot: Near Real-Time Analytics @ Uber
U B E R | Data
Xiang Fu 

Sr Software Engineer II @ Uber

Streaming Analytics Team
Quick Introduction
U B E R | Data
Uber Scale
Messages Bytes
Apache Kafka Trillion per day ~PB per day

Streaming Analytics
Platform
Billions processed
per day
100s of TB
processed per day
Pinot 100s of Billions 10s of TB
U B E R | Data
Agenda
● Pinot @ Uber

● Architecture

● Case Study

● Pinot Perf
Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed
quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent.
Pinot @ Uber
U B E R | Data
Experimentation platform (Internal Dashboard)
A / B Tests

See progress of
tests in real-time
U B E R | Data
UberEats (Realtime User Facing Product)
UberEats Restaurant
Manager

“What is my revenue for
past 90 days?”
U B E R | Data
Many More…
• UberPool Analytics
• Mobile Analytics
...
Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed
quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent.
Architecture
U B E R | Data
Pinot Workflow
Athena-X
Hive/Spark SQL/oozie
● Projection, Filtering

● Window Aggregation 

● Join
U B E R | Data
Pinot Realtime: Self Service
● Projection, Filtering

● Window Aggregation 

● Join
Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed
quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent.
Case Study
U B E R | Data
Pinot Data Model
Column Name Column Type Filtering Compression Indexing
RiderId SingleValue/
Dimension
Yes Dictionary Sorted
DriverId SingleValue/
Dimension
Yes Dictionary Inverted
TripId SingleValue/
Dimension
No No Dictionary No
PickUpPoints MultiValue/
Dimension
No No Dictionary No
TripFare SingleValue/
Metric
No No Dictionary No
Step 1
List Column Spec
Step 2
Analyze Query Pattern
Step 3
Decide Compression &
Indexing Strategy
U B E R | Data
Pinot Data Ingestion
Realtime Ingestion:
Consumer Type Scalability Consistency
High Level Consumer Hard to scale beyond one node
Sacrificing consistency during
failures
Low Level Consumer Scalable beyond one node
Strong consistency guarantees even
during failure
Segment Persistence: 500k msg or 6 hours
Offline Ingestion:
Using Oozie to schedule daily incremental backfill from Hive to Pinot
Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed
quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent.
Pinot Perf
U B E R | Data
Pinot Realtime Ingestion
Hardware 4 base SKU boxes(24 cores, 128G RAM)
Consumer Type HLC LLC
Peak Traffic(msg/sec/box) 20k 200k
Peak Traffic(bytes/sec/box) 4M 40M
Storage Kafka Pinot
Total Data Volume(GB) 500 60
U B E R | Data
Pinot/Druid Data Size
Raw Data: 

500M Rows, 30 columns

Raw Json: 391.9G
Three Storage Tiers 

in Pinot/Druid
- Segments in Deep Storage 

(NFS or HDFS)
- Local Disk Cache
- Memory
U B E R | Data
Pinot/Druid Query Performance
Max Duration:
select max(duration) from trips
Count All Grouped by City:
select count(*) from trips
group by city_id top 10000
Count All in One Month:
select count(*) from trips
where Month = '201601'
Count All in SF:
select count(*) from trips
where city_id=1 group by Month
Unique Drivers in SF:
select distinctCountHLL(driver_uuid)
from trips where city_id=1
Unique Drivers By Date:
select distinctCountHLL(driver_uuid)
from trips group by Date
U B E R | Data
Pinot/Druid Concurrent Query
Query: select count(*) from trips group by city_id
U B E R | Data
Guaranteed SLA for Site Facing Products
Aggregation on Rider
trips:
select count(*) from trips
where riderId = x and
date > 20170225
Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed
quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent.
Thank you

More Related Content

PPTX
Real-time Analytics with Trino and Apache Pinot
PDF
Building real time analytics applications using pinot : A LinkedIn case study
PDF
Spark shuffle introduction
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
PPTX
Real-time Stream Processing with Apache Flink
PDF
The Parquet Format and Performance Optimization Opportunities
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Real-time Analytics with Trino and Apache Pinot
Building real time analytics applications using pinot : A LinkedIn case study
Spark shuffle introduction
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Real-time Stream Processing with Apache Flink
The Parquet Format and Performance Optimization Opportunities
Apache Spark in Depth: Core Concepts, Architecture & Internals

What's hot (20)

PDF
Cosco: An Efficient Facebook-Scale Shuffle Service
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
PDF
Streaming SQL with Apache Calcite
PPTX
Rds data lake @ Robinhood
PDF
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
PDF
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
PPTX
Netflix Data Pipeline With Kafka
PPTX
PDF
Hudi architecture, fundamentals and capabilities
PPTX
Autoscaling Flink with Reactive Mode
PDF
Parquet performance tuning: the missing guide
PPTX
Introduction to Apache Kafka
PPTX
APACHE KAFKA / Kafka Connect / Kafka Streams
PDF
Batch Processing at Scale with Flink & Iceberg
PDF
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
PPTX
Apache Flink and what it is used for
PDF
Apache Spark Core – Practical Optimization
PDF
Cassandra Introduction & Features
PDF
A Deep Dive into Query Execution Engine of Spark SQL
Cosco: An Efficient Facebook-Scale Shuffle Service
The columnar roadmap: Apache Parquet and Apache Arrow
Streaming SQL with Apache Calcite
Rds data lake @ Robinhood
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Efficient Data Storage for Analytics with Apache Parquet 2.0
Netflix Data Pipeline With Kafka
Hudi architecture, fundamentals and capabilities
Autoscaling Flink with Reactive Mode
Parquet performance tuning: the missing guide
Introduction to Apache Kafka
APACHE KAFKA / Kafka Connect / Kafka Streams
Batch Processing at Scale with Flink & Iceberg
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Apache Flink and what it is used for
Apache Spark Core – Practical Optimization
Cassandra Introduction & Features
A Deep Dive into Query Execution Engine of Spark SQL
Ad

Similar to Pinot: Near Realtime Analytics @ Uber (20)

PDF
ISTA 2019 - Migrating data-intensive microservices from Python to Go
PPTX
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
PPTX
The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...
PPTX
Spectra Logic's BlackPearl Developers Summit 2016
PDF
Introduction to Apache Kafka
PPTX
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
PDF
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
PDF
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
PDF
Ceph Day Beijing - Ceph on All-Flash Storage - Breaking Performance Barriers
PDF
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
PDF
Webinar replay: MySQL Query Tuning Trilogy: Query tuning process and tools
PDF
ITCamp 2011 - Cristian Lefter - SQL Server code-name Denali
PDF
Look how easy it is to go from events to blazing-fast analytics! | Neha Pawar...
PDF
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
PPTX
Netflix Edge Engineering Open House Presentations - June 9, 2016
PDF
History of Apache Pinot
PDF
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
PDF
TimeSeries Machine Learning - PyData London 2025
PDF
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
PDF
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
ISTA 2019 - Migrating data-intensive microservices from Python to Go
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...
Spectra Logic's BlackPearl Developers Summit 2016
Introduction to Apache Kafka
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
Ceph Day Beijing - Ceph on All-Flash Storage - Breaking Performance Barriers
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
Webinar replay: MySQL Query Tuning Trilogy: Query tuning process and tools
ITCamp 2011 - Cristian Lefter - SQL Server code-name Denali
Look how easy it is to go from events to blazing-fast analytics! | Neha Pawar...
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Netflix Edge Engineering Open House Presentations - June 9, 2016
History of Apache Pinot
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
TimeSeries Machine Learning - PyData London 2025
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Ad

Recently uploaded (20)

PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
DOCX
The AUB Centre for AI in Media Proposal.docx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
KodekX | Application Modernization Development
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Modernizing your data center with Dell and AMD
PDF
Approach and Philosophy of On baking technology
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
NewMind AI Weekly Chronicles - August'25 Week I
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
The AUB Centre for AI in Media Proposal.docx
“AI and Expert System Decision Support & Business Intelligence Systems”
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
Electronic commerce courselecture one. Pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Advanced Soft Computing BINUS July 2025.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
20250228 LYD VKU AI Blended-Learning.pptx
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
The Rise and Fall of 3GPP – Time for a Sabbatical?
KodekX | Application Modernization Development
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Modernizing your data center with Dell and AMD
Approach and Philosophy of On baking technology
Unlocking AI with Model Context Protocol (MCP)
Reach Out and Touch Someone: Haptics and Empathic Computing
NewMind AI Weekly Chronicles - August'25 Week I

Pinot: Near Realtime Analytics @ Uber

  • 1. Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent. Pinot: Near Real-Time Analytics @ Uber
  • 2. U B E R | Data Xiang Fu Sr Software Engineer II @ Uber Streaming Analytics Team Quick Introduction
  • 3. U B E R | Data Uber Scale Messages Bytes Apache Kafka Trillion per day ~PB per day Streaming Analytics Platform Billions processed per day 100s of TB processed per day Pinot 100s of Billions 10s of TB
  • 4. U B E R | Data Agenda ● Pinot @ Uber ● Architecture ● Case Study ● Pinot Perf
  • 5. Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent. Pinot @ Uber
  • 6. U B E R | Data Experimentation platform (Internal Dashboard) A / B Tests See progress of tests in real-time
  • 7. U B E R | Data UberEats (Realtime User Facing Product) UberEats Restaurant Manager “What is my revenue for past 90 days?”
  • 8. U B E R | Data Many More… • UberPool Analytics • Mobile Analytics ...
  • 9. Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent. Architecture
  • 10. U B E R | Data Pinot Workflow Athena-X Hive/Spark SQL/oozie ● Projection, Filtering ● Window Aggregation ● Join
  • 11. U B E R | Data Pinot Realtime: Self Service ● Projection, Filtering ● Window Aggregation ● Join
  • 12. Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent. Case Study
  • 13. U B E R | Data Pinot Data Model Column Name Column Type Filtering Compression Indexing RiderId SingleValue/ Dimension Yes Dictionary Sorted DriverId SingleValue/ Dimension Yes Dictionary Inverted TripId SingleValue/ Dimension No No Dictionary No PickUpPoints MultiValue/ Dimension No No Dictionary No TripFare SingleValue/ Metric No No Dictionary No Step 1 List Column Spec Step 2 Analyze Query Pattern Step 3 Decide Compression & Indexing Strategy
  • 14. U B E R | Data Pinot Data Ingestion Realtime Ingestion: Consumer Type Scalability Consistency High Level Consumer Hard to scale beyond one node Sacrificing consistency during failures Low Level Consumer Scalable beyond one node Strong consistency guarantees even during failure Segment Persistence: 500k msg or 6 hours Offline Ingestion: Using Oozie to schedule daily incremental backfill from Hive to Pinot
  • 15. Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent. Pinot Perf
  • 16. U B E R | Data Pinot Realtime Ingestion Hardware 4 base SKU boxes(24 cores, 128G RAM) Consumer Type HLC LLC Peak Traffic(msg/sec/box) 20k 200k Peak Traffic(bytes/sec/box) 4M 40M Storage Kafka Pinot Total Data Volume(GB) 500 60
  • 17. U B E R | Data Pinot/Druid Data Size Raw Data: 
 500M Rows, 30 columns
 Raw Json: 391.9G Three Storage Tiers 
 in Pinot/Druid - Segments in Deep Storage 
 (NFS or HDFS) - Local Disk Cache - Memory
  • 18. U B E R | Data Pinot/Druid Query Performance Max Duration: select max(duration) from trips Count All Grouped by City: select count(*) from trips group by city_id top 10000 Count All in One Month: select count(*) from trips where Month = '201601' Count All in SF: select count(*) from trips where city_id=1 group by Month Unique Drivers in SF: select distinctCountHLL(driver_uuid) from trips where city_id=1 Unique Drivers By Date: select distinctCountHLL(driver_uuid) from trips group by Date
  • 19. U B E R | Data Pinot/Druid Concurrent Query Query: select count(*) from trips group by city_id
  • 20. U B E R | Data Guaranteed SLA for Site Facing Products Aggregation on Rider trips: select count(*) from trips where riderId = x and date > 20170225
  • 21. Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent. Thank you