SlideShare a Scribd company logo
STREAM PROCESSING
IN UBER
MARKETPLACE
~ 68 countries / 350+ cities
Transportation as reliable as running
water, everywhere, for everyone
2
Agenda
What’s on the menu?
β€’Use Cases
β€’Problem Space
β€’Overall Architecture
β€’Choices & Tradeoffs
β€’Q & A
Use Case: Realtime OLAP
There is always need for quick exploration
How many open cars in the world, NOW?
Streaming Processing in Uber Marketplace for Kafka Summit 2016
How many UberXs were driving clients in SF in the past 10
minutes by hexagons?
How many UberXs were driving clients in SF in the past 10 minutes by hexagons?
Driving time and other metrics over time by hexagonal area
Streaming Processing in Uber Marketplace for Kafka Summit 2016
Use Case: Complex Event Processing
There are patterns in event streams
How many drivers cancel requests
more than 3 times in a row within a 10-
minute window?
Report riders requesting a pickup 100 miles
apart within a half hour window?
IF
This β€”>
Then that β€”>
● Sigma is similar - but for offline/batch applications
Complex Event Processing
Use Case: Supply Positioning
Clusters Of Supply & Demand
Predicted Health
Metrics
Actual Health Metrics
Monitor Marketplace Health
Challenges
OLAP of Geo-spatial Temporal Data
Reasonably Large Scale
Near Real Time
β€’ Indexing, Lookup, Rendering
β€’ Symmetric Neighbors
β€’ Convex & Compact Regions
β€’ Equal Areas
β€’ Equal Shape
Hexagons
Scale
Geo Space Vehicle Types Time Status
X X X
Granular Geo Areas
Granular Geo Areas
Over 10,000 hexagons in a city
Multiple Vehicle Types
7 vehicle types
Minute-level Time Buckets
1440 minutes in a day
Many Driver States
13 driver states
Many Cities
300 cities
Granular Data
1 day of data: 300 x 10,000 x 7 x 1440 x 13 = 393 billion
possible combinations
Unknown Query Patterns
Any combination of dimensions
Variety of Aggregations
- Heatmap
- Top N
- Histogram
- count(), avg(), sum(), percent(), geo
Large Data Volume
β€’ Hundreds of thousands of events per second

β€’ At least dozens of fields in each event
Multiple Topics
Rider States Driver States
Let’s build a stream processing pipeline
Accurate Statistics
β€’ E.g., can’t over count
Pipeline Template
Event Collection
Multiple Event Types with Different Volume
Hundreds of Thousands of Events Per Second
Events Should Be Available Under a Second
Events Should Rarely Get Lost
Multiple Consumers
Streaming Processing in Uber Marketplace for Kafka Summit 2016
Natural Choice: Apache Kafka
- Low latency and high throughput
- Persistent events
- Distributes a topic by partitions
- Groups consumers by consumer groups
Streaming Processing in Uber Marketplace for Kafka Summit 2016
Event Processing
Transformation
Event Transformation Example
(Lat, Long) -> (zipcode, hexagon, S2)
Pre-aggregation
Joining Multiple Streams
Sessionization
Multi-Staged Processing
Minimum Requirements
- Statement Management
- Checkpointing
- Automatic Resource Management
- Multi-staged processing
Apache Samza
Why Apache Samza?
- DAG on Kafka
- Excellent integration with Kafka
- Built-in checkpointing
- Built-in state management
- Excellent support from our data team
Samza Is Conceptually Simple
IF
This β€”>
Then that β€”>
● Sigma is similar - but for offline/batch applications
Complex Event Processing
● Sigma is similar - but for offline/batch applications
Complex Event Processing
● Sigma is similar - but for offline/batch applications
Complex Event Processing
● Sigma is similar - but for offline/batch applications
Complex Event Processing
● Sigma is similar - but for offline/batch applications
Complex Event Processing
● Sigma is similar - but for offline/batch applications
Slightly Expanded Version
● Sigma is similar - but for offline/batch applications
Slightly Expanded Version
● Sigma is similar - but for offline/batch applications
Slightly Expanded Version
● Sigma is similar - but for offline/batch applications
Slightly Expanded Version
Streaming Processing in Uber Marketplace for Kafka Summit 2016
Applications
Dashboard of Realtime Business Metrics
Ad-Hoc Queries
Visualization with Streaming
Visualization with Streaming
LocationUpdate	where	city	=	X
LocationUpdate		
where	city	=	Y		
						and	vehicle	=	β€˜UberX’
100%
100%
100%
10%
5%
Visualization with Streaming
LocationUpdate	where	city	=	X
LocationUpdate		
where	city	=	Y		
						and	vehicle	=	β€˜UberX’
100%
100%
100%
10%
5%
Visualization with Streaming
LocationUpdate	where	city	=	X
LocationUpdate		
where	city	=	Y		
						and	vehicle	=	β€˜UberX’
100%
100%
100%
10%
5%
Visualization with Streaming
LocationUpdate	where	city	=	X
LocationUpdate		
where	city	=	Y		
						and	vehicle	=	β€˜UberX’
100%
100%
100%
10%
5%
Visualization with Streaming
LocationUpdate	where	city	=	X
LocationUpdate		
where	city	=	Y		
						and	vehicle	=	β€˜UberX’
100%
100%
100%
10%
5%
Visualization with Streaming
LocationUpdate	

where	city	=	β€˜SF’
LocationUpdate		
where	city	=	β€˜LA’		
						and	vehicle	
10%
5%
100% 100%
Ad-hoc Exploration
A Few Trade-Offs
Lambda vs Kappa
We Use Lambda
- Spark + HDFS/S3 for batch processing
- Yes, it is painful, but
- We may need to go way back due to change of business
requirements
- Batch process can run faster β€” they scale differently
- It was not easy to start a new stream processing instance
Processing by Event Time Is Not Always Easy
Leverage The Storage Layer
Dealing with Limitation of Samza
-No broadcasting. We have to override
SystemStreamPartitionGrouper
-No dynamic topology. Can’t have arbitrary number of
nested CEP queries
-Tedious configuration and deployment of jobs. In house
code-gem and deployment solution
Thank You

More Related Content

PDF
Streaming Analytics in Uber
PDF
QCon SF-2015 Stream Processing in uber
PDF
Stream Processing with Kafka in Uber, Danny Yuan
PDF
Stream Computing & Analytics at Uber
PPSX
Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)
PPTX
Big Data Pipeline and Analytics Platform
PDF
ML and Data Science at Uber - GITPro talk 2017
PPTX
Big Data Pipelines and Machine Learning at Uber
Streaming Analytics in Uber
QCon SF-2015 Stream Processing in uber
Stream Processing with Kafka in Uber, Danny Yuan
Stream Computing & Analytics at Uber
Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)
Big Data Pipeline and Analytics Platform
ML and Data Science at Uber - GITPro talk 2017
Big Data Pipelines and Machine Learning at Uber

What's hot (19)

PDF
Cloud Capacity Planning Tooling - South Bay SRE Meetup Aug-09-2016
PDF
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
PDF
Kafka and Stream Processing, Taking Analytics Real-time, Mike Spicer
PPTX
goto; London: Keeping your Cloud Footprint in Check
PDF
Choose Right Stream Storage: Amazon Kinesis Data Streams vs MSK
PDF
Big Data on EC2: Mashing Technology in the Cloud
PPT
Apache Cassandra at Videoplaza β€” Stockholm Cassandra Users β€” September 2013
PDF
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
PDF
Spark Summit EU talk by Chris Pool and Jeroen Vlek
PDF
Scalable complex event processing on samza @UBER
PDF
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
Β 
PPTX
Jeremy Foran [BAI Communications] | Detecting Subway Overcrowding in Real Tim...
PPTX
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
PDF
Spark at Airbnb
PDF
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
PDF
Ingesting IoT data in Food Processing
PDF
Databases & Analytics AWS re:invent 2019 Recap
PDF
Processing Real-Time Data at Scale: A streaming platform as a central nervous...
PDF
01 supermapiserverintroduction
Cloud Capacity Planning Tooling - South Bay SRE Meetup Aug-09-2016
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Kafka and Stream Processing, Taking Analytics Real-time, Mike Spicer
goto; London: Keeping your Cloud Footprint in Check
Choose Right Stream Storage: Amazon Kinesis Data Streams vs MSK
Big Data on EC2: Mashing Technology in the Cloud
Apache Cassandra at Videoplaza β€” Stockholm Cassandra Users β€” September 2013
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
Spark Summit EU talk by Chris Pool and Jeroen Vlek
Scalable complex event processing on samza @UBER
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
Β 
Jeremy Foran [BAI Communications] | Detecting Subway Overcrowding in Real Tim...
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Spark at Airbnb
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
Ingesting IoT data in Food Processing
Databases & Analytics AWS re:invent 2019 Recap
Processing Real-Time Data at Scale: A streaming platform as a central nervous...
01 supermapiserverintroduction
Ad

Viewers also liked (19)

PPTX
Coordenadas polares
PDF
Pachikova Anna - ART DIRECTOR
PDF
Cash for Cars melbourne
PPTX
S4 tarea4 mehed
DOCX
Logica matematica
PDF
1 introduccion al urbanismo
PPTX
Instalar y crackear
PPTX
Images for billboard poster
ODP
Atividade fΓ­sica na hipertensΓ£o
PPTX
Potabilizacion1
Β 
PPTX
Lev Vigotsky Theory Presentation
PPT
Contemplate your goodness - a metta exercise for a happy life
PPTX
Eticaa profecional
PDF
Topseller chemicals co.,ltd 2017
PPTX
Proyecto bulubulu tecnologia moderna inca
PDF
Strata lightening-talk
PDF
QConSF 2014 talk on Netflix Mantis, a stream processing system
PPTX
Pemanasan Global
PDF
Elasticsearch in Netflix
Coordenadas polares
Pachikova Anna - ART DIRECTOR
Cash for Cars melbourne
S4 tarea4 mehed
Logica matematica
1 introduccion al urbanismo
Instalar y crackear
Images for billboard poster
Atividade fΓ­sica na hipertensΓ£o
Potabilizacion1
Β 
Lev Vigotsky Theory Presentation
Contemplate your goodness - a metta exercise for a happy life
Eticaa profecional
Topseller chemicals co.,ltd 2017
Proyecto bulubulu tecnologia moderna inca
Strata lightening-talk
QConSF 2014 talk on Netflix Mantis, a stream processing system
Pemanasan Global
Elasticsearch in Netflix
Ad

Similar to Streaming Processing in Uber Marketplace for Kafka Summit 2016 (20)

PDF
Stream Processing in Uber
PPTX
AWS Cost Optimization
PDF
WSO2Con USA 2017: Scalable Real-time Complex Event Processing at Uber
Β 
PDF
Flink Forward Berlin 2018: Amey Chaugule - "Threading Needles in a Haystack: ...
PDF
Sessionizing Uber Trips in Realtime - Flink Forward '18, Berlin
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
Prassnitha Sampath - Real Time Big Data Analytics with Kafka, Storm & HBase -...
PDF
Event Driven Streaming Analytics - Demostration on Architecture of IoT
Β 
PDF
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
PPTX
Managing Large Scale Financial Time-Series Data with Graphs
PDF
Barga IC2E & IoTDI'16 Keynote
PDF
Design and Implementation of A Data Stream Management System
PDF
Cloud Experience: Data-driven Applications Made Simple and Fast
PDF
Service Virtualization - Next Gen Testing Conference Singapore 2013
PPTX
[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale
PDF
Synthetic and RUM - Best of bo
PPTX
Operating samza at skyscanner
PPTX
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
Stream Processing in Uber
AWS Cost Optimization
WSO2Con USA 2017: Scalable Real-time Complex Event Processing at Uber
Β 
Flink Forward Berlin 2018: Amey Chaugule - "Threading Needles in a Haystack: ...
Sessionizing Uber Trips in Realtime - Flink Forward '18, Berlin
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Prassnitha Sampath - Real Time Big Data Analytics with Kafka, Storm & HBase -...
Event Driven Streaming Analytics - Demostration on Architecture of IoT
Β 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Managing Large Scale Financial Time-Series Data with Graphs
Barga IC2E & IoTDI'16 Keynote
Design and Implementation of A Data Stream Management System
Cloud Experience: Data-driven Applications Made Simple and Fast
Service Virtualization - Next Gen Testing Conference Singapore 2013
[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale
Synthetic and RUM - Best of bo
Operating samza at skyscanner
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...

Recently uploaded (20)

PPT
FIRE PREVENTION AND CONTROL PLAN- LUS.FM.MQ.OM.UTM.PLN.00014.ppt
PPTX
Slides PPTX World Game (s) Eco Economic Epochs.pptx
PDF
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
PPT
Design_with_Watersergyerge45hrbgre4top (1).ppt
PPTX
Module 1 - Cyber Law and Ethics 101.pptx
PDF
Decoding a Decade: 10 Years of Applied CTI Discipline
PDF
Exploring VPS Hosting Trends for SMBs in 2025
PDF
WebRTC in SignalWire - troubleshooting media negotiation
PPTX
artificialintelligenceai1-copy-210604123353.pptx
Β 
PPTX
artificial intelligence overview of it and more
PDF
Tenda Login Guide: Access Your Router in 5 Easy Steps
PPTX
522797556-Unit-2-Temperature-measurement-1-1.pptx
PPTX
presentation_pfe-universite-molay-seltan.pptx
PPTX
Introuction about ICD -10 and ICD-11 PPT.pptx
PDF
SASE Traffic Flow - ZTNA Connector-1.pdf
PPTX
international classification of diseases ICD-10 review PPT.pptx
PPTX
introduction about ICD -10 & ICD-11 ppt.pptx
PDF
πŸ’° π”πŠπ“πˆ πŠπ„πŒπ„ππ€ππ†π€π πŠπˆππ„π‘πŸ’πƒ π‡π€π‘πˆ 𝐈𝐍𝐈 πŸπŸŽπŸπŸ“ πŸ’°
Β 
PPT
tcp ip networks nd ip layering assotred slides
PDF
Unit-1 introduction to cyber security discuss about how to secure a system
FIRE PREVENTION AND CONTROL PLAN- LUS.FM.MQ.OM.UTM.PLN.00014.ppt
Slides PPTX World Game (s) Eco Economic Epochs.pptx
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
Design_with_Watersergyerge45hrbgre4top (1).ppt
Module 1 - Cyber Law and Ethics 101.pptx
Decoding a Decade: 10 Years of Applied CTI Discipline
Exploring VPS Hosting Trends for SMBs in 2025
WebRTC in SignalWire - troubleshooting media negotiation
artificialintelligenceai1-copy-210604123353.pptx
Β 
artificial intelligence overview of it and more
Tenda Login Guide: Access Your Router in 5 Easy Steps
522797556-Unit-2-Temperature-measurement-1-1.pptx
presentation_pfe-universite-molay-seltan.pptx
Introuction about ICD -10 and ICD-11 PPT.pptx
SASE Traffic Flow - ZTNA Connector-1.pdf
international classification of diseases ICD-10 review PPT.pptx
introduction about ICD -10 & ICD-11 ppt.pptx
πŸ’° π”πŠπ“πˆ πŠπ„πŒπ„ππ€ππ†π€π πŠπˆππ„π‘πŸ’πƒ π‡π€π‘πˆ 𝐈𝐍𝐈 πŸπŸŽπŸπŸ“ πŸ’°
Β 
tcp ip networks nd ip layering assotred slides
Unit-1 introduction to cyber security discuss about how to secure a system

Streaming Processing in Uber Marketplace for Kafka Summit 2016