SlideShare a Scribd company logo
Stream Processing &
Analytics with Flink
Danny Yuan, Engineer @ Uber
@g9yuayon
Four Kinds of Analytics
- On demand aggregation and pattern detection
- Clustering
- Forecasting
- Pattern detection on geo-temporal data
Two Ingredients
Geo/Spatial Time
Real-time aggregation and pattern matching
Slide title
Complex Event Processing
How many cars enter and exit a user defined area in past 5 minutes
Examples
It doesn’t have to be within a time or area
CEP with full historical context
Notify me if a partner completed her 100th trip in a given area just now?
Patterns in the future
How many first-time riders will be dropped off in a given area in the next 5
minutes?
Patterns in the future
How many first-time riders will be dropped off in a given area in the next 5
minutes?
Geo: user flexibility is important
Geo: user flexibility is important
It needs to be scalable
It needs to be scalable
It needs to be scalable
- Every hexagon
- Every driver/rider
CEP Pipeline Built on Samza
- No hard-coded CEP rules
- Applying CEP rules per individual entity: topic, driver,
rider, cohorts, and etc
- Flexible checkpointing and statement management
Slide title
Slide title
Slide title
Slide title
Slide title
Slide title
Slide title
Slide title
We need to evolve our architecture for other analytics
Clustering
Slide title Manually Created Cluster
Slide titleCall for algorithmically created clusters
- Clustering based on key performance metrics
Slide titleCall for algorithmically created cluster
- Clustering based on key performance metrics
- Continuously measure the clusters
Slide titleCall for algorithmically created clusters
- Clustering based on key performance metrics
- Continuously measure the clusters
- Different clustering for different business needs
Slide titleCall for algorithmically created clusters
- Clustering based on key performance metrics
- Continuously measure the clusters
- Different clustering for different business needs
- Create clusters in minutes for all cities
Slide titleCall for algorithmically created clusters
- Clustering based on key performance metrics
- Continuously measure the clusters
- Different clustering for different business needs
- Create clusters in minutes for all cities
- Foundation for other stream analytics
Slide title
Home-grown Clustering Service
Slide title
Home-grown Clustering Service
- All cities under 3 minutes
Slide title
Home-grown Clustering Service
- All cities under 3 minutes
- Pluggable algorithms and measurements
Slide title
Home-grown Clustering Service
- All cities under 3 minutes
- Easily pluggable algorithms and measurements
- Historical geo-temporal data for clustering
Slide title
Home-grown Clustering Service
- All cities under 3 minutes
- Easily pluggable algorithms and measurements
- Historical geo-temporal data for clustering
- Real-time geo-temporal data for measurement
Slide title
Home-grown Clustering Service
- All cities under 3 minutes
- Easily pluggable algorithms and measurements
- Historical geo-temporal data for clustering
- Real-time geo-temporal data for measurement
- Shared optimizations
Slide title
Home-grown Clustering Service
- All cities under 3 minutes
- Easily pluggable algorithms and measurements
- Historical geo-temporal data for clustering
- Real-time geo-temporal data for measurement
- Shared optimizations. To put things in perspective:
- 70,000 hexagons in SF
- Naive distance function requires at least 70,000 x
70,000 = 4.9 billion pairs!
Slide title
Home-grown Clustering Service
- All cities under 3 minutes
- Easily pluggable algorithms and measurements
- Historical geo-temporal data for clustering
- Real-time geo-temporal data for measurement
- Shared optimizations
- Incremental updates
- Compact data representation
- Memoization
- Avoid anything more complex than O(nlog(n))
Forecasting
- Every decision is based on forecasting
Forecasting
- Forecasting based on both historical data and stream input
Forecasting
- Forecasting based on both historical data and stream input
Forecasting
- Forecasting based on both historical data and stream input
Anomaly, 

or emerging demand?
Forecasting
- Spatially granular forecasting - down to every hexagon
Forecasting
- Spatially granular forecasting - down to every hexagon
Forecasting
- Temporally granular forecasting - down to every minute
Forecasting
- Temporally granular forecasting - down to every minute
Pattern Detection
- Similarity of different metrics across geolocation and time
- Metric outliers across geolocations and time
- Frequent occurrences of certain patterns
- Clustered behavior
- Anomalies
Common Requirements in Pattern Detection
- Not just traditional time series analysis
- Incorporating insights on marketplace data
- Required both historical data and real-time input
- Spatially granular patterns - down to every hexagon
- Temporally granular patterns - down to every minute
Example: Anomaly Detection
- Simple time series analysis
- For a single geo area
- Can be noisy
A More Realistic Anomaly Detection
Example: Anomaly Detection
Example: Anomaly Detection
What’s the right architecture to support the analytics use cases?
Shared abstraction: multi-dimensional geo-temporal data
- Time series by event time
Shared abstraction: multi-dimensional geo-temporal data
- Time series by event time
Shared abstraction: multi-dimensional geo-temporal data
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
- Time series by event time
Shared abstraction: multi-dimensional geo-temporal data
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
- Time series by event time
Shared abstraction: multi-dimensional geo-temporal data
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
- Time series by event time
- Flexible windowing - tumbling, sliding, conditionally triggered
Shared abstraction: multi-dimensional geo-temporal data
- Time series by event time
- Flexible windowing - tumbling, sliding, conditionally triggered
- e.g. event-based triggers
Shared abstraction: multi-dimensional geo-temporal data
- Time series by event time
- Flexible windowing - tumbling, sliding, conditionally triggered
- e.g. event-based triggers
- e.g., triggers of computation results
Shared abstraction: multi-dimensional geo-temporal data
- Time series by event time
- Flexible windowing - tumbling, sliding, conditionally triggered
- Stateful processing
Shared abstraction: multi-dimensional geo-temporal data
- Time series by event time
- Flexible windowing - tumbling, sliding, conditionally triggered
- Stateful processing. E.g.,
Shared abstraction: multi-dimensional geo-temporal data
- Time series by event time
- Flexible windowing - tumbling, sliding, conditionally triggered
- Stateful processing. E.g.,
Shared abstraction: multi-dimensional geo-temporal data
- Time series by event time
- Flexible windowing - tumbling, sliding, conditionally triggered
- Stateful processing. E.g.,
Shared abstraction: multi-dimensional geo-temporal data
- Time series by event time
- Flexible windowing - tumbling, sliding, conditionally triggered
- Stateful processing. E.g.,
Shared abstraction: multi-dimensional geo-temporal data
- Time series by event time
- Flexible windowing - tumbling, sliding, conditionally triggered
- Stateful processing. E.g.,
State
Shared abstraction: multi-dimensional geo-temporal data
- Time series by event time
- Flexible windowing - tumbling, sliding, conditionally triggered
- Stateful processing. E.g.,
State per key
Shared abstraction: multi-dimensional geo-temporal data
- Time series by event time
- Flexible windowing - tumbling, sliding, conditionally triggered
- Stateful processing
- Unified stream
Shared abstraction: multi-dimensional geo-temporal data
- Time series by event time
- Flexible windowing - tumbling, sliding, conditionally triggered
- Stateful processing
- Unified stream
- Real-time streams: unbounded streams
Shared abstraction: multi-dimensional geo-temporal data
- Time series by event time
- Flexible windowing - tumbling, sliding, conditionally triggered
- Stateful processing
- Unified stream
- Real-time streams: unbounded streams
- Batch: bounded streams
Shared abstraction: multi-dimensional geo-temporal data
- Time series by event time
- Flexible windowing - tumbling, sliding, conditionally triggered
- Stateful processing
- Unified stream
- Real-time streams: unbounded streams
- Batch: bounded streams
- s/lambda/kappa
Shared abstraction: multi-dimensional geo-temporal data
- Ordering by event time
- Flexible windowing with watermark and triggers
- Exactly-once semantics
- Built-in state management and checkpointing
- Nice data flow APIs
Apache Flink
Mental Picture for Processing Geo-temporal Data
Mental Picture for Processing Geo-temporal Data
Mental Picture for Processing Geo-temporal Data
Mental Picture for Processing Geo-temporal Data
Mental Picture for Processing Geo-temporal Data
Mental Picture for Processing Geo-temporal Data
Mental Picture for Processing Geo-temporal Data
A Simple Example: simple prediction
Sources	
			.fromKafka()	
			.config(config)	
			.cluster(aCluster)	
			.topics(topicList)
A Simple Example
assignTimestampsAndWatermarks
A Simple Example
keyBy(…)
A Simple Example
.timeWindow(…)
A Simple Example
.flatMap(…)
A Simple Example
.keyBy(…)
A Simple Example
.apply(statefulFn)
A Simple Example
.addSink(…)
High Level Data Flow
High Level Data Flow
High Level Data Flow
High Level Data Flow
High Level Data Flow
High Level Data Flow
High Level Data Flow
High Level Data Flow
High Level Data Flow
High Level Data Flow
High Level Data Flow
Geotemporal API for efficiency
Geotemporal API for efficiency
Geotemporal API for productivity
Geotemporal API for productivity
Geotemporal API for productivity
Forecasting as an example
Forecasting as an example
Forecasting as an example
Forecasting as an example
Forecasting as an example
Forecasting as an example
Forecasting as an example
Forecasting as an example
Forecasting as an example
Forecasting as an example
Forecasting as an example
Forecasting as an example
Forecasting as an example
Forecasting as an example
Forecasting as an example
Forecasting as an example
Forecasting as an example
Lessons Learned
- Make sure you have robust infrastructure support
- Scaling up, namely single-node optimization matters
- Ensure exactly-once by proper data modeling
- Use external state store to avoid too much snapshotting
- Standardize monitoring and data validation
Lessons Learned
- Make sure you have robust infrastructure support
Lessons Learned
- Make sure you have robust infrastructure support
Lessons Learned
- Make sure you have robust infrastructure support
Lessons Learned
- Make sure you have robust infrastructure support
Lessons Learned
- Make sure you have robust infrastructure support
- Scaling up, namely single-node optimization matters
Lessons Learned
- Make sure you have robust infrastructure support
- Scaling up, namely single-node optimization matters
- Ensure exactly-once by proper data modeling
Lessons Learned
- Make sure you have robust infrastructure support
- Scaling up, namely single-node optimization matters
- Ensure exactly-once by proper data modeling
- Use external state store to avoid too much snapshotting
- Standardize monitoring and data validation
Lessons Learned
- Make sure you have robust infrastructure support
- Scaling up, namely single-node optimization matters
- Ensure exactly-once by proper data modeling
- Standardize monitoring and data validation
Choose a Stream Processing Platform
Thank You

More Related Content

PDF
Streaming Processing in Uber Marketplace for Kafka Summit 2016
PDF
Real-Time Analytics at Uber Scale
PDF
QCon SF-2015 Stream Processing in uber
PDF
Stream Computing & Analytics at Uber
PPSX
Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)
PPTX
Big Data Pipeline and Analytics Platform
PDF
ML and Data Science at Uber - GITPro talk 2017
PPTX
Big Data Pipelines and Machine Learning at Uber
Streaming Processing in Uber Marketplace for Kafka Summit 2016
Real-Time Analytics at Uber Scale
QCon SF-2015 Stream Processing in uber
Stream Computing & Analytics at Uber
Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)
Big Data Pipeline and Analytics Platform
ML and Data Science at Uber - GITPro talk 2017
Big Data Pipelines and Machine Learning at Uber

What's hot (19)

PDF
Stream Processing in Uber
PPTX
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
PPTX
goto; London: Keeping your Cloud Footprint in Check
PDF
#lspe Q1 2013 dynamically scaling netflix in the cloud
PDF
Fast Cars, Big Data - How Streaming Can Help Formula 1 - Tugdual Grall - Code...
PDF
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
PDF
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
PDF
Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...
PDF
Scalable complex event processing on samza @UBER
PPTX
Presto Talk @ Hadoop Summit'15
PPTX
Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in t...
PDF
Spark Summit EU talk by Chris Pool and Jeroen Vlek
PDF
Spark at Airbnb
PPTX
Real-time Analytics with Presto and Apache Pinot
PDF
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
PDF
Introduction to Streaming with Apache Flink
PDF
Big Data on EC2: Mashing Technology in the Cloud
PDF
01 supermapiserverintroduction
PDF
Deep learning at supercomputing scale by Rangan Sukumar from Cray
Stream Processing in Uber
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
goto; London: Keeping your Cloud Footprint in Check
#lspe Q1 2013 dynamically scaling netflix in the cloud
Fast Cars, Big Data - How Streaming Can Help Formula 1 - Tugdual Grall - Code...
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...
Scalable complex event processing on samza @UBER
Presto Talk @ Hadoop Summit'15
Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in t...
Spark Summit EU talk by Chris Pool and Jeroen Vlek
Spark at Airbnb
Real-time Analytics with Presto and Apache Pinot
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Introduction to Streaming with Apache Flink
Big Data on EC2: Mashing Technology in the Cloud
01 supermapiserverintroduction
Deep learning at supercomputing scale by Rangan Sukumar from Cray
Ad

Similar to Streaming Analytics in Uber (20)

PDF
Sensing the world with Data of Things
PDF
Sensing the world with data of things
PDF
A primer on building real time data-driven products
PPTX
Observability - the good, the bad, and the ugly
PPTX
Trivento summercamp fast data 9/9/2016
PPTX
Geo-Distributed Big Data and Analytics
PPTX
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
PPTX
Observability – the good, the bad, and the ugly
PDF
Processing and analysing streaming data with Python. Pycon Italy 2022
PPTX
Trivento summercamp masterclass 9/9/2016
PPTX
How to extract valueable information from real time data feeds
PPTX
WaJUG - Introduction to data streaming
PPTX
BruJUG - Introduction to data streaming
PPTX
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
PDF
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
PPTX
Apache Flink: Real-World Use Cases for Streaming Analytics
PPT
Moving Towards a Streaming Architecture
PDF
Introduction to Streaming Analytics
PDF
04 open source_tools
PDF
Complex event processing platform handling millions of users - Krzysztof Zarz...
Sensing the world with Data of Things
Sensing the world with data of things
A primer on building real time data-driven products
Observability - the good, the bad, and the ugly
Trivento summercamp fast data 9/9/2016
Geo-Distributed Big Data and Analytics
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
Observability – the good, the bad, and the ugly
Processing and analysing streaming data with Python. Pycon Italy 2022
Trivento summercamp masterclass 9/9/2016
How to extract valueable information from real time data feeds
WaJUG - Introduction to data streaming
BruJUG - Introduction to data streaming
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Apache Flink: Real-World Use Cases for Streaming Analytics
Moving Towards a Streaming Architecture
Introduction to Streaming Analytics
04 open source_tools
Complex event processing platform handling millions of users - Krzysztof Zarz...
Ad

Recently uploaded (20)

PPTX
presentation_pfe-universite-molay-seltan.pptx
PDF
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
PDF
Paper PDF World Game (s) Great Redesign.pdf
PPTX
Funds Management Learning Material for Beg
PDF
Introduction to the IoT system, how the IoT system works
PPTX
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
PPT
Design_with_Watersergyerge45hrbgre4top (1).ppt
DOCX
Unit-3 cyber security network security of internet system
DOC
Rose毕业证学历认证,利物浦约翰摩尔斯大学毕业证国外本科毕业证
PPTX
Slides PPTX World Game (s) Eco Economic Epochs.pptx
PPTX
INTERNET------BASICS-------UPDATED PPT PRESENTATION
PPT
FIRE PREVENTION AND CONTROL PLAN- LUS.FM.MQ.OM.UTM.PLN.00014.ppt
PPTX
Database Information System - Management Information System
PDF
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
PDF
Exploring VPS Hosting Trends for SMBs in 2025
PPTX
artificialintelligenceai1-copy-210604123353.pptx
PDF
The New Creative Director: How AI Tools for Social Media Content Creation Are...
PPTX
Job_Card_System_Styled_lorem_ipsum_.pptx
PPTX
newyork.pptxirantrafgshenepalchinachinane
PPTX
Power Point - Lesson 3_2.pptx grad school presentation
presentation_pfe-universite-molay-seltan.pptx
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
Paper PDF World Game (s) Great Redesign.pdf
Funds Management Learning Material for Beg
Introduction to the IoT system, how the IoT system works
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
Design_with_Watersergyerge45hrbgre4top (1).ppt
Unit-3 cyber security network security of internet system
Rose毕业证学历认证,利物浦约翰摩尔斯大学毕业证国外本科毕业证
Slides PPTX World Game (s) Eco Economic Epochs.pptx
INTERNET------BASICS-------UPDATED PPT PRESENTATION
FIRE PREVENTION AND CONTROL PLAN- LUS.FM.MQ.OM.UTM.PLN.00014.ppt
Database Information System - Management Information System
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
Exploring VPS Hosting Trends for SMBs in 2025
artificialintelligenceai1-copy-210604123353.pptx
The New Creative Director: How AI Tools for Social Media Content Creation Are...
Job_Card_System_Styled_lorem_ipsum_.pptx
newyork.pptxirantrafgshenepalchinachinane
Power Point - Lesson 3_2.pptx grad school presentation

Streaming Analytics in Uber