SlideShare a Scribd company logo
Now and The Future
Lyft Data Platform
Mark Grover | @mark_grover
Deepak Tiwari | @_deepaktiwari_
Improve people’s lives with the world’s best transportation
● 30.7M riders in 2018
● 1.9M drivers in 2018
● 1B+ cumulative rides
● 300+ markets in US &
Canada
Data is at the core of decisions at Lyft
Automated decisions
- What’s the price for the ride?
- What driver to match?
- What’s the ETA?
Analyzing business performance
- How are key business metrics
trending?
- How do predicted ETAs compare
to actual?
Human business decisions
- Which opportunities to invest in?
- Which path to take (via
experimentation)?
Data platform users
4
Data Modelers Analysts Data
Scientists
General
Managers
Data Platform
Engineers ExperimentersPMs/Execs
Analytics Biz ops Building apps Experimentation
By numbers...
● Millions of BI queries per
week doubling quarterly
● 5X increase in productivity
of ML models in 2018
● 20X scaling of support of
maps to users through
streaming platform in 2018
Product Teams,
Applied ML, Forecasting
ML Platform
Data Platform and Infra
Source: The AI Hierarchy of Needs, Monica Rogati (8/2017)
Data as a platform to accelerate the business and reduce risk...
● Think ahead in the future (e.g. streaming, machine learning,
security and privacy, visualization, discovery, etc.).
● Provide a step change (vs incremental) in the capability.
● Move fast.
● Create a competitive advantage.
● Focus on impact: Develop jointly with application verticals.
● Build enterprise grade platform.
● Have a clearly defined contract with applications (e.g. SLAs).
● Give a serverless application for the product teams.
Guiding principles for the data platform team...
Innovative
Impactful
Reliable
Use case #1
Unmet need for business metric observability
Business metric observability
What’s the health of the business?
Grafana
Operational observability
What’s the health of the service?
● Is the service up?
● Is it throwing errors?
● In near real-time (< 1 min)
Requirements for biz metric observability
See results within
1 - 30 minutes
Be the source of truthNear real-time
Impact on
business metrics
Derive business metrics
from raw data (aka ETL)
Don’t widen the gap for
reconciliation
11
Project F2 architecture
Data Discovery
app - Amundsen
Operational Data
stores (e.g.
Dynamo)
Apache Superset
CDC
Online flow Offline
flow
Magic of CDC - Change Data Capture
Operational Data
stores (e.g. Dynamo)
Analytical Data stores
(e.g. Hive/Presto, BQ)
1. Tail the operational
Data stores
2. Persist the
raw change log
3. Upsert the
change log to
table periodically
(~30 m)
Advantages of CDC
Data Engineer
Productivity
See results within
30 minutes
Near real-time
Source of truth
No need to reconcile
Same data as operational
DBs
No need to recreate ETL
from events
Easier primitives to build
ETL on top of
● Measuring reliability
○ How to distinguish late arriving data from missing data?
○ How do you trace a single missing revision through all moving parts?
● Lots of moving parts
○ Tailer, tied to implementation of operational DB
○ Ingest pipeline
○ Kafka, Kinesis
○ Analytic Database
Challenges of the architecture
CDC + Streaming =
Lots of business
value
Use case #2
Data Science use cases - Driver app
Data Science use cases - Pricing
Requirements for streaming applications
In Streaming, just like in Batch
Quick and simple ways of cleaning data
Prototype in a language of
choice (Python, R, SQL)
Quick and simple ways of cleaning data
20
Services (e.g.
ETA, Pricing)
Models +
Applications (e.g.
ETA, Pricing)
Flyte
Streaming architecture
Investments in Streaming
Dryft
Fully managed data processing
engine, powering real-time
features and events
- Needed for consistent feature
generation
- Batch processing for bulk
creation of features for training
- Stream processing for
real-time creation of features for
scoring
- Uses Flink SQL under the
hood
Apache Beam
Open source unified, portable
and extensible model for both
batch and streaming use-cases
- Enables streaming use cases
for teams using non JVM
languages
- Uses Flink under the hood
● Things we find at scale
○ Intermittent AWS service errors
○ Can’t be naive about pub-sub consumption
● Integration
○ Things work in isolation, but …
○ Flink Kinesis Connector
■ Connector that work at scale are hard
Challenges of the architecture
Sharing your batch
and streaming
compute will pay
huge benefits
The whole
shebang
25
Data Platform architecture
Data Discovery
app - Amundsen
Services (e.g.
ETA, Pricing)
Operational Data
stores (e.g.
Dynamo)
Models +
Applications (e.g.
ETA, Pricing)
Apache Superset
BI/Data Viz
Marketplace
Operations app
...
Other custom
apps
Custom apps
Flyte
Kafka is better but ….
• Has limitations around fan-in
Kafka vs. Kinesis
Kinesis scaling limitations
• We require high throughput & high fan-out
• Default limit of 500 shards
• Resharding is expensive and slow
• Built a fan-out system to work around
limitations
● Apache Flink vs. Apache Spark vs. Apache Beam
● 2 dimensions of comparison
○ APIs (the kinds of applications you can write)
○ Operations (the kind of applications you can support)
● Apache Beam for multi-language support (Python and Go)
● Spark Streaming - operations were hard, no state evolution, cumulative
latencies with multi-stage graphs.
● Know when to put all your eggs in the same basket (Spark), when not to.
Streaming engines
Interactive querying:
● Redshift
○ Historical but dying
● Druid
○ Interactive use-cases
● Presto (on S3)
○ Super handy interactive query engine
○ Lacking real-time ingestion support
● BigQuery
○ Interactive query engine (like Presto)
○ Expensive, but great streaming support!
ETL:
● Hive (on S3)
○ Mostly for ETL and adhoc queries that are too large to run on Presto
● Spark
○ Some ETL, potential for all ETL to be in Spark
Data Storage and processing
Future of Interactive querying
Unified access layer
e.g. DAL, Genie, DALi Views
Future of ETL
- Easily schedule with dependencies, a
SQL query to be an ETL job
- Diagnose job failures with lineage and
dashboards on data skew, etc.
● Airflow
○ Most ETL jobs
○ Python heavy DAGs
○ Really good community to support
● Flyte
○ Focussed on ML workflows
○ Built in Provenance
○ Intermediate caching, discovery of previously computed artifacts
Workflow engines
Conclusion
● We think about data as a platform and a competitive advantage.
● Our data and platform usage is growing really really fast.
● We support Data Science, Ops, Analytics, Experimentation and other
use cases.
● We have seen tremendous benefit from CDC data + Streaming
frameworks to deliver business metric observability.
● We have learned and gained a lot in operational excellence by
sharing our batch and stream compute frameworks.
● We are investing in Data Discovery, Streaming, and Machine
Learning.
Conclusion
Attend Streaming at Lyft session tomorrow!
Attend Meetup at Level39 tonight!
Thank you
go.lyft.com/lyftdataplatformMay 2nd, 2019
Mark Grover | @mark_grover
Deepak Tiwari | @_deepaktiwari_

More Related Content

PPTX
Backstage at CNCF Madison.pptx
PDF
Batch Processing at Scale with Flink & Iceberg
PPTX
Grafana optimization for Prometheus
PPTX
ApacheCon NA 2019 : Adding AI to customer segmentation using Apache Unomi and...
PDF
Cosco: An Efficient Facebook-Scale Shuffle Service
PDF
Azure Monitoring Overview
PDF
Implementing Observability for Kubernetes.pdf
PPTX
OpenTelemetry For Operators
Backstage at CNCF Madison.pptx
Batch Processing at Scale with Flink & Iceberg
Grafana optimization for Prometheus
ApacheCon NA 2019 : Adding AI to customer segmentation using Apache Unomi and...
Cosco: An Efficient Facebook-Scale Shuffle Service
Azure Monitoring Overview
Implementing Observability for Kubernetes.pdf
OpenTelemetry For Operators

What's hot (20)

PDF
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
PPTX
Kubernetes & Google Kubernetes Engine (GKE)
PPTX
Modern CI/CD Pipeline Using Azure DevOps
PDF
Near real-time anomaly detection at Lyft
PPTX
Nginx Reverse Proxy with Kafka.pptx
PDF
Let's build Developer Portal with Backstage
PDF
Google Cloud Storage | Google Cloud Platform Tutorial | Google Cloud Architec...
PDF
Bringing ML To Production, What Is Missing? AMLD 2020
PPTX
Log management with ELK
PDF
Introduction to Google Compute Engine
PDF
Introduction to Kubernetes and Google Container Engine (GKE)
PDF
Observability
PDF
Apache Airflow
PPTX
The Top 5 Apache Kafka Use Cases and Architectures in 2022
PDF
Google Cloud Networking Deep Dive
PDF
Cloud run - Serverless Containers Done Right
PDF
Building an Observability platform with ClickHouse
PDF
Serverless Kafka on AWS as Part of a Cloud-native Data Lake Architecture
PDF
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
PDF
Apache Spark Introduction
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Kubernetes & Google Kubernetes Engine (GKE)
Modern CI/CD Pipeline Using Azure DevOps
Near real-time anomaly detection at Lyft
Nginx Reverse Proxy with Kafka.pptx
Let's build Developer Portal with Backstage
Google Cloud Storage | Google Cloud Platform Tutorial | Google Cloud Architec...
Bringing ML To Production, What Is Missing? AMLD 2020
Log management with ELK
Introduction to Google Compute Engine
Introduction to Kubernetes and Google Container Engine (GKE)
Observability
Apache Airflow
The Top 5 Apache Kafka Use Cases and Architectures in 2022
Google Cloud Networking Deep Dive
Cloud run - Serverless Containers Done Right
Building an Observability platform with ClickHouse
Serverless Kafka on AWS as Part of a Cloud-native Data Lake Architecture
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Apache Spark Introduction
Ad

Similar to The Lyft data platform: Now and in the future (20)

PDF
Real time analytics on deep learning @ strata data 2019
PDF
Machine learning and big data @ uber a tale of two systems
PDF
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
PDF
Processing 19 billion messages in real time and NOT dying in the process
PPTX
Streaming in the Wild with Apache Flink
PPTX
Streaming in the Wild with Apache Flink
PPTX
Lyft talks #4 Orchestrating big data and ML pipelines at Lyft
PDF
Simply Business' Data Platform
PDF
[WSO2Con EU 2018] The Rise of Streaming SQL
PDF
Data Platform in the Cloud
PDF
The Zero-ETL Approach: Enhancing Data Agility and Insight
PDF
Scaling up uber's real time data analytics
PDF
Lambda Architecture and open source technology stack for real time big data
PPTX
LeedsSharp May 2023 - Azure Integration Services
PDF
Data Virtualization Journey: How to Grow from Single Project and to Enterpris...
PDF
Apache Flink Adoption at Shopify
DOC
Shaik Niyas Ahamed M Resume
PDF
Big Data Analytics Platforms by KTH and RISE SICS
PPTX
IIoT with Kafka and Machine Learning for Supply Chain Optimization In Real Ti...
PPTX
Apache Kafka® + Machine Learning for Supply Chain 
Real time analytics on deep learning @ strata data 2019
Machine learning and big data @ uber a tale of two systems
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Processing 19 billion messages in real time and NOT dying in the process
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
Lyft talks #4 Orchestrating big data and ML pipelines at Lyft
Simply Business' Data Platform
[WSO2Con EU 2018] The Rise of Streaming SQL
Data Platform in the Cloud
The Zero-ETL Approach: Enhancing Data Agility and Insight
Scaling up uber's real time data analytics
Lambda Architecture and open source technology stack for real time big data
LeedsSharp May 2023 - Azure Integration Services
Data Virtualization Journey: How to Grow from Single Project and to Enterpris...
Apache Flink Adoption at Shopify
Shaik Niyas Ahamed M Resume
Big Data Analytics Platforms by KTH and RISE SICS
IIoT with Kafka and Machine Learning for Supply Chain Optimization In Real Ti...
Apache Kafka® + Machine Learning for Supply Chain 
Ad

More from markgrover (20)

PDF
From discovering to trusting data
PDF
Amundsen lineage designs - community meeting, Dec 2020
PDF
Amundsen at Brex and Looker integration
PDF
REA Group's journey with Data Cataloging and Amundsen
PDF
Amundsen gremlin proxy design
PDF
Amundsen: From discovering to security data
PDF
Amundsen: From discovering to security data
PDF
Data Discovery & Trust through Metadata
PDF
Data Discovery and Metadata
PDF
Disrupting Data Discovery
PDF
TensorFlow Extension (TFX) and Apache Beam
PDF
Big Data at Speed
PDF
Dogfooding data at Lyft
PDF
Fighting cybersecurity threats with Apache Spot
PDF
Fraud Detection with Hadoop
PDF
Top 5 mistakes when writing Spark applications
PDF
Top 5 mistakes when writing Spark applications
PPTX
Architecting Applications with Hadoop
PDF
SQL Engines for Hadoop - The case for Impala
PDF
Intro to hadoop tutorial
From discovering to trusting data
Amundsen lineage designs - community meeting, Dec 2020
Amundsen at Brex and Looker integration
REA Group's journey with Data Cataloging and Amundsen
Amundsen gremlin proxy design
Amundsen: From discovering to security data
Amundsen: From discovering to security data
Data Discovery & Trust through Metadata
Data Discovery and Metadata
Disrupting Data Discovery
TensorFlow Extension (TFX) and Apache Beam
Big Data at Speed
Dogfooding data at Lyft
Fighting cybersecurity threats with Apache Spot
Fraud Detection with Hadoop
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
Architecting Applications with Hadoop
SQL Engines for Hadoop - The case for Impala
Intro to hadoop tutorial

Recently uploaded (20)

PDF
Advanced IT Governance
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Big Data Technologies - Introduction.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Empathic Computing: Creating Shared Understanding
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
Advanced IT Governance
Mobile App Security Testing_ A Comprehensive Guide.pdf
Understanding_Digital_Forensics_Presentation.pptx
Unlocking AI with Model Context Protocol (MCP)
Big Data Technologies - Introduction.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Empathic Computing: Creating Shared Understanding
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
The AUB Centre for AI in Media Proposal.docx
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
cuic standard and advanced reporting.pdf
Chapter 3 Spatial Domain Image Processing.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
NewMind AI Weekly Chronicles - August'25 Week I
Advanced Soft Computing BINUS July 2025.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
GamePlan Trading System Review: Professional Trader's Honest Take

The Lyft data platform: Now and in the future

  • 1. Now and The Future Lyft Data Platform Mark Grover | @mark_grover Deepak Tiwari | @_deepaktiwari_
  • 2. Improve people’s lives with the world’s best transportation ● 30.7M riders in 2018 ● 1.9M drivers in 2018 ● 1B+ cumulative rides ● 300+ markets in US & Canada
  • 3. Data is at the core of decisions at Lyft Automated decisions - What’s the price for the ride? - What driver to match? - What’s the ETA? Analyzing business performance - How are key business metrics trending? - How do predicted ETAs compare to actual? Human business decisions - Which opportunities to invest in? - Which path to take (via experimentation)?
  • 4. Data platform users 4 Data Modelers Analysts Data Scientists General Managers Data Platform Engineers ExperimentersPMs/Execs Analytics Biz ops Building apps Experimentation
  • 5. By numbers... ● Millions of BI queries per week doubling quarterly ● 5X increase in productivity of ML models in 2018 ● 20X scaling of support of maps to users through streaming platform in 2018
  • 6. Product Teams, Applied ML, Forecasting ML Platform Data Platform and Infra Source: The AI Hierarchy of Needs, Monica Rogati (8/2017) Data as a platform to accelerate the business and reduce risk...
  • 7. ● Think ahead in the future (e.g. streaming, machine learning, security and privacy, visualization, discovery, etc.). ● Provide a step change (vs incremental) in the capability. ● Move fast. ● Create a competitive advantage. ● Focus on impact: Develop jointly with application verticals. ● Build enterprise grade platform. ● Have a clearly defined contract with applications (e.g. SLAs). ● Give a serverless application for the product teams. Guiding principles for the data platform team... Innovative Impactful Reliable
  • 9. Unmet need for business metric observability Business metric observability What’s the health of the business? Grafana Operational observability What’s the health of the service? ● Is the service up? ● Is it throwing errors? ● In near real-time (< 1 min)
  • 10. Requirements for biz metric observability See results within 1 - 30 minutes Be the source of truthNear real-time Impact on business metrics Derive business metrics from raw data (aka ETL) Don’t widen the gap for reconciliation
  • 11. 11 Project F2 architecture Data Discovery app - Amundsen Operational Data stores (e.g. Dynamo) Apache Superset CDC Online flow Offline flow
  • 12. Magic of CDC - Change Data Capture Operational Data stores (e.g. Dynamo) Analytical Data stores (e.g. Hive/Presto, BQ) 1. Tail the operational Data stores 2. Persist the raw change log 3. Upsert the change log to table periodically (~30 m)
  • 13. Advantages of CDC Data Engineer Productivity See results within 30 minutes Near real-time Source of truth No need to reconcile Same data as operational DBs No need to recreate ETL from events Easier primitives to build ETL on top of
  • 14. ● Measuring reliability ○ How to distinguish late arriving data from missing data? ○ How do you trace a single missing revision through all moving parts? ● Lots of moving parts ○ Tailer, tied to implementation of operational DB ○ Ingest pipeline ○ Kafka, Kinesis ○ Analytic Database Challenges of the architecture
  • 15. CDC + Streaming = Lots of business value
  • 17. Data Science use cases - Driver app
  • 18. Data Science use cases - Pricing
  • 19. Requirements for streaming applications In Streaming, just like in Batch Quick and simple ways of cleaning data Prototype in a language of choice (Python, R, SQL) Quick and simple ways of cleaning data
  • 20. 20 Services (e.g. ETA, Pricing) Models + Applications (e.g. ETA, Pricing) Flyte Streaming architecture
  • 21. Investments in Streaming Dryft Fully managed data processing engine, powering real-time features and events - Needed for consistent feature generation - Batch processing for bulk creation of features for training - Stream processing for real-time creation of features for scoring - Uses Flink SQL under the hood Apache Beam Open source unified, portable and extensible model for both batch and streaming use-cases - Enables streaming use cases for teams using non JVM languages - Uses Flink under the hood
  • 22. ● Things we find at scale ○ Intermittent AWS service errors ○ Can’t be naive about pub-sub consumption ● Integration ○ Things work in isolation, but … ○ Flink Kinesis Connector ■ Connector that work at scale are hard Challenges of the architecture
  • 23. Sharing your batch and streaming compute will pay huge benefits
  • 25. 25 Data Platform architecture Data Discovery app - Amundsen Services (e.g. ETA, Pricing) Operational Data stores (e.g. Dynamo) Models + Applications (e.g. ETA, Pricing) Apache Superset BI/Data Viz Marketplace Operations app ... Other custom apps Custom apps Flyte
  • 26. Kafka is better but …. • Has limitations around fan-in Kafka vs. Kinesis Kinesis scaling limitations • We require high throughput & high fan-out • Default limit of 500 shards • Resharding is expensive and slow • Built a fan-out system to work around limitations
  • 27. ● Apache Flink vs. Apache Spark vs. Apache Beam ● 2 dimensions of comparison ○ APIs (the kinds of applications you can write) ○ Operations (the kind of applications you can support) ● Apache Beam for multi-language support (Python and Go) ● Spark Streaming - operations were hard, no state evolution, cumulative latencies with multi-stage graphs. ● Know when to put all your eggs in the same basket (Spark), when not to. Streaming engines
  • 28. Interactive querying: ● Redshift ○ Historical but dying ● Druid ○ Interactive use-cases ● Presto (on S3) ○ Super handy interactive query engine ○ Lacking real-time ingestion support ● BigQuery ○ Interactive query engine (like Presto) ○ Expensive, but great streaming support! ETL: ● Hive (on S3) ○ Mostly for ETL and adhoc queries that are too large to run on Presto ● Spark ○ Some ETL, potential for all ETL to be in Spark Data Storage and processing Future of Interactive querying Unified access layer e.g. DAL, Genie, DALi Views Future of ETL - Easily schedule with dependencies, a SQL query to be an ETL job - Diagnose job failures with lineage and dashboards on data skew, etc.
  • 29. ● Airflow ○ Most ETL jobs ○ Python heavy DAGs ○ Really good community to support ● Flyte ○ Focussed on ML workflows ○ Built in Provenance ○ Intermediate caching, discovery of previously computed artifacts Workflow engines
  • 31. ● We think about data as a platform and a competitive advantage. ● Our data and platform usage is growing really really fast. ● We support Data Science, Ops, Analytics, Experimentation and other use cases. ● We have seen tremendous benefit from CDC data + Streaming frameworks to deliver business metric observability. ● We have learned and gained a lot in operational excellence by sharing our batch and stream compute frameworks. ● We are investing in Data Discovery, Streaming, and Machine Learning. Conclusion
  • 32. Attend Streaming at Lyft session tomorrow!
  • 33. Attend Meetup at Level39 tonight!
  • 34. Thank you go.lyft.com/lyftdataplatformMay 2nd, 2019 Mark Grover | @mark_grover Deepak Tiwari | @_deepaktiwari_