The Lyft data platform: Now and in the future

Now and The Future
Lyft Data Platform
Mark Grover | @mark_grover
Deepak Tiwari | @_deepaktiwari_

Improve people’s lives with the world’s best transportation
● 30.7M riders in 2018
● 1.9M drivers in 2018
● 1B+ cumulative rides
● 300+ markets in US &
Canada

Data is at the core of decisions at Lyft
Automated decisions
- What’s the price for the ride?
- What driver to match?
- What’s the ETA?
Analyzing business performance
- How are key business metrics
trending?
- How do predicted ETAs compare
to actual?
Human business decisions
- Which opportunities to invest in?
- Which path to take (via
experimentation)?

Data platform users
4
Data Modelers Analysts Data
Scientists
General
Managers
Data Platform
Engineers ExperimentersPMs/Execs
Analytics Biz ops Building apps Experimentation

By numbers...
● Millions of BI queries per
week doubling quarterly
● 5X increase in productivity
of ML models in 2018
● 20X scaling of support of
maps to users through
streaming platform in 2018

Product Teams,
Applied ML, Forecasting
ML Platform
Data Platform and Infra
Source: The AI Hierarchy of Needs, Monica Rogati (8/2017)
Data as a platform to accelerate the business and reduce risk...

● Think ahead in the future (e.g. streaming, machine learning,
security and privacy, visualization, discovery, etc.).
● Provide a step change (vs incremental) in the capability.
● Move fast.
● Create a competitive advantage.
● Focus on impact: Develop jointly with application verticals.
● Build enterprise grade platform.
● Have a clearly deﬁned contract with applications (e.g. SLAs).
● Give a serverless application for the product teams.
Guiding principles for the data platform team...
Innovative
Impactful
Reliable

Unmet need for business metric observability
Business metric observability
What’s the health of the business?
Grafana
Operational observability
What’s the health of the service?
● Is the service up?
● Is it throwing errors?
● In near real-time (< 1 min)

Requirements for biz metric observability
See results within
1 - 30 minutes
Be the source of truthNear real-time
Impact on
business metrics
Derive business metrics
from raw data (aka ETL)
Don’t widen the gap for
reconciliation

11
Project F2 architecture
Data Discovery
app - Amundsen
Operational Data
stores (e.g.
Dynamo)
Apache Superset
CDC
Online flow Offline
flow

Magic of CDC - Change Data Capture
Operational Data
stores (e.g. Dynamo)
Analytical Data stores
(e.g. Hive/Presto, BQ)
1. Tail the operational
Data stores
2. Persist the
raw change log
3. Upsert the
change log to
table periodically
(~30 m)

Advantages of CDC
Data Engineer
Productivity
See results within
30 minutes
Near real-time
Source of truth
No need to reconcile
Same data as operational
DBs
No need to recreate ETL
from events
Easier primitives to build
ETL on top of

● Measuring reliability
○ How to distinguish late arriving data from missing data?
○ How do you trace a single missing revision through all moving parts?
● Lots of moving parts
○ Tailer, tied to implementation of operational DB
○ Ingest pipeline
○ Kafka, Kinesis
○ Analytic Database
Challenges of the architecture

CDC + Streaming =
Lots of business
value

Data Science use cases - Driver app

Data Science use cases - Pricing

Requirements for streaming applications
In Streaming, just like in Batch
Quick and simple ways of cleaning data
Prototype in a language of
choice (Python, R, SQL)
Quick and simple ways of cleaning data

20
Services (e.g.
ETA, Pricing)
Models +
Applications (e.g.
ETA, Pricing)
Flyte
Streaming architecture

Investments in Streaming
Dryft
Fully managed data processing
engine, powering real-time
features and events
- Needed for consistent feature
generation
- Batch processing for bulk
creation of features for training
- Stream processing for
real-time creation of features for
scoring
- Uses Flink SQL under the
hood
Apache Beam
Open source uniﬁed, portable
and extensible model for both
batch and streaming use-cases
- Enables streaming use cases
for teams using non JVM
languages
- Uses Flink under the hood

● Things we find at scale
○ Intermittent AWS service errors
○ Can’t be naive about pub-sub consumption
● Integration
○ Things work in isolation, but …
○ Flink Kinesis Connector
■ Connector that work at scale are hard
Challenges of the architecture

Sharing your batch
and streaming
compute will pay
huge beneﬁts

25
Data Platform architecture
Data Discovery
app - Amundsen
Services (e.g.
ETA, Pricing)
Operational Data
stores (e.g.
Dynamo)
Models +
Applications (e.g.
ETA, Pricing)
Apache Superset
BI/Data Viz
Marketplace
Operations app
...
Other custom
apps
Custom apps
Flyte

Kafka is better but ….
• Has limitations around fan-in
Kafka vs. Kinesis
Kinesis scaling limitations
• We require high throughput & high fan-out
• Default limit of 500 shards
• Resharding is expensive and slow
• Built a fan-out system to work around
limitations

● Apache Flink vs. Apache Spark vs. Apache Beam
● 2 dimensions of comparison
○ APIs (the kinds of applications you can write)
○ Operations (the kind of applications you can support)
● Apache Beam for multi-language support (Python and Go)
● Spark Streaming - operations were hard, no state evolution, cumulative
latencies with multi-stage graphs.
● Know when to put all your eggs in the same basket (Spark), when not to.
Streaming engines

Interactive querying:
● Redshift
○ Historical but dying
● Druid
○ Interactive use-cases
● Presto (on S3)
○ Super handy interactive query engine
○ Lacking real-time ingestion support
● BigQuery
○ Interactive query engine (like Presto)
○ Expensive, but great streaming support!
ETL:
● Hive (on S3)
○ Mostly for ETL and adhoc queries that are too large to run on Presto
● Spark
○ Some ETL, potential for all ETL to be in Spark
Data Storage and processing
Future of Interactive querying
Unified access layer
e.g. DAL, Genie, DALi Views
Future of ETL
- Easily schedule with dependencies, a
SQL query to be an ETL job
- Diagnose job failures with lineage and
dashboards on data skew, etc.

● Airflow
○ Most ETL jobs
○ Python heavy DAGs
○ Really good community to support
● Flyte
○ Focussed on ML workflows
○ Built in Provenance
○ Intermediate caching, discovery of previously computed artifacts
Workﬂow engines

● We think about data as a platform and a competitive advantage.
● Our data and platform usage is growing really really fast.
● We support Data Science, Ops, Analytics, Experimentation and other
use cases.
● We have seen tremendous benefit from CDC data + Streaming
frameworks to deliver business metric observability.
● We have learned and gained a lot in operational excellence by
sharing our batch and stream compute frameworks.
● We are investing in Data Discovery, Streaming, and Machine
Learning.
Conclusion

Attend Streaming at Lyft session tomorrow!

Attend Meetup at Level39 tonight!

Thank you
go.lyft.com/lyftdataplatformMay 2nd, 2019
Mark Grover | @mark_grover
Deepak Tiwari | @_deepaktiwari_

The Lyft data platform: Now and in the future

More Related Content

What's hot (20)

Similar to The Lyft data platform: Now and in the future (20)

More from markgrover (20)

Recently uploaded (20)

The Lyft data platform: Now and in the future