Near real-time anomaly detection at Lyft

Near real-time
anomaly
detection at Lyft
Mark Grover | @mark_grover
Thomas Weise | @thweise
go.lyft.com/streaming-at-lyft

Agenda
● Data at Lyft
● 3 problems in streaming
● Conclusion

Lyft: Fastest ride sharing company in the US

Data platform users
5
Data Modelers Analysts Data
Scientists
General
Managers
Data Platform
Engineers ExperimentersProduct
Managers
Analytics Biz ops Building apps Experimentation

6
Data Platform architecture
Custom apps
Services (e.g.
ETA, Pricing)
Operational Data
stores (e.g.
Dynamo)
Models +
Applications (e.g.
ETA, Pricing)
Flyte

Data platform users
7
Data Modelers Analysts Data
Scientists
General
Managers
Data Platform
Engineers ExperimentersProduct
Managers
Analytics Biz ops Building apps Experimentation

8
Data Platform architecture
Custom apps
Services (e.g.
ETA, Pricing)
Operational Data
stores (e.g.
Dynamo)
Models +
Applications (e.g.
ETA, Pricing)
Flyte

How can
streaming help
build better
applications?

1. Engineer
Responsibility
Build great products
Alerting Business metrics
Requirements
Anomaly detection on business metrics

Anomaly Detection use cases
Security Ops Payment fraud Customer service Accident detection

2. Data Scientist
Responsibility
Extract knowledge and insights from data
(To build better products)
Prototype in a language of
choice (Python, R, SQL)
Quick and simple ways
of “cleaning” data
Requirements
Prototype in a language of choice (Python, R, SQL)
Quick and simple ways of cleaning data

Data Science use cases - Driver app

Data Science use cases - Pricing

Historical architecture
State Store
Model 1 Model 2 Model 3
t=60s
t=60s
t=60s
t=63s
t=65s
t=66s

New architecture - Flink
State Store
Model 1 Model 2 Model 3
t=60s
t=63s
t=68s
t=63s
t=68s
t=74s

Today’s focus on 3 streaming use cases
1 Anomaly Detection
2
3
Making Data Prep Easy
Support non-JVM Languages

What is the problem?
Security Ops
Payment fraud Customer
service
Accident detection
Business metrics
alerting

20
Anomaly detection architecture
Services (e.g.
ETA, Pricing)
Operational Data
stores (e.g.
Dynamo)

Impact
Business metric alerting Financial line items alerting

Challenges
● Barrier to entry is pretty high
○ Takes a long time to ingest and tune alerts

● Data preparation - everyone needs it, examples:
○ Write raw data from stream to S3 for batch consumers
○ Filter, aggregate, … the usual ETL stuff
● Enable teams to focus on business problems, don’t worry
about “getting data in”
● Data ingress still is surprisingly difficult
○ Really?
○ Give our users a service that shields them from
infrastructure complexity

Dryft
fully managed data processing engine, powering real-time features and events
● Need - Consistent Feature Generation
○ The value of your machine learning results is only as good as the data
○ Subtle changes to how a feature value is generated can significantly impact results
● Solution - Unify feature generation
○ Batch processing for bulk creation of features for training ML models
○ Stream processing for real-time creation of features for scoring ML models
● How - Flink SQL
○ Use Flink as the processing engine using streaming or bulk data
○ Add automation to make it super simple to launch and maintain feature generation programs
at scale
https://www.slideshare.net/SeattleApacheFlinkMeetup/streaminglyft-greg-fee-seattle-apache-flink-meetup-104398613/#11

Dryft Program
Configuration file decl_ride_completed.sql
{
"source": "dryft",
"query_file": "decl_ride_completed.sql",
"kinesis": {
"stream": "declridecompleted" },
"features": {
"n_total_rides": {
"description": "All time ride count per user",
"type": "int",
"version": 1 }
}
}
SELECT COALESCE(user_lyft_id,
passenger_lyft_id, passenger_id, -1) AS user_id,
COUNT(ride_id) as n_total_rides
FROM event_ride_completed
GROUP BY COALESCE(user_lyft_id,
passenger_lyft_id, passenger_id, -1)

Dryft Program Execution
● Backfill - read historic data from S3, process, sink to S3
● Real-time - read stream data from Kinesis/Kafka, process, sink
to DynamoDB
SinkS3 Source SQL
SinkKinesis/Kafka Source SQL

Bootstrapping
● Read historic data from S3
● Transition to reading real-time data
● https://data-artisans.com/flink-forward/resources/bootstrappin
g-state-in-apache-flink
S3 Source
Kinesis/Kafka Source
Business
Logic
Sink
< Target Time
>= Target Time

When to Dryft
• Feature Generation as original driver
• Declarative Streaming ETL
‒ Stream to Table / Stream
• SQL - Simplicity <> Power tradeoff
‒ Flink SQL supports UDFs (written in Java)
‒ A UDF could also do a service call, but..

When we need Programming
https://ci.apache.org/projects/flink/flink-docs-stable/concepts/programming-model.html

Flink Streaming Options
• SQL - Dryft
• Java DataStream API - the usual starting point
‒ Sources, Sinks, Windowing, Implicit State Management
‒ Fluent style, high abstraction level
• ProcessFunction for advanced logic
‒ User code controlled state and timers
• Nice fit when Java is already established
‒ Forced language switch is hard sell, time to value long and less predictable
‒ Initial Flink Deployments at Lyft
‒ But we do a lot of stuff in Python..

● Flink API primarily target Java developers
○ Most of our teams that want to solve streaming use
cases don’t work with Java
● Enable streaming native to the language ecosystem
○ Python is the primary option for ML
○ (Use cases not addressed by Dryft/Flink SQL)

Streaming Options for Python
• Jython != Python
‒ Flink Python API and few more
• Jep (Java Embedded Python)
• KCL workers, Kafka consumers as standalone services
• Spark PySpark
‒ Not so much streaming, different semantics
‒ Different deployment story
• Faust
‒ Kafka Streams inspired
‒ No out of the box deployment story

Apache Beam
1. End users: who want to write
pipelines in a language that’s familiar.
2. SDK writers: who want to make Beam
concepts available in new languages.
3. Runner writers: who have a
distributed processing environment
and want to support Beam pipelines
Beam Model: Fn Runners
Apache
Flink
Apache
Spark
Beam Model: Pipeline Construction
Other
LanguagesBeam Java
Beam
Python
Execution Execution
Cloud
Dataflow
Execution
https://s.apache.org/apache-beam-project-overview

Beam Python Example
def pipeline(root):
input = root | ReadFromText("/path/to/text*") | Map(lambda line: ...)
scores = (input
| WindowInto(FixedWindows(120)
trigger=AfterWatermark(
early=AfterProcessingTime(60),
late=AfterCount(1))
accumulation_mode=ACCUMULATING)
| CombinePerKey(sum))
scores | WriteToText("/path/to/outputs")
MyRunner().run(pipeline)
( What, Where, When, How )

Python on Flink via Beam
• Beam model and Flink go well along
‒ Flink Runner most advanced OSS option for Beam Java SDK
• Python SDK already available on Dataflow
• Beam Language Portability allows Python (and Go) SDK to work
with JVM-based runners
‒ Flink Runner is first to support portability
• Flink Deployment Story
‒ Extend to run Python via Beam on Flink

Python on Flink via Beam
Job Service
Artifact
Staging
Job Manager
Fn Services
SDK Harness /
UDFs
Provision Control Data
Artifact
Retrieval
State Logging
Cluster
Runner
Dependencies
(optional)
python -m apache_beam.examples.wordcount
--input=/etc/profile
--output=/tmp/py-wordcount-direct
--experiments=beam_fn_api
--runner=PortableRunner
--sdk_location=container
--job_endpoint=localhost:8099
--streaming
https://s.apache.org/streaming-python-beam-flink

3.5 But, how do
we deploy all
this?

Deployment
40
Streaming
Application
(Dryft, Java,
Beam, ...)
Stream / Schema
Registry
Deployment
Tooling
Metrics &
Dashboards
Alerts Logging
Amazon
EC2
Amazon S3 Wavefront
Salt
(Config / Orca)
Docker
Source Sink

Future of Deployment
• Flink embraces containerization
‒ Reactive vs. Active Flink Container Mode
(resources supplied externally vs. actively requested)
• Kubernetes Operator
‒ Resource Elasticity
‒ Improved Resource Utilization
‒ Auto-Scaling Support
‒ Automate (stateful) upgrade

Learnings
• Integration
‒ Things work well in isolation, but..
‒ Flink Kinesis Consumer
‣ Connectors that work reliably at scale are easy hard
• Things we find at scale
‒ Intermittent AWS service errors (Kinesis, S3)
‣ Retry vs. topology reset
‒ S3 hotspotting with Flink checkpointing for large jobs (FLINK-9061)
‒ Naive pubsub consumption can lead to massive state buffering
‣ Align watermarks across source partitions

● Data at Lyft
● 3 problems in streaming
○ Anomaly Detection - Anodot
○ Easy data prep - Dryft
○ Non-JVM language support - Apache Beam
Conclusion

We are hiring!
lyft.com/careers
https://goo.gl/RsyLkS
go.lyft.com/streaming-at-lyft
Images from the Noun Project
Mark Grover | @mark_grover
Thomas Weise | @thweise

Near real-time anomaly detection at Lyft

More Related Content

What's hot (20)

Similar to Near real-time anomaly detection at Lyft (20)

More from markgrover (20)

Recently uploaded (20)

Near real-time anomaly detection at Lyft