SlideShare a Scribd company logo
Near real-time
anomaly
detection at Lyft
Mark Grover | @mark_grover
Thomas Weise | @thweise
go.lyft.com/streaming-at-lyft
Agenda
● Data at Lyft
● 3 problems in streaming
● Conclusion
Data at Lyft
Lyft: Fastest ride sharing company in the US
Data platform users
5
Data Modelers Analysts Data
Scientists
General
Managers
Data Platform
Engineers ExperimentersProduct
Managers
Analytics Biz ops Building apps Experimentation
6
Data Platform architecture
Custom apps
Services (e.g.
ETA, Pricing)
Operational Data
stores (e.g.
Dynamo)
Models +
Applications (e.g.
ETA, Pricing)
Flyte
Data platform users
7
Data Modelers Analysts Data
Scientists
General
Managers
Data Platform
Engineers ExperimentersProduct
Managers
Analytics Biz ops Building apps Experimentation
8
Data Platform architecture
Custom apps
Services (e.g.
ETA, Pricing)
Operational Data
stores (e.g.
Dynamo)
Models +
Applications (e.g.
ETA, Pricing)
Flyte
How can
streaming help
build better
applications?
1. Engineer
Responsibility
Build great products
Alerting Business metrics
Requirements
Anomaly detection on business metrics
Anomaly Detection use cases
Security Ops Payment fraud Customer service Accident detection
2. Data Scientist
Responsibility
Extract knowledge and insights from data
(To build better products)
Prototype in a language of
choice (Python, R, SQL)
Quick and simple ways
of “cleaning” data
Requirements
Prototype in a language of choice (Python, R, SQL)
Quick and simple ways of cleaning data
Data Science use cases - Driver app
Data Science use cases - Pricing
Historical architecture
State Store
Model 1 Model 2 Model 3
t=60s
t=60s
t=60s
t=63s
t=65s
t=66s
New architecture - Flink
State Store
Model 1 Model 2 Model 3
t=60s
t=63s
t=68s
t=63s
t=68s
t=74s
Today’s focus on 3 streaming use cases
1 Anomaly Detection
2
3
Making Data Prep Easy
Support non-JVM Languages
1. Anomaly
detection
What is the problem?
Security Ops
Payment fraud Customer
service
Accident detection
Business metrics
alerting
20
Anomaly detection architecture
Services (e.g.
ETA, Pricing)
Operational Data
stores (e.g.
Dynamo)
Impact
Business metric alerting Financial line items alerting
Challenges
● Barrier to entry is pretty high
○ Takes a long time to ingest and tune alerts
2. Making Data
Prep Easy
What is the problem?
● Data preparation - everyone needs it, examples:
○ Write raw data from stream to S3 for batch consumers
○ Filter, aggregate, … the usual ETL stuff
● Enable teams to focus on business problems, don’t worry
about “getting data in”
● Data ingress still is surprisingly difficult
○ Really?
○ Give our users a service that shields them from
infrastructure complexity
Dryft
fully managed data processing engine, powering real-time features and events
● Need - Consistent Feature Generation
○ The value of your machine learning results is only as good as the data
○ Subtle changes to how a feature value is generated can significantly impact results
● Solution - Unify feature generation
○ Batch processing for bulk creation of features for training ML models
○ Stream processing for real-time creation of features for scoring ML models
● How - Flink SQL
○ Use Flink as the processing engine using streaming or bulk data
○ Add automation to make it super simple to launch and maintain feature generation programs
at scale
https://www.slideshare.net/SeattleApacheFlinkMeetup/streaminglyft-greg-fee-seattle-apache-flink-meetup-104398613/#11
Dryft Program
Configuration file decl_ride_completed.sql
{
"source": "dryft",
"query_file": "decl_ride_completed.sql",
"kinesis": {
"stream": "declridecompleted" },
"features": {
"n_total_rides": {
"description": "All time ride count per user",
"type": "int",
"version": 1 }
}
}
SELECT COALESCE(user_lyft_id,
passenger_lyft_id, passenger_id, -1) AS user_id,
COUNT(ride_id) as n_total_rides
FROM event_ride_completed
GROUP BY COALESCE(user_lyft_id,
passenger_lyft_id, passenger_id, -1)
Dryft Program Execution
● Backfill - read historic data from S3, process, sink to S3
● Real-time - read stream data from Kinesis/Kafka, process, sink
to DynamoDB
SinkS3 Source SQL
SinkKinesis/Kafka Source SQL
Bootstrapping
● Read historic data from S3
● Transition to reading real-time data
● https://data-artisans.com/flink-forward/resources/bootstrappin
g-state-in-apache-flink
S3 Source
Kinesis/Kafka Source
Business
Logic
Sink
< Target Time
>= Target Time
When to Dryft
• Feature Generation as original driver
• Declarative Streaming ETL
‒ Stream to Table / Stream
• SQL - Simplicity <> Power tradeoff
‒ Flink SQL supports UDFs (written in Java)
‒ A UDF could also do a service call, but..
When we need Programming
https://ci.apache.org/projects/flink/flink-docs-stable/concepts/programming-model.html
Flink Streaming Options
• SQL - Dryft
• Java DataStream API - the usual starting point
‒ Sources, Sinks, Windowing, Implicit State Management
‒ Fluent style, high abstraction level
• ProcessFunction for advanced logic
‒ User code controlled state and timers
• Nice fit when Java is already established
‒ Forced language switch is hard sell, time to value long and less predictable
‒ Initial Flink Deployments at Lyft
‒ But we do a lot of stuff in Python..
3. Support
non-JVM
languages
What is the problem?
● Flink API primarily target Java developers
○ Most of our teams that want to solve streaming use
cases don’t work with Java
● Enable streaming native to the language ecosystem
○ Python is the primary option for ML
○ (Use cases not addressed by Dryft/Flink SQL)
Streaming Options for Python
• Jython != Python
‒ Flink Python API and few more
• Jep (Java Embedded Python)
• KCL workers, Kafka consumers as standalone services
• Spark PySpark
‒ Not so much streaming, different semantics
‒ Different deployment story
• Faust
‒ Kafka Streams inspired
‒ No out of the box deployment story
Apache Beam
1. End users: who want to write
pipelines in a language that’s familiar.
2. SDK writers: who want to make Beam
concepts available in new languages.
3. Runner writers: who have a
distributed processing environment
and want to support Beam pipelines
Beam Model: Fn Runners
Apache
Flink
Apache
Spark
Beam Model: Pipeline Construction
Other
LanguagesBeam Java
Beam
Python
Execution Execution
Cloud
Dataflow
Execution
https://s.apache.org/apache-beam-project-overview
Beam Python Example
def pipeline(root):
input = root | ReadFromText("/path/to/text*") | Map(lambda line: ...)
scores = (input
| WindowInto(FixedWindows(120)
trigger=AfterWatermark(
early=AfterProcessingTime(60),
late=AfterCount(1))
accumulation_mode=ACCUMULATING)
| CombinePerKey(sum))
scores | WriteToText("/path/to/outputs")
MyRunner().run(pipeline)
( What, Where, When, How )
Python on Flink via Beam
• Beam model and Flink go well along
‒ Flink Runner most advanced OSS option for Beam Java SDK
• Python SDK already available on Dataflow
• Beam Language Portability allows Python (and Go) SDK to work
with JVM-based runners
‒ Flink Runner is first to support portability
• Flink Deployment Story
‒ Extend to run Python via Beam on Flink
Python on Flink via Beam
Job Service
Artifact
Staging
Job Manager
Fn Services
SDK Harness /
UDFs
Provision Control Data
Artifact
Retrieval
State Logging
Cluster
Runner
Dependencies
(optional)
python -m apache_beam.examples.wordcount 
--input=/etc/profile 
--output=/tmp/py-wordcount-direct 
--experiments=beam_fn_api 
--runner=PortableRunner 
--sdk_location=container 
--job_endpoint=localhost:8099 
--streaming
https://s.apache.org/streaming-python-beam-flink
3.5 But, how do
we deploy all
this?
Deployment
40
Streaming
Application
(Dryft, Java,
Beam, ...)
Stream / Schema
Registry
Deployment
Tooling
Metrics &
Dashboards
Alerts Logging
Amazon
EC2
Amazon S3 Wavefront
Salt
(Config / Orca)
Docker
Source Sink
Future of Deployment
• Flink embraces containerization
‒ Reactive vs. Active Flink Container Mode
(resources supplied externally vs. actively requested)
• Kubernetes Operator
‒ Resource Elasticity
‒ Improved Resource Utilization
‒ Auto-Scaling Support
‒ Automate (stateful) upgrade
Learnings
• Integration
‒ Things work well in isolation, but..
‒ Flink Kinesis Consumer
‣ Connectors that work reliably at scale are easy hard
• Things we find at scale
‒ Intermittent AWS service errors (Kinesis, S3)
‣ Retry vs. topology reset
‒ S3 hotspotting with Flink checkpointing for large jobs (FLINK-9061)
‒ Naive pubsub consumption can lead to massive state buffering
‣ Align watermarks across source partitions
Conclusion
● Data at Lyft
● 3 problems in streaming
○ Anomaly Detection - Anodot
○ Easy data prep - Dryft
○ Non-JVM language support - Apache Beam
Conclusion
We are hiring!
lyft.com/careers
https://goo.gl/RsyLkS
go.lyft.com/streaming-at-lyft
Images from the Noun Project
Mark Grover | @mark_grover
Thomas Weise | @thweise

More Related Content

PDF
Introduction to Google Cloud Platform
PPTX
Building Modern Data Platform with Microsoft Azure
PDF
The Lyft data platform: Now and in the future
PPTX
ApacheCon NA 2019 : Adding AI to customer segmentation using Apache Unomi and...
PDF
Databricks Overview for MLOps
PDF
Microsoft Azure Security Overview
PPTX
Introduction to Google Cloud Services / Platforms
PPTX
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
Introduction to Google Cloud Platform
Building Modern Data Platform with Microsoft Azure
The Lyft data platform: Now and in the future
ApacheCon NA 2019 : Adding AI to customer segmentation using Apache Unomi and...
Databricks Overview for MLOps
Microsoft Azure Security Overview
Introduction to Google Cloud Services / Platforms
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic

What's hot (20)

PPTX
Introduction to Kubernetes
PPTX
Azure key vault
PDF
Tom Grey - Google Cloud Platform
PDF
Performance Monitoring: Understanding Your Scylla Cluster
PDF
Migrate to Microsoft Azure with Confidence
PDF
Azure Monitoring Overview
PDF
MLOps with Kubeflow
PDF
Combining Logs, Metrics, and Traces for Unified Observability
PDF
Apache Airflow Architecture
PDF
Google Cloud Networking Deep Dive
PPTX
Microsoft Azure Technical Overview
PDF
Designing and Building Next Generation Data Pipelines at Scale with Structure...
PPTX
Introduction to Azure monitor
PPTX
Databricks Fundamentals
PPTX
Azure Migrate
PPTX
Google cloud platform
PDF
Data Mesh Part 4 Monolith to Mesh
PDF
Introduction to Redis
PPTX
DW Migration Webinar-March 2022.pptx
PDF
Using MLOps to Bring ML to Production/The Promise of MLOps
Introduction to Kubernetes
Azure key vault
Tom Grey - Google Cloud Platform
Performance Monitoring: Understanding Your Scylla Cluster
Migrate to Microsoft Azure with Confidence
Azure Monitoring Overview
MLOps with Kubeflow
Combining Logs, Metrics, and Traces for Unified Observability
Apache Airflow Architecture
Google Cloud Networking Deep Dive
Microsoft Azure Technical Overview
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Introduction to Azure monitor
Databricks Fundamentals
Azure Migrate
Google cloud platform
Data Mesh Part 4 Monolith to Mesh
Introduction to Redis
DW Migration Webinar-March 2022.pptx
Using MLOps to Bring ML to Production/The Promise of MLOps
Ad

Similar to Near real-time anomaly detection at Lyft (20)

PPTX
Workshop híbrido: Stream Processing con Flink
PPTX
FluentD for end to end monitoring
PDF
Flink forward-2017-netflix keystones-paas
PDF
Lyft data Platform - 2019 slides
PDF
Apache Flink Adoption at Shopify
PDF
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
PDF
Rivivi il Data in Motion Tour Milano 2024
PPTX
Webinar september 2013
PPTX
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
PDF
Flyte kubecon 2019 SanDiego
PDF
Building and deploying LLM applications with Apache Airflow
PDF
Big Data Analytics Platforms by KTH and RISE SICS
PPTX
Streaming in the Wild with Apache Flink
PPTX
Apache Flink Training: System Overview
PDF
Scaling up uber's real time data analytics
PDF
Taking Your FDM Application to the Next Level with Advanced Scripting
PPTX
LeedsSharp May 2023 - Azure Integration Services
PDF
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
DOCX
RAGHUNATH_GORLA_RESUME
Workshop híbrido: Stream Processing con Flink
FluentD for end to end monitoring
Flink forward-2017-netflix keystones-paas
Lyft data Platform - 2019 slides
Apache Flink Adoption at Shopify
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Rivivi il Data in Motion Tour Milano 2024
Webinar september 2013
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Flyte kubecon 2019 SanDiego
Building and deploying LLM applications with Apache Airflow
Big Data Analytics Platforms by KTH and RISE SICS
Streaming in the Wild with Apache Flink
Apache Flink Training: System Overview
Scaling up uber's real time data analytics
Taking Your FDM Application to the Next Level with Advanced Scripting
LeedsSharp May 2023 - Azure Integration Services
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
RAGHUNATH_GORLA_RESUME
Ad

More from markgrover (20)

PDF
From discovering to trusting data
PDF
Amundsen lineage designs - community meeting, Dec 2020
PDF
Amundsen at Brex and Looker integration
PDF
REA Group's journey with Data Cataloging and Amundsen
PDF
Amundsen gremlin proxy design
PDF
Amundsen: From discovering to security data
PDF
Amundsen: From discovering to security data
PDF
Data Discovery & Trust through Metadata
PDF
Data Discovery and Metadata
PDF
Disrupting Data Discovery
PDF
TensorFlow Extension (TFX) and Apache Beam
PDF
Big Data at Speed
PDF
Dogfooding data at Lyft
PDF
Fighting cybersecurity threats with Apache Spot
PDF
Fraud Detection with Hadoop
PDF
Top 5 mistakes when writing Spark applications
PDF
Top 5 mistakes when writing Spark applications
PPTX
Architecting Applications with Hadoop
PDF
SQL Engines for Hadoop - The case for Impala
PDF
Intro to hadoop tutorial
From discovering to trusting data
Amundsen lineage designs - community meeting, Dec 2020
Amundsen at Brex and Looker integration
REA Group's journey with Data Cataloging and Amundsen
Amundsen gremlin proxy design
Amundsen: From discovering to security data
Amundsen: From discovering to security data
Data Discovery & Trust through Metadata
Data Discovery and Metadata
Disrupting Data Discovery
TensorFlow Extension (TFX) and Apache Beam
Big Data at Speed
Dogfooding data at Lyft
Fighting cybersecurity threats with Apache Spot
Fraud Detection with Hadoop
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
Architecting Applications with Hadoop
SQL Engines for Hadoop - The case for Impala
Intro to hadoop tutorial

Recently uploaded (20)

PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
cuic standard and advanced reporting.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Big Data Technologies - Introduction.pptx
PDF
Machine learning based COVID-19 study performance prediction
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Encapsulation theory and applications.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Per capita expenditure prediction using model stacking based on satellite ima...
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Reach Out and Touch Someone: Haptics and Empathic Computing
Review of recent advances in non-invasive hemoglobin estimation
Advanced methodologies resolving dimensionality complications for autism neur...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
NewMind AI Weekly Chronicles - August'25 Week I
Building Integrated photovoltaic BIPV_UPV.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
cuic standard and advanced reporting.pdf
Spectral efficient network and resource selection model in 5G networks
Big Data Technologies - Introduction.pptx
Machine learning based COVID-19 study performance prediction
MYSQL Presentation for SQL database connectivity
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
“AI and Expert System Decision Support & Business Intelligence Systems”
Encapsulation theory and applications.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf

Near real-time anomaly detection at Lyft

  • 1. Near real-time anomaly detection at Lyft Mark Grover | @mark_grover Thomas Weise | @thweise go.lyft.com/streaming-at-lyft
  • 2. Agenda ● Data at Lyft ● 3 problems in streaming ● Conclusion
  • 4. Lyft: Fastest ride sharing company in the US
  • 5. Data platform users 5 Data Modelers Analysts Data Scientists General Managers Data Platform Engineers ExperimentersProduct Managers Analytics Biz ops Building apps Experimentation
  • 6. 6 Data Platform architecture Custom apps Services (e.g. ETA, Pricing) Operational Data stores (e.g. Dynamo) Models + Applications (e.g. ETA, Pricing) Flyte
  • 7. Data platform users 7 Data Modelers Analysts Data Scientists General Managers Data Platform Engineers ExperimentersProduct Managers Analytics Biz ops Building apps Experimentation
  • 8. 8 Data Platform architecture Custom apps Services (e.g. ETA, Pricing) Operational Data stores (e.g. Dynamo) Models + Applications (e.g. ETA, Pricing) Flyte
  • 9. How can streaming help build better applications?
  • 10. 1. Engineer Responsibility Build great products Alerting Business metrics Requirements Anomaly detection on business metrics
  • 11. Anomaly Detection use cases Security Ops Payment fraud Customer service Accident detection
  • 12. 2. Data Scientist Responsibility Extract knowledge and insights from data (To build better products) Prototype in a language of choice (Python, R, SQL) Quick and simple ways of “cleaning” data Requirements Prototype in a language of choice (Python, R, SQL) Quick and simple ways of cleaning data
  • 13. Data Science use cases - Driver app
  • 14. Data Science use cases - Pricing
  • 15. Historical architecture State Store Model 1 Model 2 Model 3 t=60s t=60s t=60s t=63s t=65s t=66s
  • 16. New architecture - Flink State Store Model 1 Model 2 Model 3 t=60s t=63s t=68s t=63s t=68s t=74s
  • 17. Today’s focus on 3 streaming use cases 1 Anomaly Detection 2 3 Making Data Prep Easy Support non-JVM Languages
  • 19. What is the problem? Security Ops Payment fraud Customer service Accident detection Business metrics alerting
  • 20. 20 Anomaly detection architecture Services (e.g. ETA, Pricing) Operational Data stores (e.g. Dynamo)
  • 21. Impact Business metric alerting Financial line items alerting
  • 22. Challenges ● Barrier to entry is pretty high ○ Takes a long time to ingest and tune alerts
  • 24. What is the problem? ● Data preparation - everyone needs it, examples: ○ Write raw data from stream to S3 for batch consumers ○ Filter, aggregate, … the usual ETL stuff ● Enable teams to focus on business problems, don’t worry about “getting data in” ● Data ingress still is surprisingly difficult ○ Really? ○ Give our users a service that shields them from infrastructure complexity
  • 25. Dryft fully managed data processing engine, powering real-time features and events ● Need - Consistent Feature Generation ○ The value of your machine learning results is only as good as the data ○ Subtle changes to how a feature value is generated can significantly impact results ● Solution - Unify feature generation ○ Batch processing for bulk creation of features for training ML models ○ Stream processing for real-time creation of features for scoring ML models ● How - Flink SQL ○ Use Flink as the processing engine using streaming or bulk data ○ Add automation to make it super simple to launch and maintain feature generation programs at scale https://www.slideshare.net/SeattleApacheFlinkMeetup/streaminglyft-greg-fee-seattle-apache-flink-meetup-104398613/#11
  • 26. Dryft Program Configuration file decl_ride_completed.sql { "source": "dryft", "query_file": "decl_ride_completed.sql", "kinesis": { "stream": "declridecompleted" }, "features": { "n_total_rides": { "description": "All time ride count per user", "type": "int", "version": 1 } } } SELECT COALESCE(user_lyft_id, passenger_lyft_id, passenger_id, -1) AS user_id, COUNT(ride_id) as n_total_rides FROM event_ride_completed GROUP BY COALESCE(user_lyft_id, passenger_lyft_id, passenger_id, -1)
  • 27. Dryft Program Execution ● Backfill - read historic data from S3, process, sink to S3 ● Real-time - read stream data from Kinesis/Kafka, process, sink to DynamoDB SinkS3 Source SQL SinkKinesis/Kafka Source SQL
  • 28. Bootstrapping ● Read historic data from S3 ● Transition to reading real-time data ● https://data-artisans.com/flink-forward/resources/bootstrappin g-state-in-apache-flink S3 Source Kinesis/Kafka Source Business Logic Sink < Target Time >= Target Time
  • 29. When to Dryft • Feature Generation as original driver • Declarative Streaming ETL ‒ Stream to Table / Stream • SQL - Simplicity <> Power tradeoff ‒ Flink SQL supports UDFs (written in Java) ‒ A UDF could also do a service call, but..
  • 30. When we need Programming https://ci.apache.org/projects/flink/flink-docs-stable/concepts/programming-model.html
  • 31. Flink Streaming Options • SQL - Dryft • Java DataStream API - the usual starting point ‒ Sources, Sinks, Windowing, Implicit State Management ‒ Fluent style, high abstraction level • ProcessFunction for advanced logic ‒ User code controlled state and timers • Nice fit when Java is already established ‒ Forced language switch is hard sell, time to value long and less predictable ‒ Initial Flink Deployments at Lyft ‒ But we do a lot of stuff in Python..
  • 33. What is the problem? ● Flink API primarily target Java developers ○ Most of our teams that want to solve streaming use cases don’t work with Java ● Enable streaming native to the language ecosystem ○ Python is the primary option for ML ○ (Use cases not addressed by Dryft/Flink SQL)
  • 34. Streaming Options for Python • Jython != Python ‒ Flink Python API and few more • Jep (Java Embedded Python) • KCL workers, Kafka consumers as standalone services • Spark PySpark ‒ Not so much streaming, different semantics ‒ Different deployment story • Faust ‒ Kafka Streams inspired ‒ No out of the box deployment story
  • 35. Apache Beam 1. End users: who want to write pipelines in a language that’s familiar. 2. SDK writers: who want to make Beam concepts available in new languages. 3. Runner writers: who have a distributed processing environment and want to support Beam pipelines Beam Model: Fn Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other LanguagesBeam Java Beam Python Execution Execution Cloud Dataflow Execution https://s.apache.org/apache-beam-project-overview
  • 36. Beam Python Example def pipeline(root): input = root | ReadFromText("/path/to/text*") | Map(lambda line: ...) scores = (input | WindowInto(FixedWindows(120) trigger=AfterWatermark( early=AfterProcessingTime(60), late=AfterCount(1)) accumulation_mode=ACCUMULATING) | CombinePerKey(sum)) scores | WriteToText("/path/to/outputs") MyRunner().run(pipeline) ( What, Where, When, How )
  • 37. Python on Flink via Beam • Beam model and Flink go well along ‒ Flink Runner most advanced OSS option for Beam Java SDK • Python SDK already available on Dataflow • Beam Language Portability allows Python (and Go) SDK to work with JVM-based runners ‒ Flink Runner is first to support portability • Flink Deployment Story ‒ Extend to run Python via Beam on Flink
  • 38. Python on Flink via Beam Job Service Artifact Staging Job Manager Fn Services SDK Harness / UDFs Provision Control Data Artifact Retrieval State Logging Cluster Runner Dependencies (optional) python -m apache_beam.examples.wordcount --input=/etc/profile --output=/tmp/py-wordcount-direct --experiments=beam_fn_api --runner=PortableRunner --sdk_location=container --job_endpoint=localhost:8099 --streaming https://s.apache.org/streaming-python-beam-flink
  • 39. 3.5 But, how do we deploy all this?
  • 40. Deployment 40 Streaming Application (Dryft, Java, Beam, ...) Stream / Schema Registry Deployment Tooling Metrics & Dashboards Alerts Logging Amazon EC2 Amazon S3 Wavefront Salt (Config / Orca) Docker Source Sink
  • 41. Future of Deployment • Flink embraces containerization ‒ Reactive vs. Active Flink Container Mode (resources supplied externally vs. actively requested) • Kubernetes Operator ‒ Resource Elasticity ‒ Improved Resource Utilization ‒ Auto-Scaling Support ‒ Automate (stateful) upgrade
  • 42. Learnings • Integration ‒ Things work well in isolation, but.. ‒ Flink Kinesis Consumer ‣ Connectors that work reliably at scale are easy hard • Things we find at scale ‒ Intermittent AWS service errors (Kinesis, S3) ‣ Retry vs. topology reset ‒ S3 hotspotting with Flink checkpointing for large jobs (FLINK-9061) ‒ Naive pubsub consumption can lead to massive state buffering ‣ Align watermarks across source partitions
  • 44. ● Data at Lyft ● 3 problems in streaming ○ Anomaly Detection - Anodot ○ Easy data prep - Dryft ○ Non-JVM language support - Apache Beam Conclusion
  • 45. We are hiring! lyft.com/careers https://goo.gl/RsyLkS go.lyft.com/streaming-at-lyft Images from the Noun Project Mark Grover | @mark_grover Thomas Weise | @thweise