SlideShare a Scribd company logo
Bootstrapping
State in Flink
DataWorks Summit 2018
Gregory Fee
What did the
message queue
say to Flink?
Sad at Work!
Sad At Work!
DataWorks!
About Me
● Engineer @ Lyft
● Teams - ETA, Data Science Platform, Data Platform
● Accomplishments
○ ETA model training from 4 months to every 10 minutes
○ Real-time traffic updates
○ Flyte - Large Scale Orchestration and Batch Compute
○ Lyftlearn - Custom Machine Learning Library
○ Dryft - Real-time Feature Generation for Machine Learning
Dryft
● Need - Consistent Feature Generation
○ The value of your machine learning results is only as good as the data
○ Subtle changes to how a feature value is generated can significantly impact results
● Solution - Unify feature generation
○ Batch processing for bulk creation of features for training ML models
○ Stream processing for real-time creation of features for scoring ML models
● How - SPaaS
○ Use Flink as the processing engine
○ Add automation to make it super simple to launch and maintain feature generation programs
at scale
Flink Overview
● Top level Apache project
● High-throughput, low-latency streaming engine
● Event-time processing
● State management
● Fault-tolerance in the event of machine failure
● Support exactly-once semantics
● Used by Alibaba, Netflix, Uber
What is Bootstrapping?
Bootstrapping is not Backfilling
● Using historic data to calculate historic results
● Typical uses:
○ Correct for missing data based on pipeline malfunction
○ Generate output for new business logic
● So what is bootstrapping?
Stateful Stream Programs
counts = stream
.flatMap((x) -> x.split("s"))
.map((x) -> new KV(x, 1))
.keyBy((x) -> x.key)
.window(Time.days(7),Time.hours(1))
.sum((x) -> x.value);
Counts of the words that appear in the stream over
the last 7 days updated every hour
The Waiting is the Hardest Part
A program with a 7 day window needs to process for 7 days before it
has enough data to answer the query correctly.
Day 1
Launch Program
Day 3
Anger
Day 6
Bargaining
Day 8
Relief
What about forever?
Table table = tableEnv.sql(
"SELECT user_lyft_id,
COUNT(ride_id)
FROM event_ride_completed
GROUP BY user_lyft_id");
Counts of the number of rides each user
has ever taken
Bootstrapping
Read historic data store to “bootstrap” the program with 7 days
worth of data. Now your program returns results on day 1.
-7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7
Start Program + Validate Results
Provisioning
● We want bootstrapping to be super fast == set
parallelism high
○ Processing a week of data should take less than a week
● We want real-time processing to be super
cheap == set parallelism low
○ Need to host thousands of feature generation programs
Keep in Mind
● Generality is desirable
○ There are potentially simpler ways of
bootstrapping based on your application logic
○ General solution needed to scale to thousands
of programs
● Production Readiness is desirable
○ Observability, scalability, stability, and all those
good things are all considerations
● What works for Lyft might not be right
for you
Use Stream Retention
• Use the retention policy on your stream technology to
retain data for as long as you need
‒ Kinesis maximum retention is 7 days
‒ Kafka has no maximum, stores all data to disks, not
capable of petabytes of storage, suboptimal to spend
disk money on infrequently accessed data
• If this is feasible for you then you should do it
consumerConfig.put(
ConsumerConfigConstants.STREAM_INITIAL_POSITION,
"TRIM_HORIZON");
Kafka “Infinite Retention”
● Alter Kafka to allow for tiered storage
○ Write partitions that age out to secondary storage
○ Push data to S3/Glacier
● Advantages
○ Effectively infinite storage at a reasonable price
○ Use existing Kafka connectors to get data
● Disadvantages
○ Very different performance characteristics of underlying storage
○ No easy way to use different Flink configuration between
bootstrapping and steady state
○ Does not exist today
● Apache Pulsar and Pravega ecosystems might be a
viable alternative
Source Magic
● Write a source that reads from the secondary store until you are within
retention period of your stream
● Transition to reading from stream
● Advantages
○ Works with any stream provider
● Disadvantages
○ Writing a correct source to bridge between two sources and avoid duplication is hard
○ No easy way to use different Flink configuration between bootstrapping and steady state
Discovery Reader
Business
Logic
Kafka Kafka S3 S3 S3 S3
Application Level Attempt #1
1. Run the bootstrap program
a. Read historic data using a normal source
b. Process the data with selected business logic
c. Wait for all processing to complete
d. Trigger a savepoint and cancel the program
2. Run the steady state program
a. Start the program from the savepoint
b. Read stream data using a normal source
● Advantages
○ No modifications to streams or sources
○ Allows for Flink configuration between bootstrapping and steady state
● Disadvantages
○ Let’s find out
How Hard Can It Be?
● How do we make sure there is no repeated data?
SinkS3 Source Business Logic
SinkKinesis Source Business Logic
Iteration #2
● How do we trigger a savepoint when bootstrap is complete?
S3 Source < Target Time Business Logic Sink
Kinesis Source >= Target Time Business Logic Sink
Iteration #3
● After the S3 data is read, push a record that is at (target time +
1)
● Termination detector looks for low watermark to reach (target
time + 1)
S3 Source +
Termination
< Target Time Business Logic
Termination
Detector
“Sink”
Kinesis Source >= Target Time Business Logic Sink
What Did I Learn?
● Automating Flink from within Flink is
possible but fragile
○ Eg If you have multiple partitions reading S3 then you
need to make sure all of them process a message
that pushes the watermark to (target time + 1)
● Savepoint logic is via uid so make sure
those are applied on your business logic
○ No support for setting uid on operators generated via
SQL
Application Level Attempt #2
1. Run a high provisioned job
a. Read from historic data store
b. Read from live stream
c. Union the above
d. Process the data with selected business logic
e. After all S3 data is processed, trigger a savepoint and cancel program
2. Run a low provisioned job
a. Exact same ‘shape’ of program as above, but with less parallelism
b. Restore from savepoint
Success?
● Advantages
○ Less fragile, works with SQL
● Disadvantage
○ Uses many resources or requires external automation
○ Live data is buffered until historic data completes
S3 Source
Kinesis Source
Business
Logic
Sink
< Target Time
>= Target Time
Is it live?
● Running in Production at Lyft now
● Actively adding more feature generation programs
How Could We Make This Better?
● Kafka Infinite Retention
○ Repartition still necessary to get optimal bootstrap performance
● Programs as Sources
○ Allowing sources to be built in a high level programming model,
Beam’s Splittable DoFn
● Dynamic Repartitioning + Adaptive Resource
Management
○ Allow Flink parallelism to change without canceling the program
○ Allow Flink checkpointing policy to change without canceling the
program
● Meta-messages
○ Allow the passing of metadata within the data stream, watermarks
are one type of metadata
What about Batch Mode?
● Batch Mode can be more efficient than Streaming
Mode
○ Offline data has different properties than stream data
● Method #1
○ Use batch mode to process historic data, make a savepoint at
the end
○ Start a streaming mode program from the savepoint, process
stream data
● Method #2
○ Modify Flink to understand a transition watermark, batch
runtime automatically transitions to streaming runtime,
requires unified source
What Did We Learn?
● Many stream programs are stateful
● Faster than real-time
bootstrapping using Flink is
possible
● There are many opportunities for
improvement
Q&A

More Related Content

PDF
Apache Iceberg: An Architectural Look Under the Covers
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
PPTX
Evening out the uneven: dealing with skew in Flink
PPTX
Apache Flink in the Cloud-Native Era
PPTX
Introduction to Apache ZooKeeper
PDF
Apache Iceberg Presentation for the St. Louis Big Data IDEA
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg: An Architectural Look Under the Covers
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Where is my bottleneck? Performance troubleshooting in Flink
Evening out the uneven: dealing with skew in Flink
Apache Flink in the Cloud-Native Era
Introduction to Apache ZooKeeper
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg - A Table Format for Hige Analytic Datasets

What's hot (20)

PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PDF
Write Faster SQL with Trino.pdf
PPTX
Autoscaling Flink with Reactive Mode
PDF
Cassandra Introduction & Features
PDF
Optimizing Delta/Parquet Data Lakes for Apache Spark
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
PDF
Changelog Stream Processing with Apache Flink
PDF
Making Apache Spark Better with Delta Lake
PDF
Airflow Best Practises & Roadmap to Airflow 2.0
PDF
Change Data Feed in Delta
PDF
Introducing the Apache Flink Kubernetes Operator
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
PDF
Databricks Delta Lake and Its Benefits
PDF
Large Scale Lakehouse Implementation Using Structured Streaming
PPTX
Dynamic Rule-based Real-time Market Data Alerts
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
PPTX
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
PDF
Delta from a Data Engineer's Perspective
PDF
Scalability, Availability & Stability Patterns
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Write Faster SQL with Trino.pdf
Autoscaling Flink with Reactive Mode
Cassandra Introduction & Features
Optimizing Delta/Parquet Data Lakes for Apache Spark
Tuning Apache Kafka Connectors for Flink.pptx
Changelog Stream Processing with Apache Flink
Making Apache Spark Better with Delta Lake
Airflow Best Practises & Roadmap to Airflow 2.0
Change Data Feed in Delta
Introducing the Apache Flink Kubernetes Operator
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Databricks Delta Lake and Its Benefits
Large Scale Lakehouse Implementation Using Structured Streaming
Dynamic Rule-based Real-time Market Data Alerts
Iceberg: A modern table format for big data (Strata NY 2018)
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Delta from a Data Engineer's Perspective
Scalability, Availability & Stability Patterns
Ad

Similar to Bootstrapping state in Apache Flink (20)

PDF
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
PPTX
Enabling Machine Learning with Apache Flink - Sherin Thomas, Lyft
PDF
Lifting the hood on spark streaming - StampedeCon 2015
PDF
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
PDF
The Power of Distributed Snapshots in Apache Flink
PDF
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
PPTX
Debunking Common Myths in Stream Processing
PPTX
Kostas Tzoumas - Stream Processing with Apache Flink®
PPTX
Flink Streaming @BudapestData
PDF
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
PPTX
Apache Flink(tm) - A Next-Generation Stream Processor
PDF
Data Stream Analytics - Why they are important
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
PDF
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
PPTX
Apache Flink Overview at SF Spark and Friends
PDF
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...
PDF
Near real-time anomaly detection at Lyft
PPTX
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
PDF
Unbounded bounded-data-strangeloop-2016-monal-daxini
PDF
Making Sense of Apache Flink: A Fearless Introduction
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Enabling Machine Learning with Apache Flink - Sherin Thomas, Lyft
Lifting the hood on spark streaming - StampedeCon 2015
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
The Power of Distributed Snapshots in Apache Flink
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
Debunking Common Myths in Stream Processing
Kostas Tzoumas - Stream Processing with Apache Flink®
Flink Streaming @BudapestData
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Apache Flink(tm) - A Next-Generation Stream Processor
Data Stream Analytics - Why they are important
Flexible and Real-Time Stream Processing with Apache Flink
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Apache Flink Overview at SF Spark and Friends
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...
Near real-time anomaly detection at Lyft
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Unbounded bounded-data-strangeloop-2016-monal-daxini
Making Sense of Apache Flink: A Fearless Introduction
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
sap open course for s4hana steps from ECC to s4
PDF
KodekX | Application Modernization Development
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Approach and Philosophy of On baking technology
PDF
Encapsulation theory and applications.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Cloud computing and distributed systems.
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
Review of recent advances in non-invasive hemoglobin estimation
sap open course for s4hana steps from ECC to s4
KodekX | Application Modernization Development
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Machine learning based COVID-19 study performance prediction
Advanced methodologies resolving dimensionality complications for autism neur...
Approach and Philosophy of On baking technology
Encapsulation theory and applications.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Network Security Unit 5.pdf for BCA BBA.
Cloud computing and distributed systems.
“AI and Expert System Decision Support & Business Intelligence Systems”
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Dropbox Q2 2025 Financial Results & Investor Presentation
Unlocking AI with Model Context Protocol (MCP)
20250228 LYD VKU AI Blended-Learning.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
Per capita expenditure prediction using model stacking based on satellite ima...

Bootstrapping state in Apache Flink

  • 1. Bootstrapping State in Flink DataWorks Summit 2018 Gregory Fee
  • 2. What did the message queue say to Flink?
  • 5. About Me ● Engineer @ Lyft ● Teams - ETA, Data Science Platform, Data Platform ● Accomplishments ○ ETA model training from 4 months to every 10 minutes ○ Real-time traffic updates ○ Flyte - Large Scale Orchestration and Batch Compute ○ Lyftlearn - Custom Machine Learning Library ○ Dryft - Real-time Feature Generation for Machine Learning
  • 6. Dryft ● Need - Consistent Feature Generation ○ The value of your machine learning results is only as good as the data ○ Subtle changes to how a feature value is generated can significantly impact results ● Solution - Unify feature generation ○ Batch processing for bulk creation of features for training ML models ○ Stream processing for real-time creation of features for scoring ML models ● How - SPaaS ○ Use Flink as the processing engine ○ Add automation to make it super simple to launch and maintain feature generation programs at scale
  • 7. Flink Overview ● Top level Apache project ● High-throughput, low-latency streaming engine ● Event-time processing ● State management ● Fault-tolerance in the event of machine failure ● Support exactly-once semantics ● Used by Alibaba, Netflix, Uber
  • 9. Bootstrapping is not Backfilling ● Using historic data to calculate historic results ● Typical uses: ○ Correct for missing data based on pipeline malfunction ○ Generate output for new business logic ● So what is bootstrapping?
  • 10. Stateful Stream Programs counts = stream .flatMap((x) -> x.split("s")) .map((x) -> new KV(x, 1)) .keyBy((x) -> x.key) .window(Time.days(7),Time.hours(1)) .sum((x) -> x.value); Counts of the words that appear in the stream over the last 7 days updated every hour
  • 11. The Waiting is the Hardest Part A program with a 7 day window needs to process for 7 days before it has enough data to answer the query correctly. Day 1 Launch Program Day 3 Anger Day 6 Bargaining Day 8 Relief
  • 12. What about forever? Table table = tableEnv.sql( "SELECT user_lyft_id, COUNT(ride_id) FROM event_ride_completed GROUP BY user_lyft_id"); Counts of the number of rides each user has ever taken
  • 13. Bootstrapping Read historic data store to “bootstrap” the program with 7 days worth of data. Now your program returns results on day 1. -7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7 Start Program + Validate Results
  • 14. Provisioning ● We want bootstrapping to be super fast == set parallelism high ○ Processing a week of data should take less than a week ● We want real-time processing to be super cheap == set parallelism low ○ Need to host thousands of feature generation programs
  • 15. Keep in Mind ● Generality is desirable ○ There are potentially simpler ways of bootstrapping based on your application logic ○ General solution needed to scale to thousands of programs ● Production Readiness is desirable ○ Observability, scalability, stability, and all those good things are all considerations ● What works for Lyft might not be right for you
  • 16. Use Stream Retention • Use the retention policy on your stream technology to retain data for as long as you need ‒ Kinesis maximum retention is 7 days ‒ Kafka has no maximum, stores all data to disks, not capable of petabytes of storage, suboptimal to spend disk money on infrequently accessed data • If this is feasible for you then you should do it consumerConfig.put( ConsumerConfigConstants.STREAM_INITIAL_POSITION, "TRIM_HORIZON");
  • 17. Kafka “Infinite Retention” ● Alter Kafka to allow for tiered storage ○ Write partitions that age out to secondary storage ○ Push data to S3/Glacier ● Advantages ○ Effectively infinite storage at a reasonable price ○ Use existing Kafka connectors to get data ● Disadvantages ○ Very different performance characteristics of underlying storage ○ No easy way to use different Flink configuration between bootstrapping and steady state ○ Does not exist today ● Apache Pulsar and Pravega ecosystems might be a viable alternative
  • 18. Source Magic ● Write a source that reads from the secondary store until you are within retention period of your stream ● Transition to reading from stream ● Advantages ○ Works with any stream provider ● Disadvantages ○ Writing a correct source to bridge between two sources and avoid duplication is hard ○ No easy way to use different Flink configuration between bootstrapping and steady state Discovery Reader Business Logic Kafka Kafka S3 S3 S3 S3
  • 19. Application Level Attempt #1 1. Run the bootstrap program a. Read historic data using a normal source b. Process the data with selected business logic c. Wait for all processing to complete d. Trigger a savepoint and cancel the program 2. Run the steady state program a. Start the program from the savepoint b. Read stream data using a normal source ● Advantages ○ No modifications to streams or sources ○ Allows for Flink configuration between bootstrapping and steady state ● Disadvantages ○ Let’s find out
  • 20. How Hard Can It Be? ● How do we make sure there is no repeated data? SinkS3 Source Business Logic SinkKinesis Source Business Logic
  • 21. Iteration #2 ● How do we trigger a savepoint when bootstrap is complete? S3 Source < Target Time Business Logic Sink Kinesis Source >= Target Time Business Logic Sink
  • 22. Iteration #3 ● After the S3 data is read, push a record that is at (target time + 1) ● Termination detector looks for low watermark to reach (target time + 1) S3 Source + Termination < Target Time Business Logic Termination Detector “Sink” Kinesis Source >= Target Time Business Logic Sink
  • 23. What Did I Learn? ● Automating Flink from within Flink is possible but fragile ○ Eg If you have multiple partitions reading S3 then you need to make sure all of them process a message that pushes the watermark to (target time + 1) ● Savepoint logic is via uid so make sure those are applied on your business logic ○ No support for setting uid on operators generated via SQL
  • 24. Application Level Attempt #2 1. Run a high provisioned job a. Read from historic data store b. Read from live stream c. Union the above d. Process the data with selected business logic e. After all S3 data is processed, trigger a savepoint and cancel program 2. Run a low provisioned job a. Exact same ‘shape’ of program as above, but with less parallelism b. Restore from savepoint
  • 25. Success? ● Advantages ○ Less fragile, works with SQL ● Disadvantage ○ Uses many resources or requires external automation ○ Live data is buffered until historic data completes S3 Source Kinesis Source Business Logic Sink < Target Time >= Target Time
  • 26. Is it live? ● Running in Production at Lyft now ● Actively adding more feature generation programs
  • 27. How Could We Make This Better? ● Kafka Infinite Retention ○ Repartition still necessary to get optimal bootstrap performance ● Programs as Sources ○ Allowing sources to be built in a high level programming model, Beam’s Splittable DoFn ● Dynamic Repartitioning + Adaptive Resource Management ○ Allow Flink parallelism to change without canceling the program ○ Allow Flink checkpointing policy to change without canceling the program ● Meta-messages ○ Allow the passing of metadata within the data stream, watermarks are one type of metadata
  • 28. What about Batch Mode? ● Batch Mode can be more efficient than Streaming Mode ○ Offline data has different properties than stream data ● Method #1 ○ Use batch mode to process historic data, make a savepoint at the end ○ Start a streaming mode program from the savepoint, process stream data ● Method #2 ○ Modify Flink to understand a transition watermark, batch runtime automatically transitions to streaming runtime, requires unified source
  • 29. What Did We Learn? ● Many stream programs are stateful ● Faster than real-time bootstrapping using Flink is possible ● There are many opportunities for improvement
  • 30. Q&A

Editor's Notes

  • #9: “A technique of loading a program into a computer by means of a few initial instructions that enable the introduction of the rest of the program from an input device.”
  • #18: Even this system has issues that we’ll talk more about later