SlideShare a Scribd company logo
Connector Tips & Tricks
Eron Wright, Dell EMC
@eronwright
@ 2019 Dell EMC
Who am I?
● Tech Staff at Dell EMC
● Contributor to Pravega stream storage system
○ Dynamically-sharded streams
○ Event-time tracking
○ Transaction support
● Maintainer of Flink connector for Pravega
Overview
Topics:
● Connector Basics
● Table Connectors
● Event Time
● State & Fault Tolerance
Connector Basics
Developing a Connector
● Applications take an explicit dependency on a connector
○ Not generally built-in to the Flink environment
○ Treated as a normal application dependency
○ Consider shading and relocating your connector’s dependencies
● Possible connector repositories:
○ Apache Flink repository
○ Apache Bahir (for Flink) repository
○ Your own repository
Types of Flink Connectors
● Streaming Connectors
○ Provide sources and/or sinks
○ Sources may be bounded or unbounded
● Batch Connectors
○ Not discussed here
● Table Connectors
○ Provide tables which act as sources, sinks, or both
○ Unifies the batch and streaming programming model
○ Typically relies on a streaming and/or batch connector under the hood
○ A table’s update mode determines how a table is converted to/from a stream
■ Append Mode, Retract Mode, Upsert Mode
Key Challenges
● How to parallelize your data source/sink
○ Subdivide the source data amongst operator subtasks, e.g. by partition
○ Support parallelism changes
● How to provide fault tolerance
○ Provide exactly-once semantics
○ Support coarse- and fine-grained recovery for failed tasks
○ Support Flink checkpoints and savepoints
● How to support historical and real-time processing
○ Facilitate correct program output
○ Support event time semantics
● Security considerations
○ Safeguarding secrets
Connector Lifecycle
● Construction
○ Instantiated in the driver program (i.e. main method); must be serializable
○ Use the builder pattern to provide a DSL for your connector
○ Avoid making connections if possible
● State Initialization
○ Separate configuration from state
● Run
○ Supports both unbounded and bounded sources
● Cancel / Stop
○ Supports graceful termination (w/ savepoint)
○ May advance the event time clock to the end-of-time (MAX_WATERMARK)
Connector Lifecycle (con’t)
● Advanced: Initialize/Finalize on Job Master
○ Exclusively for OutputFormat (e.g.. file-based sinks)
○ Implement InitializeOnMaster, FinalizeOnMaster, and CleanupWhenUnsuccessful
○ Support for Steaming API added in Flink 1.9; see FLINK-1722
User-Defined Data Types
● Connectors are typically agnostic to the record data type
○ Expects application to supply type information w/ serializer
● For sources:
○ Accept a DeserializationSchema<T>
○ Implement ResultTypeQueryable<T>
● For sinks:
○ Accept a SerializationSchema<T>
● First-class support for Avro, Parquet, JSON
○ Geared towards Flink Table API
Connector Metrics
● Flink exposes a metric system for gathering and reporting metrics
○ Reporters: Flink UI, JMX, InfluxDB, Prometheus, ...
● Use the metric API in your connector to expose relevant metric data
○ Types: counters, gauges, histograms, meters
● Metrics are tracked on a per-subtask basis
● More information:
○ Flink Documentation / Debugging & Monitoring / Metrics
Connector Security
● Credentials are typically passed as ordinary program parameters
○ Beware lack of isolation between jobs in a given cluster
● Flink does have first-class support for Kerberos credentials
○ Based on keytabs (in support of long-running jobs)
○ Expects connector to use a named JAAS context
○ See: Kerberos Authentication Setup and Configuration
Table API
Summary
● The Table API is evolving rapidly
○ For new connectors, focus on supporting the Blink planner
● Table sources and sinks are generally built upon the DataStream API
● Two configuration styles - typed DSL and string-based properties
● Table formats are connector-independent
○ E.g. CSV, JSON, Avro
● A catalog encapsulates a collection of tables, views, and functions
○ Provides convenience and interactivity
● More information:
○ Docs: User-Defined Sources & Sinks
Event Time Support
Key Considerations
● Connectors play an critical role in program correctness
○ Connector internals influence the order-of-observation (in event time) and hence the practicality of
watermark generation
○ Connectors exhibit different behavior in historical vs real-time processing
● Event time skew leads to excess buffering and hence inefficiency
● There’s an inherent trade-off between latency and complexity
Tips and Tricks for Developing Streaming and table Connectors - Wron Wright, Dell EMC
Tips and Tricks for Developing Streaming and table Connectors - Wron Wright, Dell EMC
Tips and Tricks for Developing Streaming and table Connectors - Wron Wright, Dell EMC
Global Watermark Tracking
● Flink 1.9 has a facility for tracking a global aggregate value across sub-tasks
○ Ideal for establishing a global minimum watermark
○ See StreamingRuntimeContext#getGlobalAggregateManager
● Most useful in highly dynamic sources
○ Compensates for impact of resharding, rebalancing on event time
○ Increases latency
● See Kinesis connector’s JobManagerWatermarkTracker
Source Idleness
● Downstream tasks depend on arrival of watermarks from all sub-tasks
○ Beware stalling the pipeline
● A sub-task may remove itself from consideration by idling
○ i.e. “release the hold on the event time clock”
● A source should be idled mainly for semantic reasons
○
Sink Watermark Propagation
● Consider the possibility of watermark propagation across jobs
○ Propagate upstream watermarks along with output records
○ Job 1 → (external system) → Job 2
● Sink function does have access to current watermark
○ But only when processing an input record 😞
● Solution: event-time timers
○ Chain a ProcessFunction and corresponding SinkFunction, or develop a custom operator
Practical Suggestions
● Provide an API to assign timestamps and to generate watermarks
○ Strive to isolate system internals, e.g. apply the watermark generator on a per-partition basis
○ Aggregate the watermarks into a per-subtask or global watermark
● Strive to minimize event time ‘skew’ across subtasks
○ Strategy: prioritize oldest data and pause ingestion of partitions that are too far ahead
○ See FLINK-10886 for improvements to Kinesis, Kafka connectors
● Remember: the goal is not a total ordering of elements (in event time)
State & Fault Tolerance
Working with State
● Sources are typically stateful, e.g.
○ partition assignment to sub-tasks
○ position tracking
● Use managed operator state to track redistributable units of work
○ List state - a list of redistributable elements (e.g. partitions w/ current position index)
○ Union list state - a variation where each sub-task gets the complete list of elements
● Various interfaces:
○ CheckpointedFunction - most powerful
○ ListCheckpointed - limited but convenient
○ CheckpointListener - to observe checkpoint completion (e.g. for 2PC)
Exactly-Once Semantics
● Definition: evolution of state is based on a single observation of a given element
● Writes to external systems are ideally idempotent
● For sinks, Flink provides a few building blocks:
○ TwoPhaseCommitSinkFunction - base class providing a transaction-like API (but not storage)
○ GenericWriteAheadSink - implements a WAL using the state backend (see: CassandraSink)
○ CheckpointCommitter - stores information about completed checkpoints
● Savepoints present various complications
○ User may opt to resume from any prior checkpoint, not just the most recent checkpoint
○ The connector may be reconfigured w/ new inputs and/or outputs
Advanced: Externally-Induced Sources
● Flink is still in control of initiating the overall checkpoint, with a twist!
● It allows a source to control the checkpoint barrier insertion point
○ E.g. based on incoming data or external coordination
● Hooks into the checkpoint coordinator on the master
○ Flink → Hook → External System → Sub-task
● See:
○ ExternallyInducedSource
○ WithMasterCheckpointHook
Tips and Tricks for Developing Streaming and table Connectors - Wron Wright, Dell EMC
Tips and Tricks for Developing Streaming and table Connectors - Wron Wright, Dell EMC
Thank You!
● Feedback welcome (e.g. via the FF app)
● See me at the Speaker’s Lounge

More Related Content

PDF
Virtual Flink Forward 2020: Build your next-generation stream platform based ...
PDF
Unify Enterprise Data Processing System Platform Level Integration of Flink a...
PDF
Flink Connector Development Tips & Tricks
PDF
Flink Forward Berlin 2017: Patrick Lucas - Flink in Containerland
PDF
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
PDF
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
PDF
Marton Balassi – Stateful Stream Processing
PDF
Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse wit...
Virtual Flink Forward 2020: Build your next-generation stream platform based ...
Unify Enterprise Data Processing System Platform Level Integration of Flink a...
Flink Connector Development Tips & Tricks
Flink Forward Berlin 2017: Patrick Lucas - Flink in Containerland
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
Marton Balassi – Stateful Stream Processing
Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse wit...

What's hot (20)

PDF
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
PPTX
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
PDF
Future of Apache Flink Deployments: Containers, Kubernetes and More - Flink F...
PDF
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
PDF
Flink Forward San Francisco 2019: Developing and operating real-time applicat...
PDF
Flink Forward San Francisco 2018: Steven Wu - "Scaling Flink in Cloud"
PDF
Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...
PPTX
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
PDF
Flink Forward Berlin 2017: Zohar Mizrahi - Python Streaming API
PDF
Flink Forward San Francisco 2019: The Trade Desk's Year in Flink - Jonathan ...
PDF
Pulsar connector on flink 1.14
PDF
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...
PDF
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
PDF
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
PDF
Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - wha...
PDF
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
PDF
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
PDF
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
PDF
Streaming your Lyft Ride Prices - Flink Forward SF 2019
PPTX
Apache Flink Berlin Meetup May 2016
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Future of Apache Flink Deployments: Containers, Kubernetes and More - Flink F...
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
Flink Forward San Francisco 2019: Developing and operating real-time applicat...
Flink Forward San Francisco 2018: Steven Wu - "Scaling Flink in Cloud"
Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
Flink Forward Berlin 2017: Zohar Mizrahi - Python Streaming API
Flink Forward San Francisco 2019: The Trade Desk's Year in Flink - Jonathan ...
Pulsar connector on flink 1.14
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - wha...
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Streaming your Lyft Ride Prices - Flink Forward SF 2019
Apache Flink Berlin Meetup May 2016
Ad

Similar to Tips and Tricks for Developing Streaming and table Connectors - Wron Wright, Dell EMC (20)

PDF
Stream processing with Apache Flink (Timo Walther - Ververica)
PDF
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
PDF
Flux architecture and Redux - theory, context and practice
PDF
Getting Data In and Out of Flink - Understanding Flink and Its Connector Ecos...
PDF
Timing is Everything: Understanding Event-Time Processing in Flink SQL
PDF
OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...
PDF
Apache flink
PDF
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...
PDF
Airflow Intro-1.pdf
PDF
Introduction to Flink Streaming
PDF
Network Statistics for OpenFlow
PDF
SDN Programming with Go
PPTX
Airflow 101
PPTX
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
PDF
Serverless Event Streaming with Pulsar Functions
PDF
Log Event Stream Processing In Flink Way
PDF
Coordination in distributed systems
PDF
When Streaming Needs Batch With Konstantin Knauf | Current 2022
PDF
Dynamic log processing with fluentd and konfigurator
PPSX
Microsoft Hekaton
Stream processing with Apache Flink (Timo Walther - Ververica)
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Flux architecture and Redux - theory, context and practice
Getting Data In and Out of Flink - Understanding Flink and Its Connector Ecos...
Timing is Everything: Understanding Event-Time Processing in Flink SQL
OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...
Apache flink
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...
Airflow Intro-1.pdf
Introduction to Flink Streaming
Network Statistics for OpenFlow
SDN Programming with Go
Airflow 101
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Serverless Event Streaming with Pulsar Functions
Log Event Stream Processing In Flink Way
Coordination in distributed systems
When Streaming Needs Batch With Konstantin Knauf | Current 2022
Dynamic log processing with fluentd and konfigurator
Microsoft Hekaton
Ad

More from Flink Forward (20)

PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
PPTX
Evening out the uneven: dealing with skew in Flink
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
PDF
Introducing the Apache Flink Kubernetes Operator
PPTX
Autoscaling Flink with Reactive Mode
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
PPTX
One sink to rule them all: Introducing the new Async Sink
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
PDF
Flink powered stream processing platform at Pinterest
PPTX
Apache Flink in the Cloud-Native Era
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
PPTX
The Current State of Table API in 2022
PDF
Flink SQL on Pulsar made easy
PPTX
Dynamic Rule-based Real-time Market Data Alerts
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PPTX
Processing Semantically-Ordered Streams in Financial Services
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
PDF
Batch Processing at Scale with Flink & Iceberg
Building a fully managed stream processing platform on Flink at scale for Lin...
Evening out the uneven: dealing with skew in Flink
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing the Apache Flink Kubernetes Operator
Autoscaling Flink with Reactive Mode
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
One sink to rule them all: Introducing the new Async Sink
Tuning Apache Kafka Connectors for Flink.pptx
Flink powered stream processing platform at Pinterest
Apache Flink in the Cloud-Native Era
Where is my bottleneck? Performance troubleshooting in Flink
Using the New Apache Flink Kubernetes Operator in a Production Deployment
The Current State of Table API in 2022
Flink SQL on Pulsar made easy
Dynamic Rule-based Real-time Market Data Alerts
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Processing Semantically-Ordered Streams in Financial Services
Tame the small files problem and optimize data layout for streaming ingestion...
Batch Processing at Scale with Flink & Iceberg

Recently uploaded (20)

PPT
Teaching material agriculture food technology
PDF
Encapsulation theory and applications.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
A Presentation on Artificial Intelligence
PPTX
Machine Learning_overview_presentation.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Machine learning based COVID-19 study performance prediction
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
Teaching material agriculture food technology
Encapsulation theory and applications.pdf
20250228 LYD VKU AI Blended-Learning.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Spectral efficient network and resource selection model in 5G networks
A Presentation on Artificial Intelligence
Machine Learning_overview_presentation.pptx
Approach and Philosophy of On baking technology
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Programs and apps: productivity, graphics, security and other tools
gpt5_lecture_notes_comprehensive_20250812015547.pdf
A comparative analysis of optical character recognition models for extracting...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Accuracy of neural networks in brain wave diagnosis of schizophrenia
MIND Revenue Release Quarter 2 2025 Press Release
Machine learning based COVID-19 study performance prediction
Reach Out and Touch Someone: Haptics and Empathic Computing

Tips and Tricks for Developing Streaming and table Connectors - Wron Wright, Dell EMC

  • 1. Connector Tips & Tricks Eron Wright, Dell EMC @eronwright @ 2019 Dell EMC
  • 2. Who am I? ● Tech Staff at Dell EMC ● Contributor to Pravega stream storage system ○ Dynamically-sharded streams ○ Event-time tracking ○ Transaction support ● Maintainer of Flink connector for Pravega
  • 3. Overview Topics: ● Connector Basics ● Table Connectors ● Event Time ● State & Fault Tolerance
  • 5. Developing a Connector ● Applications take an explicit dependency on a connector ○ Not generally built-in to the Flink environment ○ Treated as a normal application dependency ○ Consider shading and relocating your connector’s dependencies ● Possible connector repositories: ○ Apache Flink repository ○ Apache Bahir (for Flink) repository ○ Your own repository
  • 6. Types of Flink Connectors ● Streaming Connectors ○ Provide sources and/or sinks ○ Sources may be bounded or unbounded ● Batch Connectors ○ Not discussed here ● Table Connectors ○ Provide tables which act as sources, sinks, or both ○ Unifies the batch and streaming programming model ○ Typically relies on a streaming and/or batch connector under the hood ○ A table’s update mode determines how a table is converted to/from a stream ■ Append Mode, Retract Mode, Upsert Mode
  • 7. Key Challenges ● How to parallelize your data source/sink ○ Subdivide the source data amongst operator subtasks, e.g. by partition ○ Support parallelism changes ● How to provide fault tolerance ○ Provide exactly-once semantics ○ Support coarse- and fine-grained recovery for failed tasks ○ Support Flink checkpoints and savepoints ● How to support historical and real-time processing ○ Facilitate correct program output ○ Support event time semantics ● Security considerations ○ Safeguarding secrets
  • 8. Connector Lifecycle ● Construction ○ Instantiated in the driver program (i.e. main method); must be serializable ○ Use the builder pattern to provide a DSL for your connector ○ Avoid making connections if possible ● State Initialization ○ Separate configuration from state ● Run ○ Supports both unbounded and bounded sources ● Cancel / Stop ○ Supports graceful termination (w/ savepoint) ○ May advance the event time clock to the end-of-time (MAX_WATERMARK)
  • 9. Connector Lifecycle (con’t) ● Advanced: Initialize/Finalize on Job Master ○ Exclusively for OutputFormat (e.g.. file-based sinks) ○ Implement InitializeOnMaster, FinalizeOnMaster, and CleanupWhenUnsuccessful ○ Support for Steaming API added in Flink 1.9; see FLINK-1722
  • 10. User-Defined Data Types ● Connectors are typically agnostic to the record data type ○ Expects application to supply type information w/ serializer ● For sources: ○ Accept a DeserializationSchema<T> ○ Implement ResultTypeQueryable<T> ● For sinks: ○ Accept a SerializationSchema<T> ● First-class support for Avro, Parquet, JSON ○ Geared towards Flink Table API
  • 11. Connector Metrics ● Flink exposes a metric system for gathering and reporting metrics ○ Reporters: Flink UI, JMX, InfluxDB, Prometheus, ... ● Use the metric API in your connector to expose relevant metric data ○ Types: counters, gauges, histograms, meters ● Metrics are tracked on a per-subtask basis ● More information: ○ Flink Documentation / Debugging & Monitoring / Metrics
  • 12. Connector Security ● Credentials are typically passed as ordinary program parameters ○ Beware lack of isolation between jobs in a given cluster ● Flink does have first-class support for Kerberos credentials ○ Based on keytabs (in support of long-running jobs) ○ Expects connector to use a named JAAS context ○ See: Kerberos Authentication Setup and Configuration
  • 14. Summary ● The Table API is evolving rapidly ○ For new connectors, focus on supporting the Blink planner ● Table sources and sinks are generally built upon the DataStream API ● Two configuration styles - typed DSL and string-based properties ● Table formats are connector-independent ○ E.g. CSV, JSON, Avro ● A catalog encapsulates a collection of tables, views, and functions ○ Provides convenience and interactivity ● More information: ○ Docs: User-Defined Sources & Sinks
  • 16. Key Considerations ● Connectors play an critical role in program correctness ○ Connector internals influence the order-of-observation (in event time) and hence the practicality of watermark generation ○ Connectors exhibit different behavior in historical vs real-time processing ● Event time skew leads to excess buffering and hence inefficiency ● There’s an inherent trade-off between latency and complexity
  • 20. Global Watermark Tracking ● Flink 1.9 has a facility for tracking a global aggregate value across sub-tasks ○ Ideal for establishing a global minimum watermark ○ See StreamingRuntimeContext#getGlobalAggregateManager ● Most useful in highly dynamic sources ○ Compensates for impact of resharding, rebalancing on event time ○ Increases latency ● See Kinesis connector’s JobManagerWatermarkTracker
  • 21. Source Idleness ● Downstream tasks depend on arrival of watermarks from all sub-tasks ○ Beware stalling the pipeline ● A sub-task may remove itself from consideration by idling ○ i.e. “release the hold on the event time clock” ● A source should be idled mainly for semantic reasons ○
  • 22. Sink Watermark Propagation ● Consider the possibility of watermark propagation across jobs ○ Propagate upstream watermarks along with output records ○ Job 1 → (external system) → Job 2 ● Sink function does have access to current watermark ○ But only when processing an input record 😞 ● Solution: event-time timers ○ Chain a ProcessFunction and corresponding SinkFunction, or develop a custom operator
  • 23. Practical Suggestions ● Provide an API to assign timestamps and to generate watermarks ○ Strive to isolate system internals, e.g. apply the watermark generator on a per-partition basis ○ Aggregate the watermarks into a per-subtask or global watermark ● Strive to minimize event time ‘skew’ across subtasks ○ Strategy: prioritize oldest data and pause ingestion of partitions that are too far ahead ○ See FLINK-10886 for improvements to Kinesis, Kafka connectors ● Remember: the goal is not a total ordering of elements (in event time)
  • 24. State & Fault Tolerance
  • 25. Working with State ● Sources are typically stateful, e.g. ○ partition assignment to sub-tasks ○ position tracking ● Use managed operator state to track redistributable units of work ○ List state - a list of redistributable elements (e.g. partitions w/ current position index) ○ Union list state - a variation where each sub-task gets the complete list of elements ● Various interfaces: ○ CheckpointedFunction - most powerful ○ ListCheckpointed - limited but convenient ○ CheckpointListener - to observe checkpoint completion (e.g. for 2PC)
  • 26. Exactly-Once Semantics ● Definition: evolution of state is based on a single observation of a given element ● Writes to external systems are ideally idempotent ● For sinks, Flink provides a few building blocks: ○ TwoPhaseCommitSinkFunction - base class providing a transaction-like API (but not storage) ○ GenericWriteAheadSink - implements a WAL using the state backend (see: CassandraSink) ○ CheckpointCommitter - stores information about completed checkpoints ● Savepoints present various complications ○ User may opt to resume from any prior checkpoint, not just the most recent checkpoint ○ The connector may be reconfigured w/ new inputs and/or outputs
  • 27. Advanced: Externally-Induced Sources ● Flink is still in control of initiating the overall checkpoint, with a twist! ● It allows a source to control the checkpoint barrier insertion point ○ E.g. based on incoming data or external coordination ● Hooks into the checkpoint coordinator on the master ○ Flink → Hook → External System → Sub-task ● See: ○ ExternallyInducedSource ○ WithMasterCheckpointHook
  • 30. Thank You! ● Feedback welcome (e.g. via the FF app) ● See me at the Speaker’s Lounge