SlideShare a Scribd company logo
Connector Tips & Tricks
Eron Wright, Dell EMC
@eronwright
@ 2019 Dell EMC
Who am I?
● Tech Staff at Dell EMC
● Contributor to Pravega stream storage system
○ Dynamically-sharded streams
○ Event-time tracking
○ Transaction support
● Maintainer of Flink connector for Pravega
Overview
Topics:
● Connector Basics
● Table Connectors
● Event Time
● State & Fault Tolerance
Connector Basics
Developing a Connector
● Applications take an explicit dependency on a connector
○ Not generally built-in to the Flink environment
○ Treated as a normal application dependency
○ Consider shading and relocating your connector’s dependencies
● Possible connector repositories:
○ Apache Flink repository
○ Apache Bahir (for Flink) repository
○ Your own repository
Types of Flink Connectors
● Streaming Connectors
○ Provide sources and/or sinks
○ Sources may be bounded or unbounded
● Batch Connectors
○ Not discussed here
● Table Connectors
○ Provide tables which act as sources, sinks, or both
○ Unifies the batch and streaming programming model
○ Typically relies on a streaming and/or batch connector under the hood
○ A table’s update mode determines how a table is converted to/from a stream
■ Append Mode, Retract Mode, Upsert Mode
Key Challenges
● How to parallelize your data source/sink
○ Subdivide the source data amongst operator subtasks, e.g. by partition
○ Support parallelism changes
● How to provide fault tolerance
○ Provide exactly-once semantics
○ Support coarse- and fine-grained recovery for failed tasks
○ Support Flink checkpoints and savepoints
● How to support historical and real-time processing
○ Facilitate correct program output
○ Support event time semantics
● Security considerations
○ Safeguarding secrets
Connector Lifecycle
● Construction
○ Instantiated in the driver program (i.e. main method); must be serializable
○ Use the builder pattern to provide a DSL for your connector
○ Avoid making connections if possible
● State Initialization
○ Separate configuration from state
● Run
○ Supports both unbounded and bounded sources
● Cancel / Stop
○ Supports graceful termination (w/ savepoint)
○ May advance the event time clock to the end-of-time (MAX_WATERMARK)
Connector Lifecycle (con’t)
● Advanced: Initialize/Finalize on Job Master
○ Exclusively for OutputFormat (e.g.. file-based sinks)
○ Implement InitializeOnMaster, FinalizeOnMaster, and CleanupWhenUnsuccessful
○ Support for Steaming API added in Flink 1.9; see FLINK-1722
User-Defined Data Types
● Connectors are typically agnostic to the record data type
○ Expects application to supply type information w/ serializer
● For sources:
○ Accept a DeserializationSchema<T>
○ Implement ResultTypeQueryable<T>
● For sinks:
○ Accept a SerializationSchema<T>
● First-class support for Avro, Parquet, JSON
○ Geared towards Flink Table API
Connector Metrics
● Flink exposes a metric system for gathering and reporting metrics
○ Reporters: Flink UI, JMX, InfluxDB, Prometheus, ...
● Use the metric API in your connector to expose relevant metric data
○ Types: counters, gauges, histograms, meters
● Metrics are tracked on a per-subtask basis
● More information:
○ Flink Documentation / Debugging & Monitoring / Metrics
Connector Security
● Credentials are typically passed as ordinary program parameters
○ Beware lack of isolation between jobs in a given cluster
● Flink does have first-class support for Kerberos credentials
○ Based on keytabs (in support of long-running jobs)
○ Expects connector to use a named JAAS context
○ See: Kerberos Authentication Setup and Configuration
Table API
Summary
● The Table API is evolving rapidly
○ For new connectors, focus on supporting the Blink planner
● Table sources and sinks are generally built upon the DataStream API
● Two configuration styles - typed DSL and string-based properties
● Table formats are connector-independent
○ E.g. CSV, JSON, Avro
● A catalog encapsulates a collection of tables, views, and functions
○ Provides convenience and interactivity
● More information:
○ Docs: User-Defined Sources & Sinks
Event Time Support
Key Considerations
● Connectors play an critical role in program correctness
○ Connector internals influence the order-of-observation (in event time) and hence the practicality of
watermark generation
○ Connectors exhibit different behavior in historical vs real-time processing
● Event time skew leads to excess buffering and hence inefficiency
● There’s an inherent trade-off between latency and complexity
Tips and tricks for developing streaming and table connectors  - Eron Wright, Dell EMC
Tips and tricks for developing streaming and table connectors  - Eron Wright, Dell EMC
Tips and tricks for developing streaming and table connectors  - Eron Wright, Dell EMC
Global Watermark Tracking
● Flink 1.9 has a facility for tracking a global aggregate value across sub-tasks
○ Ideal for establishing a global minimum watermark
○ See StreamingRuntimeContext#getGlobalAggregateManager
● Most useful in highly dynamic sources
○ Compensates for impact of resharding, rebalancing on event time
○ Increases latency
● See Kinesis connector’s JobManagerWatermarkTracker
Source Idleness
● Downstream tasks depend on arrival of watermarks from all sub-tasks
○ Beware stalling the pipeline
● A sub-task may remove itself from consideration by idling
○ i.e. “release the hold on the event time clock”
● A source should be idled mainly for semantic reasons
○
Sink Watermark Propagation
● Consider the possibility of watermark propagation across jobs
○ Propagate upstream watermarks along with output records
○ Job 1 → (external system) → Job 2
● Sink function does have access to current watermark
○ But only when processing an input record 😞
● Solution: event-time timers
○ Chain a ProcessFunction and corresponding SinkFunction, or develop a custom operator
Practical Suggestions
● Provide an API to assign timestamps and to generate watermarks
○ Strive to isolate system internals, e.g. apply the watermark generator on a per-partition basis
○ Aggregate the watermarks into a per-subtask or global watermark
● Strive to minimize event time ‘skew’ across subtasks
○ Strategy: prioritize oldest data and pause ingestion of partitions that are too far ahead
○ See FLINK-10886 for improvements to Kinesis, Kafka connectors
● Remember: the goal is not a total ordering of elements (in event time)
State & Fault Tolerance
Working with State
● Sources are typically stateful, e.g.
○ partition assignment to sub-tasks
○ position tracking
● Use managed operator state to track redistributable units of work
○ List state - a list of redistributable elements (e.g. partitions w/ current position index)
○ Union list state - a variation where each sub-task gets the complete list of elements
● Various interfaces:
○ CheckpointedFunction - most powerful
○ ListCheckpointed - limited but convenient
○ CheckpointListener - to observe checkpoint completion (e.g. for 2PC)
Exactly-Once Semantics
● Definition: evolution of state is based on a single observation of a given element
● Writes to external systems are ideally idempotent
● For sinks, Flink provides a few building blocks:
○ TwoPhaseCommitSinkFunction - base class providing a transaction-like API (but not storage)
○ GenericWriteAheadSink - implements a WAL using the state backend (see: CassandraSink)
○ CheckpointCommitter - stores information about completed checkpoints
● Savepoints present various complications
○ User may opt to resume from any prior checkpoint, not just the most recent checkpoint
○ The connector may be reconfigured w/ new inputs and/or outputs
Advanced: Externally-Induced Sources
● Flink is still in control of initiating the overall checkpoint, with a twist!
● It allows a source to control the checkpoint barrier insertion point
○ E.g. based on incoming data or external coordination
● Hooks into the checkpoint coordinator on the master
○ Flink → Hook → External System → Sub-task
● See:
○ ExternallyInducedSource
○ WithMasterCheckpointHook
Tips and tricks for developing streaming and table connectors  - Eron Wright, Dell EMC
Tips and tricks for developing streaming and table connectors  - Eron Wright, Dell EMC
Thank You!
● Feedback welcome (e.g. via the FF app)
● See me at the Speaker’s Lounge

More Related Content

PDF
Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...
PDF
Flink Connector Development Tips & Tricks
PPTX
Do Flink on Web with FLOW
PPTX
Virtual Flink Forward 2020: Implement Reliable, Isolated & Unified Job Submis...
PPTX
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Ver...
PPTX
Virtual Flink Forward 2020: Integrate Flink with Kubernetes natively - Yang Wang
PDF
Kubernetes + Operator + PaaSTA = Flink @ Yelp - Antonio Verardi, Yelp
PDF
Virtual Flink Forward 2020: Build your next-generation stream platform based ...
Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...
Flink Connector Development Tips & Tricks
Do Flink on Web with FLOW
Virtual Flink Forward 2020: Implement Reliable, Isolated & Unified Job Submis...
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Ver...
Virtual Flink Forward 2020: Integrate Flink with Kubernetes natively - Yang Wang
Kubernetes + Operator + PaaSTA = Flink @ Yelp - Antonio Verardi, Yelp
Virtual Flink Forward 2020: Build your next-generation stream platform based ...

What's hot (18)

PPTX
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
PDF
Flink Forward Berlin 2017: Patrick Lucas - Flink in Containerland
PDF
Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...
PDF
Pulsar connector on flink 1.14
PDF
Flink Forward Berlin 2017: Zohar Mizrahi - Python Streaming API
PDF
Alexander Kolb – Flink. Yet another Streaming Framework?
PDF
Flink Forward Berlin 2017: Dominik Bruhn - Deploying Flink Jobs as Docker Con...
PDF
Flink Forward San Francisco 2019: Developing and operating real-time applicat...
PDF
Streaming your Lyft Ride Prices - Flink Forward SF 2019
PDF
Future of Apache Flink Deployments: Containers, Kubernetes and More - Flink F...
PDF
Flink Forward Berlin 2017: Maciek Próchniak - TouK Nussknacker - creating Fli...
PDF
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
PPTX
Apache Flink at Strata San Jose 2016
PPTX
Stephan Ewen - Experiences running Flink at Very Large Scale
PDF
Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse wit...
PDF
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...
PDF
A look at Flink 1.2
PPTX
Flink. Pure Streaming
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
Flink Forward Berlin 2017: Patrick Lucas - Flink in Containerland
Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...
Pulsar connector on flink 1.14
Flink Forward Berlin 2017: Zohar Mizrahi - Python Streaming API
Alexander Kolb – Flink. Yet another Streaming Framework?
Flink Forward Berlin 2017: Dominik Bruhn - Deploying Flink Jobs as Docker Con...
Flink Forward San Francisco 2019: Developing and operating real-time applicat...
Streaming your Lyft Ride Prices - Flink Forward SF 2019
Future of Apache Flink Deployments: Containers, Kubernetes and More - Flink F...
Flink Forward Berlin 2017: Maciek Próchniak - TouK Nussknacker - creating Fli...
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
Apache Flink at Strata San Jose 2016
Stephan Ewen - Experiences running Flink at Very Large Scale
Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse wit...
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...
A look at Flink 1.2
Flink. Pure Streaming
Ad

Similar to Tips and tricks for developing streaming and table connectors - Eron Wright, Dell EMC (20)

PDF
Stream processing with Apache Flink (Timo Walther - Ververica)
PDF
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
PDF
Flux architecture and Redux - theory, context and practice
PDF
Getting Data In and Out of Flink - Understanding Flink and Its Connector Ecos...
PDF
Timing is Everything: Understanding Event-Time Processing in Flink SQL
PDF
OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...
PDF
Apache flink
PDF
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...
PDF
Airflow Intro-1.pdf
PDF
Introduction to Flink Streaming
PDF
Network Statistics for OpenFlow
PDF
SDN Programming with Go
PPTX
Airflow 101
PPTX
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
PDF
Serverless Event Streaming with Pulsar Functions
PDF
Log Event Stream Processing In Flink Way
PDF
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
PDF
Coordination in distributed systems
PDF
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
PDF
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
Stream processing with Apache Flink (Timo Walther - Ververica)
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Flux architecture and Redux - theory, context and practice
Getting Data In and Out of Flink - Understanding Flink and Its Connector Ecos...
Timing is Everything: Understanding Event-Time Processing in Flink SQL
OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...
Apache flink
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...
Airflow Intro-1.pdf
Introduction to Flink Streaming
Network Statistics for OpenFlow
SDN Programming with Go
Airflow 101
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Serverless Event Streaming with Pulsar Functions
Log Event Stream Processing In Flink Way
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Coordination in distributed systems
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
Ad

More from Flink Forward (20)

PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
PPTX
Evening out the uneven: dealing with skew in Flink
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
PDF
Introducing the Apache Flink Kubernetes Operator
PPTX
Autoscaling Flink with Reactive Mode
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
PPTX
One sink to rule them all: Introducing the new Async Sink
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
PDF
Flink powered stream processing platform at Pinterest
PPTX
Apache Flink in the Cloud-Native Era
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
PPTX
The Current State of Table API in 2022
PDF
Flink SQL on Pulsar made easy
PPTX
Dynamic Rule-based Real-time Market Data Alerts
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PPTX
Processing Semantically-Ordered Streams in Financial Services
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
PDF
Batch Processing at Scale with Flink & Iceberg
Building a fully managed stream processing platform on Flink at scale for Lin...
Evening out the uneven: dealing with skew in Flink
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing the Apache Flink Kubernetes Operator
Autoscaling Flink with Reactive Mode
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
One sink to rule them all: Introducing the new Async Sink
Tuning Apache Kafka Connectors for Flink.pptx
Flink powered stream processing platform at Pinterest
Apache Flink in the Cloud-Native Era
Where is my bottleneck? Performance troubleshooting in Flink
Using the New Apache Flink Kubernetes Operator in a Production Deployment
The Current State of Table API in 2022
Flink SQL on Pulsar made easy
Dynamic Rule-based Real-time Market Data Alerts
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Processing Semantically-Ordered Streams in Financial Services
Tame the small files problem and optimize data layout for streaming ingestion...
Batch Processing at Scale with Flink & Iceberg

Recently uploaded (20)

PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
A Presentation on Artificial Intelligence
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
Tartificialntelligence_presentation.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
cuic standard and advanced reporting.pdf
PPT
Teaching material agriculture food technology
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Electronic commerce courselecture one. Pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
1. Introduction to Computer Programming.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
A Presentation on Artificial Intelligence
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
NewMind AI Weekly Chronicles - August'25-Week II
Advanced methodologies resolving dimensionality complications for autism neur...
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Tartificialntelligence_presentation.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
cuic standard and advanced reporting.pdf
Teaching material agriculture food technology
gpt5_lecture_notes_comprehensive_20250812015547.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Electronic commerce courselecture one. Pdf
Big Data Technologies - Introduction.pptx
Spectral efficient network and resource selection model in 5G networks
Building Integrated photovoltaic BIPV_UPV.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
The Rise and Fall of 3GPP – Time for a Sabbatical?
1. Introduction to Computer Programming.pptx

Tips and tricks for developing streaming and table connectors - Eron Wright, Dell EMC

  • 1. Connector Tips & Tricks Eron Wright, Dell EMC @eronwright @ 2019 Dell EMC
  • 2. Who am I? ● Tech Staff at Dell EMC ● Contributor to Pravega stream storage system ○ Dynamically-sharded streams ○ Event-time tracking ○ Transaction support ● Maintainer of Flink connector for Pravega
  • 3. Overview Topics: ● Connector Basics ● Table Connectors ● Event Time ● State & Fault Tolerance
  • 5. Developing a Connector ● Applications take an explicit dependency on a connector ○ Not generally built-in to the Flink environment ○ Treated as a normal application dependency ○ Consider shading and relocating your connector’s dependencies ● Possible connector repositories: ○ Apache Flink repository ○ Apache Bahir (for Flink) repository ○ Your own repository
  • 6. Types of Flink Connectors ● Streaming Connectors ○ Provide sources and/or sinks ○ Sources may be bounded or unbounded ● Batch Connectors ○ Not discussed here ● Table Connectors ○ Provide tables which act as sources, sinks, or both ○ Unifies the batch and streaming programming model ○ Typically relies on a streaming and/or batch connector under the hood ○ A table’s update mode determines how a table is converted to/from a stream ■ Append Mode, Retract Mode, Upsert Mode
  • 7. Key Challenges ● How to parallelize your data source/sink ○ Subdivide the source data amongst operator subtasks, e.g. by partition ○ Support parallelism changes ● How to provide fault tolerance ○ Provide exactly-once semantics ○ Support coarse- and fine-grained recovery for failed tasks ○ Support Flink checkpoints and savepoints ● How to support historical and real-time processing ○ Facilitate correct program output ○ Support event time semantics ● Security considerations ○ Safeguarding secrets
  • 8. Connector Lifecycle ● Construction ○ Instantiated in the driver program (i.e. main method); must be serializable ○ Use the builder pattern to provide a DSL for your connector ○ Avoid making connections if possible ● State Initialization ○ Separate configuration from state ● Run ○ Supports both unbounded and bounded sources ● Cancel / Stop ○ Supports graceful termination (w/ savepoint) ○ May advance the event time clock to the end-of-time (MAX_WATERMARK)
  • 9. Connector Lifecycle (con’t) ● Advanced: Initialize/Finalize on Job Master ○ Exclusively for OutputFormat (e.g.. file-based sinks) ○ Implement InitializeOnMaster, FinalizeOnMaster, and CleanupWhenUnsuccessful ○ Support for Steaming API added in Flink 1.9; see FLINK-1722
  • 10. User-Defined Data Types ● Connectors are typically agnostic to the record data type ○ Expects application to supply type information w/ serializer ● For sources: ○ Accept a DeserializationSchema<T> ○ Implement ResultTypeQueryable<T> ● For sinks: ○ Accept a SerializationSchema<T> ● First-class support for Avro, Parquet, JSON ○ Geared towards Flink Table API
  • 11. Connector Metrics ● Flink exposes a metric system for gathering and reporting metrics ○ Reporters: Flink UI, JMX, InfluxDB, Prometheus, ... ● Use the metric API in your connector to expose relevant metric data ○ Types: counters, gauges, histograms, meters ● Metrics are tracked on a per-subtask basis ● More information: ○ Flink Documentation / Debugging & Monitoring / Metrics
  • 12. Connector Security ● Credentials are typically passed as ordinary program parameters ○ Beware lack of isolation between jobs in a given cluster ● Flink does have first-class support for Kerberos credentials ○ Based on keytabs (in support of long-running jobs) ○ Expects connector to use a named JAAS context ○ See: Kerberos Authentication Setup and Configuration
  • 14. Summary ● The Table API is evolving rapidly ○ For new connectors, focus on supporting the Blink planner ● Table sources and sinks are generally built upon the DataStream API ● Two configuration styles - typed DSL and string-based properties ● Table formats are connector-independent ○ E.g. CSV, JSON, Avro ● A catalog encapsulates a collection of tables, views, and functions ○ Provides convenience and interactivity ● More information: ○ Docs: User-Defined Sources & Sinks
  • 16. Key Considerations ● Connectors play an critical role in program correctness ○ Connector internals influence the order-of-observation (in event time) and hence the practicality of watermark generation ○ Connectors exhibit different behavior in historical vs real-time processing ● Event time skew leads to excess buffering and hence inefficiency ● There’s an inherent trade-off between latency and complexity
  • 20. Global Watermark Tracking ● Flink 1.9 has a facility for tracking a global aggregate value across sub-tasks ○ Ideal for establishing a global minimum watermark ○ See StreamingRuntimeContext#getGlobalAggregateManager ● Most useful in highly dynamic sources ○ Compensates for impact of resharding, rebalancing on event time ○ Increases latency ● See Kinesis connector’s JobManagerWatermarkTracker
  • 21. Source Idleness ● Downstream tasks depend on arrival of watermarks from all sub-tasks ○ Beware stalling the pipeline ● A sub-task may remove itself from consideration by idling ○ i.e. “release the hold on the event time clock” ● A source should be idled mainly for semantic reasons ○
  • 22. Sink Watermark Propagation ● Consider the possibility of watermark propagation across jobs ○ Propagate upstream watermarks along with output records ○ Job 1 → (external system) → Job 2 ● Sink function does have access to current watermark ○ But only when processing an input record 😞 ● Solution: event-time timers ○ Chain a ProcessFunction and corresponding SinkFunction, or develop a custom operator
  • 23. Practical Suggestions ● Provide an API to assign timestamps and to generate watermarks ○ Strive to isolate system internals, e.g. apply the watermark generator on a per-partition basis ○ Aggregate the watermarks into a per-subtask or global watermark ● Strive to minimize event time ‘skew’ across subtasks ○ Strategy: prioritize oldest data and pause ingestion of partitions that are too far ahead ○ See FLINK-10886 for improvements to Kinesis, Kafka connectors ● Remember: the goal is not a total ordering of elements (in event time)
  • 24. State & Fault Tolerance
  • 25. Working with State ● Sources are typically stateful, e.g. ○ partition assignment to sub-tasks ○ position tracking ● Use managed operator state to track redistributable units of work ○ List state - a list of redistributable elements (e.g. partitions w/ current position index) ○ Union list state - a variation where each sub-task gets the complete list of elements ● Various interfaces: ○ CheckpointedFunction - most powerful ○ ListCheckpointed - limited but convenient ○ CheckpointListener - to observe checkpoint completion (e.g. for 2PC)
  • 26. Exactly-Once Semantics ● Definition: evolution of state is based on a single observation of a given element ● Writes to external systems are ideally idempotent ● For sinks, Flink provides a few building blocks: ○ TwoPhaseCommitSinkFunction - base class providing a transaction-like API (but not storage) ○ GenericWriteAheadSink - implements a WAL using the state backend (see: CassandraSink) ○ CheckpointCommitter - stores information about completed checkpoints ● Savepoints present various complications ○ User may opt to resume from any prior checkpoint, not just the most recent checkpoint ○ The connector may be reconfigured w/ new inputs and/or outputs
  • 27. Advanced: Externally-Induced Sources ● Flink is still in control of initiating the overall checkpoint, with a twist! ● It allows a source to control the checkpoint barrier insertion point ○ E.g. based on incoming data or external coordination ● Hooks into the checkpoint coordinator on the master ○ Flink → Hook → External System → Sub-task ● See: ○ ExternallyInducedSource ○ WithMasterCheckpointHook
  • 30. Thank You! ● Feedback welcome (e.g. via the FF app) ● See me at the Speaker’s Lounge