SlideShare a Scribd company logo
Mason Chen | Apple
Multi Cluster Kafka Source
THIS IS NOT A CONTRIBUTION
Agenda
Motivation

FLIP 27 Kafka Source

Source Design

Example
Flink Kafka Pipeline
Manual Migration Steps
Manual Migration Steps
Bring up new cluster
Manual Migration Steps
Swap producer
Manual Migration Steps
Wait for consumer to drain
Manual Migration Steps
Source uid and cluster change
Manual Migration Steps
Upgrade with non restore state
Manual Migration Steps
Increase parallelism for lag
Manual Migration Steps
Revert to steady state
Manual Migration Steps
When can we remove nonactive cluster?
User Manual Migration Steps
• Change source uid

• Change bootstrap server

• Upgrade application

• With non restore state

• Change parallelism and resources to catch with lag

• Revert to steady state when caught up
Manual Migration Steps
• Application downtime

• Need to increase system resources for catchup

• User manual toil

• User could have 100+ jobs

• Multiple hours of team coordination
Drawbacks
Scaling Multiple Kafka Clusters
• Hybrid cloud: on-prem, private cloud and public cloud providers

• Scalability

• Topic sharding

• Operability and Failover

• In place upgrade is complex and error prone
Agenda
Motivation

FLIP 27 Kafka Source

Source Design

Example
FLIP 27 Source
https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/dev/datastream/sources/
FLIP 27 Source
https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/dev/datastream/sources/
FLIP 27 Source
https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/dev/datastream/sources/
FLIP 27 Source
https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/dev/datastream/sources/
FLIP 27 Source
https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/dev/datastream/sources/
FLIP 27 Kafka Source
FLIP 27 Kafka Source
FLIP 27 Kafka Source
FLIP 27 Kafka Source
Agenda
Motivation

FLIP 27 Kafka Source

Source Design

Example
Kafka Metadata Service
• KafkaStream

• Logical abstraction to physical
clusters and topics

• describeStreams(Collection<String>
streamIds);

• Pluggable implementation

• File based configmap
Multi Cluster Kafka Source
Runtime
Multi Cluster Kafka Source
Runtime
Multi Cluster Kafka Source
Runtime
Multi Cluster Kafka Source
Runtime
Multi Cluster Kafka Source
Runtime
Multi Cluster Kafka Source
Runtime
Multi Cluster Kafka Source
Runtime
Multi Cluster Kafka Source
Runtime
Multi Cluster Kafka Source
Runtime
Multi Cluster Kafka Source
Runtime
Multi Cluster Kafka Source
Runtime
Multi Cluster Kafka Source
Runtime
Multi Cluster Kafka Source
Runtime
Extension of FLIP 27 Major Components
• Kafka Source components

• Polling, commit, checkpoint, split assignment, 

• Source Event RPC

• Enumerator Context Proxy

• Split assignment and wrapping cluster info

• Context thread pools
Agenda
Motivation

FLIP 27 Kafka Source

Source Design

Example
Migration with Multi Cluster Kafka Source
Migration with Multi Cluster Kafka Source
Initial metadata
Migration with Multi Cluster Kafka Source
Bring up new cluster
Migration with Multi Cluster Kafka Source
Bring up new cluster
Migration with Multi Cluster Kafka Source
Add new cluster metadata
Migration with Multi Cluster Kafka Source
Reconcile metadata
Migration with Multi Cluster Kafka Source
Reconcile metadata
Migration with Multi Cluster Kafka Source
Remove old cluster
Migration with Multi Cluster Kafka Source
Reconcile metadata
Migration with Multi Cluster Kafka Source
Reconcile metadata
Migration with Multi Cluster Kafka Source
Remove old cluster
User Cluster Migration Steps
Multi Cluster Kafka Source Benefits
• Migrations and failover automated transparently within source

• Simplify operations between compute and storage infra

• Hybrid Source compatible

• Can be leveraged for topic migration
Future Work
• Integrate with split level watermark alignment

• Optimizations to remove only affected readers

• FLIP-246 (https://cwiki.apache.org/confluence/display/FLINK/
FLIP-246%3A+Multi+Cluster+Kafka+Source)
Q&A

More Related Content

PPTX
Autoscaling Flink with Reactive Mode
PDF
ksqlDB: A Stream-Relational Database System
PDF
Flink powered stream processing platform at Pinterest
PPTX
Apache Flink in the Cloud-Native Era
PPTX
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
PDF
Apache Flink internals
PPTX
Evening out the uneven: dealing with skew in Flink
PPTX
APACHE KAFKA / Kafka Connect / Kafka Streams
Autoscaling Flink with Reactive Mode
ksqlDB: A Stream-Relational Database System
Flink powered stream processing platform at Pinterest
Apache Flink in the Cloud-Native Era
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Apache Flink internals
Evening out the uneven: dealing with skew in Flink
APACHE KAFKA / Kafka Connect / Kafka Streams

What's hot (20)

PPTX
PPTX
Kafka Connect - debezium
PDF
Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022
PPTX
Flink vs. Spark
PPTX
The top 3 challenges running multi-tenant Flink at scale
PDF
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
PDF
Introducing the Apache Flink Kubernetes Operator
PDF
Introduction to Apache Kafka
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
PDF
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
PDF
Kafka Streams: What it is, and how to use it?
PDF
A Deep Dive into Kafka Controller
PDF
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
PDF
Apache Spark on K8S Best Practice and Performance in the Cloud
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
PDF
An Introduction to Apache Kafka
PDF
Changelog Stream Processing with Apache Flink
PPTX
Building Reliable Lakehouses with Apache Flink and Delta Lake
PDF
Apache Kafka Architecture & Fundamentals Explained
Kafka Connect - debezium
Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022
Flink vs. Spark
The top 3 challenges running multi-tenant Flink at scale
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Introducing the Apache Flink Kubernetes Operator
Introduction to Apache Kafka
Where is my bottleneck? Performance troubleshooting in Flink
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Kafka Streams: What it is, and how to use it?
A Deep Dive into Kafka Controller
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Apache Spark on K8S Best Practice and Performance in the Cloud
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
An Introduction to Apache Kafka
Changelog Stream Processing with Apache Flink
Building Reliable Lakehouses with Apache Flink and Delta Lake
Apache Kafka Architecture & Fundamentals Explained
Ad

Similar to Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Flink Job Downtime (20)

PDF
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...
PDF
Migrating to Multi Cluster Managed Kafka - Conf42 - CloudNative
PDF
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
PDF
Migrating to Multi Cluster Managed Kafka - DevopStars 2022
PPTX
Robert Metzger - Connecting Apache Flink to the World - Reviewing the streami...
PDF
Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...
PPTX
Building Stream Processing as a Service
PDF
Disaster Recovery Plans for Apache Kafka
PPTX
Multi-Datacenter Kafka - Strata San Jose 2017
PDF
Migrating to Multi Cluster Managed Kafka - ApacheKafkaIL
PDF
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
PPTX
Flink Forward San Francisco 2018: Andrew Gao & Jeff Sharpe - "Finding Bad Ac...
PPTX
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
PDF
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
PDF
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
PDF
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with K...
PPTX
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
PDF
Apache flink
PDF
Flink at netflix paypal speaker series
PPTX
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...
Migrating to Multi Cluster Managed Kafka - Conf42 - CloudNative
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Migrating to Multi Cluster Managed Kafka - DevopStars 2022
Robert Metzger - Connecting Apache Flink to the World - Reviewing the streami...
Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...
Building Stream Processing as a Service
Disaster Recovery Plans for Apache Kafka
Multi-Datacenter Kafka - Strata San Jose 2017
Migrating to Multi Cluster Managed Kafka - ApacheKafkaIL
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Flink Forward San Francisco 2018: Andrew Gao & Jeff Sharpe - "Finding Bad Ac...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with K...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
Apache flink
Flink at netflix paypal speaker series
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Ad

More from Flink Forward (19)

PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
PPTX
One sink to rule them all: Introducing the new Async Sink
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
PPTX
The Current State of Table API in 2022
PDF
Flink SQL on Pulsar made easy
PPTX
Dynamic Rule-based Real-time Market Data Alerts
PPTX
Processing Semantically-Ordered Streams in Financial Services
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
PDF
Batch Processing at Scale with Flink & Iceberg
PPTX
Welcome to the Flink Community!
PPTX
Practical learnings from running thousands of Flink jobs
PPTX
Extending Flink SQL for stream processing use cases
PPTX
Using Queryable State for Fun and Profit
PPTX
Large Scale Real Time Fraudulent Web Behavior Detection
PPTX
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
PPTX
Near real-time statistical modeling and anomaly detection using Flink!
PPTX
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Building a fully managed stream processing platform on Flink at scale for Lin...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
One sink to rule them all: Introducing the new Async Sink
Tuning Apache Kafka Connectors for Flink.pptx
Using the New Apache Flink Kubernetes Operator in a Production Deployment
The Current State of Table API in 2022
Flink SQL on Pulsar made easy
Dynamic Rule-based Real-time Market Data Alerts
Processing Semantically-Ordered Streams in Financial Services
Tame the small files problem and optimize data layout for streaming ingestion...
Batch Processing at Scale with Flink & Iceberg
Welcome to the Flink Community!
Practical learnings from running thousands of Flink jobs
Extending Flink SQL for stream processing use cases
Using Queryable State for Fun and Profit
Large Scale Real Time Fraudulent Web Behavior Detection
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Near real-time statistical modeling and anomaly detection using Flink!
How to build a streaming Lakehouse with Flink, Kafka, and Hudi

Recently uploaded (20)

PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Modernizing your data center with Dell and AMD
PDF
Unlocking AI with Model Context Protocol (MCP)
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PDF
KodekX | Application Modernization Development
PDF
Machine learning based COVID-19 study performance prediction
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Big Data Technologies - Introduction.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
GamePlan Trading System Review: Professional Trader's Honest Take
Dropbox Q2 2025 Financial Results & Investor Presentation
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Modernizing your data center with Dell and AMD
Unlocking AI with Model Context Protocol (MCP)
“AI and Expert System Decision Support & Business Intelligence Systems”
Spectral efficient network and resource selection model in 5G networks
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Network Security Unit 5.pdf for BCA BBA.
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
KodekX | Application Modernization Development
Machine learning based COVID-19 study performance prediction
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Big Data Technologies - Introduction.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
The Rise and Fall of 3GPP – Time for a Sabbatical?

Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Flink Job Downtime