Kafka Connect
Oleg Kuznetsov
Big Data Engineer
Intro
What is Kafka?
3
What is Kafka?
4
What is Kafka?
5
Kafka Connect
6
Kafka Connect
7
〉Focusing on data ingestion in / out Kafka topics
〉KafkaConnect - a standalone app, not a library
〉Distributed mode
Source Connector
Topic => Topic
9
External storage => Topic
10
External storage => Topic
11
12
Storage Kafka
Entity “Virtual” topic Topic
Partition
Logical partition
- file name
- table name
Physical partition
file on disk
Offset in
partition
Logical offset
- line number in file
- ID value in table
Record number within partition
offset
External storage ≈ Kafka topic
Components
13
SourceConnector
- defines parallelism level
- work distribution
- starts on leader node
- rebalancing job
Rebalancing job
- applying new connector config (REST-API)
- changes in structure of ingested data (new table, files, partitions, etc.)
SourceTask
data ingestion
Architecture
14
Architecture
15
Architecture
16
Architecture
17
Methods: SourceConnector
18
〉void start(Map<String, String> props)
〉List< Map<String, String> > taskConfigs(int maxTasks)
〉void stop()
FileSourceConnector
19
FileSourceConnector (rebalancing)
20
Architecture
21
Architecture
22
Architecture
23
Methods: SourceTask
24
〉void start(Map<String, String> props)
〉Collection<SourceRecord> poll()
〉void stop()
FileSourceTask
25
FileSourceTask
26
FileSourceTask
27
Architecture
28
FileSourceTask (offset filtering)
29
Architecture
30
Architecture
31
Sink Connector
Architecture
33
FileSinkConnector
34
Methods: SinkTask
35
〉void start(Map<String, String> props)
〉void put(Collection<SinkRecord>)
〉void flush(Map<TopicPatition, OffsetAndMetadata> currOffsets)
〉void stop()
Storing in put()
36
〉put() should be quick (there is an internal timeout)
〉A limited number of records are passed in put()
〉Automatic offset management (consumer)
Storing in flush()
37
〉put() stores in temp file / memory
〉flush() uploads optimal data amount in storage
〉Manual offset management (uploading index-files)
Resume reading using offsets
38
Run
Dockerfile
40
Starting connector
41
Facing reality
Global rebalancing
43
〉JVM with KafkaConnect can host multiple connectors
〉Rebalancing one of them initiates the rebalancing of the rest
Solution: run 1 connector per 1 JVM
Writing offsets without sending source record
44
〉Ingesting file without records (e.g. it is empty)
Solutions:
1) send marker SourceRecord with offset
2) get offsetStorageWriter by reflection and write offset directly
Controlling ingestion speed (backpressure)
45
〉Source
- no control of ingestion speed for writes to Kafka
- solution: sleep() in poll() + producer tuning
〉Sink
- no control of speed of storing data in external storage
- solution: sleep() + throw new RetryableException in put()
Exactly once delivery
46
〉not supported
〉Source
- data and offsets are stored separately => duplicates are possible
- there is technical capability, but it has not been implemented
Solution:
- extra deduplication process (for instance, KafkaStreams)
- compacted data topic
〉Sink
- idempotence: loading index-file with data files + consistent file naming
Conclusion
Conclusion
48
〉Simple and fast
〉Control how to ingest data
〉Mature
〉Cluster less
〉Lots of free connectors (Debezium, S3, FTP, ElasticSearch, etc.)
Questions?

More Related Content

ODP
Introduction to Kafka connect
PDF
Kafka Connect and Streams (Concepts, Architecture, Features)
PDF
Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
PPTX
Introduction to Apache Kafka
PDF
Top 5 Mistakes When Writing Spark Applications
PDF
Running Kafka as a Native Binary Using GraalVM with Ozan Günalp
PDF
Easy, Secure, and Fast: Using NATS.io for Streams and Services
Introduction to Kafka connect
Kafka Connect and Streams (Concepts, Architecture, Features)
Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Introduction to Apache Kafka
Top 5 Mistakes When Writing Spark Applications
Running Kafka as a Native Binary Using GraalVM with Ozan Günalp
Easy, Secure, and Fast: Using NATS.io for Streams and Services

What's hot (20)

PPTX
Introduction to Kafka
PDF
From Zero to Hero with Kafka Connect
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
PPTX
Introduction to KSQL: Streaming SQL for Apache Kafka®
PPTX
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
PDF
Introduction to Apache Kafka
PDF
Apache Kafka Fundamentals for Architects, Admins and Developers
PPTX
Kafka presentation
PPTX
A visual introduction to Apache Kafka
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
PDF
Write Faster SQL with Trino.pdf
PDF
RethinkConn 2022!
PPTX
Kafka connect 101
PDF
Stream processing with Apache Flink (Timo Walther - Ververica)
PDF
Fundamentals of Apache Kafka
PPTX
Kafka: Internals
PDF
Introduction to Apache Kafka and Confluent... and why they matter
PPTX
Apache Flink and what it is used for
PPTX
Apache Con 2021 : Apache Bookkeeper Key Value Store and use cases
Introduction to Kafka
From Zero to Hero with Kafka Connect
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Introduction to KSQL: Streaming SQL for Apache Kafka®
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Introduction to Apache Kafka
Apache Kafka Fundamentals for Architects, Admins and Developers
Kafka presentation
A visual introduction to Apache Kafka
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
Write Faster SQL with Trino.pdf
RethinkConn 2022!
Kafka connect 101
Stream processing with Apache Flink (Timo Walther - Ververica)
Fundamentals of Apache Kafka
Kafka: Internals
Introduction to Apache Kafka and Confluent... and why they matter
Apache Flink and what it is used for
Apache Con 2021 : Apache Bookkeeper Key Value Store and use cases
Ad

Similar to Kafka Connect (20)

PDF
Kafka Summit SF 2017 - Kafka Connect Best Practices – Advice from the Field
PDF
Kafka Connect by Datio
PDF
Apache Kafka - A modern Stream Processing Platform
PDF
Learnings From Shipping 1000+ Streaming Data Pipelines To Production with Hak...
PDF
Kafka Connect & Streams - the ecosystem around Kafka
PDF
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
PDF
London Apache Kafka Meetup (Jan 2017)
PDF
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
PDF
Using MongoDB with Kafka - Use Cases and Best Practices
PPTX
Building big data pipelines with Kafka and Kubernetes
PDF
Kafka used at scale to deliver real-time notifications
PPTX
More Data, More Problems: Scaling Kafka Mirroring Pipelines at LinkedIn
PDF
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
PPTX
Real time data pipline with kafka streams
PDF
Building realtime data pipeline with Apache Kafka
PDF
Kafka internals
PDF
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
PDF
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
PPTX
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Kafka Summit SF 2017 - Kafka Connect Best Practices – Advice from the Field
Kafka Connect by Datio
Apache Kafka - A modern Stream Processing Platform
Learnings From Shipping 1000+ Streaming Data Pipelines To Production with Hak...
Kafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
London Apache Kafka Meetup (Jan 2017)
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Using MongoDB with Kafka - Use Cases and Best Practices
Building big data pipelines with Kafka and Kubernetes
Kafka used at scale to deliver real-time notifications
More Data, More Problems: Scaling Kafka Mirroring Pipelines at LinkedIn
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
Real time data pipline with kafka streams
Building realtime data pipeline with Apache Kafka
Kafka internals
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Ad

Recently uploaded (20)

DOCX
Modern SharePoint Intranet Templates That Boost Employee Engagement in 2025.docx
PDF
Visual explanation of Dijkstra's Algorithm using Python
PDF
AI/ML Infra Meetup | LLM Agents and Implementation Challenges
PPTX
Introduction to Windows Operating System
PDF
Microsoft Office 365 Crack Download Free
PDF
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
PPTX
GSA Content Generator Crack (2025 Latest)
PDF
The Dynamic Duo Transforming Financial Accounting Systems Through Modern Expe...
PPTX
WiFi Honeypot Detecscfddssdffsedfseztor.pptx
PPTX
Advanced SystemCare Ultimate Crack + Portable (2025)
PDF
Cost to Outsource Software Development in 2025
PPTX
Weekly report ppt - harsh dattuprasad patel.pptx
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PDF
Time Tracking Features That Teams and Organizations Actually Need
PDF
AI Guide for Business Growth - Arna Softech
PDF
Designing Intelligence for the Shop Floor.pdf
PDF
CCleaner 6.39.11548 Crack 2025 License Key
PPTX
Computer Software and OS of computer science of grade 11.pptx
PDF
Wondershare Recoverit Full Crack New Version (Latest 2025)
PDF
iTop VPN Crack Latest Version Full Key 2025
Modern SharePoint Intranet Templates That Boost Employee Engagement in 2025.docx
Visual explanation of Dijkstra's Algorithm using Python
AI/ML Infra Meetup | LLM Agents and Implementation Challenges
Introduction to Windows Operating System
Microsoft Office 365 Crack Download Free
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
GSA Content Generator Crack (2025 Latest)
The Dynamic Duo Transforming Financial Accounting Systems Through Modern Expe...
WiFi Honeypot Detecscfddssdffsedfseztor.pptx
Advanced SystemCare Ultimate Crack + Portable (2025)
Cost to Outsource Software Development in 2025
Weekly report ppt - harsh dattuprasad patel.pptx
Why Generative AI is the Future of Content, Code & Creativity?
Time Tracking Features That Teams and Organizations Actually Need
AI Guide for Business Growth - Arna Softech
Designing Intelligence for the Shop Floor.pdf
CCleaner 6.39.11548 Crack 2025 License Key
Computer Software and OS of computer science of grade 11.pptx
Wondershare Recoverit Full Crack New Version (Latest 2025)
iTop VPN Crack Latest Version Full Key 2025

Kafka Connect