Change Data Capture
The road thus far...
Agenda
● What is CDC?
● Why should we CDC?
● High Level Architecture
● The Stack
○ Kafka Connect
○ Debezium
○ Schema Registry
○ Avro
● Open Issues
● Future Steps
● Demo
Agenda
What is
CDC?
● A set of software design patterns used to determine
(and track) the data that has changed so that action
can be taken using the changed data
● An approach to data integration that is based on the
identification, capture and delivery of the changes
made to enterprise data sources.
● feed the data into a central hub of data streams,
where it can readily be combined with event streams
and data from other databases in real-time
● Continuous data integration paradigm.
Agenda
Why
Should we
do it?
● To support data lake driven architecture
○ Current architecture is obsolete
○ Not best practice (deletes)
○ Not incremental - Overwrites everything
○ Freshness of Data
○ Being done behind your backs!
● Same data <> different representations & different
needs
○ Data lake
○ Search index
○ Cache
● To Support event driven architecture & sourcing
○ Review Created, Token Deleted ETC……..
Change Data Capture
BREAKS DATABASE ENCAPSULATION
But hey, we’ve being doing it for the
past 3 years...
Treat Your Data Models Like you
would with your APIs!
Or at least use the tools that would
enable you to do so
Change Data Capture
High Level Architecture
High Level Architecture
THIS IS POWERFUL IN SO MANY
WAYS WE CAN ONLY IMAGINE!
Kafka Connect
● Kafka Connect, an open source component of
Apache Kafka, is a framework for connecting Kafka
with external systems such as databases, key-value
stores, search indexes, and file systems.
● Source Connector
○ A source connector ingests entire databases
and streams table updates to Kafka topics. It
can also collect metrics from all of your
application servers into Kafka topics, making
the data available for stream processing with
low latency.
● Sink Connector
○ delivers data from Kafka topics into secondary
indexes such as Elasticsearch or batch systems
Kafka Connect
● An open source distributed platform for change
data capture.
● Turns your existing databases into event streams,
so applications can see and respond immediately to
each row-level change in the databases
● Debezium is built on top of Apache Kafka and
provides Kafka Connect compatible connectors that
monitor specific database.
● Reads the Binlog of the source database and
provides a unified structured event which describes
the changes
Agenda
Debezium
● Schema Registry provides a serving layer for your
metadata.
● It provides a RESTful interface for storing and
retrieving Avro schemas.
● It stores a versioned history of all schemas,
provides multiple compatibility settings and
allows evolution of schemas according to the
configured compatibility settings and expanded
Avro support.
● It provides serializers that plug into Kafka clients that
handle schema storage and retrieval for Kafka
messages that are sent in the Avro format.
Agenda
Schema
Registry
Schema Registry
● Avro is a binary data serialization framework
developed within Apache's Hadoop project.
● Similarly to how in a SQL database you can’t add
data without creating a table first, one can’t create
an Avro object without first providing a schema.
● Avro schemas are defined using JSON
● An Avro object contains the schema and the data.
The data without the schema is an invalid Avro
object. That’s a big difference with say, CSV, or JSON.
● You can make your schemas evolve over time.
Apache Avro has a concept of projection which
makes evolving schema seamless to the end user.
Agenda
Apache
Avro
Avro vs. JSON
Avro JSON
● Schema evolution!!!!!!!!!!!!!
● Fast, Compact & Binary
● Avro has support for primitive types &
complex types as well.
● Documentation of the schema is built in.
● Supports compression such as Google’s
Snappy.
● Readable using Avro-consumers which are
shipped with the schema registry
● JSON can be read by pretty much any
language
● JSON has no native schema support
● JSON objects can be quite big in size
because of repeated keys
● No comments, metadata, documentation
● Typing and parsing is the responsibility of
the consumer: INT, LONG?
● Readable
● Production - Orders/Unsubscribers?
● Secor - support schema registry param
● Single Message Transformations
○ Debezium supports masking, and column
blacklist
● Steam changes from MongoDB’s oplog
○ Register new type of connector
● Database 2 Database stream
○ Sync MySQL-> Elasticsearch like in filter &
search
● All time data and backfill
○ Consume topic from beginning
○ Use data lake to replay all time data
Agenda
Future
Steps
● Metorikku support avro serialization + schema
registry
● Metrics & Logging & Alerting
● Debezium/Schema registry Cluster, Scaling and
deployment
○ Worker Per Table? Database? Databases?
● Log Compaction and retention per connector
● UI
○ Schema Registry
○ Connect
● Dive deeper to each technology
Agenda
Future
Steps
● Introducing new complexity to the system
● Kafka Version 2.0
● Avro support in Kafka Gem and Go Packages
● Binlog + Performance = TBD
○ We can use Kafka connect built in JDBC driver
which does not work with the binlog
● Aurora + Binlog = 💩
Agenda
Open
Issues
Demo
Questions?
Thank You!

More Related Content

PDF
Introduction to Apache NiFi dws19 DWS - DC 2019
PPTX
Free Training: How to Build a Lakehouse
PDF
From my sql to postgresql using kafka+debezium
PPTX
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
PPTX
Data Sharing with Snowflake
PPT
Monitoring using Prometheus and Grafana
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PDF
Introduction SQL Analytics on Lakehouse Architecture
Introduction to Apache NiFi dws19 DWS - DC 2019
Free Training: How to Build a Lakehouse
From my sql to postgresql using kafka+debezium
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
Data Sharing with Snowflake
Monitoring using Prometheus and Grafana
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Introduction SQL Analytics on Lakehouse Architecture

What's hot (20)

PPTX
PPTX
Data Quality Patterns in the Cloud with Azure Data Factory
PDF
Data Platform Architecture Principles and Evaluation Criteria
PDF
Introducing Change Data Capture with Debezium
PDF
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
PDF
AWS S3 Cost Optimization
PPTX
Get Savvy with Snowflake
PPTX
Introduction to Apache ZooKeeper
PDF
dbt Python models - GoDataFest by Guillermo Sanchez
PDF
The Apache Spark File Format Ecosystem
PDF
Introduction to Apache NiFi 1.11.4
PDF
まずやっとくPostgreSQLチューニング
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
PDF
Airbyte @ Airflow Summit - The new modern data stack
PDF
Apache Nifi Crash Course
PDF
Making Apache Spark Better with Delta Lake
PDF
Intro to Delta Lake
PPTX
Snowflake Datawarehouse Architecturing
PDF
Change Data Feed in Delta
PPTX
Real-Time Data Flows with Apache NiFi
Data Quality Patterns in the Cloud with Azure Data Factory
Data Platform Architecture Principles and Evaluation Criteria
Introducing Change Data Capture with Debezium
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
AWS S3 Cost Optimization
Get Savvy with Snowflake
Introduction to Apache ZooKeeper
dbt Python models - GoDataFest by Guillermo Sanchez
The Apache Spark File Format Ecosystem
Introduction to Apache NiFi 1.11.4
まずやっとくPostgreSQLチューニング
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Airbyte @ Airflow Summit - The new modern data stack
Apache Nifi Crash Course
Making Apache Spark Better with Delta Lake
Intro to Delta Lake
Snowflake Datawarehouse Architecturing
Change Data Feed in Delta
Real-Time Data Flows with Apache NiFi
Ad

Similar to Change data capture (20)

ODP
Stream processing using Kafka
PDF
Structured Streaming with Kafka
PDF
Interactive Data Analysis in Spark Streaming
PDF
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
PDF
Present and future of unified, portable, and efficient data processing with A...
PDF
CDC patterns in Apache Kafka®
PDF
Building real time data-driven products
PDF
Red Hat Storage Roadmap
PDF
Red Hat Storage Roadmap
PDF
The Future of Fast Databases: Lessons from a Decade of QuestDB
PPTX
Apache frameworks for Big and Fast Data
PDF
Present and future of unified, portable and efficient data processing with Ap...
PPTX
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
PDF
Apache Spark Presentation good for big data
PPTX
Summer 2017 undergraduate research powerpoint
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PPTX
Introduction to streaming and messaging flume,kafka,SQS,kinesis
PDF
Streamsets and spark at SF Hadoop User Group
PPTX
Real time data pipline with kafka streams
PDF
Data Pipeline for The Big Data/Data Science OKC
Stream processing using Kafka
Structured Streaming with Kafka
Interactive Data Analysis in Spark Streaming
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Present and future of unified, portable, and efficient data processing with A...
CDC patterns in Apache Kafka®
Building real time data-driven products
Red Hat Storage Roadmap
Red Hat Storage Roadmap
The Future of Fast Databases: Lessons from a Decade of QuestDB
Apache frameworks for Big and Fast Data
Present and future of unified, portable and efficient data processing with Ap...
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Apache Spark Presentation good for big data
Summer 2017 undergraduate research powerpoint
Apache Iceberg - A Table Format for Hige Analytic Datasets
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Streamsets and spark at SF Hadoop User Group
Real time data pipline with kafka streams
Data Pipeline for The Big Data/Data Science OKC
Ad

Recently uploaded (20)

PDF
Zenith AI: Advanced Artificial Intelligence
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Getting Started with Data Integration: FME Form 101
PPTX
Tartificialntelligence_presentation.pptx
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
PDF
August Patch Tuesday
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
Hindi spoken digit analysis for native and non-native speakers
DOCX
search engine optimization ppt fir known well about this
PDF
Hybrid model detection and classification of lung cancer
PDF
Unlock new opportunities with location data.pdf
PDF
Five Habits of High-Impact Board Members
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PPTX
Modernising the Digital Integration Hub
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Zenith AI: Advanced Artificial Intelligence
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Getting Started with Data Integration: FME Form 101
Tartificialntelligence_presentation.pptx
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Assigned Numbers - 2025 - Bluetooth® Document
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
August Patch Tuesday
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Developing a website for English-speaking practice to English as a foreign la...
DP Operators-handbook-extract for the Mautical Institute
Hindi spoken digit analysis for native and non-native speakers
search engine optimization ppt fir known well about this
Hybrid model detection and classification of lung cancer
Unlock new opportunities with location data.pdf
Five Habits of High-Impact Board Members
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Modernising the Digital Integration Hub
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf

Change data capture

  • 1. Change Data Capture The road thus far...
  • 2. Agenda ● What is CDC? ● Why should we CDC? ● High Level Architecture ● The Stack ○ Kafka Connect ○ Debezium ○ Schema Registry ○ Avro ● Open Issues ● Future Steps ● Demo
  • 3. Agenda What is CDC? ● A set of software design patterns used to determine (and track) the data that has changed so that action can be taken using the changed data ● An approach to data integration that is based on the identification, capture and delivery of the changes made to enterprise data sources. ● feed the data into a central hub of data streams, where it can readily be combined with event streams and data from other databases in real-time ● Continuous data integration paradigm.
  • 4. Agenda Why Should we do it? ● To support data lake driven architecture ○ Current architecture is obsolete ○ Not best practice (deletes) ○ Not incremental - Overwrites everything ○ Freshness of Data ○ Being done behind your backs! ● Same data <> different representations & different needs ○ Data lake ○ Search index ○ Cache ● To Support event driven architecture & sourcing ○ Review Created, Token Deleted ETC……..
  • 5. Change Data Capture BREAKS DATABASE ENCAPSULATION But hey, we’ve being doing it for the past 3 years...
  • 6. Treat Your Data Models Like you would with your APIs! Or at least use the tools that would enable you to do so
  • 8. High Level Architecture High Level Architecture
  • 9. THIS IS POWERFUL IN SO MANY WAYS WE CAN ONLY IMAGINE!
  • 10. Kafka Connect ● Kafka Connect, an open source component of Apache Kafka, is a framework for connecting Kafka with external systems such as databases, key-value stores, search indexes, and file systems. ● Source Connector ○ A source connector ingests entire databases and streams table updates to Kafka topics. It can also collect metrics from all of your application servers into Kafka topics, making the data available for stream processing with low latency. ● Sink Connector ○ delivers data from Kafka topics into secondary indexes such as Elasticsearch or batch systems
  • 12. ● An open source distributed platform for change data capture. ● Turns your existing databases into event streams, so applications can see and respond immediately to each row-level change in the databases ● Debezium is built on top of Apache Kafka and provides Kafka Connect compatible connectors that monitor specific database. ● Reads the Binlog of the source database and provides a unified structured event which describes the changes Agenda Debezium
  • 13. ● Schema Registry provides a serving layer for your metadata. ● It provides a RESTful interface for storing and retrieving Avro schemas. ● It stores a versioned history of all schemas, provides multiple compatibility settings and allows evolution of schemas according to the configured compatibility settings and expanded Avro support. ● It provides serializers that plug into Kafka clients that handle schema storage and retrieval for Kafka messages that are sent in the Avro format. Agenda Schema Registry
  • 15. ● Avro is a binary data serialization framework developed within Apache's Hadoop project. ● Similarly to how in a SQL database you can’t add data without creating a table first, one can’t create an Avro object without first providing a schema. ● Avro schemas are defined using JSON ● An Avro object contains the schema and the data. The data without the schema is an invalid Avro object. That’s a big difference with say, CSV, or JSON. ● You can make your schemas evolve over time. Apache Avro has a concept of projection which makes evolving schema seamless to the end user. Agenda Apache Avro
  • 16. Avro vs. JSON Avro JSON ● Schema evolution!!!!!!!!!!!!! ● Fast, Compact & Binary ● Avro has support for primitive types & complex types as well. ● Documentation of the schema is built in. ● Supports compression such as Google’s Snappy. ● Readable using Avro-consumers which are shipped with the schema registry ● JSON can be read by pretty much any language ● JSON has no native schema support ● JSON objects can be quite big in size because of repeated keys ● No comments, metadata, documentation ● Typing and parsing is the responsibility of the consumer: INT, LONG? ● Readable
  • 17. ● Production - Orders/Unsubscribers? ● Secor - support schema registry param ● Single Message Transformations ○ Debezium supports masking, and column blacklist ● Steam changes from MongoDB’s oplog ○ Register new type of connector ● Database 2 Database stream ○ Sync MySQL-> Elasticsearch like in filter & search ● All time data and backfill ○ Consume topic from beginning ○ Use data lake to replay all time data Agenda Future Steps
  • 18. ● Metorikku support avro serialization + schema registry ● Metrics & Logging & Alerting ● Debezium/Schema registry Cluster, Scaling and deployment ○ Worker Per Table? Database? Databases? ● Log Compaction and retention per connector ● UI ○ Schema Registry ○ Connect ● Dive deeper to each technology Agenda Future Steps
  • 19. ● Introducing new complexity to the system ● Kafka Version 2.0 ● Avro support in Kafka Gem and Go Packages ● Binlog + Performance = TBD ○ We can use Kafka connect built in JDBC driver which does not work with the binlog ● Aurora + Binlog = 💩 Agenda Open Issues
  • 20. Demo