Change data capture

Change Data Capture
The road thus far...

Agenda
● What is CDC?
● Why should we CDC?
● High Level Architecture
● The Stack
○ Kafka Connect
○ Debezium
○ Schema Registry
○ Avro
● Open Issues
● Future Steps
● Demo

Agenda
What is
CDC?
● A set of software design patterns used to determine
(and track) the data that has changed so that action
can be taken using the changed data
● An approach to data integration that is based on the
identification, capture and delivery of the changes
made to enterprise data sources.
● feed the data into a central hub of data streams,
where it can readily be combined with event streams
and data from other databases in real-time
● Continuous data integration paradigm.

Agenda
Why
Should we
do it?
● To support data lake driven architecture
○ Current architecture is obsolete
○ Not best practice (deletes)
○ Not incremental - Overwrites everything
○ Freshness of Data
○ Being done behind your backs!
● Same data <> different representations & different
needs
○ Data lake
○ Search index
○ Cache
● To Support event driven architecture & sourcing
○ Review Created, Token Deleted ETC……..

Change Data Capture
BREAKS DATABASE ENCAPSULATION
But hey, we’ve being doing it for the
past 3 years...

Treat Your Data Models Like you
would with your APIs!
Or at least use the tools that would
enable you to do so

High Level Architecture
High Level Architecture

THIS IS POWERFUL IN SO MANY
WAYS WE CAN ONLY IMAGINE!

Kafka Connect
● Kafka Connect, an open source component of
Apache Kafka, is a framework for connecting Kafka
with external systems such as databases, key-value
stores, search indexes, and file systems.
● Source Connector
○ A source connector ingests entire databases
and streams table updates to Kafka topics. It
can also collect metrics from all of your
application servers into Kafka topics, making
the data available for stream processing with
low latency.
● Sink Connector
○ delivers data from Kafka topics into secondary
indexes such as Elasticsearch or batch systems

● An open source distributed platform for change
data capture.
● Turns your existing databases into event streams,
so applications can see and respond immediately to
each row-level change in the databases
● Debezium is built on top of Apache Kafka and
provides Kafka Connect compatible connectors that
monitor specific database.
● Reads the Binlog of the source database and
provides a unified structured event which describes
the changes
Agenda
Debezium

● Schema Registry provides a serving layer for your
metadata.
● It provides a RESTful interface for storing and
retrieving Avro schemas.
● It stores a versioned history of all schemas,
provides multiple compatibility settings and
allows evolution of schemas according to the
configured compatibility settings and expanded
Avro support.
● It provides serializers that plug into Kafka clients that
handle schema storage and retrieval for Kafka
messages that are sent in the Avro format.
Agenda
Schema
Registry

● Avro is a binary data serialization framework
developed within Apache's Hadoop project.
● Similarly to how in a SQL database you can’t add
data without creating a table first, one can’t create
an Avro object without first providing a schema.
● Avro schemas are defined using JSON
● An Avro object contains the schema and the data.
The data without the schema is an invalid Avro
object. That’s a big difference with say, CSV, or JSON.
● You can make your schemas evolve over time.
Apache Avro has a concept of projection which
makes evolving schema seamless to the end user.
Agenda
Apache
Avro

Avro vs. JSON
Avro JSON
● Schema evolution!!!!!!!!!!!!!
● Fast, Compact & Binary
● Avro has support for primitive types &
complex types as well.
● Documentation of the schema is built in.
● Supports compression such as Google’s
Snappy.
● Readable using Avro-consumers which are
shipped with the schema registry
● JSON can be read by pretty much any
language
● JSON has no native schema support
● JSON objects can be quite big in size
because of repeated keys
● No comments, metadata, documentation
● Typing and parsing is the responsibility of
the consumer: INT, LONG?
● Readable

● Production - Orders/Unsubscribers?
● Secor - support schema registry param
● Single Message Transformations
○ Debezium supports masking, and column
blacklist
● Steam changes from MongoDB’s oplog
○ Register new type of connector
● Database 2 Database stream
○ Sync MySQL-> Elasticsearch like in filter &
search
● All time data and backfill
○ Consume topic from beginning
○ Use data lake to replay all time data
Agenda
Future
Steps

● Metorikku support avro serialization + schema
registry
● Metrics & Logging & Alerting
● Debezium/Schema registry Cluster, Scaling and
deployment
○ Worker Per Table? Database? Databases?
● Log Compaction and retention per connector
● UI
○ Schema Registry
○ Connect
● Dive deeper to each technology
Agenda
Future
Steps

● Introducing new complexity to the system
● Kafka Version 2.0
● Avro support in Kafka Gem and Go Packages
● Binlog + Performance = TBD
○ We can use Kafka connect built in JDBC driver
which does not work with the binlog
● Aurora + Binlog = 💩
Agenda
Open
Issues

Change data capture

More Related Content

What's hot (20)

Similar to Change data capture (20)

Recently uploaded (20)

Change data capture