mypipe: Buffering and consuming MySQL changes via Kafka

mypipe: Buffering and consuming
MySQL changes via Kafka
with
-=[ Scala - Avro - Akka ]=-
Hisham Mardam-Bey
Github: mardambey
Twitter: @codewarrior

Overview
● Who is this guy? + Quick Mate1 Intro
● Quick Tech Intro
● Motivation and History
● Features
● Design and Architecture
● Practical applications and usages
● System diagram
● Future work
● Q&A

Who is this guy?
● Linux and OpenBSD user and developer
since 1996
● Started out with C followed by Ruby
● Working with the JVM since 2007
● “Lately” building and running distributed
systems, and doing Scala
Github: mardambey
Twitter: @codewarrior

Mate1: quick intro
● Online dating, since 2003, based in Montreal
● Initially a team of 3, around 30 now
● Engineering team has 12 geeks / geekettes
○ Always looking for talent!
● We own and run our own hardware
○ fun!
○ mostly…
https://github.com/mate1

Super Quick Tech Intro
● MySQL: relational database
● Avro: data serialization system
● Kafka: publish-subscribe messaging
rethought as a distributed commit log
● Akka: toolkit and runtime simplifying the
construction of concurrent and distributed
applications
● Actors: universal primitives of concurrent
computation using message passing
● Schema repo / registry: holds versioned
Avro schemas

Motivation
● Initially, wanted:
○ MySQL triggers outside the DB
○ MySQL fan-in or fan-out replication (data cubes)
○ MySQL to “Hadoop”
● And then:
○ Cache or data store consistency with DB
○ Direct integration with big-data systems
○ Data schema evolution support
○ Turning MySQL inside out
■ Bootstrapping downstream data systems

History
● 2010: Custom Perl scripts to parse binlogs
● 2011/2012: Guzzler
○ Written in Scala, uses mysqlbinlog command
○ Simple to start with, difficult to maintain and control
● 2014: Enter mypipe!
○ Initial prototyping begins

Feature Overview (1/2)
● Emulates MySQL slave via binary log
○ Writes MySQL events to Kafka
● Uses Avro to serialize and deserialize data
○ Generically via a common schema for all tables
○ Specifically via per-table schema
● Modular by design
○ State saving / loading (files, MySQL, ZK, etc.)
○ Error handling
○ Event filtering
○ Connection sources

Feature Overview (2/2)
● Transaction and ALTER TABLE support
○ Includes transaction information within events
○ Refreshes schema as needed
● Can publish to any downstream system
○ Currently, we have have Kafka
○ Initially, we started with Cassandra for the prototype
● Can bootstrap a MySQL table into Kafka
○ Transforms entire table into Kafka events
○ Useful with Kafka log compaction
● Configurable
○ Kafka topic names
○ whitelist / blacklist support
● Console consumer, Dockerized dev env

Project Structure
● mypipe-api: API for MySQL binlogs
● mypipe-avro: binary protocol, mutation
serialization and deserialization
● mypipe-producers: push data downstream
● mypipe-kafka: Serializer & Decoder
implementations
● mypipe-runner: pipes and console tools
● mypipe-snapshotter: import MySQL tables
(beta)

MySQL Binary Logging
● Foundation of MySQL replication
● Statement or Row based
● Represents a journal / change log of data
● Allows applications to spy / tune in on
MySQL changes

MySQLBinaryLogConsumer
● Uses behavior from abstract class
● Modular design, in this case, uses config
based implementations
● Uses Hocon for ease and availability
case class MySQLBinaryLogConsumer(config: Config)
extends AbstractMySQLBinaryLogConsumer
with ConfigBasedConnectionSource
with ConfigBasedErrorHandlingBehaviour
with ConfigBasedEventSkippingBehaviour
with CacheableTableMapBehaviour

AbstractMySQLBinaryLogConsumer
● Maintains connection to MySQL
● Primarily handles
○ TABLE_MAP
○ QUERY (BEGIN, COMMIT, ROLLBACK, ALTER)
○ XID
○ Mutations (INSERT, UPDATE, DELETE)
● Provides an enriched binary log API
○ Looks up table metadata and includes it
○ Scala friendly case class and option-driven(*) API for
speaking MySQL binlogs
(*) constant work in progress (=

TABLE_MAP and table metadata
● Provides table metadata
○ Precedes mutation events
○ But no column names!
● MySQLMetadataManager
○ One actor per database
○ Uses “information_schema”
○ Determines column metadata and primary key
● TableCache
○ Wraps metadata actor providing a cache
○ Refreshes tables “when needed”

Mutations
case class ColumnMetadata(name: String, colType: ColumnType.EnumVal, isPrimaryKey: Boolean)
case class PrimaryKey(columns: List[ColumnMetadata])
case class Column(metadata: ColumnMetadata, value: java.io.Serializable)
case class Table(id: Long, name: String, db: String, columns: List[ColumnMetadata], primaryKey:
Option[PrimaryKey])
case class Row(table: Table, columns: Map[String, Column])
case class InsertMutation(timestamp: Long, table: Table, rows: List[Row], txid: UUID)
case class UpdateMutation(timestamp: Long, table: Table, rows: List[(Row, Row)], txid: UUID)
case class DeleteMutation(timestamp: Long, table: Table, rows: List[Row], txid: UUID)
● Fully enriched with table metadata
● Contain column types, data and txid
● Mutations can be serialized and deserialized
from and to Avro

Kafka Producers
● Two modes of operation:
○ Generic Avro beans
○ Specific Avro beans
● Producers decoupled from SerDE
○ Recently started supporting Kafka serializers and
decoders
○ Currently we only support: http://schemarepo.org/
○ Very soon we can integrate with systems such as
Confluent Platform’s schema registry.

Kafka Message Format
-----------------
| MAGIC | 1 byte |
|-----------------|
| MTYPE | 1 byte |
|-----------------|
| SCMID | N bytes |
|-----------------|
| DATA | N bytes |
-----------------
● MAGIC: magic byte, for protocol version
● MTYPE: mutation type, a single byte
○ indicating insert (0x1), update (0x2), or delete (0x3)
● SCMID: Avro schema ID, N bytes
● DATA: the actual mutation data as N bytes

Generic Message Format
3 Avro beans
○ InsertMutation, DeleteMutation, UpdateMutation
○ Hold data for new and old columns (for updates)
○ Groups data by type into Avro maps
{
"name": "old_integers",
"type": {"type": "map", "values": "int"}
},
{
"name": "new_integers",
"type": {"type": "map", "values": "int"}
},
{
"name": "old_strings",
"type": {"type": "map", "values": "string"}
},
{
"name": "new_strings",
"type": {"type": "map", "values": "string"}
} ...

Specific Message Format
Requires 3 Avro beans per table
○ Insert, Update, Delete
○ Specific fields can be used in the schema
{
"name": "UserInsert",
"fields": [
{
"name": "id",
"type": ["null", "int"]
},
{
"name": "username",
"type": ["null", "string"]
},
{
"name": "login_date",
"type": ["null", "long"]
},...
]
},

ALTER table support
● ALTER table queries intercepted
○ Producers can handle this event specifically
● Kafka serializer and deserializer
○ They inspect Avro beans and refresh schema if
needed
● Avro evolution rules must be respected
○ Or mypipe can’t properly encode / decode data

Pipes
● Join consumers to producers
● Use configurable time based checkpointing
and flushing
○ File based, MySQL based, ZK based, Kafka based

schema-repo-client = "mypipe.avro.schema.SchemaRepo"
consumers {
localhost {
# database "host:port:user:pass" array
source = "localhost:3306:mypipe:mypipe"
}
}
producers {
stdout {
class = "mypipe.kafka.producer.stdout.StdoutProducer"
}
kafka-generic {
class = "mypipe.kafka.producer.KafkaMutationGenericAvroProducer"
}
}

pipes {
stdout {
consumers = ["localhost"]
producer { stdout {} }
binlog-position-repo {
#class="mypipe.api.repo.ConfigurableMySQLBasedBinaryLogPositionRepository"
class = "mypipe.api.repo.ConfigurableFileBasedBinaryLogPositionRepository"
config {
file-prefix = "stdout-00" # required if binlog-position-repo is specifiec
data-dir = "/tmp/mypipe/data"
}
}
}

kafka-generic {
enabled = true
consumers = ["localhost"]
producer {
kafka-generic {
metadata-brokers = "localhost:9092"
}
}
}

Practical Applications
● Cache coherence
● Change logging and auditing
● MySQL to:
○ HDFS
○ Cassandra
○ Spark
● Once Confluent Schema Registry integrated
○ Kafka Connect
○ KStreams
● Other reactive applications
○ Real-time notifications

Pipe 2
Pipe 1
Kafka
System Diagram
Hadoop Cassandra
MySQL
BinaryLog
Consumer
Dashboards
Binary Logs
Select
Consumer
MySQL
Kafka
Producer
Schema
Registry
Kafka
Producer
db2_tbl1
db2_tbl2
db1_tbl1
db1_tbl2
Event
Consumers
Users
Pipe N
MySQL
BinaryLog
Consumer
Kafka
Producer

Future Work
● Finish MySQL -> Kafka snapshot support
● Move to Kafka 0.10
● MySQL global transaction identifier (GTID)
support
● Publish to Maven
● More tests, we have a good amount, but you
can’t have enough!

Fin!
That’s all folks (=
Thanks!
Questions?
https://github.com/mardambey/mypipe

mypipe: Buffering and consuming MySQL changes via Kafka

More Related Content

What's hot (19)

Similar to mypipe: Buffering and consuming MySQL changes via Kafka (20)

Recently uploaded (20)

mypipe: Buffering and consuming MySQL changes via Kafka