SlideShare a Scribd company logo
© 2020, Altinity LTD© 2019, Altinity LTD
© 2020, Altinity LTD
Introductions
www.altinity.com
Software and services
provider for ClickHouse
Major committer and
community sponsor in US
and Western Europe
Robert Hodges (CEO)
>30 years DBMS plus
virtualization & security
Mikhail Filimonov (Engineer)
Kafka Engine maintainer and
ClickHouse committer
© 2020, Altinity LTD
What’s Kafka?
(And why use it
with ClickHouse)
© 2020, Altinity LTD
Kafka Broker
Kafka is messaging on steroids
Topic: Readings
Partitions
Producer
Producer
Consumer
Consumer
Consumer Group
Replicas
© 2020, Altinity LTD
ClickHouse is not a slouch either
Understands SQL
Runs on bare metal to cloud
Shared nothing architecture
Uses column storage
Parallel and vectorized execution
Scales to many petabytes
Is Open source (Apache 2.0)
a b c d
a b c d
a b c d
a b c d
And it’s really fast!
© 2020, Altinity LTD
Reasons to use Kafka with ClickHouse
Kafka
Apps
ClickHouse
AppsYour Apps
Many
datasources
High throughput
Low latency
Message
replay
© 2020, Altinity LTD
Reading data
from Kafka
© 2020, Altinity LTD
Standard flow from Kafka to ClickHouse
Topic
Contains
messages
Kafka Table Engine
Encapsulates topic
within ClickHouse
Materialized View
Fetches Rows
MergeTree Table
Stores Rows
© 2020, Altinity LTD
Create inbound Kafka topic
kafka-topics 
--bootstrap-server kafka-headless:9092 
--topic readings 
--create --partitions 6 
--replication-factor 3
© 2020, Altinity LTD
Create target table
CREATE TABLE readings (
readings_id Int32 Codec(DoubleDelta, LZ4),
time DateTime Codec(DoubleDelta, LZ4),
date ALIAS toDate(time),
temperature Decimal(5,2) Codec(T64, LZ4)
) Engine = MergeTree
PARTITION BY toYYYYMM(time)
© 2020, Altinity LTD
Create Kafka Engine table
CREATE TABLE readings_queue (
readings_id Int32,
time DateTime,
temperature Decimal(5,2)
) ENGINE = Kafka SETTINGS
kafka_broker_list = 'kafka-headless.kafka:9092',
kafka_topic_list = 'readings',
kafka_group_name = 'readings_consumer_group1',
kafka_num_consumers = '1',
kafka_format = 'CSV'
© 2020, Altinity LTD
Create materialized view to transfer data
CREATE MATERIALIZED VIEW readings_queue_mv
TO readings
AS
SELECT readings_id, time, temperature
FROM readings_queue;
© 2020, Altinity LTD
Writing data to
Kafka
© 2020, Altinity LTD
Standard flow from ClickHouse to Kafka
Topic
Contains
messages
Kafka Table Engine
Encapsulates topic
within ClickHouse
INSERT
© 2020, Altinity LTD
Create outbound Kafka topic
kafka-topics 
--bootstrap-server kafka-headless:9092 
--topic events 
--create --partitions 6 
--replication-factor 3
© 2020, Altinity LTD
Create Kafka Engine table
CREATE TABLE events (
time DateTime,
severity String,
content String
) ENGINE = Kafka SETTINGS
kafka_broker_list = kafka-headless.kafka:9092',
kafka_topic_list = 'events',
kafka_group_name = 'events_consumer_group1',
kafka_format = 'CSV'
© 2020, Altinity LTD
Insert data to write into Kafka
-- (In clickhouse-client)
INSERT INTO events VALUES
(now(), 'ERROR', 'Oh no!')
-- (In another window)
kafka-console-consumer --bootstrap-server 
kafka-headless:9092 --topic events
{"time":"2020-01-19 05:07:10",
"severity":"ERROR","content":"Oh no!"}
© 2020, Altinity LTD
Kafka Tips and
Tricks
© 2020, Altinity LTD
Kafka table engine internals
ClickHouse Server
Kafka Table Engine
readings_queue
librdkafka
Kafka Broker
Topic readings
Settings
kafka_broker_list
kafka_topic_list
...
kafka_num_consumers = 1 Config.xml
<!-- Global config -->
<kafka>
<debug>cgrp</debug>
...
</kafka>
<!-- Topic config -->
<kafka_readings>
<retry_backoff_ms>250</retry_backoff_ms>
</kafka_readings>
© 2020, Altinity LTD
Overall best practices
● Use ClickHouse version 19.16.10 or newer
● For HA you should have at least min.insync.replicas+1 brokers.
○ Typical scenario: 3 brokers, replication factor = 3, min.insync.replicas = 2
● To consume your topic in parallel you need to have enough partitions (you
can’t have more consumers than partitions, otherwise some of them will do
nothing). You can try for example 2*num_of_consumers
● If you need to get ‘coordinates’ of consumed messages use virtual columns:
○ _topic, _partition, _timestamp, _key, _offset
○ Just use the in MV, w/o declaring in Engine=Kafka table
© 2020, Altinity LTD
Overall best practices
● When you have many Kafka tables - increase background_schedule_pool_size
(monitor BackgroundSchedulePoolTask)
● If consuming performance is too low - don’t use num_consumers (keep it 1),
but create a separate table with Engine=Kafka and MV streaming data to the
same target.
● To set rdkafka options - add to <kafka> section in config.xml or preferably use
a separate file in config.d/
○ https://github.com/edenhill/librdkafka/blob/master/CONFIGURATION.md
© 2020, Altinity LTD
ClickHouse Clusters and Kafka
● Best practice - every ClickHouse server consumes some partitions, and
flushes rows to local ReplicatedMergeTree table.
● Flush to Distributed table is also possible
○ If you need to shard the data in ClickHouse according to some sharding key
● Chains of materialized view are possible but can be less reliable
○ inserts are not atomic, so on failure you can get ‘dirty’ state
○ Atomic MV chains are planned for the first half of 2020
© 2020, Altinity LTD
Rewind / fast-forward / replay
● Step 1: Detach kafka tables in clickhouse
● Step 2: kafka-consumer-groups.sh --bootstrap-server kafka:9092 --topic
topic:0,1,2 --group id1 --reset-offsets --to-latest --execute
○ More samples: https://gist.github.com/filimonov/1646259d18b911d7a1e8745d6411c0cc
● Step: Attach kafka tables back
See also configuration settings:
<kafka>
<auto_offset_reset>smallest</auto_offset_reset>
</kafka>
© 2020, Altinity LTD
How batching from Kafka stream works
Important settings: kafka_max_block_size, stream_poll_timeout_ms,
stream_flush_interval_ms
1. Batch poll (time limit: stream_poll_timeout_ms 500ms, messages limit:
kafka_max_block_size 65536)
2. Parse messages. If we have enough data (rows limit: kafka_max_block_size
65536) or reach time limit (stream_flush_interval_ms 7500ms) - flush it to
target MV, if no - repeat step 1.
3. Commit happen after writing data to MV (commit after write = at-least-once)
4. On any error during that process kafka client is restarted (leading to rebalance
- leave the group and get back in few seconds)
© 2020, Altinity LTD
Alternatives to
the ClickHouse
Kafka Engine
© 2020, Altinity LTD
Loading data via a client application
Kafka ClickHouse
Java
Connector
Home-built
client
© 2020, Altinity LTD
Other approaches to consider
● If you like the Java Stack & use something from that stack already - you can
stream Kafka topic to ClickHouse JDBC
○ Apache NiFi
○ Apache Storm
○ Kafka Streams
● A new entrant, not tested: https://github.com/housepower/clickhouse_sinker
© 2020, Altinity LTD
Kafka Feature
Roadmap and
Wrap-up
© 2020, Altinity LTD
Roadmap
● 2020 near-term Kafka improvements
○ Eliminate duplicates due to topic rebalancing
○ Filling key for inserts (to allow partitioning), also timestamps
○ Better error processing
○ Exactly once semantics
○ AVRO format
○ Introspection - system.kafka, metrics & events
● Long-term Kafka work
○ Fix performance issues including efficient consumer support
○ Support for other messaging systems (need to decide which ones)
○ Give us your thoughts!
File issues on Github or contact Altinity directly if you have feature requests
© 2020, Altinity LTD
Thank you!
Special Offer:
Contact us for a 1-hour
consultation
Presenters:
rhodges@altinity.com
mfilimonov@altinity.com
Visit us at:
https://www.altinity.com
Free Consultation:
https://blog.altinity.com/offer

More Related Content

PDF
Altinity Quickstart for ClickHouse
PDF
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
PDF
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
PDF
All about Zookeeper and ClickHouse Keeper.pdf
PDF
ClickHouse Monitoring 101: What to monitor and how
PDF
ClickHouse Materialized Views: The Magic Continues
PDF
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
PDF
Adventures with the ClickHouse ReplacingMergeTree Engine
Altinity Quickstart for ClickHouse
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
All about Zookeeper and ClickHouse Keeper.pdf
ClickHouse Monitoring 101: What to monitor and how
ClickHouse Materialized Views: The Magic Continues
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
Adventures with the ClickHouse ReplacingMergeTree Engine

What's hot (20)

PPTX
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
PDF
ClickHouse Keeper
PPTX
High Performance, High Reliability Data Loading on ClickHouse
PDF
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
PDF
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
PDF
ClickHouse Features for Advanced Users, by Aleksei Milovidov
PDF
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
PDF
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
PDF
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
PDF
Altinity Quickstart for ClickHouse-2202-09-15.pdf
PDF
ClickHouse Deep Dive, by Aleksei Milovidov
PDF
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
PDF
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
PDF
Better than you think: Handling JSON data in ClickHouse
PDF
A Day in the Life of a ClickHouse Query Webinar Slides
PDF
A day in the life of a click house query
PDF
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
PDF
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
PDF
ClickHouse on Kubernetes, by Alexander Zaitsev, Altinity CTO
PDF
Building ClickHouse and Making Your First Contribution: A Tutorial_06.10.2021
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
ClickHouse Keeper
High Performance, High Reliability Data Loading on ClickHouse
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Altinity Quickstart for ClickHouse-2202-09-15.pdf
ClickHouse Deep Dive, by Aleksei Milovidov
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
Better than you think: Handling JSON data in ClickHouse
A Day in the Life of a ClickHouse Query Webinar Slides
A day in the life of a click house query
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
ClickHouse on Kubernetes, by Alexander Zaitsev, Altinity CTO
Building ClickHouse and Making Your First Contribution: A Tutorial_06.10.2021
Ad

Similar to Fast Insight from Fast Data: Integrating ClickHouse and Apache Kafka (20)

PPTX
Real time Messages at Scale with Apache Kafka and Couchbase
PDF
Altinity Webinar: Introduction to Altinity.Cloud-Platform for Real-Time Data.pdf
PDF
Apache Kafka - Strakin Technologies Pvt Ltd
PDF
PartnerSkillUp_Enable a Streaming CDC Solution
PDF
Building Stateful applications on Streaming Platforms | Premjit Mishra, Dell ...
PPTX
Extending OpenStack for Fun and Profit
PPTX
Extending OpenStack for Fun and Profit.pptx
PDF
Making your Life Easier with MongoDB and Kafka (Robert Walters, MongoDB) Kafk...
PDF
Building Your Data Streams for all the IoT
PDF
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
PDF
GraphQL across the stack: How everything fits together
PDF
JConWorld_ Continuous SQL with Kafka and Flink
PDF
OSMC 2022 | Ignite: Observability with Grafana & Prometheus for Kafka on Kube...
PPTX
Serverless Data Architecture at scale on Google Cloud Platform
PDF
Exploring sql server 2016 bi
PPTX
Spring Boot & Spring Cloud on k8s and PCF
PDF
Introduction to Vitess on Kubernetes for MySQL - Webinar
PDF
Snowflake for Data Engineering
PDF
Event Streaming with Kafka Streams and Spring Cloud Stream | Soby Chacko, VMware
PDF
Elastically Scaling Kafka Using Confluent
Real time Messages at Scale with Apache Kafka and Couchbase
Altinity Webinar: Introduction to Altinity.Cloud-Platform for Real-Time Data.pdf
Apache Kafka - Strakin Technologies Pvt Ltd
PartnerSkillUp_Enable a Streaming CDC Solution
Building Stateful applications on Streaming Platforms | Premjit Mishra, Dell ...
Extending OpenStack for Fun and Profit
Extending OpenStack for Fun and Profit.pptx
Making your Life Easier with MongoDB and Kafka (Robert Walters, MongoDB) Kafk...
Building Your Data Streams for all the IoT
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
GraphQL across the stack: How everything fits together
JConWorld_ Continuous SQL with Kafka and Flink
OSMC 2022 | Ignite: Observability with Grafana & Prometheus for Kafka on Kube...
Serverless Data Architecture at scale on Google Cloud Platform
Exploring sql server 2016 bi
Spring Boot & Spring Cloud on k8s and PCF
Introduction to Vitess on Kubernetes for MySQL - Webinar
Snowflake for Data Engineering
Event Streaming with Kafka Streams and Spring Cloud Stream | Soby Chacko, VMware
Elastically Scaling Kafka Using Confluent
Ad

More from Altinity Ltd (20)

PPTX
Building an Analytic Extension to MySQL with ClickHouse and Open Source.pptx
PDF
Cloud Native ClickHouse at Scale--Using the Altinity Kubernetes Operator-2022...
PPTX
Building an Analytic Extension to MySQL with ClickHouse and Open Source
PDF
Fun with ClickHouse Window Functions-2021-08-19.pdf
PDF
Cloud Native Data Warehouses - Intro to ClickHouse on Kubernetes-2021-07.pdf
PDF
Building High Performance Apps with Altinity Stable Builds for ClickHouse | A...
PDF
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
PDF
Own your ClickHouse data with Altinity.Cloud Anywhere-2023-01-17.pdf
PDF
ClickHouse ReplacingMergeTree in Telecom Apps
PDF
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
PDF
OSA Con 2022 - What Data Engineering Can Learn from Frontend Engineering - Pe...
PDF
OSA Con 2022 - Welcome to OSA CON Version 2022 - Robert Hodges - Altinity.pdf
PDF
OSA Con 2022 - Using ClickHouse Database to Power Analytics and Customer Enga...
PDF
OSA Con 2022 - Tips and Tricks to Keep Your Queries under 100ms with ClickHou...
PDF
OSA Con 2022 - The Open Source Analytic Universe, Version 2022 - Robert Hodge...
PDF
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
PDF
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...
PDF
OSA Con 2022 - State of Open Source Databases - Peter Zaitsev - Percona.pdf
PDF
OSA Con 2022 - Specifics of data analysis in Time Series Databases - Roman Kh...
PDF
OSA Con 2022 - Signal Correlation, the Ho11y Grail - Michael Hausenblas - AWS...
Building an Analytic Extension to MySQL with ClickHouse and Open Source.pptx
Cloud Native ClickHouse at Scale--Using the Altinity Kubernetes Operator-2022...
Building an Analytic Extension to MySQL with ClickHouse and Open Source
Fun with ClickHouse Window Functions-2021-08-19.pdf
Cloud Native Data Warehouses - Intro to ClickHouse on Kubernetes-2021-07.pdf
Building High Performance Apps with Altinity Stable Builds for ClickHouse | A...
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Own your ClickHouse data with Altinity.Cloud Anywhere-2023-01-17.pdf
ClickHouse ReplacingMergeTree in Telecom Apps
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
OSA Con 2022 - What Data Engineering Can Learn from Frontend Engineering - Pe...
OSA Con 2022 - Welcome to OSA CON Version 2022 - Robert Hodges - Altinity.pdf
OSA Con 2022 - Using ClickHouse Database to Power Analytics and Customer Enga...
OSA Con 2022 - Tips and Tricks to Keep Your Queries under 100ms with ClickHou...
OSA Con 2022 - The Open Source Analytic Universe, Version 2022 - Robert Hodge...
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...
OSA Con 2022 - State of Open Source Databases - Peter Zaitsev - Percona.pdf
OSA Con 2022 - Specifics of data analysis in Time Series Databases - Roman Kh...
OSA Con 2022 - Signal Correlation, the Ho11y Grail - Michael Hausenblas - AWS...

Recently uploaded (20)

PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Big Data Technologies - Introduction.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Encapsulation theory and applications.pdf
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
KodekX | Application Modernization Development
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Electronic commerce courselecture one. Pdf
Spectral efficient network and resource selection model in 5G networks
Big Data Technologies - Introduction.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Empathic Computing: Creating Shared Understanding
Network Security Unit 5.pdf for BCA BBA.
Dropbox Q2 2025 Financial Results & Investor Presentation
Per capita expenditure prediction using model stacking based on satellite ima...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Encapsulation theory and applications.pdf
Machine learning based COVID-19 study performance prediction
Understanding_Digital_Forensics_Presentation.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
KodekX | Application Modernization Development
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Electronic commerce courselecture one. Pdf

Fast Insight from Fast Data: Integrating ClickHouse and Apache Kafka

  • 1. © 2020, Altinity LTD© 2019, Altinity LTD
  • 2. © 2020, Altinity LTD Introductions www.altinity.com Software and services provider for ClickHouse Major committer and community sponsor in US and Western Europe Robert Hodges (CEO) >30 years DBMS plus virtualization & security Mikhail Filimonov (Engineer) Kafka Engine maintainer and ClickHouse committer
  • 3. © 2020, Altinity LTD What’s Kafka? (And why use it with ClickHouse)
  • 4. © 2020, Altinity LTD Kafka Broker Kafka is messaging on steroids Topic: Readings Partitions Producer Producer Consumer Consumer Consumer Group Replicas
  • 5. © 2020, Altinity LTD ClickHouse is not a slouch either Understands SQL Runs on bare metal to cloud Shared nothing architecture Uses column storage Parallel and vectorized execution Scales to many petabytes Is Open source (Apache 2.0) a b c d a b c d a b c d a b c d And it’s really fast!
  • 6. © 2020, Altinity LTD Reasons to use Kafka with ClickHouse Kafka Apps ClickHouse AppsYour Apps Many datasources High throughput Low latency Message replay
  • 7. © 2020, Altinity LTD Reading data from Kafka
  • 8. © 2020, Altinity LTD Standard flow from Kafka to ClickHouse Topic Contains messages Kafka Table Engine Encapsulates topic within ClickHouse Materialized View Fetches Rows MergeTree Table Stores Rows
  • 9. © 2020, Altinity LTD Create inbound Kafka topic kafka-topics --bootstrap-server kafka-headless:9092 --topic readings --create --partitions 6 --replication-factor 3
  • 10. © 2020, Altinity LTD Create target table CREATE TABLE readings ( readings_id Int32 Codec(DoubleDelta, LZ4), time DateTime Codec(DoubleDelta, LZ4), date ALIAS toDate(time), temperature Decimal(5,2) Codec(T64, LZ4) ) Engine = MergeTree PARTITION BY toYYYYMM(time)
  • 11. © 2020, Altinity LTD Create Kafka Engine table CREATE TABLE readings_queue ( readings_id Int32, time DateTime, temperature Decimal(5,2) ) ENGINE = Kafka SETTINGS kafka_broker_list = 'kafka-headless.kafka:9092', kafka_topic_list = 'readings', kafka_group_name = 'readings_consumer_group1', kafka_num_consumers = '1', kafka_format = 'CSV'
  • 12. © 2020, Altinity LTD Create materialized view to transfer data CREATE MATERIALIZED VIEW readings_queue_mv TO readings AS SELECT readings_id, time, temperature FROM readings_queue;
  • 13. © 2020, Altinity LTD Writing data to Kafka
  • 14. © 2020, Altinity LTD Standard flow from ClickHouse to Kafka Topic Contains messages Kafka Table Engine Encapsulates topic within ClickHouse INSERT
  • 15. © 2020, Altinity LTD Create outbound Kafka topic kafka-topics --bootstrap-server kafka-headless:9092 --topic events --create --partitions 6 --replication-factor 3
  • 16. © 2020, Altinity LTD Create Kafka Engine table CREATE TABLE events ( time DateTime, severity String, content String ) ENGINE = Kafka SETTINGS kafka_broker_list = kafka-headless.kafka:9092', kafka_topic_list = 'events', kafka_group_name = 'events_consumer_group1', kafka_format = 'CSV'
  • 17. © 2020, Altinity LTD Insert data to write into Kafka -- (In clickhouse-client) INSERT INTO events VALUES (now(), 'ERROR', 'Oh no!') -- (In another window) kafka-console-consumer --bootstrap-server kafka-headless:9092 --topic events {"time":"2020-01-19 05:07:10", "severity":"ERROR","content":"Oh no!"}
  • 18. © 2020, Altinity LTD Kafka Tips and Tricks
  • 19. © 2020, Altinity LTD Kafka table engine internals ClickHouse Server Kafka Table Engine readings_queue librdkafka Kafka Broker Topic readings Settings kafka_broker_list kafka_topic_list ... kafka_num_consumers = 1 Config.xml <!-- Global config --> <kafka> <debug>cgrp</debug> ... </kafka> <!-- Topic config --> <kafka_readings> <retry_backoff_ms>250</retry_backoff_ms> </kafka_readings>
  • 20. © 2020, Altinity LTD Overall best practices ● Use ClickHouse version 19.16.10 or newer ● For HA you should have at least min.insync.replicas+1 brokers. ○ Typical scenario: 3 brokers, replication factor = 3, min.insync.replicas = 2 ● To consume your topic in parallel you need to have enough partitions (you can’t have more consumers than partitions, otherwise some of them will do nothing). You can try for example 2*num_of_consumers ● If you need to get ‘coordinates’ of consumed messages use virtual columns: ○ _topic, _partition, _timestamp, _key, _offset ○ Just use the in MV, w/o declaring in Engine=Kafka table
  • 21. © 2020, Altinity LTD Overall best practices ● When you have many Kafka tables - increase background_schedule_pool_size (monitor BackgroundSchedulePoolTask) ● If consuming performance is too low - don’t use num_consumers (keep it 1), but create a separate table with Engine=Kafka and MV streaming data to the same target. ● To set rdkafka options - add to <kafka> section in config.xml or preferably use a separate file in config.d/ ○ https://github.com/edenhill/librdkafka/blob/master/CONFIGURATION.md
  • 22. © 2020, Altinity LTD ClickHouse Clusters and Kafka ● Best practice - every ClickHouse server consumes some partitions, and flushes rows to local ReplicatedMergeTree table. ● Flush to Distributed table is also possible ○ If you need to shard the data in ClickHouse according to some sharding key ● Chains of materialized view are possible but can be less reliable ○ inserts are not atomic, so on failure you can get ‘dirty’ state ○ Atomic MV chains are planned for the first half of 2020
  • 23. © 2020, Altinity LTD Rewind / fast-forward / replay ● Step 1: Detach kafka tables in clickhouse ● Step 2: kafka-consumer-groups.sh --bootstrap-server kafka:9092 --topic topic:0,1,2 --group id1 --reset-offsets --to-latest --execute ○ More samples: https://gist.github.com/filimonov/1646259d18b911d7a1e8745d6411c0cc ● Step: Attach kafka tables back See also configuration settings: <kafka> <auto_offset_reset>smallest</auto_offset_reset> </kafka>
  • 24. © 2020, Altinity LTD How batching from Kafka stream works Important settings: kafka_max_block_size, stream_poll_timeout_ms, stream_flush_interval_ms 1. Batch poll (time limit: stream_poll_timeout_ms 500ms, messages limit: kafka_max_block_size 65536) 2. Parse messages. If we have enough data (rows limit: kafka_max_block_size 65536) or reach time limit (stream_flush_interval_ms 7500ms) - flush it to target MV, if no - repeat step 1. 3. Commit happen after writing data to MV (commit after write = at-least-once) 4. On any error during that process kafka client is restarted (leading to rebalance - leave the group and get back in few seconds)
  • 25. © 2020, Altinity LTD Alternatives to the ClickHouse Kafka Engine
  • 26. © 2020, Altinity LTD Loading data via a client application Kafka ClickHouse Java Connector Home-built client
  • 27. © 2020, Altinity LTD Other approaches to consider ● If you like the Java Stack & use something from that stack already - you can stream Kafka topic to ClickHouse JDBC ○ Apache NiFi ○ Apache Storm ○ Kafka Streams ● A new entrant, not tested: https://github.com/housepower/clickhouse_sinker
  • 28. © 2020, Altinity LTD Kafka Feature Roadmap and Wrap-up
  • 29. © 2020, Altinity LTD Roadmap ● 2020 near-term Kafka improvements ○ Eliminate duplicates due to topic rebalancing ○ Filling key for inserts (to allow partitioning), also timestamps ○ Better error processing ○ Exactly once semantics ○ AVRO format ○ Introspection - system.kafka, metrics & events ● Long-term Kafka work ○ Fix performance issues including efficient consumer support ○ Support for other messaging systems (need to decide which ones) ○ Give us your thoughts! File issues on Github or contact Altinity directly if you have feature requests
  • 30. © 2020, Altinity LTD Thank you! Special Offer: Contact us for a 1-hour consultation Presenters: rhodges@altinity.com mfilimonov@altinity.com Visit us at: https://www.altinity.com Free Consultation: https://blog.altinity.com/offer