SlideShare a Scribd company logo
21st Athens Big Data Meetup - 2nd Talk - Dive into ClickHouse storage system
Dive into ClickHouse
Storage System
Alexander Sapin, Software Engineer
ClickHouse Storages
External Table Engines:
› File on FS
› S3
› HDFS
› MySQL
› ...
Internal Table Engines:
› Memory
› Log
› StripeLog
› MergeTree family
0 / 84
ClickHouse Storages
External Table Engines:
› File on FS
› S3
› HDFS
› MySQL
› ...
Internal Table Engines:
› Memory
› Log
› StripeLog
› MergeTree family
1 / 84
MergeTree Engines Family
Advantages:
› Inserts are atomic
› Selects are blazing fast
› Primary and secondary indexes
› Data modification!
› Inserts and selects don’t block each other
2 / 84
MergeTree Engines Family
Advantages:
› Inserts are atomic
› Selects are blazing fast
› Primary and secondary indexes
› Data modification!
› Inserts and selects don’t block each other
Features:
› Infrequent INSERTs required
3 / 84
MergeTree Engines Family
Advantages:
› Inserts are atomic
› Selects are blazing fast
› Primary and secondary indexes
› Data modification!
› Inserts and selects don’t block each other
Features:
› Infrequent INSERTs required (work in progress)
4 / 84
MergeTree Engines Family
Advantages:
› Inserts are atomic
› Selects are blazing fast
› Primary and secondary indexes
› Data modification!
› Inserts and selects don’t block each other
Features:
› Infrequent INSERTs required (work in progress)
› Background operations on records with the same keys
› Primary key is NOT unique
5 / 84
Write
Create Table and Fill Some Data
Create Table:
CREATE TABLE mt (
EventDate Date,
OrderID Int32,
BannerID UInt64,
GoalNum Int8
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(EventDate) ORDER BY (OrderID, BannerID)
7 / 84
Create Table and Fill Some Data
Create Table:
CREATE TABLE mt (
EventDate Date,
OrderID Int32,
BannerID UInt64,
GoalNum Int8
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(EventDate) ORDER BY (OrderID, BannerID)
Fill Data (twice):
INSERT INTO mt SELECT toDate('2018-09-26'),
number, number + 10000, number % 128 from numbers(1000000);
INSERT INTO mt SELECT toDate('2018-10-15'),
number, number + 10000, number % 128 from numbers(1000000, 1000000);
8 / 84
Table on Disk
metadata:
$ ls /var/lib/clickhouse/metadata/default/
mt.sql
9 / 84
Table on Disk
metadata:
$ ls /var/lib/clickhouse/metadata/default/
mt.sql
data:
$ ls /var/lib/clickhouse/data/default/mt
201809_2_2_0 201809_3_3_0 201810_1_1_0 201810_4_4_0 201810_1_4_1
detached format_version.txt
10 / 84
Table on Disk
metadata:
$ ls /var/lib/clickhouse/metadata/default/
mt.sql
data:
$ ls /var/lib/clickhouse/data/default/mt
201809_2_2_0 201809_3_3_0 201810_1_1_0 201810_4_4_0 201810_1_4_1
detached format_version.txt
Contents:
› Format file format_version.txt
› Directories with parts
› Directory for detached parts
11 / 84
Parts: Details
› Part of the data in PK order
› Contain interval of insert numbers
› Created for each INSERT
› Cannot be changed (immutable)
12 / 84
Parts: Why?
PK ordering is needed for efficient OLAP queries
› In our case (OrderID, BannerID)
13 / 84
Parts: Why?
PK ordering is needed for efficient OLAP queries
› In our case (OrderID, BannerID)
But the data comes in order of time
› By EventDate
14 / 84
Parts: Why?
PK ordering is needed for efficient OLAP queries
› In our case (OrderID, BannerID)
But the data comes in order of time
› By EventDate
Re-sorting all the data is expensive
› ClickHouse handle hundreds of terabytes
15 / 84
Parts: Why?
PK ordering is needed for efficient OLAP queries
› In our case (OrderID, BannerID)
But the data comes in order of time
› By EventDate
Re-sorting all the data is expensive
› ClickHouse handle hundreds of terabytes
Solution: Store data in a set of ordered parts!
16 / 84
Parts: Main Idea
M N
Part on
disk
Primary
key
Insert number
17 / 84
Parts: Main Idea
Part on
disk
M N N+1
New
batch
Primary
key
Insert number
18 / 84
Parts: Main Idea
M N N+1
Part on
disk
Primary
key
Part on
disk
Insert number
19 / 84
Parts: Atomic insert
M N N+1
Part on
disk
Primary
key
Part on
disk
Insert number
INSERT
20 / 84
Parts: Atomic insert
M N N+1
Part on
disk
Primary
key
Part on
disk
Insert number
INSERT
21 / 84
Parts: Atomic insert
M N N+1
Part on
disk
Primary
key
Part on
disk
Insert number
22 / 84
Read
Parts: Data
$ ls /var/lib/clickhouse/data/default/mt/201810_1_4_1
GoalNum.bin GoalNum.mrk BannerID.bin ...
primary.idx checksums.txt count.txt columns.txt
partition.dat minmax_EventDate.idx
24 / 84
Parts: Data
$ ls /var/lib/clickhouse/data/default/mt/201810_1_4_1
GoalNum.bin GoalNum.mrk BannerID.bin ...
primary.idx checksums.txt count.txt columns.txt
partition.dat minmax_EventDate.idx
Contents:
› primary.idx – primary key on disk
25 / 84
Parts: Data
$ ls /var/lib/clickhouse/data/default/mt/201810_1_4_1
GoalNum.bin GoalNum.mrk BannerID.bin ...
primary.idx checksums.txt count.txt columns.txt
partition.dat minmax_EventDate.idx
Contents:
› primary.idx – primary key on disk
› GoalNum.bin – compressed column
› GoalNum.mrk – marks for column
26 / 84
Parts: Data
$ ls /var/lib/clickhouse/data/default/mt/201810_1_4_1
GoalNum.bin GoalNum.mrk BannerID.bin ...
primary.idx checksums.txt count.txt columns.txt
partition.dat minmax_EventDate.idx
Contents:
› primary.idx – primary key on disk
› GoalNum.bin – compressed column
› GoalNum.mrk – marks for column
› partition.dat – partition id
27 / 84
Parts: Data
$ ls /var/lib/clickhouse/data/default/mt/201810_1_4_1
GoalNum.bin GoalNum.mrk BannerID.bin ...
primary.idx checksums.txt count.txt columns.txt
partition.dat minmax_EventDate.idx
Contents:
› primary.idx – primary key on disk
› GoalNum.bin – compressed column
› GoalNum.mrk – marks for column
› partition.dat – partition id
› ... – a lot of other useful files
28 / 84
Index
› Row-oriented
› Sparse (each 8192 row)
› Stored in memory
› Uncompressed
0 10000
8192
16384
18192
1998848 2008848
primary.idx
OrderID BannerID
26384
0.
1.
2.
N.
29 / 84
Columns
› Each column in separate file
› Compressed by blocks
› Checksums for each block
OrderID.bin
0
65795
131614
197433
Uncompressed
block
14324
17214
17215
17216
17217
Checksums
30 / 84
How to Use Index?
Problem:
› Index is sparse and contain rows
› Columns contain compressed blocks
How to match index with columns?
31 / 84
Solution: Marks
› Mark – offset in compressed file and uncompressed block
› Stored in column_name.mrk files
› One for each index row
32 / 84
Solution: Marks
› Mark – offset in compressed file and uncompressed block
› Stored in column_name.mrk files
› One for each index row
65795 0
OrderID.mrk
OrderID.bin
0
65795
131614
197433
0
32768
Uncompressed
block
8192
65795 32768
33 / 84
Put it all together
Algorithm:
› Determine required index rows
› Found corresponding marks
› Distribute granules (stripe of marks) among threads
› Read required granules
Properties:
› Read granules concurrently
› Threads can steal tasks
34 / 84
Read Data from Disk
0 10000
8192
16384
18192
26384
1998848 2008848
EventDate OrderID GoalNum
2b 4b 1b
primary.idx
KeyCondition
.mrk .bin .mrk .bin .mrk .bin
thread	1 thread	2
OrderID BannerID
SELECT	any(EventDate),
max(GoalNum)	FROM	mt
WHERE	OrderID	BETWEEN
6123	AND	17345
35 / 84
Read Data from Disk
0 10000
8192
16384
18192
26384
1998848 2008848
EventDate OrderID GoalNum
2b 4b 1b
primary.idx
KeyCondition
.mrk .bin .mrk .bin .mrk .bin
thread	1 thread	2
OrderID BannerID
SELECT	any(EventDate),
max(GoalNum)	FROM	mt
WHERE	OrderID	BETWEEN
6123	AND	17345
36 / 84
Read Data from Disk
0 10000
8192
16384
18192
26384
1998848 2008848
EventDate OrderID GoalNum
2b 4b 1b
primary.idx
KeyCondition
.mrk .bin .mrk .bin .mrk .bin
thread	1 thread	2
OrderID BannerID
SELECT	any(EventDate),
max(GoalNum)	FROM	mt
WHERE	OrderID	BETWEEN
6123	AND	17345
37 / 84
Read Data from Disk
0 10000
8192
16384
18192
26384
1998848 2008848
EventDate OrderID GoalNum
2b 4b 1b
primary.idx
KeyCondition
.mrk .bin .mrk .bin .mrk .bin
thread	1 thread	2
OrderID BannerID
SELECT	any(EventDate),
max(GoalNum)	FROM	mt
WHERE	OrderID	BETWEEN
6123	AND	17345
38 / 84
Read Data from Disk
0 10000
8192
16384
18192
26384
1998848 2008848
EventDate OrderID GoalNum
2b 4b 1b
primary.idx
KeyCondition
.mrk .bin .mrk .bin .mrk .bin
thread	1 thread	2
OrderID BannerID
SELECT	any(EventDate),
max(GoalNum)	FROM	mt
WHERE	OrderID	BETWEEN
6123	AND	17345
39 / 84
Read Data from Disk
0 10000
8192
16384
18192
26384
1998848 2008848
EventDate OrderID GoalNum
2b 4b 1b
primary.idx
KeyCondition
.mrk .bin .mrk .bin .mrk .bin
thread	1 thread	2
OrderID BannerID
SELECT	any(EventDate),
max(GoalNum)	FROM	mt
WHERE	OrderID	BETWEEN
6123	AND	17345
40 / 84
Read Data from Disk
0 10000
8192
16384
18192
26384
1998848 2008848
EventDate OrderID GoalNum
2b 4b 1b
primary.idx
KeyCondition
.mrk .bin .mrk .bin .mrk .bin
thread	1 thread	2
OrderID BannerID
SELECT	any(EventDate),
max(GoalNum)	FROM	mt
WHERE	OrderID	BETWEEN
6123	AND	17345
41 / 84
Read Data from Disk
0 10000
8192
16384
18192
26384
1998848 2008848
EventDate OrderID GoalNum
2b 4b 1b
primary.idx
KeyCondition
.mrk .bin .mrk .bin .mrk .bin
thread	1 thread	2
OrderID BannerID
SELECT	any(EventDate),
max(GoalNum)	FROM	mt
WHERE	OrderID	BETWEEN
6123	AND	17345
42 / 84
Read Data from Disk
0 10000
8192
16384
18192
26384
1998848 2008848
EventDate OrderID GoalNum
2b 4b 1b
primary.idx
KeyCondition
.mrk .bin .mrk .bin .mrk .bin
thread	1 thread	2
OrderID BannerID
SELECT	any(EventDate),
max(GoalNum)	FROM	mt
WHERE	OrderID	BETWEEN
6123	AND	17345
43 / 84
Read Data from Disk
0 10000
8192
16384
18192
26384
1998848 2008848
EventDate OrderID GoalNum
2b 4b 1b
primary.idx
KeyCondition
.mrk .bin .mrk .bin .mrk .bin
thread	1 thread	2
OrderID BannerID
SELECT	any(EventDate),
max(GoalNum)	FROM	mt
WHERE	OrderID	BETWEEN
6123	AND	17345
44 / 84
Read Data from Disk
0 10000
8192
16384
18192
26384
1998848 2008848
EventDate OrderID GoalNum
2b 4b 1b
primary.idx
KeyCondition
.mrk .bin .mrk .bin .mrk .bin
thread	1 thread	2
OrderID BannerID
SELECT	any(EventDate),
max(GoalNum)	FROM	mt
WHERE	OrderID	BETWEEN
6123	AND	17345
45 / 84
Compaction
Problem: Amount Files in Parts
Test example: Almost OK
$ ls /var/lib/clickhouse/data/default/mt/201810_1_4_1 | wc -l
14
47 / 84
Problem: Amount Files in Parts
Test example: Almost OK
$ ls /var/lib/clickhouse/data/default/mt/201810_1_4_1 | wc -l
14
Production: Too much
$ ssh -A root@mtgiga075-2.metrika.yandex.net
# ls /var/lib/clickhouse/data/merge/visits_v2/202002_4462_4462_0 | wc -l
1556
48 / 84
Solution: Merges
M N N+1
Primary
key
Insert number
Part
[M,N]
Part
[N+1]
49 / 84
Solution: Merges
M N N+1
Background merge
[M,N] [N+1]
Primary
key
Part Part
Insert number
50 / 84
Solution: Merges
M N N+1
[M,N+1]
Primary
key
Insert number
Part
51 / 84
Properties of Merge
› Each part participate in a single successful merge
› Source parts became inactive
› Addional logic during merge
52 / 84
Things to do while merging
Replace/update records
› ReplacingMergeTree – replace
› SummingMergeTree – sum
› CollapsingMergeTree – fold
› VersionedCollapsingMergeTree – fold rows + versioning
Pre-aggregate data
› AggregatingMergeTree – merge aggregate function states
Metrics rollup
› GraphiteMergeTree – rollup in graphite fashion
53 / 84
Modify
Partitioning
ENGINE = MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY ...
55 / 84
Partitioning
ENGINE = MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY ...
› Logical entities (doesn’t stored on disk)
› Table can be partitioned by any expression (default: by month)
› Parts from different partitions never merged
› MinMax index by partition columns
› Easy manipulation of partitions:
56 / 84
Partitioning
ENGINE = MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY ...
› Logical entities (doesn’t stored on disk)
› Table can be partitioned by any expression (default: by month)
› Parts from different partitions never merged
› MinMax index by partition columns
› Easy manipulation of partitions:
ALTER TABLE mt DROP PARTITION 201810
ALTER TABLE mt DETACH/ATTACH PARTITION 201809
57 / 84
Mutations
ALTER TABLE mt DELETE WHERE OrderID < 1205
ALTER TABLE mt UPDATE GoalNum = 3 WHERE BannerID = 235433;
Features
› NOT designed for regular usage
› Overwrite all touched parts on disk
› Work in background
› Original parts became inactive
58 / 84
Mutations
M N N+1
Primary
key
Insert number
59 / 84
Mutations
M N N+1
Primary
key
Insert number
Mutation
60 / 84
Mutations
M N N+1
Primary
key
Insert number
Mutation
61 / 84
Mutations
M N N+1
Primary
key
Insert number
Mutation
62 / 84
All together
Parts Lifetime
Insert number
SELECT
64 / 84
Parts Lifetime
Insert number
SELECT
65 / 84
Parts Lifetime
Insert number
SELECT
INSERT
INSERT
66 / 84
Parts Lifetime
Insert number
SELECT
67 / 84
Parts Lifetime
Insert number
SELECT
68 / 84
Parts Lifetime
Insert number
SELECT
Merge
69 / 84
Parts Lifetime
Insert number
SELECT
70 / 84
Parts Lifetime
Insert number
SELECT
71 / 84
Parts Lifetime
Insert number
SELECT
72 / 84
Parts Lifetime
Insert number
SELECT
73 / 84
Parts Lifetime
Insert number
SELECT
74 / 84
Summarize: MergeTree Consists of
Block
Column
Part
Partition
Table
75 / 84
Summarize: MergeTree Consists of
Block
Column
Part
Partition
Table
76 / 84
Summarize: MergeTree Consists of
Block
Column
Part
Partition
Table
77 / 84
Summarize: MergeTree Consists of
Block
Column
Part
Partition
Table
78 / 84
Summarize: MergeTree Consists of
Block
Column
Part
Partition
Table
79 / 84
Things to remember
Control total number of parts
› Rate of INSERTs
80 / 84
Things to remember
Control total number of parts
› Rate of INSERTs
Merging runs in the background
› Even when there are no queries!
› With additional fold logic
81 / 84
Things to remember
Control total number of parts
› Rate of INSERTs
Merging runs in the background
› Even when there are no queries!
› With additional fold logic
Index is sparse
› Must fit into memory
› Determines order of data on disk
› Using the index is always beneficial
82 / 84
Things to remember
Control total number of parts
› Rate of INSERTs
Merging runs in the background
› Even when there are no queries!
› With additional fold logic
Index is sparse
› Must fit into memory
› Determines order of data on disk
› Using the index is always beneficial
Partitions is logical entity
› Easy manipulation with portions of data
› Cannot improve SELECTs performance
83 / 84
Thank you
QA
84 / 84

More Related Content

PDF
21st Athens Big Data Meetup - 3rd Talk - Dive into ClickHouse query execution
PDF
Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships ...
PDF
ClickHouse Unleashed 2020: Our Favorite New Features for Your Analytical Appl...
PPTX
Bucket Your Partitions Wisely (Markus Höfer, codecentric AG) | Cassandra Summ...
PDF
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
PPTX
Replication and replica sets
PDF
MariaDB and Clickhouse Percona Live 2019 talk
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
21st Athens Big Data Meetup - 3rd Talk - Dive into ClickHouse query execution
Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships ...
ClickHouse Unleashed 2020: Our Favorite New Features for Your Analytical Appl...
Bucket Your Partitions Wisely (Markus Höfer, codecentric AG) | Cassandra Summ...
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
Replication and replica sets
MariaDB and Clickhouse Percona Live 2019 talk
Efficient Data Storage for Analytics with Apache Parquet 2.0

What's hot (20)

PDF
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...
PDF
Deep dive into PostgreSQL statistics.
PDF
John Melesky - Federating Queries Using Postgres FDW @ Postgres Open
PDF
Tiered storage intro. By Robert Hodges, Altinity CEO
PDF
ClickHouse materialized views - a secret weapon for high performance analytic...
PDF
ClickHouse Materialized Views: The Magic Continues
PDF
Creating Beautiful Dashboards with Grafana and ClickHouse
PDF
OSDC 2012 | Scaling with MongoDB by Ross Lawley
PDF
New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)
PDF
ClickHouse Features for Advanced Users, by Aleksei Milovidov
PDF
PostgreSQL Meetup Berlin at Zalando HQ
PDF
Your first ClickHouse data warehouse
PDF
Advanced Apache Cassandra Operations with JMX
PDF
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
PDF
Extra performance out of thin air
PDF
Cassandra introduction apache con 2014 budapest
PDF
Unified Data Platform, by Pauline Yeung of Cisco Systems
PDF
Troubleshooting PostgreSQL Streaming Replication
PDF
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
PPTX
SCALE 15x Minimizing PostgreSQL Major Version Upgrade Downtime
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...
Deep dive into PostgreSQL statistics.
John Melesky - Federating Queries Using Postgres FDW @ Postgres Open
Tiered storage intro. By Robert Hodges, Altinity CEO
ClickHouse materialized views - a secret weapon for high performance analytic...
ClickHouse Materialized Views: The Magic Continues
Creating Beautiful Dashboards with Grafana and ClickHouse
OSDC 2012 | Scaling with MongoDB by Ross Lawley
New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)
ClickHouse Features for Advanced Users, by Aleksei Milovidov
PostgreSQL Meetup Berlin at Zalando HQ
Your first ClickHouse data warehouse
Advanced Apache Cassandra Operations with JMX
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
Extra performance out of thin air
Cassandra introduction apache con 2014 budapest
Unified Data Platform, by Pauline Yeung of Cisco Systems
Troubleshooting PostgreSQL Streaming Replication
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
SCALE 15x Minimizing PostgreSQL Major Version Upgrade Downtime
Ad

Similar to 21st Athens Big Data Meetup - 2nd Talk - Dive into ClickHouse storage system (20)

PDF
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
PDF
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
PDF
ClickHouse Deep Dive, by Aleksei Milovidov
PPTX
High Performance, High Reliability Data Loading on ClickHouse
PPTX
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
PDF
ClickHouse Mark Cache, by Mik Kocikowski, Cloudflare
PDF
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
PDF
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
PDF
Cache on Delivery
PPTX
Data Caching Evolution - the SafePeak deck from webcast 2014-04-24
PDF
ClickHouse Introduction by Alexander Zaitsev, Altinity CTO
PDF
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
PDF
2013 feb 20_thug_h_catalog
PDF
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
PDF
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
PDF
My first 90 days with ClickHouse.pdf
PDF
Dok Talks #133 - My First 90 days with Clickhouse
PPTX
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
PDF
Facebook hadoop-summit
 
PPTX
Building an Analytic Extension to MySQL with ClickHouse and Open Source
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
ClickHouse Deep Dive, by Aleksei Milovidov
High Performance, High Reliability Data Loading on ClickHouse
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
ClickHouse Mark Cache, by Mik Kocikowski, Cloudflare
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Cache on Delivery
Data Caching Evolution - the SafePeak deck from webcast 2014-04-24
ClickHouse Introduction by Alexander Zaitsev, Altinity CTO
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
2013 feb 20_thug_h_catalog
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
My first 90 days with ClickHouse.pdf
Dok Talks #133 - My First 90 days with Clickhouse
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
Facebook hadoop-summit
 
Building an Analytic Extension to MySQL with ClickHouse and Open Source
Ad

More from Athens Big Data (20)

PDF
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
PDF
19th Athens Big Data Meetup - 2nd Talk - NLP: From news recommendation to wor...
PDF
20th Athens Big Data Meetup - 2nd Talk - Druid: under the covers
PDF
20th Athens Big Data Meetup - 3rd Talk - Message from our sponsor: Velti
PDF
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...
PDF
19th Athens Big Data Meetup - 1st Talk - NLP understanding
PDF
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
PDF
18th Athens Big Data Meetup - 1st Talk - Timeseries Forecasting as a Service
PDF
17th Athens Big Data Meetup - 2nd Talk - Data Flow Building and Calculation P...
PDF
17th Athens Big Data Meetup - 1st Talk - Speedup Machine Application Learning...
PDF
16th Athens Big Data Meetup - 2nd Talk - A Focus on Building and Optimizing M...
PDF
16th Athens Big Data Meetup - 1st Talk - An Introduction to Machine Learning ...
PDF
15th Athens Big Data Meetup - 1st Talk - Running Spark On Mesos
PDF
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
PDF
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
PDF
13th Athens Big Data Meetup - 2nd Talk - Training Neural Networks With Enterp...
PDF
11th Athens Big Data Meetup - 2nd Talk - Beyond Bitcoin; Blockchain Technolog...
PDF
9th Athens Big Data Meetup - 2nd Talk - Lead Scoring And Grading
PDF
8th Athens Big Data Meetup - 1st Talk - Riding The Streaming Wave DIY Style
PDF
7th Athens Big Data Meetup - 2nd Talk - Amazon Redshift Vs Google BigQuery
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
19th Athens Big Data Meetup - 2nd Talk - NLP: From news recommendation to wor...
20th Athens Big Data Meetup - 2nd Talk - Druid: under the covers
20th Athens Big Data Meetup - 3rd Talk - Message from our sponsor: Velti
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...
19th Athens Big Data Meetup - 1st Talk - NLP understanding
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
18th Athens Big Data Meetup - 1st Talk - Timeseries Forecasting as a Service
17th Athens Big Data Meetup - 2nd Talk - Data Flow Building and Calculation P...
17th Athens Big Data Meetup - 1st Talk - Speedup Machine Application Learning...
16th Athens Big Data Meetup - 2nd Talk - A Focus on Building and Optimizing M...
16th Athens Big Data Meetup - 1st Talk - An Introduction to Machine Learning ...
15th Athens Big Data Meetup - 1st Talk - Running Spark On Mesos
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
13th Athens Big Data Meetup - 2nd Talk - Training Neural Networks With Enterp...
11th Athens Big Data Meetup - 2nd Talk - Beyond Bitcoin; Blockchain Technolog...
9th Athens Big Data Meetup - 2nd Talk - Lead Scoring And Grading
8th Athens Big Data Meetup - 1st Talk - Riding The Streaming Wave DIY Style
7th Athens Big Data Meetup - 2nd Talk - Amazon Redshift Vs Google BigQuery

Recently uploaded (20)

PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Big Data Technologies - Introduction.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Cloud computing and distributed systems.
PPT
Teaching material agriculture food technology
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Empathic Computing: Creating Shared Understanding
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Big Data Technologies - Introduction.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Machine learning based COVID-19 study performance prediction
Digital-Transformation-Roadmap-for-Companies.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Programs and apps: productivity, graphics, security and other tools
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
sap open course for s4hana steps from ECC to s4
Cloud computing and distributed systems.
Teaching material agriculture food technology
NewMind AI Weekly Chronicles - August'25 Week I
Network Security Unit 5.pdf for BCA BBA.
Encapsulation_ Review paper, used for researhc scholars
20250228 LYD VKU AI Blended-Learning.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Electronic commerce courselecture one. Pdf
Empathic Computing: Creating Shared Understanding

21st Athens Big Data Meetup - 2nd Talk - Dive into ClickHouse storage system

  • 2. Dive into ClickHouse Storage System Alexander Sapin, Software Engineer
  • 3. ClickHouse Storages External Table Engines: › File on FS › S3 › HDFS › MySQL › ... Internal Table Engines: › Memory › Log › StripeLog › MergeTree family 0 / 84
  • 4. ClickHouse Storages External Table Engines: › File on FS › S3 › HDFS › MySQL › ... Internal Table Engines: › Memory › Log › StripeLog › MergeTree family 1 / 84
  • 5. MergeTree Engines Family Advantages: › Inserts are atomic › Selects are blazing fast › Primary and secondary indexes › Data modification! › Inserts and selects don’t block each other 2 / 84
  • 6. MergeTree Engines Family Advantages: › Inserts are atomic › Selects are blazing fast › Primary and secondary indexes › Data modification! › Inserts and selects don’t block each other Features: › Infrequent INSERTs required 3 / 84
  • 7. MergeTree Engines Family Advantages: › Inserts are atomic › Selects are blazing fast › Primary and secondary indexes › Data modification! › Inserts and selects don’t block each other Features: › Infrequent INSERTs required (work in progress) 4 / 84
  • 8. MergeTree Engines Family Advantages: › Inserts are atomic › Selects are blazing fast › Primary and secondary indexes › Data modification! › Inserts and selects don’t block each other Features: › Infrequent INSERTs required (work in progress) › Background operations on records with the same keys › Primary key is NOT unique 5 / 84
  • 10. Create Table and Fill Some Data Create Table: CREATE TABLE mt ( EventDate Date, OrderID Int32, BannerID UInt64, GoalNum Int8 ) ENGINE = MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY (OrderID, BannerID) 7 / 84
  • 11. Create Table and Fill Some Data Create Table: CREATE TABLE mt ( EventDate Date, OrderID Int32, BannerID UInt64, GoalNum Int8 ) ENGINE = MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY (OrderID, BannerID) Fill Data (twice): INSERT INTO mt SELECT toDate('2018-09-26'), number, number + 10000, number % 128 from numbers(1000000); INSERT INTO mt SELECT toDate('2018-10-15'), number, number + 10000, number % 128 from numbers(1000000, 1000000); 8 / 84
  • 12. Table on Disk metadata: $ ls /var/lib/clickhouse/metadata/default/ mt.sql 9 / 84
  • 13. Table on Disk metadata: $ ls /var/lib/clickhouse/metadata/default/ mt.sql data: $ ls /var/lib/clickhouse/data/default/mt 201809_2_2_0 201809_3_3_0 201810_1_1_0 201810_4_4_0 201810_1_4_1 detached format_version.txt 10 / 84
  • 14. Table on Disk metadata: $ ls /var/lib/clickhouse/metadata/default/ mt.sql data: $ ls /var/lib/clickhouse/data/default/mt 201809_2_2_0 201809_3_3_0 201810_1_1_0 201810_4_4_0 201810_1_4_1 detached format_version.txt Contents: › Format file format_version.txt › Directories with parts › Directory for detached parts 11 / 84
  • 15. Parts: Details › Part of the data in PK order › Contain interval of insert numbers › Created for each INSERT › Cannot be changed (immutable) 12 / 84
  • 16. Parts: Why? PK ordering is needed for efficient OLAP queries › In our case (OrderID, BannerID) 13 / 84
  • 17. Parts: Why? PK ordering is needed for efficient OLAP queries › In our case (OrderID, BannerID) But the data comes in order of time › By EventDate 14 / 84
  • 18. Parts: Why? PK ordering is needed for efficient OLAP queries › In our case (OrderID, BannerID) But the data comes in order of time › By EventDate Re-sorting all the data is expensive › ClickHouse handle hundreds of terabytes 15 / 84
  • 19. Parts: Why? PK ordering is needed for efficient OLAP queries › In our case (OrderID, BannerID) But the data comes in order of time › By EventDate Re-sorting all the data is expensive › ClickHouse handle hundreds of terabytes Solution: Store data in a set of ordered parts! 16 / 84
  • 20. Parts: Main Idea M N Part on disk Primary key Insert number 17 / 84
  • 21. Parts: Main Idea Part on disk M N N+1 New batch Primary key Insert number 18 / 84
  • 22. Parts: Main Idea M N N+1 Part on disk Primary key Part on disk Insert number 19 / 84
  • 23. Parts: Atomic insert M N N+1 Part on disk Primary key Part on disk Insert number INSERT 20 / 84
  • 24. Parts: Atomic insert M N N+1 Part on disk Primary key Part on disk Insert number INSERT 21 / 84
  • 25. Parts: Atomic insert M N N+1 Part on disk Primary key Part on disk Insert number 22 / 84
  • 26. Read
  • 27. Parts: Data $ ls /var/lib/clickhouse/data/default/mt/201810_1_4_1 GoalNum.bin GoalNum.mrk BannerID.bin ... primary.idx checksums.txt count.txt columns.txt partition.dat minmax_EventDate.idx 24 / 84
  • 28. Parts: Data $ ls /var/lib/clickhouse/data/default/mt/201810_1_4_1 GoalNum.bin GoalNum.mrk BannerID.bin ... primary.idx checksums.txt count.txt columns.txt partition.dat minmax_EventDate.idx Contents: › primary.idx – primary key on disk 25 / 84
  • 29. Parts: Data $ ls /var/lib/clickhouse/data/default/mt/201810_1_4_1 GoalNum.bin GoalNum.mrk BannerID.bin ... primary.idx checksums.txt count.txt columns.txt partition.dat minmax_EventDate.idx Contents: › primary.idx – primary key on disk › GoalNum.bin – compressed column › GoalNum.mrk – marks for column 26 / 84
  • 30. Parts: Data $ ls /var/lib/clickhouse/data/default/mt/201810_1_4_1 GoalNum.bin GoalNum.mrk BannerID.bin ... primary.idx checksums.txt count.txt columns.txt partition.dat minmax_EventDate.idx Contents: › primary.idx – primary key on disk › GoalNum.bin – compressed column › GoalNum.mrk – marks for column › partition.dat – partition id 27 / 84
  • 31. Parts: Data $ ls /var/lib/clickhouse/data/default/mt/201810_1_4_1 GoalNum.bin GoalNum.mrk BannerID.bin ... primary.idx checksums.txt count.txt columns.txt partition.dat minmax_EventDate.idx Contents: › primary.idx – primary key on disk › GoalNum.bin – compressed column › GoalNum.mrk – marks for column › partition.dat – partition id › ... – a lot of other useful files 28 / 84
  • 32. Index › Row-oriented › Sparse (each 8192 row) › Stored in memory › Uncompressed 0 10000 8192 16384 18192 1998848 2008848 primary.idx OrderID BannerID 26384 0. 1. 2. N. 29 / 84
  • 33. Columns › Each column in separate file › Compressed by blocks › Checksums for each block OrderID.bin 0 65795 131614 197433 Uncompressed block 14324 17214 17215 17216 17217 Checksums 30 / 84
  • 34. How to Use Index? Problem: › Index is sparse and contain rows › Columns contain compressed blocks How to match index with columns? 31 / 84
  • 35. Solution: Marks › Mark – offset in compressed file and uncompressed block › Stored in column_name.mrk files › One for each index row 32 / 84
  • 36. Solution: Marks › Mark – offset in compressed file and uncompressed block › Stored in column_name.mrk files › One for each index row 65795 0 OrderID.mrk OrderID.bin 0 65795 131614 197433 0 32768 Uncompressed block 8192 65795 32768 33 / 84
  • 37. Put it all together Algorithm: › Determine required index rows › Found corresponding marks › Distribute granules (stripe of marks) among threads › Read required granules Properties: › Read granules concurrently › Threads can steal tasks 34 / 84
  • 38. Read Data from Disk 0 10000 8192 16384 18192 26384 1998848 2008848 EventDate OrderID GoalNum 2b 4b 1b primary.idx KeyCondition .mrk .bin .mrk .bin .mrk .bin thread 1 thread 2 OrderID BannerID SELECT any(EventDate), max(GoalNum) FROM mt WHERE OrderID BETWEEN 6123 AND 17345 35 / 84
  • 39. Read Data from Disk 0 10000 8192 16384 18192 26384 1998848 2008848 EventDate OrderID GoalNum 2b 4b 1b primary.idx KeyCondition .mrk .bin .mrk .bin .mrk .bin thread 1 thread 2 OrderID BannerID SELECT any(EventDate), max(GoalNum) FROM mt WHERE OrderID BETWEEN 6123 AND 17345 36 / 84
  • 40. Read Data from Disk 0 10000 8192 16384 18192 26384 1998848 2008848 EventDate OrderID GoalNum 2b 4b 1b primary.idx KeyCondition .mrk .bin .mrk .bin .mrk .bin thread 1 thread 2 OrderID BannerID SELECT any(EventDate), max(GoalNum) FROM mt WHERE OrderID BETWEEN 6123 AND 17345 37 / 84
  • 41. Read Data from Disk 0 10000 8192 16384 18192 26384 1998848 2008848 EventDate OrderID GoalNum 2b 4b 1b primary.idx KeyCondition .mrk .bin .mrk .bin .mrk .bin thread 1 thread 2 OrderID BannerID SELECT any(EventDate), max(GoalNum) FROM mt WHERE OrderID BETWEEN 6123 AND 17345 38 / 84
  • 42. Read Data from Disk 0 10000 8192 16384 18192 26384 1998848 2008848 EventDate OrderID GoalNum 2b 4b 1b primary.idx KeyCondition .mrk .bin .mrk .bin .mrk .bin thread 1 thread 2 OrderID BannerID SELECT any(EventDate), max(GoalNum) FROM mt WHERE OrderID BETWEEN 6123 AND 17345 39 / 84
  • 43. Read Data from Disk 0 10000 8192 16384 18192 26384 1998848 2008848 EventDate OrderID GoalNum 2b 4b 1b primary.idx KeyCondition .mrk .bin .mrk .bin .mrk .bin thread 1 thread 2 OrderID BannerID SELECT any(EventDate), max(GoalNum) FROM mt WHERE OrderID BETWEEN 6123 AND 17345 40 / 84
  • 44. Read Data from Disk 0 10000 8192 16384 18192 26384 1998848 2008848 EventDate OrderID GoalNum 2b 4b 1b primary.idx KeyCondition .mrk .bin .mrk .bin .mrk .bin thread 1 thread 2 OrderID BannerID SELECT any(EventDate), max(GoalNum) FROM mt WHERE OrderID BETWEEN 6123 AND 17345 41 / 84
  • 45. Read Data from Disk 0 10000 8192 16384 18192 26384 1998848 2008848 EventDate OrderID GoalNum 2b 4b 1b primary.idx KeyCondition .mrk .bin .mrk .bin .mrk .bin thread 1 thread 2 OrderID BannerID SELECT any(EventDate), max(GoalNum) FROM mt WHERE OrderID BETWEEN 6123 AND 17345 42 / 84
  • 46. Read Data from Disk 0 10000 8192 16384 18192 26384 1998848 2008848 EventDate OrderID GoalNum 2b 4b 1b primary.idx KeyCondition .mrk .bin .mrk .bin .mrk .bin thread 1 thread 2 OrderID BannerID SELECT any(EventDate), max(GoalNum) FROM mt WHERE OrderID BETWEEN 6123 AND 17345 43 / 84
  • 47. Read Data from Disk 0 10000 8192 16384 18192 26384 1998848 2008848 EventDate OrderID GoalNum 2b 4b 1b primary.idx KeyCondition .mrk .bin .mrk .bin .mrk .bin thread 1 thread 2 OrderID BannerID SELECT any(EventDate), max(GoalNum) FROM mt WHERE OrderID BETWEEN 6123 AND 17345 44 / 84
  • 48. Read Data from Disk 0 10000 8192 16384 18192 26384 1998848 2008848 EventDate OrderID GoalNum 2b 4b 1b primary.idx KeyCondition .mrk .bin .mrk .bin .mrk .bin thread 1 thread 2 OrderID BannerID SELECT any(EventDate), max(GoalNum) FROM mt WHERE OrderID BETWEEN 6123 AND 17345 45 / 84
  • 50. Problem: Amount Files in Parts Test example: Almost OK $ ls /var/lib/clickhouse/data/default/mt/201810_1_4_1 | wc -l 14 47 / 84
  • 51. Problem: Amount Files in Parts Test example: Almost OK $ ls /var/lib/clickhouse/data/default/mt/201810_1_4_1 | wc -l 14 Production: Too much $ ssh -A root@mtgiga075-2.metrika.yandex.net # ls /var/lib/clickhouse/data/merge/visits_v2/202002_4462_4462_0 | wc -l 1556 48 / 84
  • 52. Solution: Merges M N N+1 Primary key Insert number Part [M,N] Part [N+1] 49 / 84
  • 53. Solution: Merges M N N+1 Background merge [M,N] [N+1] Primary key Part Part Insert number 50 / 84
  • 54. Solution: Merges M N N+1 [M,N+1] Primary key Insert number Part 51 / 84
  • 55. Properties of Merge › Each part participate in a single successful merge › Source parts became inactive › Addional logic during merge 52 / 84
  • 56. Things to do while merging Replace/update records › ReplacingMergeTree – replace › SummingMergeTree – sum › CollapsingMergeTree – fold › VersionedCollapsingMergeTree – fold rows + versioning Pre-aggregate data › AggregatingMergeTree – merge aggregate function states Metrics rollup › GraphiteMergeTree – rollup in graphite fashion 53 / 84
  • 58. Partitioning ENGINE = MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY ... 55 / 84
  • 59. Partitioning ENGINE = MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY ... › Logical entities (doesn’t stored on disk) › Table can be partitioned by any expression (default: by month) › Parts from different partitions never merged › MinMax index by partition columns › Easy manipulation of partitions: 56 / 84
  • 60. Partitioning ENGINE = MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY ... › Logical entities (doesn’t stored on disk) › Table can be partitioned by any expression (default: by month) › Parts from different partitions never merged › MinMax index by partition columns › Easy manipulation of partitions: ALTER TABLE mt DROP PARTITION 201810 ALTER TABLE mt DETACH/ATTACH PARTITION 201809 57 / 84
  • 61. Mutations ALTER TABLE mt DELETE WHERE OrderID < 1205 ALTER TABLE mt UPDATE GoalNum = 3 WHERE BannerID = 235433; Features › NOT designed for regular usage › Overwrite all touched parts on disk › Work in background › Original parts became inactive 58 / 84
  • 63. Mutations M N N+1 Primary key Insert number Mutation 60 / 84
  • 64. Mutations M N N+1 Primary key Insert number Mutation 61 / 84
  • 65. Mutations M N N+1 Primary key Insert number Mutation 62 / 84
  • 78. Summarize: MergeTree Consists of Block Column Part Partition Table 75 / 84
  • 79. Summarize: MergeTree Consists of Block Column Part Partition Table 76 / 84
  • 80. Summarize: MergeTree Consists of Block Column Part Partition Table 77 / 84
  • 81. Summarize: MergeTree Consists of Block Column Part Partition Table 78 / 84
  • 82. Summarize: MergeTree Consists of Block Column Part Partition Table 79 / 84
  • 83. Things to remember Control total number of parts › Rate of INSERTs 80 / 84
  • 84. Things to remember Control total number of parts › Rate of INSERTs Merging runs in the background › Even when there are no queries! › With additional fold logic 81 / 84
  • 85. Things to remember Control total number of parts › Rate of INSERTs Merging runs in the background › Even when there are no queries! › With additional fold logic Index is sparse › Must fit into memory › Determines order of data on disk › Using the index is always beneficial 82 / 84
  • 86. Things to remember Control total number of parts › Rate of INSERTs Merging runs in the background › Even when there are no queries! › With additional fold logic Index is sparse › Must fit into memory › Determines order of data on disk › Using the index is always beneficial Partitions is logical entity › Easy manipulation with portions of data › Cannot improve SELECTs performance 83 / 84