21st Athens Big Data Meetup - 2nd Talk - Dive into ClickHouse storage system

Dive into ClickHouse
Storage System
Alexander Sapin, Software Engineer

ClickHouse Storages
External Table Engines:
› File on FS
› S3
› HDFS
› MySQL
› ...
Internal Table Engines:
› Memory
› Log
› StripeLog
› MergeTree family
0 / 84

ClickHouse Storages
External Table Engines:
› File on FS
› S3
› HDFS
› MySQL
› ...
Internal Table Engines:
› Memory
› Log
› StripeLog
› MergeTree family
1 / 84

MergeTree Engines Family
Advantages:
› Inserts are atomic
› Selects are blazing fast
› Primary and secondary indexes
› Data modification!
› Inserts and selects don’t block each other
2 / 84

Advantages:
Features:
› Infrequent INSERTs required
3 / 84

Advantages:
Features:
› Infrequent INSERTs required (work in progress)
4 / 84

Advantages:
Features:
› Infrequent INSERTs required (work in progress)
› Background operations on records with the same keys
› Primary key is NOT unique
5 / 84

Create Table and Fill Some Data
Create Table:
CREATE TABLE mt (
EventDate Date,
OrderID Int32,
BannerID UInt64,
GoalNum Int8
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(EventDate) ORDER BY (OrderID, BannerID)
7 / 84

Create Table and Fill Some Data
Create Table:
CREATE TABLE mt (
EventDate Date,
OrderID Int32,
BannerID UInt64,
GoalNum Int8
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(EventDate) ORDER BY (OrderID, BannerID)
Fill Data (twice):
INSERT INTO mt SELECT toDate('2018-09-26'),
number, number + 10000, number % 128 from numbers(1000000);
INSERT INTO mt SELECT toDate('2018-10-15'),
number, number + 10000, number % 128 from numbers(1000000, 1000000);
8 / 84

Table on Disk
metadata:
$ ls /var/lib/clickhouse/metadata/default/
mt.sql
9 / 84

Table on Disk
metadata:
mt.sql
data:
$ ls /var/lib/clickhouse/data/default/mt
201809_2_2_0 201809_3_3_0 201810_1_1_0 201810_4_4_0 201810_1_4_1
detached format_version.txt
10 / 84

Table on Disk
metadata:
mt.sql
data:
$ ls /var/lib/clickhouse/data/default/mt
201809_2_2_0 201809_3_3_0 201810_1_1_0 201810_4_4_0 201810_1_4_1
detached format_version.txt
Contents:
› Format file format_version.txt
› Directories with parts
› Directory for detached parts
11 / 84

Parts: Details
› Part of the data in PK order
› Contain interval of insert numbers
› Created for each INSERT
› Cannot be changed (immutable)
12 / 84

Parts: Why?
PK ordering is needed for efficient OLAP queries
› In our case (OrderID, BannerID)
13 / 84

Parts: Why?
But the data comes in order of time
› By EventDate
14 / 84

Parts: Why?
› By EventDate
Re-sorting all the data is expensive
› ClickHouse handle hundreds of terabytes
15 / 84

Parts: Why?
› By EventDate
Re-sorting all the data is expensive
› ClickHouse handle hundreds of terabytes
Solution: Store data in a set of ordered parts!
16 / 84

Parts: Main Idea
M N
Part on
disk
Primary
key
Insert number
17 / 84

Parts: Main Idea
Part on
disk
M N N+1
New
batch
Primary
key
Insert number
18 / 84

Parts: Main Idea
M N N+1
Part on
disk
Primary
key
Part on
disk
Insert number
19 / 84

Parts: Atomic insert
M N N+1
Part on
disk
Primary
key
Part on
disk
Insert number
INSERT
20 / 84

M N N+1
Part on
disk
Primary
key
Part on
disk
Insert number
INSERT
21 / 84

M N N+1
Part on
disk
Primary
key
Part on
disk
Insert number
22 / 84

Parts: Data
$ ls /var/lib/clickhouse/data/default/mt/201810_1_4_1
GoalNum.bin GoalNum.mrk BannerID.bin ...
primary.idx checksums.txt count.txt columns.txt
partition.dat minmax_EventDate.idx
24 / 84

Parts: Data
Contents:
› primary.idx – primary key on disk
25 / 84

Parts: Data
Contents:
› GoalNum.bin – compressed column
› GoalNum.mrk – marks for column
26 / 84

Parts: Data
Contents:
› partition.dat – partition id
27 / 84

Parts: Data
Contents:
› partition.dat – partition id
› ... – a lot of other useful files
28 / 84

Index
› Row-oriented
› Sparse (each 8192 row)
› Stored in memory
› Uncompressed
0 10000
8192
16384
18192
1998848 2008848
primary.idx
OrderID BannerID
26384
0.
1.
2.
N.
29 / 84

Columns
› Each column in separate file
› Compressed by blocks
› Checksums for each block
OrderID.bin
0
65795
131614
197433
Uncompressed
block
14324
17214
17215
17216
17217
Checksums
30 / 84

How to Use Index?
Problem:
› Index is sparse and contain rows
› Columns contain compressed blocks
How to match index with columns?
31 / 84

Solution: Marks
› Mark – offset in compressed file and uncompressed block
› Stored in column_name.mrk files
› One for each index row
32 / 84

Solution: Marks
› Mark – offset in compressed file and uncompressed block
› Stored in column_name.mrk files
› One for each index row
65795 0
OrderID.mrk
OrderID.bin
0
65795
131614
197433
0
32768
Uncompressed
block
8192
65795 32768
33 / 84

Put it all together
Algorithm:
› Determine required index rows
› Found corresponding marks
› Distribute granules (stripe of marks) among threads
› Read required granules
Properties:
› Read granules concurrently
› Threads can steal tasks
34 / 84

Read Data from Disk
0 10000
8192
16384
18192
26384
1998848 2008848
EventDate OrderID GoalNum
2b 4b 1b
primary.idx
KeyCondition
.mrk .bin .mrk .bin .mrk .bin
thread 1 thread 2
OrderID BannerID
SELECT any(EventDate),
max(GoalNum) FROM mt
WHERE OrderID BETWEEN
6123 AND 17345
35 / 84

Read Data from Disk
0 10000
8192
16384
18192
26384
1998848 2008848
2b 4b 1b
primary.idx
KeyCondition
thread 1 thread 2
OrderID BannerID
6123 AND 17345
36 / 84

Read Data from Disk
0 10000
8192
16384
18192
26384
1998848 2008848
2b 4b 1b
primary.idx
KeyCondition
thread 1 thread 2
OrderID BannerID
6123 AND 17345
37 / 84

Read Data from Disk
0 10000
8192
16384
18192
26384
1998848 2008848
2b 4b 1b
primary.idx
KeyCondition
thread 1 thread 2
OrderID BannerID
6123 AND 17345
38 / 84

Read Data from Disk
0 10000
8192
16384
18192
26384
1998848 2008848
2b 4b 1b
primary.idx
KeyCondition
thread 1 thread 2
OrderID BannerID
6123 AND 17345
39 / 84

Read Data from Disk
0 10000
8192
16384
18192
26384
1998848 2008848
2b 4b 1b
primary.idx
KeyCondition
thread 1 thread 2
OrderID BannerID
6123 AND 17345
40 / 84

Read Data from Disk
0 10000
8192
16384
18192
26384
1998848 2008848
2b 4b 1b
primary.idx
KeyCondition
thread 1 thread 2
OrderID BannerID
6123 AND 17345
41 / 84

Read Data from Disk
0 10000
8192
16384
18192
26384
1998848 2008848
2b 4b 1b
primary.idx
KeyCondition
thread 1 thread 2
OrderID BannerID
6123 AND 17345
42 / 84

Read Data from Disk
0 10000
8192
16384
18192
26384
1998848 2008848
2b 4b 1b
primary.idx
KeyCondition
thread 1 thread 2
OrderID BannerID
6123 AND 17345
43 / 84

Read Data from Disk
0 10000
8192
16384
18192
26384
1998848 2008848
2b 4b 1b
primary.idx
KeyCondition
thread 1 thread 2
OrderID BannerID
6123 AND 17345
44 / 84

Read Data from Disk
0 10000
8192
16384
18192
26384
1998848 2008848
2b 4b 1b
primary.idx
KeyCondition
thread 1 thread 2
OrderID BannerID
6123 AND 17345
45 / 84

Problem: Amount Files in Parts
Test example: Almost OK
$ ls /var/lib/clickhouse/data/default/mt/201810_1_4_1 | wc -l
14
47 / 84

Problem: Amount Files in Parts
Test example: Almost OK
$ ls /var/lib/clickhouse/data/default/mt/201810_1_4_1 | wc -l
14
Production: Too much
$ ssh -A root@mtgiga075-2.metrika.yandex.net
# ls /var/lib/clickhouse/data/merge/visits_v2/202002_4462_4462_0 | wc -l
1556
48 / 84

Solution: Merges
M N N+1
Primary
key
Insert number
Part
[M,N]
Part
[N+1]
49 / 84

Solution: Merges
M N N+1
Background merge
[M,N] [N+1]
Primary
key
Part Part
Insert number
50 / 84

Solution: Merges
M N N+1
[M,N+1]
Primary
key
Insert number
Part
51 / 84

Properties of Merge
› Each part participate in a single successful merge
› Source parts became inactive
› Addional logic during merge
52 / 84

Things to do while merging
Replace/update records
› ReplacingMergeTree – replace
› SummingMergeTree – sum
› CollapsingMergeTree – fold
› VersionedCollapsingMergeTree – fold rows + versioning
Pre-aggregate data
› AggregatingMergeTree – merge aggregate function states
Metrics rollup
› GraphiteMergeTree – rollup in graphite fashion
53 / 84

Partitioning
ENGINE = MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY ...
55 / 84

Partitioning
› Logical entities (doesn’t stored on disk)
› Table can be partitioned by any expression (default: by month)
› Parts from different partitions never merged
› MinMax index by partition columns
› Easy manipulation of partitions:
56 / 84

Partitioning
› Logical entities (doesn’t stored on disk)
› Table can be partitioned by any expression (default: by month)
› Parts from different partitions never merged
› MinMax index by partition columns
› Easy manipulation of partitions:
ALTER TABLE mt DROP PARTITION 201810
ALTER TABLE mt DETACH/ATTACH PARTITION 201809
57 / 84

Mutations
ALTER TABLE mt DELETE WHERE OrderID < 1205
ALTER TABLE mt UPDATE GoalNum = 3 WHERE BannerID = 235433;
Features
› NOT designed for regular usage
› Overwrite all touched parts on disk
› Work in background
› Original parts became inactive
58 / 84

Mutations
M N N+1
Primary
key
Insert number
59 / 84

Mutations
M N N+1
Primary
key
Insert number
Mutation
60 / 84

Mutations
M N N+1
Primary
key
Insert number
Mutation
61 / 84

Mutations
M N N+1
Primary
key
Insert number
Mutation
62 / 84

Parts Lifetime
Insert number
SELECT
64 / 84

Parts Lifetime
Insert number
SELECT
65 / 84

Parts Lifetime
Insert number
SELECT
INSERT
INSERT
66 / 84

Parts Lifetime
Insert number
SELECT
67 / 84

Parts Lifetime
Insert number
SELECT
68 / 84

Parts Lifetime
Insert number
SELECT
Merge
69 / 84

Parts Lifetime
Insert number
SELECT
70 / 84

Parts Lifetime
Insert number
SELECT
71 / 84

Parts Lifetime
Insert number
SELECT
72 / 84

Parts Lifetime
Insert number
SELECT
73 / 84

Parts Lifetime
Insert number
SELECT
74 / 84

Summarize: MergeTree Consists of
Block
Column
Part
Partition
Table
75 / 84

Block
Column
Part
Partition
Table
76 / 84

Block
Column
Part
Partition
Table
77 / 84

Block
Column
Part
Partition
Table
78 / 84

Block
Column
Part
Partition
Table
79 / 84

Things to remember
Control total number of parts
› Rate of INSERTs
80 / 84

Things to remember
› Rate of INSERTs
Merging runs in the background
› Even when there are no queries!
› With additional fold logic
81 / 84

Things to remember
› Rate of INSERTs
Index is sparse
› Must fit into memory
› Determines order of data on disk
› Using the index is always beneficial
82 / 84

Things to remember
› Rate of INSERTs
Index is sparse
› Must fit into memory
› Determines order of data on disk
› Using the index is always beneficial
Partitions is logical entity
› Easy manipulation with portions of data
› Cannot improve SELECTs performance
83 / 84

21st Athens Big Data Meetup - 2nd Talk - Dive into ClickHouse storage system

More Related Content

What's hot (20)

Similar to 21st Athens Big Data Meetup - 2nd Talk - Dive into ClickHouse storage system (20)

More from Athens Big Data (20)

Recently uploaded (20)

21st Athens Big Data Meetup - 2nd Talk - Dive into ClickHouse storage system