SlideShare a Scribd company logo
A Fast Intro to Fast
Query with ClickHouse
Robert Hodges, Altinity CEO
Altinity Background
● Premier provider of software and services for ClickHouse
● Incorporated in UK with distributed team in US/Canada/Europe
● Main US/Europe sponsor of ClickHouse community
● Offerings:
○ Enterprise support for ClickHouse and ecosystem projects
○ Software (Kubernetes, cluster manager, tools & utilities)
○ POCs/Training
The shape of data has
changed
Business insights are
hidden in massive pools
of automatically
collected information
Applications that rule the digital era have a
common success factor
The ability to discover and apply
business-critical insights
from petabyte datasets in real time
Let’s consider a concrete example
Web properties track clickstreams to:
● Calculate clickthrough/buy rates
● Guide ad placement
● Optimize eCommerce services
Constraints:
● Run on commodity hardware
● Simple to operate
● Fast interactive query
● Avoid encumbering licenses
Existing analytic databases do not meet requirements fully
Cloud-native data
warehouses cannot
operate on-prem,
limiting range of
solutions
Legacy SQL databases
are expensive to run,
scale poorly on
commodity hardware,
and adapt slowly
Hadoop/Spark
ecosystem solutions
are resource
intensive with slow
response and
complex pipelines
Specialized solutions
limit query domain
and are complex/
resource-inefficient
for general use
ClickHouse fills the gaps and does much more besides
Understands SQL
Runs on bare metal to cloud
Stores data in columns
Parallel and vectorized execution
Scales to many petabytes
Is Open source (Apache 2.0)
Is WAY fast!
Id a b c d
Id a b c d
Id a b c d
Id a b c d
What does “WAY fast” mean?
SELECT Dest d, count(*) c, avg(ArrDelayMinutes) ad
FROM ontime GROUP BY d HAVING c > 100000
ORDER BY ad DESC limit 5
┌─d───┬───────c─┬─────────────────ad─┐
│ EWR │ 3660570 │ 17.637564095209218 │
│ SFO │ 4056003 │ 16.029478528492213 │
│ JFK │ 2198078 │ 15.33669824273752 │
│ LGA │ 3133582 │ 14.533851994299177 │
│ ORD │ 9108159 │ 14.431460737565077 │
└─────┴─────────┴────────────────────┘
5 rows in set. Elapsed: 1.182 sec. Processed 173.82 million
rows, 2.78 GB (147.02 million rows/s., 2.35 GB/s.)
(Amazon md5.2xlarge: Xeon(R) Platinum 8175M, 8vCPU, 30GB RAM, NVMe SSD)
What are the main ClickHouse use patterns?
● Fast, scalable data warehouse for online services (SaaS
and in-house apps)
● Built-in data warehouse for installed analytic applications
● Exploration -- throw in a bunch of data and go crazy!
Getting started is easy with Docker image
$ docker run -d --name ch-s yandex/clickhouse-server
$ docker exec -it ch-s clickhouse client
...
11e99303c78e :) select version()
SELECT version()
┌─version()─┐
│ 19.3.3 │
└───────────┘
1 rows in set. Elapsed: 0.001 sec.
Or install recommended Altinity stable version packages
$ sudo apt -y install clickhouse-client=18.16.1 
clickhouse-server=18.16.1 
clickhouse-common-static=18.16.1
...
$ sudo systemctl start clickhouse-server
...
11e99303c78e :) select version()
SELECT version()
┌─version()─┐
│ 18.16.1 │
└───────────┘
1 rows in set. Elapsed: 0.001 sec.
Examples of table creation and data insertion
CREATE TABLE sdata (
DevId Int32,
Type String,
MDate Date,
MDatetime DateTime,
Value Float64
) ENGINE = MergeTree() PARTITION BY toYYYYMM(MDate)
ORDER BY (DevId, MDatetime)
INSERT INTO sdata VALUES
(15, 'TEMP', '2018-01-01', '2018-01-01 23:29:55', 18.0),
(15, 'TEMP', '2018-01-01', '2018-01-01 23:30:56', 18.7)
INSERT INTO sdata VALUES
(15, 'TEMP', '2018-01-01', '2018-01-01 23:31:53', 18.1),
(2, 'TEMP', '2018-01-01', '2018-01-01 23:31:55', 7.9)
Loading data from CSV files
cat > sdata.csv <<END
DevId,Type,MDate,MDatetime,Value
59,"TEMP","2018-02-01","2018-02-01 01:10:13",19.5
59,"TEMP","2018-02-01","2018-02-01 02:10:01",18.8
59,"TEMP","2018-02-01","2018-02-01 03:09:58",18.6
59,"TEMP","2018-02-01","2018-02-01 04:10:05",15.1
59,"TEMP","2018-02-01","2018-02-01 05:10:31",12.2
59,"TEMP","2018-02-01","2018-02-01 06:10:02",11.8
59,"TEMP","2018-02-01","2018-02-01 07:09:55",10.9
END
cat sdata.csv |clickhouse-client --database foo
--query='INSERT INTO sdata FORMAT CSVWithNames'
Select results can be surprising!
SELECT *
FROM sdata
WHERE
DevId < 20
┌─DevId─┬─Type─┬──────MDate─┬───────────MDatetime─┬─Value─┐
│ 15 │ TEMP │ 2018-01-01 │ 2018-01-01 23:29:55 │ 18 │
│ 15 │ TEMP │ 2018-01-01 │ 2018-01-01 23:30:56 │ 18.7 │
└───────┴──────┴────────────┴─────────────────────┴───────┘
┌─DevId─┬─Type─┬──────MDate─┬───────────MDatetime─┬─Value─┐
│ 2 │ TEMP │ 2018-01-01 │ 2018-01-01 23:31:55 │ 7.9 │
│ 15 │ TEMP │ 2018-01-01 │ 2018-01-01 23:31:53 │ 18.1 │
└───────┴──────┴────────────┴─────────────────────┴───────┘
┌─DevId─┬─Type─┬──────MDate─┬───────────MDatetime─┬─Value─┐
│ 2 │ TEMP │ 2018-01-01 │ 2018-01-01 23:31:55 │ 7.9 │
│ 15 │ TEMP │ 2018-01-01 │ 2018-01-01 23:29:55 │ 18 │
│ 15 │ TEMP │ 2018-01-01 │ 2018-01-01 23:30:56 │ 18.7 │
│ 15 │ TEMP │ 2018-01-01 │ 2018-01-01 23:31:53 │ 18.1 │
└───────┴──────┴────────────┴─────────────────────┴───────┘
Result right after INSERT:
Result somewhat later:
Time for some research into table engines
CREATE TABLE sdata (
DevId Int32,
Type String,
MDate Date,
MDatetime DateTime,
Value Float64
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(MDate)
ORDER BY (DevId, MDatetime)
How to manage data
and handle queries
How to break table
into parts
How to index and
sort data in each part
MergeTree writes parts quickly and merges them offline
/var/lib/clickhouse/data/default/sdata
201801_1_1_0/
201801_2_2_0/
Multiple parts after initial
insertion ( => very fast writes)
201801_1_2_1/
Single part after merge
( => very fast reads)
Rows are indexed and sorted inside each part
/var/lib/clickhouse/data/default/sdata
... ...
956 2018-01-01 15:22:37
575 2018-01-01 23:31:53
1300 2018-01-02 05:14:47
... ...
primary.idx
||||
.mrk .bin
||||
.mrk .bin
||||
.mrk .bin
||||
.mrk .bin
201802_1_1_0/
(DevId, MDateTime) DevId Type MDate MDatetime...
primary.idx .mrk .bin .mrk .bin .mrk .bin .mrk .bin
201801_1_2_1/
(DevId, MDateTime) DevId Type MDate MDatetime...
ClickHouse
Now we can follow how query works on a single server
SELECT DevId, Type, avg(Value)
FROM sdata
WHERE MDate = '2018-01-01'
GROUP BY DevId, Type
Identify parts to search
Query in parallel
Aggregate results
Result Set
Clickhouse distributed engine spreads queries across shards
SELECT ...
FROM
sdata_dist
ClickHouse
sdata_dist
(Distributed)
sdata
(MergeTable)
ClickHouse
sdata_dist sdata
ClickHouse
sdata_dist sdata
Result Set
ReplicatedMergeTree engine spreads over shards and replicas
ClickHouse
sdata_dist
sdata
ReplicatedMergeTree
Engine
ClickHouse
sdata_dist
sdata
ClickHouse
sdata_dist
sdata
ClickHouse
sdata_dist
sdata
ClickHouse
sdata_dist
sdata
ClickHouse
sdata_dist
sdata
SELECT ...
FROM
sdata_dist
Result Set
Zookeeper
Zookeeper
Zookeeper
SELECT Dest, count(*) c, avg(DepDelayMinutes)
FROM ontime
GROUP BY Dest HAVING c > 100000
ORDER BY c DESC limit 5
SELECT Dest, count(*) c, avg(DepDelayMinutes)
FROM ontime
WHERE toYear(FlightDate) =
toYear(toDate('2016-01-01'))
GROUP BY Dest HAVING c > 100000
ORDER BY c DESC limit 5
With basic engine knowledge you can now tune queries
Scans 355 table parts
in parallel; does not
use index
Scans 12 parts (3%
of data) because
FlightDate is
partition key
Hint: clickhouse-server.log has the query plan
Faster
SELECT
Dest d, Name n, count(*) c, avg(ArrDelayMinutes)
FROM ontime
JOIN airports ON (airports.IATA = ontime.Dest)
GROUP BY d, n HAVING c > 100000 ORDER BY ad DESC
SELECT dest, Name n, c AS flights, ad FROM (
SELECT Dest dest, count(*) c, avg(ArrDelayMinutes) ad
FROM ontime
GROUP BY dest HAVING c > 100000
ORDER BY ad DESC
) LEFT JOIN airports ON airports.IATA = dest
You can also optimize joins
Subquery
minimizes data
scanned in
parallel; joins on
GROUP BY results
Joins on data
before GROUP BY,
increased amount
to scan
Faster
ClickHouse has a wealth of features to help queries go fast
Dictionaries
Materialized Views
Arrays
Specialized functions and SQL
extensions
Lots more table engines
...And a nice set of supporting ecosystem tools
Client libraries: JDBC, ODBC, Python, Golang, ...
Kafka table engine to ingest from Kafka queues
Visualization tools: Grafana, Tableau, Tabix, SuperSet
Data science stack integration: Pandas, Jupyter Notebooks
Kubernetes ClickHouse operator
Where to get more information
● ClickHouse Docs: https://clickhouse.yandex/docs/en/
● Altinity Blog: https://www.altinity.com/blog
● Meetups and conference presentations
○ 2 April -- Madrid, Spain ClickHouse Meetup
○ 28-30 May -- Austin, TX Percona Live 2019
○ San Francisco ClickHouse Meetup
Questions?
Thank you!
Contacts:
info@altinity.com
Visit us at:
https://www.altinity.com
Read Our Blog:
https://www.altinity.com/blog

More Related Content

PDF
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
PDF
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
PDF
ClickHouse Deep Dive, by Aleksei Milovidov
PDF
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
PDF
10 Good Reasons to Use ClickHouse
PDF
ClickHouse Materialized Views: The Magic Continues
PDF
Adventures with the ClickHouse ReplacingMergeTree Engine
PDF
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
10 Good Reasons to Use ClickHouse
ClickHouse Materialized Views: The Magic Continues
Adventures with the ClickHouse ReplacingMergeTree Engine
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...

What's hot (20)

PDF
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
PDF
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
PDF
A Day in the Life of a ClickHouse Query Webinar Slides
PDF
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
PDF
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
PDF
Better than you think: Handling JSON data in ClickHouse
PDF
Fun with click house window functions webinar slides 2021-08-19
PDF
ClickHouse Mark Cache, by Mik Kocikowski, Cloudflare
PDF
ClickHouse materialized views - a secret weapon for high performance analytic...
PDF
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
PDF
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...
PPTX
High Performance, High Reliability Data Loading on ClickHouse
PDF
Altinity Quickstart for ClickHouse
PDF
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
PDF
ClickHouse Unleashed 2020: Our Favorite New Features for Your Analytical Appl...
PDF
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
PPTX
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
PDF
Your first ClickHouse data warehouse
PDF
Backup and-recovery2
PDF
ClickHouse Monitoring 101: What to monitor and how
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
A Day in the Life of a ClickHouse Query Webinar Slides
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Better than you think: Handling JSON data in ClickHouse
Fun with click house window functions webinar slides 2021-08-19
ClickHouse Mark Cache, by Mik Kocikowski, Cloudflare
ClickHouse materialized views - a secret weapon for high performance analytic...
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...
High Performance, High Reliability Data Loading on ClickHouse
Altinity Quickstart for ClickHouse
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
ClickHouse Unleashed 2020: Our Favorite New Features for Your Analytical Appl...
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Your first ClickHouse data warehouse
Backup and-recovery2
ClickHouse Monitoring 101: What to monitor and how
Ad

Similar to A Fast Intro to Fast Query with ClickHouse, by Robert Hodges (20)

PPSX
implementation of a big data architecture for real-time analytics with data s...
PDF
Performance tuning ColumnStore
PPTX
Tales from the Field
PDF
Live traffic capture and replay in cassandra 4.0
PDF
Improving the performance of Odoo deployments
PPTX
this-is-garbage-talk-2022.pptx
PPTX
Are your ready for in memory applications?
PDF
Data Structures for High Resolution, Real-time Telemetry at Scale
PPTX
Our journey with druid - from initial research to full production scale
PPTX
Spark Streaming Early Warning Use Case
PDF
Tiered storage intro. By Robert Hodges, Altinity CEO
PDF
Netflix SRE perf meetup_slides
PDF
Altinity Quickstart for ClickHouse-2202-09-15.pdf
PDF
Dip into prometheus
PDF
Александр Махомет "Beyond the code или как мониторить ваш PHP сайт"
PDF
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
PDF
Prometheus and Docker (Docker Galway, November 2015)
PDF
REX Hadoop et R
PDF
Accelerating Data Science With GPUs
PDF
Five Lessons in Distributed Databases
implementation of a big data architecture for real-time analytics with data s...
Performance tuning ColumnStore
Tales from the Field
Live traffic capture and replay in cassandra 4.0
Improving the performance of Odoo deployments
this-is-garbage-talk-2022.pptx
Are your ready for in memory applications?
Data Structures for High Resolution, Real-time Telemetry at Scale
Our journey with druid - from initial research to full production scale
Spark Streaming Early Warning Use Case
Tiered storage intro. By Robert Hodges, Altinity CEO
Netflix SRE perf meetup_slides
Altinity Quickstart for ClickHouse-2202-09-15.pdf
Dip into prometheus
Александр Махомет "Beyond the code или как мониторить ваш PHP сайт"
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Prometheus and Docker (Docker Galway, November 2015)
REX Hadoop et R
Accelerating Data Science With GPUs
Five Lessons in Distributed Databases
Ad

More from Altinity Ltd (20)

PPTX
Building an Analytic Extension to MySQL with ClickHouse and Open Source.pptx
PDF
Cloud Native ClickHouse at Scale--Using the Altinity Kubernetes Operator-2022...
PPTX
Building an Analytic Extension to MySQL with ClickHouse and Open Source
PDF
Fun with ClickHouse Window Functions-2021-08-19.pdf
PDF
Cloud Native Data Warehouses - Intro to ClickHouse on Kubernetes-2021-07.pdf
PDF
Building High Performance Apps with Altinity Stable Builds for ClickHouse | A...
PDF
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
PDF
Own your ClickHouse data with Altinity.Cloud Anywhere-2023-01-17.pdf
PDF
ClickHouse ReplacingMergeTree in Telecom Apps
PDF
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
PDF
Altinity Webinar: Introduction to Altinity.Cloud-Platform for Real-Time Data.pdf
PDF
OSA Con 2022 - What Data Engineering Can Learn from Frontend Engineering - Pe...
PDF
OSA Con 2022 - Welcome to OSA CON Version 2022 - Robert Hodges - Altinity.pdf
PDF
OSA Con 2022 - Using ClickHouse Database to Power Analytics and Customer Enga...
PDF
OSA Con 2022 - Tips and Tricks to Keep Your Queries under 100ms with ClickHou...
PDF
OSA Con 2022 - The Open Source Analytic Universe, Version 2022 - Robert Hodge...
PDF
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
PDF
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...
PDF
OSA Con 2022 - State of Open Source Databases - Peter Zaitsev - Percona.pdf
PDF
OSA Con 2022 - Specifics of data analysis in Time Series Databases - Roman Kh...
Building an Analytic Extension to MySQL with ClickHouse and Open Source.pptx
Cloud Native ClickHouse at Scale--Using the Altinity Kubernetes Operator-2022...
Building an Analytic Extension to MySQL with ClickHouse and Open Source
Fun with ClickHouse Window Functions-2021-08-19.pdf
Cloud Native Data Warehouses - Intro to ClickHouse on Kubernetes-2021-07.pdf
Building High Performance Apps with Altinity Stable Builds for ClickHouse | A...
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Own your ClickHouse data with Altinity.Cloud Anywhere-2023-01-17.pdf
ClickHouse ReplacingMergeTree in Telecom Apps
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Altinity Webinar: Introduction to Altinity.Cloud-Platform for Real-Time Data.pdf
OSA Con 2022 - What Data Engineering Can Learn from Frontend Engineering - Pe...
OSA Con 2022 - Welcome to OSA CON Version 2022 - Robert Hodges - Altinity.pdf
OSA Con 2022 - Using ClickHouse Database to Power Analytics and Customer Enga...
OSA Con 2022 - Tips and Tricks to Keep Your Queries under 100ms with ClickHou...
OSA Con 2022 - The Open Source Analytic Universe, Version 2022 - Robert Hodge...
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...
OSA Con 2022 - State of Open Source Databases - Peter Zaitsev - Percona.pdf
OSA Con 2022 - Specifics of data analysis in Time Series Databases - Roman Kh...

Recently uploaded (20)

PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
cuic standard and advanced reporting.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Electronic commerce courselecture one. Pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPT
Teaching material agriculture food technology
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Review of recent advances in non-invasive hemoglobin estimation
Understanding_Digital_Forensics_Presentation.pptx
cuic standard and advanced reporting.pdf
sap open course for s4hana steps from ECC to s4
Electronic commerce courselecture one. Pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Chapter 3 Spatial Domain Image Processing.pdf
MYSQL Presentation for SQL database connectivity
Dropbox Q2 2025 Financial Results & Investor Presentation
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Teaching material agriculture food technology
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Review of recent advances in non-invasive hemoglobin estimation

A Fast Intro to Fast Query with ClickHouse, by Robert Hodges

  • 1. A Fast Intro to Fast Query with ClickHouse Robert Hodges, Altinity CEO
  • 2. Altinity Background ● Premier provider of software and services for ClickHouse ● Incorporated in UK with distributed team in US/Canada/Europe ● Main US/Europe sponsor of ClickHouse community ● Offerings: ○ Enterprise support for ClickHouse and ecosystem projects ○ Software (Kubernetes, cluster manager, tools & utilities) ○ POCs/Training
  • 3. The shape of data has changed Business insights are hidden in massive pools of automatically collected information
  • 4. Applications that rule the digital era have a common success factor The ability to discover and apply business-critical insights from petabyte datasets in real time
  • 5. Let’s consider a concrete example Web properties track clickstreams to: ● Calculate clickthrough/buy rates ● Guide ad placement ● Optimize eCommerce services Constraints: ● Run on commodity hardware ● Simple to operate ● Fast interactive query ● Avoid encumbering licenses
  • 6. Existing analytic databases do not meet requirements fully Cloud-native data warehouses cannot operate on-prem, limiting range of solutions Legacy SQL databases are expensive to run, scale poorly on commodity hardware, and adapt slowly Hadoop/Spark ecosystem solutions are resource intensive with slow response and complex pipelines Specialized solutions limit query domain and are complex/ resource-inefficient for general use
  • 7. ClickHouse fills the gaps and does much more besides Understands SQL Runs on bare metal to cloud Stores data in columns Parallel and vectorized execution Scales to many petabytes Is Open source (Apache 2.0) Is WAY fast! Id a b c d Id a b c d Id a b c d Id a b c d
  • 8. What does “WAY fast” mean? SELECT Dest d, count(*) c, avg(ArrDelayMinutes) ad FROM ontime GROUP BY d HAVING c > 100000 ORDER BY ad DESC limit 5 ┌─d───┬───────c─┬─────────────────ad─┐ │ EWR │ 3660570 │ 17.637564095209218 │ │ SFO │ 4056003 │ 16.029478528492213 │ │ JFK │ 2198078 │ 15.33669824273752 │ │ LGA │ 3133582 │ 14.533851994299177 │ │ ORD │ 9108159 │ 14.431460737565077 │ └─────┴─────────┴────────────────────┘ 5 rows in set. Elapsed: 1.182 sec. Processed 173.82 million rows, 2.78 GB (147.02 million rows/s., 2.35 GB/s.) (Amazon md5.2xlarge: Xeon(R) Platinum 8175M, 8vCPU, 30GB RAM, NVMe SSD)
  • 9. What are the main ClickHouse use patterns? ● Fast, scalable data warehouse for online services (SaaS and in-house apps) ● Built-in data warehouse for installed analytic applications ● Exploration -- throw in a bunch of data and go crazy!
  • 10. Getting started is easy with Docker image $ docker run -d --name ch-s yandex/clickhouse-server $ docker exec -it ch-s clickhouse client ... 11e99303c78e :) select version() SELECT version() ┌─version()─┐ │ 19.3.3 │ └───────────┘ 1 rows in set. Elapsed: 0.001 sec.
  • 11. Or install recommended Altinity stable version packages $ sudo apt -y install clickhouse-client=18.16.1 clickhouse-server=18.16.1 clickhouse-common-static=18.16.1 ... $ sudo systemctl start clickhouse-server ... 11e99303c78e :) select version() SELECT version() ┌─version()─┐ │ 18.16.1 │ └───────────┘ 1 rows in set. Elapsed: 0.001 sec.
  • 12. Examples of table creation and data insertion CREATE TABLE sdata ( DevId Int32, Type String, MDate Date, MDatetime DateTime, Value Float64 ) ENGINE = MergeTree() PARTITION BY toYYYYMM(MDate) ORDER BY (DevId, MDatetime) INSERT INTO sdata VALUES (15, 'TEMP', '2018-01-01', '2018-01-01 23:29:55', 18.0), (15, 'TEMP', '2018-01-01', '2018-01-01 23:30:56', 18.7) INSERT INTO sdata VALUES (15, 'TEMP', '2018-01-01', '2018-01-01 23:31:53', 18.1), (2, 'TEMP', '2018-01-01', '2018-01-01 23:31:55', 7.9)
  • 13. Loading data from CSV files cat > sdata.csv <<END DevId,Type,MDate,MDatetime,Value 59,"TEMP","2018-02-01","2018-02-01 01:10:13",19.5 59,"TEMP","2018-02-01","2018-02-01 02:10:01",18.8 59,"TEMP","2018-02-01","2018-02-01 03:09:58",18.6 59,"TEMP","2018-02-01","2018-02-01 04:10:05",15.1 59,"TEMP","2018-02-01","2018-02-01 05:10:31",12.2 59,"TEMP","2018-02-01","2018-02-01 06:10:02",11.8 59,"TEMP","2018-02-01","2018-02-01 07:09:55",10.9 END cat sdata.csv |clickhouse-client --database foo --query='INSERT INTO sdata FORMAT CSVWithNames'
  • 14. Select results can be surprising! SELECT * FROM sdata WHERE DevId < 20 ┌─DevId─┬─Type─┬──────MDate─┬───────────MDatetime─┬─Value─┐ │ 15 │ TEMP │ 2018-01-01 │ 2018-01-01 23:29:55 │ 18 │ │ 15 │ TEMP │ 2018-01-01 │ 2018-01-01 23:30:56 │ 18.7 │ └───────┴──────┴────────────┴─────────────────────┴───────┘ ┌─DevId─┬─Type─┬──────MDate─┬───────────MDatetime─┬─Value─┐ │ 2 │ TEMP │ 2018-01-01 │ 2018-01-01 23:31:55 │ 7.9 │ │ 15 │ TEMP │ 2018-01-01 │ 2018-01-01 23:31:53 │ 18.1 │ └───────┴──────┴────────────┴─────────────────────┴───────┘ ┌─DevId─┬─Type─┬──────MDate─┬───────────MDatetime─┬─Value─┐ │ 2 │ TEMP │ 2018-01-01 │ 2018-01-01 23:31:55 │ 7.9 │ │ 15 │ TEMP │ 2018-01-01 │ 2018-01-01 23:29:55 │ 18 │ │ 15 │ TEMP │ 2018-01-01 │ 2018-01-01 23:30:56 │ 18.7 │ │ 15 │ TEMP │ 2018-01-01 │ 2018-01-01 23:31:53 │ 18.1 │ └───────┴──────┴────────────┴─────────────────────┴───────┘ Result right after INSERT: Result somewhat later:
  • 15. Time for some research into table engines CREATE TABLE sdata ( DevId Int32, Type String, MDate Date, MDatetime DateTime, Value Float64 ) ENGINE = MergeTree() PARTITION BY toYYYYMM(MDate) ORDER BY (DevId, MDatetime) How to manage data and handle queries How to break table into parts How to index and sort data in each part
  • 16. MergeTree writes parts quickly and merges them offline /var/lib/clickhouse/data/default/sdata 201801_1_1_0/ 201801_2_2_0/ Multiple parts after initial insertion ( => very fast writes) 201801_1_2_1/ Single part after merge ( => very fast reads)
  • 17. Rows are indexed and sorted inside each part /var/lib/clickhouse/data/default/sdata ... ... 956 2018-01-01 15:22:37 575 2018-01-01 23:31:53 1300 2018-01-02 05:14:47 ... ... primary.idx |||| .mrk .bin |||| .mrk .bin |||| .mrk .bin |||| .mrk .bin 201802_1_1_0/ (DevId, MDateTime) DevId Type MDate MDatetime... primary.idx .mrk .bin .mrk .bin .mrk .bin .mrk .bin 201801_1_2_1/ (DevId, MDateTime) DevId Type MDate MDatetime...
  • 18. ClickHouse Now we can follow how query works on a single server SELECT DevId, Type, avg(Value) FROM sdata WHERE MDate = '2018-01-01' GROUP BY DevId, Type Identify parts to search Query in parallel Aggregate results Result Set
  • 19. Clickhouse distributed engine spreads queries across shards SELECT ... FROM sdata_dist ClickHouse sdata_dist (Distributed) sdata (MergeTable) ClickHouse sdata_dist sdata ClickHouse sdata_dist sdata Result Set
  • 20. ReplicatedMergeTree engine spreads over shards and replicas ClickHouse sdata_dist sdata ReplicatedMergeTree Engine ClickHouse sdata_dist sdata ClickHouse sdata_dist sdata ClickHouse sdata_dist sdata ClickHouse sdata_dist sdata ClickHouse sdata_dist sdata SELECT ... FROM sdata_dist Result Set Zookeeper Zookeeper Zookeeper
  • 21. SELECT Dest, count(*) c, avg(DepDelayMinutes) FROM ontime GROUP BY Dest HAVING c > 100000 ORDER BY c DESC limit 5 SELECT Dest, count(*) c, avg(DepDelayMinutes) FROM ontime WHERE toYear(FlightDate) = toYear(toDate('2016-01-01')) GROUP BY Dest HAVING c > 100000 ORDER BY c DESC limit 5 With basic engine knowledge you can now tune queries Scans 355 table parts in parallel; does not use index Scans 12 parts (3% of data) because FlightDate is partition key Hint: clickhouse-server.log has the query plan Faster
  • 22. SELECT Dest d, Name n, count(*) c, avg(ArrDelayMinutes) FROM ontime JOIN airports ON (airports.IATA = ontime.Dest) GROUP BY d, n HAVING c > 100000 ORDER BY ad DESC SELECT dest, Name n, c AS flights, ad FROM ( SELECT Dest dest, count(*) c, avg(ArrDelayMinutes) ad FROM ontime GROUP BY dest HAVING c > 100000 ORDER BY ad DESC ) LEFT JOIN airports ON airports.IATA = dest You can also optimize joins Subquery minimizes data scanned in parallel; joins on GROUP BY results Joins on data before GROUP BY, increased amount to scan Faster
  • 23. ClickHouse has a wealth of features to help queries go fast Dictionaries Materialized Views Arrays Specialized functions and SQL extensions Lots more table engines
  • 24. ...And a nice set of supporting ecosystem tools Client libraries: JDBC, ODBC, Python, Golang, ... Kafka table engine to ingest from Kafka queues Visualization tools: Grafana, Tableau, Tabix, SuperSet Data science stack integration: Pandas, Jupyter Notebooks Kubernetes ClickHouse operator
  • 25. Where to get more information ● ClickHouse Docs: https://clickhouse.yandex/docs/en/ ● Altinity Blog: https://www.altinity.com/blog ● Meetups and conference presentations ○ 2 April -- Madrid, Spain ClickHouse Meetup ○ 28-30 May -- Austin, TX Percona Live 2019 ○ San Francisco ClickHouse Meetup
  • 26. Questions? Thank you! Contacts: info@altinity.com Visit us at: https://www.altinity.com Read Our Blog: https://www.altinity.com/blog