SlideShare a Scribd company logo
ClickHouse
Enabling interactive data exploration
Alexander Kuzmenkov, ClickHouse developer at Yandex
What is ClickHouse?
● RDBMS for analytics
○ SQL
○ FOSS
● Distributed
○ Cross-datacenter replication
○ Tolerant to a single-datacenter failure
● Linear scaling
○ 100s of servers
○ 100G-1000G records daily
● Blazing fast
○ interactive data exploration
These slides have a lot of links!
Download at https://clck.ru/MMWHA
History
● Yandex.Metrica (think Google Analytics)
○ 30B rows daily, 3PB total, 700 machines
○ Processing speed up to 2TB/s
● Used a MyISAM solution with pre-aggregation
○ Didn’t scale for the growing load
○ Could only build pre-defined reports
● Started developing own columnar DB in 2009
● Enables “double real time” reporting
○ Add new events in real time
○ Build custom reports in real time
● Becomes the main engine in 2012
● Open-sourced in 2016
Adoption
● Yandex-wide
○ business analytics
● Thousands of companies worldwide
○ analyzing 1M DNS queries per second
(Cloudflare)
○ geospatial processing (Carto)
○ real-time ad analytics platform (LifeStreet)
○ storage performance analysis and
monitoring (Infinidat)
○ AdTech, FinTech, sensors, logs, event
streams, time series, etc.
Unusual applications:
● Blockchain transaction history
○ blockchair.com
○ bloxy.info
● CERN LHCb experiment
● Bioinformatics
When to use
● stream of well-structured, immutable
events
○ (mostly) fixed schema
○ append-only
○ heavy-weight ALTER UPDATE/DELETE
as a “GDPR escape hatch”
● flexible real-time reporting
○ queries finish in seconds, not hours
○ no preprocessing needed
○ enables ad-hoc experiments
When NOT to use
● OLTP
○ no transactions
○ single INSERT is atomic, but no
cross-query atomicity
○ very heavy UPDATEs
● key/value storage
○ sparse indexes
○ not suitable for point reads
● document/blob storage
○ optimized for records < 100 kB
● highly normalized data
○ optimized for star schema
How fast?
Query 1 Query 2 Query 3 Query 4 Setup
0.005 0.01 0.10 0.188 BrytlytDB 2.1 & 5-node IBM Minsky cluster
0.051 0.15 0.05 0.794 kdb+/q & 4 Intel Xeon Phi 7210 CPUs
0.241 0.83 1.21 1.781 ClickHouse, 3 x c5d.9xlarge cluster
0.762 2.47 4.13 6.041 BrytlytDB 1.0 & 2-node p2.16xlarge cluster
1.034 3.06 5.35 12.748 ClickHouse, Intel Core i5 4670K
1.56 1.25 2.25 2.97 Redshift, 6-node ds2.8xlarge cluster
2 2.00 1.00 3 BigQuery
2.362 3.56 4.02 20.412 Spark 2.4 & 21 x m3.xlarge HDFS cluster
14.389 32.15 33.45 67.312 Vertica, Intel Core i5 4670K
1.1 Billion Taxi rides benchmark: 1.1G records, 51 columns, 500 GB uncompressed CSV
More on performance at our site
Why so fast?
Read less data
● Data locality
○ columnar storage
○ sorted by PK
● Compression
Process data faster
● Parallelism
○ multithreading
○ distributed queries
● Efficient computation
○ >40 different GROUP BY algorithms
○ vectorized query execution with SIMD
Deployment
● Repos for major Linux distros
● Docker containers
● One self-contained binary
○ works everywhere
● ZooKeeper if you need replication
● Test it on your laptop
○ runs with minimal resources
● Stable release every two weeks
○ Thousands of scenarios tested for
each change
● No data migration on update
○ Just run a new server
Data ingestion
● INSERTs
● Many data formats
○ CSV, JSON, Parquet, CapnProto,
ORC, ...
● Batching for optimal performance
○ Buffer table engine
○ Kafka table engine
○ Third-party solutions — chproxy,
kittenhouse etc.
Analyzing data
● A rich SQL dialect
○ strong typing
○ higher order functions like arrayMap
○ variety of aggregate function —
quantiles, cardinality estimators etc.
○ sampling
○ Nested type for key/value records
○ LowCardinality type for dictionary
encoding
● BI tools support
○ Tableau
○ Apache Superset
○ Holistics
○ others via ODBC/SQLAlchemy/...
clickhouse-local
$ clickhouse local
--file ~/hits_v1.tsv
--structure 'WatchID UInt64, JavaEnable UInt8, ...'
--query 'SELECT UserID, SearchPhrase, count() FROM table GROUP BY UserID,
SearchPhrase'
Read 8873898 rows, 7.88 GiB in 5.208 sec., 1704038 rows/sec., 1.51 GiB/sec.
UserID SearchPhrase count()
8410854169855355129 пальные кость играть терхи 3
The full power of ClickHouse engine over a data file.
Interfaces
Connect to ClickHouse
● Native binary protocol
○ Drivers for Python, Go, C++, ...
● RESTful HTTP
● ODBC
● JDBC
● MySQL wire protocol
● PostgreSQL
○ clickhouse_fdw
○ pg2ch (logical replication)
Connect from ClickHouse
● File
● HDFS
● URL
● MySQL
● ODBC
● External dictionaries
Resources
● Check the docs at our site
● View the talks at our YouTube channel
● Create issues on github
● Ask on Stack Overflow
● Email us at clickhouse-feeback@yandex-team.ru
● Join English and Russian chats in Telegram
● Get commercial support from Altinity and others
● … and more
Thank you!
Questions?
● https://clickhouse.tech
● https://github.com/ClickHouse/ClickHouse
● clickhouse-feedback@yandex-team.ru

More Related Content

PPTX
Understanding and tuning WiredTiger, the new high performance database engine...
PPTX
Agility and Scalability with MongoDB
ODP
Redis IU
KEY
MongoDB NYC Python
PPTX
The Basics of MongoDB
PPTX
MongoDB for Beginners
PPTX
Mongo db intro.pptx
PPTX
Tag based sharding presentation
Understanding and tuning WiredTiger, the new high performance database engine...
Agility and Scalability with MongoDB
Redis IU
MongoDB NYC Python
The Basics of MongoDB
MongoDB for Beginners
Mongo db intro.pptx
Tag based sharding presentation

What's hot (20)

PDF
U C2007 My S Q L Performance Cookbook
PPTX
Back to Basics Webinar 1: Introduction to NoSQL
PPTX
Lessons Learned Migrating 2+ Billion Documents at Craigslist
PPTX
Mongo db present
PPTX
MongoDB basics & Introduction
PDF
A New MongoDB Sharding Architecture for Higher Availability and Better Resour...
PDF
Mongo db basics
PDF
Austin bdug 2011_01_27_small_and_big_data
PDF
Zero to 1 Billion+ Records: A True Story of Learning & Scaling GameChanger
PDF
Mongo db basics
PPT
Introduction to MongoDB
PPTX
Mongo db
PPTX
Introducción a NoSQL
PPTX
Tms training
PPTX
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
PDF
Шардинг в MongoDB, Henrik Ingo (MongoDB)
PDF
Data engineering Stl Big Data IDEA user group
KEY
Mongo db admin_20110329
PPTX
PPTX
Introduction to MongoDB
U C2007 My S Q L Performance Cookbook
Back to Basics Webinar 1: Introduction to NoSQL
Lessons Learned Migrating 2+ Billion Documents at Craigslist
Mongo db present
MongoDB basics & Introduction
A New MongoDB Sharding Architecture for Higher Availability and Better Resour...
Mongo db basics
Austin bdug 2011_01_27_small_and_big_data
Zero to 1 Billion+ Records: A True Story of Learning & Scaling GameChanger
Mongo db basics
Introduction to MongoDB
Mongo db
Introducción a NoSQL
Tms training
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
Шардинг в MongoDB, Henrik Ingo (MongoDB)
Data engineering Stl Big Data IDEA user group
Mongo db admin_20110329
Introduction to MongoDB
Ad

Similar to 21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration with ClickHouse (20)

PDF
ClickHouse Introduction, by Alexander Zaitsev, Altinity CTO
PDF
ClickHouse Introduction by Alexander Zaitsev, Altinity CTO
PDF
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
PDF
ClickHouse 2018. How to stop waiting for your queries to complete and start ...
PPTX
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
PDF
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
PDF
ClickHouse Analytical DBMS: Introduction and Case Studies, by Alexander Zaitsev
PDF
ClickHouse Analytical DBMS. Introduction and usage, by Alexander Zaitsev
PDF
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
PDF
Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships ...
PDF
Your first ClickHouse data warehouse
PDF
Our Story With ClickHouse at seo.do
PDF
ClickHouse Deep Dive, by Aleksei Milovidov
PPTX
Webinar 2017. Supercharge your analytics with ClickHouse. Vadim Tkachenko
PPTX
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
PDF
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
PDF
[Meetup] a successful migration from elastic search to clickhouse
PDF
10 Good Reasons to Use ClickHouse
PPTX
Building an Analytic Extension to MySQL with ClickHouse and Open Source
PPTX
Building an Analytic Extension to MySQL with ClickHouse and Open Source.pptx
ClickHouse Introduction, by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction by Alexander Zaitsev, Altinity CTO
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
ClickHouse 2018. How to stop waiting for your queries to complete and start ...
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
ClickHouse Analytical DBMS: Introduction and Case Studies, by Alexander Zaitsev
ClickHouse Analytical DBMS. Introduction and usage, by Alexander Zaitsev
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships ...
Your first ClickHouse data warehouse
Our Story With ClickHouse at seo.do
ClickHouse Deep Dive, by Aleksei Milovidov
Webinar 2017. Supercharge your analytics with ClickHouse. Vadim Tkachenko
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
[Meetup] a successful migration from elastic search to clickhouse
10 Good Reasons to Use ClickHouse
Building an Analytic Extension to MySQL with ClickHouse and Open Source
Building an Analytic Extension to MySQL with ClickHouse and Open Source.pptx
Ad

More from Athens Big Data (20)

PDF
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
PDF
21st Athens Big Data Meetup - 2nd Talk - Dive into ClickHouse storage system
PDF
19th Athens Big Data Meetup - 2nd Talk - NLP: From news recommendation to wor...
PDF
21st Athens Big Data Meetup - 3rd Talk - Dive into ClickHouse query execution
PDF
20th Athens Big Data Meetup - 2nd Talk - Druid: under the covers
PDF
20th Athens Big Data Meetup - 3rd Talk - Message from our sponsor: Velti
PDF
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...
PDF
19th Athens Big Data Meetup - 1st Talk - NLP understanding
PDF
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
PDF
18th Athens Big Data Meetup - 1st Talk - Timeseries Forecasting as a Service
PDF
17th Athens Big Data Meetup - 2nd Talk - Data Flow Building and Calculation P...
PDF
17th Athens Big Data Meetup - 1st Talk - Speedup Machine Application Learning...
PDF
16th Athens Big Data Meetup - 2nd Talk - A Focus on Building and Optimizing M...
PDF
16th Athens Big Data Meetup - 1st Talk - An Introduction to Machine Learning ...
PDF
15th Athens Big Data Meetup - 1st Talk - Running Spark On Mesos
PDF
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
PDF
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
PDF
13th Athens Big Data Meetup - 2nd Talk - Training Neural Networks With Enterp...
PDF
11th Athens Big Data Meetup - 2nd Talk - Beyond Bitcoin; Blockchain Technolog...
PDF
9th Athens Big Data Meetup - 2nd Talk - Lead Scoring And Grading
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
21st Athens Big Data Meetup - 2nd Talk - Dive into ClickHouse storage system
19th Athens Big Data Meetup - 2nd Talk - NLP: From news recommendation to wor...
21st Athens Big Data Meetup - 3rd Talk - Dive into ClickHouse query execution
20th Athens Big Data Meetup - 2nd Talk - Druid: under the covers
20th Athens Big Data Meetup - 3rd Talk - Message from our sponsor: Velti
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...
19th Athens Big Data Meetup - 1st Talk - NLP understanding
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
18th Athens Big Data Meetup - 1st Talk - Timeseries Forecasting as a Service
17th Athens Big Data Meetup - 2nd Talk - Data Flow Building and Calculation P...
17th Athens Big Data Meetup - 1st Talk - Speedup Machine Application Learning...
16th Athens Big Data Meetup - 2nd Talk - A Focus on Building and Optimizing M...
16th Athens Big Data Meetup - 1st Talk - An Introduction to Machine Learning ...
15th Athens Big Data Meetup - 1st Talk - Running Spark On Mesos
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
13th Athens Big Data Meetup - 2nd Talk - Training Neural Networks With Enterp...
11th Athens Big Data Meetup - 2nd Talk - Beyond Bitcoin; Blockchain Technolog...
9th Athens Big Data Meetup - 2nd Talk - Lead Scoring And Grading

Recently uploaded (20)

PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
cuic standard and advanced reporting.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Encapsulation theory and applications.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Big Data Technologies - Introduction.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPT
Teaching material agriculture food technology
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
KodekX | Application Modernization Development
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Empathic Computing: Creating Shared Understanding
Programs and apps: productivity, graphics, security and other tools
Network Security Unit 5.pdf for BCA BBA.
cuic standard and advanced reporting.pdf
Unlocking AI with Model Context Protocol (MCP)
sap open course for s4hana steps from ECC to s4
Per capita expenditure prediction using model stacking based on satellite ima...
Encapsulation theory and applications.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Big Data Technologies - Introduction.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Diabetes mellitus diagnosis method based random forest with bat algorithm
20250228 LYD VKU AI Blended-Learning.pptx
Teaching material agriculture food technology
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
KodekX | Application Modernization Development
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Building Integrated photovoltaic BIPV_UPV.pdf
Empathic Computing: Creating Shared Understanding

21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration with ClickHouse

  • 1. ClickHouse Enabling interactive data exploration Alexander Kuzmenkov, ClickHouse developer at Yandex
  • 2. What is ClickHouse? ● RDBMS for analytics ○ SQL ○ FOSS ● Distributed ○ Cross-datacenter replication ○ Tolerant to a single-datacenter failure ● Linear scaling ○ 100s of servers ○ 100G-1000G records daily ● Blazing fast ○ interactive data exploration These slides have a lot of links! Download at https://clck.ru/MMWHA
  • 3. History ● Yandex.Metrica (think Google Analytics) ○ 30B rows daily, 3PB total, 700 machines ○ Processing speed up to 2TB/s ● Used a MyISAM solution with pre-aggregation ○ Didn’t scale for the growing load ○ Could only build pre-defined reports ● Started developing own columnar DB in 2009 ● Enables “double real time” reporting ○ Add new events in real time ○ Build custom reports in real time ● Becomes the main engine in 2012 ● Open-sourced in 2016
  • 4. Adoption ● Yandex-wide ○ business analytics ● Thousands of companies worldwide ○ analyzing 1M DNS queries per second (Cloudflare) ○ geospatial processing (Carto) ○ real-time ad analytics platform (LifeStreet) ○ storage performance analysis and monitoring (Infinidat) ○ AdTech, FinTech, sensors, logs, event streams, time series, etc. Unusual applications: ● Blockchain transaction history ○ blockchair.com ○ bloxy.info ● CERN LHCb experiment ● Bioinformatics
  • 5. When to use ● stream of well-structured, immutable events ○ (mostly) fixed schema ○ append-only ○ heavy-weight ALTER UPDATE/DELETE as a “GDPR escape hatch” ● flexible real-time reporting ○ queries finish in seconds, not hours ○ no preprocessing needed ○ enables ad-hoc experiments
  • 6. When NOT to use ● OLTP ○ no transactions ○ single INSERT is atomic, but no cross-query atomicity ○ very heavy UPDATEs ● key/value storage ○ sparse indexes ○ not suitable for point reads ● document/blob storage ○ optimized for records < 100 kB ● highly normalized data ○ optimized for star schema
  • 7. How fast? Query 1 Query 2 Query 3 Query 4 Setup 0.005 0.01 0.10 0.188 BrytlytDB 2.1 & 5-node IBM Minsky cluster 0.051 0.15 0.05 0.794 kdb+/q & 4 Intel Xeon Phi 7210 CPUs 0.241 0.83 1.21 1.781 ClickHouse, 3 x c5d.9xlarge cluster 0.762 2.47 4.13 6.041 BrytlytDB 1.0 & 2-node p2.16xlarge cluster 1.034 3.06 5.35 12.748 ClickHouse, Intel Core i5 4670K 1.56 1.25 2.25 2.97 Redshift, 6-node ds2.8xlarge cluster 2 2.00 1.00 3 BigQuery 2.362 3.56 4.02 20.412 Spark 2.4 & 21 x m3.xlarge HDFS cluster 14.389 32.15 33.45 67.312 Vertica, Intel Core i5 4670K 1.1 Billion Taxi rides benchmark: 1.1G records, 51 columns, 500 GB uncompressed CSV More on performance at our site
  • 8. Why so fast? Read less data ● Data locality ○ columnar storage ○ sorted by PK ● Compression Process data faster ● Parallelism ○ multithreading ○ distributed queries ● Efficient computation ○ >40 different GROUP BY algorithms ○ vectorized query execution with SIMD
  • 9. Deployment ● Repos for major Linux distros ● Docker containers ● One self-contained binary ○ works everywhere ● ZooKeeper if you need replication ● Test it on your laptop ○ runs with minimal resources ● Stable release every two weeks ○ Thousands of scenarios tested for each change ● No data migration on update ○ Just run a new server
  • 10. Data ingestion ● INSERTs ● Many data formats ○ CSV, JSON, Parquet, CapnProto, ORC, ... ● Batching for optimal performance ○ Buffer table engine ○ Kafka table engine ○ Third-party solutions — chproxy, kittenhouse etc.
  • 11. Analyzing data ● A rich SQL dialect ○ strong typing ○ higher order functions like arrayMap ○ variety of aggregate function — quantiles, cardinality estimators etc. ○ sampling ○ Nested type for key/value records ○ LowCardinality type for dictionary encoding ● BI tools support ○ Tableau ○ Apache Superset ○ Holistics ○ others via ODBC/SQLAlchemy/...
  • 12. clickhouse-local $ clickhouse local --file ~/hits_v1.tsv --structure 'WatchID UInt64, JavaEnable UInt8, ...' --query 'SELECT UserID, SearchPhrase, count() FROM table GROUP BY UserID, SearchPhrase' Read 8873898 rows, 7.88 GiB in 5.208 sec., 1704038 rows/sec., 1.51 GiB/sec. UserID SearchPhrase count() 8410854169855355129 пальные кость играть терхи 3 The full power of ClickHouse engine over a data file.
  • 13. Interfaces Connect to ClickHouse ● Native binary protocol ○ Drivers for Python, Go, C++, ... ● RESTful HTTP ● ODBC ● JDBC ● MySQL wire protocol ● PostgreSQL ○ clickhouse_fdw ○ pg2ch (logical replication) Connect from ClickHouse ● File ● HDFS ● URL ● MySQL ● ODBC ● External dictionaries
  • 14. Resources ● Check the docs at our site ● View the talks at our YouTube channel ● Create issues on github ● Ask on Stack Overflow ● Email us at clickhouse-feeback@yandex-team.ru ● Join English and Russian chats in Telegram ● Get commercial support from Altinity and others ● … and more
  • 15. Thank you! Questions? ● https://clickhouse.tech ● https://github.com/ClickHouse/ClickHouse ● clickhouse-feedback@yandex-team.ru