SlideShare a Scribd company logo
© 2022 Altinity, Inc.
ClickHouse -If
Combinators for Fun and
Profit
Robert Hodges, Altinity
SF Bay Area ClickHouse Meetup
May 4, 2022
1
© 2022 Altinity, Inc.
Let’s make some introductions
ClickHouse support and services including Altinity.Cloud
Authors of Altinity Kubernetes Operator for ClickHouse
and other open source projects
Robert Hodges
Database geek with 30+ years
on DBMS systems. Day job:
Altinity CEO
Altinity Engineering
Database geeks with centuries
of experience in DBMS and
applications
2
© 2022 Altinity, Inc.
What are -If Combinators?
ClickHouse Aggregation Functions
● count()
● min(value)
● max(value)
● sum(value)
● avg(value)
● uniq(value)
● uniqExact(value)
● …
3
-If versions of same functions
● countIf(condition)
● minIf(value, condition)
● maxIf(value, condition)
● sumIf(value, condition)
● avgIf(value, condition)
● uniqIf(value, condition)
● uniqExactIf(value, condition)
● …
© 2022 Altinity, Inc.
Finding dirty data using a histogram
– NYC Taxi data: fare amounts
WITH histogram(5)(fare_amount) AS hist
SELECT
arrayJoin(hist) AS hist1,
bar(hist1.3, 0, 1000, 20) AS bar
FROM tripdata
┌─hist1──────────────────────────────────────────────────┬─bar──────────────────┐
│ (-21474808,-12079576.375,1) │ │
│ (-12079576.375,-1342166.6614073187,163862993.25) │ ████████████████████ │
│ (-1342166.6614073187,177236.82575785994,983177956.125) │ ████████████████████ │
│ (177236.82575785994,502005.5770089285,163863003.875) │ ████████████████████ │
│ (502005.5770089285,825998.625,8.75) │ ▏ │
└────────────────────────────────────────────────────────┴──────────────────────┘
5 rows in set. Elapsed: 6.853 sec. Processed 1.31 billion rows, 5.24 GB
(191.28 million rows/s., 765.11 MB/s.)
4
© 2022 Altinity, Inc.
Finding dirty data with countIf()
– NYC Taxi data: fare amounts
SELECT
countIf(fare_amount < 0) AS fare_less_than_zero,
countIf((fare_amount >= 0) AND (fare_amount < 100)) AS fare_0_to_100,
countIf((fare_amount >= 100) AND (fare_amount < 1000)) AS fare_100_to_1000,
countIf(fare_amount > 1000) AS fare_1000_or_greater
FROM tripdata
┌─fare_less_than_zero─┬─fare_0_to_100─┬─fare_100_to_1000─┬─fare_1000_or_greater─┐
│ 128932 │ 1310237048 │ 537609 │ 373 │
└─────────────────────┴───────────────┴──────────────────┴──────────────────────┘
1 rows in set. Elapsed: 0.898 sec. Processed 1.31 billion rows, 5.24 GB
(1.46 billion rows/s., 5.84 GB/s.)
5
Simpler output and faster too!
© 2022 Altinity, Inc.
Searching for GitHub event counts using GROUP BY
-- Github event data by year.
SELECT
toYear(created_at) AS year, event_type, count()
FROM github_events
WHERE event_type IN ('PullRequestEvent', 'WatchEvent')
GROUP BY year, event_type
ORDER BY year ASC, event_type ASC
┌─year─┬─event_type───────┬──count()─┐
│ 2011 │ PullRequestEvent │ 476818 │
│ 2011 │ WatchEvent │ 1831742 │
│ 2012 │ PullRequestEvent │ 1807044 │
│ 2012 │ WatchEvent │ 4048676 │
6
© 2022 Altinity, Inc.
Pivoting event data using countIf()
-- Github event data by year.
SELECT
toYear(created_at) AS year,
countIf(event_type = 'PullRequestEvent') AS PRs,
countIf(event_type = 'WatchEvent') AS Stars
FROM github_events
WHERE event_type IN ('PullRequestEvent', 'WatchEvent')
GROUP BY year
┌─year─┬──────PRs─┬────Stars─┐
│ 2011 │ 476818 │ 1831742 │
│ 2012 │ 1807044 │ 4048676 │
│ 2013 │ 3103759 │ 7432800 │
│ 2014 │ 5923843 │ 11952935 │
7
Way slower if you
omit the IN clause
© 2022 Altinity, Inc.
Cleanup up multiple inputs on the fly with avgIf()
SELECT
count() AS rides,
avgIf(passenger_count, passenger_count < 10)
AS passengers_clean,
avgIf(fare_amount, (fare_amount >= 0) AND (fare_amount < 500))
AS fares_clean
FROM tripdata
┌──────rides─┬──passengers_clean─┬────────fares_clean─┐
│ 1310903963 │ 1.681498872092554 │ 11.424149328643022 │
└────────────┴───────────────────┴────────────────────┘
1 rows in set. Elapsed: 1.092 sec. Processed 1.31 billion rows, 6.55 GB
(1.20 billion rows/s., 6.00 GB/s.)
8
Multiple conditions
in a single scan!
© 2022 Altinity, Inc.
Finding out more tricks for using -If combinators
9
● ClickHouse docs - Very limited
● Altinity Knowledge Base - Examples using system tables and clever tricks
○ Example: Simple aggregate functions & combinators
● Tests in ClickHouse codebase – Examples of what’s possible but not why
cd ~/git/ClickHouse/src/tests
grep -r countIf .
grep -r avgIf .
grep -r weightedSumIf .
© 2022 Altinity, Inc.
© 2022 Altinity, Inc.
Thank you!
Questions?
https://altinity.com
rhodges at altinity.com
10
Altinity.Cloud
Software and support for
ClickHouse
We’re hiring!

More Related Content

PDF
A Day in the Life of a ClickHouse Query Webinar Slides
PDF
ClickHouse Features for Advanced Users, by Aleksei Milovidov
PDF
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
PDF
Altinity Quickstart for ClickHouse-2202-09-15.pdf
PDF
Altinity Quickstart for ClickHouse
PDF
A day in the life of a click house query
PDF
Data profiling with Apache Calcite
PDF
Data profiling in Apache Calcite
A Day in the Life of a ClickHouse Query Webinar Slides
ClickHouse Features for Advanced Users, by Aleksei Milovidov
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Altinity Quickstart for ClickHouse-2202-09-15.pdf
Altinity Quickstart for ClickHouse
A day in the life of a click house query
Data profiling with Apache Calcite
Data profiling in Apache Calcite

Similar to ClickHouse -If Combinators for Fun and Profit-2022-05-04.pdf (7)

PDF
Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Dat...
PDF
Assocrules
PPTX
Austin Scales- Clickstream Analytics at Bazaarvoice
PDF
SQL on everything, in memory
PDF
Datasalt - BBVA case study - extracting value from credit card transactions
PDF
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
PDF
Data Profiling in Apache Calcite
Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Dat...
Assocrules
Austin Scales- Clickstream Analytics at Bazaarvoice
SQL on everything, in memory
Datasalt - BBVA case study - extracting value from credit card transactions
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
Data Profiling in Apache Calcite
Ad

More from Altinity Ltd (20)

PPTX
Building an Analytic Extension to MySQL with ClickHouse and Open Source.pptx
PDF
Cloud Native ClickHouse at Scale--Using the Altinity Kubernetes Operator-2022...
PPTX
Building an Analytic Extension to MySQL with ClickHouse and Open Source
PDF
Fun with ClickHouse Window Functions-2021-08-19.pdf
PDF
Cloud Native Data Warehouses - Intro to ClickHouse on Kubernetes-2021-07.pdf
PDF
Building High Performance Apps with Altinity Stable Builds for ClickHouse | A...
PDF
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
PDF
Own your ClickHouse data with Altinity.Cloud Anywhere-2023-01-17.pdf
PDF
ClickHouse ReplacingMergeTree in Telecom Apps
PDF
Adventures with the ClickHouse ReplacingMergeTree Engine
PDF
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
PDF
Altinity Webinar: Introduction to Altinity.Cloud-Platform for Real-Time Data.pdf
PDF
OSA Con 2022 - What Data Engineering Can Learn from Frontend Engineering - Pe...
PDF
OSA Con 2022 - Welcome to OSA CON Version 2022 - Robert Hodges - Altinity.pdf
PDF
OSA Con 2022 - Using ClickHouse Database to Power Analytics and Customer Enga...
PDF
OSA Con 2022 - Tips and Tricks to Keep Your Queries under 100ms with ClickHou...
PDF
OSA Con 2022 - The Open Source Analytic Universe, Version 2022 - Robert Hodge...
PDF
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
PDF
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...
PDF
OSA Con 2022 - State of Open Source Databases - Peter Zaitsev - Percona.pdf
Building an Analytic Extension to MySQL with ClickHouse and Open Source.pptx
Cloud Native ClickHouse at Scale--Using the Altinity Kubernetes Operator-2022...
Building an Analytic Extension to MySQL with ClickHouse and Open Source
Fun with ClickHouse Window Functions-2021-08-19.pdf
Cloud Native Data Warehouses - Intro to ClickHouse on Kubernetes-2021-07.pdf
Building High Performance Apps with Altinity Stable Builds for ClickHouse | A...
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Own your ClickHouse data with Altinity.Cloud Anywhere-2023-01-17.pdf
ClickHouse ReplacingMergeTree in Telecom Apps
Adventures with the ClickHouse ReplacingMergeTree Engine
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Altinity Webinar: Introduction to Altinity.Cloud-Platform for Real-Time Data.pdf
OSA Con 2022 - What Data Engineering Can Learn from Frontend Engineering - Pe...
OSA Con 2022 - Welcome to OSA CON Version 2022 - Robert Hodges - Altinity.pdf
OSA Con 2022 - Using ClickHouse Database to Power Analytics and Customer Enga...
OSA Con 2022 - Tips and Tricks to Keep Your Queries under 100ms with ClickHou...
OSA Con 2022 - The Open Source Analytic Universe, Version 2022 - Robert Hodge...
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...
OSA Con 2022 - State of Open Source Databases - Peter Zaitsev - Percona.pdf
Ad

Recently uploaded (20)

PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Introduction to the R Programming Language
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Introduction to machine learning and Linear Models
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPT
Quality review (1)_presentation of this 21
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Database Infoormation System (DBIS).pptx
PPTX
1_Introduction to advance data techniques.pptx
PDF
Introduction to Data Science and Data Analysis
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Business Ppt On Nestle.pptx huunnnhhgfvu
Introduction to the R Programming Language
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Clinical guidelines as a resource for EBP(1).pdf
Introduction to machine learning and Linear Models
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Quality review (1)_presentation of this 21
Miokarditis (Inflamasi pada Otot Jantung)
STERILIZATION AND DISINFECTION-1.ppthhhbx
Data_Analytics_and_PowerBI_Presentation.pptx
Introduction to Knowledge Engineering Part 1
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Database Infoormation System (DBIS).pptx
1_Introduction to advance data techniques.pptx
Introduction to Data Science and Data Analysis
Introduction-to-Cloud-ComputingFinal.pptx
climate analysis of Dhaka ,Banglades.pptx
Reliability_Chapter_ presentation 1221.5784
Acceptance and paychological effects of mandatory extra coach I classes.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb

ClickHouse -If Combinators for Fun and Profit-2022-05-04.pdf

  • 1. © 2022 Altinity, Inc. ClickHouse -If Combinators for Fun and Profit Robert Hodges, Altinity SF Bay Area ClickHouse Meetup May 4, 2022 1
  • 2. © 2022 Altinity, Inc. Let’s make some introductions ClickHouse support and services including Altinity.Cloud Authors of Altinity Kubernetes Operator for ClickHouse and other open source projects Robert Hodges Database geek with 30+ years on DBMS systems. Day job: Altinity CEO Altinity Engineering Database geeks with centuries of experience in DBMS and applications 2
  • 3. © 2022 Altinity, Inc. What are -If Combinators? ClickHouse Aggregation Functions ● count() ● min(value) ● max(value) ● sum(value) ● avg(value) ● uniq(value) ● uniqExact(value) ● … 3 -If versions of same functions ● countIf(condition) ● minIf(value, condition) ● maxIf(value, condition) ● sumIf(value, condition) ● avgIf(value, condition) ● uniqIf(value, condition) ● uniqExactIf(value, condition) ● …
  • 4. © 2022 Altinity, Inc. Finding dirty data using a histogram – NYC Taxi data: fare amounts WITH histogram(5)(fare_amount) AS hist SELECT arrayJoin(hist) AS hist1, bar(hist1.3, 0, 1000, 20) AS bar FROM tripdata ┌─hist1──────────────────────────────────────────────────┬─bar──────────────────┐ │ (-21474808,-12079576.375,1) │ │ │ (-12079576.375,-1342166.6614073187,163862993.25) │ ████████████████████ │ │ (-1342166.6614073187,177236.82575785994,983177956.125) │ ████████████████████ │ │ (177236.82575785994,502005.5770089285,163863003.875) │ ████████████████████ │ │ (502005.5770089285,825998.625,8.75) │ ▏ │ └────────────────────────────────────────────────────────┴──────────────────────┘ 5 rows in set. Elapsed: 6.853 sec. Processed 1.31 billion rows, 5.24 GB (191.28 million rows/s., 765.11 MB/s.) 4
  • 5. © 2022 Altinity, Inc. Finding dirty data with countIf() – NYC Taxi data: fare amounts SELECT countIf(fare_amount < 0) AS fare_less_than_zero, countIf((fare_amount >= 0) AND (fare_amount < 100)) AS fare_0_to_100, countIf((fare_amount >= 100) AND (fare_amount < 1000)) AS fare_100_to_1000, countIf(fare_amount > 1000) AS fare_1000_or_greater FROM tripdata ┌─fare_less_than_zero─┬─fare_0_to_100─┬─fare_100_to_1000─┬─fare_1000_or_greater─┐ │ 128932 │ 1310237048 │ 537609 │ 373 │ └─────────────────────┴───────────────┴──────────────────┴──────────────────────┘ 1 rows in set. Elapsed: 0.898 sec. Processed 1.31 billion rows, 5.24 GB (1.46 billion rows/s., 5.84 GB/s.) 5 Simpler output and faster too!
  • 6. © 2022 Altinity, Inc. Searching for GitHub event counts using GROUP BY -- Github event data by year. SELECT toYear(created_at) AS year, event_type, count() FROM github_events WHERE event_type IN ('PullRequestEvent', 'WatchEvent') GROUP BY year, event_type ORDER BY year ASC, event_type ASC ┌─year─┬─event_type───────┬──count()─┐ │ 2011 │ PullRequestEvent │ 476818 │ │ 2011 │ WatchEvent │ 1831742 │ │ 2012 │ PullRequestEvent │ 1807044 │ │ 2012 │ WatchEvent │ 4048676 │ 6
  • 7. © 2022 Altinity, Inc. Pivoting event data using countIf() -- Github event data by year. SELECT toYear(created_at) AS year, countIf(event_type = 'PullRequestEvent') AS PRs, countIf(event_type = 'WatchEvent') AS Stars FROM github_events WHERE event_type IN ('PullRequestEvent', 'WatchEvent') GROUP BY year ┌─year─┬──────PRs─┬────Stars─┐ │ 2011 │ 476818 │ 1831742 │ │ 2012 │ 1807044 │ 4048676 │ │ 2013 │ 3103759 │ 7432800 │ │ 2014 │ 5923843 │ 11952935 │ 7 Way slower if you omit the IN clause
  • 8. © 2022 Altinity, Inc. Cleanup up multiple inputs on the fly with avgIf() SELECT count() AS rides, avgIf(passenger_count, passenger_count < 10) AS passengers_clean, avgIf(fare_amount, (fare_amount >= 0) AND (fare_amount < 500)) AS fares_clean FROM tripdata ┌──────rides─┬──passengers_clean─┬────────fares_clean─┐ │ 1310903963 │ 1.681498872092554 │ 11.424149328643022 │ └────────────┴───────────────────┴────────────────────┘ 1 rows in set. Elapsed: 1.092 sec. Processed 1.31 billion rows, 6.55 GB (1.20 billion rows/s., 6.00 GB/s.) 8 Multiple conditions in a single scan!
  • 9. © 2022 Altinity, Inc. Finding out more tricks for using -If combinators 9 ● ClickHouse docs - Very limited ● Altinity Knowledge Base - Examples using system tables and clever tricks ○ Example: Simple aggregate functions & combinators ● Tests in ClickHouse codebase – Examples of what’s possible but not why cd ~/git/ClickHouse/src/tests grep -r countIf . grep -r avgIf . grep -r weightedSumIf .
  • 10. © 2022 Altinity, Inc. © 2022 Altinity, Inc. Thank you! Questions? https://altinity.com rhodges at altinity.com 10 Altinity.Cloud Software and support for ClickHouse We’re hiring!