SlideShare a Scribd company logo
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
What is HyperLogLog and
Why You Will Love It
Burak Yücesoy
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
• Number of unique elements (cardinality) in given data
• Useful to find things like…
• Number of unique users visited your web page
• Number of unique products in your inventory
What is COUNT(DISTINCT)?
2
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
What is COUNT(DISTINCT)?
3
logins
username | date
----------+-----------
Alice | 2018-10-02
Bob | 2018-10-03
Alice | 2018-10-05
Eve | 2018-10-07
Bob | 2018-10-07
Bob | 2018-10-08
• Number of logins: 6
• Number of unique users who log in: 3
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
• Slow
• High memory footprint
• Cannot work with appended/streaming data
Problems with Traditional COUNT(DISTINCT)
4
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
HyperLogLog(HLL) is faster alternative to COUNT(DISTINCT) with low
memory footprint;
• Approximation algorithm
• Estimates cardinality (i.e. COUNT(DISTINCT) ) of given data
• Mathematically provable error bounds
• It can estimate cardinalities well beyond 109
with 1% error rate using only 6 KB of memory
There is better way!
5
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
It depends...
Is it OK to approximate?
6
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
Is it OK to approximate?
7
• Count # of unique felonies associated to a person; Not OK
• Count # of unique visits to my web page; OK
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
HLL
• Very fast
• Low memory footprint
• Can work with streaming data
• Can merge estimations of two separate datasets efficiently
8
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
How does HLL work?
Steps;
1. Hash all elements
a. Ensures uniform data distribution
b. Can treat all data types same
2. Observing rare bit patterns
3. Stochastic averaging
9
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
How does HLL work? - Observing rare bit patterns
hash
Alice 645403841
binary
0guatda.com/cmx.p010...001
Number of leading zeros: 2
Maximum number of leading zeros: 2
10
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
How does HLL work? - Observing rare bit patterns
hash
Bob 1492309842
binary
0guatda.com/cmx.p101...010
Number of leading zeros: 1
Maximum number of leading zeros: 2
11
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
How does HLL work? - Observing rare bit patterns
...
Maximum number of leading zeros: 7
Cardinality Estimation: 27
12
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
How does HLL work? Stochastic Averaging
Measuring same thing repeatedly and taking average.
13
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 201814
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 201815
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
How does HLL work? Stochastic Averaging
Data
Partition 1
Partition 3
Partition 2
7
5
12
228.968...
Estimation
27
25
212
16
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
How does HLL work? Stochastic Averaging
01000guatda.com/cmx.p101...010
First m bits to decide
partition number
Remaining bits to
count leading zeros
17
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
Error rate of HLL
• Typical Error Rate: 1.04 / sqrt(number of partitions)
• Memory need is number of partitions * log(log(max. value in hash space)) bit
• Can estimate cardinalities well beyond 109
with 1% error rate while using a
memory of only 6 kilobytes
• Memory vs accuracy tradeoff
18
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
Why does HLL work?
It turns out, combination of lots of bad observation is a
good observation
19
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
Some interesting examples
Alice
Alice
Alice
…
…
…
Alice
Partition 1
Partition 8
Partition 2
0
2
0
1.103...
Harmonic
Mean
20
22
20
hash
Alice 645403841
binary
00100guatda.com/cmx.p110...001
... ... ...
20
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
Some interesting examples
Charlie
Partition 1
Partition 8
Partition 2
29
0
0
1.142...
Harmonic
Mean
229
20
20
hash
Charlie 0
binary
00000guatda.com/cmx.p000...000
... ... ...
21
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
HLL in PostgreSQL
● https://github.com/citusdata/postgresql-hll
22
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
postgresql-hll uses a data structure, also called hll to keep maximum number of
leading zeros of each partition.
• Use hll_hash_bigint to hash elements.
• There are some other functions for other common data types.
• Use hll_add_agg to aggregate hashed elements into hll data structure.
• Use hll_cardinality to materialize hll data structure to actual distinct count.
HLL in PostgreSQL
23
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
Real Time Dashboard with
HyperLogLog
24
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
Precomputed aggregates for period of time and set of dimensions;
What is Rollup?
25
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
What is Rollup?
CREATE TABLE rollup_events_5min (
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
event_count bigint,
device_distinct_count bigint,
session_distinct_count bigint,
minute timestamp
);
CREATE TABLE events (
id bigint,
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
device_id bigint,
session_id bigint,
timestamp timestamp
);
26
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
What is Rollup?
CREATE TABLE rollup_events_5min (
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
event_count bigint,
device_distinct_count bigint,
session_distinct_count bigint,
minute timestamp
);
CREATE TABLE events (
id bigint,
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
device_id bigint,
session_id bigint,
timestamp timestamp
);
27
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
What is Rollup?
CREATE TABLE rollup_events_5min (
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
event_count bigint,
device_distinct_count bigint,
session_distinct_count bigint,
minute timestamp
);
CREATE TABLE events (
id bigint,
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
device_id bigint,
session_id bigint,
timestamp timestamp
);
28
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
What is Rollup?
CREATE TABLE rollup_events_5min (
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
event_count bigint,
device_distinct_count bigint,
session_distinct_count bigint,
minute timestamp
);
CREATE TABLE events (
id bigint,
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
device_id bigint,
session_id bigint,
timestamp timestamp
);
29
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
30
INSERT INTO rollup_events_5min
SELECT
customer_id,
event_type,
country,
browser,
COUNT(*) AS event_count,
COUNT (DISTINCT device_id) AS device_distinct_count,
COUNT (DISTINCT session_id) AS session_distinct_count,
date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP
'epoch' AS minute
FROM events
WHERE timestamp >= $1 AND timestamp <=$2
GROUP BY customer_id, event_type, country, browser, minute
What is Rollup?
30
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
31
INSERT INTO rollup_events_5min
SELECT
customer_id,
event_type,
country,
browser,
COUNT(*) AS event_count,
COUNT (DISTINCT device_id) AS device_distinct_count,
COUNT (DISTINCT session_id) AS session_distinct_count,
date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP
'epoch' AS minute
FROM events
WHERE timestamp >= $1 AND timestamp <=$2
GROUP BY customer_id, event_type, country, browser, minute
What is Rollup?
31
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
32
INSERT INTO rollup_events_5min
SELECT
customer_id,
event_type,
country,
browser,
COUNT(*) AS event_count,
COUNT (DISTINCT device_id) AS device_distinct_count,
COUNT (DISTINCT session_id) AS session_distinct_count,
date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP
'epoch' AS minute
FROM events
WHERE timestamp >= $1 AND timestamp <=$2
GROUP BY customer_id, event_type, country, browser, minute
What is Rollup?
32
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
33
INSERT INTO rollup_events_5min
SELECT
customer_id,
event_type,
country,
browser,
COUNT(*) AS event_count,
COUNT (DISTINCT device_id) AS device_distinct_count,
COUNT (DISTINCT session_id) AS session_distinct_count,
date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP
'epoch' AS minute
FROM events
WHERE timestamp >= $1 AND timestamp <=$2
GROUP BY customer_id, event_type, country, browser, minute
What is Rollup?
33
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
• Fast & indexed lookups of aggregates
• Avoid expensive repeated computations
• Rollups are compact (uses less space) and can be kept over longer periods
• Rollups can be further aggregated
Benefit of Rollup Tables
34
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
What if I want to get aggregation result for 1 hour period?
SELECT
customer_id,
event_type,
country,
browser,
SUM (event_count) AS event_count,
SUM (device_distinct_count) AS device_distinct_count,
SUM (session_distinct_count) AS session_distinct_count,
date_trunc('minutes', (minute - TIMESTAMP 'epoch') / 12) * 12 + TIMESTAMP 'epoch' AS
hour
FROM rollup_events_5min
GROUP BY customer_id, event_type, country, browser, minute
35
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
What if I want to get aggregation result for 1 hour period?
SELECT
customer_id,
event_type,
country,
browser,
SUM (event_count) AS event_count,
SUM (device_distinct_count) AS device_distinct_count,
SUM (session_distinct_count) AS session_distinct_count,
date_trunc('minutes', (minute - TIMESTAMP 'epoch') / 12) * 12 + TIMESTAMP 'epoch' AS
hour
FROM rollup_events_5min
GROUP BY customer_id, event_type, country, browser, minute
36
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
What if I want to get aggregation result for 1 hour period?
SELECT
customer_id,
event_type,
country,
browser,
SUM (event_count) AS event_count,
SUM (device_distinct_count) AS device_distinct_count,
SUM (session_distinct_count) AS session_distinct_count,
date_trunc('minutes', (minute - TIMESTAMP 'epoch') / 12) * 12 + TIMESTAMP 'epoch' AS
hour
FROM rollup_events_5min
GROUP BY customer_id, event_type, country, browser, minute
37
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
Rollup Table with HLL
CREATE TABLE rollup_events_5min (
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
event_count bigint,
device_distinct_count hll,
session_distinct_count hll,
minute timestamp
);
CREATE TABLE rollup_events_5min (
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
event_count bigint,
device_distinct_count bigint,
session_distinct_count bigint,
minute timestamp
);
38
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
39
INSERT INTO rollup_events_5min
SELECT
customer_id,
event_type,
country,
browser,
COUNT(*) AS event_count,
hll_add_agg(hll_hash_bigint(device_id)) AS device_distinct_count,
hll_add_agg(hll_hash_bigint(session_id)) AS session_distinct_count,
date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP
'epoch' AS minute
FROM events
WHERE timestamp >= $1 AND timestamp <=$2
GROUP BY customer_id, event_type, country, browser, minute
Rollup Table with HLL
39
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
What if I want to get aggregation result for 1 hour period?
SELECT
customer_id,
event_type,
country,
browser,
SUM (event_count) AS event_count,
hll_union_agg (device_distinct_count) AS device_distinct_count,
hll_union_agg (session_distinct_count) AS session_distinct_count,
date_trunc('minutes', (minute - TIMESTAMP 'epoch') / 12) * 12 + TIMESTAMP 'epoch' AS
hour
FROM rollup_events_5min
GROUP BY customer_id, event_type, country, browser, minute
40
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
Interval 1
Interval 1
Partition 1
Interval 1
Partition 3
Interval 1
Partition 2
7
5
12
HLL(7, 5, 12)
Intermediate
Result
How to Merge COUNT(DISTINCT) with HLL
41
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
Interval 2
Interval 2
Partition 1
Interval 2
Partition 3
Interval 2
Partition 2
11
7
8
HLL(11, 7, 8)
Intermediate
Result
How to Merge COUNT(DISTINCT) with HLL
42
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
11
7
12
1053.255
Estimation
211
27
212
HLL(11, 7, 8)
HLL(7, 5, 12)
HLL(11, 7, 12)
hll_union_agg
How to Merge COUNT(DISTINCT) with HLL
43
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
Interval 1
+
Interval 2
Interval 1
Partition 1(7)
+
Interval 2
Partition 1(11)
11
7
12
1053.255
Estimation
Interval 1
Partition 2(5)
+
Interval 2
Partition 2(7)
Interval 1
Partition 3(12)
+
Interval 2
Partition 4(8)
How to Merge COUNT(DISTINCT) with HLL
44
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
• What if ...
• Without hll, you would have to maintain 2n
- 1 rollup tables to cover all
combinations in n columns (multiply this with number of time intervals).
45
Rollup Table with HLL
45
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
What Happens in Distributed
Scenario?
46
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
1. Separate data into shards.
events_001 events_002 events_003
postgresql-hll in distributed environment
47
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
2. Put shards into separate nodes.
Worker
Node 1
Coordinator
Worker
Node 2
Worker
Node 3
events_001 events_002 events_003
postgresql-hll in distributed environment
48
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
3. For each shard, calculate hll (but do not materialize).
postgresql-hll in distributed environment
Shard 1
Shard 1
Partition 1
Shard 1
Partition 3
Shard 1
Partition 2
7
5
12
HLL(7, 5, 12)
Intermediate
Result
49
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
4. Pull intermediate results to a single node.
Worker
Node 1
events_001
Coordinator
Worker
Node 2
events_002
Worker
Node 3
events_003
HLL(6, 4, 11) HLL(10, 6, 7) HLL(7, 12, 5)
postgresql-hll in distributed environment
50
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
5. Merge separate hll data structures and materialize them
11
13
12
10532.571...
211
213
212
HLL(11, 7, 8)
HLL(7, 5, 12)
HLL(11, 13, 12)
HLL(8, 13, 6)
postgresql-hll in distributed environment
51
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
burak@citusdata.com
Thanks
&
Questions
@byucesoy
Burak Yücesoy
www.citusdata.com @citusdata

More Related Content

PDF
Lambda architecture
PPTX
Lakehouse Analytics with Dremio
PPTX
Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
PPTX
Hit Refresh with Oracle GoldenGate Microservices
PDF
Introduction of Redis as NoSQL Database
PPTX
Stream processing IoT time series data with Kafka & InfluxDB | Al Sargent, In...
PDF
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
PDF
Lambda architecture
Lakehouse Analytics with Dremio
Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
Hit Refresh with Oracle GoldenGate Microservices
Introduction of Redis as NoSQL Database
Stream processing IoT time series data with Kafka & InfluxDB | Al Sargent, In...
user Behavior Analysis with Session Windows and Apache Kafka's Streams API

What's hot (20)

PPTX
Introduction to Redis
PPT
PDF
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
How many ways to monitor oracle golden gate - OOW14
PDF
ScyllaDB Virtual Workshop
PPTX
Neo4j Bloom: What’s New with Neo4j's Data Visualization Tool
PPTX
Appache Cassandra
PDF
rx-java-presentation
PDF
Log Structured Merge Tree
PPTX
Enhance your multi-cloud application performance using Redis Enterprise P2
PPTX
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
PDF
The Top 5 Reasons to Deploy Your Applications on Oracle RAC
PDF
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...
PDF
RedisConf17 - Lyft - Geospatial at Scale - Daniel Hochman
PDF
Time Series Data with InfluxDB
PDF
Introduction to Spark Streaming
PDF
Handling GDPR with Apache Kafka: How to Comply Without Freaking Out? (David J...
PDF
Kafka streams windowing behind the curtain
PDF
Setting Up a TIG Stack for Your Testing
PDF
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
Introduction to Redis
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
How many ways to monitor oracle golden gate - OOW14
ScyllaDB Virtual Workshop
Neo4j Bloom: What’s New with Neo4j's Data Visualization Tool
Appache Cassandra
rx-java-presentation
Log Structured Merge Tree
Enhance your multi-cloud application performance using Redis Enterprise P2
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
The Top 5 Reasons to Deploy Your Applications on Oracle RAC
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...
RedisConf17 - Lyft - Geospatial at Scale - Daniel Hochman
Time Series Data with InfluxDB
Introduction to Spark Streaming
Handling GDPR with Apache Kafka: How to Comply Without Freaking Out? (David J...
Kafka streams windowing behind the curtain
Setting Up a TIG Stack for Your Testing
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
Ad

Similar to What is HyperLogLog and Why You Will Love It | PostgreSQL Conference Europe 2018 | Burak Yucesoy (15)

PDF
Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017)...
PDF
Real time analytics at any scale | PostgreSQL User Group NL | Marco Slot
PDF
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
PDF
Craig Kerstiens - Scalable Uniques in Postgres @ Postgres Open
PPTX
Realtime analytics
PDF
Around the world with extensions | PostgreSQL Conference Europe 2018 | Craig ...
PDF
HyperLogLog in Hive - How to count sheep efficiently?
PDF
Distributing Queries the Citus Way | PostgresConf US 2018 | Marco Slot
PDF
Data profiling with Apache Calcite
PDF
Data profiling in Apache Calcite
PDF
Probabilistic data structures. Part 2. Cardinality
PDF
PGDay UK 2016 -- Performace for queries with grouping
PDF
Count-Distinct Problem
PDF
PostgreSQL, performance for queries with grouping
ODP
Cassandra at Finn.io — May 30th 2013
Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017)...
Real time analytics at any scale | PostgreSQL User Group NL | Marco Slot
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
Craig Kerstiens - Scalable Uniques in Postgres @ Postgres Open
Realtime analytics
Around the world with extensions | PostgreSQL Conference Europe 2018 | Craig ...
HyperLogLog in Hive - How to count sheep efficiently?
Distributing Queries the Citus Way | PostgresConf US 2018 | Marco Slot
Data profiling with Apache Calcite
Data profiling in Apache Calcite
Probabilistic data structures. Part 2. Cardinality
PGDay UK 2016 -- Performace for queries with grouping
Count-Distinct Problem
PostgreSQL, performance for queries with grouping
Cassandra at Finn.io — May 30th 2013
Ad

More from Citus Data (20)

PDF
Architecting peta-byte-scale analytics by scaling out Postgres on Azure with ...
PDF
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
PDF
JSONB Tricks: Operators, Indexes, and When (Not) to Use It | PostgresOpen 201...
PDF
Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...
PDF
Whats wrong with postgres | PGConf EU 2019 | Craig Kerstiens
PDF
When it all goes wrong | PGConf EU 2019 | Will Leinweber
PDF
Amazing SQL your ORM can (or can't) do | PGConf EU 2019 | Louise Grandjonc
PDF
What Microsoft is doing with Postgres & the Citus Data acquisition | PGConf E...
PDF
Deep Postgres Extensions in Rust | PGCon 2019 | Jeff Davis
PDF
Why Postgres Why This Database Why Now | SF Bay Area Postgres Meetup | Claire...
PDF
A story on Postgres index types | PostgresLondon 2019 | Louise Grandjonc
PDF
Why developers need marketing now more than ever | GlueCon 2019 | Claire Gior...
PDF
The Art of PostgreSQL | PostgreSQL Ukraine | Dimitri Fontaine
PDF
Optimizing your app by understanding your Postgres | RailsConf 2019 | Samay S...
PDF
When it all goes wrong (with Postgres) | RailsConf 2019 | Will Leinweber
PDF
The Art of PostgreSQL | PostgreSQL Ukraine Meetup | Dimitri Fontaine
PDF
Using Postgres and Citus for Lightning Fast Analytics, also ft. Rollups | Liv...
PDF
How to write SQL queries | pgDay Paris 2019 | Dimitri Fontaine
PDF
When it all Goes Wrong |Nordic PGDay 2019 | Will Leinweber
PDF
Why PostgreSQL Why This Database Why Now | Nordic PGDay 2019 | Claire Giordano
Architecting peta-byte-scale analytics by scaling out Postgres on Azure with ...
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
JSONB Tricks: Operators, Indexes, and When (Not) to Use It | PostgresOpen 201...
Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...
Whats wrong with postgres | PGConf EU 2019 | Craig Kerstiens
When it all goes wrong | PGConf EU 2019 | Will Leinweber
Amazing SQL your ORM can (or can't) do | PGConf EU 2019 | Louise Grandjonc
What Microsoft is doing with Postgres & the Citus Data acquisition | PGConf E...
Deep Postgres Extensions in Rust | PGCon 2019 | Jeff Davis
Why Postgres Why This Database Why Now | SF Bay Area Postgres Meetup | Claire...
A story on Postgres index types | PostgresLondon 2019 | Louise Grandjonc
Why developers need marketing now more than ever | GlueCon 2019 | Claire Gior...
The Art of PostgreSQL | PostgreSQL Ukraine | Dimitri Fontaine
Optimizing your app by understanding your Postgres | RailsConf 2019 | Samay S...
When it all goes wrong (with Postgres) | RailsConf 2019 | Will Leinweber
The Art of PostgreSQL | PostgreSQL Ukraine Meetup | Dimitri Fontaine
Using Postgres and Citus for Lightning Fast Analytics, also ft. Rollups | Liv...
How to write SQL queries | pgDay Paris 2019 | Dimitri Fontaine
When it all Goes Wrong |Nordic PGDay 2019 | Will Leinweber
Why PostgreSQL Why This Database Why Now | Nordic PGDay 2019 | Claire Giordano

Recently uploaded (20)

PDF
Zenith AI: Advanced Artificial Intelligence
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
August Patch Tuesday
PDF
Getting Started with Data Integration: FME Form 101
PDF
WOOl fibre morphology and structure.pdf for textiles
PPTX
A Presentation on Touch Screen Technology
PDF
Mushroom cultivation and it's methods.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Approach and Philosophy of On baking technology
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
project resource management chapter-09.pdf
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
Zenith AI: Advanced Artificial Intelligence
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
1 - Historical Antecedents, Social Consideration.pdf
August Patch Tuesday
Getting Started with Data Integration: FME Form 101
WOOl fibre morphology and structure.pdf for textiles
A Presentation on Touch Screen Technology
Mushroom cultivation and it's methods.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Approach and Philosophy of On baking technology
DP Operators-handbook-extract for the Mautical Institute
MIND Revenue Release Quarter 2 2025 Press Release
Building Integrated photovoltaic BIPV_UPV.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
OMC Textile Division Presentation 2021.pptx
Hindi spoken digit analysis for native and non-native speakers
A comparative analysis of optical character recognition models for extracting...
project resource management chapter-09.pdf
SOPHOS-XG Firewall Administrator PPT.pptx

What is HyperLogLog and Why You Will Love It | PostgreSQL Conference Europe 2018 | Burak Yucesoy

  • 1. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 What is HyperLogLog and Why You Will Love It Burak Yücesoy
  • 2. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 • Number of unique elements (cardinality) in given data • Useful to find things like… • Number of unique users visited your web page • Number of unique products in your inventory What is COUNT(DISTINCT)? 2
  • 3. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 What is COUNT(DISTINCT)? 3 logins username | date ----------+----------- Alice | 2018-10-02 Bob | 2018-10-03 Alice | 2018-10-05 Eve | 2018-10-07 Bob | 2018-10-07 Bob | 2018-10-08 • Number of logins: 6 • Number of unique users who log in: 3
  • 4. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 • Slow • High memory footprint • Cannot work with appended/streaming data Problems with Traditional COUNT(DISTINCT) 4
  • 5. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 HyperLogLog(HLL) is faster alternative to COUNT(DISTINCT) with low memory footprint; • Approximation algorithm • Estimates cardinality (i.e. COUNT(DISTINCT) ) of given data • Mathematically provable error bounds • It can estimate cardinalities well beyond 109 with 1% error rate using only 6 KB of memory There is better way! 5
  • 6. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 It depends... Is it OK to approximate? 6
  • 7. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 Is it OK to approximate? 7 • Count # of unique felonies associated to a person; Not OK • Count # of unique visits to my web page; OK
  • 8. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 HLL • Very fast • Low memory footprint • Can work with streaming data • Can merge estimations of two separate datasets efficiently 8
  • 9. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 How does HLL work? Steps; 1. Hash all elements a. Ensures uniform data distribution b. Can treat all data types same 2. Observing rare bit patterns 3. Stochastic averaging 9
  • 10. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 How does HLL work? - Observing rare bit patterns hash Alice 645403841 binary 0guatda.com/cmx.p010...001 Number of leading zeros: 2 Maximum number of leading zeros: 2 10
  • 11. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 How does HLL work? - Observing rare bit patterns hash Bob 1492309842 binary 0guatda.com/cmx.p101...010 Number of leading zeros: 1 Maximum number of leading zeros: 2 11
  • 12. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 How does HLL work? - Observing rare bit patterns ... Maximum number of leading zeros: 7 Cardinality Estimation: 27 12
  • 13. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 How does HLL work? Stochastic Averaging Measuring same thing repeatedly and taking average. 13
  • 14. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 201814
  • 15. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 201815
  • 16. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 How does HLL work? Stochastic Averaging Data Partition 1 Partition 3 Partition 2 7 5 12 228.968... Estimation 27 25 212 16
  • 17. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 How does HLL work? Stochastic Averaging 01000guatda.com/cmx.p101...010 First m bits to decide partition number Remaining bits to count leading zeros 17
  • 18. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 Error rate of HLL • Typical Error Rate: 1.04 / sqrt(number of partitions) • Memory need is number of partitions * log(log(max. value in hash space)) bit • Can estimate cardinalities well beyond 109 with 1% error rate while using a memory of only 6 kilobytes • Memory vs accuracy tradeoff 18
  • 19. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 Why does HLL work? It turns out, combination of lots of bad observation is a good observation 19
  • 20. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 Some interesting examples Alice Alice Alice … … … Alice Partition 1 Partition 8 Partition 2 0 2 0 1.103... Harmonic Mean 20 22 20 hash Alice 645403841 binary 00100guatda.com/cmx.p110...001 ... ... ... 20
  • 21. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 Some interesting examples Charlie Partition 1 Partition 8 Partition 2 29 0 0 1.142... Harmonic Mean 229 20 20 hash Charlie 0 binary 00000guatda.com/cmx.p000...000 ... ... ... 21
  • 22. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 HLL in PostgreSQL ● https://github.com/citusdata/postgresql-hll 22
  • 23. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 postgresql-hll uses a data structure, also called hll to keep maximum number of leading zeros of each partition. • Use hll_hash_bigint to hash elements. • There are some other functions for other common data types. • Use hll_add_agg to aggregate hashed elements into hll data structure. • Use hll_cardinality to materialize hll data structure to actual distinct count. HLL in PostgreSQL 23
  • 24. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 Real Time Dashboard with HyperLogLog 24
  • 25. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 Precomputed aggregates for period of time and set of dimensions; What is Rollup? 25
  • 26. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 What is Rollup? CREATE TABLE rollup_events_5min ( customer_id bigint, event_type varchar, country varchar, browser varchar, event_count bigint, device_distinct_count bigint, session_distinct_count bigint, minute timestamp ); CREATE TABLE events ( id bigint, customer_id bigint, event_type varchar, country varchar, browser varchar, device_id bigint, session_id bigint, timestamp timestamp ); 26
  • 27. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 What is Rollup? CREATE TABLE rollup_events_5min ( customer_id bigint, event_type varchar, country varchar, browser varchar, event_count bigint, device_distinct_count bigint, session_distinct_count bigint, minute timestamp ); CREATE TABLE events ( id bigint, customer_id bigint, event_type varchar, country varchar, browser varchar, device_id bigint, session_id bigint, timestamp timestamp ); 27
  • 28. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 What is Rollup? CREATE TABLE rollup_events_5min ( customer_id bigint, event_type varchar, country varchar, browser varchar, event_count bigint, device_distinct_count bigint, session_distinct_count bigint, minute timestamp ); CREATE TABLE events ( id bigint, customer_id bigint, event_type varchar, country varchar, browser varchar, device_id bigint, session_id bigint, timestamp timestamp ); 28
  • 29. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 What is Rollup? CREATE TABLE rollup_events_5min ( customer_id bigint, event_type varchar, country varchar, browser varchar, event_count bigint, device_distinct_count bigint, session_distinct_count bigint, minute timestamp ); CREATE TABLE events ( id bigint, customer_id bigint, event_type varchar, country varchar, browser varchar, device_id bigint, session_id bigint, timestamp timestamp ); 29
  • 30. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 30 INSERT INTO rollup_events_5min SELECT customer_id, event_type, country, browser, COUNT(*) AS event_count, COUNT (DISTINCT device_id) AS device_distinct_count, COUNT (DISTINCT session_id) AS session_distinct_count, date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP 'epoch' AS minute FROM events WHERE timestamp >= $1 AND timestamp <=$2 GROUP BY customer_id, event_type, country, browser, minute What is Rollup? 30
  • 31. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 31 INSERT INTO rollup_events_5min SELECT customer_id, event_type, country, browser, COUNT(*) AS event_count, COUNT (DISTINCT device_id) AS device_distinct_count, COUNT (DISTINCT session_id) AS session_distinct_count, date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP 'epoch' AS minute FROM events WHERE timestamp >= $1 AND timestamp <=$2 GROUP BY customer_id, event_type, country, browser, minute What is Rollup? 31
  • 32. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 32 INSERT INTO rollup_events_5min SELECT customer_id, event_type, country, browser, COUNT(*) AS event_count, COUNT (DISTINCT device_id) AS device_distinct_count, COUNT (DISTINCT session_id) AS session_distinct_count, date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP 'epoch' AS minute FROM events WHERE timestamp >= $1 AND timestamp <=$2 GROUP BY customer_id, event_type, country, browser, minute What is Rollup? 32
  • 33. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 33 INSERT INTO rollup_events_5min SELECT customer_id, event_type, country, browser, COUNT(*) AS event_count, COUNT (DISTINCT device_id) AS device_distinct_count, COUNT (DISTINCT session_id) AS session_distinct_count, date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP 'epoch' AS minute FROM events WHERE timestamp >= $1 AND timestamp <=$2 GROUP BY customer_id, event_type, country, browser, minute What is Rollup? 33
  • 34. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 • Fast & indexed lookups of aggregates • Avoid expensive repeated computations • Rollups are compact (uses less space) and can be kept over longer periods • Rollups can be further aggregated Benefit of Rollup Tables 34
  • 35. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 What if I want to get aggregation result for 1 hour period? SELECT customer_id, event_type, country, browser, SUM (event_count) AS event_count, SUM (device_distinct_count) AS device_distinct_count, SUM (session_distinct_count) AS session_distinct_count, date_trunc('minutes', (minute - TIMESTAMP 'epoch') / 12) * 12 + TIMESTAMP 'epoch' AS hour FROM rollup_events_5min GROUP BY customer_id, event_type, country, browser, minute 35
  • 36. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 What if I want to get aggregation result for 1 hour period? SELECT customer_id, event_type, country, browser, SUM (event_count) AS event_count, SUM (device_distinct_count) AS device_distinct_count, SUM (session_distinct_count) AS session_distinct_count, date_trunc('minutes', (minute - TIMESTAMP 'epoch') / 12) * 12 + TIMESTAMP 'epoch' AS hour FROM rollup_events_5min GROUP BY customer_id, event_type, country, browser, minute 36
  • 37. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 What if I want to get aggregation result for 1 hour period? SELECT customer_id, event_type, country, browser, SUM (event_count) AS event_count, SUM (device_distinct_count) AS device_distinct_count, SUM (session_distinct_count) AS session_distinct_count, date_trunc('minutes', (minute - TIMESTAMP 'epoch') / 12) * 12 + TIMESTAMP 'epoch' AS hour FROM rollup_events_5min GROUP BY customer_id, event_type, country, browser, minute 37
  • 38. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 Rollup Table with HLL CREATE TABLE rollup_events_5min ( customer_id bigint, event_type varchar, country varchar, browser varchar, event_count bigint, device_distinct_count hll, session_distinct_count hll, minute timestamp ); CREATE TABLE rollup_events_5min ( customer_id bigint, event_type varchar, country varchar, browser varchar, event_count bigint, device_distinct_count bigint, session_distinct_count bigint, minute timestamp ); 38
  • 39. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 39 INSERT INTO rollup_events_5min SELECT customer_id, event_type, country, browser, COUNT(*) AS event_count, hll_add_agg(hll_hash_bigint(device_id)) AS device_distinct_count, hll_add_agg(hll_hash_bigint(session_id)) AS session_distinct_count, date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP 'epoch' AS minute FROM events WHERE timestamp >= $1 AND timestamp <=$2 GROUP BY customer_id, event_type, country, browser, minute Rollup Table with HLL 39
  • 40. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 What if I want to get aggregation result for 1 hour period? SELECT customer_id, event_type, country, browser, SUM (event_count) AS event_count, hll_union_agg (device_distinct_count) AS device_distinct_count, hll_union_agg (session_distinct_count) AS session_distinct_count, date_trunc('minutes', (minute - TIMESTAMP 'epoch') / 12) * 12 + TIMESTAMP 'epoch' AS hour FROM rollup_events_5min GROUP BY customer_id, event_type, country, browser, minute 40
  • 41. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 Interval 1 Interval 1 Partition 1 Interval 1 Partition 3 Interval 1 Partition 2 7 5 12 HLL(7, 5, 12) Intermediate Result How to Merge COUNT(DISTINCT) with HLL 41
  • 42. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 Interval 2 Interval 2 Partition 1 Interval 2 Partition 3 Interval 2 Partition 2 11 7 8 HLL(11, 7, 8) Intermediate Result How to Merge COUNT(DISTINCT) with HLL 42
  • 43. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 11 7 12 1053.255 Estimation 211 27 212 HLL(11, 7, 8) HLL(7, 5, 12) HLL(11, 7, 12) hll_union_agg How to Merge COUNT(DISTINCT) with HLL 43
  • 44. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 Interval 1 + Interval 2 Interval 1 Partition 1(7) + Interval 2 Partition 1(11) 11 7 12 1053.255 Estimation Interval 1 Partition 2(5) + Interval 2 Partition 2(7) Interval 1 Partition 3(12) + Interval 2 Partition 4(8) How to Merge COUNT(DISTINCT) with HLL 44
  • 45. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 • What if ... • Without hll, you would have to maintain 2n - 1 rollup tables to cover all combinations in n columns (multiply this with number of time intervals). 45 Rollup Table with HLL 45
  • 46. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 What Happens in Distributed Scenario? 46
  • 47. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 1. Separate data into shards. events_001 events_002 events_003 postgresql-hll in distributed environment 47
  • 48. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 2. Put shards into separate nodes. Worker Node 1 Coordinator Worker Node 2 Worker Node 3 events_001 events_002 events_003 postgresql-hll in distributed environment 48
  • 49. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 3. For each shard, calculate hll (but do not materialize). postgresql-hll in distributed environment Shard 1 Shard 1 Partition 1 Shard 1 Partition 3 Shard 1 Partition 2 7 5 12 HLL(7, 5, 12) Intermediate Result 49
  • 50. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 4. Pull intermediate results to a single node. Worker Node 1 events_001 Coordinator Worker Node 2 events_002 Worker Node 3 events_003 HLL(6, 4, 11) HLL(10, 6, 7) HLL(7, 12, 5) postgresql-hll in distributed environment 50
  • 51. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 5. Merge separate hll data structures and materialize them 11 13 12 10532.571... 211 213 212 HLL(11, 7, 8) HLL(7, 5, 12) HLL(11, 13, 12) HLL(8, 13, 6) postgresql-hll in distributed environment 51
  • 52. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 burak@citusdata.com Thanks & Questions @byucesoy Burak Yücesoy www.citusdata.com @citusdata