What is HyperLogLog and Why You Will Love It | PostgreSQL Conference Europe 2018 | Burak Yucesoy

Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
What is HyperLogLog and
Why You Will Love It
Burak Yücesoy

• Number of unique elements (cardinality) in given data
• Useful to find things like…
• Number of unique users visited your web page
• Number of unique products in your inventory
What is COUNT(DISTINCT)?
2

What is COUNT(DISTINCT)?
3
logins
username | date
----------+-----------
Alice | 2018-10-02
Bob | 2018-10-03
Alice | 2018-10-05
Eve | 2018-10-07
Bob | 2018-10-07
Bob | 2018-10-08
• Number of logins: 6
• Number of unique users who log in: 3

• Slow
• High memory footprint
• Cannot work with appended/streaming data
Problems with Traditional COUNT(DISTINCT)
4

HyperLogLog(HLL) is faster alternative to COUNT(DISTINCT) with low
memory footprint;
• Approximation algorithm
• Estimates cardinality (i.e. COUNT(DISTINCT) ) of given data
• Mathematically provable error bounds
• It can estimate cardinalities well beyond 109
with 1% error rate using only 6 KB of memory
There is better way!
5

It depends...
Is it OK to approximate?
6

Is it OK to approximate?
7
• Count # of unique felonies associated to a person; Not OK
• Count # of unique visits to my web page; OK

HLL
• Very fast
• Low memory footprint
• Can work with streaming data
• Can merge estimations of two separate datasets efficiently
8

How does HLL work?
Steps;
1. Hash all elements
a. Ensures uniform data distribution
b. Can treat all data types same
2. Observing rare bit patterns
3. Stochastic averaging
9

How does HLL work? - Observing rare bit patterns
hash
Alice 645403841
binary
0guatda.com/cmx.p010...001
Number of leading zeros: 2
Maximum number of leading zeros: 2
10

hash
Bob 1492309842
binary
Number of leading zeros: 1
11

...
Cardinality Estimation: 27
12

How does HLL work? Stochastic Averaging
Measuring same thing repeatedly and taking average.
13

Data
Partition 1
Partition 3
Partition 2
7
5
12
228.968...
Estimation
27
25
212
16

First m bits to decide
partition number
Remaining bits to
count leading zeros
17

Error rate of HLL
• Typical Error Rate: 1.04 / sqrt(number of partitions)
• Memory need is number of partitions * log(log(max. value in hash space)) bit
• Can estimate cardinalities well beyond 109
with 1% error rate while using a
memory of only 6 kilobytes
• Memory vs accuracy tradeoff
18

Why does HLL work?
It turns out, combination of lots of bad observation is a
good observation
19

Some interesting examples
Alice
Alice
Alice
…
…
…
Alice
Partition 1
Partition 8
Partition 2
0
2
0
1.103...
Harmonic
Mean
20
22
20
hash
Alice 645403841
binary
... ... ...
20

Some interesting examples
Charlie
Partition 1
Partition 8
Partition 2
29
0
0
1.142...
Harmonic
Mean
229
20
20
hash
Charlie 0
binary
... ... ...
21

HLL in PostgreSQL
● https://github.com/citusdata/postgresql-hll
22

postgresql-hll uses a data structure, also called hll to keep maximum number of
leading zeros of each partition.
• Use hll_hash_bigint to hash elements.
• There are some other functions for other common data types.
• Use hll_add_agg to aggregate hashed elements into hll data structure.
• Use hll_cardinality to materialize hll data structure to actual distinct count.
HLL in PostgreSQL
23

Real Time Dashboard with
HyperLogLog
24

Precomputed aggregates for period of time and set of dimensions;
What is Rollup?
25

What is Rollup?
CREATE TABLE rollup_events_5min (
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
event_count bigint,
device_distinct_count bigint,
session_distinct_count bigint,
minute timestamp
);
CREATE TABLE events (
id bigint,
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
device_id bigint,
session_id bigint,
timestamp timestamp
);
26

What is Rollup?
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
event_count bigint,
minute timestamp
);
id bigint,
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
device_id bigint,
session_id bigint,
timestamp timestamp
);
27

What is Rollup?
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
event_count bigint,
minute timestamp
);
id bigint,
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
device_id bigint,
session_id bigint,
timestamp timestamp
);
28

What is Rollup?
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
event_count bigint,
minute timestamp
);
id bigint,
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
device_id bigint,
session_id bigint,
timestamp timestamp
);
29

30
INSERT INTO rollup_events_5min
SELECT
customer_id,
event_type,
country,
browser,
COUNT(*) AS event_count,
COUNT (DISTINCT device_id) AS device_distinct_count,
COUNT (DISTINCT session_id) AS session_distinct_count,
date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP
'epoch' AS minute
FROM events
WHERE timestamp >= $1 AND timestamp <=$2
GROUP BY customer_id, event_type, country, browser, minute
What is Rollup?
30

31
SELECT
customer_id,
event_type,
country,
browser,
'epoch' AS minute
FROM events
What is Rollup?
31

32
SELECT
customer_id,
event_type,
country,
browser,
'epoch' AS minute
FROM events
What is Rollup?
32

33
SELECT
customer_id,
event_type,
country,
browser,
'epoch' AS minute
FROM events
What is Rollup?
33

• Fast & indexed lookups of aggregates
• Avoid expensive repeated computations
• Rollups are compact (uses less space) and can be kept over longer periods
• Rollups can be further aggregated
Benefit of Rollup Tables
34

What if I want to get aggregation result for 1 hour period?
SELECT
customer_id,
event_type,
country,
browser,
SUM (event_count) AS event_count,
SUM (device_distinct_count) AS device_distinct_count,
SUM (session_distinct_count) AS session_distinct_count,
date_trunc('minutes', (minute - TIMESTAMP 'epoch') / 12) * 12 + TIMESTAMP 'epoch' AS
hour
FROM rollup_events_5min
35

SELECT
customer_id,
event_type,
country,
browser,
hour
36

SELECT
customer_id,
event_type,
country,
browser,
hour
37

Rollup Table with HLL
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
event_count bigint,
device_distinct_count hll,
session_distinct_count hll,
minute timestamp
);
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
event_count bigint,
minute timestamp
);
38

39
SELECT
customer_id,
event_type,
country,
browser,
hll_add_agg(hll_hash_bigint(device_id)) AS device_distinct_count,
hll_add_agg(hll_hash_bigint(session_id)) AS session_distinct_count,
'epoch' AS minute
FROM events
39

SELECT
customer_id,
event_type,
country,
browser,
hll_union_agg (device_distinct_count) AS device_distinct_count,
hll_union_agg (session_distinct_count) AS session_distinct_count,
hour
40

Interval 1
Interval 1
Partition 1
Interval 1
Partition 3
Interval 1
Partition 2
7
5
12
HLL(7, 5, 12)
Intermediate
Result
How to Merge COUNT(DISTINCT) with HLL
41

Interval 2
Interval 2
Partition 1
Interval 2
Partition 3
Interval 2
Partition 2
11
7
8
HLL(11, 7, 8)
Intermediate
Result
42

11
7
12
1053.255
Estimation
211
27
212
HLL(11, 7, 8)
HLL(7, 5, 12)
HLL(11, 7, 12)
hll_union_agg
43

Interval 1
+
Interval 2
Interval 1
Partition 1(7)
+
Interval 2
Partition 1(11)
11
7
12
1053.255
Estimation
Interval 1
Partition 2(5)
+
Interval 2
Partition 2(7)
Interval 1
Partition 3(12)
+
Interval 2
Partition 4(8)
44

• What if ...
• Without hll, you would have to maintain 2n
- 1 rollup tables to cover all
combinations in n columns (multiply this with number of time intervals).
45
45

What Happens in Distributed
Scenario?
46

1. Separate data into shards.
events_001 events_002 events_003
postgresql-hll in distributed environment
47

2. Put shards into separate nodes.
Worker
Node 1
Coordinator
Worker
Node 2
Worker
Node 3
events_001 events_002 events_003
48

3. For each shard, calculate hll (but do not materialize).
Shard 1
Shard 1
Partition 1
Shard 1
Partition 3
Shard 1
Partition 2
7
5
12
HLL(7, 5, 12)
Intermediate
Result
49

4. Pull intermediate results to a single node.
Worker
Node 1
events_001
Coordinator
Worker
Node 2
events_002
Worker
Node 3
events_003
HLL(6, 4, 11) HLL(10, 6, 7) HLL(7, 12, 5)
50

5. Merge separate hll data structures and materialize them
11
13
12
10532.571...
211
213
212
HLL(11, 7, 8)
HLL(7, 5, 12)
HLL(11, 13, 12)
HLL(8, 13, 6)
51

What is HyperLogLog and Why You Will Love It | PostgreSQL Conference Europe 2018 | Burak Yucesoy

More Related Content

What's hot (20)

Similar to What is HyperLogLog and Why You Will Love It | PostgreSQL Conference Europe 2018 | Burak Yucesoy (15)

More from Citus Data (20)

Recently uploaded (20)

What is HyperLogLog and Why You Will Love It | PostgreSQL Conference Europe 2018 | Burak Yucesoy