HyperLogLog and friends

- HyperLogLog and friends
S I M O N L I A - J O N A S S E N
F A S T F H L 2 0 2 0

 Counting distinct items
 Counting most frequent items
 Computing quantiles
 Computing joins
 …
 Memory-hungry
 May not parallelize well

Users visiting by day Thursday Friday
Site A 7.0M 7.5M
Site B 4.6M 4.4M
a Thu a Fri
b Thu
b Fri
 Were there 7.1M or 23.5M users in total?

 Streamable
 Sub-linear in size
 Approximate with a predictable error
 Mergeable / additive
 Highly parallelizable
https://yahooeng.tumblr.com/post/135390948446/data-sketches

 Exact detection
 Use a list, hash-set, or dictionary-size bit-set
 Require linear space
 To parallelize require partitioning or a shared dictionary

 Bloom filter
 Use an m bit vector (sub-linear) and k mutually independent hash functions
 Update - set bits, Query - check if all bits are set
 False-positive probability is known
 Can merge multiple filters

 Linear Counting
 We can use a hash into b < n bits
(assuming we know n)
 Update – flip a bit to 1
 Query – estimate from # of unset bits
http://dblab.kaist.ac.kr/Prof/pdf/ACM90_TODS_v15n2.pdf

 Let’s have a look at hash(x) using just 4 bits
Distribution of leading 0’s:
 1 – 50% (1 in 2)
 2 – 25% (1 in 4)
 3 – 12.5% (1 in 8)
 4 – 6.25% (1 in 16)
 We expect to see at least 8 random distinct elements to get 3 or more leading 0’s.
 So having max k leading 0’s, we expect having seen 2^k distinct elements.
 What if we hit 0000 early?
0000 0100 1000 1100
0001 0101 1001 1101
0010 0110 1010 1110
0011 0111 1011 1111

 What if we hit 0000 early?
 We could use many independent hash functions.
 LogLog
 Use m different buckets
 Log m is the number of bit to determine bucket
 Loglog H is the max number of bits per counter
 Approximate using 2^k_avg
 Std error is 1.30/sqrt(m)
https://blog.devartis.com/hyperloglogs-a-probabilistic-way-of-obtaining-a-sets-cardinality-3b5e6a982a12

 HyperLogLog
 Using harmonic mean, etc.
 Resulting in std error at 1.04/sqrt(m)
 Requires 64% less memory to match LogLog
 Variants:
 HyperLogLog++ (Google)
 Improves memory usage and estimation accuracy
for small cardinalities
 Java-HLL
 Uses different representations for empty, explicit,
sparse and full estimator sets.
Live Demo: http://content.research.neustar.biz/blog/hll.html

 Unions:
 Can merge any number of estimators.
 Intersections:
 Inclusion-exclusion principle.
 The accuracy is tricky.
https://cloud.google.com/blog/products/data-analytics/using-hll-speed-count-distinct-massive-datasets

 A brief comparison between StreamLib (S-) and Java-HLL (J-HLL) methods
 See http://s-j.github.io/hyperloglog/ (Feb 2014) for more numbers and details
 3 765 844 tokens
 2 074 012 unque keys - Sets.newHashSet(): 1195 ms
 (S- parameters were picked for 1% error with 10 mil keys)
method % error size time
S-LinearCounting 0.17 137 073 B 1 217 ms
S-LogLog (logm=14) 1.35 16 384 B 963 ms
S-HLL (logm=13) 1.81 5 472 B 1 000 ms
S-HLL++ (logm=13) -0.81 5 473 B 863 ms
J-HLL (logm=12 regw=5 Full Auto) -0.76 2 563 B 500 ms
J-HLL (logm=10 regw=5 Sparse Auto) -2.27 643 B 570ms

 Applications:
 Stream processing
 Distributed processing
 Batch processing
 Frameworks:
 Postgres, Hadoop, Presto, Redis, Druid …
 Kusto – dcount, dcount_hll, …
 Griffin – CardinalityEstimation
 An interesting open question:
 What about user retention?
https://clevertap.com/blog/cohort-analysis-user-retention/
https://yahooeng.tumblr.com/post/135390948446/data-sketches

THANK YOU!
Code:
 https://github.com/microsoft/CardinalityEstimation
 https://github.com/aggregateknowledge/java-hll
 https://github.com/addthis/stream-lib
 https://datasketches.apache.org/
Blog posts:
 http://s-j.github.io/hyperloglog/
 https://blog.devartis.com/hyperloglogs-a-probabilistic-way-of-obtaining-a-
sets-cardinality-3b5e6a982a12
 https://medium.com/@vinodhinic/hyperloglog-probabilistic-algorithm-
330ecbbc686c
 https://odino.org/my-favorite-data-structure-hyperloglog/
Papers:
 LinCnt: http://dblab.kaist.ac.kr/Prof/pdf/ACM90_TODS_v15n2.pdf
 LogLog: https://link.springer.com/chapter/10.1007/978-3-540-39658-1_55
 HLL: http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
 HLL++: https://stefanheule.com/papers/edbt13-hyperloglog.pdf

HyperLogLog and friends

More Related Content

What's hot (14)

Similar to HyperLogLog and friends (20)

More from Simon Lia-Jonassen (10)

Recently uploaded (20)

HyperLogLog and friends

Editor's Notes