SlideShare a Scribd company logo
tech talk @ ferret
Andrii Gakhov
PROBABILISTIC DATA STRUCTURES
ALL YOU WANTED TO KNOW BUT WERE AFRAID TO ASK
PART 2: CARDINALITY
CARDINALITY
Agenda:
▸ Linear Counting
▸ LogLog, SuperLogLog, HyperLogLog, HyperLogLog++
• To determine the number of distinct elements, also
called the cardinality, of a large set of elements
where duplicates are present
Calculating the exact cardinality of a multiset requires an amount of memory
proportional to the cardinality, which is impractical for very large data sets.
THE PROBLEM
LINEAR COUNTING
LINEAR COUNTING: ALGORITHM
• Linear counter is a bit map (hash table) of size m (all
elements set to 0 at the beginning).
• Algorithm consists of a few steps:
• for every element calculate hash function and set the
appropriate bit to 1
• calculate the fraction V of empty bits in the structure 

(divide the number of empty bits by the bit map size m )
• estimate cardinality as n ≈ -m ln V
LINEAR COUNTING: EXAMPLE
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
• Consider linear counter with 16 bits (m=16)
• Consider MurmurHash3 as the hash function h

(to calculate the appropriate index, we divide result by mod 16)
• Set of 10 elements: “bernau”, “bernau”, “bernau”,
“berlin”, “kiev”, “kiev”, “new york”, “germany”, “ukraine”,
“europe” (NOTE: the real cardinality n = 7)
h(“bernau”) = 4, h(“berlin”) = 4, h(“kiev”) = 6, h(“new york”) = 6,
h(“germany”) = 14, h(“ukraine”) = 7, h(“europe”) = 9
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0
LINEAR COUNTING: EXAMPLE
number of empty bits: 11
m = 16
V = 11 / 16 = 0.6875
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0
• Cardinality estimation is
n ≈ - 16 * ln (0.6875) = 5.995
LINEAR COUNTING: READ MORE
• http://dblab.kaist.ac.kr/Prof/pdf/Whang1990(linear).pdf
• http://www.codeproject.com/Articles/569718/
CardinalityplusEstimationplusinplusLinearplusTimep
HYPERLOGLOG
HYPERLOGLOG: INTUITION
• The cardinality of a multiset of uniformly distributed numbers can be estimated
by the maximum number of leading zeros in the binary representation of each
number. If such value is k, then the number of distinct elements in the set is 2k
P(rank=1) = 1/2 - probability to find a binary representation, that starts with 1
P(rank = 2) = 1/2
2
- probability to find a binary representation, that start with 01
…
P(rank=k) = 1/2
k
rank = number of leading zeros + 1, e.g. rank(f) = 3
0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0
0 1 2
leading zeros
3
f =
• Therefore, for 2k
binary representations we shell find at least one
representation with rank = k
• If we remember the maximal rank we’ve seen and it’s equal to k, then we can
use 2k
as the approximation of the number of elements
HYPERLOGLOG
• proposed by Flajolet et. al., 2007
• an extension of the Flajolet–Martin algorithm (1985)
• HyperLogLog is described by 2 parameters:
• p – number of bits that determine a bucket to use averaging

(m = 2p
is the number of buckets/substreams)
• h - hash function, that produces uniform hash values
• The HyperLogLog algorithm is able to estimate cardinalities of
> 109
with a typical error rate of 2%, using 1.5kB of memory
(Flajolet, P. et al., 2007).
HYPERLOGLOG: ALGORITHM
• HyperLogLog uses randomization to approximate the
cardinality of a multiset.This randomization is achieved by
using hash function h
• Observe the maximum number of leading zeros that for all
hash values:
• If the bit pattern 0L−1 1 is observed at the beginning of a
hash value (so, rank = L), then a good estimation of the size
of the multiset is 2L.
HYPERLOGLOG: ALGORITHM
• Stochastic averaging is used to reduce the large variability:
• The input stream of data elements S is divided into m substreams Si using
the first p bits of the hash values (m = 2p)
.
• In each substream, the rank (after the initial p bits that are used for
substreaming) is measured independently.
• These numbers are kept in an array of registers M, where M[i] stores the
maximum rank it seen for the substream with index i.
• The cardinality estimation is calculated computes as the normalized bias
corrected harmonic mean of the estimations on the substreams
DVHLL = const(m)⋅m2
⋅ 2
−M j
j=1
m
∑
⎛
⎝⎜
⎞
⎠⎟
−1
HYPERLOGLOG: EXAMPLE
• Consider L=8 bits hash function h
• Index elements “berlin” and “ferret”:
h(“berlin”) = 0110111 h(“ferret”) = 1100011
• Define buckets and calculate values to store:

(use first p =3 bits for buckets and least L - p = 5 bits for ranks)
• bucket(“berlin”) = 011 = 3 value(“berlin”) = rank(0111) = 2
• bucket(“ferret”) = 110 = 6 value(“ferret”) = rank(0011) = 3
• Let’s use p=3 bits to define a bucket (then m=23
=8 buckets).
1 2 3 4 5 6 7
0 0 0 0 0 0 0 0
0
M
1 2 3 4 5 6 7
0 0 0 2 0 0 3 0
0
M
HYPERLOGLOG: EXAMPLE
• Estimate the cardinality be the HLL formula (C ≈ 0.66):
DVHLL ≈ 0.66 * 82
/ (2-2
+ 2-4
) = 0.66 * 204.8 ≈ 135≠3
• Index element “kharkov”:
• h(“kharkov”) = 1100001
• bucket(“kharkov”) = 110 = 6 value(“kharkov”) = rank(0001) = 4
• M[6] = max(M[6], 4) = max(3, 4) = 4
1 2 3 4 5 6 7
0 0 0 2 0 0 4 0
0
M
NOTE: For small cardinalities HLL has a strong bias!!!
HYPERLOGLOG: PROPERTIES
• Memory requirement doesn't grow linearly with L (unlike MinCount or
Linear Counting) - for hash function of L bits and precision p, required
memory:
• original HyperLogLog uses 32 bit hash codes, which requires 5 · 2
p
bits
• It’s not necessary to calculate the full hash code for the element
• first p bits and number of leading zeros of the remaining bits are
enough
• There are no evidence that some of popular hash functions (MD5, Sha1,
Sha256, Murmur3) performs significantly better than others.
log2 L +1− p( )⎡⎢ ⎤⎥⋅2p
bits
HYPERLOGLOG: PROPERTIES
• The standard error can be estimated as:
σ =
1.04
2p
so, if we use 16 bits (p=16) for bucket indices, we receive the
standard error in 0.40625%
• Algorithm has large error for small cardinalities.
• For instance, for n = 0 the algorithm always returns roughly 0.7m
• To achieve better estimates for small cardinalities, use
LinearCounting below a threshold of 5m/2
HYPERLOGLOG: APPLICATIONS
• PFCOUNT in Redis returns the approximated cardinality
computed by the HyperLogLog data structure 

(http://antirez.com/news/75)
• Redis implementation uses 12Kb per key to count with a standard
error of 0.81%, and there is no limit to the number of items you can
count, unless you approach 264 items
HYPERLOGLOG: READ MORE
• http://algo.inria.fr/flajolet/Publications/DuFl03-LNCS.pdf
• http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
• https://stefanheule.com/papers/edbt13-hyperloglog.pdf
• https://highlyscalable.wordpress.com/2012/05/01/
probabilistic-structures-web-analytics-data-mining/
• https://hal.archives-ouvertes.fr/file/index/docid/465313/
filename/sliding_HyperLogLog.pdf
• http://stackoverflow.com/questions/12327004/how-does-
the-hyperloglog-algorithm-work
HYPERLOGLOG++
HYPERLOGLOG++
• proposed by Stefan Heule et. al., 2013 for Google PowerDrill
• an improved version of HyperLogLog (Flajolet et. al., 2007)
• HyperLogLog++ is described by 2 parameters:
• p – number of bits that determine a bucket to use averaging

(m = 2p
is the number of buckets/substreams)
• h - hash function, that produces uniform hash values
• The HyperLogLog++ algorithm is able to estimate cardinalities
of ~ 7.9 · 10
9
with a typical error rate of 1.625%, using 2.56KB of
memory (Micha Gorelick and Ian Ozsvald, High Performance
Python, 2014).
HYPERLOGLOG++: IMPROVEMENTS
• use 64-bit hash function
• algorithm that only uses the hash code of the input values is limited by the
number of bits of the hash codes when it comes to accurately estimating
large cardinalities
• In particular, a hash function of L bits can at most distinguish 2L
different
values, and as the cardinality n approaches 2L
hash collisions become
more and more likely and accurate estimation gets impossible
• if the cardinality approaches 264 ≈ 1.8 · 1019, hash collisions become a problem
• bias correction
• original algorithm overestimates the real cardinality for small sets, but
most of the error is due to bias.
• storage efficiency
• uses different encoding strategies for hash values, variable length
encoding for integers, difference encoding
HYPERLOGLOG++ VS HYPERLOGLOG
• accuracy is significantly better for large range of cardinalities
and equally good on the rest
• sparse representation allows for a more adaptive use of memory
• if the cardinality n is much smaller then m, then HyperLogLog++
requires significantly less memory
• For cardinalities between 12000 and 61000, the bias correction
allows for a lower error and avoids a spike in the error when
switching between sub-algorithms.
• 64 bit hash codes allow the algorithm to estimate cardinalities well
beyond 1 billion
HYPERLOGLOG++: APPLICATIONS
• cardinality metric in Elasticsearch is based on the
HyperLogLog++ algorithm for big cardinalities (adaptive
counting)
• Apache DataFu, collection of libraries for working with
large-scale data in Hadoop, has an implementation of
HyperLogLog++ algorithm
HYPERLOGLOG++: READ MORE
• http://static.googleusercontent.com/media/
research.google.com/en//pubs/archive/40671.pdf
• https://research.neustar.biz/2013/01/24/hyperloglog-
googles-take-on-engineering-hll/
▸ @gakhov
▸ linkedin.com/in/gakhov
▸ www.datacrucis.com
THANK YOU

More Related Content

PPT
Running Spark in Production
PPTX
Examples of undecidable problems and problems.pptx
PPT
Parquet overview
PPTX
Resilient Distributed DataSets - Apache SPARK
PPTX
Hash table in data structure and algorithm
PPTX
Collision resolution.pptx
PDF
Bloom filter
PPTX
Hashing and separate chain
Running Spark in Production
Examples of undecidable problems and problems.pptx
Parquet overview
Resilient Distributed DataSets - Apache SPARK
Hash table in data structure and algorithm
Collision resolution.pptx
Bloom filter
Hashing and separate chain

What's hot (20)

PPTX
Hashing Technique In Data Structures
PDF
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
PDF
Hadoop Ecosystem
PDF
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
PDF
Real Time Systems
PPTX
Unit-3.pptx
PDF
LDPC Encoding and Hamming Encoding
PPT
Chapter4 1
PPTX
Compiler design Project
PPTX
Matrix chain multiplication
PPTX
Chess board problem(divide and conquer)
PDF
CDC Stream Processing with Apache Flink
PPTX
Data models in NoSQL
PPTX
halstead software science measures
PPT
Domain name system
PDF
Top 5 Mistakes When Writing Spark Applications
PDF
Inside HDFS Append
PPTX
Distributed DBMS - Unit 8 - Distributed Transaction Management & Concurrency ...
PPTX
캐쉬 일관성 Msi, mesi 프로토콜 흐름
PDF
Operating System-Ch4.processes
Hashing Technique In Data Structures
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Hadoop Ecosystem
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
Real Time Systems
Unit-3.pptx
LDPC Encoding and Hamming Encoding
Chapter4 1
Compiler design Project
Matrix chain multiplication
Chess board problem(divide and conquer)
CDC Stream Processing with Apache Flink
Data models in NoSQL
halstead software science measures
Domain name system
Top 5 Mistakes When Writing Spark Applications
Inside HDFS Append
Distributed DBMS - Unit 8 - Distributed Transaction Management & Concurrency ...
캐쉬 일관성 Msi, mesi 프로토콜 흐름
Operating System-Ch4.processes
Ad

Viewers also liked (9)

PPTX
Probabilistic data structures
PPT
Using Simplicity to Make Hard Big Data Problems Easy
PDF
Анализ количества посетителей на сайте [Считаем уникальные элементы]
PDF
HyperLogLog in Hive - How to count sheep efficiently?
PDF
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
PDF
ReqLabs PechaKucha Евгений Сафроненко
PDF
Big Data aggregation techniques
PPTX
Hyper loglog
PDF
Deep dive into Coroutines on JVM @ KotlinConf 2017
Probabilistic data structures
Using Simplicity to Make Hard Big Data Problems Easy
Анализ количества посетителей на сайте [Считаем уникальные элементы]
HyperLogLog in Hive - How to count sheep efficiently?
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
ReqLabs PechaKucha Евгений Сафроненко
Big Data aggregation techniques
Hyper loglog
Deep dive into Coroutines on JVM @ KotlinConf 2017
Ad

Similar to Probabilistic data structures. Part 2. Cardinality (20)

PDF
Count-Distinct Problem
PPTX
2013 open analytics_countingv3
PPTX
HyperLogLog and friends
PDF
Distributed algorithms for big data @ GeeCon
PDF
2013 open analytics_countingv3
PDF
Hyper loglog
PDF
Too Much Data? - Just Sample, Just Hash, ...
PDF
An introduction to probabilistic data structures
PPTX
Tech talk Probabilistic Data Structure
PDF
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...
PDF
Large-scale real-time analytics for everyone
PDF
Probabilistic Data Structures and Approximate Solutions
PPT
Approximate methods for scalable data mining
PDF
Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
PDF
Beyond PFCount: Shrif Nada
PDF
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
PDF
Hyperloglog Project
PDF
Counting (Using Computer)
PPTX
Probabilistic data structure
PDF
Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017)...
Count-Distinct Problem
2013 open analytics_countingv3
HyperLogLog and friends
Distributed algorithms for big data @ GeeCon
2013 open analytics_countingv3
Hyper loglog
Too Much Data? - Just Sample, Just Hash, ...
An introduction to probabilistic data structures
Tech talk Probabilistic Data Structure
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...
Large-scale real-time analytics for everyone
Probabilistic Data Structures and Approximate Solutions
Approximate methods for scalable data mining
Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
Beyond PFCount: Shrif Nada
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Hyperloglog Project
Counting (Using Computer)
Probabilistic data structure
Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017)...

More from Andrii Gakhov (20)

PDF
Let's start GraphQL: structure, behavior, and architecture
PDF
DNS Delegation
PPTX
Implementing a Fileserver with Nginx and Lua
PPTX
Pecha Kucha: Ukrainian Food Traditions
PDF
Probabilistic data structures. Part 4. Similarity
PDF
Probabilistic data structures. Part 3. Frequency
PDF
Вероятностные структуры данных
PDF
Recurrent Neural Networks. Part 1: Theory
PDF
Apache Big Data Europe 2015: Selected Talks
PDF
Swagger / Quick Start Guide
PDF
API Days Berlin highlights
PDF
ELK - What's new and showcases
PDF
Apache Spark Overview @ ferret
PDF
Data Mining - lecture 8 - 2014
PDF
Data Mining - lecture 7 - 2014
PDF
Data Mining - lecture 6 - 2014
PDF
Data Mining - lecture 5 - 2014
PDF
Data Mining - lecture 4 - 2014
PDF
Data Mining - lecture 3 - 2014
PDF
Decision Theory - lecture 1 (introduction)
Let's start GraphQL: structure, behavior, and architecture
DNS Delegation
Implementing a Fileserver with Nginx and Lua
Pecha Kucha: Ukrainian Food Traditions
Probabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 3. Frequency
Вероятностные структуры данных
Recurrent Neural Networks. Part 1: Theory
Apache Big Data Europe 2015: Selected Talks
Swagger / Quick Start Guide
API Days Berlin highlights
ELK - What's new and showcases
Apache Spark Overview @ ferret
Data Mining - lecture 8 - 2014
Data Mining - lecture 7 - 2014
Data Mining - lecture 6 - 2014
Data Mining - lecture 5 - 2014
Data Mining - lecture 4 - 2014
Data Mining - lecture 3 - 2014
Decision Theory - lecture 1 (introduction)

Recently uploaded (20)

PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Cloud computing and distributed systems.
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
MYSQL Presentation for SQL database connectivity
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
cuic standard and advanced reporting.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Mobile App Security Testing_ A Comprehensive Guide.pdf
The AUB Centre for AI in Media Proposal.docx
Understanding_Digital_Forensics_Presentation.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Cloud computing and distributed systems.
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Chapter 3 Spatial Domain Image Processing.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Review of recent advances in non-invasive hemoglobin estimation
Diabetes mellitus diagnosis method based random forest with bat algorithm
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Unlocking AI with Model Context Protocol (MCP)
MYSQL Presentation for SQL database connectivity
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
Empathic Computing: Creating Shared Understanding
cuic standard and advanced reporting.pdf

Probabilistic data structures. Part 2. Cardinality

  • 1. tech talk @ ferret Andrii Gakhov PROBABILISTIC DATA STRUCTURES ALL YOU WANTED TO KNOW BUT WERE AFRAID TO ASK PART 2: CARDINALITY
  • 2. CARDINALITY Agenda: ▸ Linear Counting ▸ LogLog, SuperLogLog, HyperLogLog, HyperLogLog++
  • 3. • To determine the number of distinct elements, also called the cardinality, of a large set of elements where duplicates are present Calculating the exact cardinality of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. THE PROBLEM
  • 5. LINEAR COUNTING: ALGORITHM • Linear counter is a bit map (hash table) of size m (all elements set to 0 at the beginning). • Algorithm consists of a few steps: • for every element calculate hash function and set the appropriate bit to 1 • calculate the fraction V of empty bits in the structure 
 (divide the number of empty bits by the bit map size m ) • estimate cardinality as n ≈ -m ln V
  • 6. LINEAR COUNTING: EXAMPLE 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 • Consider linear counter with 16 bits (m=16) • Consider MurmurHash3 as the hash function h
 (to calculate the appropriate index, we divide result by mod 16) • Set of 10 elements: “bernau”, “bernau”, “bernau”, “berlin”, “kiev”, “kiev”, “new york”, “germany”, “ukraine”, “europe” (NOTE: the real cardinality n = 7) h(“bernau”) = 4, h(“berlin”) = 4, h(“kiev”) = 6, h(“new york”) = 6, h(“germany”) = 14, h(“ukraine”) = 7, h(“europe”) = 9 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0
  • 7. LINEAR COUNTING: EXAMPLE number of empty bits: 11 m = 16 V = 11 / 16 = 0.6875 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0 • Cardinality estimation is n ≈ - 16 * ln (0.6875) = 5.995
  • 8. LINEAR COUNTING: READ MORE • http://dblab.kaist.ac.kr/Prof/pdf/Whang1990(linear).pdf • http://www.codeproject.com/Articles/569718/ CardinalityplusEstimationplusinplusLinearplusTimep
  • 10. HYPERLOGLOG: INTUITION • The cardinality of a multiset of uniformly distributed numbers can be estimated by the maximum number of leading zeros in the binary representation of each number. If such value is k, then the number of distinct elements in the set is 2k P(rank=1) = 1/2 - probability to find a binary representation, that starts with 1 P(rank = 2) = 1/2 2 - probability to find a binary representation, that start with 01 … P(rank=k) = 1/2 k rank = number of leading zeros + 1, e.g. rank(f) = 3 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 2 leading zeros 3 f = • Therefore, for 2k binary representations we shell find at least one representation with rank = k • If we remember the maximal rank we’ve seen and it’s equal to k, then we can use 2k as the approximation of the number of elements
  • 11. HYPERLOGLOG • proposed by Flajolet et. al., 2007 • an extension of the Flajolet–Martin algorithm (1985) • HyperLogLog is described by 2 parameters: • p – number of bits that determine a bucket to use averaging
 (m = 2p is the number of buckets/substreams) • h - hash function, that produces uniform hash values • The HyperLogLog algorithm is able to estimate cardinalities of > 109 with a typical error rate of 2%, using 1.5kB of memory (Flajolet, P. et al., 2007).
  • 12. HYPERLOGLOG: ALGORITHM • HyperLogLog uses randomization to approximate the cardinality of a multiset.This randomization is achieved by using hash function h • Observe the maximum number of leading zeros that for all hash values: • If the bit pattern 0L−1 1 is observed at the beginning of a hash value (so, rank = L), then a good estimation of the size of the multiset is 2L.
  • 13. HYPERLOGLOG: ALGORITHM • Stochastic averaging is used to reduce the large variability: • The input stream of data elements S is divided into m substreams Si using the first p bits of the hash values (m = 2p) . • In each substream, the rank (after the initial p bits that are used for substreaming) is measured independently. • These numbers are kept in an array of registers M, where M[i] stores the maximum rank it seen for the substream with index i. • The cardinality estimation is calculated computes as the normalized bias corrected harmonic mean of the estimations on the substreams DVHLL = const(m)⋅m2 ⋅ 2 −M j j=1 m ∑ ⎛ ⎝⎜ ⎞ ⎠⎟ −1
  • 14. HYPERLOGLOG: EXAMPLE • Consider L=8 bits hash function h • Index elements “berlin” and “ferret”: h(“berlin”) = 0110111 h(“ferret”) = 1100011 • Define buckets and calculate values to store:
 (use first p =3 bits for buckets and least L - p = 5 bits for ranks) • bucket(“berlin”) = 011 = 3 value(“berlin”) = rank(0111) = 2 • bucket(“ferret”) = 110 = 6 value(“ferret”) = rank(0011) = 3 • Let’s use p=3 bits to define a bucket (then m=23 =8 buckets). 1 2 3 4 5 6 7 0 0 0 0 0 0 0 0 0 M 1 2 3 4 5 6 7 0 0 0 2 0 0 3 0 0 M
  • 15. HYPERLOGLOG: EXAMPLE • Estimate the cardinality be the HLL formula (C ≈ 0.66): DVHLL ≈ 0.66 * 82 / (2-2 + 2-4 ) = 0.66 * 204.8 ≈ 135≠3 • Index element “kharkov”: • h(“kharkov”) = 1100001 • bucket(“kharkov”) = 110 = 6 value(“kharkov”) = rank(0001) = 4 • M[6] = max(M[6], 4) = max(3, 4) = 4 1 2 3 4 5 6 7 0 0 0 2 0 0 4 0 0 M NOTE: For small cardinalities HLL has a strong bias!!!
  • 16. HYPERLOGLOG: PROPERTIES • Memory requirement doesn't grow linearly with L (unlike MinCount or Linear Counting) - for hash function of L bits and precision p, required memory: • original HyperLogLog uses 32 bit hash codes, which requires 5 · 2 p bits • It’s not necessary to calculate the full hash code for the element • first p bits and number of leading zeros of the remaining bits are enough • There are no evidence that some of popular hash functions (MD5, Sha1, Sha256, Murmur3) performs significantly better than others. log2 L +1− p( )⎡⎢ ⎤⎥⋅2p bits
  • 17. HYPERLOGLOG: PROPERTIES • The standard error can be estimated as: σ = 1.04 2p so, if we use 16 bits (p=16) for bucket indices, we receive the standard error in 0.40625% • Algorithm has large error for small cardinalities. • For instance, for n = 0 the algorithm always returns roughly 0.7m • To achieve better estimates for small cardinalities, use LinearCounting below a threshold of 5m/2
  • 18. HYPERLOGLOG: APPLICATIONS • PFCOUNT in Redis returns the approximated cardinality computed by the HyperLogLog data structure 
 (http://antirez.com/news/75) • Redis implementation uses 12Kb per key to count with a standard error of 0.81%, and there is no limit to the number of items you can count, unless you approach 264 items
  • 19. HYPERLOGLOG: READ MORE • http://algo.inria.fr/flajolet/Publications/DuFl03-LNCS.pdf • http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf • https://stefanheule.com/papers/edbt13-hyperloglog.pdf • https://highlyscalable.wordpress.com/2012/05/01/ probabilistic-structures-web-analytics-data-mining/ • https://hal.archives-ouvertes.fr/file/index/docid/465313/ filename/sliding_HyperLogLog.pdf • http://stackoverflow.com/questions/12327004/how-does- the-hyperloglog-algorithm-work
  • 21. HYPERLOGLOG++ • proposed by Stefan Heule et. al., 2013 for Google PowerDrill • an improved version of HyperLogLog (Flajolet et. al., 2007) • HyperLogLog++ is described by 2 parameters: • p – number of bits that determine a bucket to use averaging
 (m = 2p is the number of buckets/substreams) • h - hash function, that produces uniform hash values • The HyperLogLog++ algorithm is able to estimate cardinalities of ~ 7.9 · 10 9 with a typical error rate of 1.625%, using 2.56KB of memory (Micha Gorelick and Ian Ozsvald, High Performance Python, 2014).
  • 22. HYPERLOGLOG++: IMPROVEMENTS • use 64-bit hash function • algorithm that only uses the hash code of the input values is limited by the number of bits of the hash codes when it comes to accurately estimating large cardinalities • In particular, a hash function of L bits can at most distinguish 2L different values, and as the cardinality n approaches 2L hash collisions become more and more likely and accurate estimation gets impossible • if the cardinality approaches 264 ≈ 1.8 · 1019, hash collisions become a problem • bias correction • original algorithm overestimates the real cardinality for small sets, but most of the error is due to bias. • storage efficiency • uses different encoding strategies for hash values, variable length encoding for integers, difference encoding
  • 23. HYPERLOGLOG++ VS HYPERLOGLOG • accuracy is significantly better for large range of cardinalities and equally good on the rest • sparse representation allows for a more adaptive use of memory • if the cardinality n is much smaller then m, then HyperLogLog++ requires significantly less memory • For cardinalities between 12000 and 61000, the bias correction allows for a lower error and avoids a spike in the error when switching between sub-algorithms. • 64 bit hash codes allow the algorithm to estimate cardinalities well beyond 1 billion
  • 24. HYPERLOGLOG++: APPLICATIONS • cardinality metric in Elasticsearch is based on the HyperLogLog++ algorithm for big cardinalities (adaptive counting) • Apache DataFu, collection of libraries for working with large-scale data in Hadoop, has an implementation of HyperLogLog++ algorithm
  • 25. HYPERLOGLOG++: READ MORE • http://static.googleusercontent.com/media/ research.google.com/en//pubs/archive/40671.pdf • https://research.neustar.biz/2013/01/24/hyperloglog- googles-take-on-engineering-hll/
  • 26. ▸ @gakhov ▸ linkedin.com/in/gakhov ▸ www.datacrucis.com THANK YOU