Probabilistic data structures in real life

PROBABILISTIC DATA
STRUCTURES IN REAL LIFE
Valentin Bazarevsky

WHO THEY ARE?
Bloom Filter
LogLog Family
MinHash

BUSINESS CASE:
ESTIMATE YOUR AUDIENCE

SEGMENT BUILDER
15 Tb of transactional data
4h SLA

POSSIBLE SOLUTIONS
Brute force (15 TB of transactional data)
Sampling (1 % of users => 1.2 mb / b.o.)
Magic tool (?!)
Estimator
HyperLogLog allows to estimate > 1 000 000 000 sets of unique
elements with 1% error, and requires only 4kb memory
50 000 000 basic operations

OOPS…
Supports only Unions
But we need Intersections, Subtractions, Not
operators

HYPERLOGLOG INTUITION
00101010101010001111010101101 => a[2] = 0
10010101010100101010101001011 => a[9] = 1
00000101010100101010101110101 => a[0] = 1
01010101010100100101010101010 => a[5] = 1
01010000000000000000000000010 => a[5] = 23

MINHASH
Store only x (8192) smallest hashes in set
Jaccard Distance

UNION OF INTERSECTIONS
A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ B)
A - B - C = A - (B ∪ C)

I WANT EVERYONE EXCEPT…
A and not B
Not A and Not B

CORNER CASES
|(A ∪ not(B)) ∩ C| => |A ∩ C|
|A ∪ not(B)| = |Everything| - |B| + |A ∩ B|
|A ∩ not(B)| => |A| - |A ∩ B|

ERROR RATE
Median = 5%
Percentile 75 = 8%

Probabilistic data structures in real life

Probabilistic data structures in real life

More Related Content

Viewers also liked (19)

Recently uploaded (20)

Probabilistic data structures in real life