Sketch algoritms

1
Intro to Sketch Algorithms
19/10/2021

2
- Did this IP visit me before?
- How many unique IPs have we seen this
month?
- How many times did I see this IP?
- What is the median transaction value?
top 1% value?
- What are the most common collection of
fonts available?
Large Stream of Events

3
Can’t store all unique values in memory
Fixed memory

4
If we are willing to accept an arbitrary low chance
of false positives we can solve this problem with
Bloom Filters.
Did I see this value before?

5
Hash each value and turn on a bit for that hash
bucket.
Repeat with multiple k different hash function, and
ask if all bits for all hash functions are set
Some false positives, no false negatives.
Bloom Filter

6
If we hash all values, and calculate the minimum of
all hashes, what is the expected minimum value?
Cardinality estimation

7
let hash(x) : X => [0,1] uniformly pseudo random
E[min(hash(x))] = 1/(k+1) when k is number of
distinct elements.
This is an unbiased estimator
If we repeat with several different hash functions,
we can average the estimations.
Cardinality estimation

8
Counting bloom filters.
Hash value and increment a counter at the hashed
index.
Use multiple hash functions each with separate
table(column) return min of all estimates.
Produces biased estimate, estimate >= actual
How many times did we see this value?
count–min sketch

9
Naive - Sample and calculate on sample
Remedian - Calculate median of medians (of
medians…)
Median estimation

10
Naive - sample and calculate quantile on sample
Sample and keep to K
Manku - maintain eps approximate counts and
quantiles. keep counts of values in intervals. and
keep them balanced.
Biased quantile estimators

11
Proveably requires at least O(N) space
Even top 1 most common does.
Relax to K-heavy-hitters problem. Find all values with
frequency at least 1/K ?
Approximate K heavy hitters: Return all values with frequency
more than 1/K and return no value with frequency below 1/k -
epsilon
What are the top K most frequent
values?

12
Initialize an empty Map m from elements to counters
def add(a)
if m.contains(a) m(a) += 1
else if m.size < k m(a) = 1
else
decrease all counters in m by 1
remove any elements with count=0
Frequent algorithm

14
Sampling K elements from a stream of N
Algorithm Extra memory Accurate results Materialized result
Shuffle and take N elements Yes Yes
Reservoir K elements Yes Yes
Indices reservoir K indices Yes No
Independent sample O(1) Length not guaranteed No
Accurate independent O(1) Slight correlation
between elements
No

15
variance = E[(x - E[x])^2] =
E[x^2 -2xE[x] +E[x]^2] = E[x^2] -2E[x]E[x]+E[x]^2 =
E[x^2] - E[x]^2
stdev = sqrt(variance)
STDEV streaming - accurate algorithm

Sketch algoritms

More Related Content

Similar to Sketch algoritms (20)

More from Meir Maor (6)

Recently uploaded (20)

Sketch algoritms