Probabilistic data structures

Probabilistic Data
Structures
Yoav Chernobroda
CTO

Types of big data analytics computation
• Batch processing
• Periodically over large data sets
• Scale through map-reduce
• One or more passes over all data
Offline
• Event based streaming
• Incremental computation
• Asynchronous processing
• Relaxed latency requests
Near-line
• Stream oriented
• Immediate response
• Strict latency requirements
• Scales through stateless instances
Online
(Real-time)

Why should you care?
Statistical mean:
• Easy to calculate over a set of
data at any size.
• Works well over streaming and
partitioned data (map-reduce)
• Does not require to move data
• Standard algorithms requires to sort the
data in place or use of the Quick Select.
• Data is moved around.
• Does not scale over partitioned data
• Can’t work over streaming data
Statistical median

Data sampling
Type of problems:
1. Train a model with a data set that is larger than can be processed in
memory
2. Interview question – how would you log a sample p% of users from a
stream of events arriving from billions of mobile devices, identified by a
device ID?
3. How would you sample k elements from a stream of very large events

Hashmod technique
1. Determine the percentage of the data that would fit in memory (e.g. 5%)
2. Set R = hash(device ID) mod 100 (* prefer murmur hash on standard Java or Python hash)
3. If R < 5 then add to sample, else skip
A very simple and scalable technique for sampling desired
percentage based on an ID that repeats in the stream (e.g.
mobile device ID)

Reservoir sampling
Problem: suppose you need to choose randomly k elements from
a very large stream.
Intuition for k = 1:
• Keep the first element
• For the subsequent i > 1 element:
• Select a random number between 1 to i
• If the selected number is i then keep it
Now let’s extend for selecting k elements from an infinite series:
for i = 1 to k
R[i] = S[i]
// replace elements with gradually decreasing probability
for i = k+1 to n
j := random(1, i)
if j <= k
R[j] := S[i]

Map-reduce sampling
Percentage sampling  use hashmod technique
k elements sampling
Mapper Mapper Mapper Mapper
Reducer
1. Hash each
entry in the data
set
2. Within each
mapper sort by hash
value.
3. Send top
k elements to
the reducer
Top k Top k Top k Top k
Top k
4. Emit top
k elements

Set membership
Is the needle in
the haystack?
Exact answer:
Yes it definitely is
No, it definitely not
Approximate answer:
Probably yes
No, it definitely not

Bloom filters
m bits array
k uniformly distributed hash
functions over m

Bloom filter sizing
𝑘 =
𝑚
𝑛
ln 2
m = -
𝑛∙ln(𝑝)
(𝑙𝑛2)2
n: estimated number of elements
p: allowed false probability
m: required bit array length
k: number of hash functions
Examples:
Items (n) Precision Size (kb) Hash
functions (k)
1,000,000 1% 1,200K 7
1,000,000 2% 1,000K 6
1,000,000 5% 780k 4

Cuckoo filter
• Practically better than bloom
filter
• Supports adding and removing
items dynamically
• Provide higher lookup
performance
• Uses less space
• Cuckoo hashing – resolves
collisions by rehashing to a new
place

Frequency estimation – Count-Min Sketch
• Memory efficient data structure
• Estimate frequency related
properties of a data stream
• Frequency of particular element
• Top K frequent elements
• Trades off:
• Don’t care about relatively rare
events
• Accurately estimate frequent
values
• Works well when items have
different probabilities
Insertion:
• Get value from the stream
• Use separate hash value for each row. Increment the
count of the cell referred by the hash function
Query:
Given value v, take the min count of the values referred
by the hash functions.

Top K elements in a stream
Find the top K frequent elements
from a large stream of items
Insert elements into a count-sketch
Maintain a heap of K elements, initially empty
Add the element to the min-count sketch
For each element e in the stream:
freq(e) > k * n
Add e to the heap
Clean heap from elements beyond threshold
Yes
Constraint: no space for full table
of counters

Finding quantiles in a streams – T-digest
Example 1: given a large
stream, find its 0.25, 0.50
(median) and 0.75 quantiles
Example 2: anomaly detection –
dynamically identify the 99.95
percentile and alert on values
that deviates from it
T-digest data structure
https://github.com/tdunning/t-digest
Published in 2013 by Ted Dunning (MapR)
Smart representation of the cumulative
distribution function of the stream
Attempts to identify the ‘interesting’ spots
(centroids) of the data stream
Sub linear space demands

Counting distinct values - HyperLogLog
Efficiently count distinct values in
a stream.
Example: how many unique
visitors visited a site within a
given period?
Example: given a large stream,
how many distinct elements it
contains?
Example: efficiently parallel the
calculation by a very large
partitioned data set
Solution? HyperLogLog data structure
Very efficient in terms of space
1B distinct values  2% error  1.5K !!!
Supported operations:
• hll1.insert(e): add element e to the count
• hll1.distinct(): returns count of distinct values
• hll1.union(hll2): returns hll3 which is a merge of hll1 and hll2
• hll1.intersect(hll2)? : hll1.distinct() + hll2.distinct –
distinct(hll1.union(hll2)) (*)
(*) unfortunately it’s not possible to get a new hll of the
intersection.

Further reading
Coursera – Mining massive data sets
Mining massive data sets (ebook)
http://infolab.stanford.edu/~ullman/mmds/book.pdf
Practical machine learning – a new look at
anomaly detection
http://info.mapr.com/rs/mapr/images/Practica
l_Machine_Learning_Anomaly_Detection.pdf
A collection of links of streaming algorithms
https://gist.github.com/debasishg/8172796
http://www.mythings.com/about/careers/

Probabilistic data structures

More Related Content

What's hot (20)

Similar to Probabilistic data structures (20)

Recently uploaded (20)

Probabilistic data structures