SlideShare a Scribd company logo
Probabilistic Data
Structures
Yoav Chernobroda
CTO
Types of big data analytics computation
• Batch processing
• Periodically over large data sets
• Scale through map-reduce
• One or more passes over all data
Offline
• Event based streaming
• Incremental computation
• Asynchronous processing
• Relaxed latency requests
Near-line
• Stream oriented
• Immediate response
• Strict latency requirements
• Scales through stateless instances
Online
(Real-time)
Why should you care?
Why should you care?
Why should you care?
Statistical mean:
• Easy to calculate over a set of
data at any size.
• Works well over streaming and
partitioned data (map-reduce)
• Does not require to move data
• Standard algorithms requires to sort the
data in place or use of the Quick Select.
• Data is moved around.
• Does not scale over partitioned data
• Can’t work over streaming data
Statistical median
Exact vs. Approximation
Data sampling
Type of problems:
1. Train a model with a data set that is larger than can be processed in
memory
2. Interview question – how would you log a sample p% of users from a
stream of events arriving from billions of mobile devices, identified by a
device ID?
3. How would you sample k elements from a stream of very large events
Hashmod technique
1. Determine the percentage of the data that would fit in memory (e.g. 5%)
2. Set R = hash(device ID) mod 100 (* prefer murmur hash on standard Java or Python hash)
3. If R < 5 then add to sample, else skip
A very simple and scalable technique for sampling desired
percentage based on an ID that repeats in the stream (e.g.
mobile device ID)
Reservoir sampling
Problem: suppose you need to choose randomly k elements from
a very large stream.
Intuition for k = 1:
• Keep the first element
• For the subsequent i > 1 element:
• Select a random number between 1 to i
• If the selected number is i then keep it
Now let’s extend for selecting k elements from an infinite series:
for i = 1 to k
R[i] = S[i]
// replace elements with gradually decreasing probability
for i = k+1 to n
j := random(1, i)
if j <= k
R[j] := S[i]
Map-reduce sampling
Percentage sampling  use hashmod technique
k elements sampling
Mapper Mapper Mapper Mapper
Reducer
1. Hash each
entry in the data
set
2. Within each
mapper sort by hash
value.
3. Send top
k elements to
the reducer
Top k Top k Top k Top k
Top k
4. Emit top
k elements
Set membership
Is the needle in
the haystack?
Exact answer:
Yes it definitely is
No, it definitely not
Approximate answer:
Probably yes
No, it definitely not
Bloom filters
m bits array
k uniformly distributed hash
functions over m
Bloom filter sizing
𝑘 =
𝑚
𝑛
ln 2
m = -
𝑛∙ln(𝑝)
(𝑙𝑛2)2
n: estimated number of elements
p: allowed false probability
m: required bit array length
k: number of hash functions
Examples:
Items (n) Precision Size (kb) Hash
functions (k)
1,000,000 1% 1,200K 7
1,000,000 2% 1,000K 6
1,000,000 5% 780k 4
Real world example
Cuckoo filter
• Practically better than bloom
filter
• Supports adding and removing
items dynamically
• Provide higher lookup
performance
• Uses less space
• Cuckoo hashing – resolves
collisions by rehashing to a new
place
Frequency estimation – Count-Min Sketch
• Memory efficient data structure
• Estimate frequency related
properties of a data stream
• Frequency of particular element
• Top K frequent elements
• Trades off:
• Don’t care about relatively rare
events
• Accurately estimate frequent
values
• Works well when items have
different probabilities
Insertion:
• Get value from the stream
• Use separate hash value for each row. Increment the
count of the cell referred by the hash function
Query:
Given value v, take the min count of the values referred
by the hash functions.
Top K elements in a stream
Find the top K frequent elements
from a large stream of items
Insert elements into a count-sketch
Maintain a heap of K elements, initially empty
Add the element to the min-count sketch
For each element e in the stream:
freq(e) > k * n
Add e to the heap
Clean heap from elements beyond threshold
Yes
Constraint: no space for full table
of counters
Finding quantiles in a streams – T-digest
Example 1: given a large
stream, find its 0.25, 0.50
(median) and 0.75 quantiles
Example 2: anomaly detection –
dynamically identify the 99.95
percentile and alert on values
that deviates from it
T-digest data structure
https://github.com/tdunning/t-digest
Published in 2013 by Ted Dunning (MapR)
Smart representation of the cumulative
distribution function of the stream
Attempts to identify the ‘interesting’ spots
(centroids) of the data stream
Sub linear space demands
Counting distinct values - HyperLogLog
Efficiently count distinct values in
a stream.
Example: how many unique
visitors visited a site within a
given period?
Example: given a large stream,
how many distinct elements it
contains?
Example: efficiently parallel the
calculation by a very large
partitioned data set
Solution? HyperLogLog data structure
Very efficient in terms of space
1B distinct values  2% error  1.5K !!!
Supported operations:
• hll1.insert(e): add element e to the count
• hll1.distinct(): returns count of distinct values
• hll1.union(hll2): returns hll3 which is a merge of hll1 and hll2
• hll1.intersect(hll2)? : hll1.distinct() + hll2.distinct –
distinct(hll1.union(hll2)) (*)
(*) unfortunately it’s not possible to get a new hll of the
intersection.
Further reading
Coursera – Mining massive data sets
Mining massive data sets (ebook)
http://infolab.stanford.edu/~ullman/mmds/book.pdf
Practical machine learning – a new look at
anomaly detection
http://info.mapr.com/rs/mapr/images/Practica
l_Machine_Learning_Anomaly_Detection.pdf
A collection of links of streaming algorithms
https://gist.github.com/debasishg/8172796
http://www.mythings.com/about/careers/

More Related Content

PPTX
Streaming Algorithms
PDF
Moa: Real Time Analytics for Data Streams
PDF
MOA for the IoT at ACML 2016
PDF
Artificial intelligence and data stream mining
PDF
Mining Big Data Streams with APACHE SAMOA
PPTX
Mining high speed data streams: Hoeffding and VFDT
PPT
5.1 mining data streams
PPTX
Linear regression on 1 terabytes of data? Some crazy observations and actions
Streaming Algorithms
Moa: Real Time Analytics for Data Streams
MOA for the IoT at ACML 2016
Artificial intelligence and data stream mining
Mining Big Data Streams with APACHE SAMOA
Mining high speed data streams: Hoeffding and VFDT
5.1 mining data streams
Linear regression on 1 terabytes of data? Some crazy observations and actions

What's hot (20)

PDF
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
PDF
Sentiment Knowledge Discovery in Twitter Streaming Data
PDF
Mining big data streams with APACHE SAMOA by Albert Bifet
PDF
Efficient Online Evaluation of Big Data Stream Classifiers
PDF
Development Infographic
PPTX
A Comparison of Different Strategies for Automated Semantic Document Annotation
PDF
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
PDF
Distributed streaming k means
PPTX
Mining and Managing Large-scale Linked Open Data
PDF
Real-Time Big Data Stream Analytics
PDF
Pitfalls in benchmarking data stream classification and how to avoid them
PDF
Joey gonzalez, graph lab, m lconf 2013
PPT
Real Time Geodemographics
PDF
Mining Big Data in Real Time
PPTX
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...
PDF
Intro to Forecasting - Part 3 - HRUG
PDF
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
PDF
Time series deep learning
PDF
Monte Carlo Simulation for project estimates v1.0
PDF
Intro To Forecasting - Part 2 - HRUG
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
Sentiment Knowledge Discovery in Twitter Streaming Data
Mining big data streams with APACHE SAMOA by Albert Bifet
Efficient Online Evaluation of Big Data Stream Classifiers
Development Infographic
A Comparison of Different Strategies for Automated Semantic Document Annotation
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Distributed streaming k means
Mining and Managing Large-scale Linked Open Data
Real-Time Big Data Stream Analytics
Pitfalls in benchmarking data stream classification and how to avoid them
Joey gonzalez, graph lab, m lconf 2013
Real Time Geodemographics
Mining Big Data in Real Time
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...
Intro to Forecasting - Part 3 - HRUG
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
Time series deep learning
Monte Carlo Simulation for project estimates v1.0
Intro To Forecasting - Part 2 - HRUG
Ad

Similar to Probabilistic data structures (20)

PPTX
Data streaming algorithms
PPTX
PCA-LDA-Lobo.pptxttvertyuytreiopkjhgftfv
PPTX
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
PDF
Approximation Data Structures for Streaming Applications
PPTX
Practical deep learning for computer vision
PDF
Deep Learning Introduction - WeCloudData
PPTX
Mining Data Streams
PPTX
PPTX
Real time streaming analytics
PPTX
Unsupervised Learning: Clustering
PDF
Analysis Framework for Analysis of Algorithms.pdf
PPTX
How Does Math Matter in Data Science
PPTX
Probabilistic data structure
PDF
An Slight Overview of the Critical Elements of Spatial Statistics
PPTX
Recommender Systems from A to Z – The Right Dataset
PPT
introegthnhhdfhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhppt
PPT
02 order of growth
PPTX
hash
PPTX
Building and deploying analytics
PDF
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Data streaming algorithms
PCA-LDA-Lobo.pptxttvertyuytreiopkjhgftfv
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Approximation Data Structures for Streaming Applications
Practical deep learning for computer vision
Deep Learning Introduction - WeCloudData
Mining Data Streams
Real time streaming analytics
Unsupervised Learning: Clustering
Analysis Framework for Analysis of Algorithms.pdf
How Does Math Matter in Data Science
Probabilistic data structure
An Slight Overview of the Critical Elements of Spatial Statistics
Recommender Systems from A to Z – The Right Dataset
introegthnhhdfhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhppt
02 order of growth
hash
Building and deploying analytics
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Ad

Recently uploaded (20)

PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Computer network topology notes for revision
PPT
Quality review (1)_presentation of this 21
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Mega Projects Data Mega Projects Data
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PDF
Introduction to Business Data Analytics.
PPTX
Database Infoormation System (DBIS).pptx
IB Computer Science - Internal Assessment.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Computer network topology notes for revision
Quality review (1)_presentation of this 21
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Miokarditis (Inflamasi pada Otot Jantung)
IBA_Chapter_11_Slides_Final_Accessible.pptx
1_Introduction to advance data techniques.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Fluorescence-microscope_Botany_detailed content
Mega Projects Data Mega Projects Data
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Business Acumen Training GuidePresentation.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Introduction to Business Data Analytics.
Database Infoormation System (DBIS).pptx

Probabilistic data structures

  • 2. Types of big data analytics computation • Batch processing • Periodically over large data sets • Scale through map-reduce • One or more passes over all data Offline • Event based streaming • Incremental computation • Asynchronous processing • Relaxed latency requests Near-line • Stream oriented • Immediate response • Strict latency requirements • Scales through stateless instances Online (Real-time)
  • 5. Why should you care? Statistical mean: • Easy to calculate over a set of data at any size. • Works well over streaming and partitioned data (map-reduce) • Does not require to move data • Standard algorithms requires to sort the data in place or use of the Quick Select. • Data is moved around. • Does not scale over partitioned data • Can’t work over streaming data Statistical median
  • 7. Data sampling Type of problems: 1. Train a model with a data set that is larger than can be processed in memory 2. Interview question – how would you log a sample p% of users from a stream of events arriving from billions of mobile devices, identified by a device ID? 3. How would you sample k elements from a stream of very large events
  • 8. Hashmod technique 1. Determine the percentage of the data that would fit in memory (e.g. 5%) 2. Set R = hash(device ID) mod 100 (* prefer murmur hash on standard Java or Python hash) 3. If R < 5 then add to sample, else skip A very simple and scalable technique for sampling desired percentage based on an ID that repeats in the stream (e.g. mobile device ID)
  • 9. Reservoir sampling Problem: suppose you need to choose randomly k elements from a very large stream. Intuition for k = 1: • Keep the first element • For the subsequent i > 1 element: • Select a random number between 1 to i • If the selected number is i then keep it Now let’s extend for selecting k elements from an infinite series: for i = 1 to k R[i] = S[i] // replace elements with gradually decreasing probability for i = k+1 to n j := random(1, i) if j <= k R[j] := S[i]
  • 10. Map-reduce sampling Percentage sampling  use hashmod technique k elements sampling Mapper Mapper Mapper Mapper Reducer 1. Hash each entry in the data set 2. Within each mapper sort by hash value. 3. Send top k elements to the reducer Top k Top k Top k Top k Top k 4. Emit top k elements
  • 11. Set membership Is the needle in the haystack? Exact answer: Yes it definitely is No, it definitely not Approximate answer: Probably yes No, it definitely not
  • 12. Bloom filters m bits array k uniformly distributed hash functions over m
  • 13. Bloom filter sizing 𝑘 = 𝑚 𝑛 ln 2 m = - 𝑛∙ln(𝑝) (𝑙𝑛2)2 n: estimated number of elements p: allowed false probability m: required bit array length k: number of hash functions Examples: Items (n) Precision Size (kb) Hash functions (k) 1,000,000 1% 1,200K 7 1,000,000 2% 1,000K 6 1,000,000 5% 780k 4
  • 15. Cuckoo filter • Practically better than bloom filter • Supports adding and removing items dynamically • Provide higher lookup performance • Uses less space • Cuckoo hashing – resolves collisions by rehashing to a new place
  • 16. Frequency estimation – Count-Min Sketch • Memory efficient data structure • Estimate frequency related properties of a data stream • Frequency of particular element • Top K frequent elements • Trades off: • Don’t care about relatively rare events • Accurately estimate frequent values • Works well when items have different probabilities Insertion: • Get value from the stream • Use separate hash value for each row. Increment the count of the cell referred by the hash function Query: Given value v, take the min count of the values referred by the hash functions.
  • 17. Top K elements in a stream Find the top K frequent elements from a large stream of items Insert elements into a count-sketch Maintain a heap of K elements, initially empty Add the element to the min-count sketch For each element e in the stream: freq(e) > k * n Add e to the heap Clean heap from elements beyond threshold Yes Constraint: no space for full table of counters
  • 18. Finding quantiles in a streams – T-digest Example 1: given a large stream, find its 0.25, 0.50 (median) and 0.75 quantiles Example 2: anomaly detection – dynamically identify the 99.95 percentile and alert on values that deviates from it T-digest data structure https://github.com/tdunning/t-digest Published in 2013 by Ted Dunning (MapR) Smart representation of the cumulative distribution function of the stream Attempts to identify the ‘interesting’ spots (centroids) of the data stream Sub linear space demands
  • 19. Counting distinct values - HyperLogLog Efficiently count distinct values in a stream. Example: how many unique visitors visited a site within a given period? Example: given a large stream, how many distinct elements it contains? Example: efficiently parallel the calculation by a very large partitioned data set Solution? HyperLogLog data structure Very efficient in terms of space 1B distinct values  2% error  1.5K !!! Supported operations: • hll1.insert(e): add element e to the count • hll1.distinct(): returns count of distinct values • hll1.union(hll2): returns hll3 which is a merge of hll1 and hll2 • hll1.intersect(hll2)? : hll1.distinct() + hll2.distinct – distinct(hll1.union(hll2)) (*) (*) unfortunately it’s not possible to get a new hll of the intersection.
  • 20. Further reading Coursera – Mining massive data sets Mining massive data sets (ebook) http://infolab.stanford.edu/~ullman/mmds/book.pdf Practical machine learning – a new look at anomaly detection http://info.mapr.com/rs/mapr/images/Practica l_Machine_Learning_Anomaly_Detection.pdf A collection of links of streaming algorithms https://gist.github.com/debasishg/8172796 http://www.mythings.com/about/careers/