SlideShare a Scribd company logo
1
Intro to Sketch Algorithms
19/10/2021
2
- Did this IP visit me before?
- How many unique IPs have we seen this
month?
- How many times did I see this IP?
- What is the median transaction value?
top 1% value?
- What are the most common collection of
fonts available?
Large Stream of Events
3
Can’t store all unique values in memory
Fixed memory
4
If we are willing to accept an arbitrary low chance
of false positives we can solve this problem with
Bloom Filters.
Did I see this value before?
5
Hash each value and turn on a bit for that hash
bucket.
Repeat with multiple k different hash function, and
ask if all bits for all hash functions are set
Some false positives, no false negatives.
Bloom Filter
6
If we hash all values, and calculate the minimum of
all hashes, what is the expected minimum value?
Cardinality estimation
7
let hash(x) : X => [0,1] uniformly pseudo random
E[min(hash(x))] = 1/(k+1) when k is number of
distinct elements.
This is an unbiased estimator
If we repeat with several different hash functions,
we can average the estimations.
Cardinality estimation
8
Counting bloom filters.
Hash value and increment a counter at the hashed
index.
Use multiple hash functions each with separate
table(column) return min of all estimates.
Produces biased estimate, estimate >= actual
How many times did we see this value?
count–min sketch
9
Naive - Sample and calculate on sample
Remedian - Calculate median of medians (of
medians…)
Median estimation
10
Naive - sample and calculate quantile on sample
Sample and keep to K
Manku - maintain eps approximate counts and
quantiles. keep counts of values in intervals. and
keep them balanced.
Biased quantile estimators
11
Proveably requires at least O(N) space
Even top 1 most common does.
Relax to K-heavy-hitters problem. Find all values with
frequency at least 1/K ?
Approximate K heavy hitters: Return all values with frequency
more than 1/K and return no value with frequency below 1/k -
epsilon
What are the top K most frequent
values?
12
Initialize an empty Map m from elements to counters
def add(a)
if m.contains(a) m(a) += 1
else if m.size < k m(a) = 1
else
decrease all counters in m by 1
remove any elements with count=0
Frequent algorithm
13
THANK YOU
14
Sampling K elements from a stream of N
Algorithm Extra memory Accurate results Materialized result
Shuffle and take N elements Yes Yes
Reservoir K elements Yes Yes
Indices reservoir K indices Yes No
Independent sample O(1) Length not guaranteed No
Accurate independent O(1) Slight correlation
between elements
No
15
variance = E[(x - E[x])^2] =
E[x^2 -2xE[x] +E[x]^2] = E[x^2] -2E[x]E[x]+E[x]^2 =
E[x^2] - E[x]^2
stdev = sqrt(variance)
STDEV streaming - accurate algorithm

More Related Content

PPT
1527 exponential functions
PDF
Probabilistic data structures
PPTX
Algebra 2 warm up 5.4.14
PPT
Scientific notation power point
PPTX
Notes 5.1 & 5.2 honors
PPTX
My powerpoint
PPTX
6 - analyzing graphs
PDF
lec9_annotated.pdf ml csci 567 vatsal sharan
1527 exponential functions
Probabilistic data structures
Algebra 2 warm up 5.4.14
Scientific notation power point
Notes 5.1 & 5.2 honors
My powerpoint
6 - analyzing graphs
lec9_annotated.pdf ml csci 567 vatsal sharan

Similar to Sketch algoritms (20)

PPT
Numerical Methods
PPT
Class9_PCA_final.ppt
PPT
Exploring Algorithms
PPTX
PRML Chapter 1
DOC
Unit 2 in daa
DOC
algorithm Unit 2
PPT
Section 3.1 PC.pptSection 3.1 PC.pptSection 3.1 PC.ppt
PPT
Manifold Blurring Mean Shift algorithms for manifold denoising, presentation,...
PDF
Skiena algorithm 2007 lecture16 introduction to dynamic programming
PDF
L&NDeltaTalk
PPT
Exponential functions
PPTX
Ke yi small summaries for big data
PDF
Analysis Framework for Analysis of Algorithms.pdf
PPTX
Advance algebra
PPTX
Introduction to simulating data to improve your research
PDF
Machine learning mathematicals.pdf
PPT
35 algorithm-types
PPTX
Nelder Mead Search Algorithm
PPTX
Data Analysis Homework Help
PPTX
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
Numerical Methods
Class9_PCA_final.ppt
Exploring Algorithms
PRML Chapter 1
Unit 2 in daa
algorithm Unit 2
Section 3.1 PC.pptSection 3.1 PC.pptSection 3.1 PC.ppt
Manifold Blurring Mean Shift algorithms for manifold denoising, presentation,...
Skiena algorithm 2007 lecture16 introduction to dynamic programming
L&NDeltaTalk
Exponential functions
Ke yi small summaries for big data
Analysis Framework for Analysis of Algorithms.pdf
Advance algebra
Introduction to simulating data to improve your research
Machine learning mathematicals.pdf
35 algorithm-types
Nelder Mead Search Algorithm
Data Analysis Homework Help
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
Ad

More from Meir Maor (6)

ODP
Actionable Machine Learning
ODP
Limits of Machine Learning
PPTX
Prior On Model Space
PPTX
Can automated feature engineering prevent target leaks
ODP
Scala Reflection & Runtime MetaProgramming
ODP
10 Things I Hate About Scala
Actionable Machine Learning
Limits of Machine Learning
Prior On Model Space
Can automated feature engineering prevent target leaks
Scala Reflection & Runtime MetaProgramming
10 Things I Hate About Scala
Ad

Recently uploaded (20)

PDF
Digital Strategies for Manufacturing Companies
PPTX
Online Work Permit System for Fast Permit Processing
PDF
Nekopoi APK 2025 free lastest update
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
ISO 45001 Occupational Health and Safety Management System
PPTX
history of c programming in notes for students .pptx
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
System and Network Administration Chapter 2
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
top salesforce developer skills in 2025.pdf
PPTX
Transform Your Business with a Software ERP System
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPT
Introduction Database Management System for Course Database
Digital Strategies for Manufacturing Companies
Online Work Permit System for Fast Permit Processing
Nekopoi APK 2025 free lastest update
Operating system designcfffgfgggggggvggggggggg
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
CHAPTER 2 - PM Management and IT Context
Upgrade and Innovation Strategies for SAP ERP Customers
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
ISO 45001 Occupational Health and Safety Management System
history of c programming in notes for students .pptx
Navsoft: AI-Powered Business Solutions & Custom Software Development
System and Network Administration Chapter 2
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
top salesforce developer skills in 2025.pdf
Transform Your Business with a Software ERP System
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Introduction Database Management System for Course Database

Sketch algoritms

  • 1. 1 Intro to Sketch Algorithms 19/10/2021
  • 2. 2 - Did this IP visit me before? - How many unique IPs have we seen this month? - How many times did I see this IP? - What is the median transaction value? top 1% value? - What are the most common collection of fonts available? Large Stream of Events
  • 3. 3 Can’t store all unique values in memory Fixed memory
  • 4. 4 If we are willing to accept an arbitrary low chance of false positives we can solve this problem with Bloom Filters. Did I see this value before?
  • 5. 5 Hash each value and turn on a bit for that hash bucket. Repeat with multiple k different hash function, and ask if all bits for all hash functions are set Some false positives, no false negatives. Bloom Filter
  • 6. 6 If we hash all values, and calculate the minimum of all hashes, what is the expected minimum value? Cardinality estimation
  • 7. 7 let hash(x) : X => [0,1] uniformly pseudo random E[min(hash(x))] = 1/(k+1) when k is number of distinct elements. This is an unbiased estimator If we repeat with several different hash functions, we can average the estimations. Cardinality estimation
  • 8. 8 Counting bloom filters. Hash value and increment a counter at the hashed index. Use multiple hash functions each with separate table(column) return min of all estimates. Produces biased estimate, estimate >= actual How many times did we see this value? count–min sketch
  • 9. 9 Naive - Sample and calculate on sample Remedian - Calculate median of medians (of medians…) Median estimation
  • 10. 10 Naive - sample and calculate quantile on sample Sample and keep to K Manku - maintain eps approximate counts and quantiles. keep counts of values in intervals. and keep them balanced. Biased quantile estimators
  • 11. 11 Proveably requires at least O(N) space Even top 1 most common does. Relax to K-heavy-hitters problem. Find all values with frequency at least 1/K ? Approximate K heavy hitters: Return all values with frequency more than 1/K and return no value with frequency below 1/k - epsilon What are the top K most frequent values?
  • 12. 12 Initialize an empty Map m from elements to counters def add(a) if m.contains(a) m(a) += 1 else if m.size < k m(a) = 1 else decrease all counters in m by 1 remove any elements with count=0 Frequent algorithm
  • 14. 14 Sampling K elements from a stream of N Algorithm Extra memory Accurate results Materialized result Shuffle and take N elements Yes Yes Reservoir K elements Yes Yes Indices reservoir K indices Yes No Independent sample O(1) Length not guaranteed No Accurate independent O(1) Slight correlation between elements No
  • 15. 15 variance = E[(x - E[x])^2] = E[x^2 -2xE[x] +E[x]^2] = E[x^2] -2E[x]E[x]+E[x]^2 = E[x^2] - E[x]^2 stdev = sqrt(variance) STDEV streaming - accurate algorithm