SlideShare a Scribd company logo
Count-min sketch to Infinity:
Using Probabilistic Data Structures to Solve Presence, Counting, and Distinct
Count Problems in .NET
presented by: Steve Lorello - Developer Advocate @Redis
Agenda
● What are Probabilistic Data Structures?
● Set Membership problems
● Bloom Filters
● Counting problems
● Count-Min-Sketch
● Distinct Count problems
● HyperLogLog
● Using Probabilistic Data Structures with Redis
What are Probabilistic Data Structures?
● Class of specialized data structures
● Tackle specific problems
● Use probability approximate
Probabilistic Data Structures Examples
Name Problem Solved Optimization
Bloom Filter Presence Space, Insertion Time, Lookup Time
Quotient Filter Presence Space, Insertion Time, Lookup Time
Skip List Ordering and Searching Insertion Time, Search time
HyperLogLog Set Cardinality Space, Insertion Time, Lookup Time
Count-min-sketch Counting occurrences on large sets Space, Insertion Time, Lookup Time
Cuckoo Filter Presence Space, Insertion Time, Lookup Time
Top-K Keep track of top records Space, Insertion Time, Lookup Time
Set Membership
Set Membership Problems
● Has a given element been inserted?
● e.g. Unique username for registration
Presence Problem Naive Approach 1
● Store User Info in table ‘users’ and Query
Check username Query username
Presence Problem Naive Approach 1
SELECT 1
FROM users
WHERE username = ‘selected_username’
Check username Query username
Summary
Access Type Disk
Lookup Time O(n)
Extra Space
(beyond storing
user info)
O(1)
Presence Problem Naive Approach 2
● Store User Info in table ‘users’
● Index username
Check username Query username
Presence Problem Naive Approach 2
SELECT 1
FROM users
WHERE username = ‘selected_username’
Check username Query username
Summary
Access Type Disk
Lookup Time O(log(n))
Extra Space
(beyond storing
user info)
O(n)
Presence Problem Naive Approach 3
● Store usernames in Redis cache
Check username SISMEMBER
Presence Problem Naive Approach 3
● Store usernames in Redis cache
SADD usernames selected_username
SISMEMBER usernames selected_username
Check username SISMEMBER
Summary
Access Type Memory
Lookup Time O(1)
Extra Space
(beyond storing
user info)
O(n)
Bloom Filters
Bloom Filter
● Specialized ‘Probabilistic’ Data Structure for presence checks
● Can say if element has definitely not been added
● Can say if element has probably been added
● Uses constant K-hashes scheme
● Represented as a 1D array of bits
● All operations O(1) complexity
● Space complexity O(n) - bits
Insert:
For i = 0->K:
FILTER[H[ i ](key)] = 1
Query:
For i = 0 -> K:
If FILTER[H[ i ](key)] == 0:
Return False
Return true
Complexities
Type Worst Case
Space O(n) - BITS
Insert O(1)
Lookup O(1)
Delete Not Available
Example Initial State
Bloom Filter k = 3
bit 0 1 2 3 4 5 6 7 8 9
state 0 0 0 0 0 0 0 0 0 0
Example Insert username ‘razzle’
Bloom Filter k = 3
bit 0 1 2 3 4 5 6 7 8 9
state 0 0 0 0 0 0 0 0 0 0
Example Insert username ‘razzle’
Bloom Filter k = 3
bit 0 1 2 3 4 5 6 7 8 9
state 0 0 1 0 0 0 0 0 0 0
● H1(razzle) = 2
Example Insert username ‘razzle’
Bloom Filter k = 3
bit 0 1 2 3 4 5 6 7 8 9
state 0 0 1 0 0 1 0 0 0 0
● H1(razzle) = 2
● H2(razzle) = 5
Example Insert username ‘razzle’
Bloom Filter k = 3
bit 0 1 2 3 4 5 6 7 8 9
state 0 0 1 0 0 1 0 0 1 0
● H1(razzle) = 2
● H2(razzle) = 5
● H3(razzle) = 8
Example Query username ‘fizzle’
Bloom Filter k = 3
bit 0 1 2 3 4 5 6 7 8 9
state 0 0 1 0 0 1 0 0 1 0
H1(fizzle) = 8 - bit 8 is set—maybe?
Example Query username ‘fizzle’
Bloom Filter k = 3
bit 0 1 2 3 4 5 6 7 8 9
state 0 0 1 0 0 1 0 0 1 0
H1(fizzle) = 8 - bit 8 is set—maybe?
H3(fizzle) = 2 - bit 2 is set—maybe?
Example Query username ‘fizzle’
Bloom Filter k = 3
bit 0 1 2 3 4 5 6 7 8 9
state 0 0 1 0 0 1 0 0 1 0
H1(fizzle) = 8 - bit 8 is set—maybe?
H3(fizzle) = 2 - bit 2 is set—maybe?
H2(fizzle) = 4 - bit 4 is not set—definitely not.
False Positives and Optimal K
● This algorithm will never give you false negatives, but it is
possible to report false positives
● You can optimize false positives by optimizing K
● Let c = hash-table-size/num-records
● Optimal K = c * ln(2)
● This will result in a false positivity rate of .6185^c, this will be
quite small
Counting Problems
What’s a Counting Problem?
● How many times does an individual occur in a stream
● Easy to do on small-mid size streams of data
● e.g. Counting Views on YouTube
● Nearly impossible to scale to enormous data sets
Naive Approach: Hash Table
● Hash Table of Counters
● Lookup name in Hash table, instead of storing record,
store an integer
● On insert, increment the integer
● On query, check the integer
Pros
● Straight Forward
● Guaranteed accuracy (if
storing whole object)
● O(n) Space Complexity in the
best case
● O(n) worst case time
complexity
● Scales poorly (think billions
of unique records)
● If relying on only a single
hash - very vulnerable to
collisions and overcounts
Cons
Naive Approach Relational DB
● Issue a Query to a traditional Relational Database searching for a count of
record where some condition occurs
SELECT COUNT( * ) FROM views
WHERE name=”Gangnam Style”
Linear Time Complexity O(n)
Linear Space Complexity O(n)
What’s the problem with a Billion Unique Records?
● Each unique record needs its own space in a Hash Table or row in a RDBMS
(perhaps several rows across multiple tables)
● Taxing on memory for Hash Table
○ 8 bit integer? 1GB
○ 16 bit? 2GB
○ 32 bit? 4GB
○ 64 bit? 8GB
● Maintaining such large data structures in a typical program’s memory isn’t
feasible
● In a relational database, it’s stored on disk
Count-min Sketch
Count-Min Sketch
● Specialized data structure for keeping count on very large streams of data
● Similar to Bloom filter in Concept - multi-hashed record
● 2D array of counters
● Sublinear Space Complexity
● Constant Time complexity
● Never undercounts, sometimes over counts
Increment:
For i = 0 -> k:
Table[ H(i) ][ i ] += 1
Query:
minimum = infinity
For i = 0 -> k:
minimum = min(minimum,Table[H(i)][i])
return minimum
Video Views Sketch 10 x 3
Count Min Sketch
position 0 1 2 3 4 5 6 7 8 9
H1 0 0 0 0 0 0 0 0 0 0
H2 0 0 0 0 0 0 0 0 0 0
H3 0 0 0 0 0 0 0 0 0 0
Increment Gangnam Style
Count Min Sketch
position 0 1 2 3 4 5 6 7 8 9
H1 0 0 0 0 0 0 0 0 0 0
H2 0 0 0 0 0 0 0 0 0 0
H3 0 0 0 0 0 0 0 0 0 0
Increment Gangnam Style
● H1(Gangnam Style) = 0
Count Min Sketch
position 0 1 2 3 4 5 6 7 8 9
H1 1 0 0 0 0 0 0 0 0 0
H2 0 0 0 0 0 0 0 0 0 0
H3 0 0 0 0 0 0 0 0 0 0
Count Min Sketch
position 0 1 2 3 4 5 6 7 8 9
H1 1 0 0 0 0 0 0 0 0 0
H2 0 0 0 0 1 0 0 0 0 0
H3 0 0 0 0 0 0 0 0 0 0
Increment Gangnam Style
● H1(Gangnam Style) = 0
● H2(Gangnam Style) = 4
Count Min Sketch
position 0 1 2 3 4 5 6 7 8 9
H1 1 0 0 0 0 0 0 0 0 0
H2 0 0 0 0 1 0 0 0 0 0
H3 0 0 0 0 0 0 1 0 0 0
Increment Gangnam Style
● H1(Gangnam Style) = 0
● H2(Gangnam Style) = 4
● H3(Gangnam Style) = 6
Increment Baby Shark
● H1(Baby Shark) = 0
● H2(Baby Shark) = 5
● H3(Baby Shark) = 6
Count Min Sketch
position 0 1 2 3 4 5 6 7 8 9
H1 1 0 0 0 0 0 0 0 0 0
H2 0 0 0 0 1 0 0 0 0 0
H3 0 0 0 0 0 0 1 0 0 0
Increment Baby Shark
● H1(Baby Shark) = 0
Count Min Sketch
position 0 1 2 3 4 5 6 7 8 9
H1 2 0 0 0 0 0 0 0 0 0
H2 0 0 0 0 1 0 0 0 0 0
H3 0 0 0 0 0 0 1 0 0 0
Increment Baby Shark
● H1(Baby Shark) = 0
● H2(Baby Shark) = 5
Count Min Sketch
position 0 1 2 3 4 5 6 7 8 9
H1 2 0 0 0 0 0 0 0 0 0
H2 0 0 0 0 1 1 0 0 0 0
H3 0 0 0 0 0 0 1 0 0 0
Increment Baby Shark
● H1(Baby Shark) = 0
● H2(Baby Shark) = 5
● H3(Baby Shark) = 6
Count Min Sketch
position 0 1 2 3 4 5 6 7 8 9
H1 2 0 0 0 0 0 0 0 0 0
H2 0 0 0 0 1 1 0 0 0 0
H3 0 0 0 0 0 0 2 0 0 0
Count Min Sketch
position 0 1 2 3 4 5 6 7 8 9
H1 2 0 0 0 0 0 0 0 0 0
H2 0 0 0 0 1 1 0 0 0 0
H3 0 0 0 0 0 0 2 0 0 0
Query Gangnam Style
● H1(Gangnam Style) = 0
● MIN (2)
Count Min Sketch
position 0 1 2 3 4 5 6 7 8 9
H1 2 0 0 0 0 0 0 0 0 0
H2 0 0 0 0 1 1 0 0 0 0
H3 0 0 0 0 0 0 2 0 0 0
Query Gangnam Style
● H1(Gangnam Style) = 0
● H2(Gangnam Style) = 4
● MIN (2, 1)
Count Min Sketch
position 0 1 2 3 4 5 6 7 8 9
H1 2 0 0 0 0 0 0 0 0 0
H2 0 0 0 0 1 1 0 0 0 0
H3 0 0 0 0 0 0 2 0 0 0
Query Gangnam Style
● H1(Gangnam Style) = 0
● H2(Gangnam Style) = 4
● H3(Gangnam Style) = 6
● MIN (2, 1, 2) = 1
Type Worst Case
Space Sublinear
Increment O(1)
Query O(1)
Delete Not Available
Complexities
CMS Pros
● Extremely Fast - O(1)
● Super compact - sublinear
● Impossible to undercount
● incidence of overcounting - all
results are approximations
CMS Cons
When to Use a Count Min Sketch?
● Counting many unique instances
● When Approximation is fine
● When counts are likely to be skewed (think YouTube video views)
Set Cardinality
Set Cardinality
● Counting distinct elements inserted into set
● Easier on smaller data sets
● For exact counts - must preserve all unique elements
● Scales very poorly
HyperLogLog
HyperLogLog
● Probabilistic Data Structure to Count Distinct Elements
● Space Complexity is about O(1)
● Time Complexity O(1)
● Can handle billions of elements with less than 2 kB of memory
● Can scale up as needed - HLL is effectively constant even
though you may want to increase its size for enormous
cardinalities.
HyperLogLog Walkthrough
● Initialize an array of registers of size 2^P (where P is some constant, usually
around 16-18)
● When an Item is inserted
○ Hash the Item
○ Determine the register to update: i - from the left P bits of the item’s hash
○ Set registers[i] to the index of the rightmost 1 in the binary representation of the hash
● When Querying
○ Compute harmonic mean of the registers that have been set
○ Multiply by a constant determined by size of P
Example: Insert Username ‘bar’ P = 16
H(bar) = 3103595182
● 1011 1000 1111 1101 0001 1010 1010 1110
● Take first 16 bits -> 1011 1000 1111 1101 -> 47357 = register index
● Index of rightmost 1 = 1
● registers[47357] = 1
Get Cardinality
● Calculate harmonic mean of only set registers.
Only 1 set register: 47357 -> 1
ceiling(.673 * 1 * 1/(2^1)) = 1
Cardinality = 1
Probabilistic Data
Structures with
Redis-Stack
What Is Redis Stack?
● Grouping of modules for Redis, including Redis Bloom
● Adds additional functionality, e.g. Probabilistic Data structures to Redis
Demo
Steve Lorello
Developer Advocate
@Redis
@slorello
github.com/slorello89
slorello.com
Resources
Redis
https://redis.io
RedisBloom
https://redisbloom.io
Source Code For Demo:
https://github.com/slorello89/probablistic-data-structures-blazor
C# Implementation Bloom Filter, HyperLogLog, and Count-Min Sketch:
https://github.com/TheAlgorithms/C-Sharp/tree/master/DataStructures/Probabilistic
Come Check Us Out!
Redis University:
https://university.redis.com
Discord:
https://discord.com/invite/redis
Count-min sketch to Infinity.pdf
Baby Name Freq hash table
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0
H(Liam) = 4
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0
H(Liam) = 4
0 1 2 3 4 5 6 7 8 9
0 0 0 0 1 0 0 0 0 0
H(Sophia) = 8
0 1 2 3 4 5 6 7 8 9
0 0 0 0 1 0 0 0 0 0
H(Sophia) = 8
0 1 2 3 4 5 6 7 8 9
0 0 0 0 1 0 0 0 1 0
H(Liam) = 4
0 1 2 3 4 5 6 7 8 9
0 0 0 0 2 0 0 0 1 0
Baby Name existence table k = 3
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0
H1(Liam)=0 H2(Liam) = 4 H3(Liam) = 6
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0
H1(Liam)=0 H2(Liam) = 4 H3(Liam) = 6
0 1 2 3 4 5 6 7 8 9
1 0 0 0 1 0 1 0 0 0
H1(Susan)=0 H2(Susan) = 5 H3(Susan) = 6
0 1 2 3 4 5 6 7 8 9
1 0 0 0 1 0 1 0 0 0
H1(Susan)=0 H2(Susan) = 5 H3(Susan) = 6
0 1 2 3 4 5 6 7 8 9
1 0 0 0 1 1 1 0 0 0
Does Tom Exist? H1(Tom)=1 H2(Tom)=4 H3(Tom) = 5
0 1 2 3 4 5 6 7 8 9
1 0 0 0 1 1 1 0 0 0
Does Tom Exist? H1(Tom)=1 H2(Tom)=4 H3(Tom) = 5
0 1 2 3 4 5 6 7 8 9
1 0 0 0 1 1 1 0 0 0
A hash of Tom = 0, so no Tom does not exist!
Does Liam Exist? H1(Liam)=0 H2(Liam) = 4 H3(Liam) = 6
0 1 2 3 4 5 6 7 8 9
1 0 0 0 1 1 1 0 0 0
All Hashes of Liam = 1, so we repot YES
2-Choice Hashing
● Use two Hash Functions instead of one
● Store @ index with Lowest Load (smallest linked list)
● Time Complexity goes from log(n) in traditional chain hash table -> log(log(n))
with high probability, so nearly constant
● Benefit stops at 2 hashes, additional hashes don’t help
● Still O(n) space complexity
Hash Tables
Hash Table
● Ubiquitous data structure for
storing associated data. E.g.
Map, Dictionary, Dict
● Set of Keys associated with
array of values
● Run hash function on key to find
position in array to store value
Source: wikipedia
Hash Collisions
● Hash Functions can
produce the same
output for different
keys - creates
collision
● Collision Resolution
either sequentially of
with linked-list
Hash Table Complexity - with chain hashing
Type Amortized Worst Case
Space O(n) O(n)
Insert O(1) O(n)
Lookup O(1) O(n)
Delete O(1) O(n)

More Related Content

PDF
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
PPTX
Apache Solr-Webinar
PPTX
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
PDF
Introduction to Apache Solr
PDF
Elastic Search (엘라스틱서치) 입문
PDF
Blockchains and databases a new era in distributed computing
PDF
Web Push Notifications done right
KEY
Web API Basics
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Apache Solr-Webinar
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
Introduction to Apache Solr
Elastic Search (엘라스틱서치) 입문
Blockchains and databases a new era in distributed computing
Web Push Notifications done right
Web API Basics

What's hot (20)

PPTX
ElasticSearch : Architecture et Développement
PPTX
Web mining
PPTX
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
PPTX
Security and Data Governance using Apache Ranger and Apache Atlas
PDF
AWS EMR Cost optimization
PDF
PostgreSQL performance archaeology
PDF
민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS
PPTX
Google Vertex AI
PDF
Ray: Enterprise-Grade, Distributed Python
PPTX
Sprinting with Anypoint Runtime Fabric
PDF
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
PDF
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
PPT
RESTful services
PDF
Getting started with Web Scraping in Python
PDF
AWS를 통한 빅데이터 기반 비지니스 인텔리전스 구축- AWS Summit Seoul 2017
PDF
Basic Kong API Gateway
PDF
Elasticsearch From the Bottom Up
PDF
MLOps Bridging the gap between Data Scientists and Ops.
PPTX
How to Choose The Right Database on AWS - Berlin Summit - 2019
PDF
EMR 플랫폼 기반의 Spark 워크로드 실행 최적화 방안 - 정세웅, AWS 솔루션즈 아키텍트:: AWS Summit Online Ko...
ElasticSearch : Architecture et Développement
Web mining
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Security and Data Governance using Apache Ranger and Apache Atlas
AWS EMR Cost optimization
PostgreSQL performance archaeology
민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS
Google Vertex AI
Ray: Enterprise-Grade, Distributed Python
Sprinting with Anypoint Runtime Fabric
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
RESTful services
Getting started with Web Scraping in Python
AWS를 통한 빅데이터 기반 비지니스 인텔리전스 구축- AWS Summit Seoul 2017
Basic Kong API Gateway
Elasticsearch From the Bottom Up
MLOps Bridging the gap between Data Scientists and Ops.
How to Choose The Right Database on AWS - Berlin Summit - 2019
EMR 플랫폼 기반의 Spark 워크로드 실행 최적화 방안 - 정세웅, AWS 솔루션즈 아키텍트:: AWS Summit Online Ko...
Ad

Similar to Count-min sketch to Infinity.pdf (20)

PPTX
streamingalgo88585858585858585pppppp.pptx
PPTX
Tech talk Probabilistic Data Structure
PPTX
Data streaming algorithms
PPTX
Probabilistic data structures
PDF
Probabilistic data structures. Part 3. Frequency
PPTX
Ke yi small summaries for big data
PDF
Approximation Data Structures for Streaming Applications
PPTX
Unit 5 Streams2.pptx
PDF
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
PDF
Approximate "Now" is Better Than Accurate "Later"
PDF
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
PDF
Lec 3-mcgregor
PPTX
Streaming Algorithms
PDF
Probabilistic algorithms for fun and pseudorandom profit
PPTX
hash
PDF
PPTX
Probabilistic data structure
PDF
PDF
An introduction to probabilistic data structures
PDF
Randamization.pdf
streamingalgo88585858585858585pppppp.pptx
Tech talk Probabilistic Data Structure
Data streaming algorithms
Probabilistic data structures
Probabilistic data structures. Part 3. Frequency
Ke yi small summaries for big data
Approximation Data Structures for Streaming Applications
Unit 5 Streams2.pptx
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Approximate "Now" is Better Than Accurate "Later"
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Lec 3-mcgregor
Streaming Algorithms
Probabilistic algorithms for fun and pseudorandom profit
hash
Probabilistic data structure
An introduction to probabilistic data structures
Randamization.pdf
Ad

More from Stephen Lorello (7)

PDF
Florida Man Uses Cache as Database.pdf
PDF
An Introduction to Redis for .NET Developers.pdf
PDF
An Introduction to Redis for Developers.pdf
PDF
Indexing, searching, and aggregation with redi search and .net
PDF
Frontends w ithout javascript
PDF
Intro to computer vision in .net update
PDF
Intro to computer vision in .net
Florida Man Uses Cache as Database.pdf
An Introduction to Redis for .NET Developers.pdf
An Introduction to Redis for Developers.pdf
Indexing, searching, and aggregation with redi search and .net
Frontends w ithout javascript
Intro to computer vision in .net update
Intro to computer vision in .net

Recently uploaded (20)

PPTX
L1 - Introduction to python Backend.pptx
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
System and Network Administration Chapter 2
PDF
Digital Strategies for Manufacturing Companies
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
Understanding Forklifts - TECH EHS Solution
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
Online Work Permit System for Fast Permit Processing
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
top salesforce developer skills in 2025.pdf
L1 - Introduction to python Backend.pptx
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
2025 Textile ERP Trends: SAP, Odoo & Oracle
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
How to Migrate SBCGlobal Email to Yahoo Easily
Odoo Companies in India – Driving Business Transformation.pdf
System and Network Administration Chapter 2
Digital Strategies for Manufacturing Companies
ManageIQ - Sprint 268 Review - Slide Deck
Understanding Forklifts - TECH EHS Solution
How Creative Agencies Leverage Project Management Software.pdf
Softaken Excel to vCard Converter Software.pdf
Online Work Permit System for Fast Permit Processing
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Design an Analysis of Algorithms I-SECS-1021-03
Odoo POS Development Services by CandidRoot Solutions
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
VVF-Customer-Presentation2025-Ver1.9.pptx
top salesforce developer skills in 2025.pdf

Count-min sketch to Infinity.pdf

  • 1. Count-min sketch to Infinity: Using Probabilistic Data Structures to Solve Presence, Counting, and Distinct Count Problems in .NET presented by: Steve Lorello - Developer Advocate @Redis
  • 2. Agenda ● What are Probabilistic Data Structures? ● Set Membership problems ● Bloom Filters ● Counting problems ● Count-Min-Sketch ● Distinct Count problems ● HyperLogLog ● Using Probabilistic Data Structures with Redis
  • 3. What are Probabilistic Data Structures? ● Class of specialized data structures ● Tackle specific problems ● Use probability approximate
  • 4. Probabilistic Data Structures Examples Name Problem Solved Optimization Bloom Filter Presence Space, Insertion Time, Lookup Time Quotient Filter Presence Space, Insertion Time, Lookup Time Skip List Ordering and Searching Insertion Time, Search time HyperLogLog Set Cardinality Space, Insertion Time, Lookup Time Count-min-sketch Counting occurrences on large sets Space, Insertion Time, Lookup Time Cuckoo Filter Presence Space, Insertion Time, Lookup Time Top-K Keep track of top records Space, Insertion Time, Lookup Time
  • 6. Set Membership Problems ● Has a given element been inserted? ● e.g. Unique username for registration
  • 7. Presence Problem Naive Approach 1 ● Store User Info in table ‘users’ and Query Check username Query username
  • 8. Presence Problem Naive Approach 1 SELECT 1 FROM users WHERE username = ‘selected_username’ Check username Query username
  • 9. Summary Access Type Disk Lookup Time O(n) Extra Space (beyond storing user info) O(1)
  • 10. Presence Problem Naive Approach 2 ● Store User Info in table ‘users’ ● Index username Check username Query username
  • 11. Presence Problem Naive Approach 2 SELECT 1 FROM users WHERE username = ‘selected_username’ Check username Query username
  • 12. Summary Access Type Disk Lookup Time O(log(n)) Extra Space (beyond storing user info) O(n)
  • 13. Presence Problem Naive Approach 3 ● Store usernames in Redis cache Check username SISMEMBER
  • 14. Presence Problem Naive Approach 3 ● Store usernames in Redis cache SADD usernames selected_username SISMEMBER usernames selected_username Check username SISMEMBER
  • 15. Summary Access Type Memory Lookup Time O(1) Extra Space (beyond storing user info) O(n)
  • 17. Bloom Filter ● Specialized ‘Probabilistic’ Data Structure for presence checks ● Can say if element has definitely not been added ● Can say if element has probably been added ● Uses constant K-hashes scheme ● Represented as a 1D array of bits ● All operations O(1) complexity ● Space complexity O(n) - bits
  • 18. Insert: For i = 0->K: FILTER[H[ i ](key)] = 1
  • 19. Query: For i = 0 -> K: If FILTER[H[ i ](key)] == 0: Return False Return true
  • 20. Complexities Type Worst Case Space O(n) - BITS Insert O(1) Lookup O(1) Delete Not Available
  • 21. Example Initial State Bloom Filter k = 3 bit 0 1 2 3 4 5 6 7 8 9 state 0 0 0 0 0 0 0 0 0 0
  • 22. Example Insert username ‘razzle’ Bloom Filter k = 3 bit 0 1 2 3 4 5 6 7 8 9 state 0 0 0 0 0 0 0 0 0 0
  • 23. Example Insert username ‘razzle’ Bloom Filter k = 3 bit 0 1 2 3 4 5 6 7 8 9 state 0 0 1 0 0 0 0 0 0 0 ● H1(razzle) = 2
  • 24. Example Insert username ‘razzle’ Bloom Filter k = 3 bit 0 1 2 3 4 5 6 7 8 9 state 0 0 1 0 0 1 0 0 0 0 ● H1(razzle) = 2 ● H2(razzle) = 5
  • 25. Example Insert username ‘razzle’ Bloom Filter k = 3 bit 0 1 2 3 4 5 6 7 8 9 state 0 0 1 0 0 1 0 0 1 0 ● H1(razzle) = 2 ● H2(razzle) = 5 ● H3(razzle) = 8
  • 26. Example Query username ‘fizzle’ Bloom Filter k = 3 bit 0 1 2 3 4 5 6 7 8 9 state 0 0 1 0 0 1 0 0 1 0 H1(fizzle) = 8 - bit 8 is set—maybe?
  • 27. Example Query username ‘fizzle’ Bloom Filter k = 3 bit 0 1 2 3 4 5 6 7 8 9 state 0 0 1 0 0 1 0 0 1 0 H1(fizzle) = 8 - bit 8 is set—maybe? H3(fizzle) = 2 - bit 2 is set—maybe?
  • 28. Example Query username ‘fizzle’ Bloom Filter k = 3 bit 0 1 2 3 4 5 6 7 8 9 state 0 0 1 0 0 1 0 0 1 0 H1(fizzle) = 8 - bit 8 is set—maybe? H3(fizzle) = 2 - bit 2 is set—maybe? H2(fizzle) = 4 - bit 4 is not set—definitely not.
  • 29. False Positives and Optimal K ● This algorithm will never give you false negatives, but it is possible to report false positives ● You can optimize false positives by optimizing K ● Let c = hash-table-size/num-records ● Optimal K = c * ln(2) ● This will result in a false positivity rate of .6185^c, this will be quite small
  • 31. What’s a Counting Problem? ● How many times does an individual occur in a stream ● Easy to do on small-mid size streams of data ● e.g. Counting Views on YouTube ● Nearly impossible to scale to enormous data sets
  • 32. Naive Approach: Hash Table ● Hash Table of Counters ● Lookup name in Hash table, instead of storing record, store an integer ● On insert, increment the integer ● On query, check the integer
  • 33. Pros ● Straight Forward ● Guaranteed accuracy (if storing whole object) ● O(n) Space Complexity in the best case ● O(n) worst case time complexity ● Scales poorly (think billions of unique records) ● If relying on only a single hash - very vulnerable to collisions and overcounts Cons
  • 34. Naive Approach Relational DB ● Issue a Query to a traditional Relational Database searching for a count of record where some condition occurs SELECT COUNT( * ) FROM views WHERE name=”Gangnam Style” Linear Time Complexity O(n) Linear Space Complexity O(n)
  • 35. What’s the problem with a Billion Unique Records? ● Each unique record needs its own space in a Hash Table or row in a RDBMS (perhaps several rows across multiple tables) ● Taxing on memory for Hash Table ○ 8 bit integer? 1GB ○ 16 bit? 2GB ○ 32 bit? 4GB ○ 64 bit? 8GB ● Maintaining such large data structures in a typical program’s memory isn’t feasible ● In a relational database, it’s stored on disk
  • 37. Count-Min Sketch ● Specialized data structure for keeping count on very large streams of data ● Similar to Bloom filter in Concept - multi-hashed record ● 2D array of counters ● Sublinear Space Complexity ● Constant Time complexity ● Never undercounts, sometimes over counts
  • 38. Increment: For i = 0 -> k: Table[ H(i) ][ i ] += 1
  • 39. Query: minimum = infinity For i = 0 -> k: minimum = min(minimum,Table[H(i)][i]) return minimum
  • 40. Video Views Sketch 10 x 3 Count Min Sketch position 0 1 2 3 4 5 6 7 8 9 H1 0 0 0 0 0 0 0 0 0 0 H2 0 0 0 0 0 0 0 0 0 0 H3 0 0 0 0 0 0 0 0 0 0
  • 41. Increment Gangnam Style Count Min Sketch position 0 1 2 3 4 5 6 7 8 9 H1 0 0 0 0 0 0 0 0 0 0 H2 0 0 0 0 0 0 0 0 0 0 H3 0 0 0 0 0 0 0 0 0 0
  • 42. Increment Gangnam Style ● H1(Gangnam Style) = 0 Count Min Sketch position 0 1 2 3 4 5 6 7 8 9 H1 1 0 0 0 0 0 0 0 0 0 H2 0 0 0 0 0 0 0 0 0 0 H3 0 0 0 0 0 0 0 0 0 0
  • 43. Count Min Sketch position 0 1 2 3 4 5 6 7 8 9 H1 1 0 0 0 0 0 0 0 0 0 H2 0 0 0 0 1 0 0 0 0 0 H3 0 0 0 0 0 0 0 0 0 0 Increment Gangnam Style ● H1(Gangnam Style) = 0 ● H2(Gangnam Style) = 4
  • 44. Count Min Sketch position 0 1 2 3 4 5 6 7 8 9 H1 1 0 0 0 0 0 0 0 0 0 H2 0 0 0 0 1 0 0 0 0 0 H3 0 0 0 0 0 0 1 0 0 0 Increment Gangnam Style ● H1(Gangnam Style) = 0 ● H2(Gangnam Style) = 4 ● H3(Gangnam Style) = 6
  • 45. Increment Baby Shark ● H1(Baby Shark) = 0 ● H2(Baby Shark) = 5 ● H3(Baby Shark) = 6 Count Min Sketch position 0 1 2 3 4 5 6 7 8 9 H1 1 0 0 0 0 0 0 0 0 0 H2 0 0 0 0 1 0 0 0 0 0 H3 0 0 0 0 0 0 1 0 0 0
  • 46. Increment Baby Shark ● H1(Baby Shark) = 0 Count Min Sketch position 0 1 2 3 4 5 6 7 8 9 H1 2 0 0 0 0 0 0 0 0 0 H2 0 0 0 0 1 0 0 0 0 0 H3 0 0 0 0 0 0 1 0 0 0
  • 47. Increment Baby Shark ● H1(Baby Shark) = 0 ● H2(Baby Shark) = 5 Count Min Sketch position 0 1 2 3 4 5 6 7 8 9 H1 2 0 0 0 0 0 0 0 0 0 H2 0 0 0 0 1 1 0 0 0 0 H3 0 0 0 0 0 0 1 0 0 0
  • 48. Increment Baby Shark ● H1(Baby Shark) = 0 ● H2(Baby Shark) = 5 ● H3(Baby Shark) = 6 Count Min Sketch position 0 1 2 3 4 5 6 7 8 9 H1 2 0 0 0 0 0 0 0 0 0 H2 0 0 0 0 1 1 0 0 0 0 H3 0 0 0 0 0 0 2 0 0 0
  • 49. Count Min Sketch position 0 1 2 3 4 5 6 7 8 9 H1 2 0 0 0 0 0 0 0 0 0 H2 0 0 0 0 1 1 0 0 0 0 H3 0 0 0 0 0 0 2 0 0 0 Query Gangnam Style ● H1(Gangnam Style) = 0 ● MIN (2)
  • 50. Count Min Sketch position 0 1 2 3 4 5 6 7 8 9 H1 2 0 0 0 0 0 0 0 0 0 H2 0 0 0 0 1 1 0 0 0 0 H3 0 0 0 0 0 0 2 0 0 0 Query Gangnam Style ● H1(Gangnam Style) = 0 ● H2(Gangnam Style) = 4 ● MIN (2, 1)
  • 51. Count Min Sketch position 0 1 2 3 4 5 6 7 8 9 H1 2 0 0 0 0 0 0 0 0 0 H2 0 0 0 0 1 1 0 0 0 0 H3 0 0 0 0 0 0 2 0 0 0 Query Gangnam Style ● H1(Gangnam Style) = 0 ● H2(Gangnam Style) = 4 ● H3(Gangnam Style) = 6 ● MIN (2, 1, 2) = 1
  • 52. Type Worst Case Space Sublinear Increment O(1) Query O(1) Delete Not Available Complexities
  • 53. CMS Pros ● Extremely Fast - O(1) ● Super compact - sublinear ● Impossible to undercount ● incidence of overcounting - all results are approximations CMS Cons
  • 54. When to Use a Count Min Sketch? ● Counting many unique instances ● When Approximation is fine ● When counts are likely to be skewed (think YouTube video views)
  • 56. Set Cardinality ● Counting distinct elements inserted into set ● Easier on smaller data sets ● For exact counts - must preserve all unique elements ● Scales very poorly
  • 58. HyperLogLog ● Probabilistic Data Structure to Count Distinct Elements ● Space Complexity is about O(1) ● Time Complexity O(1) ● Can handle billions of elements with less than 2 kB of memory ● Can scale up as needed - HLL is effectively constant even though you may want to increase its size for enormous cardinalities.
  • 59. HyperLogLog Walkthrough ● Initialize an array of registers of size 2^P (where P is some constant, usually around 16-18) ● When an Item is inserted ○ Hash the Item ○ Determine the register to update: i - from the left P bits of the item’s hash ○ Set registers[i] to the index of the rightmost 1 in the binary representation of the hash ● When Querying ○ Compute harmonic mean of the registers that have been set ○ Multiply by a constant determined by size of P
  • 60. Example: Insert Username ‘bar’ P = 16 H(bar) = 3103595182 ● 1011 1000 1111 1101 0001 1010 1010 1110 ● Take first 16 bits -> 1011 1000 1111 1101 -> 47357 = register index ● Index of rightmost 1 = 1 ● registers[47357] = 1
  • 61. Get Cardinality ● Calculate harmonic mean of only set registers. Only 1 set register: 47357 -> 1 ceiling(.673 * 1 * 1/(2^1)) = 1 Cardinality = 1
  • 63. What Is Redis Stack? ● Grouping of modules for Redis, including Redis Bloom ● Adds additional functionality, e.g. Probabilistic Data structures to Redis
  • 64. Demo
  • 66. Resources Redis https://redis.io RedisBloom https://redisbloom.io Source Code For Demo: https://github.com/slorello89/probablistic-data-structures-blazor C# Implementation Bloom Filter, HyperLogLog, and Count-Min Sketch: https://github.com/TheAlgorithms/C-Sharp/tree/master/DataStructures/Probabilistic
  • 67. Come Check Us Out! Redis University: https://university.redis.com Discord: https://discord.com/invite/redis
  • 69. Baby Name Freq hash table 0 1 2 3 4 5 6 7 8 9 0 0 0 0 0 0 0 0 0 0
  • 70. H(Liam) = 4 0 1 2 3 4 5 6 7 8 9 0 0 0 0 0 0 0 0 0 0
  • 71. H(Liam) = 4 0 1 2 3 4 5 6 7 8 9 0 0 0 0 1 0 0 0 0 0
  • 72. H(Sophia) = 8 0 1 2 3 4 5 6 7 8 9 0 0 0 0 1 0 0 0 0 0
  • 73. H(Sophia) = 8 0 1 2 3 4 5 6 7 8 9 0 0 0 0 1 0 0 0 1 0
  • 74. H(Liam) = 4 0 1 2 3 4 5 6 7 8 9 0 0 0 0 2 0 0 0 1 0
  • 75. Baby Name existence table k = 3 0 1 2 3 4 5 6 7 8 9 0 0 0 0 0 0 0 0 0 0
  • 76. H1(Liam)=0 H2(Liam) = 4 H3(Liam) = 6 0 1 2 3 4 5 6 7 8 9 0 0 0 0 0 0 0 0 0 0
  • 77. H1(Liam)=0 H2(Liam) = 4 H3(Liam) = 6 0 1 2 3 4 5 6 7 8 9 1 0 0 0 1 0 1 0 0 0
  • 78. H1(Susan)=0 H2(Susan) = 5 H3(Susan) = 6 0 1 2 3 4 5 6 7 8 9 1 0 0 0 1 0 1 0 0 0
  • 79. H1(Susan)=0 H2(Susan) = 5 H3(Susan) = 6 0 1 2 3 4 5 6 7 8 9 1 0 0 0 1 1 1 0 0 0
  • 80. Does Tom Exist? H1(Tom)=1 H2(Tom)=4 H3(Tom) = 5 0 1 2 3 4 5 6 7 8 9 1 0 0 0 1 1 1 0 0 0
  • 81. Does Tom Exist? H1(Tom)=1 H2(Tom)=4 H3(Tom) = 5 0 1 2 3 4 5 6 7 8 9 1 0 0 0 1 1 1 0 0 0 A hash of Tom = 0, so no Tom does not exist!
  • 82. Does Liam Exist? H1(Liam)=0 H2(Liam) = 4 H3(Liam) = 6 0 1 2 3 4 5 6 7 8 9 1 0 0 0 1 1 1 0 0 0 All Hashes of Liam = 1, so we repot YES
  • 83. 2-Choice Hashing ● Use two Hash Functions instead of one ● Store @ index with Lowest Load (smallest linked list) ● Time Complexity goes from log(n) in traditional chain hash table -> log(log(n)) with high probability, so nearly constant ● Benefit stops at 2 hashes, additional hashes don’t help ● Still O(n) space complexity
  • 85. Hash Table ● Ubiquitous data structure for storing associated data. E.g. Map, Dictionary, Dict ● Set of Keys associated with array of values ● Run hash function on key to find position in array to store value Source: wikipedia
  • 86. Hash Collisions ● Hash Functions can produce the same output for different keys - creates collision ● Collision Resolution either sequentially of with linked-list
  • 87. Hash Table Complexity - with chain hashing Type Amortized Worst Case Space O(n) O(n) Insert O(1) O(n) Lookup O(1) O(n) Delete O(1) O(n)