SlideShare a Scribd company logo
Probabilistic Data
Structures
KYLE J. DAVIS
TECHNICAL MARKETING MANAGER
REDIS LABS
Who We Are
Open source. The leading in-memory database platform,
supporting any high performance operational, analytics or
hybrid use case.
The open source home and commercial provider of Redis
Enterprise technology, platform, products & services.
2
Stack Overflow Survey: The Most Loved Databases
3
64.8%
60.8%
55%
54.2%
49.9%
49.6%
47.2%
36.9%
Redis
PostgreSQL
MongoDB
SQL Server
Cassandra
MySQL
SQLite
Oracle
% of devs who expressed interest in continuing to develop with a language/tech
Redis Top Differentiators
Simplicity ExtensibilityPerformance
NoSQL Benchmark
1
Redis Data Structures
2 3
Redis Modules
4
Lists
Hashes
Bitmaps
Strings
Bit field
Streams
Hyperloglog
Sorted Sets
Sets
Geospatial Indexes
Simplicity: Data Structures - Redis’ Building Blocks
Lists
[ A → B → C → D → E ]
Hashes
{ A: “foo”, B: “bar”, C: “baz” }
Bitmaps
0011010101100111001010
Strings
"I'm a Plain Text String!”
Bit field
{23334}{112345569}{766538}
Key
5
2
”Retrieve the e-mail address of the user with the highest
bid in an auction that started on July 24th at 11:00pm PST” ZREVRANGE 07242015_2300 0 0=
Streams
{id1=time1.seq1(A:“xyz”, B:“cdf”),
d2=time2.seq2(D:“abc”, )}
Hyperloglog
00110101 11001110
Sorted Sets
{ A: 0.1, B: 0.3, C: 100 }
Sets
{ A , B , C , D , E }
Geospatial Indexes
{ A: (51.5, 0.12), B: (32.1, 34.7) }
• Add-ons that use a Redis API to seamlessly support additional
use cases and data structures.
• Enjoy Redis’ simplicity, super high performance, infinite
scalability and high availability.
Extensibility: Modules Extend Redis Infinitely
• Any C/C++/Rust program can become a Module and run on Redis.
• Leverage existing data structures or introduce new ones.
• Can be used by anyone; Redis Enterprise Modules are tested and certified by Redis
Labs.
• Turn Redis into a Multi-Model database
6
3
Probabla-what-its?
Deterministic
• You know how it will work.
• Data in = data out.
• Data is stored or it isn’t.
• Structure size >= data size
• Examples:
–Hash map (1953)
–Linked lists (1955)
–Heaps (1964)
–…
Data Structures:
Probabilistic
• Behaves differently in different
contexts
• Data in maybe data out.
• Provides a fuzzy view of data
• Structure size can be less than data
size.
• Examples:
–Bloom Filters (1970/1998)
–Count Min Sketch (2005)
–HyperLogLog (2007)
–Cuckoo Filter (2014)
–…
…BUT WHY?!
Sometimes speed is more
important than correctness
Sometimes compactness is more
important than correctness
Sometimes you only need certain
data guarantees
You can use both!
You will not leave tonight knowing everything about
Probabilistic data structures. But…
• Input: Anything, of any length
• Output: A (very) large number
• Properties: Any change in the input will result in a completely different output, but for
a given input, the output will always be the same. One way: Practically impossible to
reverse computationally.
• Cryptographic (SHA family, RIPEMD, etc.)
–Hard to compute,
–very low collision
• Non-Cryptographic (Murmur, spooky, xxhash, fnv, etc.)
– Easy to compute
– Low collision
–Smaller result size
Step 0: The hashing function
• Filter is a weird term for it - think storage not filtering
• Items are hashed, and the hashed items are stored in a bit field.
• Maybe or no.
• Demo
–http://llimllib.github.io/bloomfilter-tutorial/
–Not precisely how it’s done normally, but nice and visual
• Bit flipping.
• Put items in and query status
–Simplest form: Never fills, just gets bad.
–More complex: Fills to a pre-determined error rate ”grows”
• Growing
Step 1: Bloom Filters
- Username search (speed, guarantees)
- Fraud Mitigation (speed, guarantees)
- Akamai – One hit wonder problem (speed, compactness, guarantees)
- Databases - Disk lookups for non-existent data (speed, guarantees)
- Chrome – Is a URL malicious? (speed, guarantees, combined)
- Bitcoin – Transaction privacy in Simplified Payment Verification (compactness, combined)
- Venti – Only storing unique data in archival storage (speed, guarantees)
- Exim – as part of a rate limiter (speed, compactness, guarantees)
- Medium – Content freshness (speed, guarantees)
Step 1: Bloom Filter Usage (General)
• Provided by ReBloom Module
• BF.ADD [filter name] [item]
• BF.EXISTS [filter name] [item]
• Others commands for edge cases and administration: BF.RESERVE, BF.MADD,
BF.MEXISTS, BF.SCANDUMP, BF.LOADCHUNK
Step 1: Bloom Filter Redis Usage
• Funny name again. Estimates cardinality of unique items.
• Part of the the “sketch” family of data types
• Bit flipping and count
• Add, Count or Merge
–Merge is really useful
• 12kb for Redis implementation
• Standard Error
Step 2: HyperLogLog
Items are hashed. Look at the
binary of the hash value, find the
position of the first 1 (i.e. length
first run of 0s), count/increment a
table cell based on the position.
Complete multiple times with
different buckets and the
maximum is your count.
Step 2a: How does HyperLogLog work?
• Facebook Likes (speed, compactness, guarantees)
• Reddit Unique Reads (speed, compactness, guarantees)
• Network Attack Mitigation (speed, compactness, guarantees, combined)
• Neustar (Advertising Platforms) Group Intersections (compactness, guarantees, combined)
Step 2: HyperLogLog Uses (General)
• Built into Redis
• PFADD [hll name] [element… ]
• PFCOUNT [hll name(s)…]
• PFMERGE [dest] [source…]
Step 2: HyperLogLog Redis Usage
• Frequency Estimation (counting)
• “Sketch” family
• Increment, Query, Merge (with weights!)
• Hash items with multiple functions, counter for
each bit position.
–Grid counters of bit positions and depth
–Take the minimum
• Initialize with error at probability if to dial in
requirements
–0.01% error rate at probability of 0.01% = 40kb
• Overestimations are possible, especially at
small observations (underestimates are not)
Step 3: Count Min Sketch
1
Initial B1 B2 B3 B4
Hash 1 0 0 0 0
Hash 2 0 0 0 0
Hash 3 0 0 0 0
’foo’ INCRBY 1 B1 B2 B3 B4
Hash 1 = 3 0 1 0 1
Hash 2 = 5 0 1 0 0
Hash 3 = 1 0 0 0 0
‘bar’ INCRBY 99 B1 B2 B3 B4
Hash 1 = 11 0 1 0 1
Hash 2 = 5 0 100 0 0
Hash 3 = 8 99 0 0 99
Query `baz` MIN (5,1,0) = 0
• Network flows (speed, compactness, guarantees)
• Anomaly Detection (speed, guarantees, combined)
• Outliers (guarantees, combined)
• Power Saving Analytics in IoT Devices (speed, combined)
Step 3: Count Min Sketch Uses
• Provided by Count Min Sketch Module
• CMS.INCRBY [sketch name] [item] [amount to increment] […]
• CMS.QUERY [sketch name] [item] [item…]
• CMS.MERGE [dest] [sketch name] [sketch name…] [WEIGHTS weight weight…]
• CMS.INITBYDIM, CMS.INITBYERR
Step 3: Count Min Sketch Redis Usage
Cuckoo Filters
CC BY-SA 2.0 / Ltshears
• Same use patterns usage as Bloom filters
• Can delete and count items
• Larger than Bloom filters
• Hash x2, fingerprint x1, place the fingerprint in one bucket, if empty
–If full, kick it out to the next bucket.
• Look up does the same hash/fingerprint routine, looks for the finger print in any of the
buckets.
Step 4: Cuckoo Filter
• Slower to insert
• Faster to lookup
• Great for times when you don’t have a:
–Good Cardinality Estimate
–Tight storage budget
• Only viable option for delete on a probabilistic presence detection
• CF.ADD, CF.INSERT, CF.DEL, CF.EXISTS + a few options
Step 4: Cuckoo vs Bloom
Other probabilistic data structures?
Questions?
kyle@redislabs.com / mike@redislabs.com

More Related Content

PPTX
Real-Time Integration Between MongoDB and SQL Databases
PDF
NoSQL Best Practices for PostgreSQL / Дмитрий Долгов (Mindojo)
PDF
Save Java memory
PPTX
Powering Rails Application With PostgreSQL
PPTX
MongoDB Chunks - Distribution, Splitting, and Merging
PDF
On Beyond (PostgreSQL) Data Types
PDF
Managing Data and Operation Distribution In MongoDB
PDF
ClickHouse Features for Advanced Users, by Aleksei Milovidov
Real-Time Integration Between MongoDB and SQL Databases
NoSQL Best Practices for PostgreSQL / Дмитрий Долгов (Mindojo)
Save Java memory
Powering Rails Application With PostgreSQL
MongoDB Chunks - Distribution, Splitting, and Merging
On Beyond (PostgreSQL) Data Types
Managing Data and Operation Distribution In MongoDB
ClickHouse Features for Advanced Users, by Aleksei Milovidov

What's hot (20)

PPTX
Getting Started with Geospatial Data in MongoDB
PPTX
MongoDB: Comparing WiredTiger In-Memory Engine to Redis
PDF
Webscale PostgreSQL - JSONB and Horizontal Scaling Strategies
PDF
Accelerating Local Search with PostgreSQL (KNN-Search)
PDF
Mongo sharding
PPTX
MongoDB - Sharded Cluster Tutorial
PPT
Mongodb
PPT
KEY
Geo & capped collections with MongoDB
ODP
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
PPTX
MongoDB Scalability Best Practices
PPTX
Choosing a Shard key
PPTX
Triggers In MongoDB
PPTX
NoSQL with MongoDB
PDF
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
PDF
Indexing and Query Optimizer (Mongo Austin)
PDF
Elasticsearch War Stories
PDF
MongoDB Performance Tuning
PPT
Gdc03 ericson memory_optimization
KEY
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Getting Started with Geospatial Data in MongoDB
MongoDB: Comparing WiredTiger In-Memory Engine to Redis
Webscale PostgreSQL - JSONB and Horizontal Scaling Strategies
Accelerating Local Search with PostgreSQL (KNN-Search)
Mongo sharding
MongoDB - Sharded Cluster Tutorial
Mongodb
Geo & capped collections with MongoDB
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
MongoDB Scalability Best Practices
Choosing a Shard key
Triggers In MongoDB
NoSQL with MongoDB
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Indexing and Query Optimizer (Mongo Austin)
Elasticsearch War Stories
MongoDB Performance Tuning
Gdc03 ericson memory_optimization
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Ad

Similar to Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018) (20)

PPTX
Why databases cry at night
PPTX
Agility and Scalability with MongoDB
PPTX
High Performance, Scalable MongoDB in a Bare Metal Cloud
PDF
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
PPT
5 Pitfalls to Avoid with MongoDB
PPT
Using Simplicity to Make Hard Big Data Problems Easy
PPTX
MongoDB at Scale
PPT
7. Key-Value Databases: In Depth
PDF
Lessons learned while building Omroep.nl
PDF
Lessons learned while building Omroep.nl
PPTX
SQLCAT: Tier-1 BI in the World of Big Data
PDF
Building a Complex, Real-Time Data Management Application
PDF
MySQL 开发
PPTX
Scaling MongoDB
PDF
Workshop: Big Data Visualization for Security
PDF
Redis Streams - Fiverr Tech5 meetup
PPT
NoSQL databases pros and cons
PDF
Managing your black friday logs Voxxed Luxembourg
PPTX
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
PPTX
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019
Why databases cry at night
Agility and Scalability with MongoDB
High Performance, Scalable MongoDB in a Bare Metal Cloud
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
5 Pitfalls to Avoid with MongoDB
Using Simplicity to Make Hard Big Data Problems Easy
MongoDB at Scale
7. Key-Value Databases: In Depth
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
SQLCAT: Tier-1 BI in the World of Big Data
Building a Complex, Real-Time Data Management Application
MySQL 开发
Scaling MongoDB
Workshop: Big Data Visualization for Security
Redis Streams - Fiverr Tech5 meetup
NoSQL databases pros and cons
Managing your black friday logs Voxxed Luxembourg
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019
Ad

Recently uploaded (20)

PDF
Business Analytics and business intelligence.pdf
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPT
Quality review (1)_presentation of this 21
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Introduction to machine learning and Linear Models
PPTX
Computer network topology notes for revision
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Business Analytics and business intelligence.pdf
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Supervised vs unsupervised machine learning algorithms
Introduction to Knowledge Engineering Part 1
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
.pdf is not working space design for the following data for the following dat...
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Quality review (1)_presentation of this 21
Business Acumen Training GuidePresentation.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Introduction to machine learning and Linear Models
Computer network topology notes for revision
Reliability_Chapter_ presentation 1221.5784
Fluorescence-microscope_Botany_detailed content
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
MODULE 8 - DISASTER risk PREPAREDNESS.pptx

Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)

  • 1. Probabilistic Data Structures KYLE J. DAVIS TECHNICAL MARKETING MANAGER REDIS LABS
  • 2. Who We Are Open source. The leading in-memory database platform, supporting any high performance operational, analytics or hybrid use case. The open source home and commercial provider of Redis Enterprise technology, platform, products & services. 2
  • 3. Stack Overflow Survey: The Most Loved Databases 3 64.8% 60.8% 55% 54.2% 49.9% 49.6% 47.2% 36.9% Redis PostgreSQL MongoDB SQL Server Cassandra MySQL SQLite Oracle % of devs who expressed interest in continuing to develop with a language/tech
  • 4. Redis Top Differentiators Simplicity ExtensibilityPerformance NoSQL Benchmark 1 Redis Data Structures 2 3 Redis Modules 4 Lists Hashes Bitmaps Strings Bit field Streams Hyperloglog Sorted Sets Sets Geospatial Indexes
  • 5. Simplicity: Data Structures - Redis’ Building Blocks Lists [ A → B → C → D → E ] Hashes { A: “foo”, B: “bar”, C: “baz” } Bitmaps 0011010101100111001010 Strings "I'm a Plain Text String!” Bit field {23334}{112345569}{766538} Key 5 2 ”Retrieve the e-mail address of the user with the highest bid in an auction that started on July 24th at 11:00pm PST” ZREVRANGE 07242015_2300 0 0= Streams {id1=time1.seq1(A:“xyz”, B:“cdf”), d2=time2.seq2(D:“abc”, )} Hyperloglog 00110101 11001110 Sorted Sets { A: 0.1, B: 0.3, C: 100 } Sets { A , B , C , D , E } Geospatial Indexes { A: (51.5, 0.12), B: (32.1, 34.7) }
  • 6. • Add-ons that use a Redis API to seamlessly support additional use cases and data structures. • Enjoy Redis’ simplicity, super high performance, infinite scalability and high availability. Extensibility: Modules Extend Redis Infinitely • Any C/C++/Rust program can become a Module and run on Redis. • Leverage existing data structures or introduce new ones. • Can be used by anyone; Redis Enterprise Modules are tested and certified by Redis Labs. • Turn Redis into a Multi-Model database 6 3
  • 8. Deterministic • You know how it will work. • Data in = data out. • Data is stored or it isn’t. • Structure size >= data size • Examples: –Hash map (1953) –Linked lists (1955) –Heaps (1964) –… Data Structures: Probabilistic • Behaves differently in different contexts • Data in maybe data out. • Provides a fuzzy view of data • Structure size can be less than data size. • Examples: –Bloom Filters (1970/1998) –Count Min Sketch (2005) –HyperLogLog (2007) –Cuckoo Filter (2014) –…
  • 9. …BUT WHY?! Sometimes speed is more important than correctness Sometimes compactness is more important than correctness Sometimes you only need certain data guarantees You can use both!
  • 10. You will not leave tonight knowing everything about Probabilistic data structures. But…
  • 11. • Input: Anything, of any length • Output: A (very) large number • Properties: Any change in the input will result in a completely different output, but for a given input, the output will always be the same. One way: Practically impossible to reverse computationally. • Cryptographic (SHA family, RIPEMD, etc.) –Hard to compute, –very low collision • Non-Cryptographic (Murmur, spooky, xxhash, fnv, etc.) – Easy to compute – Low collision –Smaller result size Step 0: The hashing function
  • 12. • Filter is a weird term for it - think storage not filtering • Items are hashed, and the hashed items are stored in a bit field. • Maybe or no. • Demo –http://llimllib.github.io/bloomfilter-tutorial/ –Not precisely how it’s done normally, but nice and visual • Bit flipping. • Put items in and query status –Simplest form: Never fills, just gets bad. –More complex: Fills to a pre-determined error rate ”grows” • Growing Step 1: Bloom Filters
  • 13. - Username search (speed, guarantees) - Fraud Mitigation (speed, guarantees) - Akamai – One hit wonder problem (speed, compactness, guarantees) - Databases - Disk lookups for non-existent data (speed, guarantees) - Chrome – Is a URL malicious? (speed, guarantees, combined) - Bitcoin – Transaction privacy in Simplified Payment Verification (compactness, combined) - Venti – Only storing unique data in archival storage (speed, guarantees) - Exim – as part of a rate limiter (speed, compactness, guarantees) - Medium – Content freshness (speed, guarantees) Step 1: Bloom Filter Usage (General)
  • 14. • Provided by ReBloom Module • BF.ADD [filter name] [item] • BF.EXISTS [filter name] [item] • Others commands for edge cases and administration: BF.RESERVE, BF.MADD, BF.MEXISTS, BF.SCANDUMP, BF.LOADCHUNK Step 1: Bloom Filter Redis Usage
  • 15. • Funny name again. Estimates cardinality of unique items. • Part of the the “sketch” family of data types • Bit flipping and count • Add, Count or Merge –Merge is really useful • 12kb for Redis implementation • Standard Error Step 2: HyperLogLog
  • 16. Items are hashed. Look at the binary of the hash value, find the position of the first 1 (i.e. length first run of 0s), count/increment a table cell based on the position. Complete multiple times with different buckets and the maximum is your count. Step 2a: How does HyperLogLog work?
  • 17. • Facebook Likes (speed, compactness, guarantees) • Reddit Unique Reads (speed, compactness, guarantees) • Network Attack Mitigation (speed, compactness, guarantees, combined) • Neustar (Advertising Platforms) Group Intersections (compactness, guarantees, combined) Step 2: HyperLogLog Uses (General)
  • 18. • Built into Redis • PFADD [hll name] [element… ] • PFCOUNT [hll name(s)…] • PFMERGE [dest] [source…] Step 2: HyperLogLog Redis Usage
  • 19. • Frequency Estimation (counting) • “Sketch” family • Increment, Query, Merge (with weights!) • Hash items with multiple functions, counter for each bit position. –Grid counters of bit positions and depth –Take the minimum • Initialize with error at probability if to dial in requirements –0.01% error rate at probability of 0.01% = 40kb • Overestimations are possible, especially at small observations (underestimates are not) Step 3: Count Min Sketch 1 Initial B1 B2 B3 B4 Hash 1 0 0 0 0 Hash 2 0 0 0 0 Hash 3 0 0 0 0 ’foo’ INCRBY 1 B1 B2 B3 B4 Hash 1 = 3 0 1 0 1 Hash 2 = 5 0 1 0 0 Hash 3 = 1 0 0 0 0 ‘bar’ INCRBY 99 B1 B2 B3 B4 Hash 1 = 11 0 1 0 1 Hash 2 = 5 0 100 0 0 Hash 3 = 8 99 0 0 99 Query `baz` MIN (5,1,0) = 0
  • 20. • Network flows (speed, compactness, guarantees) • Anomaly Detection (speed, guarantees, combined) • Outliers (guarantees, combined) • Power Saving Analytics in IoT Devices (speed, combined) Step 3: Count Min Sketch Uses
  • 21. • Provided by Count Min Sketch Module • CMS.INCRBY [sketch name] [item] [amount to increment] […] • CMS.QUERY [sketch name] [item] [item…] • CMS.MERGE [dest] [sketch name] [sketch name…] [WEIGHTS weight weight…] • CMS.INITBYDIM, CMS.INITBYERR Step 3: Count Min Sketch Redis Usage
  • 22. Cuckoo Filters CC BY-SA 2.0 / Ltshears
  • 23. • Same use patterns usage as Bloom filters • Can delete and count items • Larger than Bloom filters • Hash x2, fingerprint x1, place the fingerprint in one bucket, if empty –If full, kick it out to the next bucket. • Look up does the same hash/fingerprint routine, looks for the finger print in any of the buckets. Step 4: Cuckoo Filter
  • 24. • Slower to insert • Faster to lookup • Great for times when you don’t have a: –Good Cardinality Estimate –Tight storage budget • Only viable option for delete on a probabilistic presence detection • CF.ADD, CF.INSERT, CF.DEL, CF.EXISTS + a few options Step 4: Cuckoo vs Bloom