SlideShare a Scribd company logo
Hash Functions                                                                                        FTW*
Fast Hashing, Bloom Filters & Hash-Oriented Storage



                                                            Sunny Gleason


* For   the win (see urbandictionary FTW[1]); this expression has nothing to do with hash functions
What’s in this Presentation

• Hash Function Survey
• Hash Performance
• Bloom Filters
• HashFile : Hash Storage
Hash Functions
int getIntHash(byte[] data); // 32-bit
long getLongHash(byte[] data) // 64-bit

int v1 = hash(“foo”); int v2 = hash(“goo”);

int hash(byte[] value) { // a simple hash
    int h = 0;
    for (byte b: value) { h = (h<<5) ^ (h>>27) ^ b; }
    return h % PRIME;
}
Hash Functions

• Goal : v1 has many bit differences from v2
• Desirable Properties:
 • Uniform Distribution - no collisions
 • Very Fast Computation
Hash Applications
Goal: O(1) access
   • Hash Table
   • Hash Set
   • Bloom Filter
Popular Hash Functions
• FNV Hash
• DJB Hash
• Jenkins Hash
• Murmur2
• New (Promising?): CrapWow
• Awesome & Slow: SHA-1, MD5 etc.
Evaluating Hash Functions
• Hash Function “Zoo”
• Quality of: CRC32 DJB    Jenkins FNV
  Murmur2 SHA1
• Performance:            !"#$%&'()*(+",-'%./%0'/%1',23$%
  (MM ops/s)       '#"
                   '!"
                   &#"
                   &!"
                   %#"                                      *+,-.,/"
                   %!"                                      012312%"
                   $#"                                      456$"
                   $!"
                    #"
                    !"
                           %#("      ('"        )"
A Strawman “Set”
• N keys, K bytes per key
• Allocate array of size K * N bytes
• Utilize array storage as:
 • a heap or tree: O(lg N) insert/delete/
    remove
  • a hash: O(1) insert/delete/remove
• What if we don’t have room for K*N
  bytes?
Bloom Filter
• Key Point: give up on storing all the keys
• Store r bits per key instead of K bytes
• Allocate bit vector of size: M = r * N,
  where N is expected number of entries
• Use multiple hash functions of key to
  determine which bits to set
• Premise: if hash functions are well-
  distributed, few collisions, high accuracy
Bloom Filter
Tuning Bloom Filters
Let r = M bits / N keys (r: num bits/key)
Let k = 0.7 * r      (k: num hashes to use)
Let p = 0.6185 ** r (p: probability of false positives)

Working backwards, we can use desired false
positive rate p to tune the data structure space
consumption:

r = 8, p = 2.1e-2      r = 16, p = 4.5e-4
r = 24, p = 9.8e-6     r = 32, p = 2.1e-7
r = 40, p = 4.5e-9     r = 48, p = 9.6e-11
Bloom Filter Performance
  100MM entries, 8bits/key :    833k ops/s
  100MM entries, 32bits/key :   256k ops/s
  1BN entries, 8bits/key :      714k ops/s
  1BN entries, 32bits/key :     185k ops/s

  Hypothesis : difference between 100MM and
  1BN is due to locality of memory access in
  smaller bit vector
Hash-Oriented Storage
•   HashFile : 64-bit clone of djb’s constant db
    “CDB”

•   Plain ol’ Key/Value storage:
     add(byte[] k, byte[] v), byte[] lookup(byte[] k)

•   Constant aka “Immutable” Data Store
     create(), add(k, v) ... , build() ... before lookup(k)

•   Use properties of hash table to achieve
    O(1) disk seeks per lookup
HashFile Structure
• Header (fixed width): table pointers,
  contains offests of hash tables and count of
  elements per table
• Body (variable width): contains
  concatenation of all keys and values (with
  data lengths)
• Footer (fixed width): hash “tables”
  containing long hash values of keys
  alongside long offsets into body
HashFile Diagram
    HEADER                    BODY                      FOOTER
p1s3p2s4p3s2p4s1   k1v1k2v2k3v3k4v4k5v5k6v6k7v7   hk7o7hk3o3hk4o4hk1o1




       •   Create: initialize empty header, start appending
           keys/values while recording offsets and hash values
           of keys

       •   Build: take list of hash values and offsets and turn
           them into hash tables, backfill header with values

       •   Lookup: compute hash(key), compute offset into
           table (hash modulo size of table), use table to find
           offset into body, return the value from body
HashFile Performance
• Spec: ≤ 2 disk seeks per lookup
• Number of seeks independent of number
  of entries
• X25E SSD: 1BN 8-byte keys, values (41GB):
  650μs lookup w/ cold cache, up to 700x
  faster as filesystem cache warms, 0.9μs
  when in-memory
• With 100MM entries (4GB), cold cache is
  ~600μs (from locality), 0.6μs warm
Conclusions

• Be aware of different Hash Functions and
  their collision / performance tradeoffs
• Bloom Filters are extremely useful for fast,
  large-scale set membership
• HashFile provides excellent performance in
  cases where a static K/V store suffices
Future Work
• Implement cWow hash in Java
• Extend HashFile with configurable hash,
  pointer, and key/value lengths to conserve
  space (reduce 24 bytes-per-KV overhead)
• Implement a read-write (non-constant)
  version of HashFile
• Bloom Filter that spills to SSD
Thank You!
...Any questions? :)
References
• GitHub Project: g414-hash (hash
  function, bloom filter, HashFile
  implementations)
• Wikipedia: Hash Function, Bloom Filter
• Non-Cryptographic Hash Function Zoo
• DJB CDB, sg-cdb (java implementation)

More Related Content

PDF
RecSplit Minimal Perfect Hashing
PDF
Accelerating Local Search with PostgreSQL (KNN-Search)
PDF
The Directions Pipeline at Mapbox - AWS Meetup Berlin June 2015
PPTX
Ordered Record Collection
PDF
Real time indexes in Sphinx, Yaroslav Vorozhko
PPTX
Presentation on Heap Sort
PDF
ClickHouse materialized views - a secret weapon for high performance analytic...
PDF
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
RecSplit Minimal Perfect Hashing
Accelerating Local Search with PostgreSQL (KNN-Search)
The Directions Pipeline at Mapbox - AWS Meetup Berlin June 2015
Ordered Record Collection
Real time indexes in Sphinx, Yaroslav Vorozhko
Presentation on Heap Sort
ClickHouse materialized views - a secret weapon for high performance analytic...
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges

What's hot (20)

PDF
Probabilistic data structures. Part 2. Cardinality
PDF
Spark 4th Meetup Londond - Building a Product with Spark
PDF
Introduction to Redis
PPTX
Weather of the Century: Visualization
PPTX
MongoDB Aggregation MongoSF May 2011
PDF
Using MongoDB and Python
PPTX
A survey on Heap Exploitation
PDF
Cloud flare jgc bigo meetup rolling hashes
PDF
The Ring programming language version 1.10 book - Part 45 of 212
PDF
Functional Programming
PDF
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
PPT
Analysis of Algorithms-Heapsort
KEY
Hadoop導入事例 in クックパッド
PPTX
How does one go from binary data to HDF files efficiently?
KEY
Python Development (MongoSF)
PDF
The Weather of the Century Part 3: Visualization
PDF
The Weather of the Century
PDF
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Probabilistic data structures. Part 2. Cardinality
Spark 4th Meetup Londond - Building a Product with Spark
Introduction to Redis
Weather of the Century: Visualization
MongoDB Aggregation MongoSF May 2011
Using MongoDB and Python
A survey on Heap Exploitation
Cloud flare jgc bigo meetup rolling hashes
The Ring programming language version 1.10 book - Part 45 of 212
Functional Programming
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
Analysis of Algorithms-Heapsort
Hadoop導入事例 in クックパッド
How does one go from binary data to HDF files efficiently?
Python Development (MongoSF)
The Weather of the Century Part 3: Visualization
The Weather of the Century
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Ad

Viewers also liked (6)

PPTX
Ada-Sketch and friends
PDF
InnoDB Magic
PDF
Hokusai - Sketching streams in real time
PDF
Accelerating NoSQL
PDF
High-Performance Storage Services with HailDB and Java
PDF
Hype vs. Reality: The AI Explainer
Ada-Sketch and friends
InnoDB Magic
Hokusai - Sketching streams in real time
Accelerating NoSQL
High-Performance Storage Services with HailDB and Java
Hype vs. Reality: The AI Explainer
Ad

Similar to Hash Functions FTW (20)

PDF
Hadoop Overview kdd2011
PDF
Hadoop Overview & Architecture
 
PPTX
hashing in data structures and its applications
PPTX
Hash table
KEY
KEY
R for Pirates. ESCCONF October 27, 2011
PDF
Sorry - How Bieber broke Google Cloud at Spotify
PPTX
Block ciphers &amp; public key cryptography
PDF
Scaling HDFS to Manage Billions of Files
PDF
Scaling HDFS to Manage Billions of Files with Key-Value Stores
PPT
Hashing In Data Structure Download PPT i
PPTX
Webinar: MongoDB 2.4 Feature Demo and Q&A on Hash-based Sharding
ODP
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
ODP
Типы данных JSONb, соответствующие индексы и модуль jsquery – Олег Бартунов, ...
PDF
ImplementingCryptoSecurityARMCortex_Doin
PDF
Modern C++
PDF
Happy Go Programming
PDF
hashtableeeeeeeeeeeeeeeeeeeeeeeeeeee.pdf
PDF
PostgreSQL 9.4: NoSQL on ACID
PDF
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Hadoop Overview kdd2011
Hadoop Overview & Architecture
 
hashing in data structures and its applications
Hash table
R for Pirates. ESCCONF October 27, 2011
Sorry - How Bieber broke Google Cloud at Spotify
Block ciphers &amp; public key cryptography
Scaling HDFS to Manage Billions of Files
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Hashing In Data Structure Download PPT i
Webinar: MongoDB 2.4 Feature Demo and Q&A on Hash-based Sharding
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
Типы данных JSONb, соответствующие индексы и модуль jsquery – Олег Бартунов, ...
ImplementingCryptoSecurityARMCortex_Doin
Modern C++
Happy Go Programming
hashtableeeeeeeeeeeeeeeeeeeeeeeeeeee.pdf
PostgreSQL 9.4: NoSQL on ACID
Scaling ingest pipelines with high performance computing principles - Rajiv K...

Recently uploaded (20)

PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Electronic commerce courselecture one. Pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
KodekX | Application Modernization Development
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Spectral efficient network and resource selection model in 5G networks
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Approach and Philosophy of On baking technology
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Review of recent advances in non-invasive hemoglobin estimation
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Unlocking AI with Model Context Protocol (MCP)
The Rise and Fall of 3GPP – Time for a Sabbatical?
Mobile App Security Testing_ A Comprehensive Guide.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
cuic standard and advanced reporting.pdf
Encapsulation_ Review paper, used for researhc scholars
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Dropbox Q2 2025 Financial Results & Investor Presentation
Electronic commerce courselecture one. Pdf
The AUB Centre for AI in Media Proposal.docx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
KodekX | Application Modernization Development
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Spectral efficient network and resource selection model in 5G networks
“AI and Expert System Decision Support & Business Intelligence Systems”
Approach and Philosophy of On baking technology
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx

Hash Functions FTW

  • 1. Hash Functions FTW* Fast Hashing, Bloom Filters & Hash-Oriented Storage Sunny Gleason * For the win (see urbandictionary FTW[1]); this expression has nothing to do with hash functions
  • 2. What’s in this Presentation • Hash Function Survey • Hash Performance • Bloom Filters • HashFile : Hash Storage
  • 3. Hash Functions int getIntHash(byte[] data); // 32-bit long getLongHash(byte[] data) // 64-bit int v1 = hash(“foo”); int v2 = hash(“goo”); int hash(byte[] value) { // a simple hash int h = 0; for (byte b: value) { h = (h<<5) ^ (h>>27) ^ b; } return h % PRIME; }
  • 4. Hash Functions • Goal : v1 has many bit differences from v2 • Desirable Properties: • Uniform Distribution - no collisions • Very Fast Computation
  • 5. Hash Applications Goal: O(1) access • Hash Table • Hash Set • Bloom Filter
  • 6. Popular Hash Functions • FNV Hash • DJB Hash • Jenkins Hash • Murmur2 • New (Promising?): CrapWow • Awesome & Slow: SHA-1, MD5 etc.
  • 7. Evaluating Hash Functions • Hash Function “Zoo” • Quality of: CRC32 DJB Jenkins FNV Murmur2 SHA1 • Performance: !"#$%&'()*(+",-'%./%0'/%1',23$% (MM ops/s) '#" '!" &#" &!" %#" *+,-.,/" %!" 012312%" $#" 456$" $!" #" !" %#(" ('" )"
  • 8. A Strawman “Set” • N keys, K bytes per key • Allocate array of size K * N bytes • Utilize array storage as: • a heap or tree: O(lg N) insert/delete/ remove • a hash: O(1) insert/delete/remove • What if we don’t have room for K*N bytes?
  • 9. Bloom Filter • Key Point: give up on storing all the keys • Store r bits per key instead of K bytes • Allocate bit vector of size: M = r * N, where N is expected number of entries • Use multiple hash functions of key to determine which bits to set • Premise: if hash functions are well- distributed, few collisions, high accuracy
  • 11. Tuning Bloom Filters Let r = M bits / N keys (r: num bits/key) Let k = 0.7 * r (k: num hashes to use) Let p = 0.6185 ** r (p: probability of false positives) Working backwards, we can use desired false positive rate p to tune the data structure space consumption: r = 8, p = 2.1e-2 r = 16, p = 4.5e-4 r = 24, p = 9.8e-6 r = 32, p = 2.1e-7 r = 40, p = 4.5e-9 r = 48, p = 9.6e-11
  • 12. Bloom Filter Performance 100MM entries, 8bits/key : 833k ops/s 100MM entries, 32bits/key : 256k ops/s 1BN entries, 8bits/key : 714k ops/s 1BN entries, 32bits/key : 185k ops/s Hypothesis : difference between 100MM and 1BN is due to locality of memory access in smaller bit vector
  • 13. Hash-Oriented Storage • HashFile : 64-bit clone of djb’s constant db “CDB” • Plain ol’ Key/Value storage: add(byte[] k, byte[] v), byte[] lookup(byte[] k) • Constant aka “Immutable” Data Store create(), add(k, v) ... , build() ... before lookup(k) • Use properties of hash table to achieve O(1) disk seeks per lookup
  • 14. HashFile Structure • Header (fixed width): table pointers, contains offests of hash tables and count of elements per table • Body (variable width): contains concatenation of all keys and values (with data lengths) • Footer (fixed width): hash “tables” containing long hash values of keys alongside long offsets into body
  • 15. HashFile Diagram HEADER BODY FOOTER p1s3p2s4p3s2p4s1 k1v1k2v2k3v3k4v4k5v5k6v6k7v7 hk7o7hk3o3hk4o4hk1o1 • Create: initialize empty header, start appending keys/values while recording offsets and hash values of keys • Build: take list of hash values and offsets and turn them into hash tables, backfill header with values • Lookup: compute hash(key), compute offset into table (hash modulo size of table), use table to find offset into body, return the value from body
  • 16. HashFile Performance • Spec: ≤ 2 disk seeks per lookup • Number of seeks independent of number of entries • X25E SSD: 1BN 8-byte keys, values (41GB): 650μs lookup w/ cold cache, up to 700x faster as filesystem cache warms, 0.9μs when in-memory • With 100MM entries (4GB), cold cache is ~600μs (from locality), 0.6μs warm
  • 17. Conclusions • Be aware of different Hash Functions and their collision / performance tradeoffs • Bloom Filters are extremely useful for fast, large-scale set membership • HashFile provides excellent performance in cases where a static K/V store suffices
  • 18. Future Work • Implement cWow hash in Java • Extend HashFile with configurable hash, pointer, and key/value lengths to conserve space (reduce 24 bytes-per-KV overhead) • Implement a read-write (non-constant) version of HashFile • Bloom Filter that spills to SSD
  • 20. References • GitHub Project: g414-hash (hash function, bloom filter, HashFile implementations) • Wikipedia: Hash Function, Bloom Filter • Non-Cryptographic Hash Function Zoo • DJB CDB, sg-cdb (java implementation)