SlideShare a Scribd company logo
ENGINEERING FAST INDEXES
Daniel Lemire
https://lemire.me
Joint work with lots of super smart people
Our recent work: Roaring Bitmaps
http://roaringbitmap.org/
Used by
Apache Spark,
Netflix Atlas,
LinkedIn Pinot,
Apache Lucene,
Whoosh,
Metamarket's Druid
eBay's Apache Kylin
Further reading:
Frame of Reference and Roaring Bitmaps (at Elastic, the
company behind Elasticsearch)
2
Set data structures
We focus on sets of integers: S = {1, 2, 3, 1000}. Ubiquitous in
database or search engines.
tests: x ∈ S?
intersections: S ∩ S
unions: S ∪ S
differences: S ∖ S
Jaccard Index (Tanimoto similarity) ∣S ∩ S ∣/∣S ∪ S ∣
2 1
2 1
2 1
1 1 1 2
3
"Ordered" Set
iterate
in sorted order,
in reverse order,
skippable iterators (jump to first value ≥ x)
Rank: how many elements of the set are smaller than k?
Select: find the kth smallest value
Min/max: find the maximal and minimal value
4
Let us make some assumptions...
Many sets containing more than a few integers
Integers span a wide range (e.g., [0, 100000))
Mostly immutable (read often, write rarely)
5
How do we implement integer sets?
Assume sets are mostly imutable.
sorted arrays ( std::vector<uint32_t> )
hash sets ( java.util.HashSet<Integer> ,
 std::unordered_set<uint32_t> )
…
bitsets ( java.util.BitSet )
compressed bitsets
6
What is a bitset???
Efficient way to represent a set of integers.
E.g., 0, 1, 3, 4 becomes  0b11011 or "27".
Also called a "bitmap" or a "bit array".
7
Add and contains on bitset
Most of the processors work on 64‑bit words.
Given index  x , the corresponding word index is  x/64 and within‑
word bit index is  x % 64 .
add(x) {
array[x / 64] |= (1 << (x % 64))
}
contains(x) {
return array[x / 64] & (1 << (x % 64))
}
8
How fast can you set bits in a bitset?
Very fast! Roughly three instructions (on x64)...
index = x / 64 -> a single shift
mask = 1 << ( x % 64) -> a single shift
array[ index ] |- mask -> a logical OR to memory
(Or can use BMI's  bts .)
On recent x64 can set one bit every ≈ 1.65 cycles (in cache)
Recall : Modern processors are superscalar (more than one
instruction per cycle)
9
Bit‑level parallelism
Bitsets are efficient: intersections
Intersection between {0, 1, 3} and {1, 3}
can be computed as AND operation between
 0b1011 and  0b1010 .
Result is  0b1010 or {1, 3}.
Enables Branchless processing.
10
Bitsets are efficient: in practice
for i in [0...n]
out[i] = A[i] & B[i]
Recent x64 processors can do this at a speed of ≈ 0.5 cycles per
pair of input 64‑bit words (in cache) for  n = 1024 .
0.5
 memcpy runs at ≈ 0.3 cycles.
0.3
11
Bitsets can be inefficient
Relatively wasteful to represent {1, 32000, 64000} with a bitset.
Would use 1000 bytes to store 3 numbers.
So we use compression...
12
Memory usage example
dataset : census1881_srt
format bits per value
hash sets
200
arrays
32
bitsets
900
compressed bitsets (Roaring)
2
https://github.com/RoaringBitmap/CBitmapCompetition 13
Performance example (unions)
dataset : census1881_srt
format CPU cycles per value
hash sets
200
arrays
6
bitsets
30
compressed bitsets (Roaring)
1
https://github.com/RoaringBitmap/CBitmapCompetition 14
What is happening? (Bitsets)
Bitsets are often best... except if data is
very sparse (lots of 0s). Then you spend a
lot of time scanning zeros.
Large memory usage
Bad performance
Threshold? ~1 100
15
Hash sets are not always fast
Hash sets have great one‑value look‑up. But
they have poor data locality and non‑trivial overhead...
h1 <- some hash set
h2 <- some hash set
...
for(x in h1) {
insert x in h2 // "sure" to hit a new cache line!!!!
}
16
Want to kill Swift?
Swift is Apple's new language. Try this:
var d = Set<Int>()
for i in 1...size {
d.insert(i)
}
//
var z = Set<Int>()
for i in d {
z.insert(i)
}
This blows up! Quadratic‑time.
Same problem with Rust.
17
What is happening? (Arrays)
Arrays are your friends. Reliable. Simple. Economical.
But... binary search is branchy and has bad locality...
while (low <= high) {
int middleIndex = (low + high) >>> 1;
int middleValue = array.get(middleIndex);
if (middleValue < ikey) {
low = middleIndex + 1;
} else if (middleValue > ikey) {
high = middleIndex - 1;
} else {
return middleIndex;
}
}
return -(low + 1);
18
Performance: value lookups (x ∈ S)
dataset : weather_sept_85
format CPU cycles per query
hash sets ( std::unordered_set )
50
arrays
900
bitsets
4
compressed bitsets (Roaring)
80
19
How do you compress bitsets?
We have long runs of 0s or 1s.
Use run‑length encoding (RLE)
Example: 000000001111111100 can be coded as
00000000 − 11111111 − 00
or
<5><1>
using the format < number of repetitions >< value being repeated >
20
RLE‑compressed bitsets
Oracle's BBC
WAH (FastBit)
EWAH (Git + Apache Hive)
Concise (Druid)
…
Further reading:
http://githubengineering.com/counting‑objects/
21
Hybrid Model
Decompose 32‑bit space into
16‑bit spaces (chunk).
Given value x, its chunk index is x ÷ 2 (16 most significant bits).
For each chunk, use best container to store least 16 significant bits:
a sorted array ({1,20,144})
a bitset (0b10000101011)
a sequences of sorted runs ([0,10],[15,20])
That's Roaring!
Prior work: O'Neil's RIDBit + BitMagic
16
22
Roaring
All containers fit in 8 kB (several fit in L1 cache)
Attempts to select the best container as you build the bitmaps
Calling  runOptimize will scan (quickly!) non‑run containers
and try to convert them to run containers
23
Performance: union (weather_sept_85)
format CPU cycles per value
bitsets
0.6
WAH
4
EWAH
2
Concise
5
Roaring
0.6
24
What helps us...
All modern processors have fast population‑count functions
( popcnt ) to count the number of 1s in a word.
Cheap to keep track of the number of values stored in a bitset!
Choice between array, run and bitset covers many use cases!
25
Go try it out!
Java, Go, C, C++, C#, Rust, Python... (soon: Swift)
http://roaringbitmap.org
Documented interoperable serialized format.
Free. Well‑tested. Benchmarked.
Peer reviewed
Consistently faster and smaller compressed bitmaps with
Roaring. Softw., Pract. Exper. (2016)
Better bitmap performance with Roaring bitmaps. Softw.,
Pract. Exper. (2016)
Optimizing Druid with Roaring bitmaps, IDEAS 2016, 2016
Wide community (dozens of contributors).
26

More Related Content

PDF
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
PDF
Fast indexes with roaring #gomtl-10
PDF
To Swift 2...and Beyond!
PDF
Engineering fast indexes (Deepdive)
PDF
Faster Practical Block Compression for Rank/Select Dictionaries
PPTX
AA-sort with SSE4.1
PDF
Fast Identification of Heavy Hitters by Cached and Packed Group Testing
PPTX
Nicety of java 8 multithreading for advanced, Max Voronoy
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Fast indexes with roaring #gomtl-10
To Swift 2...and Beyond!
Engineering fast indexes (Deepdive)
Faster Practical Block Compression for Rank/Select Dictionaries
AA-sort with SSE4.1
Fast Identification of Heavy Hitters by Cached and Packed Group Testing
Nicety of java 8 multithreading for advanced, Max Voronoy

What's hot (20)

PDF
Porting and optimizing UniFrac for GPUs
PPTX
Seeing with Python presented at PyCon AU 2014
PDF
Fast Wavelet Tree Construction in Practice
PPTX
Deep dumpster diving 2010
PPTX
On Mining Bitcoins - Fundamentals & Outlooks
PPTX
TCO in Python via bytecode manipulation.
PDF
WebAssembly向け多倍長演算の実装
PDF
Python opcodes
PDF
Efficient Two-level Homomorphic Encryption in Prime-order Bilinear Groups and...
PPTX
Nicety of Java 8 Multithreading
PDF
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
PDF
Faster Python, FOSDEM
PDF
Dynamic C++ ACCU 2013
PDF
Practical Two-level Homomorphic Encryption in Prime-order Bilinear Groups
PPTX
RealmDB for Android
PPTX
Java Performance Tweaks
PPTX
Introduction to PyTorch
PDF
Powered by Python - PyCon Germany 2016
KEY
Grand centraldispatch
PDF
Конверсия управляемых языков в неуправляемые
Porting and optimizing UniFrac for GPUs
Seeing with Python presented at PyCon AU 2014
Fast Wavelet Tree Construction in Practice
Deep dumpster diving 2010
On Mining Bitcoins - Fundamentals & Outlooks
TCO in Python via bytecode manipulation.
WebAssembly向け多倍長演算の実装
Python opcodes
Efficient Two-level Homomorphic Encryption in Prime-order Bilinear Groups and...
Nicety of Java 8 Multithreading
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
Faster Python, FOSDEM
Dynamic C++ ACCU 2013
Practical Two-level Homomorphic Encryption in Prime-order Bilinear Groups
RealmDB for Android
Java Performance Tweaks
Introduction to PyTorch
Powered by Python - PyCon Germany 2016
Grand centraldispatch
Конверсия управляемых языков в неуправляемые
Ad

Similar to Engineering fast indexes (20)

PDF
RecSplit Minimal Perfect Hashing
PDF
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
PPTX
Dandelion Hashtable: beyond billion requests per second on a commodity server
PDF
Collections forceawakens
PPT
JVM performance options. How it works
PDF
Advance computer architecture
PDF
PostgreSQL: Joining 1 million tables
PDF
Interview with Anatoliy Kuznetsov, the author of BitMagic C++ library
PPTX
RedisConf17 - Internet Archive - Preventing Cache Stampede with Redis and XFetch
PDF
Lockless
PPTX
Class 26: Objectifying Objects
PPT
Memory Optimization
PPT
Memory Optimization
PPTX
DotNetFest - Let’s refresh our memory! Memory management in .NET
PPTX
Why learn Internals?
PDF
Options and trade offs for parallelism and concurrency in Modern C++
PDF
The walking 0xDEAD
PPTX
Segmentation Faults, Page Faults, Processes, Threads, and Tasks
PPTX
Sql server scalability fundamentals
PDF
Structures de données exotiques
RecSplit Minimal Perfect Hashing
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Dandelion Hashtable: beyond billion requests per second on a commodity server
Collections forceawakens
JVM performance options. How it works
Advance computer architecture
PostgreSQL: Joining 1 million tables
Interview with Anatoliy Kuznetsov, the author of BitMagic C++ library
RedisConf17 - Internet Archive - Preventing Cache Stampede with Redis and XFetch
Lockless
Class 26: Objectifying Objects
Memory Optimization
Memory Optimization
DotNetFest - Let’s refresh our memory! Memory management in .NET
Why learn Internals?
Options and trade offs for parallelism and concurrency in Modern C++
The walking 0xDEAD
Segmentation Faults, Page Faults, Processes, Threads, and Tasks
Sql server scalability fundamentals
Structures de données exotiques
Ad

More from Daniel Lemire (20)

PDF
Accurate and efficient software microbenchmarks
PDF
Parsing JSON Really Quickly: Lessons Learned
PDF
Ingénierie de la performance au sein des mégadonnées
PDF
SIMD Compression and the Intersection of Sorted Integers
PDF
Decoding billions of integers per second through vectorization
PDF
Logarithmic Discrete Wavelet Transform for High-Quality Medical Image Compres...
PDF
MaskedVByte: SIMD-accelerated VByte
PDF
Roaring Bitmaps (January 2016)
PDF
Roaring Bitmap : June 2015 report
PDF
La vectorisation des algorithmes de compression
PDF
OLAP and more
PDF
Decoding billions of integers per second through vectorization
PDF
Extracting, Transforming and Archiving Scientific Data
KEY
Innovation without permission: from Codd to NoSQL
PDF
Write good papers
PDF
Faster Column-Oriented Indexes
PDF
Compressing column-oriented indexes
PDF
All About Bitmap Indexes... And Sorting Them
PDF
A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP
PDF
Tag-Cloud Drawing: Algorithms for Cloud Visualization
Accurate and efficient software microbenchmarks
Parsing JSON Really Quickly: Lessons Learned
Ingénierie de la performance au sein des mégadonnées
SIMD Compression and the Intersection of Sorted Integers
Decoding billions of integers per second through vectorization
Logarithmic Discrete Wavelet Transform for High-Quality Medical Image Compres...
MaskedVByte: SIMD-accelerated VByte
Roaring Bitmaps (January 2016)
Roaring Bitmap : June 2015 report
La vectorisation des algorithmes de compression
OLAP and more
Decoding billions of integers per second through vectorization
Extracting, Transforming and Archiving Scientific Data
Innovation without permission: from Codd to NoSQL
Write good papers
Faster Column-Oriented Indexes
Compressing column-oriented indexes
All About Bitmap Indexes... And Sorting Them
A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP
Tag-Cloud Drawing: Algorithms for Cloud Visualization

Recently uploaded (20)

PPT
Teaching material agriculture food technology
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Getting Started with Data Integration: FME Form 101
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Empathic Computing: Creating Shared Understanding
PPTX
1. Introduction to Computer Programming.pptx
Teaching material agriculture food technology
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Encapsulation_ Review paper, used for researhc scholars
SOPHOS-XG Firewall Administrator PPT.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
MIND Revenue Release Quarter 2 2025 Press Release
OMC Textile Division Presentation 2021.pptx
Getting Started with Data Integration: FME Form 101
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Network Security Unit 5.pdf for BCA BBA.
Unlocking AI with Model Context Protocol (MCP)
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Digital-Transformation-Roadmap-for-Companies.pptx
Programs and apps: productivity, graphics, security and other tools
Diabetes mellitus diagnosis method based random forest with bat algorithm
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Spectral efficient network and resource selection model in 5G networks
A comparative study of natural language inference in Swahili using monolingua...
Empathic Computing: Creating Shared Understanding
1. Introduction to Computer Programming.pptx

Engineering fast indexes

  • 1. ENGINEERING FAST INDEXES Daniel Lemire https://lemire.me Joint work with lots of super smart people
  • 2. Our recent work: Roaring Bitmaps http://roaringbitmap.org/ Used by Apache Spark, Netflix Atlas, LinkedIn Pinot, Apache Lucene, Whoosh, Metamarket's Druid eBay's Apache Kylin Further reading: Frame of Reference and Roaring Bitmaps (at Elastic, the company behind Elasticsearch) 2
  • 3. Set data structures We focus on sets of integers: S = {1, 2, 3, 1000}. Ubiquitous in database or search engines. tests: x ∈ S? intersections: S ∩ S unions: S ∪ S differences: S ∖ S Jaccard Index (Tanimoto similarity) ∣S ∩ S ∣/∣S ∪ S ∣ 2 1 2 1 2 1 1 1 1 2 3
  • 4. "Ordered" Set iterate in sorted order, in reverse order, skippable iterators (jump to first value ≥ x) Rank: how many elements of the set are smaller than k? Select: find the kth smallest value Min/max: find the maximal and minimal value 4
  • 5. Let us make some assumptions... Many sets containing more than a few integers Integers span a wide range (e.g., [0, 100000)) Mostly immutable (read often, write rarely) 5
  • 6. How do we implement integer sets? Assume sets are mostly imutable. sorted arrays ( std::vector<uint32_t> ) hash sets ( java.util.HashSet<Integer> ,  std::unordered_set<uint32_t> ) … bitsets ( java.util.BitSet ) compressed bitsets 6
  • 7. What is a bitset??? Efficient way to represent a set of integers. E.g., 0, 1, 3, 4 becomes  0b11011 or "27". Also called a "bitmap" or a "bit array". 7
  • 8. Add and contains on bitset Most of the processors work on 64‑bit words. Given index  x , the corresponding word index is  x/64 and within‑ word bit index is  x % 64 . add(x) { array[x / 64] |= (1 << (x % 64)) } contains(x) { return array[x / 64] & (1 << (x % 64)) } 8
  • 9. How fast can you set bits in a bitset? Very fast! Roughly three instructions (on x64)... index = x / 64 -> a single shift mask = 1 << ( x % 64) -> a single shift array[ index ] |- mask -> a logical OR to memory (Or can use BMI's  bts .) On recent x64 can set one bit every ≈ 1.65 cycles (in cache) Recall : Modern processors are superscalar (more than one instruction per cycle) 9
  • 10. Bit‑level parallelism Bitsets are efficient: intersections Intersection between {0, 1, 3} and {1, 3} can be computed as AND operation between  0b1011 and  0b1010 . Result is  0b1010 or {1, 3}. Enables Branchless processing. 10
  • 11. Bitsets are efficient: in practice for i in [0...n] out[i] = A[i] & B[i] Recent x64 processors can do this at a speed of ≈ 0.5 cycles per pair of input 64‑bit words (in cache) for  n = 1024 . 0.5  memcpy runs at ≈ 0.3 cycles. 0.3 11
  • 12. Bitsets can be inefficient Relatively wasteful to represent {1, 32000, 64000} with a bitset. Would use 1000 bytes to store 3 numbers. So we use compression... 12
  • 13. Memory usage example dataset : census1881_srt format bits per value hash sets 200 arrays 32 bitsets 900 compressed bitsets (Roaring) 2 https://github.com/RoaringBitmap/CBitmapCompetition 13
  • 14. Performance example (unions) dataset : census1881_srt format CPU cycles per value hash sets 200 arrays 6 bitsets 30 compressed bitsets (Roaring) 1 https://github.com/RoaringBitmap/CBitmapCompetition 14
  • 15. What is happening? (Bitsets) Bitsets are often best... except if data is very sparse (lots of 0s). Then you spend a lot of time scanning zeros. Large memory usage Bad performance Threshold? ~1 100 15
  • 16. Hash sets are not always fast Hash sets have great one‑value look‑up. But they have poor data locality and non‑trivial overhead... h1 <- some hash set h2 <- some hash set ... for(x in h1) { insert x in h2 // "sure" to hit a new cache line!!!! } 16
  • 17. Want to kill Swift? Swift is Apple's new language. Try this: var d = Set<Int>() for i in 1...size { d.insert(i) } // var z = Set<Int>() for i in d { z.insert(i) } This blows up! Quadratic‑time. Same problem with Rust. 17
  • 18. What is happening? (Arrays) Arrays are your friends. Reliable. Simple. Economical. But... binary search is branchy and has bad locality... while (low <= high) { int middleIndex = (low + high) >>> 1; int middleValue = array.get(middleIndex); if (middleValue < ikey) { low = middleIndex + 1; } else if (middleValue > ikey) { high = middleIndex - 1; } else { return middleIndex; } } return -(low + 1); 18
  • 19. Performance: value lookups (x ∈ S) dataset : weather_sept_85 format CPU cycles per query hash sets ( std::unordered_set ) 50 arrays 900 bitsets 4 compressed bitsets (Roaring) 80 19
  • 20. How do you compress bitsets? We have long runs of 0s or 1s. Use run‑length encoding (RLE) Example: 000000001111111100 can be coded as 00000000 − 11111111 − 00 or <5><1> using the format < number of repetitions >< value being repeated > 20
  • 21. RLE‑compressed bitsets Oracle's BBC WAH (FastBit) EWAH (Git + Apache Hive) Concise (Druid) … Further reading: http://githubengineering.com/counting‑objects/ 21
  • 22. Hybrid Model Decompose 32‑bit space into 16‑bit spaces (chunk). Given value x, its chunk index is x ÷ 2 (16 most significant bits). For each chunk, use best container to store least 16 significant bits: a sorted array ({1,20,144}) a bitset (0b10000101011) a sequences of sorted runs ([0,10],[15,20]) That's Roaring! Prior work: O'Neil's RIDBit + BitMagic 16 22
  • 23. Roaring All containers fit in 8 kB (several fit in L1 cache) Attempts to select the best container as you build the bitmaps Calling  runOptimize will scan (quickly!) non‑run containers and try to convert them to run containers 23
  • 24. Performance: union (weather_sept_85) format CPU cycles per value bitsets 0.6 WAH 4 EWAH 2 Concise 5 Roaring 0.6 24
  • 25. What helps us... All modern processors have fast population‑count functions ( popcnt ) to count the number of 1s in a word. Cheap to keep track of the number of values stored in a bitset! Choice between array, run and bitset covers many use cases! 25
  • 26. Go try it out! Java, Go, C, C++, C#, Rust, Python... (soon: Swift) http://roaringbitmap.org Documented interoperable serialized format. Free. Well‑tested. Benchmarked. Peer reviewed Consistently faster and smaller compressed bitmaps with Roaring. Softw., Pract. Exper. (2016) Better bitmap performance with Roaring bitmaps. Softw., Pract. Exper. (2016) Optimizing Druid with Roaring bitmaps, IDEAS 2016, 2016 Wide community (dozens of contributors). 26