SlideShare a Scribd company logo
ENGINEERING FAST INDEXES (DEEP DIVE)
Daniel Lemire
https://lemire.me
Joint work with lots of super smart people
Roaring : Hybrid Model
A collection of containers...
array: sorted arrays ({1,20,144}) of packed 16‑bit integers
bitset: bitsets spanning 65536 bits or 1024 64‑bit words
run: sequences of runs ([0,10],[15,20])
2
Keeping track
E.g., a bitset with few 1s need to be converted back to array.
→ we need to keep track of the cardinality!
In Roaring, we do it automagically
3
Setting/Flipping/Clearing bits while keeping track
Important : avoid mispredicted branches
Pure C/Java:
q = p / 64
ow = w[ q ];
nw = ow | (1 << (p % 64) );
cardinality += (ow ^ nw) >> (p % 64) ; // EXTRA
w[ q ] = nw;
4
In x64 assembly with BMI instructions:
shrx %[6], %[p], %[q] // q = p / 64
mov (%[w],%[q],8), %[ow] // ow = w [q]
bts %[p], %[ow] // ow |= ( 1<< (p % 64)) + flag
sbb $-1, %[cardinality] // update card based on flag
mov %[load], (%[w],%[q],8) // w[q] = ow
 sbb is the extra work
5
For each operation
union
intersection
difference
...
Must specialize by container type:
array bitset run
array ? ? ?
bitset ? ? ?
run ? ? ?
6
High‑level API or Sipping Straw?
7
Bitset vs. Bitset...
Intersection:
First compute the cardinality of the result.
If low, use an array for the result (slow), otherwise generate
a bitset (fast).
Union: Always generate a bitset (fast).
(Unless cardinality is high then maybe create a run!)
We generally keep track of the cardinality of the result.
8
Cardinality of the result
How fast does this code run?
int c = 0;
for (int k = 0; k < 1024; ++k) {
c += Long.bitCount(A[k] & B[k]);
}
We have 1024 calls to  Long.bitCount .
This counts the number of 1s in a 64‑bit word.
9
Population count in Java
// Hacker`s Delight
int bitCount(long i) {
// HD, Figure 5-14
i = i - ((i >>> 1) & 0x5555555555555555L);
i = (i & 0x3333333333333333L)
+ ((i >>> 2) & 0x3333333333333333L);
i = (i + (i >>> 4)) & 0x0f0f0f0f0f0f0f0fL;
i = i + (i >>> 8);
i = i + (i >>> 16);
i = i + (i >>> 32);
return (int)i & 0x7f;
}
Sounds expensive?
10
Population count in C
How do you think that the C compiler  clang compiles this code?
#include <stdint.h>
int count(uint64_t x) {
int v = 0;
while(x != 0) {
x &= x - 1;
v++;
}
return v;
}
11
Compile with  -O1 -march=native on a recent x64 machine:
popcnt rax, rdi
12
Why care for  popcnt ?
 popcnt : throughput of 1 instruction per cycle (recent Intel CPUs)
Really fast.
13
Population count in Java?
// Hacker`s Delight
int bitCount(long i) {
// HD, Figure 5-14
i = i - ((i >>> 1) & 0x5555555555555555L);
i = (i & 0x3333333333333333L)
+ ((i >>> 2) & 0x3333333333333333L);
i = (i + (i >>> 4)) & 0x0f0f0f0f0f0f0f0fL;
i = i + (i >>> 8);
i = i + (i >>> 16);
i = i + (i >>> 32);
return (int)i & 0x7f;
}
14
Population count in Java!
Also compiles to  popcnt if hardware supports it
$ java -XX:+PrintFlagsFinal
| grep UsePopCountInstruction
bool UsePopCountInstruction = true
But only if you call it from  Long.bitCount 
15
Java intrinsics
 Long.bitCount ,  Integer.bitCount 
 Integer.reverseBytes ,  Long.reverseBytes 
 Integer.numberOfLeadingZeros ,
 Long.numberOfLeadingZeros 
 Integer.numberOfTrailingZeros ,
 Long.numberOfTrailingZeros 
 System.arraycopy 
...
16
Cardinality of the intersection
How fast does this code run?
int c = 0;
for (int k = 0; k < 1024; ++k) {
c += Long.bitCount(A[k] & B[k]);
}
A bit over ≈ 2 cycles per pair of 64‑bit words.
load A, load B
bitwise AND
 popcnt 
17
Take away
Bitset vs. Bitset operations are fast
even if you need to track the cardinality.
even in Java
e.g.,  popcnt overhead might be negligible compared to other costs
like cache misses.
18
Array vs. Array intersection
Always output an array. Use galloping O(m log n) if the sizes
differs a lot.
int intersect(A, B) {
if (A.length * 25 < B.length) {
return galloping(A,B);
} else if (B.length * 25 < A.length) {
return galloping(B,A);
} else {
return boring_intersection(A,B);
}
}
19
Galloping intersection
You have two arrays a small and a large one...
while (true) {
if (largeSet[k1] < smallSet[k2]) {
find k1 by binary search such that
largeSet[k1] >= smallSet[k2]
}
if (smallSet[k2] < largeSet[k1]) {
++k2;
} else {
// got a match! (smallSet[k2] == largeSet[k1])
}
}
If the small set is tiny, runs in O(log(size of big set))
20
Array vs. Array union
Union: If sum of cardinalities is large, go for a bitset. Revert to an
array if we got it wrong.
union (A,B) {
total = A.length + B.length;
if (total > DEFAULT_MAX_SIZE) {// bitmap?
create empty bitmap C and add both A and B to it
if (C.cardinality <= DEFAULT_MAX_SIZE) {
convert C to array
} else if (C is full) {
convert C to run
} else {
C is fine as a bitmap
}
}
otherwise merge two arrays and output array
}
21
Array vs. Bitmap (Intersection)...
Intersection: Always an array.
Branchy (3 to 16 cycles per array value):
answer = new array
for value in array {
if value in bitset {
append value to answer
}
}
22
Branchless (3 cycles per array value):
answer = new array
pos = 0
for value in array {
answer[pos] = value
pos += bit_value(bitset, value)
}
23
Array vs. Bitmap (Union)...
Always a bitset. Very fast. Few cycles per value in array.
answer = clone the bitset
for value in array { // branchless
set bit in answer at index value
}
Without tracking the cardinality ≈ 1.65 cycles per value
Tracking the cardinality ≈ 2.2 cycles per value
24
Parallelization is not just multicore + distributed
In practice, all commodity processors support Single instruction,
multiple data (SIMD) instructions.
Raspberry Pi
Your phone
Your PC
Working with words x × larger has the potential of multiplying the
performance by x.
No lock needed.
Purely deterministic/testable.
25
SIMD is not too hard conceptually
Instead of working with x + y you do
(x , x , x , x ) + (y , y , y , y ).
Alas: it is messy in actual code.
1 2 3 4 1 2 3 4
26
With SIMD small words help!
With scalar code, working on 16‑bit integers is not 2 × faster than
32‑bit integers.
But with SIMD instructions, going from 64‑bit integers to 16‑bit
integers can mean 4 × gain.
Roaring uses arrays of 16‑bit integers.
27
Bitsets are vectorizable
Logical ORs, ANDs, ANDNOTs, XORs can be computed fast with
Single instruction, multiple data (SIMD) instructions.
Intel Cannonlake (late 2017), AVX‑512
Operate on 64 bytes with ONE instruction
→ Several 512‑bit ops/cycle
Java 9's Hotspot can use AVX 512
ARM v8‑A to get Scalable Vector Extension...
up to 2048 bits!!!
28
Java supports advanced SIMD instructions
$ java -XX:+PrintFlagsFinal -version |grep "AVX"
intx UseAVX = 2
29
Vectorization matters!
for(size_t i = 0; i < len; i++) {
a[i] |= b[i];
}
using scalar : 1.5 cycles per byte
with AVX2 : 0.43 cycles per byte (3.5 × better)
With AVX‑512, the performance gap exceeds 5 ×
Can also vectorize OR, AND, ANDNOT, XOR + population count
(AVX2‑Harley‑Seal)
30
Vectorization beats  popcnt 
int count = 0;
for(size_t i = 0; i < len; i++) {
count += popcount(a[i]);
}
using fast scalar (popcnt): 1 cycle per input byte
using AVX2 Harley‑Seal: 0.5 cycles per input byte
even greater gain with AVX‑512
31
Sorted arrays
sorted arrays are vectorizable:
array union
array difference
array symmetric difference
array intersection
sorted arrays can be compressed with SIMD
32
Bitsets are vectorizable... sadly...
Java's hotspot is limited in what it can autovectorize:
1. Copying arrays
2. String.indexOf
3. ...
And it seems that  Unsafe effectively disables autovectorization!
33
There is hope yet for Java
One big reason, today, for binding closely to hardware is to
process wider data flows in SIMD modes. (And IMO this is a
long‑term trend towards right‑sizing data channel widths, as
hardware grows wider in various ways.) AVX bindings are where
we are experimenting, today
(John Rose, Oracle)
34
Fun things you can do with SIMD: Masked VByte
Consider the ubiquitous VByte format:
Use 1 byte to store all integers in [0, 2 )
Use 2 bytes to store all integers in [2 , 2 )
...
Decoding can become a bottleneck. Google developed Varint‑GB.
What if you are stuck with the conventional format? (E.g., Lucene,
LEB128, Protocol Buffers...)
7
7 14
35
Masked VByte
Joint work with J. Plaisance (Indeed.com) and N. Kurz.
http://maskedvbyte.org/
36
Go try it out!
Fully vectorized Roaring implementation (C/C++):
https://github.com/RoaringBitmap/CRoaring
Wrappers in Python, Go, Rust...
37

More Related Content

PDF
Scaling up data science applications
PDF
GeoMesa on Apache Spark SQL with Anthony Fox
PDF
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
PDF
Map reduce: beyond word count
PDF
Spark 4th Meetup Londond - Building a Product with Spark
PDF
R and cpp
PDF
Enhancing Spark SQL Optimizer with Reliable Statistics
PDF
Spark schema for free with David Szakallas
Scaling up data science applications
GeoMesa on Apache Spark SQL with Anthony Fox
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
Map reduce: beyond word count
Spark 4th Meetup Londond - Building a Product with Spark
R and cpp
Enhancing Spark SQL Optimizer with Reliable Statistics
Spark schema for free with David Szakallas

What's hot (20)

PDF
Vasia Kalavri – Training: Gelly School
PDF
Real Time Big Data Management
PDF
NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
PDF
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
PDF
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
PPTX
Time Series Analysis for Network Secruity
PDF
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
PPTX
Anomaly Detection with Apache Spark
ODP
Stratosphere Intro (Java and Scala Interface)
PDF
Unsupervised Learning with Apache Spark
PDF
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
PDF
Data correlation using PySpark and HDFS
PPTX
ComputeFest 2012: Intro To R for Physical Sciences
PDF
Photon Technical Deep Dive: How to Think Vectorized
PPTX
Java 8 monads
PPTX
Distributed GLM with H2O - Atlanta Meetup
PDF
Distributed computing with spark
PDF
Automatic Features Generation And Model Training On Spark: A Bayesian Approach
PDF
Mapreduce Algorithms
PDF
On Beyond (PostgreSQL) Data Types
Vasia Kalavri – Training: Gelly School
Real Time Big Data Management
NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Time Series Analysis for Network Secruity
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
Anomaly Detection with Apache Spark
Stratosphere Intro (Java and Scala Interface)
Unsupervised Learning with Apache Spark
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Data correlation using PySpark and HDFS
ComputeFest 2012: Intro To R for Physical Sciences
Photon Technical Deep Dive: How to Think Vectorized
Java 8 monads
Distributed GLM with H2O - Atlanta Meetup
Distributed computing with spark
Automatic Features Generation And Model Training On Spark: A Bayesian Approach
Mapreduce Algorithms
On Beyond (PostgreSQL) Data Types
Ad

Viewers also liked (20)

PDF
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
PDF
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
PDF
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
PPTX
Keeping Spark on Track: Productionizing Spark for ETL
PDF
Spark and Online Analytics: Spark Summit East talky by Shubham Chopra
PDF
SparkSQL: A Compiler from Queries to RDDs
PDF
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
PDF
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
PDF
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
PDF
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
PDF
Predictive Analytics for IoT Network Capacity Planning: Spark Summit East tal...
PDF
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
PDF
IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...
PDF
No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...
PDF
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
PPTX
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
PDF
Exceptions are the Norm: Dealing with Bad Actors in ETL
PDF
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
PDF
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Keeping Spark on Track: Productionizing Spark for ETL
Spark and Online Analytics: Spark Summit East talky by Shubham Chopra
SparkSQL: A Compiler from Queries to RDDs
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Predictive Analytics for IoT Network Capacity Planning: Spark Summit East tal...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...
No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
Exceptions are the Norm: Dealing with Bad Actors in ETL
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Ad

Similar to Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by Daniel Lemire (20)

PDF
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
PDF
Engineering fast indexes
PPTX
DConf 2016: Bitpacking Like a Madman by Amaury Sechet
PPTX
unsplitted slideshare
DOCX
Goals1)Be able to work with individual bits in java.2).docx
PDF
Programming techniques
PPTX
Programing techniques
PDF
Three Optimization Tips for C++
PDF
Three Optimization Tips for C++
PDF
Cray XT Porting, Scaling, and Optimization Best Practices
PDF
Upgrading to System Verilog for FPGA Designs, Srinivasan Venkataramanan, CVC
PDF
"Quantum" Performance Effects
DOCX
do it in eclips and make sure it compile Goals1)Be able to.docx
PPTX
X86opti 05 s5yata
PDF
Esoteric Data structures
PDF
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
PDF
Dsc -session01_introduction_to_data_structures_v2_1_.2
PDF
Optimizing array-based data structures to the limit
PPTX
In the Name of Performance
PPT
10 instruction sets characteristics
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Engineering fast indexes
DConf 2016: Bitpacking Like a Madman by Amaury Sechet
unsplitted slideshare
Goals1)Be able to work with individual bits in java.2).docx
Programming techniques
Programing techniques
Three Optimization Tips for C++
Three Optimization Tips for C++
Cray XT Porting, Scaling, and Optimization Best Practices
Upgrading to System Verilog for FPGA Designs, Srinivasan Venkataramanan, CVC
"Quantum" Performance Effects
do it in eclips and make sure it compile Goals1)Be able to.docx
X86opti 05 s5yata
Esoteric Data structures
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
Dsc -session01_introduction_to_data_structures_v2_1_.2
Optimizing array-based data structures to the limit
In the Name of Performance
10 instruction sets characteristics

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded (20)

PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Lecture1 pattern recognition............
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
1_Introduction to advance data techniques.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Global journeys: estimating international migration
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Computer network topology notes for revision
PDF
.pdf is not working space design for the following data for the following dat...
Data_Analytics_and_PowerBI_Presentation.pptx
climate analysis of Dhaka ,Banglades.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Reliability_Chapter_ presentation 1221.5784
Business Ppt On Nestle.pptx huunnnhhgfvu
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
oil_refinery_comprehensive_20250804084928 (1).pptx
Business Acumen Training GuidePresentation.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
IBA_Chapter_11_Slides_Final_Accessible.pptx
Lecture1 pattern recognition............
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
1_Introduction to advance data techniques.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Global journeys: estimating international migration
Moving the Public Sector (Government) to a Digital Adoption
Introduction to Knowledge Engineering Part 1
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Computer network topology notes for revision
.pdf is not working space design for the following data for the following dat...

Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by Daniel Lemire

  • 1. ENGINEERING FAST INDEXES (DEEP DIVE) Daniel Lemire https://lemire.me Joint work with lots of super smart people
  • 2. Roaring : Hybrid Model A collection of containers... array: sorted arrays ({1,20,144}) of packed 16‑bit integers bitset: bitsets spanning 65536 bits or 1024 64‑bit words run: sequences of runs ([0,10],[15,20]) 2
  • 3. Keeping track E.g., a bitset with few 1s need to be converted back to array. → we need to keep track of the cardinality! In Roaring, we do it automagically 3
  • 4. Setting/Flipping/Clearing bits while keeping track Important : avoid mispredicted branches Pure C/Java: q = p / 64 ow = w[ q ]; nw = ow | (1 << (p % 64) ); cardinality += (ow ^ nw) >> (p % 64) ; // EXTRA w[ q ] = nw; 4
  • 5. In x64 assembly with BMI instructions: shrx %[6], %[p], %[q] // q = p / 64 mov (%[w],%[q],8), %[ow] // ow = w [q] bts %[p], %[ow] // ow |= ( 1<< (p % 64)) + flag sbb $-1, %[cardinality] // update card based on flag mov %[load], (%[w],%[q],8) // w[q] = ow  sbb is the extra work 5
  • 6. For each operation union intersection difference ... Must specialize by container type: array bitset run array ? ? ? bitset ? ? ? run ? ? ? 6
  • 7. High‑level API or Sipping Straw? 7
  • 8. Bitset vs. Bitset... Intersection: First compute the cardinality of the result. If low, use an array for the result (slow), otherwise generate a bitset (fast). Union: Always generate a bitset (fast). (Unless cardinality is high then maybe create a run!) We generally keep track of the cardinality of the result. 8
  • 9. Cardinality of the result How fast does this code run? int c = 0; for (int k = 0; k < 1024; ++k) { c += Long.bitCount(A[k] & B[k]); } We have 1024 calls to  Long.bitCount . This counts the number of 1s in a 64‑bit word. 9
  • 10. Population count in Java // Hacker`s Delight int bitCount(long i) { // HD, Figure 5-14 i = i - ((i >>> 1) & 0x5555555555555555L); i = (i & 0x3333333333333333L) + ((i >>> 2) & 0x3333333333333333L); i = (i + (i >>> 4)) & 0x0f0f0f0f0f0f0f0fL; i = i + (i >>> 8); i = i + (i >>> 16); i = i + (i >>> 32); return (int)i & 0x7f; } Sounds expensive? 10
  • 11. Population count in C How do you think that the C compiler  clang compiles this code? #include <stdint.h> int count(uint64_t x) { int v = 0; while(x != 0) { x &= x - 1; v++; } return v; } 11
  • 12. Compile with  -O1 -march=native on a recent x64 machine: popcnt rax, rdi 12
  • 13. Why care for  popcnt ?  popcnt : throughput of 1 instruction per cycle (recent Intel CPUs) Really fast. 13
  • 14. Population count in Java? // Hacker`s Delight int bitCount(long i) { // HD, Figure 5-14 i = i - ((i >>> 1) & 0x5555555555555555L); i = (i & 0x3333333333333333L) + ((i >>> 2) & 0x3333333333333333L); i = (i + (i >>> 4)) & 0x0f0f0f0f0f0f0f0fL; i = i + (i >>> 8); i = i + (i >>> 16); i = i + (i >>> 32); return (int)i & 0x7f; } 14
  • 15. Population count in Java! Also compiles to  popcnt if hardware supports it $ java -XX:+PrintFlagsFinal | grep UsePopCountInstruction bool UsePopCountInstruction = true But only if you call it from  Long.bitCount  15
  • 16. Java intrinsics  Long.bitCount ,  Integer.bitCount   Integer.reverseBytes ,  Long.reverseBytes   Integer.numberOfLeadingZeros ,  Long.numberOfLeadingZeros   Integer.numberOfTrailingZeros ,  Long.numberOfTrailingZeros   System.arraycopy  ... 16
  • 17. Cardinality of the intersection How fast does this code run? int c = 0; for (int k = 0; k < 1024; ++k) { c += Long.bitCount(A[k] & B[k]); } A bit over ≈ 2 cycles per pair of 64‑bit words. load A, load B bitwise AND  popcnt  17
  • 18. Take away Bitset vs. Bitset operations are fast even if you need to track the cardinality. even in Java e.g.,  popcnt overhead might be negligible compared to other costs like cache misses. 18
  • 19. Array vs. Array intersection Always output an array. Use galloping O(m log n) if the sizes differs a lot. int intersect(A, B) { if (A.length * 25 < B.length) { return galloping(A,B); } else if (B.length * 25 < A.length) { return galloping(B,A); } else { return boring_intersection(A,B); } } 19
  • 20. Galloping intersection You have two arrays a small and a large one... while (true) { if (largeSet[k1] < smallSet[k2]) { find k1 by binary search such that largeSet[k1] >= smallSet[k2] } if (smallSet[k2] < largeSet[k1]) { ++k2; } else { // got a match! (smallSet[k2] == largeSet[k1]) } } If the small set is tiny, runs in O(log(size of big set)) 20
  • 21. Array vs. Array union Union: If sum of cardinalities is large, go for a bitset. Revert to an array if we got it wrong. union (A,B) { total = A.length + B.length; if (total > DEFAULT_MAX_SIZE) {// bitmap? create empty bitmap C and add both A and B to it if (C.cardinality <= DEFAULT_MAX_SIZE) { convert C to array } else if (C is full) { convert C to run } else { C is fine as a bitmap } } otherwise merge two arrays and output array } 21
  • 22. Array vs. Bitmap (Intersection)... Intersection: Always an array. Branchy (3 to 16 cycles per array value): answer = new array for value in array { if value in bitset { append value to answer } } 22
  • 23. Branchless (3 cycles per array value): answer = new array pos = 0 for value in array { answer[pos] = value pos += bit_value(bitset, value) } 23
  • 24. Array vs. Bitmap (Union)... Always a bitset. Very fast. Few cycles per value in array. answer = clone the bitset for value in array { // branchless set bit in answer at index value } Without tracking the cardinality ≈ 1.65 cycles per value Tracking the cardinality ≈ 2.2 cycles per value 24
  • 25. Parallelization is not just multicore + distributed In practice, all commodity processors support Single instruction, multiple data (SIMD) instructions. Raspberry Pi Your phone Your PC Working with words x × larger has the potential of multiplying the performance by x. No lock needed. Purely deterministic/testable. 25
  • 26. SIMD is not too hard conceptually Instead of working with x + y you do (x , x , x , x ) + (y , y , y , y ). Alas: it is messy in actual code. 1 2 3 4 1 2 3 4 26
  • 27. With SIMD small words help! With scalar code, working on 16‑bit integers is not 2 × faster than 32‑bit integers. But with SIMD instructions, going from 64‑bit integers to 16‑bit integers can mean 4 × gain. Roaring uses arrays of 16‑bit integers. 27
  • 28. Bitsets are vectorizable Logical ORs, ANDs, ANDNOTs, XORs can be computed fast with Single instruction, multiple data (SIMD) instructions. Intel Cannonlake (late 2017), AVX‑512 Operate on 64 bytes with ONE instruction → Several 512‑bit ops/cycle Java 9's Hotspot can use AVX 512 ARM v8‑A to get Scalable Vector Extension... up to 2048 bits!!! 28
  • 29. Java supports advanced SIMD instructions $ java -XX:+PrintFlagsFinal -version |grep "AVX" intx UseAVX = 2 29
  • 30. Vectorization matters! for(size_t i = 0; i < len; i++) { a[i] |= b[i]; } using scalar : 1.5 cycles per byte with AVX2 : 0.43 cycles per byte (3.5 × better) With AVX‑512, the performance gap exceeds 5 × Can also vectorize OR, AND, ANDNOT, XOR + population count (AVX2‑Harley‑Seal) 30
  • 31. Vectorization beats  popcnt  int count = 0; for(size_t i = 0; i < len; i++) { count += popcount(a[i]); } using fast scalar (popcnt): 1 cycle per input byte using AVX2 Harley‑Seal: 0.5 cycles per input byte even greater gain with AVX‑512 31
  • 32. Sorted arrays sorted arrays are vectorizable: array union array difference array symmetric difference array intersection sorted arrays can be compressed with SIMD 32
  • 33. Bitsets are vectorizable... sadly... Java's hotspot is limited in what it can autovectorize: 1. Copying arrays 2. String.indexOf 3. ... And it seems that  Unsafe effectively disables autovectorization! 33
  • 34. There is hope yet for Java One big reason, today, for binding closely to hardware is to process wider data flows in SIMD modes. (And IMO this is a long‑term trend towards right‑sizing data channel widths, as hardware grows wider in various ways.) AVX bindings are where we are experimenting, today (John Rose, Oracle) 34
  • 35. Fun things you can do with SIMD: Masked VByte Consider the ubiquitous VByte format: Use 1 byte to store all integers in [0, 2 ) Use 2 bytes to store all integers in [2 , 2 ) ... Decoding can become a bottleneck. Google developed Varint‑GB. What if you are stuck with the conventional format? (E.g., Lucene, LEB128, Protocol Buffers...) 7 7 14 35
  • 36. Masked VByte Joint work with J. Plaisance (Indeed.com) and N. Kurz. http://maskedvbyte.org/ 36
  • 37. Go try it out! Fully vectorized Roaring implementation (C/C++): https://github.com/RoaringBitmap/CRoaring Wrappers in Python, Go, Rust... 37