SlideShare a Scribd company logo
Optimizing array-based 
data structures 
to the limit 
Roman Leventov 
Higher Frequency Trading Ltd. 
leventov@ya.ru 
August 28, 2014
Overview 
Indexing 
Encoding of distinct entry states 
Object data 
Primitive data 
Layout of tuples of primitives
Benchmarking environments 
1. AMD K10 (2007), 
L1 cache: 128 KB, L2: 512 KB, L3: 6 MB 
2. Intel Sandy Bridge (2011), 
L1: 64 KB, L2: 256 KB, L3: 20 MB 
3. Intel Haswell (2013), 
L1: 64 KB, L2: 256 KB, L3: 3 MB 
64-bit Java 1.8.0-b129–8u20 
JMH ??–0.9.8 
If not specified, measurements are in CPU clock 
cycles per operation or loop iteration.
Section 1 
Indexing
Indexing 
Simple 
int e = a[i]; 
vs. 
Unsafe 
long off = 
(( long ) i) << INT_SCALE_SHIFT ; 
int e = U. getInt (a, INT_BASE + off );
Whyever unsafe indexing? 
HotSpot JIT doesn’t eliminate bound checks 
as perfectly as you probably think.
Whyever simple indexing? 
In performance-critical code 
Simple 
; cmp r8d , ebx 
; jae <IOOBE location > 
mov r11 , [r9 + r8 *4 + 16] 
Unsafe 
mov r10 , r8 
shl r10 , 2 
mov r11 , [r9 + r10 + 16] 
%r9—a; %r8—i 
16—INT_BASE: object header (12 bytes) + 
array length field (4 bytes)
Iteration over parallel arrays 
Indexing case #1 
@Benchmark 
public int _2_simple ( State st) { 
int [] xs = st.xs , ys = st.ys; 
int dummy = 0; 
for (int i = xs. length ; i --> 0;) 
dummy ^= xs[i] + ys[i]; 
return dummy ; 
} 
Bound checks are fully eliminated!
Iteration over parallel arrays 
Indexing case #1 
@Benchmark 
public int _2_unsafe ( State st) { 
int [] xs = st.xs , ys = st.ys; 
int dummy = 0; 
long off = xs. length * INT_SCALE ; 
while (( off -= INT_SCALE ) >= 0) 
dummy ^= 
U. getInt (xs , INT_BASE + off) + 
U. getInt (ys , INT_BASE + off ); 
return dummy ; 
}
Iteration over parallel arrays 
Indexing case #1 
# of arrays 1 2 3 4 
SB 
Simple 0.78 1.3 2.2 3.4 
Unsafe 1.6 1.8 2.5 3.2 
HW 
Simple 1.2 2.1 3.3 4.9 
Unsafe 2.1 2.6 3.2 4.3 
K10 
Simple 1.6 5.8 13.1 19.5 
Unsafe 2.9 6.4 11.8 17.1 
Unsafe indexing is slower when there is a single 
or 2-3 parallel arrays because of an odd instruction 
in the tight loop. JIT compiler fault?
Binary heap 
Indexing case #2
Binary heap 
Indexing case #2 
int leftChildI = parentI * 2 + 1; 
int rightChildI = leftChildI + 1; 
long leftChildOff = 
parentOff * 2 + INT_SCALE ; 
long rightChildOff = 
leftChildOff + INT_SCALE ;
Binary heap sort 
Indexing case #2 
Heapsort version with unsafe indexing is faster 
by 12–13% on 4 KB array and by 7–10% on 4 MB 
array. 
With simple indexing lower bound checks 
are eliminated, but upper mostly aren’t.
Linear hash 
Indexing case #3 
def any_lhash_op (key[, ...]): 
i = hash (key) % table_size 
while True : 
if is_empty_slot (i): ... 
if key_at (i) == key: ... 
i = (i + 1) % table_size 
First access is random, then sequential. 
Table size is a power of 2, therefore bitwise 
masking & (table_size - 1) is used 
instead of modulo.
Quadratic hash 
Indexing case #3 
def any_qhash_op (key[, ...]): 
i = hash (key) % table_size 
step = 0 
while True : 
if is_empty_slot (i): ... 
if key_at (i) == key: ... 
step += 1 
i = (i + step ) % table_size 
Random, then local, then non-local access. 
Two-way modification of this algorithm is tested, 
in which table size isn’t a power of 2: one integral 
division per op.
Double hash 
Indexing case #3 
def any_dhash_op (key[, ...]): 
i = hash (key) % table_size 
step = hash2 (key) 
while True : 
if is_empty_slot (i): ... 
if key_at (i) == key: ... 
i = (i + step ) % table_size 
Random access. 
Table size isn’t a power of 2, one or two 
(on collisions) integral divisions per op.
Composite hash benchmark 
Indexing case #3 
load factor 0.3 0.6 0.9 
L. 
SB 1:9  1:0 1:7  1:0 2:1  1:1 
HW 5:5  1:3 4:9  0:7 4:3  0:9 
K10 10:3  0:5 8:2  0:2 1:6  0:7 
Q. 
SB 0:2  1:9 2:0  1:8 0:9  1:9 
HW 2:3  2:3 2:7  1:4 0:3  1:5 
K10 1:6  0:5 0:5  0:2 5:6  0:3 
D. 
SB 11:5  2:5 15:2  1:1 23:7  1:3 
HW 9:9  2:3 13:5  1:2 26:2  1:0 
K10 4:3  0:2 9:4  0:1 17:6  0:4 
Relative diff of unsafe indexing time to simple, 
in percent.
Indexing: bottom line 
Unsafe indexing is worth considering in the hottest 
methods. Tried to avoid this, but: measure don’t 
guess. 
Was not investigated: 
I Performance of unsafe indexing on 32-bit VMs 
and CPUs, all results should be rechecked. 
I Interference of unsafe indexing with loop 
unrolling and vectorization.
Section 2 
Encoding of distinct 
entry states
Use-cases of entry states 
Full state + data, or empty state: 
I Open hash table implementations 
(taken/empty slots) 
I Nullable non-object data in the subject 
domain 
I Lists or queues with half-lazy in-place 
filtering 
Collections of tuples of primitive/object and 
boolean (or binary state).
Object data 
Obvious: null in slots of empty state, domain 
objects in full slots. 
But what if domain objects are nullable 
themselves?
What if nullable Object data? 
Special empty object 
static final Object EMPTY_SLOT = 
new Object (); 
Domain nulls - as is. 
Masking domain nulls 
static final Object NULL_MASK = 
new Object (); 
... 
Object maskedData = data != null ? 
data : NULL_MASK ; 
null in slots of empty state.
What if nullable Object data? 
The rule: null should be more frequently stored in 
memory or compared to other objects, than the 
special object. Often the right option for both goals 
is the same.
Why store nulls 
Nullable Object data + states 
Don’t forget about amortized costs of storing 
Objects rather than nulls. At least one extra 
dereference and check per each location during 
garbage collection. 
Array shouldn’t be filled with nulls after 
initialization.
Why compare to null 
Nullable Object data + states 
Explicit null checks are almost always costless, 
merged with VM-generated ones (to throw NPE). 
In the rest cases comparison to null is still 
cheaper than to the special object, because 
I null shouldn’t be read from anyware 
in advance 
I Checks against zero are featured on x86
And what if nullable Object data? 
In hash tables, domain null (at most one!) should 
be masked, empty slots should be filled with 
nulls. But the implementation is harder, than with 
special empty object. 
Got it right: java.util.IdentityHashMap. 
Got it wrong: almost all other open hash 
implementations.
Primitive data 
No natural way to express nullabulity. Even no 
natural word :) 
Arrays of boxed primitives
Separate byte state 
Primitive data + states 
boolean[] or byte[] and data arrays in parallel: 
if ( used [i]) 
doSomething ( data [i ])); 
The easiest to implement.
Separate bit state 
Primitive data + states 
Hand-written bit set and data arrays in parallel: 
long word = bitWords [i  6]; 
if (( word  (1  i)) != 0) 
doSomething ( data [i ]));
Advantages of separate bit state 
Primitive data + states 
Almost no additional memory is used. 
Sequential state checks often doesn’t requere 
memory reads (until the word is exausted). 
Iteration could employ very cheap 
numberOfLeading(Trailing)Zeros intrinsic. 
Intel: Haswell+ 
AMD: Leading—K10+, Trailing—Piledriver+
Disadvantages of separate bit state 
Primitive data + states 
Only for binary state. 
On pure random access, no advantage over byte 
states except memory usage, just perform extra 
work to extract bits. 
Relatively tricky to implement. 
(java.util.BitSet—no way.)
Special value as a state 
Primitive data + states 
long d = data [i]; 
if (d != EMPTY ) 
doSomething (d); 
Suitable only when there is a full state and one or 
several empty states.
Advantages of special values 
Primitive data + states 
Zero memory overhead. 
All entry data could reside the single memory 
location: 
I less memory reads are required 
I Cache-friendly 
I Possibility of atomic updates
Special value management 
Primitive data + states 
When data domain is bounded, special values 
is a clear winner for enconding states, just pick up 
a constant out of the data domain, preferably 0, 
as a special value.
Special value management 
Primitive data + states 
However, if the data domain is unbounded, 
a number of disadvantages of special values 
as states appear: 
I Special value should be stored within the data 
structure and being read on each query. 
I Comparison to non-constant is slower, 
especially than comparison to zero. 
I On collision, special value should be 
replaced, that is impossible without locking, 
if the data structure should be thread-safe, 
or if it is offline in any meaning. 
I Implementation become more complicated.
Zero value as a state 
Primitive data + states 
An attempt to resolve one of the dynamic special 
values problems - data is compared to zero, and 
when zero is passed as a data itself, it is masked 
with another value: 
if ( data == zeroMask ) changeZeroMask (); 
data = data == 0 ? zeroMask : data ; 
... 
long d = data [i]; 
if (d != 0) 
doSomething (d); 
But now data should be masked/unmasked all the 
time and impelementation is getting even more 
complicated.
Byte along state 
Primitive data + states 
Like separate byte state, but more memory-local: 
On the other hand: 
I Only unsafe access (see section 1) 
I Tiring to implement 
I Cross cache line bounray memory IO, which 
1) has penalty on many CPUs, 2) is not 
atomic, out-of-the-air values could appear, if 
the data structure is not synchronized, or IO 
performed only via CAS ops (Nitsan Wakart).
Benchmarking LHash queries, 
random queries 
Primitive data + states 
All the hash data is in L1: 
I Load factors 0.3-0.6: typically byte states win 
I Load factor 0.9: bit states win, sometimes 
special values 
Big hashes (don’t fit caches): 
I Successful queries: special values win 
I Unsuccessful queries, including insertions: 
bit states win 
I Byte along states outperform simple byte 
states 
Zero states (with replacement) is never an option.
Benchmarking LHash queries, 
iteration 
Primitive data + states 
Internal iteration (forEach): special values win. 
External iteration (iterators): byte states win. 
But on Haswell and K10, of cause, bit states beat 
them all. 
Byte along states and zero states with 
replacement always lose.
Enconding of distinct entry states: 
bottom line 
Object[] arrays: more nulls. 
Primitive arrays: special values as states, when 
applicable. Bit states for iteration on Haswell+ and 
K10+.
Section 3 
Layout of tuples of primitives
Layout of tuples of primitives 
When random access is needed, we always strive 
for memory locality.
Two fields of the same length 
Layout of tuples of primitives 
byte+byte, char+short, long+double 
(longBitsToDouble() is a no-op). 
For up to 8 bytes, use arrays of the longer 
primitive, ex. long[] for int+int tuples. 
I Guarantees the tuple lies on the same cache 
line. 
I Allows to approach Java array size limits 
closer.
One field is two times longer than 
another 
Layout of tuples of primitives 
byte+short, int+double, ... 
If cross cache line boundary IO is not an option, 
use the following layout: 
Reqires to access individual fields via Unsafe.
One field is 4-8 times longer than 
another 
Layout of tuples of primitives 
If cross cache line boundary IO is not an option, 
the only reasonable approach is: 
k1 , long , 8 bytes 
k2 , long , 8 bytes 
k3 , long , 8 bytes 
v1 , short , 2 bytes 
v2 , short , 2 bytes 
v3 , short , 2 bytes 
2 bytes gap 
k4 , long , 8 bytes 
...
One field is 4-8 times longer than 
another 
Layout of tuples of primitives 
Fields of the same tuple will anyway lie on different 
cache lines with some probability. 
Indexing: 
long kOff = (i / 3) * 32L + 
(i % 3) * 8; 
long vOff = kOff + 24; 
Integral division :(
Integral division by small constant 
— Maybe this will help? (see Hacker’s Delight) 
long quot = (i * 0 x55555556L )  32; 
long rem = i - quot * 3; 
long kOff = quot * 32 + rem * 8; 
long vOff = kOff + 24; 
— No, it won’t, because we need to obtain 
reminder as well as quotient.
The End

More Related Content

DOCX
Dinámica de los 6 Sombreros.docx
DOC
Porcentajes y proporcionalidad. Competencias Básicas.
PDF
Resistance des materiaux (2)
PDF
DeltaV Electronic Marshalling
PDF
10 ways to nurture your spiritual life
PPTX
Vizitka navros v_n
PPTX
Rubanomics - Corporate Presentation
PPTX
Private Sector Leads Virgin Islands to Solar
Dinámica de los 6 Sombreros.docx
Porcentajes y proporcionalidad. Competencias Básicas.
Resistance des materiaux (2)
DeltaV Electronic Marshalling
10 ways to nurture your spiritual life
Vizitka navros v_n
Rubanomics - Corporate Presentation
Private Sector Leads Virgin Islands to Solar

Viewers also liked (13)

PPSX
Dispositivos de entrada y salida presentación
PPT
Probak egiten ari naiz
PPT
PDF
Очаковский ЖБИ каталог
PDF
Continued Operation Tecnology
PPS
賈伯斯與禪
PDF
Прайс лист Очаковского ЖБИ в Рязани
PDF
Stepway
PPTX
Uploading resources to_mbc
PDF
Урожай – Витязь
PPTX
Pinterest
PPTX
Gasteizko irteera 2D THOR
Dispositivos de entrada y salida presentación
Probak egiten ari naiz
Очаковский ЖБИ каталог
Continued Operation Tecnology
賈伯斯與禪
Прайс лист Очаковского ЖБИ в Рязани
Stepway
Uploading resources to_mbc
Урожай – Витязь
Pinterest
Gasteizko irteera 2D THOR
Ad

Similar to Optimizing array-based data structures to the limit (20)

PDF
Lockless
PDF
Tiling matrix-matrix multiply, code tuning
PPT
Unit I Advanced Java Programming Course
PPT
Memory Optimization
PPT
Memory Optimization
PDF
PPU Optimisation Lesson
PDF
Advance data structure & algorithm
PPTX
linkedlist.pptx
PDF
Performance and Predictability - Richard Warburton
PDF
Performance and predictability (1)
PPTX
Understanding Javascript Engines
PPTX
Memory model
PDF
Failure Of DEP And ASLR
PDF
10 -bits_and_bytes
PDF
Haskell for data science
PPT
Java basic tutorial by sanjeevini india
PPT
Java basic tutorial by sanjeevini india
PPT
computer notes - Data Structures - 35
PDF
Java Pitfalls and Good-to-Knows
PDF
Session2
Lockless
Tiling matrix-matrix multiply, code tuning
Unit I Advanced Java Programming Course
Memory Optimization
Memory Optimization
PPU Optimisation Lesson
Advance data structure & algorithm
linkedlist.pptx
Performance and Predictability - Richard Warburton
Performance and predictability (1)
Understanding Javascript Engines
Memory model
Failure Of DEP And ASLR
10 -bits_and_bytes
Haskell for data science
Java basic tutorial by sanjeevini india
Java basic tutorial by sanjeevini india
computer notes - Data Structures - 35
Java Pitfalls and Good-to-Knows
Session2
Ad

Recently uploaded (20)

PDF
Digital Systems & Binary Numbers (comprehensive )
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
Introduction to Artificial Intelligence
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPT
Introduction Database Management System for Course Database
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
Computer Software and OS of computer science of grade 11.pptx
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Designing Intelligence for the Shop Floor.pdf
PDF
System and Network Administration Chapter 2
PPTX
assetexplorer- product-overview - presentation
Digital Systems & Binary Numbers (comprehensive )
How to Choose the Right IT Partner for Your Business in Malaysia
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
wealthsignaloriginal-com-DS-text-... (1).pdf
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
2025 Textile ERP Trends: SAP, Odoo & Oracle
How to Migrate SBCGlobal Email to Yahoo Easily
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Introduction to Artificial Intelligence
Navsoft: AI-Powered Business Solutions & Custom Software Development
Introduction Database Management System for Course Database
Which alternative to Crystal Reports is best for small or large businesses.pdf
Computer Software and OS of computer science of grade 11.pptx
CHAPTER 2 - PM Management and IT Context
Wondershare Filmora 15 Crack With Activation Key [2025
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Upgrade and Innovation Strategies for SAP ERP Customers
Designing Intelligence for the Shop Floor.pdf
System and Network Administration Chapter 2
assetexplorer- product-overview - presentation

Optimizing array-based data structures to the limit

  • 1. Optimizing array-based data structures to the limit Roman Leventov Higher Frequency Trading Ltd. leventov@ya.ru August 28, 2014
  • 2. Overview Indexing Encoding of distinct entry states Object data Primitive data Layout of tuples of primitives
  • 3. Benchmarking environments 1. AMD K10 (2007), L1 cache: 128 KB, L2: 512 KB, L3: 6 MB 2. Intel Sandy Bridge (2011), L1: 64 KB, L2: 256 KB, L3: 20 MB 3. Intel Haswell (2013), L1: 64 KB, L2: 256 KB, L3: 3 MB 64-bit Java 1.8.0-b129–8u20 JMH ??–0.9.8 If not specified, measurements are in CPU clock cycles per operation or loop iteration.
  • 5. Indexing Simple int e = a[i]; vs. Unsafe long off = (( long ) i) << INT_SCALE_SHIFT ; int e = U. getInt (a, INT_BASE + off );
  • 6. Whyever unsafe indexing? HotSpot JIT doesn’t eliminate bound checks as perfectly as you probably think.
  • 7. Whyever simple indexing? In performance-critical code Simple ; cmp r8d , ebx ; jae <IOOBE location > mov r11 , [r9 + r8 *4 + 16] Unsafe mov r10 , r8 shl r10 , 2 mov r11 , [r9 + r10 + 16] %r9—a; %r8—i 16—INT_BASE: object header (12 bytes) + array length field (4 bytes)
  • 8. Iteration over parallel arrays Indexing case #1 @Benchmark public int _2_simple ( State st) { int [] xs = st.xs , ys = st.ys; int dummy = 0; for (int i = xs. length ; i --> 0;) dummy ^= xs[i] + ys[i]; return dummy ; } Bound checks are fully eliminated!
  • 9. Iteration over parallel arrays Indexing case #1 @Benchmark public int _2_unsafe ( State st) { int [] xs = st.xs , ys = st.ys; int dummy = 0; long off = xs. length * INT_SCALE ; while (( off -= INT_SCALE ) >= 0) dummy ^= U. getInt (xs , INT_BASE + off) + U. getInt (ys , INT_BASE + off ); return dummy ; }
  • 10. Iteration over parallel arrays Indexing case #1 # of arrays 1 2 3 4 SB Simple 0.78 1.3 2.2 3.4 Unsafe 1.6 1.8 2.5 3.2 HW Simple 1.2 2.1 3.3 4.9 Unsafe 2.1 2.6 3.2 4.3 K10 Simple 1.6 5.8 13.1 19.5 Unsafe 2.9 6.4 11.8 17.1 Unsafe indexing is slower when there is a single or 2-3 parallel arrays because of an odd instruction in the tight loop. JIT compiler fault?
  • 12. Binary heap Indexing case #2 int leftChildI = parentI * 2 + 1; int rightChildI = leftChildI + 1; long leftChildOff = parentOff * 2 + INT_SCALE ; long rightChildOff = leftChildOff + INT_SCALE ;
  • 13. Binary heap sort Indexing case #2 Heapsort version with unsafe indexing is faster by 12–13% on 4 KB array and by 7–10% on 4 MB array. With simple indexing lower bound checks are eliminated, but upper mostly aren’t.
  • 14. Linear hash Indexing case #3 def any_lhash_op (key[, ...]): i = hash (key) % table_size while True : if is_empty_slot (i): ... if key_at (i) == key: ... i = (i + 1) % table_size First access is random, then sequential. Table size is a power of 2, therefore bitwise masking & (table_size - 1) is used instead of modulo.
  • 15. Quadratic hash Indexing case #3 def any_qhash_op (key[, ...]): i = hash (key) % table_size step = 0 while True : if is_empty_slot (i): ... if key_at (i) == key: ... step += 1 i = (i + step ) % table_size Random, then local, then non-local access. Two-way modification of this algorithm is tested, in which table size isn’t a power of 2: one integral division per op.
  • 16. Double hash Indexing case #3 def any_dhash_op (key[, ...]): i = hash (key) % table_size step = hash2 (key) while True : if is_empty_slot (i): ... if key_at (i) == key: ... i = (i + step ) % table_size Random access. Table size isn’t a power of 2, one or two (on collisions) integral divisions per op.
  • 17. Composite hash benchmark Indexing case #3 load factor 0.3 0.6 0.9 L. SB 1:9 1:0 1:7 1:0 2:1 1:1 HW 5:5 1:3 4:9 0:7 4:3 0:9 K10 10:3 0:5 8:2 0:2 1:6 0:7 Q. SB 0:2 1:9 2:0 1:8 0:9 1:9 HW 2:3 2:3 2:7 1:4 0:3 1:5 K10 1:6 0:5 0:5 0:2 5:6 0:3 D. SB 11:5 2:5 15:2 1:1 23:7 1:3 HW 9:9 2:3 13:5 1:2 26:2 1:0 K10 4:3 0:2 9:4 0:1 17:6 0:4 Relative diff of unsafe indexing time to simple, in percent.
  • 18. Indexing: bottom line Unsafe indexing is worth considering in the hottest methods. Tried to avoid this, but: measure don’t guess. Was not investigated: I Performance of unsafe indexing on 32-bit VMs and CPUs, all results should be rechecked. I Interference of unsafe indexing with loop unrolling and vectorization.
  • 19. Section 2 Encoding of distinct entry states
  • 20. Use-cases of entry states Full state + data, or empty state: I Open hash table implementations (taken/empty slots) I Nullable non-object data in the subject domain I Lists or queues with half-lazy in-place filtering Collections of tuples of primitive/object and boolean (or binary state).
  • 21. Object data Obvious: null in slots of empty state, domain objects in full slots. But what if domain objects are nullable themselves?
  • 22. What if nullable Object data? Special empty object static final Object EMPTY_SLOT = new Object (); Domain nulls - as is. Masking domain nulls static final Object NULL_MASK = new Object (); ... Object maskedData = data != null ? data : NULL_MASK ; null in slots of empty state.
  • 23. What if nullable Object data? The rule: null should be more frequently stored in memory or compared to other objects, than the special object. Often the right option for both goals is the same.
  • 24. Why store nulls Nullable Object data + states Don’t forget about amortized costs of storing Objects rather than nulls. At least one extra dereference and check per each location during garbage collection. Array shouldn’t be filled with nulls after initialization.
  • 25. Why compare to null Nullable Object data + states Explicit null checks are almost always costless, merged with VM-generated ones (to throw NPE). In the rest cases comparison to null is still cheaper than to the special object, because I null shouldn’t be read from anyware in advance I Checks against zero are featured on x86
  • 26. And what if nullable Object data? In hash tables, domain null (at most one!) should be masked, empty slots should be filled with nulls. But the implementation is harder, than with special empty object. Got it right: java.util.IdentityHashMap. Got it wrong: almost all other open hash implementations.
  • 27. Primitive data No natural way to express nullabulity. Even no natural word :) Arrays of boxed primitives
  • 28. Separate byte state Primitive data + states boolean[] or byte[] and data arrays in parallel: if ( used [i]) doSomething ( data [i ])); The easiest to implement.
  • 29. Separate bit state Primitive data + states Hand-written bit set and data arrays in parallel: long word = bitWords [i 6]; if (( word (1 i)) != 0) doSomething ( data [i ]));
  • 30. Advantages of separate bit state Primitive data + states Almost no additional memory is used. Sequential state checks often doesn’t requere memory reads (until the word is exausted). Iteration could employ very cheap numberOfLeading(Trailing)Zeros intrinsic. Intel: Haswell+ AMD: Leading—K10+, Trailing—Piledriver+
  • 31. Disadvantages of separate bit state Primitive data + states Only for binary state. On pure random access, no advantage over byte states except memory usage, just perform extra work to extract bits. Relatively tricky to implement. (java.util.BitSet—no way.)
  • 32. Special value as a state Primitive data + states long d = data [i]; if (d != EMPTY ) doSomething (d); Suitable only when there is a full state and one or several empty states.
  • 33. Advantages of special values Primitive data + states Zero memory overhead. All entry data could reside the single memory location: I less memory reads are required I Cache-friendly I Possibility of atomic updates
  • 34. Special value management Primitive data + states When data domain is bounded, special values is a clear winner for enconding states, just pick up a constant out of the data domain, preferably 0, as a special value.
  • 35. Special value management Primitive data + states However, if the data domain is unbounded, a number of disadvantages of special values as states appear: I Special value should be stored within the data structure and being read on each query. I Comparison to non-constant is slower, especially than comparison to zero. I On collision, special value should be replaced, that is impossible without locking, if the data structure should be thread-safe, or if it is offline in any meaning. I Implementation become more complicated.
  • 36. Zero value as a state Primitive data + states An attempt to resolve one of the dynamic special values problems - data is compared to zero, and when zero is passed as a data itself, it is masked with another value: if ( data == zeroMask ) changeZeroMask (); data = data == 0 ? zeroMask : data ; ... long d = data [i]; if (d != 0) doSomething (d); But now data should be masked/unmasked all the time and impelementation is getting even more complicated.
  • 37. Byte along state Primitive data + states Like separate byte state, but more memory-local: On the other hand: I Only unsafe access (see section 1) I Tiring to implement I Cross cache line bounray memory IO, which 1) has penalty on many CPUs, 2) is not atomic, out-of-the-air values could appear, if the data structure is not synchronized, or IO performed only via CAS ops (Nitsan Wakart).
  • 38. Benchmarking LHash queries, random queries Primitive data + states All the hash data is in L1: I Load factors 0.3-0.6: typically byte states win I Load factor 0.9: bit states win, sometimes special values Big hashes (don’t fit caches): I Successful queries: special values win I Unsuccessful queries, including insertions: bit states win I Byte along states outperform simple byte states Zero states (with replacement) is never an option.
  • 39. Benchmarking LHash queries, iteration Primitive data + states Internal iteration (forEach): special values win. External iteration (iterators): byte states win. But on Haswell and K10, of cause, bit states beat them all. Byte along states and zero states with replacement always lose.
  • 40. Enconding of distinct entry states: bottom line Object[] arrays: more nulls. Primitive arrays: special values as states, when applicable. Bit states for iteration on Haswell+ and K10+.
  • 41. Section 3 Layout of tuples of primitives
  • 42. Layout of tuples of primitives When random access is needed, we always strive for memory locality.
  • 43. Two fields of the same length Layout of tuples of primitives byte+byte, char+short, long+double (longBitsToDouble() is a no-op). For up to 8 bytes, use arrays of the longer primitive, ex. long[] for int+int tuples. I Guarantees the tuple lies on the same cache line. I Allows to approach Java array size limits closer.
  • 44. One field is two times longer than another Layout of tuples of primitives byte+short, int+double, ... If cross cache line boundary IO is not an option, use the following layout: Reqires to access individual fields via Unsafe.
  • 45. One field is 4-8 times longer than another Layout of tuples of primitives If cross cache line boundary IO is not an option, the only reasonable approach is: k1 , long , 8 bytes k2 , long , 8 bytes k3 , long , 8 bytes v1 , short , 2 bytes v2 , short , 2 bytes v3 , short , 2 bytes 2 bytes gap k4 , long , 8 bytes ...
  • 46. One field is 4-8 times longer than another Layout of tuples of primitives Fields of the same tuple will anyway lie on different cache lines with some probability. Indexing: long kOff = (i / 3) * 32L + (i % 3) * 8; long vOff = kOff + 24; Integral division :(
  • 47. Integral division by small constant — Maybe this will help? (see Hacker’s Delight) long quot = (i * 0 x55555556L ) 32; long rem = i - quot * 3; long kOff = quot * 32 + rem * 8; long vOff = kOff + 24; — No, it won’t, because we need to obtain reminder as well as quotient.