Optimizing array-based data structures to the limit

Optimizing array-based
data structures
to the limit
Roman Leventov
Higher Frequency Trading Ltd.
leventov@ya.ru
August 28, 2014

Overview
Indexing
Encoding of distinct entry states
Object data
Primitive data
Layout of tuples of primitives

Benchmarking environments
1. AMD K10 (2007),
L1 cache: 128 KB, L2: 512 KB, L3: 6 MB
2. Intel Sandy Bridge (2011),
L1: 64 KB, L2: 256 KB, L3: 20 MB
3. Intel Haswell (2013),
L1: 64 KB, L2: 256 KB, L3: 3 MB
64-bit Java 1.8.0-b129–8u20
JMH ??–0.9.8
If not specified, measurements are in CPU clock
cycles per operation or loop iteration.

Indexing
Simple
int e = a[i];
vs.
Unsafe
long off =
(( long ) i) << INT_SCALE_SHIFT ;
int e = U. getInt (a, INT_BASE + off );

Whyever unsafe indexing?
HotSpot JIT doesn’t eliminate bound checks
as perfectly as you probably think.

Whyever simple indexing?
In performance-critical code
Simple
; cmp r8d , ebx
; jae <IOOBE location >
mov r11 , [r9 + r8 *4 + 16]
Unsafe
mov r10 , r8
shl r10 , 2
mov r11 , [r9 + r10 + 16]
%r9—a; %r8—i
16—INT_BASE: object header (12 bytes) +
array length field (4 bytes)

Iteration over parallel arrays
Indexing case #1
@Benchmark
public int _2_simple ( State st) {
int [] xs = st.xs , ys = st.ys;
int dummy = 0;
for (int i = xs. length ; i --> 0;)
dummy ^= xs[i] + ys[i];
return dummy ;
}
Bound checks are fully eliminated!

Indexing case #1
@Benchmark
public int _2_unsafe ( State st) {
int [] xs = st.xs , ys = st.ys;
int dummy = 0;
long off = xs. length * INT_SCALE ;
while (( off -= INT_SCALE ) >= 0)
dummy ^=
U. getInt (xs , INT_BASE + off) +
U. getInt (ys , INT_BASE + off );
return dummy ;
}

Indexing case #1
# of arrays 1 2 3 4
SB
Simple 0.78 1.3 2.2 3.4
Unsafe 1.6 1.8 2.5 3.2
HW
Simple 1.2 2.1 3.3 4.9
Unsafe 2.1 2.6 3.2 4.3
K10
Simple 1.6 5.8 13.1 19.5
Unsafe 2.9 6.4 11.8 17.1
Unsafe indexing is slower when there is a single
or 2-3 parallel arrays because of an odd instruction
in the tight loop. JIT compiler fault?

Binary heap
Indexing case #2
int leftChildI = parentI * 2 + 1;
int rightChildI = leftChildI + 1;
long leftChildOff =
parentOff * 2 + INT_SCALE ;
long rightChildOff =
leftChildOff + INT_SCALE ;

Binary heap sort
Indexing case #2
Heapsort version with unsafe indexing is faster
by 12–13% on 4 KB array and by 7–10% on 4 MB
array.
With simple indexing lower bound checks
are eliminated, but upper mostly aren’t.

Linear hash
Indexing case #3
def any_lhash_op (key[, ...]):
i = hash (key) % table_size
while True :
if is_empty_slot (i): ...
if key_at (i) == key: ...
i = (i + 1) % table_size
First access is random, then sequential.
Table size is a power of 2, therefore bitwise
masking & (table_size - 1) is used
instead of modulo.

Quadratic hash
Indexing case #3
def any_qhash_op (key[, ...]):
step = 0
while True :
step += 1
i = (i + step ) % table_size
Random, then local, then non-local access.
Two-way modification of this algorithm is tested,
in which table size isn’t a power of 2: one integral
division per op.

Double hash
Indexing case #3
def any_dhash_op (key[, ...]):
step = hash2 (key)
while True :
i = (i + step ) % table_size
Random access.
Table size isn’t a power of 2, one or two
(on collisions) integral divisions per op.

Composite hash benchmark
Indexing case #3
load factor 0.3 0.6 0.9
L.
SB 1:9 1:0 1:7 1:0 2:1 1:1
HW 5:5 1:3 4:9 0:7 4:3 0:9
K10 10:3 0:5 8:2 0:2 1:6 0:7
Q.
SB 0:2 1:9 2:0 1:8 0:9 1:9
HW 2:3 2:3 2:7 1:4 0:3 1:5
K10 1:6 0:5 0:5 0:2 5:6 0:3
D.
SB 11:5 2:5 15:2 1:1 23:7 1:3
HW 9:9 2:3 13:5 1:2 26:2 1:0
K10 4:3 0:2 9:4 0:1 17:6 0:4
Relative diff of unsafe indexing time to simple,
in percent.

Indexing: bottom line
Unsafe indexing is worth considering in the hottest
methods. Tried to avoid this, but: measure don’t
guess.
Was not investigated:
I Performance of unsafe indexing on 32-bit VMs
and CPUs, all results should be rechecked.
I Interference of unsafe indexing with loop
unrolling and vectorization.

Section 2
Encoding of distinct
entry states

Use-cases of entry states
Full state + data, or empty state:
I Open hash table implementations
(taken/empty slots)
I Nullable non-object data in the subject
domain
I Lists or queues with half-lazy in-place
filtering
Collections of tuples of primitive/object and
boolean (or binary state).

Object data
Obvious: null in slots of empty state, domain
objects in full slots.
But what if domain objects are nullable
themselves?

What if nullable Object data?
Special empty object
static final Object EMPTY_SLOT =
new Object ();
Domain nulls - as is.
Masking domain nulls
static final Object NULL_MASK =
new Object ();
...
Object maskedData = data != null ?
data : NULL_MASK ;
null in slots of empty state.

What if nullable Object data?
The rule: null should be more frequently stored in
memory or compared to other objects, than the
special object. Often the right option for both goals
is the same.

Why store nulls
Nullable Object data + states
Don’t forget about amortized costs of storing
Objects rather than nulls. At least one extra
dereference and check per each location during
garbage collection.
Array shouldn’t be filled with nulls after
initialization.

Why compare to null
Nullable Object data + states
Explicit null checks are almost always costless,
merged with VM-generated ones (to throw NPE).
In the rest cases comparison to null is still
cheaper than to the special object, because
I null shouldn’t be read from anyware
in advance
I Checks against zero are featured on x86

And what if nullable Object data?
In hash tables, domain null (at most one!) should
be masked, empty slots should be filled with
nulls. But the implementation is harder, than with
special empty object.
Got it right: java.util.IdentityHashMap.
Got it wrong: almost all other open hash
implementations.

Primitive data
No natural way to express nullabulity. Even no
natural word :)
Arrays of boxed primitives

Separate byte state
Primitive data + states
boolean[] or byte[] and data arrays in parallel:
if ( used [i])
doSomething ( data [i ]));
The easiest to implement.

Separate bit state
Hand-written bit set and data arrays in parallel:
long word = bitWords [i 6];
if (( word (1 i)) != 0)
doSomething ( data [i ]));

Advantages of separate bit state
Almost no additional memory is used.
Sequential state checks often doesn’t requere
memory reads (until the word is exausted).
Iteration could employ very cheap
numberOfLeading(Trailing)Zeros intrinsic.
Intel: Haswell+
AMD: Leading—K10+, Trailing—Piledriver+

Disadvantages of separate bit state
Only for binary state.
On pure random access, no advantage over byte
states except memory usage, just perform extra
work to extract bits.
Relatively tricky to implement.
(java.util.BitSet—no way.)

Special value as a state
long d = data [i];
if (d != EMPTY )
doSomething (d);
Suitable only when there is a full state and one or
several empty states.

Advantages of special values
Zero memory overhead.
All entry data could reside the single memory
location:
I less memory reads are required
I Cache-friendly
I Possibility of atomic updates

Special value management
When data domain is bounded, special values
is a clear winner for enconding states, just pick up
a constant out of the data domain, preferably 0,
as a special value.

Special value management
However, if the data domain is unbounded,
a number of disadvantages of special values
as states appear:
I Special value should be stored within the data
structure and being read on each query.
I Comparison to non-constant is slower,
especially than comparison to zero.
I On collision, special value should be
replaced, that is impossible without locking,
if the data structure should be thread-safe,
or if it is offline in any meaning.
I Implementation become more complicated.

Zero value as a state
An attempt to resolve one of the dynamic special
values problems - data is compared to zero, and
when zero is passed as a data itself, it is masked
with another value:
if ( data == zeroMask ) changeZeroMask ();
data = data == 0 ? zeroMask : data ;
...
long d = data [i];
if (d != 0)
doSomething (d);
But now data should be masked/unmasked all the
time and impelementation is getting even more
complicated.

Byte along state
Like separate byte state, but more memory-local:
On the other hand:
I Only unsafe access (see section 1)
I Tiring to implement
I Cross cache line bounray memory IO, which
1) has penalty on many CPUs, 2) is not
atomic, out-of-the-air values could appear, if
the data structure is not synchronized, or IO
performed only via CAS ops (Nitsan Wakart).

Benchmarking LHash queries,
random queries
All the hash data is in L1:
I Load factors 0.3-0.6: typically byte states win
I Load factor 0.9: bit states win, sometimes
special values
Big hashes (don’t fit caches):
I Successful queries: special values win
I Unsuccessful queries, including insertions:
bit states win
I Byte along states outperform simple byte
states
Zero states (with replacement) is never an option.

Benchmarking LHash queries,
iteration
Internal iteration (forEach): special values win.
External iteration (iterators): byte states win.
But on Haswell and K10, of cause, bit states beat
them all.
Byte along states and zero states with
replacement always lose.

Enconding of distinct entry states:
bottom line
Object[] arrays: more nulls.
Primitive arrays: special values as states, when
applicable. Bit states for iteration on Haswell+ and
K10+.

Section 3

When random access is needed, we always strive
for memory locality.

Two fields of the same length
byte+byte, char+short, long+double
(longBitsToDouble() is a no-op).
For up to 8 bytes, use arrays of the longer
primitive, ex. long[] for int+int tuples.
I Guarantees the tuple lies on the same cache
line.
I Allows to approach Java array size limits
closer.

One field is two times longer than
another
byte+short, int+double, ...
If cross cache line boundary IO is not an option,
use the following layout:
Reqires to access individual fields via Unsafe.

One field is 4-8 times longer than
another
If cross cache line boundary IO is not an option,
the only reasonable approach is:
k1 , long , 8 bytes
k2 , long , 8 bytes
k3 , long , 8 bytes
v1 , short , 2 bytes
2 bytes gap
k4 , long , 8 bytes
...

One field is 4-8 times longer than
another
Fields of the same tuple will anyway lie on different
cache lines with some probability.
Indexing:
long kOff = (i / 3) * 32L +
(i % 3) * 8;
long vOff = kOff + 24;
Integral division :(

Integral division by small constant
— Maybe this will help? (see Hacker’s Delight)
long quot = (i * 0 x55555556L ) 32;
long rem = i - quot * 3;
long kOff = quot * 32 + rem * 8;
long vOff = kOff + 24;
— No, it won’t, because we need to obtain
reminder as well as quotient.

Optimizing array-based data structures to the limit

More Related Content

Viewers also liked (13)

Similar to Optimizing array-based data structures to the limit (20)

Recently uploaded (20)

Optimizing array-based data structures to the limit