Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019

Enabling Low-level
Intrinsics in Burst

— You’re interested in CPU performance
— You’re considering porting engine systems to HPC#
— You just go to technical talks because it’s cool
Who this talk is for
3

— Quick introduction to SIMD topics
— Options for SIMD programming in HPC# today
— The case for intrinsics and typeless SIMD
— Case studies for intrinsics
— Q & A
Talk Contents
4

What is SIMD?
6
— Single Instruction, Multiple Data
– Doing more than one thing at a time
— Available on essentially all hardware today in some form
– Capabilities vary, but a few families exist
— ARM Neon
— x86/64 SSE and AVX

SIMD Analogy: Chopping Veggies
7
Input Data
Output Data
Instruction
Preprocessed!

Why is SIMD important?
8
— It’s more efficient to do more with less instructions
— There is dedicated hardware for this stuff
— Often the only way you can get the max cache bandwidth

Dedicated Hardware: Skylake Example
9

Cache Bandwidth
10
— L1 Caches can deliver N bits every cycle
– N typically much larger than 64
– 128 or 256 bits per cycle common in most CPUs today
— Without using SIMD instructions, only get a fraction of this
– Important part of not leaving performance on the table

Cache Bandwidth
11
— This matters when you are processing in cache
– Which is what we hope to do most of the time
— Processing floats (4 bytes, 32 bits), 128-bit cache b/w
– Each load wastes 75% of bandwidth from the cache
— Better: Process 4 floats at a time
– Full cache utilization

The vector fallacy
12
— SIMD and mathematical vectors are mostly unrelated
— Lots of confusion around this issue
— False: Using a vector math library is somehow SIMD
— True: Working with arrays of data can lead to opportunities
for SIMD (but not always)
— It’s especially problematic with 3-component vectors as we’ll
see

Quick smell test for float SIMD-ness (x86)
13
— xxxps instructions feeding into each other?
– It’s probably SIMD code
— xxxss instructions?
– Scalar code
— Occasional xxxps instructions with infrastructure?
– Mix of SIMD and scalar, red flag!

Example: 3D dot products
14
public static float DotExample(float3 a, float3 b)
{
return math.dot(a, b);
}
mulps xmm0, xmmword ptr [rdx]
movshdup xmm1, xmm0
addss xmm1, xmm0
movhlps xmm0, xmm0
addss xmm0, xmm1
1 SIMD op, 2 infrastructure, 2 scalar ops => 1 dot product
4-wide mul, only 3 lanes valid
shuffle overhead
scalar addition
shuffle overhead
scalar addition

Example: 3D dot products
15
— But wait a minute, what is a dot product?
— For 3D:
– a.x * b.x + a.y * b.y + a.z * b.z
— What if we go back to basics and base our code on this?

Example: 3D dot products (back to basics)
16
public static float DotExample1(
float ax, float ay, float az
float bx, float by, float bz)
{
return ax * bx + ay * by + az * bz;
}
mulss xmm0, xmm3
mulss xmm1, dword ptr [rsp + 40]
addss xmm0, xmm1
mulss xmm2, dword ptr [rsp + 48]
addss xmm0, xmm2
0 SIMD ops, 0 infrastructure, 5 scalar ops => 1 dot product
mul
mul
add
mul
add

Example: 3D dot products (SIMD)
17
public static float4 DotExample4(
float4 ax, float4 ay, float4 az
float4 bx, float4 by, float4 bz)
{
return ax * bx + ay * by + az * bz;
}
mulps xmm2, xmmword ptr [r9]
mulps xmm1, xmmword ptr [rcx]
addps xmm1, xmm2
mulps xmm0, xmmword ptr [rax]
addps xmm0, xmm1
5 SIMD ops, 0 infrastructure, 0 scalar ops => 4 dot products
4-wide mul
4-wide mul
4-wide add
4-wide mul
4-wide add

SIMD mindset
18
— Important not to think in terms of abstractions
— Don’t think about float4 as “a 4-D vector”
— Better: 4 floats is the width of the vector unit on this CPU
— Get used to the idea of 128 or 256 bit blocks of data
– Divide into whatever size is convenient
— What fits in 128 bits?
– 16 bytes
– 8 shorts
– 4 floats or ints
– 2 doubles or longs

SIMD mindset, contd.
19
— Try to find opportunities to compute independent values
– Like the 4 independent dot products we just saw
— Fight the urge to think of vectors as horizontal values
– Horizontal operations often go against the grain of SIMD instructions
— Typically scalar code without abstractions vectorizes well
– float3, float2 etc can be convenient but often get in the way

So how do you get SIMD code?
20
— In HPC# we’ve had two options so far:
– LLVM auto-vectorization
– Unity.Mathematics explicit SIMD

LLVM’s auto vectorizer
21
— Simple mode: Write scalar code, get SIMD code out
— For simple loops, LLVM is often able to generate SIMD code
— Checklist to look at before expecting SIMD:
– Data ranges must not alias
– Data must be contiguous in memory (for wide loads)
– Data types must be integer or float with fast-math
– Branches are kept to a minimum
– There is no cross-element interference

LLVM’s auto vectorizer
22
— Pros
– Simpler code to read/write (at face value)
– Often gives you a speedup where you didn’t expect one
— Cons
– Need to learn a bunch of rules to get SIMD code from loops
– No way to tell when you’ve stopped getting SIMD
– (We’re looking at ways to make this a compile error if desired)
– Hard to reinterpret data types
– Often surprising what will not vectorize

Example of successful vectorization
23
[BurstCompile]
public struct VectorizeDemo : IJob
{
public NativeArray<int> Inputs;
public NativeArray<int> Outputs;
public void Execute()
{
for (int i = 0; i < Inputs.Length; ++i)
{
if (Inputs[i] >= 0)
{
Outputs[i] = Inputs[i];
}
else
{
Outputs[i] = 0;
}
}
}
}
.LBB0_7:
vpmaxsd ymm1, ymm0, ymmword ptr [r10 + 4*rdx]
vpmaxsd ymm2, ymm0, ymmword ptr [r10 + 4*rdx + 32]
vmovdqu ymmword ptr [rcx + 4*rdx], ymm1
vmovdqu ymmword ptr [rcx + 4*rdx + 32], ymm2
add rdx, 32
cmp rax, rdx
jne .LBB0_7

Example of unsuccessful vectorization
24
[BurstCompile]
public struct VectorizeDemo : IJob
{
public NativeArray<int> Inputs;
public NativeArray<int> Outputs;
public void Execute()
{
for (int i = 0; i < Inputs.Length; ++i)
{
if (Inputs[i] >= 0)
{
Outputs[i] = Inputs[i] * 2;
}
else
{
Outputs[i] = 0;
}
}
}
}
.LBB0_2:
mov edx, dword ptr [r10 + 4*rax]
lea ecx, [rdx + rdx]
test edx, edx
cmovs ecx, r8d
mov dword ptr [r11 + 4*rax], ecx
inc rax
cmp r9, rax
jne .LBB0_2

Explicit SIMD with Unity.Mathematics
25
— Use e.g. float4, int4 vertically (as in dot product example)
— Maps directly to LLVM vector types, you will get vector code
— Checklist:
– Avoid branches, use select/mask idioms
– Use native arrays, with ReinterpretLoad/Store as needed
– Handle end-of-array cases manually

Explicit Unity.Mathematics SIMD Example
26
static public IntersectResult Intersect2(NativeArray<PlanePacket4> cullingPlanePackets, AABB a)
{
// …
int4 outCounts = 0;
int4 inCounts = 0;
for (int i = 0; i < cullingPlanePackets.Length; i++) {
var p = cullingPlanePackets[i];
float4 distances = dot4(p.Xs, p.Ys, p.Zs, mx, my, mz) + p.Distances;
float4 radii = dot4(ex, ey, ez, math.abs(p.Xs), math.abs(p.Ys), math.abs(p.Zs));
outCounts += (int4) (distances + radii <= 0);
inCounts += (int4) (distances > radii);
}
int inCount = math.csum(inCounts);
int outCount = math.csum(outCounts);
if (outCount != 0)
return IntersectResult.Out;
else
return (inCount == 4 * cullingPlanePackets.Length) ? IntersectResult.In : IntersectResult.Partial;
}

The need for typeless SIMD
28
— In the engine space it’s frequently useful to reinterpret data
— Want control over instruction selection for particular HW
— Want to leverage tricks that compilers don’t use

Data reinterpretation
29
— Work with floats bits using integer operations
— Example: Converting small integers to floats
ushort x = ...;
uint y = x | 0x4b000000;
float f = as_float(y) - 8388608.0f;

Instruction selection
30
— Often useful to base core engine loops around specific h/w
— Example: x86 pmulhrsw

Leveraging data tricks
31
— Many tricks are not in the repertoire of most compilers
— Example: Quickly generating mask from sign of float data
float x = ...;
uint mask = as_int(x) >> 31;

What we’re working on
33
— Typeless SIMD library of intrinsics
— Start with x86, with ARM to come
— Good C# integration with debugging considerations

Typeless?
34
— Types are mostly an annoyance for real world SIMD
— Often need to reinterpret float/int
— Often need to deal with masks, which are unclearly typed
— Canonical example: comparisons
– _mm_cmpeq_ps – returns a mask of all ones when equal
– So… is that a float? Or an int?

Do what the hardware does
35
— The hardware just has registers, not types (obviously)
— That’s what we expose in our intrinsics API
— m128 – 128 bit SIMD register
— m256 – 256 bit SIMD register
— Instructions determine how the register contents are interpreted

API Usage Example
36
using static Burst.Compiler.IL.x86;
// …
m128 a, b = …;
m128 mask = cmpeq_ps(a, b);
// …

API Extract
37
// _mm_cmpeq_ps
/// <summary> Compare packed single-precision (32-bit)
/// floating-point elements in "a" and "b" for equality,
/// and store the results in "dst". </summary>
[X86InstructionFamily(InstructionFamily.SSE)]
[DebuggerStepThrough]
public static m128 cmpeq_ps(m128 a, m128 b)
{
m128 dst = default(m128);
dst.UInt0 = a.Float0 == b.Float0 ? ~0u : 0;
return dst;
}
C# Reference Implementation

A more complete example
39
For each door:
open = 0
For each player position:
if player in range and correct team:
open = 1
store open state for door

A more complete example
40
— Basic N vs M test
— N doors, M players
public struct Door
{
public float3 Pos;
public float RadiusSquared;
public int Team;
}
public struct DoorTestPos
{
public float3 Pos;
public int Team;
}

Reference version
41
[BurstCompile]
public struct DoorTest_Reference : IJob
{
public NativeArray<Door> Doors;
public NativeArray<DoorTestPos> TestPos;
public NativeArray<int> DoorOpenStates;
public void Execute() {
for (int j = 0; j < Doors.Length; ++j) {
bool shouldOpen = false;
for (int i = 0; i < TestPos.Length; ++i) {
float3 delta = TestPos[i].Pos - Doors[j].Pos;
float dsq = math.csum(delta * delta);
if (dsq < Doors[j].RadiusSquared && Doors[j].Team == TestPos[i].Team) {
shouldOpen = true;
break;
}
}
DoorOpenStates[j] = shouldOpen ? 1 : 0;
}
}
}

Reference disassembly
42
.LBB0_6:
vmovsd xmm2, qword ptr [rsi - 12]
vinsertps xmm2, xmm2, dword ptr [rsi - 4], 32
vsubps xmm2, xmm2, xmm0
vmulps xmm2, xmm2, xmm2
vmovshdup xmm3, xmm2
vpermilpd xmm4, xmm2, 1
vaddss xmm3, xmm3, xmm4
vucomiss xmm2, xmm1
jae .LBB0_10 ; not inside radius?
mov ebx, dword ptr [rdx]
cmp ebx, dword ptr [rsi]
je .LBB0_8 ; break out of loop
.LBB0_10:
inc rdi
add rsi, 16
cmp rdi, rax
jl .LBB0_6

Let’s lose the branches
43
public void Execute() {
bool shouldOpen = false;
float3 delta = TestPos[i].Pos - Doors[j].Pos;
float dsq = math.csum(delta * delta);
bool inRadius = dsq < Doors[j].RadiusSquared;
bool teamMatches = Doors[j].Team == TestPos[i].Team;
shouldOpen |= (inRadius & teamMatches) ? true : false;
}
DoorOpenStates[j] = shouldOpen ? 1 : 0;
}
}
}

Branch-free disassembly
44
.LBB0_4:
vmovsd xmm2, qword ptr [rdi - 12]
vinsertps xmm2, xmm2, dword ptr [rdi - 4], 32
vmovshdup xmm3, xmm2
vpermilpd xmm4, xmm2, 1
vucomiss xmm2, xmm1
setb al
cmp ebp, dword ptr [rdi]
sete dl
and dl, al
movzx eax, dl
or esi, eax
add rdi, 16
dec rbx
jne .LBB0_4

Explicit SIMD with Unity Mathematics
45
public struct DoorGroup
{
public float4 Xs;
public float4 Ys;
public float4 Zs;
public float4 RadiiSquared;
public int4 Teams;
}
public NativeArray<DoorGroup> Doors;

Explicit SIMD with Unity Mathematics
46
bool4 openMask = false;
float4 xdeltas = TestPos[i].X - Doors[j].Xs;
float4 ydeltas = TestPos[i].Y - Doors[j].Ys;
float4 zdeltas = TestPos[i].Z - Doors[j].Zs;
float4 xdsq = xdeltas * xdeltas;
float4 ydsq = ydeltas * ydeltas;
float4 zdsq = zdeltas * zdeltas;
float4 dsq = xdsq + ydsq + zdsq;
bool4 rangeMask = dsq < Doors[j].RadiiSquared;
bool4 teamMask = TestPos[i].Team == Doors[j].Teams;
openMask |= teamMask & rangeMask;
}
DoorOpenStates[j] = math.select(new int4(0), new int4(1), openMask);
}

Explicit Math version disassembly
47
.LBB0_2:
vbroadcastss xmm0, dword ptr [rdx - 12]
vaddps xmm0, xmm0, xmm3
vcmpltps xmm0, xmm0, xmm7
vpcmpeqd xmm2, xmm1, xmmword ptr [rdx]
vpand xmm0, xmm2, xmm0
vpsrld xmm0, xmm0, 31
vpor xmm6, xmm6, xmm0
add rdx, 28
dec rsi
jne .LBB0_2

Explicit SIMD with Burst Intrinsics
48
public struct Door4
{
public m128 Xs;
public m128 Ys;
public m128 Zs;
public m128 RadiiSquared;
public m128 Teams;
}

Explicit SIMD with Burst Intrinsics
49
m128 openMask = new m128(~0u);
m128 tx = new m128(TestPos[i].X);
m128 ty = new m128(TestPos[i].Y);
m128 tz = new m128(TestPos[i].Z);
m128 tt = new m128(TestPos[i].Team);
m128 xdeltas = sub_ps(Doors[j].Xs, tx);
m128 ydeltas = sub_ps(Doors[j].Ys, ty);
m128 zdeltas = sub_ps(Doors[j].Zs, tz);
m128 xdsq = mul_ps(xdeltas, xdeltas);
m128 ydsq = mul_ps(ydeltas, ydeltas);
m128 zdsq = mul_ps(zdeltas, zdeltas);
m128 dsq = add_ps(xdsq, add_ps(ydsq, zdsq));
m128 rangeMask = cmple_ps(dsq, Doors[j].RadiiSquared);
rangeMask = and_ps(rangeMask, cmpeq_epi32(Doors[j].Teams, tt));
openMask = or_ps(openMask, rangeMask);
}
DoorOpenStates.ReinterpretStore(j * 4, openMask);
}

Explicit SIMD Disassembly
50
.LBB1_3:
vbroadcastss xmm4, dword ptr [rax - 12]
vpbroadcastd xmm7, dword ptr [rax]
vpcmpeqd xmm7, xmm3, xmm7
vcmpleps xmm4, xmm4, xmm2
vpand xmm4, xmm7, xmm4
vpor xmm0, xmm4, xmm0
inc rsi
add rax, 16
cmp rsi, rdx
jl .LBB1_3

Guidelines for SIMD with Burst
51
— Become familiar with the Burst inspector
— Eliminate branches (typically a good idea)
— Prefer wider batches of input data
— Use Unity.Mathematics vertically (as in this example)
— SIMD intrinsics gives you least surprises, but require the most
effort

What about System.Numerics?
52
— We might consider supporting this API at a later stage
— We want complete control and easy porting of C++ intrinsic
code to HPC#
— Similar to the approach we took with HLSL code for Math

Summary
53
— Intrinsics are coming
— Be careful with abstractions
— Adopt a SIMD mindset with Unity.Mathematics today
— Independent values are your friends
— Get familiar with the Burst inspector
— Go forth and compute more things quickly!

Thank you!
54
— Q & A
— Forum feedback welcome
— Twitter: @deplinenoise

Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019

More Related Content

What's hot (20)

Similar to Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019 (20)

More from Unity Technologies (20)

Recently uploaded (20)

Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019