SlideShare a Scribd company logo
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Enabling Low-level
Intrinsics in Burst
— You’re interested in CPU performance
— You’re considering porting engine systems to HPC#
— You just go to technical talks because it’s cool
Who this talk is for
3
— Quick introduction to SIMD topics
— Options for SIMD programming in HPC# today
— The case for intrinsics and typeless SIMD
— Case studies for intrinsics
— Q & A
Talk Contents
4
SIMD: Some background
5
What is SIMD?
6
— Single Instruction, Multiple Data
– Doing more than one thing at a time
— Available on essentially all hardware today in some form
– Capabilities vary, but a few families exist
— ARM Neon
— x86/64 SSE and AVX
SIMD Analogy: Chopping Veggies
7
Input Data
Output Data
Instruction
Preprocessed!
Why is SIMD important?
8
— It’s more efficient to do more with less instructions
— There is dedicated hardware for this stuff
— Often the only way you can get the max cache bandwidth
Dedicated Hardware: Skylake Example
9
Cache Bandwidth
10
— L1 Caches can deliver N bits every cycle
– N typically much larger than 64
– 128 or 256 bits per cycle common in most CPUs today
— Without using SIMD instructions, only get a fraction of this
– Important part of not leaving performance on the table
Cache Bandwidth
11
— This matters when you are processing in cache
– Which is what we hope to do most of the time
— Processing floats (4 bytes, 32 bits), 128-bit cache b/w
– Each load wastes 75% of bandwidth from the cache
— Better: Process 4 floats at a time
– Full cache utilization
The vector fallacy
12
— SIMD and mathematical vectors are mostly unrelated
— Lots of confusion around this issue
— False: Using a vector math library is somehow SIMD
— True: Working with arrays of data can lead to opportunities
for SIMD (but not always)
— It’s especially problematic with 3-component vectors as we’ll
see
Quick smell test for float SIMD-ness (x86)
13
— xxxps instructions feeding into each other?
– It’s probably SIMD code
— xxxss instructions?
– Scalar code
— Occasional xxxps instructions with infrastructure?
– Mix of SIMD and scalar, red flag!
Example: 3D dot products
14
public static float DotExample(float3 a, float3 b)
{
return math.dot(a, b);
}
mulps xmm0, xmmword ptr [rdx]
movshdup xmm1, xmm0
addss xmm1, xmm0
movhlps xmm0, xmm0
addss xmm0, xmm1
1 SIMD op, 2 infrastructure, 2 scalar ops => 1 dot product
4-wide mul, only 3 lanes valid
shuffle overhead
scalar addition
shuffle overhead
scalar addition
Example: 3D dot products
15
— But wait a minute, what is a dot product?
— For 3D:
– a.x * b.x + a.y * b.y + a.z * b.z
— What if we go back to basics and base our code on this?
Example: 3D dot products (back to basics)
16
public static float DotExample1(
float ax, float ay, float az
float bx, float by, float bz)
{
return ax * bx + ay * by + az * bz;
}
mulss xmm0, xmm3
mulss xmm1, dword ptr [rsp + 40]
addss xmm0, xmm1
mulss xmm2, dword ptr [rsp + 48]
addss xmm0, xmm2
0 SIMD ops, 0 infrastructure, 5 scalar ops => 1 dot product
mul
mul
add
mul
add
Example: 3D dot products (SIMD)
17
public static float4 DotExample4(
float4 ax, float4 ay, float4 az
float4 bx, float4 by, float4 bz)
{
return ax * bx + ay * by + az * bz;
}
mulps xmm2, xmmword ptr [r9]
mulps xmm1, xmmword ptr [rcx]
addps xmm1, xmm2
mulps xmm0, xmmword ptr [rax]
addps xmm0, xmm1
5 SIMD ops, 0 infrastructure, 0 scalar ops => 4 dot products
4-wide mul
4-wide mul
4-wide add
4-wide mul
4-wide add
SIMD mindset
18
— Important not to think in terms of abstractions
— Don’t think about float4 as “a 4-D vector”
— Better: 4 floats is the width of the vector unit on this CPU
— Get used to the idea of 128 or 256 bit blocks of data
– Divide into whatever size is convenient
— What fits in 128 bits?
– 16 bytes
– 8 shorts
– 4 floats or ints
– 2 doubles or longs
SIMD mindset, contd.
19
— Try to find opportunities to compute independent values
– Like the 4 independent dot products we just saw
— Fight the urge to think of vectors as horizontal values
– Horizontal operations often go against the grain of SIMD instructions
— Typically scalar code without abstractions vectorizes well
– float3, float2 etc can be convenient but often get in the way
So how do you get SIMD code?
20
— In HPC# we’ve had two options so far:
– LLVM auto-vectorization
– Unity.Mathematics explicit SIMD
LLVM’s auto vectorizer
21
— Simple mode: Write scalar code, get SIMD code out
— For simple loops, LLVM is often able to generate SIMD code
— Checklist to look at before expecting SIMD:
– Data ranges must not alias
– Data must be contiguous in memory (for wide loads)
– Data types must be integer or float with fast-math
– Branches are kept to a minimum
– There is no cross-element interference
LLVM’s auto vectorizer
22
— Pros
– Simpler code to read/write (at face value)
– Often gives you a speedup where you didn’t expect one
— Cons
– Need to learn a bunch of rules to get SIMD code from loops
– No way to tell when you’ve stopped getting SIMD
– (We’re looking at ways to make this a compile error if desired)
– Hard to reinterpret data types
– Often surprising what will not vectorize
Example of successful vectorization
23
[BurstCompile]
public struct VectorizeDemo : IJob
{
public NativeArray<int> Inputs;
public NativeArray<int> Outputs;
public void Execute()
{
for (int i = 0; i < Inputs.Length; ++i)
{
if (Inputs[i] >= 0)
{
Outputs[i] = Inputs[i];
}
else
{
Outputs[i] = 0;
}
}
}
}
.LBB0_7:
vpmaxsd ymm1, ymm0, ymmword ptr [r10 + 4*rdx]
vpmaxsd ymm2, ymm0, ymmword ptr [r10 + 4*rdx + 32]
vpmaxsd ymm3, ymm0, ymmword ptr [r10 + 4*rdx + 64]
vpmaxsd ymm4, ymm0, ymmword ptr [r10 + 4*rdx + 96]
vmovdqu ymmword ptr [rcx + 4*rdx], ymm1
vmovdqu ymmword ptr [rcx + 4*rdx + 32], ymm2
vmovdqu ymmword ptr [rcx + 4*rdx + 64], ymm3
vmovdqu ymmword ptr [rcx + 4*rdx + 96], ymm4
add rdx, 32
cmp rax, rdx
jne .LBB0_7
Example of unsuccessful vectorization
24
[BurstCompile]
public struct VectorizeDemo : IJob
{
public NativeArray<int> Inputs;
public NativeArray<int> Outputs;
public void Execute()
{
for (int i = 0; i < Inputs.Length; ++i)
{
if (Inputs[i] >= 0)
{
Outputs[i] = Inputs[i] * 2;
}
else
{
Outputs[i] = 0;
}
}
}
}
.LBB0_2:
mov edx, dword ptr [r10 + 4*rax]
lea ecx, [rdx + rdx]
test edx, edx
cmovs ecx, r8d
mov dword ptr [r11 + 4*rax], ecx
inc rax
cmp r9, rax
jne .LBB0_2
Explicit SIMD with Unity.Mathematics
25
— Use e.g. float4, int4 vertically (as in dot product example)
— Maps directly to LLVM vector types, you will get vector code
— Checklist:
– Avoid branches, use select/mask idioms
– Use native arrays, with ReinterpretLoad/Store as needed
– Handle end-of-array cases manually
Explicit Unity.Mathematics SIMD Example
26
static public IntersectResult Intersect2(NativeArray<PlanePacket4> cullingPlanePackets, AABB a)
{
// …
int4 outCounts = 0;
int4 inCounts = 0;
for (int i = 0; i < cullingPlanePackets.Length; i++) {
var p = cullingPlanePackets[i];
float4 distances = dot4(p.Xs, p.Ys, p.Zs, mx, my, mz) + p.Distances;
float4 radii = dot4(ex, ey, ez, math.abs(p.Xs), math.abs(p.Ys), math.abs(p.Zs));
outCounts += (int4) (distances + radii <= 0);
inCounts += (int4) (distances > radii);
}
int inCount = math.csum(inCounts);
int outCount = math.csum(outCounts);
if (outCount != 0)
return IntersectResult.Out;
else
return (inCount == 4 * cullingPlanePackets.Length) ? IntersectResult.In : IntersectResult.Partial;
}
The Case For Intrinsics
27
The need for typeless SIMD
28
— In the engine space it’s frequently useful to reinterpret data
— Want control over instruction selection for particular HW
— Want to leverage tricks that compilers don’t use
Data reinterpretation
29
— Work with floats bits using integer operations
— Example: Converting small integers to floats
ushort x = ...;
uint y = x | 0x4b000000;
float f = as_float(y) - 8388608.0f;
Instruction selection
30
— Often useful to base core engine loops around specific h/w
— Example: x86 pmulhrsw
Leveraging data tricks
31
— Many tricks are not in the repertoire of most compilers
— Example: Quickly generating mask from sign of float data
float x = ...;
uint mask = as_int(x) >> 31;
Burst Intrinsics
32
What we’re working on
33
— Typeless SIMD library of intrinsics
— Start with x86, with ARM to come
— Good C# integration with debugging considerations
Typeless?
34
— Types are mostly an annoyance for real world SIMD
— Often need to reinterpret float/int
— Often need to deal with masks, which are unclearly typed
— Canonical example: comparisons
– _mm_cmpeq_ps – returns a mask of all ones when equal
– So… is that a float? Or an int?
Do what the hardware does
35
— The hardware just has registers, not types (obviously)
— That’s what we expose in our intrinsics API
— m128 – 128 bit SIMD register
— m256 – 256 bit SIMD register
— Instructions determine how the register contents are interpreted
API Usage Example
36
using static Burst.Compiler.IL.x86;
// …
m128 a, b = …;
m128 mask = cmpeq_ps(a, b);
// …
API Extract
37
// _mm_cmpeq_ps
/// <summary> Compare packed single-precision (32-bit)
/// floating-point elements in "a" and "b" for equality,
/// and store the results in "dst". </summary>
[X86InstructionFamily(InstructionFamily.SSE)]
[DebuggerStepThrough]
public static m128 cmpeq_ps(m128 a, m128 b)
{
m128 dst = default(m128);
dst.UInt0 = a.Float0 == b.Float0 ? ~0u : 0;
dst.UInt1 = a.Float1 == b.Float1 ? ~0u : 0;
dst.UInt2 = a.Float2 == b.Float2 ? ~0u : 0;
dst.UInt3 = a.Float3 == b.Float3 ? ~0u : 0;
return dst;
}
C# Reference Implementation
A more complete example
38
A more complete example
39
For each door:
open = 0
For each player position:
if player in range and correct team:
open = 1
store open state for door
A more complete example
40
— Basic N vs M test
— N doors, M players
public struct Door
{
public float3 Pos;
public float RadiusSquared;
public int Team;
}
public struct DoorTestPos
{
public float3 Pos;
public int Team;
}
Reference version
41
[BurstCompile]
public struct DoorTest_Reference : IJob
{
public NativeArray<Door> Doors;
public NativeArray<DoorTestPos> TestPos;
public NativeArray<int> DoorOpenStates;
public void Execute() {
for (int j = 0; j < Doors.Length; ++j) {
bool shouldOpen = false;
for (int i = 0; i < TestPos.Length; ++i) {
float3 delta = TestPos[i].Pos - Doors[j].Pos;
float dsq = math.csum(delta * delta);
if (dsq < Doors[j].RadiusSquared && Doors[j].Team == TestPos[i].Team) {
shouldOpen = true;
break;
}
}
DoorOpenStates[j] = shouldOpen ? 1 : 0;
}
}
}
Reference disassembly
42
.LBB0_6:
vmovsd xmm2, qword ptr [rsi - 12]
vinsertps xmm2, xmm2, dword ptr [rsi - 4], 32
vsubps xmm2, xmm2, xmm0
vmulps xmm2, xmm2, xmm2
vmovshdup xmm3, xmm2
vpermilpd xmm4, xmm2, 1
vaddss xmm3, xmm3, xmm4
vaddss xmm2, xmm2, xmm3
vucomiss xmm2, xmm1
jae .LBB0_10 ; not inside radius?
mov ebx, dword ptr [rdx]
cmp ebx, dword ptr [rsi]
je .LBB0_8 ; break out of loop
.LBB0_10:
inc rdi
add rsi, 16
cmp rdi, rax
jl .LBB0_6
Let’s lose the branches
43
public void Execute() {
for (int j = 0; j < Doors.Length; ++j) {
bool shouldOpen = false;
for (int i = 0; i < TestPos.Length; ++i) {
float3 delta = TestPos[i].Pos - Doors[j].Pos;
float dsq = math.csum(delta * delta);
bool inRadius = dsq < Doors[j].RadiusSquared;
bool teamMatches = Doors[j].Team == TestPos[i].Team;
shouldOpen |= (inRadius & teamMatches) ? true : false;
}
DoorOpenStates[j] = shouldOpen ? 1 : 0;
}
}
}
Branch-free disassembly
44
.LBB0_4:
vmovsd xmm2, qword ptr [rdi - 12]
vinsertps xmm2, xmm2, dword ptr [rdi - 4], 32
vsubps xmm2, xmm2, xmm0
vmulps xmm2, xmm2, xmm2
vmovshdup xmm3, xmm2
vpermilpd xmm4, xmm2, 1
vaddss xmm3, xmm3, xmm4
vaddss xmm2, xmm2, xmm3
vucomiss xmm2, xmm1
setb al
cmp ebp, dword ptr [rdi]
sete dl
and dl, al
movzx eax, dl
or esi, eax
add rdi, 16
dec rbx
jne .LBB0_4
Explicit SIMD with Unity Mathematics
45
public struct DoorGroup
{
public float4 Xs;
public float4 Ys;
public float4 Zs;
public float4 RadiiSquared;
public int4 Teams;
}
public NativeArray<DoorGroup> Doors;
Explicit SIMD with Unity Mathematics
46
for (int j = 0; j < Doors.Length; ++j) {
bool4 openMask = false;
for (int i = 0; i < TestPos.Length; ++i) {
float4 xdeltas = TestPos[i].X - Doors[j].Xs;
float4 ydeltas = TestPos[i].Y - Doors[j].Ys;
float4 zdeltas = TestPos[i].Z - Doors[j].Zs;
float4 xdsq = xdeltas * xdeltas;
float4 ydsq = ydeltas * ydeltas;
float4 zdsq = zdeltas * zdeltas;
float4 dsq = xdsq + ydsq + zdsq;
bool4 rangeMask = dsq < Doors[j].RadiiSquared;
bool4 teamMask = TestPos[i].Team == Doors[j].Teams;
openMask |= teamMask & rangeMask;
}
DoorOpenStates[j] = math.select(new int4(0), new int4(1), openMask);
}
Explicit Math version disassembly
47
.LBB0_2:
vbroadcastss xmm0, dword ptr [rdx - 12]
vsubps xmm0, xmm0, xmm11
vbroadcastss xmm2, dword ptr [rdx - 8]
vsubps xmm2, xmm2, xmm4
vbroadcastss xmm3, dword ptr [rdx - 4]
vsubps xmm3, xmm3, xmm5
vmulps xmm0, xmm0, xmm0
vmulps xmm2, xmm2, xmm2
vmulps xmm3, xmm3, xmm3
vaddps xmm0, xmm0, xmm3
vaddps xmm0, xmm2, xmm0
vcmpltps xmm0, xmm0, xmm7
vpcmpeqd xmm2, xmm1, xmmword ptr [rdx]
vpand xmm0, xmm2, xmm0
vpsrld xmm0, xmm0, 31
vpor xmm6, xmm6, xmm0
add rdx, 28
dec rsi
jne .LBB0_2
Explicit SIMD with Burst Intrinsics
48
public struct Door4
{
public m128 Xs;
public m128 Ys;
public m128 Zs;
public m128 RadiiSquared;
public m128 Teams;
}
Explicit SIMD with Burst Intrinsics
49
for (int j = 0; j < Doors.Length; ++j) {
m128 openMask = new m128(~0u);
for (int i = 0; i < TestPos.Length; ++i) {
m128 tx = new m128(TestPos[i].X);
m128 ty = new m128(TestPos[i].Y);
m128 tz = new m128(TestPos[i].Z);
m128 tt = new m128(TestPos[i].Team);
m128 xdeltas = sub_ps(Doors[j].Xs, tx);
m128 ydeltas = sub_ps(Doors[j].Ys, ty);
m128 zdeltas = sub_ps(Doors[j].Zs, tz);
m128 xdsq = mul_ps(xdeltas, xdeltas);
m128 ydsq = mul_ps(ydeltas, ydeltas);
m128 zdsq = mul_ps(zdeltas, zdeltas);
m128 dsq = add_ps(xdsq, add_ps(ydsq, zdsq));
m128 rangeMask = cmple_ps(dsq, Doors[j].RadiiSquared);
rangeMask = and_ps(rangeMask, cmpeq_epi32(Doors[j].Teams, tt));
openMask = or_ps(openMask, rangeMask);
}
DoorOpenStates.ReinterpretStore(j * 4, openMask);
}
Explicit SIMD Disassembly
50
.LBB1_3:
vbroadcastss xmm4, dword ptr [rax - 12]
vbroadcastss xmm5, dword ptr [rax - 8]
vbroadcastss xmm6, dword ptr [rax - 4]
vpbroadcastd xmm7, dword ptr [rax]
vpcmpeqd xmm7, xmm3, xmm7
vsubps xmm4, xmm1, xmm4
vsubps xmm5, xmm1, xmm5
vsubps xmm6, xmm1, xmm6
vmulps xmm4, xmm4, xmm4
vmulps xmm5, xmm5, xmm5
vaddps xmm4, xmm5, xmm4
vmulps xmm5, xmm6, xmm6
vaddps xmm4, xmm5, xmm4
vcmpleps xmm4, xmm4, xmm2
vpand xmm4, xmm7, xmm4
vpor xmm0, xmm4, xmm0
inc rsi
add rax, 16
cmp rsi, rdx
jl .LBB1_3
Guidelines for SIMD with Burst
51
— Become familiar with the Burst inspector
— Eliminate branches (typically a good idea)
— Prefer wider batches of input data
— Use Unity.Mathematics vertically (as in this example)
— SIMD intrinsics gives you least surprises, but require the most
effort
What about System.Numerics?
52
— We might consider supporting this API at a later stage
— We want complete control and easy porting of C++ intrinsic
code to HPC#
— Similar to the approach we took with HLSL code for Math
Summary
53
— Intrinsics are coming
— Be careful with abstractions
— Adopt a SIMD mindset with Unity.Mathematics today
— Independent values are your friends
— Get familiar with the Burst inspector
— Go forth and compute more things quickly!
Thank you!
54
— Q & A
— Forum feedback welcome
— Twitter: @deplinenoise

More Related Content

PPTX
Built for performance: the UIElements Renderer – Unite Copenhagen 2019
PDF
Siggraph2016 - The Devil is in the Details: idTech 666
PPSX
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
PPTX
Optimizing the Graphics Pipeline with Compute, GDC 2016
PDF
Modern OpenGL Usage: Using Vertex Buffer Objects Well
PPTX
Decima Engine: Visibility in Horizon Zero Dawn
PPTX
Optimizing unity games (Google IO 2014)
PPT
OpenGL 3.2 and More
Built for performance: the UIElements Renderer – Unite Copenhagen 2019
Siggraph2016 - The Devil is in the Details: idTech 666
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Optimizing the Graphics Pipeline with Compute, GDC 2016
Modern OpenGL Usage: Using Vertex Buffer Objects Well
Decima Engine: Visibility in Horizon Zero Dawn
Optimizing unity games (Google IO 2014)
OpenGL 3.2 and More

What's hot (20)

PDF
OpenGL 4.4 - Scene Rendering Techniques
PPTX
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
PPT
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
PDF
The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)
PPTX
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
PPT
Crysis Next-Gen Effects (GDC 2008)
PPTX
Hable John Uncharted2 Hdr Lighting
PPSX
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
PPTX
DirectX 11 Rendering in Battlefield 3
PDF
Graphics Gems from CryENGINE 3 (Siggraph 2013)
PPTX
A Certain Slant of Light - Past, Present and Future Challenges of Global Illu...
PPT
Secrets of CryENGINE 3 Graphics Technology
PPTX
Frostbite on Mobile
ODP
Inter-process communication of Android
PPT
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
PPTX
Triangle Visibility buffer
PPTX
Stochastic Screen-Space Reflections
PDF
Advanced Scenegraph Rendering Pipeline
PDF
Getting started with Burst – Unite Copenhagen 2019
PDF
Penner pre-integrated skin rendering (siggraph 2011 advances in real-time r...
OpenGL 4.4 - Scene Rendering Techniques
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Crysis Next-Gen Effects (GDC 2008)
Hable John Uncharted2 Hdr Lighting
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
DirectX 11 Rendering in Battlefield 3
Graphics Gems from CryENGINE 3 (Siggraph 2013)
A Certain Slant of Light - Past, Present and Future Challenges of Global Illu...
Secrets of CryENGINE 3 Graphics Technology
Frostbite on Mobile
Inter-process communication of Android
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
Triangle Visibility buffer
Stochastic Screen-Space Reflections
Advanced Scenegraph Rendering Pipeline
Getting started with Burst – Unite Copenhagen 2019
Penner pre-integrated skin rendering (siggraph 2011 advances in real-time r...
Ad

Similar to Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019 (20)

PPTX
SIMD.pptx
PPTX
Fedor Polyakov - Optimizing computer vision problems on mobile platforms
PDF
Designing C++ portable SIMD support
PPTX
JVM Memory Model - Yoav Abrahami, Wix
PPT
8871077.ppt
PPTX
Medical Image Processing Strategies for multi-core CPUs
PPTX
lec2 - Modern Processors - SIMD.pptx
PPT
Happy To Use SIMD
PDF
Simd programming introduction
PPTX
SIMD Processing Using Compiler Intrinsics
PPTX
Java Jit. Compilation and optimization by Andrey Kovalenko
PDF
Joel Falcou, Boost.SIMD
PDF
Peddle the Pedal to the Metal
PDF
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...
PDF
100 bugs in Open Source C/C++ projects
PPTX
Graphics processing uni computer archiecture
PDF
Data Analytics and Simulation in Parallel with MATLAB*
PPTX
Java on arm theory, applications, and workloads [dev5048]
PPT
12 virtualmachine
PDF
100 bugs in Open Source C/C++ projects
SIMD.pptx
Fedor Polyakov - Optimizing computer vision problems on mobile platforms
Designing C++ portable SIMD support
JVM Memory Model - Yoav Abrahami, Wix
8871077.ppt
Medical Image Processing Strategies for multi-core CPUs
lec2 - Modern Processors - SIMD.pptx
Happy To Use SIMD
Simd programming introduction
SIMD Processing Using Compiler Intrinsics
Java Jit. Compilation and optimization by Andrey Kovalenko
Joel Falcou, Boost.SIMD
Peddle the Pedal to the Metal
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...
100 bugs in Open Source C/C++ projects
Graphics processing uni computer archiecture
Data Analytics and Simulation in Parallel with MATLAB*
Java on arm theory, applications, and workloads [dev5048]
12 virtualmachine
100 bugs in Open Source C/C++ projects
Ad

More from Unity Technologies (20)

PDF
Build Immersive Worlds in Virtual Reality
PDF
Augmenting reality: Bring digital objects into the real world
PDF
Let’s get real: An introduction to AR, VR, MR, XR and more
PDF
Using synthetic data for computer vision model training
PDF
The Tipping Point: How Virtual Experiences Are Transforming Global Industries
PDF
Unity Roadmap 2020: Live games
PDF
Unity Roadmap 2020: Core Engine & Creator Tools
PDF
How ABB shapes the future of industry with Microsoft HoloLens and Unity - Uni...
PPTX
Unity XR platform has a new architecture – Unite Copenhagen 2019
PDF
Turn Revit Models into real-time 3D experiences
PDF
How Daimler uses mobile mixed realities for training and sales - Unite Copenh...
PDF
How Volvo embraced real-time 3D and shook up the auto industry- Unite Copenha...
PDF
QA your code: The new Unity Test Framework – Unite Copenhagen 2019
PDF
Engineering.com webinar: Real-time 3D and digital twins: The power of a virtu...
PDF
Supplying scalable VR training applications with Innoactive - Unite Copenhage...
PDF
XR and real-time 3D in automotive digital marketing strategies | Visionaries ...
PDF
Real-time CG animation in Unity: unpacking the Sherman project - Unite Copenh...
PDF
Creating next-gen VR and MR experiences using Varjo VR-1 and XR-1 - Unite Cop...
PDF
What's ahead for film and animation with Unity 2020 - Unite Copenhagen 2019
PDF
How to Improve Visual Rendering Quality in VR - Unite Copenhagen 2019
Build Immersive Worlds in Virtual Reality
Augmenting reality: Bring digital objects into the real world
Let’s get real: An introduction to AR, VR, MR, XR and more
Using synthetic data for computer vision model training
The Tipping Point: How Virtual Experiences Are Transforming Global Industries
Unity Roadmap 2020: Live games
Unity Roadmap 2020: Core Engine & Creator Tools
How ABB shapes the future of industry with Microsoft HoloLens and Unity - Uni...
Unity XR platform has a new architecture – Unite Copenhagen 2019
Turn Revit Models into real-time 3D experiences
How Daimler uses mobile mixed realities for training and sales - Unite Copenh...
How Volvo embraced real-time 3D and shook up the auto industry- Unite Copenha...
QA your code: The new Unity Test Framework – Unite Copenhagen 2019
Engineering.com webinar: Real-time 3D and digital twins: The power of a virtu...
Supplying scalable VR training applications with Innoactive - Unite Copenhage...
XR and real-time 3D in automotive digital marketing strategies | Visionaries ...
Real-time CG animation in Unity: unpacking the Sherman project - Unite Copenh...
Creating next-gen VR and MR experiences using Varjo VR-1 and XR-1 - Unite Cop...
What's ahead for film and animation with Unity 2020 - Unite Copenhagen 2019
How to Improve Visual Rendering Quality in VR - Unite Copenhagen 2019

Recently uploaded (20)

PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Empathic Computing: Creating Shared Understanding
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Modernizing your data center with Dell and AMD
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Electronic commerce courselecture one. Pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
A Presentation on Artificial Intelligence
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Empathic Computing: Creating Shared Understanding
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Review of recent advances in non-invasive hemoglobin estimation
Modernizing your data center with Dell and AMD
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Electronic commerce courselecture one. Pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
NewMind AI Weekly Chronicles - August'25 Week I
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Digital-Transformation-Roadmap-for-Companies.pptx
Unlocking AI with Model Context Protocol (MCP)
A Presentation on Artificial Intelligence
20250228 LYD VKU AI Blended-Learning.pptx

Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019

  • 3. — You’re interested in CPU performance — You’re considering porting engine systems to HPC# — You just go to technical talks because it’s cool Who this talk is for 3
  • 4. — Quick introduction to SIMD topics — Options for SIMD programming in HPC# today — The case for intrinsics and typeless SIMD — Case studies for intrinsics — Q & A Talk Contents 4
  • 6. What is SIMD? 6 — Single Instruction, Multiple Data – Doing more than one thing at a time — Available on essentially all hardware today in some form – Capabilities vary, but a few families exist — ARM Neon — x86/64 SSE and AVX
  • 7. SIMD Analogy: Chopping Veggies 7 Input Data Output Data Instruction Preprocessed!
  • 8. Why is SIMD important? 8 — It’s more efficient to do more with less instructions — There is dedicated hardware for this stuff — Often the only way you can get the max cache bandwidth
  • 10. Cache Bandwidth 10 — L1 Caches can deliver N bits every cycle – N typically much larger than 64 – 128 or 256 bits per cycle common in most CPUs today — Without using SIMD instructions, only get a fraction of this – Important part of not leaving performance on the table
  • 11. Cache Bandwidth 11 — This matters when you are processing in cache – Which is what we hope to do most of the time — Processing floats (4 bytes, 32 bits), 128-bit cache b/w – Each load wastes 75% of bandwidth from the cache — Better: Process 4 floats at a time – Full cache utilization
  • 12. The vector fallacy 12 — SIMD and mathematical vectors are mostly unrelated — Lots of confusion around this issue — False: Using a vector math library is somehow SIMD — True: Working with arrays of data can lead to opportunities for SIMD (but not always) — It’s especially problematic with 3-component vectors as we’ll see
  • 13. Quick smell test for float SIMD-ness (x86) 13 — xxxps instructions feeding into each other? – It’s probably SIMD code — xxxss instructions? – Scalar code — Occasional xxxps instructions with infrastructure? – Mix of SIMD and scalar, red flag!
  • 14. Example: 3D dot products 14 public static float DotExample(float3 a, float3 b) { return math.dot(a, b); } mulps xmm0, xmmword ptr [rdx] movshdup xmm1, xmm0 addss xmm1, xmm0 movhlps xmm0, xmm0 addss xmm0, xmm1 1 SIMD op, 2 infrastructure, 2 scalar ops => 1 dot product 4-wide mul, only 3 lanes valid shuffle overhead scalar addition shuffle overhead scalar addition
  • 15. Example: 3D dot products 15 — But wait a minute, what is a dot product? — For 3D: – a.x * b.x + a.y * b.y + a.z * b.z — What if we go back to basics and base our code on this?
  • 16. Example: 3D dot products (back to basics) 16 public static float DotExample1( float ax, float ay, float az float bx, float by, float bz) { return ax * bx + ay * by + az * bz; } mulss xmm0, xmm3 mulss xmm1, dword ptr [rsp + 40] addss xmm0, xmm1 mulss xmm2, dword ptr [rsp + 48] addss xmm0, xmm2 0 SIMD ops, 0 infrastructure, 5 scalar ops => 1 dot product mul mul add mul add
  • 17. Example: 3D dot products (SIMD) 17 public static float4 DotExample4( float4 ax, float4 ay, float4 az float4 bx, float4 by, float4 bz) { return ax * bx + ay * by + az * bz; } mulps xmm2, xmmword ptr [r9] mulps xmm1, xmmword ptr [rcx] addps xmm1, xmm2 mulps xmm0, xmmword ptr [rax] addps xmm0, xmm1 5 SIMD ops, 0 infrastructure, 0 scalar ops => 4 dot products 4-wide mul 4-wide mul 4-wide add 4-wide mul 4-wide add
  • 18. SIMD mindset 18 — Important not to think in terms of abstractions — Don’t think about float4 as “a 4-D vector” — Better: 4 floats is the width of the vector unit on this CPU — Get used to the idea of 128 or 256 bit blocks of data – Divide into whatever size is convenient — What fits in 128 bits? – 16 bytes – 8 shorts – 4 floats or ints – 2 doubles or longs
  • 19. SIMD mindset, contd. 19 — Try to find opportunities to compute independent values – Like the 4 independent dot products we just saw — Fight the urge to think of vectors as horizontal values – Horizontal operations often go against the grain of SIMD instructions — Typically scalar code without abstractions vectorizes well – float3, float2 etc can be convenient but often get in the way
  • 20. So how do you get SIMD code? 20 — In HPC# we’ve had two options so far: – LLVM auto-vectorization – Unity.Mathematics explicit SIMD
  • 21. LLVM’s auto vectorizer 21 — Simple mode: Write scalar code, get SIMD code out — For simple loops, LLVM is often able to generate SIMD code — Checklist to look at before expecting SIMD: – Data ranges must not alias – Data must be contiguous in memory (for wide loads) – Data types must be integer or float with fast-math – Branches are kept to a minimum – There is no cross-element interference
  • 22. LLVM’s auto vectorizer 22 — Pros – Simpler code to read/write (at face value) – Often gives you a speedup where you didn’t expect one — Cons – Need to learn a bunch of rules to get SIMD code from loops – No way to tell when you’ve stopped getting SIMD – (We’re looking at ways to make this a compile error if desired) – Hard to reinterpret data types – Often surprising what will not vectorize
  • 23. Example of successful vectorization 23 [BurstCompile] public struct VectorizeDemo : IJob { public NativeArray<int> Inputs; public NativeArray<int> Outputs; public void Execute() { for (int i = 0; i < Inputs.Length; ++i) { if (Inputs[i] >= 0) { Outputs[i] = Inputs[i]; } else { Outputs[i] = 0; } } } } .LBB0_7: vpmaxsd ymm1, ymm0, ymmword ptr [r10 + 4*rdx] vpmaxsd ymm2, ymm0, ymmword ptr [r10 + 4*rdx + 32] vpmaxsd ymm3, ymm0, ymmword ptr [r10 + 4*rdx + 64] vpmaxsd ymm4, ymm0, ymmword ptr [r10 + 4*rdx + 96] vmovdqu ymmword ptr [rcx + 4*rdx], ymm1 vmovdqu ymmword ptr [rcx + 4*rdx + 32], ymm2 vmovdqu ymmword ptr [rcx + 4*rdx + 64], ymm3 vmovdqu ymmword ptr [rcx + 4*rdx + 96], ymm4 add rdx, 32 cmp rax, rdx jne .LBB0_7
  • 24. Example of unsuccessful vectorization 24 [BurstCompile] public struct VectorizeDemo : IJob { public NativeArray<int> Inputs; public NativeArray<int> Outputs; public void Execute() { for (int i = 0; i < Inputs.Length; ++i) { if (Inputs[i] >= 0) { Outputs[i] = Inputs[i] * 2; } else { Outputs[i] = 0; } } } } .LBB0_2: mov edx, dword ptr [r10 + 4*rax] lea ecx, [rdx + rdx] test edx, edx cmovs ecx, r8d mov dword ptr [r11 + 4*rax], ecx inc rax cmp r9, rax jne .LBB0_2
  • 25. Explicit SIMD with Unity.Mathematics 25 — Use e.g. float4, int4 vertically (as in dot product example) — Maps directly to LLVM vector types, you will get vector code — Checklist: – Avoid branches, use select/mask idioms – Use native arrays, with ReinterpretLoad/Store as needed – Handle end-of-array cases manually
  • 26. Explicit Unity.Mathematics SIMD Example 26 static public IntersectResult Intersect2(NativeArray<PlanePacket4> cullingPlanePackets, AABB a) { // … int4 outCounts = 0; int4 inCounts = 0; for (int i = 0; i < cullingPlanePackets.Length; i++) { var p = cullingPlanePackets[i]; float4 distances = dot4(p.Xs, p.Ys, p.Zs, mx, my, mz) + p.Distances; float4 radii = dot4(ex, ey, ez, math.abs(p.Xs), math.abs(p.Ys), math.abs(p.Zs)); outCounts += (int4) (distances + radii <= 0); inCounts += (int4) (distances > radii); } int inCount = math.csum(inCounts); int outCount = math.csum(outCounts); if (outCount != 0) return IntersectResult.Out; else return (inCount == 4 * cullingPlanePackets.Length) ? IntersectResult.In : IntersectResult.Partial; }
  • 27. The Case For Intrinsics 27
  • 28. The need for typeless SIMD 28 — In the engine space it’s frequently useful to reinterpret data — Want control over instruction selection for particular HW — Want to leverage tricks that compilers don’t use
  • 29. Data reinterpretation 29 — Work with floats bits using integer operations — Example: Converting small integers to floats ushort x = ...; uint y = x | 0x4b000000; float f = as_float(y) - 8388608.0f;
  • 30. Instruction selection 30 — Often useful to base core engine loops around specific h/w — Example: x86 pmulhrsw
  • 31. Leveraging data tricks 31 — Many tricks are not in the repertoire of most compilers — Example: Quickly generating mask from sign of float data float x = ...; uint mask = as_int(x) >> 31;
  • 33. What we’re working on 33 — Typeless SIMD library of intrinsics — Start with x86, with ARM to come — Good C# integration with debugging considerations
  • 34. Typeless? 34 — Types are mostly an annoyance for real world SIMD — Often need to reinterpret float/int — Often need to deal with masks, which are unclearly typed — Canonical example: comparisons – _mm_cmpeq_ps – returns a mask of all ones when equal – So… is that a float? Or an int?
  • 35. Do what the hardware does 35 — The hardware just has registers, not types (obviously) — That’s what we expose in our intrinsics API — m128 – 128 bit SIMD register — m256 – 256 bit SIMD register — Instructions determine how the register contents are interpreted
  • 36. API Usage Example 36 using static Burst.Compiler.IL.x86; // … m128 a, b = …; m128 mask = cmpeq_ps(a, b); // …
  • 37. API Extract 37 // _mm_cmpeq_ps /// <summary> Compare packed single-precision (32-bit) /// floating-point elements in "a" and "b" for equality, /// and store the results in "dst". </summary> [X86InstructionFamily(InstructionFamily.SSE)] [DebuggerStepThrough] public static m128 cmpeq_ps(m128 a, m128 b) { m128 dst = default(m128); dst.UInt0 = a.Float0 == b.Float0 ? ~0u : 0; dst.UInt1 = a.Float1 == b.Float1 ? ~0u : 0; dst.UInt2 = a.Float2 == b.Float2 ? ~0u : 0; dst.UInt3 = a.Float3 == b.Float3 ? ~0u : 0; return dst; } C# Reference Implementation
  • 38. A more complete example 38
  • 39. A more complete example 39 For each door: open = 0 For each player position: if player in range and correct team: open = 1 store open state for door
  • 40. A more complete example 40 — Basic N vs M test — N doors, M players public struct Door { public float3 Pos; public float RadiusSquared; public int Team; } public struct DoorTestPos { public float3 Pos; public int Team; }
  • 41. Reference version 41 [BurstCompile] public struct DoorTest_Reference : IJob { public NativeArray<Door> Doors; public NativeArray<DoorTestPos> TestPos; public NativeArray<int> DoorOpenStates; public void Execute() { for (int j = 0; j < Doors.Length; ++j) { bool shouldOpen = false; for (int i = 0; i < TestPos.Length; ++i) { float3 delta = TestPos[i].Pos - Doors[j].Pos; float dsq = math.csum(delta * delta); if (dsq < Doors[j].RadiusSquared && Doors[j].Team == TestPos[i].Team) { shouldOpen = true; break; } } DoorOpenStates[j] = shouldOpen ? 1 : 0; } } }
  • 42. Reference disassembly 42 .LBB0_6: vmovsd xmm2, qword ptr [rsi - 12] vinsertps xmm2, xmm2, dword ptr [rsi - 4], 32 vsubps xmm2, xmm2, xmm0 vmulps xmm2, xmm2, xmm2 vmovshdup xmm3, xmm2 vpermilpd xmm4, xmm2, 1 vaddss xmm3, xmm3, xmm4 vaddss xmm2, xmm2, xmm3 vucomiss xmm2, xmm1 jae .LBB0_10 ; not inside radius? mov ebx, dword ptr [rdx] cmp ebx, dword ptr [rsi] je .LBB0_8 ; break out of loop .LBB0_10: inc rdi add rsi, 16 cmp rdi, rax jl .LBB0_6
  • 43. Let’s lose the branches 43 public void Execute() { for (int j = 0; j < Doors.Length; ++j) { bool shouldOpen = false; for (int i = 0; i < TestPos.Length; ++i) { float3 delta = TestPos[i].Pos - Doors[j].Pos; float dsq = math.csum(delta * delta); bool inRadius = dsq < Doors[j].RadiusSquared; bool teamMatches = Doors[j].Team == TestPos[i].Team; shouldOpen |= (inRadius & teamMatches) ? true : false; } DoorOpenStates[j] = shouldOpen ? 1 : 0; } } }
  • 44. Branch-free disassembly 44 .LBB0_4: vmovsd xmm2, qword ptr [rdi - 12] vinsertps xmm2, xmm2, dword ptr [rdi - 4], 32 vsubps xmm2, xmm2, xmm0 vmulps xmm2, xmm2, xmm2 vmovshdup xmm3, xmm2 vpermilpd xmm4, xmm2, 1 vaddss xmm3, xmm3, xmm4 vaddss xmm2, xmm2, xmm3 vucomiss xmm2, xmm1 setb al cmp ebp, dword ptr [rdi] sete dl and dl, al movzx eax, dl or esi, eax add rdi, 16 dec rbx jne .LBB0_4
  • 45. Explicit SIMD with Unity Mathematics 45 public struct DoorGroup { public float4 Xs; public float4 Ys; public float4 Zs; public float4 RadiiSquared; public int4 Teams; } public NativeArray<DoorGroup> Doors;
  • 46. Explicit SIMD with Unity Mathematics 46 for (int j = 0; j < Doors.Length; ++j) { bool4 openMask = false; for (int i = 0; i < TestPos.Length; ++i) { float4 xdeltas = TestPos[i].X - Doors[j].Xs; float4 ydeltas = TestPos[i].Y - Doors[j].Ys; float4 zdeltas = TestPos[i].Z - Doors[j].Zs; float4 xdsq = xdeltas * xdeltas; float4 ydsq = ydeltas * ydeltas; float4 zdsq = zdeltas * zdeltas; float4 dsq = xdsq + ydsq + zdsq; bool4 rangeMask = dsq < Doors[j].RadiiSquared; bool4 teamMask = TestPos[i].Team == Doors[j].Teams; openMask |= teamMask & rangeMask; } DoorOpenStates[j] = math.select(new int4(0), new int4(1), openMask); }
  • 47. Explicit Math version disassembly 47 .LBB0_2: vbroadcastss xmm0, dword ptr [rdx - 12] vsubps xmm0, xmm0, xmm11 vbroadcastss xmm2, dword ptr [rdx - 8] vsubps xmm2, xmm2, xmm4 vbroadcastss xmm3, dword ptr [rdx - 4] vsubps xmm3, xmm3, xmm5 vmulps xmm0, xmm0, xmm0 vmulps xmm2, xmm2, xmm2 vmulps xmm3, xmm3, xmm3 vaddps xmm0, xmm0, xmm3 vaddps xmm0, xmm2, xmm0 vcmpltps xmm0, xmm0, xmm7 vpcmpeqd xmm2, xmm1, xmmword ptr [rdx] vpand xmm0, xmm2, xmm0 vpsrld xmm0, xmm0, 31 vpor xmm6, xmm6, xmm0 add rdx, 28 dec rsi jne .LBB0_2
  • 48. Explicit SIMD with Burst Intrinsics 48 public struct Door4 { public m128 Xs; public m128 Ys; public m128 Zs; public m128 RadiiSquared; public m128 Teams; }
  • 49. Explicit SIMD with Burst Intrinsics 49 for (int j = 0; j < Doors.Length; ++j) { m128 openMask = new m128(~0u); for (int i = 0; i < TestPos.Length; ++i) { m128 tx = new m128(TestPos[i].X); m128 ty = new m128(TestPos[i].Y); m128 tz = new m128(TestPos[i].Z); m128 tt = new m128(TestPos[i].Team); m128 xdeltas = sub_ps(Doors[j].Xs, tx); m128 ydeltas = sub_ps(Doors[j].Ys, ty); m128 zdeltas = sub_ps(Doors[j].Zs, tz); m128 xdsq = mul_ps(xdeltas, xdeltas); m128 ydsq = mul_ps(ydeltas, ydeltas); m128 zdsq = mul_ps(zdeltas, zdeltas); m128 dsq = add_ps(xdsq, add_ps(ydsq, zdsq)); m128 rangeMask = cmple_ps(dsq, Doors[j].RadiiSquared); rangeMask = and_ps(rangeMask, cmpeq_epi32(Doors[j].Teams, tt)); openMask = or_ps(openMask, rangeMask); } DoorOpenStates.ReinterpretStore(j * 4, openMask); }
  • 50. Explicit SIMD Disassembly 50 .LBB1_3: vbroadcastss xmm4, dword ptr [rax - 12] vbroadcastss xmm5, dword ptr [rax - 8] vbroadcastss xmm6, dword ptr [rax - 4] vpbroadcastd xmm7, dword ptr [rax] vpcmpeqd xmm7, xmm3, xmm7 vsubps xmm4, xmm1, xmm4 vsubps xmm5, xmm1, xmm5 vsubps xmm6, xmm1, xmm6 vmulps xmm4, xmm4, xmm4 vmulps xmm5, xmm5, xmm5 vaddps xmm4, xmm5, xmm4 vmulps xmm5, xmm6, xmm6 vaddps xmm4, xmm5, xmm4 vcmpleps xmm4, xmm4, xmm2 vpand xmm4, xmm7, xmm4 vpor xmm0, xmm4, xmm0 inc rsi add rax, 16 cmp rsi, rdx jl .LBB1_3
  • 51. Guidelines for SIMD with Burst 51 — Become familiar with the Burst inspector — Eliminate branches (typically a good idea) — Prefer wider batches of input data — Use Unity.Mathematics vertically (as in this example) — SIMD intrinsics gives you least surprises, but require the most effort
  • 52. What about System.Numerics? 52 — We might consider supporting this API at a later stage — We want complete control and easy porting of C++ intrinsic code to HPC# — Similar to the approach we took with HLSL code for Math
  • 53. Summary 53 — Intrinsics are coming — Be careful with abstractions — Adopt a SIMD mindset with Unity.Mathematics today — Independent values are your friends — Get familiar with the Burst inspector — Go forth and compute more things quickly!
  • 54. Thank you! 54 — Q & A — Forum feedback welcome — Twitter: @deplinenoise