Egor Bogatov - .NET Core intrinsics and other micro-optimizations

Intrinsics and other micro-optimizations
Egor Bogatov
Engineer at Microsoft

Agenda
Useful micro-optimizations
Pitfalls for external contributors
Intrinsics & SIMD with examples
.NET Core 3.0-x features
2

Prefer Spans API where possible
var str = "EGOR 3.14 1234 7/3/2018";
string name = str.Substring(0, 4);
float pi = float.Parse(str.Substring(5, 4));
int number = int.Parse(str.Substring(10, 4));
DateTime date = DateTime.Parse(str.Substring(15, 8));
var str = "EGOR 3.14 1234 7/3/2018".AsSpan();
var name = str.Slice(0, 4);
float pi = float.Parse(str.Slice(5, 4));
int number = int.Parse(str.Slice(10, 4));
DateTime date = DateTime.Parse(str.Slice(15, 8));
Allocated on heap: 168 bytes
Allocated on heap: 0 bytes
3

Allocating a temp array
char[] buffer =
new char[count];
4

Span<char> span =
new char[count];
5

Span<char> span =
count <= 512 ?
stackalloc char[512] :
new char[count];
6

Span<char> span =
count <= 512 ?
ArrayPool<char>.Shared.Rent(count);
7

char[] pool = null;
Span<char> span =
count <= 512 ?
(pool = ArrayPool<char>.Shared.Rent(count));
if (pool != null)
ArrayPool<char>.Shared.Return(pool);
8

Allocating a temp array - final pattern
char[] pool = null;
Span<char> span =
count <= 512 ?
(pool = ArrayPool<char>.Shared.Rent(count));
if (pool != null)
ArrayPool<char>.Shared.Return(pool);
9

Allocating a temp array – without ArrayPool
Span<char> span = count <= 512 ?
new char[count];
10

Optimizing .NET Core: pitfalls

public static int Count<TSource>(this IEnumerable<TSource> source)
{
if (source is ICollection<TSource> collectionoft)
return collectionoft.Count;
if (source is IIListProvider<TSource> listProv)
return listProv.GetCount(onlyIfCheap: false);
if (source is ICollection collection)
return collection.Count;
if (source is IReadOnlyCollection<TSource> rocollectionoft)
return rocollectionoft.Count;
int count = 0;
using (IEnumerator<TSource> e = source.GetEnumerator())
while (e.MoveNext())
count++;
return count;
}
12
~ 3 ns
~ 3 ns
~ 3 ns
~ 30 ns
~ 10-… ns

Casts are not cheap
var t0 = (List<string>)value;
var t1 = (ICollection<string>)value
var t2 = (IList)value
var t3 = (IEnumerable<string>)value
object value = new List<string> { };
// Covariant interfaces:
public interface IEnumerable<out T>
public interface IReadOnlyCollection<out T>
13
IEnumerable<object> a = new List<string> {..}

Cast to covariant interface – different runtimes 14
Method | Runtime | Mean | Scaled |
-------------------:|----------------:|------------:|---------:|
CastAndCount | .NET 4.7 | 78.1 ns | 6.7 |
CastAndCount | .NET Core 3 | 42.9 ns | 3.7 |
CastAndCount | CoreRT | 11.6 ns | 1.0 |
CastAndCount | Mono | 6.7 ns | 0.6 |
return ((IReadOnlyCollection<string>)_smallArray).Count;

Bounds check
public static double SumSqrt(double[] array)
{
double result = 0;
for (int i = 0; i < array.Length; i++)
{
result += Math.Sqrt(array[i]);
}
return result;
}
16

Bounds check
{
double result = 0;
{
if (i >= array.Length)
throw new ArgumentOutOfRangeException();
}
return result;
}
17

Bounds check eliminated!
{
double result = 0;
{
}
return result;
}
18

Bounds check: tricks
public static void Test1(char[] array)
{
array[0] = 'F';
array[1] = 'a';
array[2] = 'l';
array[3] = 's';
array[4] = 'e';
array[5] = '.';
}
19

{
array[5] = '.';
array[0] = 'F';
array[1] = 'a';
array[2] = 'l';
array[3] = 's';
array[4] = 'e';
}
20

{
if (array.Length > 5)
{
array[0] = 'F';
array[1] = 'a';
array[2] = 'l';
array[3] = 's';
array[4] = 'e';
array[5] = '.';
}
}
21

{
if ((uint)array.Length > 5)
{
array[0] = 'F';
array[1] = 'a';
array[2] = 'l';
array[3] = 's';
array[4] = 'e';
array[5] = '.';
}
}
22

Bounds check: tricks – CoreCLR sources:
// Boolean.cs
public bool TryFormat(Span<char> destination, out int charsWritten)
{
if (m_value)
{
if ((uint)destination.Length > 3)
{
destination[0] = 'T';
destination[1] = 'r';
destination[2] = 'u';
destination[3] = 'e';
charsWritten = 4;
return true;
}
}
23

• Recognize patterns
• Replace methods (usually marked with [Intrinsic])
• System.Runtime.Intrinsics
mov eax,dword ptr [rcx+8]
mov ecx,dword ptr [rcx+0Ch]
rol eax,cl
ret
private static uint Rotl(uint value, int shift)
{
return (value << shift) | (value >> (32 - shift));
}
[Intrinsic]
public static double Round(double a)
{
double flrTempVal = Floor(a + 0.5);
if ((a == (Floor(a) + 0.5)) && (FMod(flrTempVal, 2.0) != 0))
flrTempVal -= 1.0;
return copysign(flrTempVal, a);
}
25
cmp dword ptr [rcx+48h] …
jne M00_L00
vroundsd xmm0,xmm0,mmword ptr …
ret
Intrinsics

SIMD
Vector4 result =
new Vector4(1f, 2f, 3f, 4f) +
new Vector4(5f, 6f, 7f, 8f);
26
vmovups xmm0,xmmword ptr [rdx]
vmovups xmm1,xmmword ptr [rdx+16]
vaddps xmm0,xmm0,xmm1
X1 X2 X+ =
Y1 Y2 Y+ =
Z1 Z2 Z+ =
W1 W2 W+ =
X1 X2 X
Y1 Y2 Y
+
Z1 Z2 Z
=
W1 W2 W
SIMD

Meet System.Runtime.Intrinsics
var v1 = new Vector4(1, 2, 3, 4);
var v2 = new Vector4(5, 6, 7, 8);
var left = Sse.LoadVector128(&v1.X); // Vector128<float>
var right = Sse.LoadVector128(&v2.X);
var sum = Sse.Add(left, right);
Sse.Store(&result.X, sum);
var mulPi = Sse.Multiply(sum, Sse.SetAllVector128(3.14f));
var result = new Vector4(v1.X + v2.X, v1.Y + v2.Y, ...);
27

System.Runtime.Intrinsics
28
• System.Runtime.Intrinsics
Vector64<T>
Vector128<T>
Vector256<T>
• System.Runtime.Intrinsics.X86
Sse (Sse, Sse2…Sse42)
Avx, Avx2
Fma
…
• System.Runtime.Intrinsics.Arm.Arm
64
Simd
…

System.Runtime.Intrinsics
29
public class Sse2 : Sse
{
public static bool IsSupported => true;
/// <summary>
/// __m128i _mm_add_epi8 (__m128i a, __m128i b)
/// PADDB xmm, xmm/m128
/// </summary>
public static Vector128<byte> Add(Vector128<byte> left, Vector128<byte> right);
/// <summary>
/// __m128i _mm_add_epi8 (__m128i a, __m128i b)
/// PADDB xmm, xmm/m128
/// </summary>
public static Vector128<sbyte> Add(Vector128<sbyte> left, Vector128<sbyte> right);

S.R.I.: Documentation
/// <summary>
/// __m128d _mm_add_pd (__m128d a, __m128d b)
/// ADDPD xmm, xmm/m128
/// </summary>
public static Vector128<double> Add(
Vector128<double> left,
Vector128<double> right);
30

S.R.I.: Usage pattern
if (Arm.Simd.IsSupported)
DoWorkusingNeon();
else if (Avx2.IsSupported)
DoWorkUsingAvx2();
else if (Sse2.IsSupported)
DoWorkUsingSse2();
else
DoWorkSlowly();
31
JIT
if (Arm.Simd.IsSupported)
DoWorkusingNeon();
else if (x86.Avx2.IsSupported)
DoWorkUsingAvx2();
else if (x86.Sse2.IsSupported)
DoWorkUsingSse2();
else
DoWorkSlowly();

IsSorted(int[]) – simple implementation
bool IsSorted(int[] array)
{
if (array.Length < 2)
return true;
for (int i = 0; i < array.Length - 1; i++)
{
if (array[i] > array[i + 1])
return false;
}
return true;
}
32

IsSorted(int[]) – optimized with SSE41
bool IsSorted_Sse41(int[] array)
{
fixed (int* ptr = &array[0])
{
for (int i = 0; i < array.Length - 4; i += 4)
{
var curr = Sse2.LoadVector128(ptr + i);
var next = Sse2.LoadVector128(ptr + i + 1);
var mask = Sse2.CompareGreaterThan(curr, next);
if (!Sse41.TestAllZeros(mask, mask))
return false;
}
}
return true;
}
i0 i1 i2 i3
i0 i1 i2 i3
0 1 0 0
_mm_test_all_zeros
i4 i5
33
Method | Mean |
---------------- |---------:|
IsSorted | 35.07 us |
IsSorted_unsafe | 21.19 us |
IsSorted_Sse41 | 13.79 us |

Reverse<T>(T[] array), level: student
void Reverse<T>(T[] array)
{
for (int i = 0; i < array.Length / 2; i++)
{
T tmp = array[i];
array[i] = array[array.Length - i - 1];
array[array.Length - i - 1] = tmp;
}
}
“1 2 3 4 5 6” => “6 5 4 3 2 1”
34

Reverse<T>(T[] array), level: CoreCLR developer
void Reverse<T>(T[] array)
{
ref T p = ref Unsafe.As<byte, T>(ref array.GetRawSzArrayData());
int i = 0;
int j = array.Length - 1;
while (i < j)
{
T temp = Unsafe.Add(ref p, i);
Unsafe.Add(ref p, i) = Unsafe.Add(ref p, j);
Unsafe.Add(ref p, j) = temp;
i++;
j--;
}
}
No bounds/covariance checks
35

Reverse<T>(T[] array), level: SSE-maniac
int* leftPtr = ptr + i;
int* rightPtr = ptr + len - vectorSize - i;
var left = Sse2.LoadVector128(leftPtr);
var right = Sse2.LoadVector128(rightPtr);
var reversedLeft = Sse2.Shuffle(left, 0x1b); //0x1b =_MM_SHUFFLE(0,1,2,3)
var reversedRight = Sse2.Shuffle(right, 0x1b);
Sse2.Store(rightPtr, reversedLeft);
Sse2.Store(leftPtr, reversedRight);
36

LINQ vs SIMD
37
int max = arrayOfInts.Max();
bool equal = Enumerable.SequenceEqual(arrayOfFloats1, arrayOfFloats2);

Be careful with floats and intrinsics
38
Fma.MultiplyAdd(x, y, z); // x*y+z
Sse3.HorizontalAdd(x, x);
a (39.33427f) * b (245.2255f) + c (150.424f) =
fmadd: 9796.190
fmul,fadd: 9796.189

61453.ToString("X"): "0xF00D"
public static int CountHexDigits(ulong value)
{
int digits = 1;
if (value > 0xFFFFFFFF)
{
digits += 8;
value >>= 0x20;
}
if (value > 0xFFFF)
{
digits += 4;
value >>= 0x10;
}
if (value > 0xFF)
{
digits += 2;
value >>= 0x8;
}
if (value > 0xF)
digits++;
return digits;
}
return (67-(int)Lzcnt.LeadingZeroCount(value | 1)) >> 2;
0xF00D = 0000 0000 … 0000 0000 1111 0000 0000 1101
40
Lzcnt.LeadingZeroCount(0xFOOD): 42

public static unsafe Matrix4x4 operator *(Matrix4x4 value1, Matrix4x4 value2)
{
// OLD
m.M11 = value1.M11 * value2.M11 + value1.M12 * value2.M21 + value1.M13 * value2.M31 + value1.M14 * value2.M41;
// NEW
var row = Sse.LoadVector128(&value1.M11);
Sse.Store(&value1.M11,
Sse.Add(Sse.Add(Sse.Multiply(Sse.Shuffle(row, row, 0x00), Sse.LoadVector128(&value2.M11)),
Sse.Multiply(Sse.Shuffle(row, row, 0x55), Sse.LoadVector128(&value2.M21))),
Sse.Add(Sse.Multiply(Sse.Shuffle(row, row, 0xAA), Sse.LoadVector128(&value2.M31)),
Sse.Multiply(Sse.Shuffle(row, row, 0xFF), Sse.LoadVector128(&value2.M41)))));
41

43
Better Matrix4x4 layout:
public struct Matrix4x4
{
public float M11;
public float M12;
public float M13;
//... 16 float fields
}
public struct Matrix4x4
{
public Vector128<float> Row1;
}

AVX problems
44
var v1 = Avx.LoadVector256(&m1.M11);
var v2 = Avx.LoadVector256(&m2.M11);
var v3 = Avx.Add(v1, v2);
SSE <-> AVX

Alignment
45
// Prologue: iterate until data is aligned
for (…)
// Main loop: 100% optimized SIMD operations
for (…) LoadAlignedVector256(i)
// Epilogue: do regular `for` for the rest
for (…)

Objects on stack (escape analysis)
public string DoSomething()
{
var builder = new StringBuilder();
builder.Append(…);
builder.Append(…);
return builder.ToString();
// builder never escapes the method
}
47
For Java folks: we have user-defined value-types ;-)

Objects on stack – merged!
48

Tiered JIT Compilation – enabled by default
49
• COMPlus_TieredCompilation=1
• COMPlus_TieredCompilation_Tier1CallCountThreshold=30
• Cold methods with hot loops problem
• [MethodImpl(MethodImplOptions.AggressiveOptimization)]

Loop unrolling (auto-vectorization)
for (uint i = 0; i < 256; ++i)
{
total += array[i];
}
for (uint i = 0; i < 64; ++i)
{
total += array[i + 0];
}
50

And don’t forget - C# has other backends!
51
• .NET 4.x CLR
• CoreRT
• Mono
• JIT
• AOT
• LLVM (AOT/JIT)
• Interpreter
• IL2CPP
• Burst

Micro-optimizations are for
• BCL and Runtime
• Because you expect it to be fast
• Game Dev – 16ms per frame
• Don’t be CPU-bound 
• High-load related libs and apps
• Image/Video processing, DL/ML frameworks
• Silly benchmarks (Go vs C#, Java vs C#)
52

Egor Bogatov
EgorBo
Thanks!
53

Egor Bogatov - .NET Core intrinsics and other micro-optimizations

More Related Content

What's hot (17)

Similar to Egor Bogatov - .NET Core intrinsics and other micro-optimizations (20)

Recently uploaded (20)

Egor Bogatov - .NET Core intrinsics and other micro-optimizations