SlideShare a Scribd company logo
Intrinsics and other micro-optimizations
Egor Bogatov
Engineer at Microsoft
Agenda
Useful micro-optimizations
Pitfalls for external contributors
Intrinsics & SIMD with examples
.NET Core 3.0-x features
2
Prefer Spans API where possible
var str = "EGOR 3.14 1234 7/3/2018";
string name = str.Substring(0, 4);
float pi = float.Parse(str.Substring(5, 4));
int number = int.Parse(str.Substring(10, 4));
DateTime date = DateTime.Parse(str.Substring(15, 8));
var str = "EGOR 3.14 1234 7/3/2018".AsSpan();
var name = str.Slice(0, 4);
float pi = float.Parse(str.Slice(5, 4));
int number = int.Parse(str.Slice(10, 4));
DateTime date = DateTime.Parse(str.Slice(15, 8));
Allocated on heap: 168 bytes
Allocated on heap: 0 bytes
3
Allocating a temp array
char[] buffer =
new char[count];
4
Allocating a temp array
Span<char> span =
new char[count];
5
Allocating a temp array
Span<char> span =
count <= 512 ?
stackalloc char[512] :
new char[count];
6
Allocating a temp array
Span<char> span =
count <= 512 ?
stackalloc char[512] :
ArrayPool<char>.Shared.Rent(count);
7
Allocating a temp array
char[] pool = null;
Span<char> span =
count <= 512 ?
stackalloc char[512] :
(pool = ArrayPool<char>.Shared.Rent(count));
if (pool != null)
ArrayPool<char>.Shared.Return(pool);
8
Allocating a temp array - final pattern
char[] pool = null;
Span<char> span =
count <= 512 ?
stackalloc char[512] :
(pool = ArrayPool<char>.Shared.Rent(count));
if (pool != null)
ArrayPool<char>.Shared.Return(pool);
9
Allocating a temp array – without ArrayPool
Span<char> span = count <= 512 ?
stackalloc char[512] :
new char[count];
10
Optimizing .NET Core: pitfalls
public static int Count<TSource>(this IEnumerable<TSource> source)
{
if (source is ICollection<TSource> collectionoft)
return collectionoft.Count;
if (source is IIListProvider<TSource> listProv)
return listProv.GetCount(onlyIfCheap: false);
if (source is ICollection collection)
return collection.Count;
if (source is IReadOnlyCollection<TSource> rocollectionoft)
return rocollectionoft.Count;
int count = 0;
using (IEnumerator<TSource> e = source.GetEnumerator())
while (e.MoveNext())
count++;
return count;
}
12
~ 3 ns
~ 3 ns
~ 3 ns
~ 30 ns
~ 10-… ns
Casts are not cheap
var t0 = (List<string>)value;
var t1 = (ICollection<string>)value
var t2 = (IList)value
var t3 = (IEnumerable<string>)value
object value = new List<string> { };
// Covariant interfaces:
public interface IEnumerable<out T>
public interface IReadOnlyCollection<out T>
13
IEnumerable<object> a = new List<string> {..}
Cast to covariant interface – different runtimes 14
Method | Runtime | Mean | Scaled |
-------------------:|----------------:|------------:|---------:|
CastAndCount | .NET 4.7 | 78.1 ns | 6.7 |
CastAndCount | .NET Core 3 | 42.9 ns | 3.7 |
CastAndCount | CoreRT | 11.6 ns | 1.0 |
CastAndCount | Mono | 6.7 ns | 0.6 |
return ((IReadOnlyCollection<string>)_smallArray).Count;
.NET Core: bounds check
Bounds check
public static double SumSqrt(double[] array)
{
double result = 0;
for (int i = 0; i < array.Length; i++)
{
result += Math.Sqrt(array[i]);
}
return result;
}
16
Bounds check
public static double SumSqrt(double[] array)
{
double result = 0;
for (int i = 0; i < array.Length; i++)
{
if (i >= array.Length)
throw new ArgumentOutOfRangeException();
result += Math.Sqrt(array[i]);
}
return result;
}
17
Bounds check eliminated!
public static double SumSqrt(double[] array)
{
double result = 0;
for (int i = 0; i < array.Length; i++)
{
result += Math.Sqrt(array[i]);
}
return result;
}
18
Bounds check: tricks
public static void Test1(char[] array)
{
array[0] = 'F';
array[1] = 'a';
array[2] = 'l';
array[3] = 's';
array[4] = 'e';
array[5] = '.';
}
19
Bounds check: tricks
public static void Test1(char[] array)
{
array[5] = '.';
array[0] = 'F';
array[1] = 'a';
array[2] = 'l';
array[3] = 's';
array[4] = 'e';
}
20
Bounds check: tricks
public static void Test1(char[] array)
{
if (array.Length > 5)
{
array[0] = 'F';
array[1] = 'a';
array[2] = 'l';
array[3] = 's';
array[4] = 'e';
array[5] = '.';
}
}
21
Bounds check: tricks
public static void Test1(char[] array)
{
if ((uint)array.Length > 5)
{
array[0] = 'F';
array[1] = 'a';
array[2] = 'l';
array[3] = 's';
array[4] = 'e';
array[5] = '.';
}
}
22
Bounds check: tricks – CoreCLR sources:
// Boolean.cs
public bool TryFormat(Span<char> destination, out int charsWritten)
{
if (m_value)
{
if ((uint)destination.Length > 3)
{
destination[0] = 'T';
destination[1] = 'r';
destination[2] = 'u';
destination[3] = 'e';
charsWritten = 4;
return true;
}
}
23
.NET Core: Intrinsics & SIMD
• Recognize patterns
• Replace methods (usually marked with [Intrinsic])
• System.Runtime.Intrinsics
mov eax,dword ptr [rcx+8]
mov ecx,dword ptr [rcx+0Ch]
rol eax,cl
ret
private static uint Rotl(uint value, int shift)
{
return (value << shift) | (value >> (32 - shift));
}
[Intrinsic]
public static double Round(double a)
{
double flrTempVal = Floor(a + 0.5);
if ((a == (Floor(a) + 0.5)) && (FMod(flrTempVal, 2.0) != 0))
flrTempVal -= 1.0;
return copysign(flrTempVal, a);
}
25
cmp dword ptr [rcx+48h] …
jne M00_L00
vroundsd xmm0,xmm0,mmword ptr …
ret
Intrinsics
SIMD
Vector4 result =
new Vector4(1f, 2f, 3f, 4f) +
new Vector4(5f, 6f, 7f, 8f);
26
vmovups xmm0,xmmword ptr [rdx]
vmovups xmm1,xmmword ptr [rdx+16]
vaddps xmm0,xmm0,xmm1
X1 X2 X+ =
Y1 Y2 Y+ =
Z1 Z2 Z+ =
W1 W2 W+ =
X1 X2 X
Y1 Y2 Y
+
Z1 Z2 Z
=
W1 W2 W
SIMD
Meet System.Runtime.Intrinsics
var v1 = new Vector4(1, 2, 3, 4);
var v2 = new Vector4(5, 6, 7, 8);
var left = Sse.LoadVector128(&v1.X); // Vector128<float>
var right = Sse.LoadVector128(&v2.X);
var sum = Sse.Add(left, right);
Sse.Store(&result.X, sum);
var mulPi = Sse.Multiply(sum, Sse.SetAllVector128(3.14f));
var result = new Vector4(v1.X + v2.X, v1.Y + v2.Y, ...);
27
System.Runtime.Intrinsics
28
• System.Runtime.Intrinsics
Vector64<T>
Vector128<T>
Vector256<T>
• System.Runtime.Intrinsics.X86
Sse (Sse, Sse2…Sse42)
Avx, Avx2
Fma
…
• System.Runtime.Intrinsics.Arm.Arm
64
Simd
…
System.Runtime.Intrinsics
29
public class Sse2 : Sse
{
public static bool IsSupported => true;
/// <summary>
/// __m128i _mm_add_epi8 (__m128i a, __m128i b)
/// PADDB xmm, xmm/m128
/// </summary>
public static Vector128<byte> Add(Vector128<byte> left, Vector128<byte> right);
/// <summary>
/// __m128i _mm_add_epi8 (__m128i a, __m128i b)
/// PADDB xmm, xmm/m128
/// </summary>
public static Vector128<sbyte> Add(Vector128<sbyte> left, Vector128<sbyte> right);
S.R.I.: Documentation
/// <summary>
/// __m128d _mm_add_pd (__m128d a, __m128d b)
/// ADDPD xmm, xmm/m128
/// </summary>
public static Vector128<double> Add(
Vector128<double> left,
Vector128<double> right);
30
S.R.I.: Usage pattern
if (Arm.Simd.IsSupported)
DoWorkusingNeon();
else if (Avx2.IsSupported)
DoWorkUsingAvx2();
else if (Sse2.IsSupported)
DoWorkUsingSse2();
else
DoWorkSlowly();
31
JIT
if (Arm.Simd.IsSupported)
DoWorkusingNeon();
else if (x86.Avx2.IsSupported)
DoWorkUsingAvx2();
else if (x86.Sse2.IsSupported)
DoWorkUsingSse2();
else
DoWorkSlowly();
IsSorted(int[]) – simple implementation
bool IsSorted(int[] array)
{
if (array.Length < 2)
return true;
for (int i = 0; i < array.Length - 1; i++)
{
if (array[i] > array[i + 1])
return false;
}
return true;
}
32
IsSorted(int[]) – optimized with SSE41
bool IsSorted_Sse41(int[] array)
{
fixed (int* ptr = &array[0])
{
for (int i = 0; i < array.Length - 4; i += 4)
{
var curr = Sse2.LoadVector128(ptr + i);
var next = Sse2.LoadVector128(ptr + i + 1);
var mask = Sse2.CompareGreaterThan(curr, next);
if (!Sse41.TestAllZeros(mask, mask))
return false;
}
}
return true;
}
i0 i1 i2 i3
i0 i1 i2 i3
0 1 0 0
_mm_test_all_zeros
i4 i5
33
Method | Mean |
---------------- |---------:|
IsSorted | 35.07 us |
IsSorted_unsafe | 21.19 us |
IsSorted_Sse41 | 13.79 us |
Reverse<T>(T[] array), level: student
void Reverse<T>(T[] array)
{
for (int i = 0; i < array.Length / 2; i++)
{
T tmp = array[i];
array[i] = array[array.Length - i - 1];
array[array.Length - i - 1] = tmp;
}
}
“1 2 3 4 5 6” => “6 5 4 3 2 1”
34
Reverse<T>(T[] array), level: CoreCLR developer
void Reverse<T>(T[] array)
{
ref T p = ref Unsafe.As<byte, T>(ref array.GetRawSzArrayData());
int i = 0;
int j = array.Length - 1;
while (i < j)
{
T temp = Unsafe.Add(ref p, i);
Unsafe.Add(ref p, i) = Unsafe.Add(ref p, j);
Unsafe.Add(ref p, j) = temp;
i++;
j--;
}
}
No bounds/covariance checks
35
Reverse<T>(T[] array), level: SSE-maniac
int* leftPtr = ptr + i;
int* rightPtr = ptr + len - vectorSize - i;
var left = Sse2.LoadVector128(leftPtr);
var right = Sse2.LoadVector128(rightPtr);
var reversedLeft = Sse2.Shuffle(left, 0x1b); //0x1b =_MM_SHUFFLE(0,1,2,3)
var reversedRight = Sse2.Shuffle(right, 0x1b);
Sse2.Store(rightPtr, reversedLeft);
Sse2.Store(leftPtr, reversedRight);
36
LINQ vs SIMD
37
int max = arrayOfInts.Max();
bool equal = Enumerable.SequenceEqual(arrayOfFloats1, arrayOfFloats2);
Be careful with floats and intrinsics
38
Fma.MultiplyAdd(x, y, z); // x*y+z
Sse3.HorizontalAdd(x, x);
a (39.33427f) * b (245.2255f) + c (150.424f) =
fmadd: 9796.190
fmul,fadd: 9796.189
39
61453.ToString("X"): "0xF00D"
public static int CountHexDigits(ulong value)
{
int digits = 1;
if (value > 0xFFFFFFFF)
{
digits += 8;
value >>= 0x20;
}
if (value > 0xFFFF)
{
digits += 4;
value >>= 0x10;
}
if (value > 0xFF)
{
digits += 2;
value >>= 0x8;
}
if (value > 0xF)
digits++;
return digits;
}
return (67-(int)Lzcnt.LeadingZeroCount(value | 1)) >> 2;
0xF00D = 0000 0000 … 0000 0000 1111 0000 0000 1101
40
Lzcnt.LeadingZeroCount(0xFOOD): 42
public static unsafe Matrix4x4 operator *(Matrix4x4 value1, Matrix4x4 value2)
{
// OLD
m.M11 = value1.M11 * value2.M11 + value1.M12 * value2.M21 + value1.M13 * value2.M31 + value1.M14 * value2.M41;
m.M12 = value1.M11 * value2.M12 + value1.M12 * value2.M22 + value1.M13 * value2.M32 + value1.M14 * value2.M42;
m.M13 = value1.M11 * value2.M13 + value1.M12 * value2.M23 + value1.M13 * value2.M33 + value1.M14 * value2.M43;
m.M14 = value1.M11 * value2.M14 + value1.M12 * value2.M24 + value1.M13 * value2.M34 + value1.M14 * value2.M44;
// NEW
var row = Sse.LoadVector128(&value1.M11);
Sse.Store(&value1.M11,
Sse.Add(Sse.Add(Sse.Multiply(Sse.Shuffle(row, row, 0x00), Sse.LoadVector128(&value2.M11)),
Sse.Multiply(Sse.Shuffle(row, row, 0x55), Sse.LoadVector128(&value2.M21))),
Sse.Add(Sse.Multiply(Sse.Shuffle(row, row, 0xAA), Sse.LoadVector128(&value2.M31)),
Sse.Multiply(Sse.Shuffle(row, row, 0xFF), Sse.LoadVector128(&value2.M41)))));
41
42
43
Better Matrix4x4 layout:
public struct Matrix4x4
{
public float M11;
public float M12;
public float M13;
//... 16 float fields
}
public struct Matrix4x4
{
public Vector128<float> Row1;
public Vector128<float> Row2;
public Vector128<float> Row3;
public Vector128<float> Row4;
}
AVX problems
44
var v1 = Avx.LoadVector256(&m1.M11);
var v2 = Avx.LoadVector256(&m2.M11);
var v3 = Avx.Add(v1, v2);
SSE <-> AVX
Alignment
45
// Prologue: iterate until data is aligned
for (…)
// Main loop: 100% optimized SIMD operations
for (…) LoadAlignedVector256(i)
// Epilogue: do regular `for` for the rest
for (…)
.NET Core: future
Objects on stack (escape analysis)
public string DoSomething()
{
var builder = new StringBuilder();
builder.Append(…);
builder.Append(…);
return builder.ToString();
// builder never escapes the method
}
47
For Java folks: we have user-defined value-types ;-)
Objects on stack – merged!
48
Tiered JIT Compilation – enabled by default
49
• COMPlus_TieredCompilation=1
• COMPlus_TieredCompilation_Tier1CallCountThreshold=30
• Cold methods with hot loops problem
• [MethodImpl(MethodImplOptions.AggressiveOptimization)]
Loop unrolling (auto-vectorization)
for (uint i = 0; i < 256; ++i)
{
total += array[i];
}
for (uint i = 0; i < 64; ++i)
{
total += array[i + 0];
total += array[i + 1];
total += array[i + 2];
total += array[i + 3];
}
50
And don’t forget - C# has other backends!
51
• .NET 4.x CLR
• CoreRT
• Mono
• JIT
• AOT
• LLVM (AOT/JIT)
• Interpreter
• IL2CPP
• Burst
Micro-optimizations are for
• BCL and Runtime
• Because you expect it to be fast
• Game Dev – 16ms per frame
• Don’t be CPU-bound 
• High-load related libs and apps
• Image/Video processing, DL/ML frameworks
• Silly benchmarks (Go vs C#, Java vs C#)
52
Egor Bogatov
EgorBo
Thanks!
53

More Related Content

PPTX
1017 論文紹介第四回
PDF
Ant colony opitimization numerical example
PDF
Real time implementation of unscented kalman filter for target tracking
PPTX
Reconnaissance using Python
PDF
Erlang Software Developer CV
PPTX
딥러닝 - 역사와 이론적 기초
PDF
코드와 실습으로 이해하는 인공지능
PPT
Operating System - Monitors (Presentation)
1017 論文紹介第四回
Ant colony opitimization numerical example
Real time implementation of unscented kalman filter for target tracking
Reconnaissance using Python
Erlang Software Developer CV
딥러닝 - 역사와 이론적 기초
코드와 실습으로 이해하는 인공지능
Operating System - Monitors (Presentation)

What's hot (17)

PPTX
Simulation-Language.pptx
PPTX
Petri Nets: Properties, Analysis and Applications
PDF
Tutorial of kalman filter
PPT
Clock Synchronization (Distributed computing)
PPTX
The dag representation of basic blocks
PPTX
Semophores and it's types
PDF
Ui disk & terminal drivers
PPT
Lamport’s algorithm for mutual exclusion
PPTX
Neural Networks
PPT
Types of Load distributing algorithm in Distributed System
PPT
Introduction iii
PDF
Optimization using soft computing
PDF
Problems in parallel computations of tree functions
PDF
[DL輪読会]StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generat...
PDF
8. mutual exclusion in Distributed Operating Systems
PPTX
Artificial intelligence agents and environment
Simulation-Language.pptx
Petri Nets: Properties, Analysis and Applications
Tutorial of kalman filter
Clock Synchronization (Distributed computing)
The dag representation of basic blocks
Semophores and it's types
Ui disk & terminal drivers
Lamport’s algorithm for mutual exclusion
Neural Networks
Types of Load distributing algorithm in Distributed System
Introduction iii
Optimization using soft computing
Problems in parallel computations of tree functions
[DL輪読会]StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generat...
8. mutual exclusion in Distributed Operating Systems
Artificial intelligence agents and environment
Ad

Similar to Egor Bogatov - .NET Core intrinsics and other micro-optimizations (20)

PPTX
How to add an optimization for C# to RyuJIT
PPTX
C++11 - STL Additions
PPT
Whats new in_csharp4
PPT
SP-First-Lecture.ppt
PPTX
Story of static code analyzer development
PPTX
A scrupulous code review - 15 bugs in C++ code
PPTX
Getting started cpp full
PPTX
Ch07-3-sourceCode.pptxhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
PDF
Write Python for Speed
PPT
Java 5 Features
PPT
array2d.ppt
PPTX
week14Pointers_II. pointers pemrograman dasar C++.pptx
PPTX
Arrays 2d Arrays 2d Arrays 2d Arrrays 2d
PPT
Lec2&3 data structure
PPT
Lec2
PPT
Lec2&3_DataStructure
PPT
Евгений Крутько, Многопоточные вычисления, современный подход.
PDF
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
PPTX
C++11 - A Change in Style - v2.0
PDF
talk at Virginia Bioinformatics Institute, December 5, 2013
How to add an optimization for C# to RyuJIT
C++11 - STL Additions
Whats new in_csharp4
SP-First-Lecture.ppt
Story of static code analyzer development
A scrupulous code review - 15 bugs in C++ code
Getting started cpp full
Ch07-3-sourceCode.pptxhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
Write Python for Speed
Java 5 Features
array2d.ppt
week14Pointers_II. pointers pemrograman dasar C++.pptx
Arrays 2d Arrays 2d Arrays 2d Arrrays 2d
Lec2&3 data structure
Lec2
Lec2&3_DataStructure
Евгений Крутько, Многопоточные вычисления, современный подход.
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
C++11 - A Change in Style - v2.0
talk at Virginia Bioinformatics Institute, December 5, 2013
Ad

Recently uploaded (20)

PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
history of c programming in notes for students .pptx
PPTX
ai tools demonstartion for schools and inter college
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
AI in Product Development-omnex systems
PDF
medical staffing services at VALiNTRY
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
System and Network Administraation Chapter 3
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
L1 - Introduction to python Backend.pptx
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
Introduction to Artificial Intelligence
PDF
Digital Strategies for Manufacturing Companies
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Which alternative to Crystal Reports is best for small or large businesses.pdf
history of c programming in notes for students .pptx
ai tools demonstartion for schools and inter college
ISO 45001 Occupational Health and Safety Management System
How to Choose the Right IT Partner for Your Business in Malaysia
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
AI in Product Development-omnex systems
medical staffing services at VALiNTRY
CHAPTER 2 - PM Management and IT Context
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Design an Analysis of Algorithms II-SECS-1021-03
System and Network Administraation Chapter 3
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
L1 - Introduction to python Backend.pptx
2025 Textile ERP Trends: SAP, Odoo & Oracle
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Introduction to Artificial Intelligence
Digital Strategies for Manufacturing Companies
Navsoft: AI-Powered Business Solutions & Custom Software Development

Egor Bogatov - .NET Core intrinsics and other micro-optimizations

  • 1. Intrinsics and other micro-optimizations Egor Bogatov Engineer at Microsoft
  • 2. Agenda Useful micro-optimizations Pitfalls for external contributors Intrinsics & SIMD with examples .NET Core 3.0-x features 2
  • 3. Prefer Spans API where possible var str = "EGOR 3.14 1234 7/3/2018"; string name = str.Substring(0, 4); float pi = float.Parse(str.Substring(5, 4)); int number = int.Parse(str.Substring(10, 4)); DateTime date = DateTime.Parse(str.Substring(15, 8)); var str = "EGOR 3.14 1234 7/3/2018".AsSpan(); var name = str.Slice(0, 4); float pi = float.Parse(str.Slice(5, 4)); int number = int.Parse(str.Slice(10, 4)); DateTime date = DateTime.Parse(str.Slice(15, 8)); Allocated on heap: 168 bytes Allocated on heap: 0 bytes 3
  • 4. Allocating a temp array char[] buffer = new char[count]; 4
  • 5. Allocating a temp array Span<char> span = new char[count]; 5
  • 6. Allocating a temp array Span<char> span = count <= 512 ? stackalloc char[512] : new char[count]; 6
  • 7. Allocating a temp array Span<char> span = count <= 512 ? stackalloc char[512] : ArrayPool<char>.Shared.Rent(count); 7
  • 8. Allocating a temp array char[] pool = null; Span<char> span = count <= 512 ? stackalloc char[512] : (pool = ArrayPool<char>.Shared.Rent(count)); if (pool != null) ArrayPool<char>.Shared.Return(pool); 8
  • 9. Allocating a temp array - final pattern char[] pool = null; Span<char> span = count <= 512 ? stackalloc char[512] : (pool = ArrayPool<char>.Shared.Rent(count)); if (pool != null) ArrayPool<char>.Shared.Return(pool); 9
  • 10. Allocating a temp array – without ArrayPool Span<char> span = count <= 512 ? stackalloc char[512] : new char[count]; 10
  • 12. public static int Count<TSource>(this IEnumerable<TSource> source) { if (source is ICollection<TSource> collectionoft) return collectionoft.Count; if (source is IIListProvider<TSource> listProv) return listProv.GetCount(onlyIfCheap: false); if (source is ICollection collection) return collection.Count; if (source is IReadOnlyCollection<TSource> rocollectionoft) return rocollectionoft.Count; int count = 0; using (IEnumerator<TSource> e = source.GetEnumerator()) while (e.MoveNext()) count++; return count; } 12 ~ 3 ns ~ 3 ns ~ 3 ns ~ 30 ns ~ 10-… ns
  • 13. Casts are not cheap var t0 = (List<string>)value; var t1 = (ICollection<string>)value var t2 = (IList)value var t3 = (IEnumerable<string>)value object value = new List<string> { }; // Covariant interfaces: public interface IEnumerable<out T> public interface IReadOnlyCollection<out T> 13 IEnumerable<object> a = new List<string> {..}
  • 14. Cast to covariant interface – different runtimes 14 Method | Runtime | Mean | Scaled | -------------------:|----------------:|------------:|---------:| CastAndCount | .NET 4.7 | 78.1 ns | 6.7 | CastAndCount | .NET Core 3 | 42.9 ns | 3.7 | CastAndCount | CoreRT | 11.6 ns | 1.0 | CastAndCount | Mono | 6.7 ns | 0.6 | return ((IReadOnlyCollection<string>)_smallArray).Count;
  • 16. Bounds check public static double SumSqrt(double[] array) { double result = 0; for (int i = 0; i < array.Length; i++) { result += Math.Sqrt(array[i]); } return result; } 16
  • 17. Bounds check public static double SumSqrt(double[] array) { double result = 0; for (int i = 0; i < array.Length; i++) { if (i >= array.Length) throw new ArgumentOutOfRangeException(); result += Math.Sqrt(array[i]); } return result; } 17
  • 18. Bounds check eliminated! public static double SumSqrt(double[] array) { double result = 0; for (int i = 0; i < array.Length; i++) { result += Math.Sqrt(array[i]); } return result; } 18
  • 19. Bounds check: tricks public static void Test1(char[] array) { array[0] = 'F'; array[1] = 'a'; array[2] = 'l'; array[3] = 's'; array[4] = 'e'; array[5] = '.'; } 19
  • 20. Bounds check: tricks public static void Test1(char[] array) { array[5] = '.'; array[0] = 'F'; array[1] = 'a'; array[2] = 'l'; array[3] = 's'; array[4] = 'e'; } 20
  • 21. Bounds check: tricks public static void Test1(char[] array) { if (array.Length > 5) { array[0] = 'F'; array[1] = 'a'; array[2] = 'l'; array[3] = 's'; array[4] = 'e'; array[5] = '.'; } } 21
  • 22. Bounds check: tricks public static void Test1(char[] array) { if ((uint)array.Length > 5) { array[0] = 'F'; array[1] = 'a'; array[2] = 'l'; array[3] = 's'; array[4] = 'e'; array[5] = '.'; } } 22
  • 23. Bounds check: tricks – CoreCLR sources: // Boolean.cs public bool TryFormat(Span<char> destination, out int charsWritten) { if (m_value) { if ((uint)destination.Length > 3) { destination[0] = 'T'; destination[1] = 'r'; destination[2] = 'u'; destination[3] = 'e'; charsWritten = 4; return true; } } 23
  • 25. • Recognize patterns • Replace methods (usually marked with [Intrinsic]) • System.Runtime.Intrinsics mov eax,dword ptr [rcx+8] mov ecx,dword ptr [rcx+0Ch] rol eax,cl ret private static uint Rotl(uint value, int shift) { return (value << shift) | (value >> (32 - shift)); } [Intrinsic] public static double Round(double a) { double flrTempVal = Floor(a + 0.5); if ((a == (Floor(a) + 0.5)) && (FMod(flrTempVal, 2.0) != 0)) flrTempVal -= 1.0; return copysign(flrTempVal, a); } 25 cmp dword ptr [rcx+48h] … jne M00_L00 vroundsd xmm0,xmm0,mmword ptr … ret Intrinsics
  • 26. SIMD Vector4 result = new Vector4(1f, 2f, 3f, 4f) + new Vector4(5f, 6f, 7f, 8f); 26 vmovups xmm0,xmmword ptr [rdx] vmovups xmm1,xmmword ptr [rdx+16] vaddps xmm0,xmm0,xmm1 X1 X2 X+ = Y1 Y2 Y+ = Z1 Z2 Z+ = W1 W2 W+ = X1 X2 X Y1 Y2 Y + Z1 Z2 Z = W1 W2 W SIMD
  • 27. Meet System.Runtime.Intrinsics var v1 = new Vector4(1, 2, 3, 4); var v2 = new Vector4(5, 6, 7, 8); var left = Sse.LoadVector128(&v1.X); // Vector128<float> var right = Sse.LoadVector128(&v2.X); var sum = Sse.Add(left, right); Sse.Store(&result.X, sum); var mulPi = Sse.Multiply(sum, Sse.SetAllVector128(3.14f)); var result = new Vector4(v1.X + v2.X, v1.Y + v2.Y, ...); 27
  • 28. System.Runtime.Intrinsics 28 • System.Runtime.Intrinsics Vector64<T> Vector128<T> Vector256<T> • System.Runtime.Intrinsics.X86 Sse (Sse, Sse2…Sse42) Avx, Avx2 Fma … • System.Runtime.Intrinsics.Arm.Arm 64 Simd …
  • 29. System.Runtime.Intrinsics 29 public class Sse2 : Sse { public static bool IsSupported => true; /// <summary> /// __m128i _mm_add_epi8 (__m128i a, __m128i b) /// PADDB xmm, xmm/m128 /// </summary> public static Vector128<byte> Add(Vector128<byte> left, Vector128<byte> right); /// <summary> /// __m128i _mm_add_epi8 (__m128i a, __m128i b) /// PADDB xmm, xmm/m128 /// </summary> public static Vector128<sbyte> Add(Vector128<sbyte> left, Vector128<sbyte> right);
  • 30. S.R.I.: Documentation /// <summary> /// __m128d _mm_add_pd (__m128d a, __m128d b) /// ADDPD xmm, xmm/m128 /// </summary> public static Vector128<double> Add( Vector128<double> left, Vector128<double> right); 30
  • 31. S.R.I.: Usage pattern if (Arm.Simd.IsSupported) DoWorkusingNeon(); else if (Avx2.IsSupported) DoWorkUsingAvx2(); else if (Sse2.IsSupported) DoWorkUsingSse2(); else DoWorkSlowly(); 31 JIT if (Arm.Simd.IsSupported) DoWorkusingNeon(); else if (x86.Avx2.IsSupported) DoWorkUsingAvx2(); else if (x86.Sse2.IsSupported) DoWorkUsingSse2(); else DoWorkSlowly();
  • 32. IsSorted(int[]) – simple implementation bool IsSorted(int[] array) { if (array.Length < 2) return true; for (int i = 0; i < array.Length - 1; i++) { if (array[i] > array[i + 1]) return false; } return true; } 32
  • 33. IsSorted(int[]) – optimized with SSE41 bool IsSorted_Sse41(int[] array) { fixed (int* ptr = &array[0]) { for (int i = 0; i < array.Length - 4; i += 4) { var curr = Sse2.LoadVector128(ptr + i); var next = Sse2.LoadVector128(ptr + i + 1); var mask = Sse2.CompareGreaterThan(curr, next); if (!Sse41.TestAllZeros(mask, mask)) return false; } } return true; } i0 i1 i2 i3 i0 i1 i2 i3 0 1 0 0 _mm_test_all_zeros i4 i5 33 Method | Mean | ---------------- |---------:| IsSorted | 35.07 us | IsSorted_unsafe | 21.19 us | IsSorted_Sse41 | 13.79 us |
  • 34. Reverse<T>(T[] array), level: student void Reverse<T>(T[] array) { for (int i = 0; i < array.Length / 2; i++) { T tmp = array[i]; array[i] = array[array.Length - i - 1]; array[array.Length - i - 1] = tmp; } } “1 2 3 4 5 6” => “6 5 4 3 2 1” 34
  • 35. Reverse<T>(T[] array), level: CoreCLR developer void Reverse<T>(T[] array) { ref T p = ref Unsafe.As<byte, T>(ref array.GetRawSzArrayData()); int i = 0; int j = array.Length - 1; while (i < j) { T temp = Unsafe.Add(ref p, i); Unsafe.Add(ref p, i) = Unsafe.Add(ref p, j); Unsafe.Add(ref p, j) = temp; i++; j--; } } No bounds/covariance checks 35
  • 36. Reverse<T>(T[] array), level: SSE-maniac int* leftPtr = ptr + i; int* rightPtr = ptr + len - vectorSize - i; var left = Sse2.LoadVector128(leftPtr); var right = Sse2.LoadVector128(rightPtr); var reversedLeft = Sse2.Shuffle(left, 0x1b); //0x1b =_MM_SHUFFLE(0,1,2,3) var reversedRight = Sse2.Shuffle(right, 0x1b); Sse2.Store(rightPtr, reversedLeft); Sse2.Store(leftPtr, reversedRight); 36
  • 37. LINQ vs SIMD 37 int max = arrayOfInts.Max(); bool equal = Enumerable.SequenceEqual(arrayOfFloats1, arrayOfFloats2);
  • 38. Be careful with floats and intrinsics 38 Fma.MultiplyAdd(x, y, z); // x*y+z Sse3.HorizontalAdd(x, x); a (39.33427f) * b (245.2255f) + c (150.424f) = fmadd: 9796.190 fmul,fadd: 9796.189
  • 39. 39
  • 40. 61453.ToString("X"): "0xF00D" public static int CountHexDigits(ulong value) { int digits = 1; if (value > 0xFFFFFFFF) { digits += 8; value >>= 0x20; } if (value > 0xFFFF) { digits += 4; value >>= 0x10; } if (value > 0xFF) { digits += 2; value >>= 0x8; } if (value > 0xF) digits++; return digits; } return (67-(int)Lzcnt.LeadingZeroCount(value | 1)) >> 2; 0xF00D = 0000 0000 … 0000 0000 1111 0000 0000 1101 40 Lzcnt.LeadingZeroCount(0xFOOD): 42
  • 41. public static unsafe Matrix4x4 operator *(Matrix4x4 value1, Matrix4x4 value2) { // OLD m.M11 = value1.M11 * value2.M11 + value1.M12 * value2.M21 + value1.M13 * value2.M31 + value1.M14 * value2.M41; m.M12 = value1.M11 * value2.M12 + value1.M12 * value2.M22 + value1.M13 * value2.M32 + value1.M14 * value2.M42; m.M13 = value1.M11 * value2.M13 + value1.M12 * value2.M23 + value1.M13 * value2.M33 + value1.M14 * value2.M43; m.M14 = value1.M11 * value2.M14 + value1.M12 * value2.M24 + value1.M13 * value2.M34 + value1.M14 * value2.M44; // NEW var row = Sse.LoadVector128(&value1.M11); Sse.Store(&value1.M11, Sse.Add(Sse.Add(Sse.Multiply(Sse.Shuffle(row, row, 0x00), Sse.LoadVector128(&value2.M11)), Sse.Multiply(Sse.Shuffle(row, row, 0x55), Sse.LoadVector128(&value2.M21))), Sse.Add(Sse.Multiply(Sse.Shuffle(row, row, 0xAA), Sse.LoadVector128(&value2.M31)), Sse.Multiply(Sse.Shuffle(row, row, 0xFF), Sse.LoadVector128(&value2.M41))))); 41
  • 42. 42
  • 43. 43 Better Matrix4x4 layout: public struct Matrix4x4 { public float M11; public float M12; public float M13; //... 16 float fields } public struct Matrix4x4 { public Vector128<float> Row1; public Vector128<float> Row2; public Vector128<float> Row3; public Vector128<float> Row4; }
  • 44. AVX problems 44 var v1 = Avx.LoadVector256(&m1.M11); var v2 = Avx.LoadVector256(&m2.M11); var v3 = Avx.Add(v1, v2); SSE <-> AVX
  • 45. Alignment 45 // Prologue: iterate until data is aligned for (…) // Main loop: 100% optimized SIMD operations for (…) LoadAlignedVector256(i) // Epilogue: do regular `for` for the rest for (…)
  • 47. Objects on stack (escape analysis) public string DoSomething() { var builder = new StringBuilder(); builder.Append(…); builder.Append(…); return builder.ToString(); // builder never escapes the method } 47 For Java folks: we have user-defined value-types ;-)
  • 48. Objects on stack – merged! 48
  • 49. Tiered JIT Compilation – enabled by default 49 • COMPlus_TieredCompilation=1 • COMPlus_TieredCompilation_Tier1CallCountThreshold=30 • Cold methods with hot loops problem • [MethodImpl(MethodImplOptions.AggressiveOptimization)]
  • 50. Loop unrolling (auto-vectorization) for (uint i = 0; i < 256; ++i) { total += array[i]; } for (uint i = 0; i < 64; ++i) { total += array[i + 0]; total += array[i + 1]; total += array[i + 2]; total += array[i + 3]; } 50
  • 51. And don’t forget - C# has other backends! 51 • .NET 4.x CLR • CoreRT • Mono • JIT • AOT • LLVM (AOT/JIT) • Interpreter • IL2CPP • Burst
  • 52. Micro-optimizations are for • BCL and Runtime • Because you expect it to be fast • Game Dev – 16ms per frame • Don’t be CPU-bound  • High-load related libs and apps • Image/Video processing, DL/ML frameworks • Silly benchmarks (Go vs C#, Java vs C#) 52