Let’s talk about microbenchmarking
Andrey Akinshin, JetBrains
DotNext 2016 Helsinki
1/29
BenchmarkDotNet
2/29
Today’s environment
• BenchmarkDotNet v0.10.1
• Haswell Core i7-4702MQ CPU 2.20GHz, Windows 10
• .NET Framework 4.6.2 + clrjit/compatjit-v4.6.1586.0
• Source code: https://git.io/v1RRX
3/29
Today’s environment
• BenchmarkDotNet v0.10.1
• Haswell Core i7-4702MQ CPU 2.20GHz, Windows 10
• .NET Framework 4.6.2 + clrjit/compatjit-v4.6.1586.0
• Source code: https://git.io/v1RRX
Other environments:
C# compiler old csc / Roslyn
CLR CLR2 / CLR4 / CoreCLR / Mono
OS Windows / Linux / MacOS / FreeBSD
JIT LegacyJIT-x86 / LegacyJIT-x64 / RyuJIT-x64
GC MS (different CLRs) / Mono (Boehm/Sgen)
Toolchain JIT / NGen / .NET Native
Hardware ∞ different configurations
. . . . . .
And don’t forget about multiple versions
3/29
Count of iterations
A bad benchmark
// Resolution (Stopwatch) = 466 ns
// Latency (Stopwatch) = 18 ns
var sw = Stopwatch.StartNew();
Foo(); // 100 ns
sw.Stop();
WriteLine(sw.ElapsedMilliseconds);
4/29
Count of iterations
A bad benchmark
// Resolution (Stopwatch) = 466 ns
// Latency (Stopwatch) = 18 ns
var sw = Stopwatch.StartNew();
Foo(); // 100 ns
sw.Stop();
WriteLine(sw.ElapsedMilliseconds);
A better benchmark
var sw = Stopwatch.StartNew();
for (int i = 0; i < N; i++) // (N * 100 + eps) ns
Foo(); // 100 ns
sw.Stop();
var total = sw.ElapsedTicks / Stopwatch.Frequency;
WriteLine(total / N);
4/29
Several launches
Run 01 : 529.8674 ns/op
Run 02 : 532.7541 ns/op
Run 03 : 558.7448 ns/op
Run 04 : 555.6647 ns/op
Run 05 : 539.6401 ns/op
Run 06 : 539.3494 ns/op
Run 07 : 564.3222 ns/op
Run 08 : 551.9544 ns/op
Run 09 : 550.1608 ns/op
Run 10 : 533.0634 ns/op
5/29
Several launches
6/29
A simple case
Central limit theorem to the rescue!
7/29
Complicated cases
8/29
Latencies
Event Latency
1 CPU cycle 0.3 ns
Level 1 cache access 0.9 ns
Level 2 cache access 2.8 ns
Level 3 cache access 12.9 ns
Main memory access 120 ns
Solid-state disk I/O 50-150 µs
Rotational disk I/O 1-10 ms
Internet: SF to NYC 40 ms
Internet: SF to UK 81 ms
Internet: SF to Australia 183 ms
OS virtualization reboot 4 sec
Hardware virtualization reboot 40 sec
Physical system reboot 5 min
© Systems Performance: Enterprise and the Cloud
9/29
Sum of elements
const int N = 1024;
int[,] a = new int[N, N];
[Benchmark]
public double SumIJ()
{
var sum = 0;
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
sum += a[i, j];
return sum;
}
[Benchmark]
public double SumJI()
{
var sum = 0;
for (int j = 0; j < N; j++)
for (int i = 0; i < N; i++)
sum += a[i, j];
return sum;
}
10/29
Sum of elements
const int N = 1024;
int[,] a = new int[N, N];
[Benchmark]
public double SumIJ()
{
var sum = 0;
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
sum += a[i, j];
return sum;
}
[Benchmark]
public double SumJI()
{
var sum = 0;
for (int j = 0; j < N; j++)
for (int i = 0; i < N; i++)
sum += a[i, j];
return sum;
}
CPU cache effect:
SumIJ SumJI
LegacyJIT-x86 ≈1.3ms ≈4.0ms
10/29
Cache-sensitive benchmarks
Let’s run a benchmark several times:
int[] x = new int[128 * 1024 * 1024];
for (int iter = 0; iter < 5; iter++)
{
var sw = Stopwatch.StartNew();
for (int i = 0; i < x.Length; i += 16)
x[i]++;
sw.Stop();
WriteLine(sw.ElapsedMilliseconds);
}
11/29
Cache-sensitive benchmarks
Let’s run a benchmark several times:
int[] x = new int[128 * 1024 * 1024];
for (int iter = 0; iter < 5; iter++)
{
var sw = Stopwatch.StartNew();
for (int i = 0; i < x.Length; i += 16)
x[i]++;
sw.Stop();
WriteLine(sw.ElapsedMilliseconds);
}
176 // not warmed
81 // still not warmed
62 // the steady state
62 // the steady state
62 // the steady state
Warmup is not only about .NET
11/29
Branch prediction
const int N = 32767;
int[] sorted, unsorted; // random numbers [0..255]
private static int Sum(int[] data)
{
int sum = 0;
for (int i = 0; i < N; i++)
if (data[i] >= 128)
sum += data[i];
return sum;
}
[Benchmark]
public int Sorted()
{
return Sum(sorted);
}
[Benchmark]
public int Unsorted()
{
return Sum(unsorted);
}
12/29
Branch prediction
const int N = 32767;
int[] sorted, unsorted; // random numbers [0..255]
private static int Sum(int[] data)
{
int sum = 0;
for (int i = 0; i < N; i++)
if (data[i] >= 128)
sum += data[i];
return sum;
}
[Benchmark]
public int Sorted()
{
return Sum(sorted);
}
[Benchmark]
public int Unsorted()
{
return Sum(unsorted);
}
Sorted Unsorted
LegacyJIT-x86 ≈20µs ≈139µs
12/29
Isolation
A bad benchmark
var sw1 = Stopwatch.StartNew();
Foo();
sw1.Stop();
var sw2 = Stopwatch.StartNew();
Bar();
sw2.Stop();
13/29
Isolation
A bad benchmark
var sw1 = Stopwatch.StartNew();
Foo();
sw1.Stop();
var sw2 = Stopwatch.StartNew();
Bar();
sw2.Stop();
In general case, you should run each benchmark in
his own process. Remember about:
• Interface method dispatch
• Garbage collector and autotuning
• Conditional jitting
13/29
Interface method dispatch
private interface IInc {
double Inc(double x);
}
private class Foo : IInc {
double Inc(double x) => x + 1;
}
private class Bar : IInc {
double Inc(double x) => x + 1;
}
private double Run(IInc inc) {
double sum = 0;
for (int i = 0; i < 1001; i++)
sum += inc.Inc(0);
return sum;
}
// Which method is faster?
[Benchmark]
public double FooFoo() {
var foo1 = new Foo();
var foo2 = new Foo();
return Run(foo1) + Run(foo2);
}
[Benchmark]
public double FooBar() {
var foo = new Foo();
var bar = new Bar();
return Run(foo) + Run(bar);
}
14/29
Interface method dispatch
private interface IInc {
double Inc(double x);
}
private class Foo : IInc {
double Inc(double x) => x + 1;
}
private class Bar : IInc {
double Inc(double x) => x + 1;
}
private double Run(IInc inc) {
double sum = 0;
for (int i = 0; i < 1001; i++)
sum += inc.Inc(0);
return sum;
}
// Which method is faster?
[Benchmark]
public double FooFoo() {
var foo1 = new Foo();
var foo2 = new Foo();
return Run(foo1) + Run(foo2);
}
[Benchmark]
public double FooBar() {
var foo = new Foo();
var bar = new Bar();
return Run(foo) + Run(bar);
}
FooFoo FooBar
LegacyJIT-x64 ≈5.4µs ≈7.1µs
14/29
Tricky inlining
[Benchmark]
int Calc() => WithoutStarg(0x11) + WithStarg(0x12);
int WithoutStarg(int value) => value;
int WithStarg(int value) {
if (value < 0)
value = -value;
return value;
}
15/29
Tricky inlining
[Benchmark]
int Calc() => WithoutStarg(0x11) + WithStarg(0x12);
int WithoutStarg(int value) => value;
int WithStarg(int value) {
if (value < 0)
value = -value;
return value;
}
LegacyJIT-x86 LegacyJIT-x64 RyuJIT-x64
≈1.7ns 0 ≈1.7ns
15/29
Tricky inlining
[Benchmark]
int Calc() => WithoutStarg(0x11) + WithStarg(0x12);
int WithoutStarg(int value) => value;
int WithStarg(int value) {
if (value < 0)
value = -value;
return value;
}
LegacyJIT-x86 LegacyJIT-x64 RyuJIT-x64
≈1.7ns 0 ≈1.7ns
; LegacyJIT-x64 : Inlining succeeded
mov ecx,23h
ret
15/29
Tricky inlining
[Benchmark]
int Calc() => WithoutStarg(0x11) + WithStarg(0x12);
int WithoutStarg(int value) => value;
int WithStarg(int value) {
if (value < 0)
value = -value;
return value;
}
LegacyJIT-x86 LegacyJIT-x64 RyuJIT-x64
≈1.7ns 0 ≈1.7ns
; LegacyJIT-x64 : Inlining succeeded
mov ecx,23h
ret
// RyuJIT-x64 : Inlining failed
// Inline expansion aborted due to opcode
// [06] OP_starg.s in method
// Program:WithStarg(int):int:this
15/29
SIMD
struct MyVector // Copy-pasted from System.Numerics.Vector4
{
public float X, Y, Z, W;
public MyVector(float x, float y, float z, float w)
{
X = x; Y = y; Z = z; W = w;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static MyVector operator *(MyVector left, MyVector right)
{
return new MyVector(left.X * right.X, left.Y * right.Y,
left.Z * right.Z, left.W * right.W);
}
}
Vector4 vector1, vector2, vector3;
MyVector myVector1, myVector2, myVector3;
[Benchmark] void MyMul() => myVector3 = myVector1 * myVector2;
[Benchmark] void BclMul() => vector3 = vector1 * vector2;
16/29
SIMD
struct MyVector // Copy-pasted from System.Numerics.Vector4
{
public float X, Y, Z, W;
public MyVector(float x, float y, float z, float w)
{
X = x; Y = y; Z = z; W = w;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static MyVector operator *(MyVector left, MyVector right)
{
return new MyVector(left.X * right.X, left.Y * right.Y,
left.Z * right.Z, left.W * right.W);
}
}
Vector4 vector1, vector2, vector3;
MyVector myVector1, myVector2, myVector3;
[Benchmark] void MyMul() => myVector3 = myVector1 * myVector2;
[Benchmark] void BclMul() => vector3 = vector1 * vector2;
LegacyJIT-x64 RyuJIT-x64
MyMul ≈12.9ns ≈2.5ns
BclMul ≈12.9ns ≈0.2ns
16/29
How so?
LegacyJIT-x64 RyuJIT-x64
MyMul ≈12.9ns ≈2.5ns
BclMul ≈12.9ns ≈0.2ns
; LegacyJIT-x64
; MyMul, BclMul: Na¨ıve SSE
; ...
movss xmm3,dword ptr [rsp+40h]
mulss xmm3,dword ptr [rsp+30h]
movss xmm2,dword ptr [rsp+44h]
mulss xmm2,dword ptr [rsp+34h]
movss xmm1,dword ptr [rsp+48h]
mulss xmm1,dword ptr [rsp+38h]
movss xmm0,dword ptr [rsp+4Ch]
mulss xmm0,dword ptr [rsp+3Ch]
xor eax,eax
mov qword ptr [rsp],rax
mov qword ptr [rsp+8],rax
lea rax,[rsp]
movss dword ptr [rax],xmm3
movss dword ptr [rax+4],xmm2
; ...
; RyuJIT-x64
; MyMul: Na¨ıve AVX
; ...
vmulss xmm0,xmm0,xmm4
vmulss xmm1,xmm1,xmm5
vmulss xmm2,xmm2,xmm6
vmulss xmm3,xmm3,xmm7
; ...
; BclMul: Smart AVX intrinsic
vmovupd xmm0,xmmword ptr [rcx+8]
vmovupd xmm1,xmmword ptr [rcx+18h]
vmulps xmm0,xmm0,xmm1
vmovupd xmmword ptr [rcx+28h],xmm0
17/29
Let’s calculate some square roots
[Benchmark]
double Sqrt13() =>
Math.Sqrt(1) + Math.Sqrt(2) + Math.Sqrt(3) + /* ... */
+ Math.Sqrt(13);
VS
[Benchmark]
double Sqrt14() =>
Math.Sqrt(1) + Math.Sqrt(2) + Math.Sqrt(3) + /* ... */
+ Math.Sqrt(13) + Math.Sqrt(14);
18/29
Let’s calculate some square roots
[Benchmark]
double Sqrt13() =>
Math.Sqrt(1) + Math.Sqrt(2) + Math.Sqrt(3) + /* ... */
+ Math.Sqrt(13);
VS
[Benchmark]
double Sqrt14() =>
Math.Sqrt(1) + Math.Sqrt(2) + Math.Sqrt(3) + /* ... */
+ Math.Sqrt(13) + Math.Sqrt(14);
RyuJIT-x64∗
Sqrt13 ≈91ns
Sqrt14 0 ns
∗Can be changed in future versions, see github.com/dotnet/coreclr/issues/987
18/29
How so?
RyuJIT-x64, Sqrt13
vsqrtsd xmm0,xmm0,mmword ptr [7FF94F9E4D28h]
vsqrtsd xmm1,xmm0,mmword ptr [7FF94F9E4D30h]
vaddsd xmm0,xmm0,xmm1
vsqrtsd xmm1,xmm0,mmword ptr [7FF94F9E4D38h]
vaddsd xmm0,xmm0,xmm1
vsqrtsd xmm1,xmm0,mmword ptr [7FF94F9E4D40h]
vaddsd xmm0,xmm0,xmm1
; A lot of vqrtsd and vaddsd instructions
; ...
vsqrtsd xmm1,xmm0,mmword ptr [7FF94F9E4D88h]
vaddsd xmm0,xmm0,xmm1
ret
RyuJIT-x64, Sqrt14
vmovsd xmm0,qword ptr [7FF94F9C4C80h] ; Const
ret
19/29
How so?
Big expression tree
* stmtExpr void (top level) (IL 0x000... ???)
| /--* mathFN double sqrt
| | --* dconst double 13.000000000000000
| /--* + double
| | | /--* mathFN double sqrt
| | | | --* dconst double 12.000000000000000
| | --* + double
| | | /--* mathFN double sqrt
| | | | --* dconst double 11.000000000000000
| | --* + double
| | | /--* mathFN double sqrt
| | | | --* dconst double 10.000000000000000
| | --* + double
| | | /--* mathFN double sqrt
| | | | --* dconst double 9.0000000000000000
| | --* + double
| | | /--* mathFN double sqrt
| | | | --* dconst double 8.0000000000000000
| | --* + double
| | | /--* mathFN double sqrt
| | | | --* dconst double 7.0000000000000000
| | --* + double
| | | /--* mathFN double sqrt
| | | | --* dconst double 6.0000000000000000
| | --* + double
| | | /--* mathFN double sqrt
| | | | --* dconst double 5.0000000000000000
// ...
20/29
How so?
Constant folding in action
N001 [000001] dconst 1.0000000000000000 => $c0 {DblCns[1.000000]}
N002 [000002] mathFN => $c0 {DblCns[1.000000]}
N003 [000003] dconst 2.0000000000000000 => $c1 {DblCns[2.000000]}
N004 [000004] mathFN => $c2 {DblCns[1.414214]}
N005 [000005] + => $c3 {DblCns[2.414214]}
N006 [000006] dconst 3.0000000000000000 => $c4 {DblCns[3.000000]}
N007 [000007] mathFN => $c5 {DblCns[1.732051]}
N008 [000008] + => $c6 {DblCns[4.146264]}
N009 [000009] dconst 4.0000000000000000 => $c7 {DblCns[4.000000]}
N010 [000010] mathFN => $c1 {DblCns[2.000000]}
N011 [000011] + => $c8 {DblCns[6.146264]}
N012 [000012] dconst 5.0000000000000000 => $c9 {DblCns[5.000000]}
N013 [000013] mathFN => $ca {DblCns[2.236068]}
N014 [000014] + => $cb {DblCns[8.382332]}
N015 [000015] dconst 6.0000000000000000 => $cc {DblCns[6.000000]}
N016 [000016] mathFN => $cd {DblCns[2.449490]}
N017 [000017] + => $ce {DblCns[10.831822]}
N018 [000018] dconst 7.0000000000000000 => $cf {DblCns[7.000000]}
N019 [000019] mathFN => $d0 {DblCns[2.645751]}
N020 [000020] + => $d1 {DblCns[13.477573]}
...
21/29
Concurrency
22/29
True sharing
23/29
False sharing
24/29
False sharing in action
// It's an extremely na¨ıve benchmark
// Don't try this at home
int[] x = new int[1024];
void Inc(int p) {
for (int i = 0; i < 10000001; i++)
x[p]++;
}
void Run(int step) {
var sw = Stopwatch.StartNew();
Task.WaitAll(
Task.Factory.StartNew(() => Inc(0 * step)),
Task.Factory.StartNew(() => Inc(1 * step)),
Task.Factory.StartNew(() => Inc(2 * step)),
Task.Factory.StartNew(() => Inc(3 * step)));
WriteLine(sw.ElapsedMilliseconds);
}
25/29
False sharing in action
// It's an extremely na¨ıve benchmark
// Don't try this at home
int[] x = new int[1024];
void Inc(int p) {
for (int i = 0; i < 10000001; i++)
x[p]++;
}
void Run(int step) {
var sw = Stopwatch.StartNew();
Task.WaitAll(
Task.Factory.StartNew(() => Inc(0 * step)),
Task.Factory.StartNew(() => Inc(1 * step)),
Task.Factory.StartNew(() => Inc(2 * step)),
Task.Factory.StartNew(() => Inc(3 * step)));
WriteLine(sw.ElapsedMilliseconds);
}
Run(1) Run(256)
≈400ms ≈150ms
25/29
Conclusion: Benchmarking is hard
Anon et al., “A Measure of Transaction Processing Power”
There are lies, damn lies and then there are performance
measures.
26/29
Some good books
27/29
Questions?
Andrey Akinshin
http://aakinshin.net
https://github.com/AndreyAkinshin
https://twitter.com/andrey_akinshin
andrey.akinshin@gmail.com
28/29

More Related Content

PDF
Ch01 basic concepts_nosoluiton
PDF
.NET 2015: Будущее рядом
PPTX
Secure and privacy-preserving data transmission and processing using homomorp...
PDF
Bartosz Milewski, “Re-discovering Monads in C++”
PDF
Introduction to Homomorphic Encryption
PPT
A verifiable random function with short proofs and keys
PDF
Python opcodes
PDF
Yurii Shevtsov "V8 + libuv = Node.js. Under the hood"
Ch01 basic concepts_nosoluiton
.NET 2015: Будущее рядом
Secure and privacy-preserving data transmission and processing using homomorp...
Bartosz Milewski, “Re-discovering Monads in C++”
Introduction to Homomorphic Encryption
A verifiable random function with short proofs and keys
Python opcodes
Yurii Shevtsov "V8 + libuv = Node.js. Under the hood"

What's hot (20)

PPTX
How to add an optimization for C# to RyuJIT
PDF
20140531 serebryany lecture02_find_scary_cpp_bugs
PDF
20140531 serebryany lecture01_fantastic_cpp_bugs
TXT
Advance C++notes
PDF
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
PDF
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
PDF
ZK Study Club: Sumcheck Arguments and Their Applications
PDF
To Swift 2...and Beyond!
PPTX
Computing on Encrypted Data
PDF
zkStudy Club: Subquadratic SNARGs in the Random Oracle Model
PDF
Microsoft Word Hw#1
PDF
Some examples of the 64-bit code errors
PPT
Advance features of C++
PPTX
PVS-Studio team experience: checking various open source projects, or mistake...
PDF
C++ TUTORIAL 6
PPT
Computer Programming- Lecture 4
PDF
How To Crack RSA Netrek Binary Verification System
PDF
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
PDF
Конверсия управляемых языков в неуправляемые
PPTX
Homomorphic Encryption
How to add an optimization for C# to RyuJIT
20140531 serebryany lecture02_find_scary_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs
Advance C++notes
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
ZK Study Club: Sumcheck Arguments and Their Applications
To Swift 2...and Beyond!
Computing on Encrypted Data
zkStudy Club: Subquadratic SNARGs in the Random Oracle Model
Microsoft Word Hw#1
Some examples of the 64-bit code errors
Advance features of C++
PVS-Studio team experience: checking various open source projects, or mistake...
C++ TUTORIAL 6
Computer Programming- Lecture 4
How To Crack RSA Netrek Binary Verification System
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
Конверсия управляемых языков в неуправляемые
Homomorphic Encryption
Ad

Similar to Let’s talk about microbenchmarking (20)

PPTX
Where the wild things are - Benchmarking and Micro-Optimisations
PPTX
Egor Bogatov - .NET Core intrinsics and other micro-optimizations
PPTX
Performance is a feature! - London .NET User Group
PDF
BenchmarkDotNet state of art Techorama 2023
PPTX
Performance and how to measure it - ProgSCon London 2016
PDF
Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
PPTX
Skillwise - Enhancing dotnet app
PPTX
Performance is a Feature!
PPTX
Performance is a Feature! at DDD 11
PDF
High Performance Managed Languages
PDF
PPU Optimisation Lesson
PPTX
Code and Memory Optimisation Tricks
PPTX
Code and memory optimization tricks
PDF
Look Mommy, No GC! (TechDays NL 2017)
PDF
ECS (Part 1/3) - Introduction to Data-Oriented Design
PDF
"Optimization of a .NET application- is it simple ! / ?", Yevhen Tatarynov
PDF
The Art of Java Benchmarking
PDF
Code GPU with CUDA - Identifying performance limiters
PDF
"Quantum" Performance Effects
PDF
Code dive 2019 kamil witecki - should i care about cpu cache
Where the wild things are - Benchmarking and Micro-Optimisations
Egor Bogatov - .NET Core intrinsics and other micro-optimizations
Performance is a feature! - London .NET User Group
BenchmarkDotNet state of art Techorama 2023
Performance and how to measure it - ProgSCon London 2016
Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
Skillwise - Enhancing dotnet app
Performance is a Feature!
Performance is a Feature! at DDD 11
High Performance Managed Languages
PPU Optimisation Lesson
Code and Memory Optimisation Tricks
Code and memory optimization tricks
Look Mommy, No GC! (TechDays NL 2017)
ECS (Part 1/3) - Introduction to Data-Oriented Design
"Optimization of a .NET application- is it simple ! / ?", Yevhen Tatarynov
The Art of Java Benchmarking
Code GPU with CUDA - Identifying performance limiters
"Quantum" Performance Effects
Code dive 2019 kamil witecki - should i care about cpu cache
Ad

More from Andrey Akinshin (20)

PDF
Поговорим про performance-тестирование
PDF
Сложности performance-тестирования
PDF
Сложности микробенчмаркинга
PDF
Поговорим про память
PDF
Кроссплатформенный .NET и как там дела с Mono и CoreCLR
PDF
Теория и практика .NET-бенчмаркинга (25.01.2017, Москва)
PDF
Продолжаем говорить про арифметику
PDF
Теория и практика .NET-бенчмаркинга (02.11.2016, Екатеринбург)
PDF
Поговорим про арифметику
PPTX
Подружили CLR и JVM в Project Rider
PDF
Что нам готовит грядущий C#7?
PDF
Продолжаем говорить о микрооптимизациях .NET-приложений
PDF
Распространённые ошибки оценки производительности .NET-приложений
PDF
Поговорим о микрооптимизациях .NET-приложений
PPTX
Практические приёмы оптимизации .NET-приложений
PDF
Поговорим о различных версиях .NET
PDF
Низкоуровневые оптимизации .NET-приложений
PDF
Основы работы с Git
PDF
Сборка мусора в .NET
PDF
Об особенностях использования значимых типов в .NET
Поговорим про performance-тестирование
Сложности performance-тестирования
Сложности микробенчмаркинга
Поговорим про память
Кроссплатформенный .NET и как там дела с Mono и CoreCLR
Теория и практика .NET-бенчмаркинга (25.01.2017, Москва)
Продолжаем говорить про арифметику
Теория и практика .NET-бенчмаркинга (02.11.2016, Екатеринбург)
Поговорим про арифметику
Подружили CLR и JVM в Project Rider
Что нам готовит грядущий C#7?
Продолжаем говорить о микрооптимизациях .NET-приложений
Распространённые ошибки оценки производительности .NET-приложений
Поговорим о микрооптимизациях .NET-приложений
Практические приёмы оптимизации .NET-приложений
Поговорим о различных версиях .NET
Низкоуровневые оптимизации .NET-приложений
Основы работы с Git
Сборка мусора в .NET
Об особенностях использования значимых типов в .NET

Recently uploaded (20)

PPT
What is a Computer? Input Devices /output devices
PDF
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
PDF
CloudStack 4.21: First Look Webinar slides
PDF
Two-dimensional Klein-Gordon and Sine-Gordon numerical solutions based on dee...
PDF
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PPTX
Benefits of Physical activity for teenagers.pptx
PDF
The influence of sentiment analysis in enhancing early warning system model f...
PDF
Architecture types and enterprise applications.pdf
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PPTX
The various Industrial Revolutions .pptx
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Five Habits of High-Impact Board Members
PDF
sbt 2.0: go big (Scala Days 2025 edition)
PDF
Flame analysis and combustion estimation using large language and vision assi...
PDF
Enhancing emotion recognition model for a student engagement use case through...
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
PDF
Consumable AI The What, Why & How for Small Teams.pdf
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
A comparative study of natural language inference in Swahili using monolingua...
What is a Computer? Input Devices /output devices
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
CloudStack 4.21: First Look Webinar slides
Two-dimensional Klein-Gordon and Sine-Gordon numerical solutions based on dee...
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
Benefits of Physical activity for teenagers.pptx
The influence of sentiment analysis in enhancing early warning system model f...
Architecture types and enterprise applications.pdf
sustainability-14-14877-v2.pddhzftheheeeee
The various Industrial Revolutions .pptx
Chapter 5: Probability Theory and Statistics
Five Habits of High-Impact Board Members
sbt 2.0: go big (Scala Days 2025 edition)
Flame analysis and combustion estimation using large language and vision assi...
Enhancing emotion recognition model for a student engagement use case through...
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
Consumable AI The What, Why & How for Small Teams.pdf
Getting started with AI Agents and Multi-Agent Systems
A comparative study of natural language inference in Swahili using monolingua...

Let’s talk about microbenchmarking

  • 1. Let’s talk about microbenchmarking Andrey Akinshin, JetBrains DotNext 2016 Helsinki 1/29
  • 3. Today’s environment • BenchmarkDotNet v0.10.1 • Haswell Core i7-4702MQ CPU 2.20GHz, Windows 10 • .NET Framework 4.6.2 + clrjit/compatjit-v4.6.1586.0 • Source code: https://git.io/v1RRX 3/29
  • 4. Today’s environment • BenchmarkDotNet v0.10.1 • Haswell Core i7-4702MQ CPU 2.20GHz, Windows 10 • .NET Framework 4.6.2 + clrjit/compatjit-v4.6.1586.0 • Source code: https://git.io/v1RRX Other environments: C# compiler old csc / Roslyn CLR CLR2 / CLR4 / CoreCLR / Mono OS Windows / Linux / MacOS / FreeBSD JIT LegacyJIT-x86 / LegacyJIT-x64 / RyuJIT-x64 GC MS (different CLRs) / Mono (Boehm/Sgen) Toolchain JIT / NGen / .NET Native Hardware ∞ different configurations . . . . . . And don’t forget about multiple versions 3/29
  • 5. Count of iterations A bad benchmark // Resolution (Stopwatch) = 466 ns // Latency (Stopwatch) = 18 ns var sw = Stopwatch.StartNew(); Foo(); // 100 ns sw.Stop(); WriteLine(sw.ElapsedMilliseconds); 4/29
  • 6. Count of iterations A bad benchmark // Resolution (Stopwatch) = 466 ns // Latency (Stopwatch) = 18 ns var sw = Stopwatch.StartNew(); Foo(); // 100 ns sw.Stop(); WriteLine(sw.ElapsedMilliseconds); A better benchmark var sw = Stopwatch.StartNew(); for (int i = 0; i < N; i++) // (N * 100 + eps) ns Foo(); // 100 ns sw.Stop(); var total = sw.ElapsedTicks / Stopwatch.Frequency; WriteLine(total / N); 4/29
  • 7. Several launches Run 01 : 529.8674 ns/op Run 02 : 532.7541 ns/op Run 03 : 558.7448 ns/op Run 04 : 555.6647 ns/op Run 05 : 539.6401 ns/op Run 06 : 539.3494 ns/op Run 07 : 564.3222 ns/op Run 08 : 551.9544 ns/op Run 09 : 550.1608 ns/op Run 10 : 533.0634 ns/op 5/29
  • 9. A simple case Central limit theorem to the rescue! 7/29
  • 11. Latencies Event Latency 1 CPU cycle 0.3 ns Level 1 cache access 0.9 ns Level 2 cache access 2.8 ns Level 3 cache access 12.9 ns Main memory access 120 ns Solid-state disk I/O 50-150 µs Rotational disk I/O 1-10 ms Internet: SF to NYC 40 ms Internet: SF to UK 81 ms Internet: SF to Australia 183 ms OS virtualization reboot 4 sec Hardware virtualization reboot 40 sec Physical system reboot 5 min © Systems Performance: Enterprise and the Cloud 9/29
  • 12. Sum of elements const int N = 1024; int[,] a = new int[N, N]; [Benchmark] public double SumIJ() { var sum = 0; for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) sum += a[i, j]; return sum; } [Benchmark] public double SumJI() { var sum = 0; for (int j = 0; j < N; j++) for (int i = 0; i < N; i++) sum += a[i, j]; return sum; } 10/29
  • 13. Sum of elements const int N = 1024; int[,] a = new int[N, N]; [Benchmark] public double SumIJ() { var sum = 0; for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) sum += a[i, j]; return sum; } [Benchmark] public double SumJI() { var sum = 0; for (int j = 0; j < N; j++) for (int i = 0; i < N; i++) sum += a[i, j]; return sum; } CPU cache effect: SumIJ SumJI LegacyJIT-x86 ≈1.3ms ≈4.0ms 10/29
  • 14. Cache-sensitive benchmarks Let’s run a benchmark several times: int[] x = new int[128 * 1024 * 1024]; for (int iter = 0; iter < 5; iter++) { var sw = Stopwatch.StartNew(); for (int i = 0; i < x.Length; i += 16) x[i]++; sw.Stop(); WriteLine(sw.ElapsedMilliseconds); } 11/29
  • 15. Cache-sensitive benchmarks Let’s run a benchmark several times: int[] x = new int[128 * 1024 * 1024]; for (int iter = 0; iter < 5; iter++) { var sw = Stopwatch.StartNew(); for (int i = 0; i < x.Length; i += 16) x[i]++; sw.Stop(); WriteLine(sw.ElapsedMilliseconds); } 176 // not warmed 81 // still not warmed 62 // the steady state 62 // the steady state 62 // the steady state Warmup is not only about .NET 11/29
  • 16. Branch prediction const int N = 32767; int[] sorted, unsorted; // random numbers [0..255] private static int Sum(int[] data) { int sum = 0; for (int i = 0; i < N; i++) if (data[i] >= 128) sum += data[i]; return sum; } [Benchmark] public int Sorted() { return Sum(sorted); } [Benchmark] public int Unsorted() { return Sum(unsorted); } 12/29
  • 17. Branch prediction const int N = 32767; int[] sorted, unsorted; // random numbers [0..255] private static int Sum(int[] data) { int sum = 0; for (int i = 0; i < N; i++) if (data[i] >= 128) sum += data[i]; return sum; } [Benchmark] public int Sorted() { return Sum(sorted); } [Benchmark] public int Unsorted() { return Sum(unsorted); } Sorted Unsorted LegacyJIT-x86 ≈20µs ≈139µs 12/29
  • 18. Isolation A bad benchmark var sw1 = Stopwatch.StartNew(); Foo(); sw1.Stop(); var sw2 = Stopwatch.StartNew(); Bar(); sw2.Stop(); 13/29
  • 19. Isolation A bad benchmark var sw1 = Stopwatch.StartNew(); Foo(); sw1.Stop(); var sw2 = Stopwatch.StartNew(); Bar(); sw2.Stop(); In general case, you should run each benchmark in his own process. Remember about: • Interface method dispatch • Garbage collector and autotuning • Conditional jitting 13/29
  • 20. Interface method dispatch private interface IInc { double Inc(double x); } private class Foo : IInc { double Inc(double x) => x + 1; } private class Bar : IInc { double Inc(double x) => x + 1; } private double Run(IInc inc) { double sum = 0; for (int i = 0; i < 1001; i++) sum += inc.Inc(0); return sum; } // Which method is faster? [Benchmark] public double FooFoo() { var foo1 = new Foo(); var foo2 = new Foo(); return Run(foo1) + Run(foo2); } [Benchmark] public double FooBar() { var foo = new Foo(); var bar = new Bar(); return Run(foo) + Run(bar); } 14/29
  • 21. Interface method dispatch private interface IInc { double Inc(double x); } private class Foo : IInc { double Inc(double x) => x + 1; } private class Bar : IInc { double Inc(double x) => x + 1; } private double Run(IInc inc) { double sum = 0; for (int i = 0; i < 1001; i++) sum += inc.Inc(0); return sum; } // Which method is faster? [Benchmark] public double FooFoo() { var foo1 = new Foo(); var foo2 = new Foo(); return Run(foo1) + Run(foo2); } [Benchmark] public double FooBar() { var foo = new Foo(); var bar = new Bar(); return Run(foo) + Run(bar); } FooFoo FooBar LegacyJIT-x64 ≈5.4µs ≈7.1µs 14/29
  • 22. Tricky inlining [Benchmark] int Calc() => WithoutStarg(0x11) + WithStarg(0x12); int WithoutStarg(int value) => value; int WithStarg(int value) { if (value < 0) value = -value; return value; } 15/29
  • 23. Tricky inlining [Benchmark] int Calc() => WithoutStarg(0x11) + WithStarg(0x12); int WithoutStarg(int value) => value; int WithStarg(int value) { if (value < 0) value = -value; return value; } LegacyJIT-x86 LegacyJIT-x64 RyuJIT-x64 ≈1.7ns 0 ≈1.7ns 15/29
  • 24. Tricky inlining [Benchmark] int Calc() => WithoutStarg(0x11) + WithStarg(0x12); int WithoutStarg(int value) => value; int WithStarg(int value) { if (value < 0) value = -value; return value; } LegacyJIT-x86 LegacyJIT-x64 RyuJIT-x64 ≈1.7ns 0 ≈1.7ns ; LegacyJIT-x64 : Inlining succeeded mov ecx,23h ret 15/29
  • 25. Tricky inlining [Benchmark] int Calc() => WithoutStarg(0x11) + WithStarg(0x12); int WithoutStarg(int value) => value; int WithStarg(int value) { if (value < 0) value = -value; return value; } LegacyJIT-x86 LegacyJIT-x64 RyuJIT-x64 ≈1.7ns 0 ≈1.7ns ; LegacyJIT-x64 : Inlining succeeded mov ecx,23h ret // RyuJIT-x64 : Inlining failed // Inline expansion aborted due to opcode // [06] OP_starg.s in method // Program:WithStarg(int):int:this 15/29
  • 26. SIMD struct MyVector // Copy-pasted from System.Numerics.Vector4 { public float X, Y, Z, W; public MyVector(float x, float y, float z, float w) { X = x; Y = y; Z = z; W = w; } [MethodImpl(MethodImplOptions.AggressiveInlining)] public static MyVector operator *(MyVector left, MyVector right) { return new MyVector(left.X * right.X, left.Y * right.Y, left.Z * right.Z, left.W * right.W); } } Vector4 vector1, vector2, vector3; MyVector myVector1, myVector2, myVector3; [Benchmark] void MyMul() => myVector3 = myVector1 * myVector2; [Benchmark] void BclMul() => vector3 = vector1 * vector2; 16/29
  • 27. SIMD struct MyVector // Copy-pasted from System.Numerics.Vector4 { public float X, Y, Z, W; public MyVector(float x, float y, float z, float w) { X = x; Y = y; Z = z; W = w; } [MethodImpl(MethodImplOptions.AggressiveInlining)] public static MyVector operator *(MyVector left, MyVector right) { return new MyVector(left.X * right.X, left.Y * right.Y, left.Z * right.Z, left.W * right.W); } } Vector4 vector1, vector2, vector3; MyVector myVector1, myVector2, myVector3; [Benchmark] void MyMul() => myVector3 = myVector1 * myVector2; [Benchmark] void BclMul() => vector3 = vector1 * vector2; LegacyJIT-x64 RyuJIT-x64 MyMul ≈12.9ns ≈2.5ns BclMul ≈12.9ns ≈0.2ns 16/29
  • 28. How so? LegacyJIT-x64 RyuJIT-x64 MyMul ≈12.9ns ≈2.5ns BclMul ≈12.9ns ≈0.2ns ; LegacyJIT-x64 ; MyMul, BclMul: Na¨ıve SSE ; ... movss xmm3,dword ptr [rsp+40h] mulss xmm3,dword ptr [rsp+30h] movss xmm2,dword ptr [rsp+44h] mulss xmm2,dword ptr [rsp+34h] movss xmm1,dword ptr [rsp+48h] mulss xmm1,dword ptr [rsp+38h] movss xmm0,dword ptr [rsp+4Ch] mulss xmm0,dword ptr [rsp+3Ch] xor eax,eax mov qword ptr [rsp],rax mov qword ptr [rsp+8],rax lea rax,[rsp] movss dword ptr [rax],xmm3 movss dword ptr [rax+4],xmm2 ; ... ; RyuJIT-x64 ; MyMul: Na¨ıve AVX ; ... vmulss xmm0,xmm0,xmm4 vmulss xmm1,xmm1,xmm5 vmulss xmm2,xmm2,xmm6 vmulss xmm3,xmm3,xmm7 ; ... ; BclMul: Smart AVX intrinsic vmovupd xmm0,xmmword ptr [rcx+8] vmovupd xmm1,xmmword ptr [rcx+18h] vmulps xmm0,xmm0,xmm1 vmovupd xmmword ptr [rcx+28h],xmm0 17/29
  • 29. Let’s calculate some square roots [Benchmark] double Sqrt13() => Math.Sqrt(1) + Math.Sqrt(2) + Math.Sqrt(3) + /* ... */ + Math.Sqrt(13); VS [Benchmark] double Sqrt14() => Math.Sqrt(1) + Math.Sqrt(2) + Math.Sqrt(3) + /* ... */ + Math.Sqrt(13) + Math.Sqrt(14); 18/29
  • 30. Let’s calculate some square roots [Benchmark] double Sqrt13() => Math.Sqrt(1) + Math.Sqrt(2) + Math.Sqrt(3) + /* ... */ + Math.Sqrt(13); VS [Benchmark] double Sqrt14() => Math.Sqrt(1) + Math.Sqrt(2) + Math.Sqrt(3) + /* ... */ + Math.Sqrt(13) + Math.Sqrt(14); RyuJIT-x64∗ Sqrt13 ≈91ns Sqrt14 0 ns ∗Can be changed in future versions, see github.com/dotnet/coreclr/issues/987 18/29
  • 31. How so? RyuJIT-x64, Sqrt13 vsqrtsd xmm0,xmm0,mmword ptr [7FF94F9E4D28h] vsqrtsd xmm1,xmm0,mmword ptr [7FF94F9E4D30h] vaddsd xmm0,xmm0,xmm1 vsqrtsd xmm1,xmm0,mmword ptr [7FF94F9E4D38h] vaddsd xmm0,xmm0,xmm1 vsqrtsd xmm1,xmm0,mmword ptr [7FF94F9E4D40h] vaddsd xmm0,xmm0,xmm1 ; A lot of vqrtsd and vaddsd instructions ; ... vsqrtsd xmm1,xmm0,mmword ptr [7FF94F9E4D88h] vaddsd xmm0,xmm0,xmm1 ret RyuJIT-x64, Sqrt14 vmovsd xmm0,qword ptr [7FF94F9C4C80h] ; Const ret 19/29
  • 32. How so? Big expression tree * stmtExpr void (top level) (IL 0x000... ???) | /--* mathFN double sqrt | | --* dconst double 13.000000000000000 | /--* + double | | | /--* mathFN double sqrt | | | | --* dconst double 12.000000000000000 | | --* + double | | | /--* mathFN double sqrt | | | | --* dconst double 11.000000000000000 | | --* + double | | | /--* mathFN double sqrt | | | | --* dconst double 10.000000000000000 | | --* + double | | | /--* mathFN double sqrt | | | | --* dconst double 9.0000000000000000 | | --* + double | | | /--* mathFN double sqrt | | | | --* dconst double 8.0000000000000000 | | --* + double | | | /--* mathFN double sqrt | | | | --* dconst double 7.0000000000000000 | | --* + double | | | /--* mathFN double sqrt | | | | --* dconst double 6.0000000000000000 | | --* + double | | | /--* mathFN double sqrt | | | | --* dconst double 5.0000000000000000 // ... 20/29
  • 33. How so? Constant folding in action N001 [000001] dconst 1.0000000000000000 => $c0 {DblCns[1.000000]} N002 [000002] mathFN => $c0 {DblCns[1.000000]} N003 [000003] dconst 2.0000000000000000 => $c1 {DblCns[2.000000]} N004 [000004] mathFN => $c2 {DblCns[1.414214]} N005 [000005] + => $c3 {DblCns[2.414214]} N006 [000006] dconst 3.0000000000000000 => $c4 {DblCns[3.000000]} N007 [000007] mathFN => $c5 {DblCns[1.732051]} N008 [000008] + => $c6 {DblCns[4.146264]} N009 [000009] dconst 4.0000000000000000 => $c7 {DblCns[4.000000]} N010 [000010] mathFN => $c1 {DblCns[2.000000]} N011 [000011] + => $c8 {DblCns[6.146264]} N012 [000012] dconst 5.0000000000000000 => $c9 {DblCns[5.000000]} N013 [000013] mathFN => $ca {DblCns[2.236068]} N014 [000014] + => $cb {DblCns[8.382332]} N015 [000015] dconst 6.0000000000000000 => $cc {DblCns[6.000000]} N016 [000016] mathFN => $cd {DblCns[2.449490]} N017 [000017] + => $ce {DblCns[10.831822]} N018 [000018] dconst 7.0000000000000000 => $cf {DblCns[7.000000]} N019 [000019] mathFN => $d0 {DblCns[2.645751]} N020 [000020] + => $d1 {DblCns[13.477573]} ... 21/29
  • 37. False sharing in action // It's an extremely na¨ıve benchmark // Don't try this at home int[] x = new int[1024]; void Inc(int p) { for (int i = 0; i < 10000001; i++) x[p]++; } void Run(int step) { var sw = Stopwatch.StartNew(); Task.WaitAll( Task.Factory.StartNew(() => Inc(0 * step)), Task.Factory.StartNew(() => Inc(1 * step)), Task.Factory.StartNew(() => Inc(2 * step)), Task.Factory.StartNew(() => Inc(3 * step))); WriteLine(sw.ElapsedMilliseconds); } 25/29
  • 38. False sharing in action // It's an extremely na¨ıve benchmark // Don't try this at home int[] x = new int[1024]; void Inc(int p) { for (int i = 0; i < 10000001; i++) x[p]++; } void Run(int step) { var sw = Stopwatch.StartNew(); Task.WaitAll( Task.Factory.StartNew(() => Inc(0 * step)), Task.Factory.StartNew(() => Inc(1 * step)), Task.Factory.StartNew(() => Inc(2 * step)), Task.Factory.StartNew(() => Inc(3 * step))); WriteLine(sw.ElapsedMilliseconds); } Run(1) Run(256) ≈400ms ≈150ms 25/29
  • 39. Conclusion: Benchmarking is hard Anon et al., “A Measure of Transaction Processing Power” There are lies, damn lies and then there are performance measures. 26/29