High Performance Managed Languages

High Performance
Managed Languages
Martin Thompson - @mjpt777

Really, what is your preferred
platform for building HFT
applications?

Why would you build
low-latency applications on a
GC’ed platform?

1. Some Context
2. Runtime Optimisation
3. Garbage Collection
4. Algorithms & Design
Agenda

Let’s be clear
A Managed Runtime is not
always a good choice…

1 CPU Cycle : < 1ns
http://www.agner.org/optimize/instruction_tables.pdf

Time per operation to sum the
values in an array of integers?

Access Pattern Benchmark
Benchmark Score Error Units
===========================================
sequential 0.832 ± 0.006 ns/op

Really???
Less than 1ns per operation?

What if the access pattern
is different?

Access Pattern Benchmark
Benchmark Score Error Units
===========================================
sequential 0.832 ± 0.006 ns/op
randomPage 2.703 ± 0.025 ns/op
dependentRandomPage 7.102 ± 0.326 ns/op
randomHeap 19.896 ± 3.110 ns/op
dependentRandomHeap 89.516 ± 4.573 ns/op

Data Dependent Loads
aka “Pointer Chasing”!!!

1. Memory is transported in Cachelines
Performance 101

2. Memory is managed in OS Pages
Performance 101

2. Memory is managed in OS Pages
3. Memory is pre-fetched on
predictable access patterns
Performance 101

1. Profile guided optimisations
Runtime JIT

1. Profile guided optimisations
2. Bets can be taken and later revoked
Runtime JIT

Branches
void foo()
{
// code
if (condition)
{
// code
}
// code
}

Block A
Branches
void foo()
{
// code
if (condition)
{
// code
}
// code
}

Block A
Block B
Branches
void foo()
{
// code
if (condition)
{
// code
}
// code
}

Block A
Block C
Block B
Branches
void foo()
{
// code
if (condition)
{
// code
}
// code
}

Block A
Block C
Block B
Branches
void foo()
{
// code
if (condition)
{
// code
}
// code
}
Block A
Block C

Block A
Block C
Block B
Branches
void foo()
{
// code
if (condition)
{
// code
}
// code
}
Block A
Block C
Block B

Subtle Branches
int result = (i > 7) ? a : b;

Subtle Branches
int result = (i > 7) ? a : b;
CMOV vs Branch Prediction?

Method/Function Inlining
void foo()
{
// code
bar();
// code
}

void foo()
{
// code
bar();
// code
}
Block A

void foo()
{
// code
bar();
// code
}
Block A
bar()

void foo()
{
// code
bar();
// code
}
Block A
Block B
bar()

void foo()
{
// code
bar();
// code
}
Block A
Block B
bar()
Block A

void foo()
{
// code
bar();
// code
}
Block A
Block B
bar()
Block A
bar()

void foo()
{
// code
bar();
// code
}
Block A
Block B
bar()
Block A
Block B
bar()

void foo()
{
// code
bar();
// code
}
i-cache
& code bloat?

“Inlining is THE optimisation.”
- Cliff Click

void foo(int[] array, int length)
{
for (int i = 0; i < length; i++)
{
bar(Integer.bitCount(array[i]));
}
}
Loops

{
for (int i = 0; i < length; i += 4)
{
bar(Integer.bitCount(array[i + 1]));
}
}
Loops

Loops
{
for (int i = 0; i < length; i++)
{
}
}
Intrinsics

Subtype Polymorphism
void draw(Shape[] shapes)
{
for (int i = 0; i < shapes.length; i++)
{
shapes[i].draw();
}
}

{
{
shapes[i].draw();
}
}
void bar(Shape shape)
{
bar(shape.isVisible());
}

{
{
shapes[i].draw();
}
}
void bar(Shape shape)
{
bar(shape.isVisible());
}
Class Hierarchy Analysis
& Inline Caching

Generational Garbage Collection
“Only the good die young”
- Billy Joel

Eden Survivor 0 Survivor 1
Young/New Generation
TLAB
TLAB
Tenured
Virtual
Virtual
Old Generation
Generational Garbage Collection

Modern Hardware (Intel Sandy Bridge EP)
C 1 C n C 1 C nRegisters/Buffers <1ns
L1 L1 L1 L1~4 cycles ~1ns
L2 L2 L2 L2~12 cycles ~3ns
L3 L3
~40 cycles ~15ns
~75 cycles ~25ns (dirty hit)
~65ns
DRAM
QPI ~40ns
MC MC
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
...
...
...
...
...
...
QPI QPIPCI-e 3 PCI-e 3
40X
IO
40X
IO
* Assumption: 3GHz Processor

Broadwell EX – 24 cores & 60MB L3 Cache

Eden
TLAB
TLAB
Thread Local Allocation Buffers

Eden
TLAB
TLAB
Thread Local Allocation Buffers
• Affords locality of reference
• Avoid false sharing
• Can have NUMA aware allocation

TLAB
TLAB
Virtual
Object Survival

TLAB
TLAB
Virtual
Object Survival
• Aging Policies
• Compacting Copy
• NUMA Interleave
• Fast Parallel Scavenging
• Only the survivors require work

TLAB
TLAB
Tenured
Virtual
Virtual
Old Generation
Object Promotion

TLAB
TLAB
Tenured
Virtual
Virtual
Old Generation
Object Promotion
• Concurrent Collection
• String Deduplication

Compacting Collections – Depth first copy

Compacting Collections
OS Pages and
cache lines?

EdenE
O
S
O
S
O E
S OO
E
O E O O
O
Survivor
Old
Unused
O
S
E
H
HumongousH
H
O O
G1 – Concurrent Compaction?

Azul Zing C4
True Concurrent Compacting
Collector

GC vs Manual Memory Management
Not easy to pick clear winner…

Managed GC
• GC Implementation
• Card Marking
• Read/Write Barriers
• Object Headers
• Background Overhead
on CPU and Memory

Managed GC
• GC Implementation
• Card Marking
• Read/Write Barriers
• Object Headers
• Background Overhead
on CPU and Memory
Native
• Malloc Implementation
• Arena/pool contention
• Bin Wastage
• Fragmentation
• Debugging Effort
• Inter-thread costs

What is most important to
performance?

“If I had more time, I would
have written a shorter letter.”
- Blaise Pascal

• Avoiding duplicate work
• Avoiding cache misses
• Avoiding contention
• Strength reduction
• Amortising expensive operations
• Mechanical Sympathy
• Choice of algorithms & data structures
• API design

In a large codebase it is really
difficult to do everything well

It also takes some “uncommon”
disciplines such as:
profiling, telemetry, modelling…

The story of Aeron
https://github.com/real-logic/Aeron

Aeron is an interesting lesson in
“time to performance”

Lots of others exists such at the
C# Roslyn compiler

Time spent on
Mechanical Sympathy
vs
Debugging Pointers
???

GC
Immutable Data & Concurrency

Remember
Assembly vs Compiled
Languages?

What about
footprint, startup, warm up, etc.
???

Blog: http://mechanical-sympathy.blogspot.com/
Twitter: @mjpt777
“Any intelligent fool can make things bigger, more
complex, and more violent.
It takes a touch of genius, and a lot of courage, to move
in the opposite direction.”
- Albert Einstein
Questions?

High Performance Managed Languages

More Related Content

What's hot (6)

Similar to High Performance Managed Languages (20)

More from J On The Beach (20)

Recently uploaded (20)

High Performance Managed Languages