Striving for ultimate Low Latency

Mateusz Pusz
September 18, 2017
STRIVING FOR ULTIMATE
LOW LATENCY
INTRODUCTION TO DEVELOPMENT OF LOW LATENCY SYSTEMS

| Striving for ultimate Low Latency
LATENCY VS THROUGHPUT
2

Latency is the time required to perform some action or to produce
some result. Measured in units of time like hours, minutes,
seconds, nanoseconds or clock periods.
2

Latency is the time required to perform some action or to produce
some result. Measured in units of time like hours, minutes,
seconds, nanoseconds or clock periods.
Throughput is the number of such actions executed or results
produced per unit of time. Measured in units of whatever is being
produced per unit of time.
2

WHAT DO WE MEAN BY LOW LATENCY?
3

Low Latency allows human-unnoticeable delays between an input
being processed and the corresponding output providing real time
characteristics.
3

Low Latency allows human-unnoticeable delays between an input
being processed and the corresponding output providing real time
characteristics.
Especially important for internet connections utilizing services such as
trading, online gaming and VoIP.
3

WHY DO WE STRIVE FOR LOW LATENCY?
4

• In VoIP substantial delays between input from conversation participants may impair their
communication
4

communication
• In online gaming a player with a high latency internet connection may show slow responses in spite of
superior tactics or the appropriate reaction time
4

communication
• In online gaming a player with a high latency internet connection may show slow responses in spite of
superior tactics or the appropriate reaction time
• Within capital markets the proliferation of algorithmic trading requires firms to react to market events
faster than the competition to increase profitability of trades
4

A program trading platform that uses powerful computers to
transact a large number of orders at very fast speeds
-- Investopedia
HIGH-FREQUENCY TRADING (HFT)
5

A program trading platform that uses powerful computers to
transact a large number of orders at very fast speeds
-- Investopedia
• Using complex algorithms to analyze multiple markets and execute orders based on market conditions
• Buying and selling of securities many times over a period of time (o en hundreds of times an hour)
• Done to profit from time-sensitive opportunities that arise during trading hours
• Implies high turnover of capital (i.e. one's entire capital or more in a single day)
• Typically, the traders with the fastest execution speeds are more profitable
HIGH-FREQUENCY TRADING (HFT)
5

MARKET DATA PROCESSING
6

1-10us 100-1000ns
HOW FAST DO WE DO?
ALL SOFTWARE APPROACH ALL HARDWARE APPROACH
7

1-10us 100-1000ns
• Average human eye blink takes 350 000us (1/3s)
• Millions of orders can be traded that time
HOW FAST DO WE DO?
ALL SOFTWARE APPROACH ALL HARDWARE APPROACH
7

WHAT IF SOMETHING GOES WRONG?
8

• In 2012 was the largest trader in
U.S. equities
• Market share
– 17.3% on NYSE
– 16.9% on NASDAQ
• Had approximately $365 million in cash
and equivalents
• Average daily trading volume
– 3.3 billion trades
– trading over 21 billion dollars
KNIGHT CAPITAL
8

• In 2012 was the largest trader in
U.S. equities
• Market share
– 17.3% on NYSE
– 16.9% on NASDAQ
• Had approximately $365 million in cash
and equivalents
• Average daily trading volume
– 3.3 billion trades
– trading over 21 billion dollars
• pre-tax loss of $440 million in 45 minutes
-- LinkedIn
KNIGHT CAPITAL
9

• Low Latency network
• Modern hardware
• BIOS profiling
• Kernel profiling
• OS profiling
C++ OFTEN NOT THE MOST IMPORTANT PART OF THE SYSTEM
10

• Don't sleep
• Don't context switch
• Prefer single-threaded scheduling
• Disable locking and thread support
• Disable power management
• Disable C-states
• Disable interrupt coalescing
SPIN, PIN, AND DROP-IN
SPIN
11

• Don't sleep
• Assign CPU a inity
• Assign interrupt a inity
• Assign memory to NUMA nodes
• Consider the physical location of NICs
• Isolate cores from general OS use
• Use a system with a single physical CPU
SPIN PIN
11

• Don't sleep
• Assign CPU a inity
• Assign interrupt a inity
• Assign memory to NUMA nodes
• Consider the physical location of NICs
• Isolate cores from general OS use
• Use a system with a single physical CPU
• Choose NIC vendors based on performance and availability of drop-in kernel bypass libraries
• Use the kernel bypass library
SPIN PIN
DROP-IN
11

• Typically only a small part of code is really important (fast path)
CHARACTERISTICS OF LOW LATENCY SOFTWARE
13

• That code is not executed o en
• When it is executed it has to
– start and finish as soon as possible
– have predictable and reproducible performance (low jitter)
13

• Multithreading increases latency
– it is about low latency and not throughput
– concurrency (even on di erent cores) trashes CPU caches above L1, share memory bus, shares IO,
shares network
13

• Multithreading increases latency
– it is about low latency and not throughput
– concurrency (even on di erent cores) trashes CPU caches above L1, share memory bus, shares IO,
shares network
• Mistakes are really costly
– good error checking and recovery is mandatory
– one second is 4 billion CPU instructions (a lot can happen that time)
13

HOW TO DEVELOP SOFTWARE THAT HAVE PREDICTABLE
PERFORMANCE?
14

It turns out that the more important question here is...
HOW TO DEVELOP SOFTWARE THAT HAVE PREDICTABLE
PERFORMANCE?
14

HOW NOT TO DEVELOP SOFTWARE THAT HAVE PREDICTABLE
PERFORMANCE?
15

• In Low Latency system we care a lot about
WCET (Worst Case Execution Time)
• In order to limit WCET we should limit the
usage of specific C++ language features
• This is not only the task for developers but
also for code architects
HOW NOT TO DEVELOP SOFTWARE THAT HAVE PREDICTABLE
PERFORMANCE?
16

1 C++ tools that trade performance for usability (e.g. std::shared_ptr<T>, std::function<>)
2 Throwing exceptions on likely code path
3 Dynamic polymorphism
4 Multiple inheritance
5 RTTI
6 Dynamic memory allocations
THINGS TO AVOID ON THE FAST PATH
17

template<class T>
class shared_ptr;
• Smart pointer that retains shared ownership of an object through a pointer
• Several shared_ptr objects may own the same object
• The shared object is destroyed and its memory deallocated when the last remaining shared_ptr
owning that object is either destroyed or assigned another pointer via operator= or reset()
• Support user provided deleter
std::shared_ptr<T>
18

template<class T>
class shared_ptr;
• Smart pointer that retains shared ownership of an object through a pointer
• Several shared_ptr objects may own the same object
• The shared object is destroyed and its memory deallocated when the last remaining shared_ptr
owning that object is either destroyed or assigned another pointer via operator= or reset()
• Support user provided deleter
std::shared_ptr<T>
Too o en overused by C++ programmers
18

void foo()
{
std::unique_ptr<int> ptr{new int{1}};
// some code using 'ptr'
}
void foo()
{
std::shared_ptr<int> ptr{new int{1}};
// some code using 'ptr'
}
QUESTION: WHAT IS THE DIFFERENCE HERE?
19

• Shared state
– performance + memory footprint
• Mandatory synchronization
– performance
• Type Erasure
– performance
• std::weak_ptr<T> support
– memory footprint
• Aliasing constructor
– memory footprint
KEY std::shared_ptr<T> ISSUES
20

MORE INFO ON CODE::DIVE 2016
21

C++ EXCEPTIONS
22

• Code generated by nearly all C++ compilers does not introduce significant runtime overhead for C++
exceptions
C++ EXCEPTIONS
22

exceptions
• ... if they are not thrown
C++ EXCEPTIONS
22

exceptions
• Throwing an exception can take significant and not deterministic time
C++ EXCEPTIONS
22

exceptions
• Advantages of C++ exceptions usage
– (if not thrown) actually can improve application performance
– cannot be ignored!
– simplify interfaces
– make source code of likely path easier to reason about
C++ EXCEPTIONS
22

exceptions
• Advantages of C++ exceptions usage
– (if not thrown) actually can improve application performance
– cannot be ignored!
– simplify interfaces
– make source code of likely path easier to reason about
C++ EXCEPTIONS
Not using C++ exceptions is not an excuse to write not exception-safe code!
22

class base {
virtual void setup() = 0;
virtual void run() = 0;
virtual void cleanup() = 0;
public:
virtual ~base() = default;
void process()
{
setup();
run();
cleanup();
}
};
class derived : public base {
void setup() override { /* ... */ }
void run() override { /* ... */ }
void cleanup() override { /* ... */ }
};
POLYMORPHISM
DYNAMIC
23

class base {
public:
void process()
{
setup();
run();
cleanup();
}
};
};
• Additional pointer stored in an object
• Extra indirection (pointer dereference)
• O en not possible to devirtualize
• Not inlined
• Instruction cache miss
POLYMORPHISM
DYNAMIC
23

class base {
public:
void process()
{
setup();
run();
cleanup();
}
};
};
template<class Derived>
class base {
public:
void process()
{
static_cast<Derived*>(this)->setup();
static_cast<Derived*>(this)->run();
static_cast<Derived*>(this)->cleanup();
}
};
class derived : public base<derived> {
friend class base<derived>;
void setup() { /* ... */ }
void run() { /* ... */ }
void cleanup() { /* ... */ }
};
POLYMORPHISM
DYNAMIC STATIC
24

• this pointer adjustments needed to call
member function (for not empty base classes)
MULTIPLE INHERITANCE
25

• Virtual inheritance as an answer
• virtual in C++ means "determined at runtime"
• Extra indirection to access data members
DIAMOND OF DREAD
26

• Virtual inheritance as an answer
• virtual in C++ means "determined at runtime"
• Extra indirection to access data members
Always consider composition before inheritance!
DIAMOND OF DREAD
26

class base {
public:
virtual void foo() = 0;
};
public:
void foo() override;
void boo();
};
RUNTIME TYPE IDENTIFICATION (RTTI)
27

class base {
public:
};
public:
void boo();
};
void foo(base& b)
{
derived* d = dynamic_cast<derived*>(&b);
if(d) {
d->boo();
}
}
27

class base {
public:
};
public:
void boo();
};
void foo(base& b)
{
if(d) {
d->boo();
}
}
O en the sign of a smelly design
27

class base {
public:
};
public:
void boo();
};
void foo(base& b)
{
if(d) {
d->boo();
}
}
• Traversing an inheritance tree
• Comparisons
28

class base {
public:
};
public:
void boo();
};
void foo(base& b)
{
if(d) {
d->boo();
}
}
• Traversing an inheritance tree
• Comparisons
void foo(base& b)
{
if(typeid(b) == typeid(derived)) {
derived* d = static_cast<derived*>(&b);
d->boo();
}
}
• Only one comparison of std::type_info
• O en only one runtime pointer compare
28

• General purpose operation
• Nondeterministic execution performance
• Causes memory fragmentation
• Memory leaks possible if not properly handled
• May fail (error handling is needed)
DYNAMIC MEMORY ALLOCATIONS
29

• Address specific needs (functionality and hardware constrains)
• Typically low number of dynamic memory allocations
• Data structures needed to manage big chunks of memory
CUSTOM ALLOCATORS TO THE RESCUE
30

template<typename T> struct pool_allocator {
T* allocate(std::size_t n);
void deallocate(T* p, std::size_t n);
};
using pool_string = std::basic_string<char, std::char_traits<char>, pool_allocator>;
30

template<typename T> struct pool_allocator {
T* allocate(std::size_t n);
void deallocate(T* p, std::size_t n);
};
using pool_string = std::basic_string<char, std::char_traits<char>, pool_allocator>;
Preallocation makes the allocator jitter more stable, helps in keeping related
data together and avoiding long term fragmentation.
30

Prevent dynamic memory allocation for the (common) case of
dealing with small objects
SMALL OBJECT OPTIMIZATION (SOO / SSO / SBO)
31

Prevent dynamic memory allocation for the (common) case of
dealing with small objects
class sso_string {
char* data_ = u_.sso_;
size_t size_ = 0;
union {
char sso_[16] = "";
size_t capacity_;
} u_;
public:
size_t capacity() const { return data_ == u_.sso_ ? sizeof(u_.sso_) - 1 : u_.capacity_; }
// ...
};
SMALL OBJECT OPTIMIZATION (SOO / SSO / SBO)
31

template<std::size_t MaxSize>
class inplace_string {
std::array<value_type, MaxSize + 1> chars_;
public:
// string-like interface
};
NO DYNAMIC ALLOCATION
32

public:
};
struct db_contact {
inplace_string<7> symbol;
inplace_string<15> name;
inplace_string<15> surname;
inplace_string<23> company;
};
32

public:
};
struct db_contact {
inplace_string<7> symbol;
inplace_string<15> name;
inplace_string<15> surname;
inplace_string<23> company;
};
No dynamic memory allocations or pointer indirections guaranteed with the
cost of possibly bigger memory usage
32

HOW TO DEVELOP SYSTEM WITH LOW-LATENCY CONSTRAINTS
33

• Keep the number of threads close (less or equal) to the number available physical CPU cores
33

• Separate IO threads from business logic thread (unless business logic is extremely lightweight)
33

• Use fixed size lock free queues / busy spins to pass the data between threads
33

• Use optimal algorithms/data structures, data locality principle
33

• Precompute, use compile time instead of runtime whenever possible
33

• The simpler the code, the faster it is likely to be
33

• Do not try to be smarter than the compiler
33

• Know the language, tools, and libraries
33

• Know your hardware!
33

• Bypass the kernel (100% user space code)
33

• Bypass the kernel (100% user space code)
• Measure performance… ALWAYS
33

THE MOST IMPORTANT RECOMMENDATION
34

Always measure your performance!
THE MOST IMPORTANT RECOMMENDATION
34

• Always measure Release version
cmake -DCMAKE_BUILD_TYPE=Release
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo
HOW TO MEASURE THE PERFORMANCE OF YOUR PROGRAMS
35

• Prefer hardware based black box performance meassurements
35

• In case that is not possible or you want to debug specific performance issue use profiler
• To gather meaningful stack traces preserve frame pointer
set(CMAKE_CXX_FLAGS_RELWITHDEBINFO
"${CMAKE_CXX_FLAGS_RELWITHDEBINFO} -fno-omit-frame-pointer")
• Familiarize yourself with linux perf tools (xperf on Windows) and flame graphs
• Use tools like Intel VTune
35

• In case that is not possible or you want to debug specific performance issue use profiler
• To gather meaningful stack traces preserve frame pointer
set(CMAKE_CXX_FLAGS_RELWITHDEBINFO
"${CMAKE_CXX_FLAGS_RELWITHDEBINFO} -fno-omit-frame-pointer")
• Familiarize yourself with linux perf tools (xperf on Windows) and flame graphs
• Use tools like Intel VTune
• Verify output assembly code
35

Flame Graph Search
test_bu..
x..
d..
s..
gener..
S..
_..
unary_..
execute_command_internal
x..
bash
shell_expand_word_list
red..
execute_builtin
__xst..
__dup2
main
_.. do_redirection_internal
do_redirections
do_f..
ext4_f..
__GI___li..
__..
expand_word_list_internal
unary_..
expan..tra..
path..
execute_command
two_ar..
cleanup_redi..
execute_command
do_sy..
trac..
__GI___libc_..
posixt..
execute_builtin_or_function
_..
[unknown]
tra..
expand_word_internal
_..
tracesys
ex..
_..
vfs_write
c..
test_co..
e..
u..
v..
SY..
i..
execute_simple_command
execute_command_internal
p.. gl..
s..sys_write
__GI___l..
expand_words
d..
__libc_start_main
d..
do_sync..
generi..
e..
reader_loop
sy..
execute_while_command
do_redirecti..
sys_o..
g..
__gene..
__xsta..
tracesys
c..
_..
execute_while_or_until
FLAMEGRAPH
36

Striving for ultimate Low Latency

Striving for ultimate Low Latency

More Related Content

Similar to Striving for ultimate Low Latency (20)

More from Mateusz Pusz (14)

Recently uploaded (20)

Striving for ultimate Low Latency