-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Description
When we start a multi-thread application (e.g. any web workload) it seems to me that we pay some penalty for accessing some common memory locations from different threads. Consider this method:
void DoWork(IDoWork work) => work?.Do();
If on start we call it from multiple threads (e.g. processing incoming requests) we most likely will end up accessing the same 3 memory locations from multiple threads:
- call counting cell in the callCountingStub for
DoWork
(andDo
) - BB counter in case of PGO (
DoWork
has a branch) - Class probe
So we basically are going to do a lot of cache thrashing and it's especially painful for NUMA nodes.
We should consider/experiment with adding some quick random-based checks on top of all 3, something like
if (rand & 1)
dec [callCountingCell]
It should slightly help and increase chances of accessing the same memory location from just one core and reduce number of cache thrashing in general.
On x86 we can rely on rdtsc
for that (and cntvct_el0
on arm) to access perf counters.
One might say that it's not that important because we have low callcounting thresholds but we need to take into account the fact that we start to promote methods to tier1 only if we didn't encounter new tier0 compilations in the last 100ms
category:proposal
theme:profile-feedback