Make call-counting, class probes, block counters cache-friendly

When we start a multi-thread application (e.g. any web workload) it seems to me that we pay some penalty for accessing some common memory locations from different threads. Consider this method:
```csharp
void DoWork(IDoWork work) => work?.Do();
```
If on start we call it from multiple threads (e.g. processing incoming requests) we most likely will end up accessing the same 3 memory locations from multiple threads:
1) call counting cell in the callCountingStub for `DoWork` (and `Do`)
2) BB counter in case of PGO (`DoWork` has a branch)
3) Class probe

So we basically are going to do a lot of cache thrashing and it's especially painful for NUMA nodes.

We should consider/experiment with adding some quick random-based checks on top of all 3, something like

```asm
if (rand & 1)
    dec [callCountingCell]
```
It should slightly help and increase chances of accessing the same memory location from just one core and reduce number of cache thrashing in general.
On x86 we can rely on `rdtsc` for that (and `cntvct_el0` on arm) to access perf counters.

One might say that it's not that important because we have low callcounting thresholds but we need to take into account the fact that we start to promote methods to tier1 only if we didn't encounter new tier0 compilations in the last 100ms


category:proposal
theme:profile-feedback

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make call-counting, class probes, block counters cache-friendly #72387

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Make call-counting, class probes, block counters cache-friendly #72387

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions