Skip to content

Make call-counting, class probes, block counters cache-friendly #72387

@EgorBo

Description

@EgorBo

When we start a multi-thread application (e.g. any web workload) it seems to me that we pay some penalty for accessing some common memory locations from different threads. Consider this method:

void DoWork(IDoWork work) => work?.Do();

If on start we call it from multiple threads (e.g. processing incoming requests) we most likely will end up accessing the same 3 memory locations from multiple threads:

  1. call counting cell in the callCountingStub for DoWork (and Do)
  2. BB counter in case of PGO (DoWork has a branch)
  3. Class probe

So we basically are going to do a lot of cache thrashing and it's especially painful for NUMA nodes.

We should consider/experiment with adding some quick random-based checks on top of all 3, something like

if (rand & 1)
    dec [callCountingCell]

It should slightly help and increase chances of accessing the same memory location from just one core and reduce number of cache thrashing in general.
On x86 we can rely on rdtsc for that (and cntvct_el0 on arm) to access perf counters.

One might say that it's not that important because we have low callcounting thresholds but we need to take into account the fact that we start to promote methods to tier1 only if we didn't encounter new tier0 compilations in the last 100ms

category:proposal
theme:profile-feedback

Metadata

Metadata

Assignees

Labels

area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMItenet-performancePerformance related issue

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions