Using eBPF Off-CPU Sampling to See What Your DBs are Really Waiting For by Tanel Poder

A ScyllaDB
Community
Using eBPF Off-CPU Sampling to
See What Your DBs are Really
Waiting For
Tanel Põder
Computer Performance Nerd

Tanel Põder
A long-time computer performance nerd
■ I've been promoting DB session- or OS thread-level
performance diagnosis approach for decades now
■ P99CONF is the best performance conference these days! :-)
■ I'm originally from (a small country) Estonia - and my
company logo uses our flag's colors!
■ I still end up researching & testing out modern tech even in
my free time (CXL.mem is my latest interest)

Method > Data Sources > Tools
Every system is
a bunch of
threads.
Measure where
they spend most
of their time and
do it less!
/proc/PID/task/TID
perf, ftrace, ...
"top" for
wallclock time
... and much
more!
eBPF

0x.tools
● /proc sampling
● works without eBPF
● even very old linuxes
● eBPF!
● see anything you want!
● PoC prototype with bcc
● work-in-progress
Extended Linux Thread State Sampling method

/proc sampling example (psn)
the fact of
sampling: a
thread seen in
"active state"
sample attributes:
(many) dimensions
in a "fact table"

For systematic performance & troubleshooting work, I want to:
● See the full system activity (“active threads”)
● Not only system-wide utilization averages
● Not only on-CPU thread stacks, but all thread states (and offcpu stacks)
● With ability to drill down into each thread’s activity
● See what each thread of interest is doing, for whom and why (context)
● I/O & function call latencies tied to each thread & its context at the time
● All this without tracing & postprocessing every event for every thread!
Detailed full system activity without tracing every event?

eBPF example (xtop with bcc)
Each dimension attribute is linked to the
same point in time! (*except oncpu)

"stacktiles" show the
value of a stack_id

Extended Task State Array (very basic) example

How does it work?!
Two decoupled layers
● eBPF populating & maintaining the array
● Keep only the latest state change for each thread
● “Tracking, not tracing!”
● Sampling program independent from population
● Python/BCC, C, Rust/libbpf, eBPF iterators, etc...
● Multiple concurrent samplers allowed
● Different sampling frequencies allowed

Time
tid 10
tid 11
tid 42
10 11 42 N
...
10
10
10
TRACEPOINT_PROBE(
raw_syscalls, sys_enter)
{
...
t->syscall_id = args->id;
tsa.update(&tid, t);
...
}
BPF_HASH(tsa, ...);
TRACEPOINT_PROBE(
raw_syscalls, sys_exit)
{
...
t->syscall_id = -1;
...
}
Populating the extended task state array

Time
tid 10
tid 11
tid 42
10 11 42 N
...
10 11
11
11
11
11
BPF_HASH(tsa, ...);
TRACEPOINT_PROBE(
{
...
...
}
TRACEPOINT_PROBE(
{
...
t->syscall_id = -1;
...
}

Time
tid 10
tid 11
tid 42
10 11 42 N
...
10 42
42
42
42
42
42
42
42
42
42
42
We are not
tracing: no
logging or
appending all
events ...
We track:
overwrite the
task's current
action in the
extended task
state array
...
BPF_HASH(tsa, ...);
TRACEPOINT_PROBE(
{
...
...
}
TRACEPOINT_PROBE(
{
...
t->syscall_id = -1;
...
}

Time
tid 10
tid 11
tid 42
10 11 42 N
...
A separate,
independent
program samples
the state arrays
using its desired
frequency and filter
rules to userspace
tsa = BPF.get_table(“tsa”)
for x in tsa.items():
...
10
11
42
N
10
11
42
N
10
11
42
N
10
11
42
N
BPF_HASH(tsa, ...);
TRACEPOINT_PROBE(
{
...
...
}
TRACEPOINT_PROBE(
{
...
t->syscall_id = -1;
...
}
Sampling the extended task state array

Time
tid 10
tid 11
tid 42
10 11 42 N
... 10
11
42
N
10
11
42
N
10
11
42
N
10
11
42
N
The sampler(s) can
be eBPF client
programs (bcc,
libbpf) using bpf()
syscall or a bpf
task iterator with
perf_event queue
BPF_HASH(tsa, ...);
TRACEPOINT_PROBE(
{
...
...
}
TRACEPOINT_PROBE(
{
...
t->syscall_id = -1;
...
}
tsa = BPF.get_table(“tsa”)
for x in tsa.items():
...
Sampling the extended task state array

Always-on output logging (for time travel and advanced analytics)
$ ./xcapture-bpf -h
usage: xcapture-bpf [-h] [-x] [-d report_seconds] [-f SAMPLE_HZ] [-g csv-columns]
[-G append-csv-columns] [-n] [-N] [-c] [-V] [-o OUTPUT_DIR] [-l]
Always-on profiling of Linux thread activity using eBPF.
options:
-h, --help show this help message and exit
-x, --xtop Run in aggregated top-thread-activity (xtop) mode
-d report_seconds xtop report printing interval (default: 5s)
-f SAMPLE_HZ, --sample-hz SAMPLE_HZ
xtop sampling frequency in Hz (default: 20)
-g csv-columns, --group-by csv-columns
Full column list what to group by
-G append-csv-columns, --append-group-by append-csv-columns
List of additional columns to default cols what to group by
-n, --nerd-mode Print out relevant stack traces as wide output lines
-N, --giant-nerd-mode
Print out relevant stack traces as stacktiles
-c, --clear-screen Clear screen before printing next output
-V, --version Show the program version and exit
-o OUTPUT_DIR, --output-dir OUTPUT_DIR
Directory path where to write the output CSV files
-l, --list list all available columns for display and grouping

Always-on output logging (for time travel and advanced analytics)
$ ls -l
total 236
-rw-r--r-- 1 root root 19080 Jul 12 17:30 stacks_2024-07-12.16.csv
-rw-r--r-- 1 root root 41061 Jul 12 17:00 threads_2024-07-12.16.csv
-rw-r--r-- 1 root root 162132 Jul 12 17:33 threads_2024-07-12.17.csv
$ grep -E "TIMESTAMP|mysql" threads_2024-07-12.17.csv | head
TIMESTAMP,ST,TID,PID,USERNAME,COMM,SYSCALL,CMDLINE,OFFCPU_U,OFFCPU_K,ONCPU_U,ONCPU_K,WAKER_TID,SCH
2024-07-12 17:14:16.798,R,1894,1836,mysql,ib_log_fl_notif,-,,-,-,14409,12280,0,___-
2024-07-12 17:22:44.575,D,1895,1836,mysql,ib_log_flush,fsync,/usr/sbin/mysqld,9692,24360,-,-,0,____
2024-07-12 17:22:48.778,D,1895,1836,mysql,ib_log_flush,fsync,/usr/sbin/mysqld,9692,24360,-,-,353,_-__
$ grep 9692 stacks_2024-07-12.16.csv
ustack 9692 ->71051cceabb4->std::thread::_State_impl->log_flusher->log_flush_low->Log_file_handle::fsync->
os_file_flush_func->os_file_fsync_posix

Path to "IPC wait chains"?
$ sudo ./xcapture-bpf
Client – Server
interaction
RDBMS commit
"log file sync"

Things not yet implemented, but possible (it's eBPF, after all!)
Many components are already successfully implemented in other (eBPF) tools
● IPC wait chains (more research needed)
● RPC / trace_id / distributed tracing context propagation
● Sample & estimate I/O latencies for each captured thread that's off CPU
● Use these samples for analyzing various latencies across any "dimension"
● Read common SQL DB context (SQL text/hash, exec phase DB wait events)
● Read interpreted language/VM state (via perf.map or direct)

● Still just a method, datasource and a couple of tools, not a product or platform
● Production-grade, always on, focus on compiled binaries & perf.map capable runtimes
● Use BTF, CO-RE and libbpf instead of bcc
● Use BPF task iterators for sampling kernel-maintained task fields (no field duplication)
● Use BPF_MAP_TASK_STORAGE for all the additional (extended context) structures
● Use get_stack (not get_stackid) – flexible, no need for large stack-maps in kernel mem
● Use BlazeSym as the build-id aware symbolizer (OSS by Meta, written in Rust)
● Feed output to common metrics/monitoring/visualization tools (which metric type?!)
● Contribute/integrate with OpenTelemetry agent (if/when the time is right)?
0x.tools future plans and hopes: xcapture-bpf v3.0
Modern libbpf
dev help is
appreciated!

● 0x.tools
● tanelpoder.com
● tanel@tanelpoder.com
● @tanelpoder
Thank You!

Using eBPF Off-CPU Sampling to See What Your DBs are Really Waiting For by Tanel Poder

More Related Content

Similar to Using eBPF Off-CPU Sampling to See What Your DBs are Really Waiting For by Tanel Poder (20)

More from ScyllaDB (20)

Recently uploaded (20)

Using eBPF Off-CPU Sampling to See What Your DBs are Really Waiting For by Tanel Poder