BPF

BPF stands for Berkeley Packet Filter - originally framework to provide flexible packet filtering for tcpdump and similar tools - it used its own bytecode, however Linux eBPF is now able to convert bytecode to real CPU instructions (so expressions are evaluated at native CPU speed).

Today it become universal framework to trace kernel.

Please see https://www.brendangregg.com/ebpf.html for official quick overview.

BCC versus BPF Trace:

explained on https://github.com/iovisor/bcc/pull/4173/files
Quote: bpftrace is at github.com/iovisor/bpftrace, and is a high-level front-end for BPF tracing, which uses libraries from bcc. bpftrace is ideal for ad hoc instrumentation with powerful custom one-liners and short scripts, whereas bcc is ideal for complex tools and daemons.
BCC makes BPF programs easier to write, with kernel instrumentation in C (and includes a C wrapper around LLVM), and front-ends in Python and lua. It is suited for many tasks, including performance analysis and network traffic control.

bpftrace programs generally looks lot of like awk and have extension .bt
BCC tools are mostly written in Python, but the whole structure looks to me quite clumsy - there are long multi-line strings of BPF code and later lot of python glue...

Unfortunately there is some tensions where new examples for bptrace are no longer accepted and users are pointed to bcc as can be seen on:

https://github.com/bpftrace/bpftrace/issues/2616
KVM example not merged:
- https://github.com/bpftrace/bpftrace/pull/2015/files

Installing BPF tools

Example is for Proxmox VE 8.2.4, kernel 6.8.12-1-pve, which is more-or-less Debian 12:

Disclaimer: BPF uses deep internals of Linux kernel (there is no official stable API). SO it is quite normal that existing script will brake with current kernel...

There are actually two sets of examples that can be installed with:

apt-get install bpfcc-tools
apt-get install bpftrace

First command will install nice example scripts under /usr/sbin/*-bpfcc Later command install /usr/sbin/*.bt

Let's try first example globally snooping which new processes were exec-uted (must be run as root):

$ /usr/sbin/execsnoop.bt

Attaching 3 probes...
TIME(ms)   PID   ARGS
4093       9736  nft -j -f -
4093       9736  nft -j -f -
4093       9736  nft -j -f -
4101       9737  nft -j -f -
4101       9737  nft -j -f -
4101       9737  nft -j -f -
6766       9740  iptables-save
6782       9741  iptables-restore -T filter -n
6797       9742  ip6tables-save

When you look inside you can see that source code is very short:

#!/usr/bin/env bpftrace
/*
 * execsnoop.bt   Trace new processes via exec() syscalls.
 *                For Linux, uses bpftrace and eBPF.
 * ...
 */
 BEGIN
{
        printf("%-10s %-5s %s\n", "TIME(ms)", "PID", "ARGS");
}

tracepoint:syscalls:sys_enter_exec*
{
        printf("%-10u %-5d ", elapsed / 1e6, pid);
        join(args->argv);
}

You can list used tracepoints with:

$ bpftrace -l 'tracepoint:syscalls:sys_enter_exec*'

tracepoint:syscalls:sys_enter_execve
tracepoint:syscalls:sys_enter_execveat

NOTE: Still I'm not sure why there is message Attaching 3 probes...

When it is broken

Here is example from same Proxmox VE environment:

$ /usr/sbin/swapin.bt

/usr/sbin/swapin.bt:1-16: WARNING: swap_readpage is not traceable (either non-existing, inlined, or marked as "notrace"); attaching to it will likely fail
Attaching 2 probes...
cannot attach kprobe, probe entry may not exist
ERROR: Error attaching probe: 'kprobe:swap_readpage'

WARNING!

although it looks obvious that kprobe:swap_readpage was replaced with kprobe:swap_read_page, is it NOT the case
as can be found on: https://lore.kernel.org/all/[email protected]/T/
- git web url: https://git.proxmox.com/?p=mirror_ubuntu-kernels.git;a=patch;h=c9bdf768dd9319d2d80a334646e2c8116af9e430
it was actually replaced by kprobe:swap_read_folio ! - as can be verified on BCC version of swapin.py: https://github.com/iovisor/bcc/commit/36b40102cc94f537cdbcea35c5cc766a53643c1d

So proper patch is this:

diff -u broken/swapin.bt swapin.bt
--- broken/swapin.bt    2024-08-10 09:58:50.396672650 +0200
+++ swapin.bt   2024-08-12 16:35:47.276154899 +0200
@@ -13,7 +13,7 @@
  * 26-Jan-2019  Brendan Gregg   Created this.
  */
 
-kprobe:swap_readpage
+kprobe:swap_read_folio
 {
         @[comm, pid] = count();
 }

First verify with vmstat -y 1 that there are is non-zero column si (swap in). Then run patched version of swapin.bt:

$ ./swapin.bt

@[pve-firewall, 1617]: 2
16:52:49
@[pvestatd, 1625]: 24
16:52:50
@[pvedaemon worke, 1647]: 1
@[pveproxy worker, 1658]: 1
@[server, 1515]: 1

When it is really broken

Let's try more advanced example:

$ /usr/sbin/biosnoop.bt

/usr/sbin/biosnoop.bt:21-23: WARNING: blk_account_io_start is not traceable (either non-existing, inlined, or marked as "notrace"); attaching to it will likely fail
/usr/sbin/biosnoop.bt:23-24: WARNING: __blk_account_io_start is not traceable (either non-existing, inlined, or marked as "notrace"); attaching to it will likely fail
/usr/sbin/biosnoop.bt:30-32: WARNING: blk_account_io_done is not traceable (either non-existing, inlined, or marked as "notrace"); attaching to it will likely fail
/usr/sbin/biosnoop.bt:32-33: WARNING: __blk_account_io_done is not traceable (either non-existing, inlined, or marked as "notrace"); attaching to it will likely fail
Attaching 6 probes...
cannot attach kprobe, probe entry may not exist
WARNING: could not attach probe kprobe:__blk_account_io_done, skipping.
cannot attach kprobe, Invalid argument
WARNING: could not attach probe kprobe:blk_account_io_done, skipping.
cannot attach kprobe, probe entry may not exist
WARNING: could not attach probe kprobe:__blk_account_io_start, skipping.
cannot attach kprobe, Invalid argument
WARNING: could not attach probe kprobe:blk_account_io_start, skipping.
TIME(ms)     DISK    COMM             PID    LAT(ms)
 Ctrl-C

In such case there is no real replacement candidate and we have to resort to standard option, e.g., using Google:

https://github.com/iovisor/bcc/issues/4261
there is reference to reaking kernel commit: https://github.com/torvalds/linux/commit/450b7879e34517c3ebc3a35a53806fe40e60fac2
what is even worse, that global function is replaced with static - so it will be no longer reachable...
proper patch was submitted here: https://www.spinics.net/lists/linux-block/msg90434.html
- and https://www.spinics.net/lists/linux-block/msg100873.html
but not sure when (and if) it reaches standard kernel...
this commit should contain fixed version
- (python): https://github.com/iovisor/bcc/commit/ed1c98047e0c68126bfb04c8c3ea2455f3b7674d

It seems that Proxmox kernel has these new tracepoints (-v will reveal tracepoint arguments):

$ uname -r

6.8.12-1-pve # <<< kernel version

# replaces kprobe:blk_account_io_start, kprobe:__blk_account_io_start
$ bpftrace -l 'tracepoint:*block_io_start*' -v

tracepoint:block:block_io_start
    dev_t dev
    sector_t sector
    unsigned int nr_sector
    unsigned int bytes
    char rwbs[8]
    char comm[16]
    __data_loc char[] cmd

# replaces probe:blk_account_io_done, kprobe:__blk_account_io_done

$ bpftrace -l 'tracepoint:*block_io_done' -v

tracepoint:block:block_io_done
    dev_t dev
    sector_t sector
    unsigned int nr_sector
    unsigned int bytes
    char rwbs[8]
    char comm[16]
    __data_loc char[] cmd

There is open pull request that should fix it:

https://github.com/bpftrace/bpftrace/pull/2875/files

Custom script - I/O sector alignment

Goal: write script that will check I/O sector alignment.

We will tracepoint:block:block_io_start for that.

First inspect data of tracepoint:block:block_io_start:

$ bpftrace -lv tracepoint:block:block_io_start
tracepoint:block:block_io_start
    dev_t dev
    sector_t sector
    unsigned int nr_sector
    unsigned int bytes
    char rwbs[8]
    char comm[16]
    __data_loc char[] cmd

Also:

$ cat /sys/kernel/debug/tracing/events/block/block_io_start/format
name: block_io_start
ID: 1257
format:
	field:unsigned short common_type;	offset:0;	size:2;	signed:0;
	field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
	field:unsigned char common_preempt_count;	offset:3;	size:1;	signed:0;
	field:int common_pid;	offset:4;	size:4;	signed:1;

	field:dev_t dev;	offset:8;	size:4;	signed:0;
	field:sector_t sector;	offset:16;	size:8;	signed:0;
	field:unsigned int nr_sector;	offset:24;	size:4;	signed:0;
	field:unsigned int bytes;	offset:28;	size:4;	signed:0;
	field:char rwbs[8];	offset:32;	size:8;	signed:0;
	field:char comm[16];	offset:40;	size:16;	signed:0;
	field:__data_loc char[] cmd;	offset:56;	size:4;	signed:0;

print fmt: "%d,%d %s %u (%s) %llu + %u [%s]", ((unsigned int) ((REC->dev) >> 20)), ((unsigned int) ((REC->dev) & ((1U << 20) - 1))), REC->rwbs, REC->bytes, __get_str(cmd), (unsigned long long)REC->sector, REC->nr_sector, REC->comm

Did not find a way how to quickly use that fmt to avoid rewriting it into bpftrace...

Quick oneline test:

$ bpftrace -e 'tracepoint:block:block_io_start { printf("sector=%x nr=%u bytes=%u comm=%s\n", args->sector,args->nr_sector, args->bytes, args->comm); }'

sector=0 nr=0 bytes=8 comm=kworker/u4:22
sector=0 nr=0 bytes=8 comm=kworker/u4:22
sector=212e3288 nr=8 bytes=4096 comm=jbd2/dm-1-8
sector=210e2aa0 nr=0 bytes=0 comm=jbd2/dm-1-8
sector=212e32a0 nr=8 bytes=4096 comm=kworker/0:7
sector=200800 nr=256 bytes=131072 comm=vgs
sector=200a00 nr=256 bytes=131072 comm=vgs
sector=200800 nr=256 bytes=131072 comm=vgs
sector=200800 nr=256 bytes=131072 comm=lvs
sector=200a00 nr=256 bytes=131072 comm=lvs

Hmm, but why there is sometimes bytes=0?

Another problem:

unable to get meaningful major,minor from dev:

tracepoint:block:block_io_start
{
   // from include/linux/kdev_t.h
   $major = ( (args->dev) & 0xfff00) >> 8;
   $minor = ( (args->dev) & 0xff) | (( (args->dev) >> 12) & 0xfff00);
   //$major = (args->dev) >> 8;
   //$minor = (args->dev) & 0xff;
   printf("sz=%d dev=0x%llx dev=%u,%u ", sizeof(args->dev), args->dev, $major, $minor);
   printf("sector=%x nr=%u bytes=%u comm=%s rwbs=%s\n",
         args->sector,args->nr_sector, args->bytes, args->comm,args->rwbs);
}

Getting:

sz=4 dev=0x800000 dev=0,2048 sector=318c910 nr=32 bytes=16384 comm=kworker/u4:0 rwbs=W
sz=4 dev=0xb00000 dev=0,2816 sector=0 nr=0 bytes=8 comm=kworker/u4:0 rwbs=N
sz=4 dev=0xb00000 dev=0,2816 sector=0 nr=0 bytes=8 comm=kworker/u4:0 rwbs=N

Being so desperate also tried (under Fedora install dnf install perf instead of linux-perf):

$ apt-get install linux-perf
$ perf list | fgrep block_rq
$ perf trace -e 'block:block_io_start'

     0.000 kworker/u4:0-d/18423 block:block_io_start(dev: 11534336, bytes: 8, rwbs: "N", comm: "kworker/u4:0", cmd: "")
  1216.080 jbd2/dm-1-8/635 block:block_io_start(dev: 8388608, sector: 556762200, nr_sector: 8, bytes: 4096, rwbs: "WSM", comm: "jbd2/dm-1-8", cmd: "")
  1216.626 jbd2/dm-1-8/635 block:block_io_start(dev: 8388608, sector: 554663064, rwbs: "FWFSM", comm: "jbd2/dm-1-8", cmd: "")
  1227.444 kworker/0:7-ev/1050 block:block_io_start(dev: 8388608, sector: 556762264, nr_sector: 8, bytes: 4096, rwbs: "WFSM", comm: "kworker/0:7", cmd: "")
  2047.993 kworker/u4:0-d/18423 block:block_io_start(dev: 11534336, bytes: 8, rwbs: "N", comm: "kworker/u4:0", cmd: "")
  4095.984 kworker/u4:22-/230 block:block_io_start(dev: 11534336, bytes: 8, rwbs: "N", comm: "kworker/u4:22", cmd: "")
  4800.087 kworker/u4:22-/230 block:block_io_start(dev: 8388608, sector: 78316928, nr_sector: 16, bytes: 8192, rwbs: "W", comm: "kworker/u4:22", cmd: "")
  6143.982 kworker/u4:22-/230 block:block_io_start(dev: 11534336, bytes: 8, rwbs: "N", comm: "kworker/u4:22", cmd: "")
  6152.624 vgs/18976 block:block_io_start(dev: 8388608, sector: 2099200, nr_sector: 256, bytes: 131072, rwbs: "R", comm: "vgs", cmd: "")
  6153.935 vgs/18976 block:block_io_start(dev: 8388608, sector: 2099712, nr_sector: 256, bytes: 131072, rwbs: "R", comm:

$ perf trace -e 'block:block_rq_complete'

     0.000 :0/0 block:block_rq_complete(dev: 11534336, sector: -1, rwbs: "N", cmd: "")
  2049.028 pve-firewall/1613 block:block_rq_complete(dev: 11534336, sector: -1, rwbs: "N", cmd: "")
  2278.587 :0/0 block:block_rq_complete(dev: 8388608, sector: 2099200, nr_sector: 256, rwbs: "R", cmd: "")
  2356.410 :0/0 block:block_rq_complete(dev: 8388608, sector: 2099200, nr_sector: 256, rwbs: "R", cmd: "")
  3676.119 :0/0 block:block_rq_complete(dev: 8388608, sector: 556755032, nr_sector: 264, rwbs: "WSM", cmd: "")
  3693.873 :0/0 block:block_rq_complete(dev: 8388608, sector: -1, rwbs: "FF", cmd: "")
  3694.072 :0/0 block:block_rq_complete(dev: 8388608, sector: 554656096, rwbs: "WSM", cmd: "")
  3694.316 kworker/u4:22-/230 block:block_rq_complete(dev: 8388608, sector: 556755296, nr_sector: 8, rwbs: "WSM", cmd: "")

Find kernel calls from process

Here is example from Fedora - installing Windows server 2022 in LibVirt + KVM:

cd /usr/share/bcc/tools
./syscount -p PID_OF_QEMU

...

racing syscalls, printing top 10... Ctrl+C to quit.
^C[17:16:35]
SYSCALL                   COUNT
clock_gettime           1021651
ioctl                    753043
futex                    256742
gettimeofday             192409
ppoll                     74836
write                      7517
read                       3994
poll                       2385
recvmsg                     960
writev                      671

Or:

cd /usr/share/bcc/tools
./ext4slower -p PID_OF_QEMU

...
IME     COMM           PID    T BYTES   OFF_KB   LAT(ms) FILENAME
17:20:22 worker         5315   W 686080  9947504    12.87 win2k22.qcow2
17:20:22 worker         5315   W 753152  9948176    14.61 win2k22.qcow2
17:20:22 worker         5315   W 88576   9949524    11.28 win2k22.qcow2
17:20:22 worker         5315   R 131072  5088416    19.31 en-us_windows_server_2022_updated_july_2
17:20:24 worker         5315   S 0       0         488.64 win2k22.qcow2
17:20:25 worker         5315   S 0       0         877.78 win2k22.qcow2
17:20:26 worker         5315   R 131072  5100576    13.92 en-us_windows_server_2022_updated_july_2
17:20:26 worker         5315   R 131072  5106976    11.32 en-us_windows_server_2022_updated_july_2
17:20:28 worker         5315   R 4096    3260976    13.75 win2k22.qcow2
...

Now I have to understand what that means...

Flame graphs: quick start

Here is my Quick Start how to make Flame graphs (using Gregg's) guides.

Tested Host environment:

Proxmox VE 8.2.4
old dual-core Opteron X2 (Generation 2), MSI-7250 MB
8 GB RAM
AHCI, 4TB disk Seagate IronWolf

NOTE: Host CPU does not support HAP (Hardware assisted paging) so any access to new page is costly (causes trap and host must setup so called "Shadow pages").

Guest VM booted:

Windows Server 2022, BIOS, 1 CPU, 4GB RAM, 35GB disk Virtio BLK on ext4/raw, disabled Network (to stop Windows update affecting tests)

Install:

apt-get install bpftrace bpfcc-tools git-core
mkdir -p ~/projects
cd ~/projects
git clone https://github.com/brendangregg/FlameGraph.git
mkdir -p ~/bpf

CPU Flame graph

How to get CPU Flame graph

from https://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html

I start VM (with id=108 in example) in Proxmox and run while VM is booting:

cd ~/bpf

id=108
ulimit -n 65536
profile-bpfcc -F 99 -adf --stack-storage-size $(( 2 * 1024 * 1024 )) 120 > boot-$id.profile-folded
~/projects/FlameGraph/flamegraph.pl boot-$id.profile-folded > /usr/share/pve-manager/images/cpu-$id.svg
echo "Image URL: https://`hostname -i`:8006/pve2/images/cpu-$id.svg"

And point your browser to https://PROXMO_IP:8006/pve2/images/cpu-108.svg.

Here is example graph:

CPU flame chart booting VMID

Click for SVG graph

NOTE: note how kvm_mmu_page_fault dominates this chart.

NOTE: GitHub blocks JavaScript in SVG image so it is not interactive - you have to publish it somewhere, where is correct ContentType to make it interactive.

Off-CPU Flame graph

Quick Off-CPU chart, following:

https://www.brendangregg.com/offcpuanalysis.html

cd ~/bpf

id=108

ulimit -n 65536
/usr/sbin/offcputime-bpfcc -df 120 > off-cpu-$id.stacks
~/projects/FlameGraph/flamegraph.pl --color=io \
       	--title="Off-CPU Time Flame Graph" \
       	--countname=us < off-cpu-$id.stacks > /usr/share/pve-manager/images/off-cpu-$id.svg
echo "Image URL: https://`hostname -i`:8006/pve2/images/off-cpu-$id.svg"

Here is example graph:

Off-CPU flame chart booting VMID

Click for SVG graph

NOTE: GitHub blocks JavaScript in SVG image so it is not interactive - you have to publish it somewhere, where is correct ContentType to make it interactive.

Resources

Creative Commons License This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BPF

BPF