Skip to content

Tracing with BPF

Krijn Doekemeijer edited this page Aug 23, 2023 · 1 revision

Often we need to measure performance of software in the software stack, such as the Linux Block Layer. BPF (Berkeley Packet Filter) is a handy tool for exactly this scenario, allowing for detailed analysis of the software. While it was originally designed for networking, as the name implies, it is now present and usable for all major aspects of the Linux Kernel, especially with the introduction of the extended BPF (eBPF)

At its core, with BPF we can code sandboxed applications to observe Kernel activity without having to modify the kernel source code or requiring to code and load custom kernel modules. With this we can code applications that insert monitoring probes into the Kernel without sacrificing on security. Besides being beneficial for performance profiling, and learning what actually goes on in the system, this is also useful for debugging of applications. In this post we will walk through an example of coding a BPF script for tracing of NVMe commands to ZNS (zoned namespace SSD, see here for more: Zoned Storage Docs).

In order to use BPF, the Kernel must have support for it and libraries have to be installed. See here for detailed install instructions. For more details about the BPF, consult their webpage.

bpftrace Script

For coding of tracing scripts, we will use bpftrace. It has an extensive guide on all code constructions, available here. We will cover only certain aspects of it, however quickly glancing over it will give a general impression on how to code with BPF.

Firstly, bpftrace scripts have the .bt extension and can be run with sudo bpftrace ./[script_path].bt (which always requires root privileges). People familiar with C will have an easy time getting used to bpftrace, as the constructs are all very similar. This example walks through the process of how we coded a script for tracing several ZNS commands, available here.

kbprobes

A kbprobe is the what we will utilize for identifying ZNS commands. These probes are triggered whenever a kernel function is called. Therefore, when we insert a kprobe for a particular function in the kernel, it will get triggered once the function is called, and from there we can extract the functions' arguments, take timing measurements if we want to trace call duration, and analyze its call stack.

In order to see available functions that we can use we can run

user@stosys:~/src/BPF-Stosys-scripts$ sudo bpftrace -l | grep nvme_setup
kprobe:__traceiter_nvme_setup_cmd
kprobe:nvme_setup_cmd
kprobe:nvme_setup_discard
kprobe:nvme_setup_host_mem
kprobe:nvme_setup_io_queues
kprobe:nvme_setup_zone_mgmt_send
tracepoint:nvme:nvme_setup_cmd

From here we can see that we can use the nvme_setup_cmd to insert our kprobe and call it every time an NVMe command is setup. Now we need to know how this command looks like (what arguments it takes), such that we can disassemble it and take what information we need. We can do this with by looking at the Kernel source code, where we find the following (nvme source code, core.c:912):

blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req)
{
     struct nvme_command *cmd = nvme_req(req)->cmd;

showing us that the command has 2 arguments. Next, we can retrieve the information that is stored in the nvme_command *cmd pointer. To do this, we can create our probe that stores the command in a local variable (these are referred to as scratch variables in BPF and have a $ before them, only valid in local scope of the probe).

// k: indicates it is a kprobe
k:nvme_setup_cmd {
    $nvme_cmd = (struct nvme_command *)*(arg1+sizeof(struct request));
}

We can see the notion is similar to C. Next, we'll retrieve some information of the command fields. The function call had 2 arguments, where the second one is of type struct request, hence we can go to the address of the second argument (arg1, is provided as a primitive by bpftrace to access the second argument) and add to it the size of the struct request. We are then at the address of the struct nvme_command which we can cast and store.

Dissecting NVMe Commands

Now, we want to dissect the command, which we can do by identifying the structure of struct nvme_command:

struct nvme_command {
	union {
		struct nvme_common_command common;
		struct nvme_rw_command rw;
		struct nvme_identify identify;
		struct nvme_features features;
		struct nvme_create_cq create_cq;
		struct nvme_create_sq create_sq;
		struct nvme_delete_queue delete_queue;
		struct nvme_download_firmware dlfw;
		struct nvme_format_cmd format;
		struct nvme_dsm_cmd dsm;
		struct nvme_write_zeroes_cmd write_zeroes;
		struct nvme_zone_mgmt_send_cmd zms;
		struct nvme_zone_mgmt_recv_cmd zmr;
		struct nvme_abort_cmd abort;
		struct nvme_get_log_page_command get_log_page;
		struct nvmf_common_command fabrics;
		struct nvmf_connect_command connect;
		struct nvmf_property_set_command prop_set;
		struct nvmf_property_get_command prop_get;
		struct nvmf_auth_common_command auth_common;
		struct nvmf_auth_send_command auth_send;
		struct nvmf_auth_receive_command auth_receive;
		struct nvme_dbbuf dbbuf;
		struct nvme_directive_cmd directive;
	};
};

From here we can extend our code to retrieve some more information from the command. This is again, very similar to how accessing in C is done.

BEGIN {
    // Our zone size in 512B sectors
    @ZONE_MASK = ~(4194304 - 1);
}

k:nvme_setup_cmd {
    $nvme_cmd = (struct nvme_command *)*(arg1+sizeof(struct request));
    $opcode = (uint8)$nvme_cmd->rw.opcode;

    $secnum = $nvme_cmd->rw.slba;
    // Bitwise And to get zone starting LBA with zone MASK
    $zlbas = ($secnum & @ZONE_MASK);
}

Notice several additions. Firstly, we define global variables in the beginning of the script with BEGIN. Next, we get the opcode of the command. This indicates the type of operation and can be identified by

enum nvme_opcode {
    nvme_cmd_flush          = 0x00,
    nvme_cmd_write          = 0x01,
    nvme_cmd_read           = 0x02,
    nvme_cmd_write_uncor    = 0x04,
    nvme_cmd_compare        = 0x05,
    nvme_cmd_write_zeroes   = 0x08,
    nvme_cmd_dsm            = 0x09,
    nvme_cmd_verify         = 0x0c,
    nvme_cmd_resv_register  = 0x0d,
    nvme_cmd_resv_report    = 0x0e,
    nvme_cmd_resv_acquire   = 0x11,
    nvme_cmd_resv_release   = 0x15,
    nvme_cmd_zone_mgmt_send = 0x79,
    nvme_cmd_zone_mgmt_recv = 0x7a,
    nvme_cmd_zone_append    = 0x7d,
}

Secondly, we now identify the sector number that the request is going to. These are given in the respective sector addressing of the device. Lastly, and this particular to ZNS devices, we want to identify into what zone the specific address maps. For this we can simply take the mask of the zone and calculate the beginning LBA of that zone. The zone mask is simply the size of the zone (in sectors) - 1 to which we apply a bitwise NOT. We then apply a bitwise AND of the zone mask and the address, and we get the zone LBAS (starting LBA) as the result. This works because by taking the zone size - 1 and flipping all bits, we set the bits such that when we apply the bitwise AND, all bits are invalidated past the zone LBAS.

Filtering Commands per Device

It may also be beneficial to only collect commands that are for a particular device. For this bpftrace provides filters, which are checked whenever a probe is triggered. We can apply this by adding behind the probing function our filtering condition, which we match to the device with the name of "nvme0n2":

BEGIN {
    // Our zone size in 512B sectors
    @ZONE_MASK = ~(4194304 - 1);
}
k:nvme_setup_cmd / ((struct request *)arg1)->q->disk->disk_name == "nvme0n2" / {
    $nvme_cmd = (struct nvme_command *)*(arg1+sizeof(struct request));
    $opcode = (uint8)$nvme_cmd->rw.opcode;

    $secnum = $nvme_cmd->rw.slba;
    // Bitwise And to get zone starting LBA with zone MASK
    $zlbas = ($secnum & @ZONE_MASK);
}

Collecting Statistics

With this we can already collect statistics on the command, such as how many read/write commands are issued, how much data is being read/written, etc. For this we use global maps, which bpftrace uses to store data and which are provided as output when the script finishes. These are accessed with the @ before the variable. Now, adding a simple counter for read operations we would do as follows:

BEGIN {
    // Our zone size in 512B sectors
    @ZONE_MASK = ~(4194304 - 1);
}
k:nvme_setup_cmd / ((struct request *)arg1)->q->disk->disk_name == "nvme0n2" / {
    $nvme_cmd = (struct nvme_command *)*(arg1+sizeof(struct request));
    $opcode = (uint8)$nvme_cmd->rw.opcode;

    $secnum = $nvme_cmd->rw.slba;
    // Bitwise And to get zone starting LBA with zone MASK
    $zlbas = ($secnum & @ZONE_MASK);

    if($opcode == nvme_cmd_read) {
        // Count the commands per zone
        @z_rw_ctr_map[$zlbas]++;
    }
}

This stores a counter for each zone.

NVMe in Passthrough

When the NVMe device is used in passthrough mode, flags such as as zone resets are not identifiable in this way, requiring to look further into the command arguments. For this, we can use the request operation flag to identify zone reset commands. These flags can similarly be used to track zone management commands (e.g., opening/closing zones).

enum req_opf {
    /* read sectors from the device */
    REQ_OP_READ     = 0,
    /* write sectors to the device */
    REQ_OP_WRITE        = 1,
    /* flush the volatile write cache */
    REQ_OP_FLUSH        = 2,
    /* discard sectors */
    REQ_OP_DISCARD      = 3,
    /* securely erase sectors */
    REQ_OP_SECURE_ERASE = 5,
    /* write the zero filled sector many times */
    REQ_OP_WRITE_ZEROES = 9,
    /* Open a zone */
    REQ_OP_ZONE_OPEN    = 10,
    /* Close a zone */
    REQ_OP_ZONE_CLOSE   = 11,
    /* Transition a zone to full */
    REQ_OP_ZONE_FINISH  = 12,
    /* write data at the current zone write pointer */
    REQ_OP_ZONE_APPEND  = 13,
    /* reset a zone write pointer */
    REQ_OP_ZONE_RESET   = 15,
    /* reset all the zone present on the device */
    REQ_OP_ZONE_RESET_ALL   = 17,

    /* Driver private requests */
    REQ_OP_DRV_IN       = 34,
    REQ_OP_DRV_OUT      = 35,

    REQ_OP_LAST,
};

Passthrough mode will set the opcode to REQ_OP_DRV_OUT. To identify the correct command operation we can add the following:

BEGIN {
    // Our zone size in 512B sectors
    @ZONE_MASK = ~(4194304 - 1);
}
k:nvme_setup_cmd / ((struct request *)arg1)->q->disk->disk_name == "nvme0n2" / {
    $nvme_cmd = (struct nvme_command *)*(arg1+sizeof(struct request));
    $opcode = (uint8)$nvme_cmd->rw.opcode;

    $secnum = $nvme_cmd->rw.slba;
    // Bitwise And to get zone starting LBA with zone MASK
    $zlbas = ($secnum & @ZONE_MASK);

    if($opcode == nvme_cmd_read) {
        // Count the commands per zone
        @z_rw_ctr_map[$zlbas]++;
    }

    $cmd = (((struct request *)arg1)->cmd_flags & REQ_OP_MASK);
    if((($cmd == REQ_OP_DRV_OUT && $opcode == nvme_cmd_zone_mgmt_send) && $nvme_cmd->zms.zsa == NVME_ZONE_RESET)) {
        @z_reset_ctr_map[$zlbas]++;
    }
}

Here we check if the opcode is set with passthrough mode, and a zns management command is being issued. We then take the zone management flags to identify the operation. To identify the $cmd we now take the struct command* of the function. We retrieve the structure for this with

user@stosys:~/src/BPF-Stosys-scripts$ sudo bpftrace -lv "struct request"
struct request {
        struct request_queue *q;
        struct blk_mq_ctx *mq_ctx;
        struct blk_mq_hw_ctx *mq_hctx;
        unsigned int cmd_flags;
        req_flags_t rq_flags;
        int tag;
        int internal_tag;
        unsigned int timeout;
        unsigned int __data_len;
        sector_t __sector;
        struct bio *bio;
        struct bio *biotail;
        union {
                struct list_head queuelist;
                struct request *rq_next;
        };
        struct block_device *part;
        u64 alloc_time_ns;
        u64 start_time_ns;
        u64 io_start_time_ns;
        short unsigned int wbt_flags;
        short unsigned int stats_sectors;
        short unsigned int nr_phys_segments;
        short unsigned int nr_integrity_segments;
        struct bio_crypt_ctx *crypt_ctx;
        struct blk_crypto_keyslot *crypt_keyslot;
        short unsigned int write_hint;
        short unsigned int ioprio;
        enum mq_rq_state state;
        atomic_t ref;
        long unsigned int deadline;
        union {
                struct hlist_node hash;
                struct llist_node ipi_list;
        };
        union {
                struct rb_node rb_node;
                struct bio_vec special_vec;
                void *completion_data;
        };
        union {
                struct {
                        struct io_cq *icq;
                        void *priv[2];
                } elv;
                struct {
                        unsigned int seq;
                        struct list_head list;
                        rq_end_io_fn *saved_end_io;
                } flush;
        };
        union {
                struct __call_single_data csd;
                u64 fifo_time;
        };
        rq_end_io_fn *end_io;
        void *end_io_data;
};

This can be used for most structs we want to use (strangely struct nvme_command doesn't work there). In any case if it does not exist we can check the Kernel source code. With the request we can now count the number of zone resets being issued, even if the ZNS device is in passthrough mode.

Tracing Operation Latency

Another key aspect is the tracing of the latency for commands. We will cover how we can measure the latency for zone reset commands. For this we simply add an additional kprobe for the completion of nvme requests k:nvme_complete_rq and add code to keep track of the time. Conveniently, bpftrace has primitives for tracing time.

BEGIN {
    // Our zone size in 512B sectors
    @ZONE_MASK = ~(4194304 - 1);
}
k:nvme_setup_cmd / ((struct request *)arg1)->q->disk->disk_name == "nvme0n2" / {
    $nvme_cmd = (struct nvme_command *)*(arg1+sizeof(struct request));
    $opcode = (uint8)$nvme_cmd->rw.opcode;

    $secnum = $nvme_cmd->rw.slba;
    // Bitwise And to get zone starting LBA with zone MASK
    $zlbas = ($secnum & @ZONE_MASK);

    if($opcode == nvme_cmd_read) {
        // Count the commands per zone
        @z_rw_ctr_map[$zlbas]++;
    }

    $cmd = (((struct request *)arg1)->cmd_flags & REQ_OP_MASK);
    if($cmd == REQ_OP_ZONE_RESET || (($cmd == REQ_OP_DRV_OUT && $opcode == nvme_cmd_zone_mgmt_send) && $nvme_cmd->zms.zsa == NVME_ZONE_RESET)) {
        @z_reset_ctr_map[$zlbas]++;

        // Store the starting lba and time in a map indexed by the ID of the command 
        $cmdid = ((struct request *)arg1)->tag;
        @reset_z_track_map[$cmdid] = $zlbas;
        @reset_lat_track_map[$cmdid] = nsecs; // nsecs is the bpftrace primitive to get the time
    }
}

k:nvme_complete_rq / ((struct request *)arg0)->q->disk->disk_name == "nvme0n2" / {
    $nvme_cmd = (struct nvme_command *)*(arg0+sizeof(struct request));
    $opcode = (uint8)$nvme_cmd->rw.opcode;
    $cmd = (((struct request *)arg0)->cmd_flags & REQ_OP_MASK);

    if($cmd == REQ_OP_ZONE_RESET || (($cmd == REQ_OP_DRV_OUT && $opcode == nvme_cmd_zone_mgmt_send) && $nvme_cmd->zms.zsa == NVME_ZONE_RESET)) {
        $cmdid = ((struct request *)arg0)->tag;
        $zlbas = @reset_z_track_map[$cmdid];
        @z_reset_lat_map[$zlbas, @z_reset_ctr_map[$zlbas]] = nsecs - @reset_lat_track_map[$cmdid];
    }
}

// Let's cleanup maps we don't need
END {
    clear(@ZONE_MASK);
    clear(@reset_z_track_map);
    clear(@reset_lat_track_map);
}

Notice the arguments for nvme_complete_rq are different. We have now added the probe to be triggered on the completion of an NVMe command. We initally keep track of the LBAS for a command, and its starting time, and upon completion match these back to the command id. We need to keep track of the LBAS because the completion may not contain that information anymore. After this we can do some cleanup to remove the unwanted maps, such that bpftrace won't print these when the script is done. This is achieved with the END function.

Getting Call Stacks

It is also beneficial to see the call stack of operations. For this we can run bpftrace and tell it to provide us the stack for a particular probe. We can also run shorter BPF commands directly in a command, without a script.

user@stosys:~$ sudo bpftrace -e 'kprobe:nvme_setup_cmd / ((struct request *)arg1)->q->disk->disk_name == "nvme0n2" / { @[kstack] = count(); }'
Attaching 1 probe...
^C

@[
    nvme_setup_cmd+1
    blk_mq_dispatch_rq_list+369
    __blk_mq_do_dispatch_sched+186
    blk_mq_do_dispatch_sched+64
    __blk_mq_sched_dispatch_requests+259
    blk_mq_sched_dispatch_requests+53
    __blk_mq_run_hw_queue+59
    blk_mq_run_work_fn+31
    process_one_work+543
    worker_thread+80
    kthread+238
    ret_from_fork+34
]: 89

This shows the call stack for the nvme_setup_cmd call. It requires a command to the device to be issued while BPF is running (e.g., simply issuing a zone append operation with echo "test" | sudo nvme zns zone-append /dev/nvme0n2 -z 4096.

Tracing in user-space

BPFtrace is not limited to the kernel. It has excellent support for applications as well. You can even break right into you current running Firefox (tested in PopOs). Just try sudo bpftrace -e uprobe:/usr/lib/firefox/libnspr4.so:PR_Write { printf("Firefox can be traced"); }. Tracing in user-space is a bit more limited than in kernel-space. Important to know is that there exist 2 types of tracing: dynamic (uprobe, uretprobe) and static (usdt).

Excellent sources explaining how user-space tracing works

Dynamic tracing

Dynamic tracing seems to work on most executables that do not have their symbols stripped. It does add some overhead and you have no ensurance at all that it is portable or stable (sometimes functions are inlined). Further on, ALWAYS use absolute paths. To see what is available for tracing do, do:

bpftrace -l *:_PATH_TO_EXE_:*

It will display all points you can trace in the exe. However, you can often also trace its shared libraries. For example with db_bench. Db_bench is linked against librocksdb, which then allows something like:

ldd _PATH_TO_DB_BENCH # < remember the path to librocksdb used
bpftrace -l *:_PATH_TO_LIBROCKSDB_:*

You can trace on entry of functions with uprobe and exit of functions with uretprobe.

Warning!

Parsing args in BPFftrace is not fun. Many arguments can not be correctly parsed. If an arg is a struct, copy the struct into the BPFTrace script as a stub. Floats can not be parsed at all, but can be converted to unsigned longs. See for example https://github.com/nicktehrany/bpf-stosys-scripts/blob/master/fio/fio_reset_latency_spdk_nvme.bt for a proper hack.

Static tracing

Static tracing is when an application explicitly states its tracepoints, this is based on DTRACE and is generally referred to with USDT. Static tracepoints ensure that regardless of e.g. inlining it can be traced. It will also work with stripped simple (e.g. g++ -s) It is also provides more stable tracepoints (provided by the developers themselves). Lastly, it adds close to 0 overhead when not used. This is possible by patching the binary when tracing starts (nops get translated to trace logic). Many applications allow for enabling USDT with ./configure --with_usdt (e.g. SPDK) or something similar. You can then easily spot the available calls with one of the following:

readelf -n PATH_TO_EXE # look for the `.note.stapsdt` section
tplist-bpfcc -l _PATH_TO_EXE
tplist -l PATH_TO_EXE
grep -r DTRACE_PROBE  ./*

To activate one of the calls it is as simple as:

bpftrace -e `usdt:_PATH_TO_EXE:_CALL_NAME_ { printf("Static tracing"); }

Static tracing can not be done with explicit entry and exit points. Both need to be added by the developer. However, you can get the stack, use ustack to do this, not kstack. The symbols will probably be unreadable, so maybe you need to allow for debug symbols.

To add USDT probes to your own code you need to import sys/sdt.h and use one of the DTRACE_PROBE calls. Here is an example program:

# test.cc
#include <iostream>
#include <unordered_map>
#include <random>
#include <sys/sdt.h>
​
std::unordered_map<int, std::string> glob;
std::mt19937_64 rand_mt;
​
void calculate() {
        size_t i = 0;
	while (i++ < 1000) {
		size_t index = rand_mt() % 20'000'000;
        	if (index < 10'000'000) {
			DTRACE_PROBE(test, first_probe);
			std::cout << glob[index] << "\n";
			DTRACE_PROBE(test, first_probe_done);
       		}
	}
}
​
void loop() {
	size_t i = 0;
	while (i++ < 1'000'000) {
		calculate();
	}
}
​
int main(int argc, char **argv) {
	for (size_t i = 0; i < 10'000'000; i++) {
		glob.insert({i, std::to_string(i)});
	}
	loop();
	return 0;
}

And this is an example BPFtrace script for the code:

// example.bt
uprobe:./test:calculate {
    // register to thread id. So will work if the same thread completes the task. 
    @start[tid] = nsecs; 
} 
uretprobe:./test:calculate { 
    $cmdtime = nsecs - @start[tid];
     @avg_calc["calc avg"] = avg($cmdtime);
}
usdt:./test:first_probe {
    @startprint[tid] = nsecs;
}
usdt:./test:first_probe_done {
    $cmdtime = nsecs - @startprint[tid];
    @avg_calc_print["calc print avg"] = avg($cmdtime);
}

Try to compile with g++ test.cc -o test and g++ test.cc -s -o test and see what happens when you call ./test in one screen and sudo bpftace example.bt in the other.

Profiling BPF Scripts

If we are using large BPF tracing scripts during performance sensitive performance profiling, we must remember that BPF itself will introduce some overheads. The larger the script is, the more overheads it introduces. For instance, keeping a single counter for a kfunc invokation is a small overhead, compared to logging its function arguments. Especially, if this funcion is called thousands of times. To our benefit, we can profile BPF scripts with tools such as bpftool, however it requires to be built with the Kernel.

An alternative approach is to use the kernel statistics collection for BPF. We can enable this by running

echo 1 > /proc/sys/kernel/bpf_stats_enabled

With this we can get statistics during the execution of a BPF script using its open fd. First, when running the BPF script we need to get its pid. For this we can run the following:

user@stosys:~/src$ ps aux | grep bpftrace
root       24609  4.5  0.4 288856 98124 pts/9    S+   18:43   0:00 bpftrace ./inode_writeback_count.bt
user       24686  0.0  0.0   6608  2456 pts/4    S+   18:44   0:00 grep bpftrace

where we can see root running a bpftrace script with pid 24609. Next, we can locate the collected stats for this pid by running:

root@stosys:/home/user/src# ll /proc/24609/fdinfo
total 0
dr-xr-xr-x 2 root root 0 Nov 21 18:44 ./
dr-xr-xr-x 9 root root 0 Nov 21 18:44 ../
-r--r--r-- 1 root root 0 Nov 21 18:44 0
-r--r--r-- 1 root root 0 Nov 21 18:44 1
-r--r--r-- 1 root root 0 Nov 21 18:44 10
-r--r--r-- 1 root root 0 Nov 21 18:44 11
-r--r--r-- 1 root root 0 Nov 21 18:44 12
-r--r--r-- 1 root root 0 Nov 21 18:44 13
-r--r--r-- 1 root root 0 Nov 21 18:44 14
-r--r--r-- 1 root root 0 Nov 21 18:44 15
-r--r--r-- 1 root root 0 Nov 21 18:44 16
-r--r--r-- 1 root root 0 Nov 21 18:44 17
-r--r--r-- 1 root root 0 Nov 21 18:44 2
-r--r--r-- 1 root root 0 Nov 21 18:44 3
-r--r--r-- 1 root root 0 Nov 21 18:44 4
-r--r--r-- 1 root root 0 Nov 21 18:44 5
-r--r--r-- 1 root root 0 Nov 21 18:44 6
-r--r--r-- 1 root root 0 Nov 21 18:44 7
-r--r--r-- 1 root root 0 Nov 21 18:44 8
-r--r--r-- 1 root root 0 Nov 21 18:44 9

This provides the list of open fds under that particular process. The collected statistics for BPF are under the last open fd, 17 in this case. We can then cat the outputs of this with

oot@stosys:/home/user/src# cat /proc/24609/fdinfo/17
pos:    0
flags:  02000002
mnt_id: 15
ino:    1053
prog_type:      2
prog_jited:     1
prog_tag:       03b3f5f931493a53
memlock:        4096
prog_id:        126
run_time_ns:    2379004
run_cnt:        16049
recursion_misses:       0
verified_insns: 21

The fields of interest are the run_time_ns (the runtime in nanoseconds) and the run_cnt (number of times it was triggered). This script run in this scenario does a simple counter of a VFS function call. The tracing output of the run BPF script is identical to the run_cnt value. If we want the overhead of a single kfunc trigger, we can divide the run_time_ns / run_cnt, which is equivalent to ~150ns for a single invokation.