bpftrace principle and usage


bpftrace is a tracing tool based on eBPF (Extended Berkeley Packet Filter), used for dynamic tracing and system performance analysis in Linux systems. Understanding the concepts, principles and usage of bpftrace will help you better use and apply it.

Concepts and Principles

  • eBPF (Extended Berkeley Packet Filter): eBPF is a virtual machine technology that allows safe, programmable code snippets to be run in the kernel to perform in-depth tracing and monitoring of the system. eBPF provides a flexible and efficient way to extend the functionality of the kernel and allow user-space applications to interact with the kernel.
  • bpftrace language: bpftrace provides a high-level scripting language that uses an awk-like syntax for writing tracing scripts. The bpftrace script is executed through the virtual machine provided by eBPF and can capture and analyze various system events and indicators.
  • Dynamic loading and execution: A key feature of bpftrace is that it can dynamically load and execute scripts at runtime without the need to recompile the kernel or application. This makes it ideal for real-time system performance analysis and troubleshooting.

bpftrace installation

Depending on your Linux distribution and version, install bpftrace using a package manager (such as apt, dnf, yum, etc.) (I'm using Ubuntu):
sudo apt install bpftrace

bpftrace syntax structure

The syntax structure of bpftrace is based on that of awk.
probes /filter/ { action }

probes: event, tracepoint, kprobe, kretprobe, uprobe. Two special events BEGIN/END, used for execution at the beginning and end of the script
filter: filter conditions, judgment conditions when the event is triggered, for example: /pid == 3245/, indicating that the process with pid 3245 is executed.
action: The specific operation to be performed, for example: { printf(“close\n”);} print close

probes
Insert image description here

Case:

bpftrace -e 'BEGIN { printf("hello\n"); }'
bpftrace -e 'tracepoint:syscalls:sys_enter_accept { printf("accept\n"); }'
bpftrace -e 'tracepoint:syscalls:sys_enter_accept4 { printf("accept4\n"); }'
bpftrace -e 'tracepoint:syscalls:sys_enter_connect { printf("connect\n"); }'
bpftrace -e 'tracepoint:syscalls:sys_enter_read { printf("read\n"); }'
bpftrace -e 'tracepoint:syscalls:sys_enter_write { printf("write\n"); }'
bpftrace -e 'tracepoint:syscalls:sys_enter_close { printf("close\n"); }'

bpftrace variable

built-in variables

Commonly used variables in the bpftrace script are as follows:

uid:用户 id。
tid:线程 id
pid:进程 id。
cpu:cpu id。
cgroup:cgroup id.
probe:当前的 trace 点。
comm:进程名字。
nsecs:纳秒级别的时间戳。
kstack:内核栈描述
curtask:当前进程的 task_struct 地址。
args:获取该 kprobe 或者 tracepoint 的参数列表
arg0:获取该 kprobe 的第一个变量,tracepoint 不可用
arg1:获取该 kprobe 的第二个变量,tracepoint 不可用
arg2:获取该 kprobe 的第三个变量,tracepoint 不可用
retval: kretprobe 中获取函数返回值
args->ret: kretprobe 中获取函数返回值

Custom variables

以'$'标志起来定义与引用变量,例如:$idx = 0;

Map variable

Map 变量是用于内核向用户空间传递数据的一种存储结构,定义方式是以'@'符
号作为标志
@path[tid] = nsecs;
@path[pid, $fd] = nsecs;
Bpftrace 默认在结束时会打印从内核接收到的 map 变量

built-in functions

exit():退出 bpftrace 程序
str(char *):转换一个指针到 string 类型
system(format[, arguments ...]):运行一个 shell 命令
join(char *str[]):打印一个字符串列表并在每个前面加上空格,比如可以用
来输出 args->argv
ksym(addr):用于转换一个地址到内核 symbol
kaddr(char *name):通过 symbol 转换为内核地址
print(@m [, top [, div]]):可选择参数打印 map 中的 top n 个数据,数
据可选择除以一个 div 值

bpftrace has built-in map object operation functions for passing data to map variables.

count():用于计算次数
sum(int n):用于累加计算
avg(int n):用于计算平均值
min(int n):用于计算最小值
max(int n):用于计算最大值
hist(int n):数据分布直方图(范围为 2 的幂次增长)
lhist(int n):数据线性直方图
delete(@m[key]):删除 map 中的对应的 key 数据
clear(@m):删除 map 中的所有数据
zero(@m):map 中的所有值设置为 0

Bpftrace operation case

Note: Since bpftrace needs to access kernel resources,Typically requires running with superuser (sudo) privileges

File system

Count the number of calls to read:
bpftrace -e 't:syscalls:sys_enter_read {@[probe]=count(); }'

Insert image description here
Track the usage of the system call "read" and create a histogram showing the distribution of the parameter "count" of the "read" system call:
bpftrace -e 't:syscalls:sys_enter_read {@=hist(args->count);}'
Insert image description here
Track the system Call "openat" usage, and print out the name of the process that called the system call and the name of the opened file:
bpftrace -e 't:syscalls:sys_enter_openat { printf("%s–> %s\n",comm,str(args->filename)); }'
Insert image description here
Execute through the script file:
vim vfs.bt

#include <linux/fs.h>
#include <linux/path.h>
#include <linux/dcache.h>

kprobe:vfs_open 
/ comm == "cat"/ 
{ 
	printf("vfs_open: %s, name: %s\n", comm, str(((struct path*)arg0)->dentry->d_name.name)); 
}


kprobe:vfs_write
/ comm == "cat"/
{
	$file = str(((struct file*)arg0)->f_path.dentry->d_name.name);
	printf("vfs_write: %s, count: %d, buf:%s\n", $file, arg2, str(arg1));
}


implement:bpftrace vfs.bt

disk

Statistics of blocking io events:
bpftrace -e 't:block:* { @[probe] = count(); }'
Insert image description here
Statistics of blocking io operation data size:
bpftrace -e 't:block:block_rq_issue { @bytes = hist(args->bytes); }'
Insert image description here

process

Started process name and command line parameters:
bpftrace -e 'tracepoint:syscalls:sys_enter_execve { join(args->argv); }'
Insert image description here
Process scheduling:
bpftrace -e 'tracepoint:sched:sched_switch { @[kstack] = count(); }'
Insert image description here

Memory

Kernel memory stack
bpftrace -e 't:kmem:kmem_cache_alloc { @bytes[kstack] = sum(args->bytes_alloc); }'
Insert image description here
Malloc call statistics:
bpftrace -e 'u:/lib/x86_64-linux-gnu/libc.so.6:malloc {@[ustack, comm] = sum(arg0); }'
Insert image description here

Guess you like

Origin blog.csdn.net/m0_68678128/article/details/134822479