Exploring eBPF: the black technology of Linux kernel

The Linux kernel will mainly release versions 5.16-5.19, 6.0 and 6.1 in 2022, and each version introduces a large number of new features for eBPF. This article will give a brief introduction to these new features. For more detailed information, please refer to the corresponding link information. Overall, eBPF is still one of the most active modules in the kernel, and its features are still under rapid development. In a sense, eBPF is rapidly evolving towards a complete kernel-mode programmable interface.

Advanced eBPF: Overview of the progress of new kernel features

  1. BPF kfuncs
  2. Bloom Filter Map:5.16
  3. Compile Once – Run Everywhere:Linux 5.17
  4. bpf_loop() helper function: 5.17
  5. BPF_LINK_TYPE_KPROBE_MULTI:5.18
  6. Dynamic pointers and typed pointers: 5.19
  7. USDT:5.19
  8. bpf panic:6.1
  9. BPF memory allocator, linked list: 6.1
  10. user ring buffer 6.1

1. Overview of eBPF

1.1 What is eBPF

eBPF is a register-based virtual machine using a custom 64-bit RISC instruction set capable of running just-in-time natively compiled "BPF programs" inside the Linux kernel, with access to a subset of kernel functionality and memory. This is a complete virtual machine implementation, not to be confused with Kernel-based Virtual Machine (KVM), which is a module designed to enable Linux to act as a hypervisor for other virtual machines. eBPF is also part of the mainline kernel, so it doesn't require any third-party modules (LTTng or SystemTap) like other frameworks, and it's enabled by default on almost all Linux distributions. Readers familiar with DTrace may find the DTrace/BPFtrace comparison useful.

Running a full virtual machine inside the kernel is primarily a matter of convenience and security. While everything an eBPF program does can be handled by normal kernel modules, direct kernel programming is a very dangerous thing - it can lead to system lockups, memory corruption and process crashes, leading to security holes and other surprises especially on production devices (eBPF is often used to check systems in production), so running native JIT-compiled fast kernel code through a secure virtual machine is very useful for security monitoring and sandboxing, network filtering, program tracing, performance Both profiling and debugging are very valuable. Some simple examples can be found in this excellent eBPF reference.

By design, the eBPF virtual machine and its programs are intentionally not Turing-complete: i.e. no loops are allowed (work in progress to support bounded loops) ), so every eBPF program needs to be guaranteed to complete without being suspended, all memory accesses are bounded and type-checked (including registers, a MOV instruction can change the type of a register), cannot contain null dereferences , a program must have at most BPF_MAXINSNS directives (default 4096), the "main" function requires one parameter (context), etc. When an eBPF program is loaded into the kernel, its instructions are parsed into a directed circular graph by the verification module. The above restrictions make the correctness can be easily and quickly verified.

The main differences are as follows:

  1. Allows code snippets to be written in C language and compiled into eBPF bytecode by LLVM;
  2. cBPF only implements SOCKET_FILTER, while eBPF also has KPROBE, PERF, etc.
  3. BPF uses sockets to realize the interaction between user mode and kernel, and eBPF defines a new system call dedicated to eBPF, which is used to load BPF code segments, create and read BPF maps, and is more general.
  4. The BPF map mechanism is used to temporarily store the data generated by the BPF code in the form of key-value in the kernel.

For eBPF, it can be simply understood that the kernel implements a virtual machine mechanism, compiles C-like code into bytecode (explained in detail later), and hangs on the hook to the kernel. When the hook is triggered, the kernel runs on the virtual machine The bytecode is run in the "sandbox" of the system, which can not only realize many functions conveniently, but also ensure the security of the kernel through the sandbox.

1.2 Evolution of eBPF

The original [Berkeley Packet Filter (BPF) PDF] was designed to capture and filter network packets that meet specific rules. Filters are programs that run on register-based virtual machines.

Running user-specified programs in the kernel proved to be a useful design, but some features of the original BPF design were not well supported. For example, the instruction set architecture (ISA) of the virtual machine is relatively backward, and now the processor has used 64-bit registers, and introduced new instructions for multi-core systems, such as the atomic instruction XADD. A small subset of RISC instructions provided by BPF are no longer available on existing processors.

Therefore, in the design of eBPF, Alexei Starovoitov introduced how to use modern hardware to make the eBPF virtual machine closer to the contemporary processor, and the eBPF instructions are closer to the hardware ISA, which is convenient for improving performance. One of the biggest changes is the use of 64-bit registers and increasing the number of registers from 2 to 10. Since modern architectures use far more than 10 registers, parameters can be passed to corresponding functions through eBPF virtual machine registers just like native hardware. In addition, the new BPF_CALL instruction makes calling kernel functions more convenient.

Mapping eBPF to native instructions facilitates just-in-time compilation and improves performance. The new eBPF patch in the 3.15 kernel makes the performance of eBPF running on x86-64 4 times higher than that of the old BPF (cBPF) in network filtering, and in most cases it will maintain a 1.5 times performance improvement. Many architectures (x86-64, SPARC, PowerPC, ARM, arm64, MIPS, and s390) already support just-in-time (JIT) compilation.

1.3Ebpf environment construction

Compile and run the code in the source code samples/bpf

  1. Download the kernel source code and unzip it
  2. /bin/sh: scripts/mod/modpost: No such file or directory Encounter this kind of error, you need to make scripts
  3. make M=samples/bpf needs a .config file, you need to ensure that these items exist
  4. Encountered an error libcrypt1.so.1 not found, execute the following code (https://www.mail-archive.com/[email protected]/msg1818037.html)
$ cd /tmp
$ apt -y download libcrypt1
$ dpkg-deb -x libcrypt1_1%3a4.4.25-2_amd64.deb  .
$ cp -av lib/x86_64-linux-gnu/* /lib/x86_64-linux-gnu/
$ apt -y --fix-broken install

5. The compilation is successful, and the executable file in samples/bpf can be executed.

Compile and run the code developed by yourself

\1. Download the linux source code, compile the kernel and upgrade

git clone https://github.com/torvalds/linux.git
cd linux/
git checkout -b v5.0 v5.0

configuration file

cp -a /boot/config-4.14.81.bm.15-amd64 ./.config

echo '
CONFIG_BPF=y
CONFIG_BPF_SYSCALL=y
CONFIG_BPF_JIT=y
CONFIG_HAVE_EBPF_JIT=y
CONFIG_BPF_EVENTS=y
CONFIG_FTRACE_SYSCALLS=y
CONFIG_FUNCTION_TRACER=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_HAVE_KPROBES=y
CONFIG_KPROBES=y
CONFIG_KPROBE_EVENTS=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_UPROBES=y
CONFIG_UPROBE_EVENTS=y
CONFIG_DEBUG_FS=y
CONFIG_DEBUG_INFO_BTF=y
' >> ./.config

Need to add sid source to install dwarves

apt install dwarves
make oldconfig
apt install libssl-dev
make
make modules_install
make install
reboot

at this time:

uname -a
Linux n231-238-061 5.0.0 #1 SMP Mon Dec 13 05:38:52 UTC 2021 x86_64 GNU/Linux

compile bpf helloworld

Switch to the helloworld directory of https://github.com/bpftools/linux-observability-with-bpf

sed -i 's;/kernel-src;/root/linux;' Makefile
make

There is an error:

picture

Reference: http://www.helight.info/blog/2021/build-kernel-ebpf-sample/Solution

cp /root/linux/include/uapi/linux/bpf.h /usr/include/linux/bpf.h

Execute ./monitor-exec, there is an error

./monitor-exec: error while loading shared libraries: libbpf.so: cannot open shared object file: No such file or directory

Solution

cd /root/linux/tools/lib/bpf/
make
make install

Add the line /usr/local/lib64 in /etc/ld.so.conf, and run sudo ldconfig to regenerate the dynamic library configuration information.

~/linux/tools/lib/bpf# ldconfig -v 2>/dev/null | grep libbpf
    libbpf.so.0 -> libbpf.so.0.5.0
    libbpf.so -> libbpf.so

Final implementation:

![图片](data:image/svg+xml,%3C%3Fxml version=‘1.0’ encoding=‘UTF-8’%3F%3E%3Csvg width=‘1px’ height=‘1px’ viewBox=‘0 0 1 1’ version=‘1.1’ xmlns=‘http://www.w3.org/2000/svg’ xmlns:xlink=‘http://www.w3.org/1999/xlink’%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=‘none’ stroke-width=‘1’ fill=‘none’ fill-rule=‘evenodd’ fill-opacity=‘0’%3E%3Cg transform=‘translate(-249.000000, -126.000000)’ fill=‘%23FFFFFF’%3E%3Crect x=‘249’ y=‘126’ width=‘1’ height=‘1’%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E)

May need to install apt-get install gcc-multilib g+±multilib

https://github.com/sirfz/tesserocr/issues/130

Install bpftrace

(1) debian add sid source https://github.com/iovisor/bcc/blob/master/INSTALL.md#debian—source

deb http://deb.debian.org/debian sid main contrib non-free
deb-src http://deb.debian.org/debian sid main contrib non-free

(2)apt install bpftrace https://github.com/iovisor/bpftrace/blob/master/INSTALL.md

1.4 What can be done with eBPF?

An eBPF program will be attached to the specified kernel code path, and when the code path is executed, the corresponding eBPF program will be executed. Given its origins, eBPF is particularly well suited for writing network programs, attaching that network program to network sockets, performing traffic filtering, traffic classification, and performing the actions of network classifiers. An eBPF program can even modify the configuration of an established network socket. The XDP project will run the eBPF program at the bottom layer of the network stack to process the received packets with high performance.

From the figure below, you can see the functions supported by eBPF:

picture

BPF's processing of the network can be divided into tc/BPF and XDP/BPF, and their main differences are as follows (refer to this document ):

The hook of XDP is earlier than tc, so the performance is higher: the tc hook uses the sk_buff structure as a parameter, while XDP uses the xdp_md structure as a parameter. The data in sk_buff is much more than xdp_md, but it will also have a certain impact on performance. And the message needs to be sent to the tc hook to trigger the handler. The xdp_buff (i.e. xdp_md) used by XDP does not have access to the sk_buff metadata since the XDP hook is located before the network stack.

 struct xdp_buff { 
 /* Linux 5.8*/ 	
void *data; 	
void *data_end; 	
void *data_meta; 	
void *data_hard_start; 	
struct xdp_rxq_info *rxq; 	
struct xdp_txq_info *txq; 	
u32 frame_sz; /* frame size to deduce data_hard_end/reserved tailroom*/
 }; 
struct xdp_rxq_info { 	
struct net_device *dev; 	
u32 queue_index; 	
u32 reg_state; 	
struct xdp_mem_info mem; 
} ____cacheline_aligned; /* perf critical, avoid false-sharing */  
struct xdp_txq_info { 	
struct net_device *dev; };

data points to the actual location of the data packet in the page, and data_end points to the end of the data packet. Since XDP allows headroom (see below), data_hard_start points to the starting position of headroom in the page, that is, when the packet is encapsulated, data will move to data_hard_start through bpf_xdp_adjust_head().

The same BPF helper function can also be used for decapsulation, in which case data is kept away from data_hard_start. data_meta initially points to the same location as data, but bpf_xdp_adjust_meta() can move it towards data_hard_start, thus providing space for user metadata, which is invisible to the kernel network stack, but can be read by tc BPF programs (tc needs to transfer it from XDP to skb).

Conversely, user metadata size can be removed or reduced by moving data_meta away from data_hard_start by the same BPF procedure. data_meta can also be used purely to pass state between tail calls, similar to the skb->cb[] control block accessed by tc BPF programs. For the message pointer in struct xdp_buff, the relationship is as follows: data_hard_start <= data_meta <= data < data_end. The rxq field points to additional per-accept queue-related metadata that is populated during ring startup. BPF programs can retrieve queue_index, and other data on network devices (such as ifindex, etc.).

tc can better manage packets: tc's BPF input context is a sk_buff, which is different from the xdp_buff used by XDP. Both have their own advantages and disadvantages. When the network stack of the kernel receives a message after the XDP layer, it will allocate a buffer, parse and save the metadata of the message, and these metadata are sk_buff.

This structure will be exposed to the BPF input context, so that the tc BPF program at the tc ingress layer can use the metadata parsed from the packet by the network stack. Using sk_buff, tc can use these metadata more directly, so BPF programs attached to tc BPF hooks can read or write skb's mark, pkt_type, protocol, priority, queue_mapping, napi_id, cb[] array, hash, tc_classid Or tc_index, vlan metadata, etc., and XDP can transmit user metadata and other information. The struct __sk_buff used by tc BPF is defined in the linux/bpf.h header file. The disadvantage of xdp_buff is that it cannot use the data in sk_buff. XDP can only use the original packet data and transmit user metadata.

XDP can modify the message faster: sk_buff contains a lot of protocol-related information (such as GSO phase information), so it is difficult to achieve the purpose of switching protocols by simply modifying the message data, because the network stack is sensitive to the message Dealing with primarily packet-based metadata rather than the overhead of accessing packet content each time. Therefore, the BPF helper functions need to correctly handle the conversion of the internal sk_buff. But xdp_buff will not have this kind of problem, because the processing time of XDP is earlier than the time when the kernel allocates sk_buff, so the modification of any message can be easily realized (but it is more difficult to manage).

tc/ebpf and xdp can complement each other: if the user needs to modify the message and manage the data more complicatedly, then, the limitations of each program type can be made up for by running two types of programs. The XDP program is located in the ingress, which can modify the complete message and pass the user metadata from the XDP BPF to the tc BPF, and then tc can use the XDP metadata and sk_buff field to manage the message.

tc/eBPF can act on ingress and egress, but XDP can only act on ingress: Compared with XDP, tc BPF program can be triggered on the network data path of ingress and egress, while XDP can only act on ingress.

tc/BPF does not need to change the hardware driver, and XDP usually uses the native driver mode to achieve higher performance. However, the processing of the tc BPF program still acts on the early kernel network data path (after GRO processing, before protocol processing and traditional iptables firewall processing, such as iptables PREROUTING or nftables ingress hooks, etc.). On egress, the tc BPF program processes packets before passing them to the driver, that is, behind traditional iptables firewalls (such as iptables POSTROUTING), but before the kernel's GSO engine. A special case is that Offloaded tc/eBPF approaches the performance of Offloaded XDP if the offloaded tc BPF program (usually provided via SmartNIC) is used.

From the figure below, you can see the working positions of TC and XDP, and you can see that XDP processes packets before TC:

picture

Another type of filtering performed by the kernel is to limit the system calls that a process can use. Implemented via seccomp BPF.

eBPF can also be used to locate kernel problems and perform performance analysis by attaching programs to tracepoints, kprobes, and perf events. Because eBPF has access to kernel data structures, developers can write and test code without compiling the kernel. For busy engineers, this method can conveniently debug an online running system. In addition, it is also possible to debug user-space programs through statically defined trace points ( that is, BCC debugs user programs, such as Mysql ).

Using eBPF has two major advantages: fast and secure. In order to use eBPF well, it is necessary to understand how it works.

eBPF validator for kernel 1.5

Running user-space code in the kernel can present security and stability risks. Therefore, a lot of verification is required before loading the eBPF program. Firstly, through the depth-first search of the program control flow, it is ensured that eBPF can end normally, and the kernel will not be locked due to any loop. Unreachable instructions are strictly prohibited; any program containing unreachable instructions will fail to load.

The second stage involves simulating the execution of an eBPF program (one instruction at a time) using the verifier. Before and after each instruction is executed, the state of the virtual machine needs to be checked to ensure that the state of the registers and the stack are valid. Out-of-bounds (code) jumps, and access to out-of-bounds data are strictly prohibited.

The verifier does not examine every path of the program, it is able to know whether the current state of the program is a subset of the program that has already been checked. Since all previous paths must be valid (otherwise the program will fail to load), so must the current path, allowing the validator to "prune" the current branch and skip its simulation phase.

The validator has a "safe mode" that prohibits pointer arithmetic. Safe mode is enabled when a user without CAP_SYS_ADMIN privileges loads an eBPF program, ensuring that kernel addresses are not leaked to non-privileged users and pointers are not written to memory. If safemode is not enabled, pointer arithmetic is only allowed after the check is performed. For example, all pointer accesses are checked for type, alignment, and bounds violations.

Registers containing uninitialized content cannot be read, and attempts to read the contents of such registers will cause the load to fail. The contents of registers R0-R5 are marked unreadable during the function call, and any read behavior to an uninitialized register can be tested by storing a special value; a similar check is made for the behavior of reading variables on the stack , to ensure that no instructions will write to the read-only frame pointer register.

Finally, the validator uses the eBPF program type (see below) to restrict which kernel functions can be called from an eBPF program, and which data structures can be accessed. For example, some program types can directly access network packets. ,

1.6 pf() system call

Programs are loaded using the bpf() system call and the BPF_PROG_LOAD command. The prototype of this system call is:

int bpf(int cmd, union bpf_attr *attr, unsigned int size);

bpf_attr allows data to be passed between kernel and user space, the specific type depends on the cmd parameter.

cmd can be as follows:

       BPF_MAP_CREATE
              Create a map and return a file descriptor that refers to the
              map.  The close-on-exec file descriptor flag (see fcntl(2)) is
              automatically enabled for the new file descriptor.

       BPF_MAP_LOOKUP_ELEM
              Look up an element by key in a specified map and return its
              value.

       BPF_MAP_UPDATE_ELEM
              Create or update an element (key/value pair) in a specified
              map.

       BPF_MAP_DELETE_ELEM
              Look up and delete an element by key in a specified map.

       BPF_MAP_GET_NEXT_KEY
              Look up an element by key in a specified map and return the
              key of the next element.

       BPF_PROG_LOAD
              Verify and load an eBPF program, returning a new file descrip‐
              tor associated with the program.  The close-on-exec file
              descriptor flag (see fcntl(2)) is automatically enabled for
              the new file descriptor.

The size parameter gives the byte length of the bpf_attr union object. The commands loaded by BPF_PROG_LOAD can be used to create and modify eBPF maps, which are common key/value data structures used for communication between eBPF programs and kernel space or user space. Other commands allow attaching eBPF programs to a control group directory or socket file descriptor, iterating over all maps and programs, and pinning eBPF objects to files so that they are not destroyed after the process that loaded the eBPF program ends (the latter Used by tc classifiers/opcodes so eBPF programs can be made persistent without the loaded process being kept alive). For complete commands, please refer to the bpf() help documentation.

While many different commands may exist, they can broadly be grouped into two categories: commands that interact with eBPF programs, commands that interact with eBPF maps, or commands that interact with both programs and maps (collectively referred to as objects).

1.7 eBPF program type

The type of program loaded with BPF_PROG_LOAD determines four things: the location of the attached program, the kernel helper functions the verifier is allowed to call, whether it has direct access to network datagrams, and the type of the first argument object passed to the program. In fact, a program type essentially defines an API. New program types were even created purely to differentiate between different callable function lists (for example, BPF_PROG_TYPE_CGROUP_SKB and BPF_PROG_TYPE_SOCKET_FILTER).

The eBPF program types supported by the current kernel are:

  • BPF_PROG_TYPE_SOCKET_FILTER: a network packet filter
  • BPF_PROG_TYPE_KPROBE: determine whether a kprobe should fire or not
  • BPF_PROG_TYPE_SCHED_CLS: a network traffic-control classifier
  • BPF_PROG_TYPE_SCHED_ACT: a network traffic-control action
  • BPF_PROG_TYPE_TRACEPOINT: determine whether a tracepoint should fire or not
  • BPF_PROG_TYPE_XDP: a network packet filter run from the device-driver receive path
  • BPF_PROG_TYPE_PERF_EVENT: determine whether a perf event handler should fire or not
  • BPF_PROG_TYPE_CGROUP_SKB: a network packet filter for control groups
  • BPF_PROG_TYPE_CGROUP_SOCK: a network packet filter for control groups that is allowed to modify socket options
  • BPF_PROG_TYPE_LWT_*: a network packet filter for lightweight tunnels
  • BPF_PROG_TYPE_SOCK_OPS: a program for setting socket parameters
  • BPF_PROG_TYPE_SK_SKB: a network packet filter for forwarding packets between sockets
  • BPF_PROG_CGROUP_DEVICE: determine if a device operation should be permitted or not

As new program types are added, kernel developers also find the need to add new data structures.

1.8 eBPF data structure

The main data structure used by eBPF is the eBPF map, which is a general purpose data structure for passing data between the kernel or kernel and user space. Its name "map" also means that the storage and retrieval of data needs to use the key.

Maps are created and managed using the bpf() system call. When a map is successfully created, the file descriptor associated with the map is returned. The map is destroyed when the corresponding file descriptor is closed. Each map defines 4 values: the type, the maximum number of elements, the byte size of the value, and the byte size of the key. eBPF provides different map types, and different types of maps provide different features.

  • BPF_MAP_TYPE_HASH: a hash table
  • BPF_MAP_TYPE_ARRAY: an array map, optimized for fast lookup speeds, often used for counters
  • BPF_MAP_TYPE_PROG_ARRAY: an array of file descriptors corresponding to eBPF programs; used to implement jump tables and sub-programs to handle specific packet protocols
  • BPF_MAP_TYPE_PERCPU_ARRAY: a per-CPU array, used to implement histograms of latency
  • BPF_MAP_TYPE_PERF_EVENT_ARRAY: stores pointers to struct perf_event, used to read and store perf event counters
  • BPF_MAP_TYPE_CGROUP_ARRAY: stores pointers to control groups
  • BPF_MAP_TYPE_PERCPU_HASH: a per-CPU hash table
  • BPF_MAP_TYPE_LRU_HASH: a hash table that only retains the most recently used items
  • BPF_MAP_TYPE_LRU_PERCPU_HASH: a per-CPU hash table that only retains the most recently used items
  • BPF_MAP_TYPE_LPM_TRIE: a longest-prefix match trie, good for matching IP addresses to a range
  • BPF_MAP_TYPE_STACK_TRACE: stores stack traces
  • BPF_MAP_TYPE_ARRAY_OF_MAPS: a map-in-map data structure
  • BPF_MAP_TYPE_HASH_OF_MAPS: a map-in-map data structure
  • BPF_MAP_TYPE_DEVICE_MAP: for storing and looking up network device references
  • BPF_MAP_TYPE_SOCKET_MAP: stores and looks up sockets and allows socket redirection with BPF helper functions

All maps can be accessed via eBPF or in userspace programs using the bpf_map_lookup_elem() and bpf_map_update_elem() functions. Certain map types, such as socket maps, use other eBPF helper functions that perform special tasks. More details about eBPF can be found in the official help documentation.

Note: Before Linux4.4, bpf() requires the caller to have the CAP_SYS_ADMIN capability permission. Starting from Linux 4.4., non-privileged users can use the BPF_PROG_TYPE_SOCKET_FILTER type and the corresponding map to create restricted programs, but such programs cannot use kernel pointers Saving to a map is limited to the following helper functions: * get_random * get_smp_processor_id * tail_call * ktime_get_ns Unprivileged access can be disabled via sysctl: /proc/sys/kernel/unprivileged_bpf_disabled eBPF objects (maps and programs) can be shared between different processes . For example, after a fork, the child process inherits file descriptors referencing eBPF objects. Additionally, file descriptors referencing eBPF objects can be transferred over UNIX domain sockets. File descriptors referencing eBPF objects can be duplicated with dup(2) and similar calls. An eBPF object is not released until all file descriptors referencing the object are closed. eBPF programs can be written in the restricted C language and compiled to eBPF bytecode using the clang compiler. The restricted C language disables many features, such as loops, global variables, floating point numbers, and the use of structures as function parameters. See examples in the samples/bpf/*_kern.c files of the kernel source. The just-in-time (JIT) in the kernel can convert eBPF bytecode into machine code to improve performance. Before Linux 4.15, JIT is disabled by default, and JIT can be enabled by modifying /proc/sys/net/core/bpf_jit_enable.

  • 0 Disable JIT
  • 1 Compile normally
  • 2 dehub mode.

Starting from Linux 4.15, the kernel may configure the CONFIG_BPF_JIT_ALWAYS_ON option, in which case the JIT compiler will be enabled and bpf_jit_enable will be set to 1.

The following architectures support eBPF's JIT compiler:

  • * x86-64 (since Linux 3.18; cBPF since Linux 3.0);
  • * ARM32 (since Linux 3.18; cBPF since Linux 3.4);
  • * SPARC 32 (since Linux 3.18; cBPF since Linux 3.5);
  • * ARM-64 (since Linux 3.18);
  • * s390 (since Linux 4.1; cBPF since Linux 3.7);
  • * PowerPC 64 (since Linux 4.8; cBPF since Linux 3.1);
  • * SPARC 64 (since Linux 4.12);
  • * x86-32 (since Linux 4.18);
  • * MIPS 64 (since Linux 4.18; cBPF since Linux 3.16);
  • * riscv (since Linux 5.1)

1.9 eBPF helper functions

You can refer to the official help documentation to view the auxiliary functions provided by the libbpf library.

The official documentation gives the existing eBPF helper functions. More examples can be found in the samples/bpf/ and tools/testing/selftests/bpf/ directories of the kernel source code.

In the official help documentation, there are the following additions:

Since eBPF is being developed while writing the help documentation, the newly introduced eBPF program or map type may not be added to the help documentation in time, and the most accurate description can be found in the kernel source tree: include/uapi/ linux /bpf.h : The main BPF header file. Contains a complete list of auxiliary functions, as well as descriptions of tags, structures and constants used by auxiliary functions net/core/filter.c : contains most of the auxiliary functions related to the network, and a list of program types used kernel/trace/bpf_trace .c : Contains most of the auxiliary functions related to program tracing kernel/bpf/verifier.c : Contains the functions used by specific auxiliary functions to verify the validity of eBPF map kernel/bpf/: The files in this directory contain other Auxiliary functions (such as cgroups, sockmaps, etc.)

How to write eBPF programs

Historically, eBPF programs have been converted to BPF bytecode using the kernel's bpf_asm assembler. Fortunately, the LLVM Clang compiler supports compiling eBPF backends written in C to bytecode. The bpf() system call and the BPF_PROG_LOAD command can directly load object files containing these bytecodes.

An eBPF program can be written in C and compiled with Clang's -march=bpf parameter. There are many examples of eBPF programs in the kernel samples/bpf/ directory. Most filenames have a _kern.c suffix in their names. The object file (eBPF bytecode) compiled by Clang needs to be loaded by a program running on the machine (usually a file starting with _user.c). In order to simplify the writing of eBPF programs, the kernel provides the libbpf library, which can use auxiliary functions to load, create and manage eBPF objects.

For example, the general flow of an eBPF program and a user program using libbpf is:

  1. Read the eBPF byte stream in the user program and pass it to bpf_load_program().
  2. When running an eBPF program in the kernel, bpf_map_lookup_elem() will be called to look up an element in a map and save a new value.
  3. The user program will call bpf_map_lookup_elem() to read the kernel data saved by the eBPF program.

However, most of the example code has one major drawback: the need to compile your own eBPF programs in the kernel source tree. Fortunately, the BCC project addresses this type of problem. It includes a complete toolchain to write and load eBPF programs without linking to the kernel source tree.

2. eBPF framework

Before starting to explain, explain the nouns on eBPF to help you understand better:

  1. eBPF bytecode: compile the hook code written in C language into binary bytecode through clang, load it into the kernel through the program, and run it in the kernel "virtual machine" after the hook is triggered.
  2. JIT: Just-in-time compilation, compiling bytecode into local machine code to improve running speed, similar to the concept in Java.
  3. Maps: The hook code can save some statistical information in a map of key-value pairs to communicate with user space programs and transfer data.
  4. There are many detailed explanations on the eBPF mechanism on the Internet, so I won’t expand here. Here is a picture first, which includes everything involved in using or writing ebpf. The following will explain this picture in detail.
  5. ![图片](data:image/svg+xml,%3C%3Fxml version=‘1.0’ encoding=‘UTF-8’%3F%3E%3Csvg width=‘1px’ height=‘1px’ viewBox=‘0 0 1 1’ version=‘1.1’ xmlns=‘http://www.w3.org/2000/svg’ xmlns:xlink=‘http://www.w3.org/1999/xlink’%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=‘none’ stroke-width=‘1’ fill=‘none’ fill-rule=‘evenodd’ fill-opacity=‘0’%3E%3Cg transform=‘translate(-249.000000, -126.000000)’ fill=‘%23FFFFFF’%3E%3Crect x=‘249’ y=‘126’ width=‘1’ height=‘1’%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E)
  6. foo_kern.c hook implementation code, mainly responsible for:
  • Declare the Map node used
  • Declare hook mount point and processing function

Compile to bytecode via LLVM/clang

  • Compile command: clang --target=bpf
  • The android platform has integrated eBPF compilation, which will be mentioned later

foo_user.c User space processing function, mainly responsible for:

  • Load the bytecode compiled by foo_kern.c into kenel
  • Read the information in the Map and process the output to the user

When the kernel receives the eBPF loading request, it will first verify the bytecode and compile it into machine code through JIT. When the hook event comes, the hook function will be called, and the kernel will verify the loaded bytecode to ensure System security, the main verification rules are as follows:

  • a. Check whether the GNU GPL is declared, and check whether the version of the kernel supports
  • b. Function calling rules:

Allow mutual calls between bpf functions

Only the BPF helper functions allowed by the kernel are allowed to be called. For details, please refer to the linux/bpf.h file

Functions and dynamic links other than the above are not allowed.

  • c. Process processing rules:

Loop is not allowed to prevent the kernel from getting stuck in an infinite loop

unreachable branch code is not allowed

  • d. The stack size is limited within the range of MAX_BPF_STACK.
  • e. The compiled bytecode size is limited to BPF_COMPLEXITY_LIMIT_INSNS.

The hook mount point mainly includes:

![图片](data:image/svg+xml,%3C%3Fxml version=‘1.0’ encoding=‘UTF-8’%3F%3E%3Csvg width=‘1px’ height=‘1px’ viewBox=‘0 0 1 1’ version=‘1.1’ xmlns=‘http://www.w3.org/2000/svg’ xmlns:xlink=‘http://www.w3.org/1999/xlink’%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=‘none’ stroke-width=‘1’ fill=‘none’ fill-rule=‘evenodd’ fill-opacity=‘0’%3E%3Cg transform=‘translate(-249.000000, -126.000000)’ fill=‘%23FFFFFF’%3E%3Crect x=‘249’ y=‘126’ width=‘1’ height=‘1’%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E)

In addition, there are a large number of samples in the samples/bpf directory in the source code of the kernel, and you can read it if you are interested.

3. The use of eBPF on the Android platform

After the boring explanation above, everyone should have a basic understanding of eBPF. Next, let's use a small example of monitoring performance on the android platform to practice. The requirement of this small example is to count the number of system calls of each application in the system within a period of time.

3.1 Compiling support for eBPF in the android system

At present, the android compilation system has integrated eBPF, and the eBPF bytecode can be easily compiled in the android source code through android.bp.

android.bp example:

picture

The relevant compiled code is in soong's bpf.go. Although Google has few documents about soong, at least the code is relatively clear.

picture

The $ccCmd here is generally clang, so its compilation command is mainly clang --target=bpf. There is no difference from normal bpf compilation.

3.2 eBPF hook code implementation

After solving the compilation problem, the next step is to implement the hook code. We are going to use the tracepoint hook. First, we need to find the tracepoint functions sys_enter and sys_exit we need. The trace parameter of the function definition in the include/trace/events/syscalls.h file sys_enter is an array of id and length 6.

  • The trace parameter of sys_exit is two long integer numbers id and ret.

After finding the hook, the next step is to write the hook processing code:

picture

Define map to save system call statistics. When DEFINE_BPF_MAP declares map, macro functions for deleting, modifying and checking will also be generated. For example, the following functions will be generated in this example:

bpf_pid_syscall_map_lookup_elem
bpf_pid_syscall_map_update_elem
bpf_pid_syscall_map_delete_elem
  • To define the parameter type of the callback function, you need to refer to the definition of tracepoint above.
  • Specifies the tracepoint event to monitor.
  • Use the bpf_trace_printk function to print debug information, and the information will be printed directly to ftrace.
  • Find the specified key in the map.
  • Update the value of the specified key.

3.3 Load the hook code

We only need to push our compiled *.o file to the system/etc/bpf directory of the mobile phone, restart the mobile phone, the system will automatically load our hook file, and it will be displayed in the /sys/fs/bpf directory after loading successfully The map and prog files we defined.

The system loading code is in system/bpf/bpfloader, and the code is very simple.

The main operations are as follows:

1) Write 1 to the following two nodes in the early-init phase

– /proc/sys/net/core/bpf_jit_enable

Enable eBPF JIT, when the kernel sets BPF_JIT_ALWAYS_ON, the default is 1

– /proc/sys/net/core/bpf_jit_kallsyms

Enable privileged users to read kernel symbols through the kallsyms node

2) Start the bpfloader service

– Read the *.o file in the system/etc/bpf directory, and call the loadProg function in libbpf_android.so to load it into the kernel.

– Generate the corresponding /sys/fs/bpf/ node.

– Set property bpf.progs_loaded to 1

The sys node is divided into two types: map node and prog node, respectively map_ , prog_

Below is the node information on the Android Q release.

picture

You can use the following command to debug dynamic loading

picture

3.4 User Space Program Implementation

Next, we need to write a display program in user space, which is essentially to read out the BPF map through system calls in user mode

picture

![图片](data:image/svg+xml,%3C%3Fxml version=‘1.0’ encoding=‘UTF-8’%3F%3E%3Csvg width=‘1px’ height=‘1px’ viewBox=‘0 0 1 1’ version=‘1.1’ xmlns=‘http://www.w3.org/2000/svg’ xmlns:xlink=‘http://www.w3.org/1999/xlink’%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=‘none’ stroke-width=‘1’ fill=‘none’ fill-rule=‘evenodd’ fill-opacity=‘0’%3E%3Cg transform=‘translate(-249.000000, -126.000000)’ fill=‘%23FFFFFF’%3E%3Crect x=‘249’ y=‘126’ width=‘1’ height=‘1’%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E)

  • 1) eBPF statistics only work when bpf_attach_tracepoint is called. bpf_attach_tracepoint is a function in bcc, android packages a part of bcc into libbpf and puts it in the system library.
  • 2) To obtain the fd of the map, bpf_obj_get will directly call the bpf system call.
  • 3) Wrap fd into BpfMap, android defines many convenient functions in BpfMap.h.
  • 4) Traversing the map callback function. The return value must be android::netdutils::status::ok (modified in newer versions of android).

3.5 View the running results

Execute mm directly in the directory, push the compiled bpf.o to the /system/etc/bpf directory, push the statistical program to the /system/bin directory, restart, and see the results.

picture

The former is the pid, and the latter is the number of system calls. So far, how to use eBPF on the android platform to implement the function of counting the number of system calls of each pid in a period of time in the system has been introduced.

In addition, there are still many technical details that have not been studied in depth, but after all, it is only a preliminary exploration, so I will stop here first, and I will further study it later when I have time. The research time is still relatively short, please correct me if there are any mistakes.

4. Overview of seccomp

The following is from the official Linux documentation:

4.1 History

The first version of seccomp was merged into the Linux 2.6.12 version in 2005. Enable the feature by writing 1 in /proc/PID/seccomp. Once enabled, the process can only use 4 system calls read(), write(), exit() and sigreturn(), if the process calls other system calls will cause SIGKILL. The idea and patch came from andreaarcangeli as a way to safely run other people's code. However, this idea has never been realized.

In 2007, the way seccomp was enabled was changed in kernel 2.6.23. Added prctl() operation mode (PR_SET_SECCOMP and SECCOMP_MODE_STRICT parameters), and removed /proc interface. The behavior of the PR_GET_SECCOMP operation is interesting: if the process is not in seccomp mode, it returns 0, otherwise it sends a SIGKILL signal (because prctl() is not an allowed system call). It's proof that kernel developers do have a sense of humor, Kerrisk said.

Things were quiet in the seccomp world for the next five years or so until seccomp mode 2 (or "seccomp filter mode") was added in linux 3.5 in 2012. Added a second mode for seccomp: SECCOMP_MODE_FILTER. Using this mode, a process can specify which system calls are allowed. Through mini's BPF program, a process can restrict the entire system call or specific parameter values. There are already many tools that use seccomp filtering, including Chrome/Chromium browsers, OpenSSH, vsftpd, and Firefox OS. Also, seccomp is heavily used in containers.

In the 3.8 kernel moderator in 2013, a "Seccomp" field was added to /proc/PID/status. By reading this field, a process can determine its seccomp mode (0 is disabled, 1 is strict, 2 is filtered). Kerrisk pointed out that a process may need to obtain a file descriptor for a file from elsewhere to ensure it does not receive a SIGKILL.

The seccomp() system call was added in version 3.17 in 2014 (it no longer complicates the prctl() system call). The seccomp() system call provides a superset of existing functionality. It also adds the ability to synchronize all threads of a process to the same set of filters, helping to ensure that even threads created before a filter is installed are still affected by it.

4.2BPF

The filtering mode of seccomp allows developers to write BPF programs to determine whether a given system call can be executed based on the number and value of the passed parameters. Only passing by value works (the BPF virtual machine does not dereference pointer parameters).

Filters can be installed using seccomp() or prctl(). A BPF program must first be constructed and then installed into the kernel. After that, the filtering code will be triggered every time a system call is executed. It is also possible to remove an already installed filter (since installing a filter is effectively a statement that any subsequently executed code cannot be trusted).

The BPF language almost predates Linux (Kerrisk). It first appeared in 1992 and was used in the tcpdump program to monitor network packets. However, since the number of packets is relatively large, the cost of passing all the packets to the space for filtering is quite high. BPF provides a kind of filtering at the kernel level, so that user space only needs to process the packets it is interested in.

The seccomp filter developers discovered that other types of functionality could be implemented using BPF, and BPF evolved to allow filtering of system calls. A small in-kernel virtual machine within the kernel is used to interpret a simple set of BPF instructions.

BPF allows branching, but only forward branching, so loops cannot occur, and in this way occurrences are guaranteed to end. The instruction limit of BPF program is 4096, and the validity check is completed during loading. In addition, the verifier can ensure that the program exits normally and returns an instruction telling the kernel what action should be taken for this system call.

The promotion of BPF is in progress, among which eBPF has been added to the kernel, which can filter tracepoint (Linux 3.18) and raw socket (3.19), and the eBPF code for perf event has been incorporated in version 4.1.

BPF has an accumulator register, a data area (for seccomp, containing information about system calls), and an implicit program counter. All instructions are 64 bits long, of which 16 bits are used for the opcode, two 8-bit fields are used for jump destinations, and a 32-bit field holds the value parsed depending on the opcode.

The basic instructions used by BPF are: load, stora, jump, arithmetic and logical operations, and return. BPF supports conditional and unconditional jump instructions, the latter using a 32-bit field as its offset. Conditional jumps use two jump destination fields in the instruction, each containing a jump offset (depending on whether the jump was true or false).

With two jump purposes, BPF can simplify conditional jump instructions (for example, you can use "jump if equal" but not "jump if not equal"), and if you need a comparison in another sense, you can Interchange the two offsets. The destination is the offset, and 0 means "no jump" (execute the next jump instruction). Since they are 8-bit values, a maximum of 255 instructions can be jumped. As mentioned before, negative offsets are not allowed, avoiding loops.

The BPF data area (struct seccomp_data) used by seccomp has several different fields to describe the system call in progress: system call number, architecture, instruction pointer, and system call parameters. It is a read-only buffer and cannot be modified by the program.

4.3 Writing filters

BPF programs can be written using constants and macros, for example:

BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof(struct seccomp_data, arch)))

The above command will create a load (BPF_LD) word (BPF_W) operation, using the value in the instruction as the data area offset (BPF_ABS). This value is the offset of the architecture field from the data area, so the end result is an instruction that loads the accumulator according to the architecture (from the AUDIT_ARCH_* values ​​in AUDIT.h). The next instruction is:

BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K ,AUDIT_ARCH_X86_64 , 1, 0)

The above command creates a jump-if-equal instruction (BPF_JMP | BPF JEQ) that compares the value in the instruction (BPF_K) with the value in the accumulator. If the architecture is x86-64, the jump will ignore the startling instruction (the number of instructions to jump is "1"), otherwise it will continue to execute (jump to false, "0").

A BPF program should first validate its schema to ensure that the system calls are what the program expects. A BPF program may be created on a different architecture than it allows.

Once the filter is created, the program is allowed on every system call, with some performance hit. Every program must return an instruction on exit, otherwise, the verifier returns EINVAL. The returned content is a 32-bit value. The upper 16 bits specify the action of the kernel, and the other bits return data related to the action.

The program can return 5 actions: SECCOMP_RET_ALLOW means to allow the system call to run; SECCOMP_RET_KILL means to terminate the process, as if the process was killed due to SIGSYS (the signal will not be caught by the process); SECCOMP_RET_ERRNO will tell the kernel to try to notify a ptrace() trace SECCOMP_RET_TRAP tells the kernel to send a real SIGSYS signal immediately, which the process catches when it expects it.

BPF programs can be installed using seccomp() (since Linux 3.17) or prctl(), in both cases passing a pointer to a struct sock_fprog containing the instruction number and a pointer to the program. In order to successfully execute the instruction, the caller must either have the CAP_SYS_ADMIN privilege, or set the PR_SET_NO_NEW_PRIVS attribute to the process (using execve() to execute a new program ignores set-UID, set-GID, and file capabilities).

If the filter runner calls prctl() or seccomp(), then more filters can be installed, and they will be run in the reverse order that they were added, eventually returning the value with the highest priority among the filters (KILL's precedence highest, ALLOW has the lowest priority). If the filter allows calls to fork(), clone(), and execve(), the filter is preserved across calls to these commands.

The two main uses of seccomp filters are sandboxing and failure mode testing. The former is used to restrict programs, especially system calls that need to handle untrusted input, and usually uses a whitelist. For failure mode testing, seccomp can be used to inject various unexpected errors into the program to help find bugs.

There are currently many tools and resources that simplify the development of seccomp filters and BPF. Libseccomp provides a set of high-level APIs to create filters. The libseccomp project provides a lot of help documents, such as seccomp_init().

Finally, the kernel has a just-in-time (JIT) compiler for converting BPF bytecode to machine code, which can improve performance by 2-3 times. The JIT compiler is disabled by default and can be enabled by writing 1 in the following file.

/proc/sys/net/core/bpf_jit_enable

4.4XDP

overview

XDP is a kernel-integrated packet processor on the Linux network path, featuring security, programmability, and high performance. When the network card driver receives a packet, the processor executes the BPF program. XDP can process data packets before they enter the protocol stack, so it has high performance and can be used in DDoS defense, firewall, load balancing and other fields.

XDP data structure

The data structure used by the XDP program is xdp_buff, not sk_buff, and xdp_buff can be regarded as a lightweight version of sk_buff. The difference between the two is: sk_buff contains the metadata of the packet, xdp_buff is created earlier and does not depend on other kernel layers, so XDP can acquire and process packets faster.

The xdp_buff data structure is defined as follows:

// /linux/include/net/xdp.h
struct xdp_rxq_info {
	struct net_device *dev;
	u32 queue_index;
	u32 reg_state;
	struct xdp_mem_info mem;
} ____cacheline_aligned; /* perf critical, avoid false-sharing */

struct xdp_buff {
	void *data;
	void *data_end;
	void *data_meta;
	void *data_hard_start;
	unsigned long handle;
	struct xdp_rxq_info *rxq;
};

The sk_buff data structure is defined as follows:

// /include/linux/skbuff.h
struct sk_buff {
	union {
		struct {
			/* These two members must be first. */
			struct sk_buff		*next;
			struct sk_buff		*prev;

			union {
				struct net_device	*dev;
				/* Some protocols might use this space to store information,
				 * while device pointer would be NULL.
				 * UDP receive path is one user.
				 */
				unsigned long		dev_scratch;
			};
		};
		struct rb_node		rbnode; /* used in netem, ip4 defrag, and tcp stack */
		struct list_head	list;
	};

	union {
		struct sock		*sk;
		int			ip_defrag_offset;
	};

	union {
		ktime_t		tstamp;
		u64		skb_mstamp_ns; /* earliest departure time */
	};
	/*
	 * This is the control buffer. It is free to use for every
	 * layer. Please put your private variables there. If you
	 * want to keep them across layers you have to do a skb_clone()
	 * first. This is owned by whoever has the skb queued ATM.
	 */
	char			cb[48] __aligned(8);

	union {
		struct {
			unsigned long	_skb_refdst;
			void		(*destructor)(struct sk_buff *skb);
		};
		struct list_head	tcp_tsorted_anchor;
	};

#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
	unsigned long		 _nfct;
#endif
	unsigned int		len,
				data_len;
	__u16			mac_len,
				hdr_len;

	/* Following fields are _not_ copied in __copy_skb_header()
	 * Note that queue_mapping is here mostly to fill a hole.
	 */
	__u16			queue_mapping;

/* if you move cloned around you also must adapt those constants */
#ifdef __BIG_ENDIAN_BITFIELD
#define CLONED_MASK	(1 << 7)
#else
#define CLONED_MASK	1
#endif
#define CLONED_OFFSET()		offsetof(struct sk_buff, __cloned_offset)

	__u8			__cloned_offset[0];
	__u8			cloned:1,
				nohdr:1,
				fclone:2,
				peeked:1,
				head_frag:1,
				xmit_more:1,
				pfmemalloc:1;
#ifdef CONFIG_SKB_EXTENSIONS
	__u8			active_extensions;
#endif
	/* fields enclosed in headers_start/headers_end are copied
	 * using a single memcpy() in __copy_skb_header()
	 */
	/* private: */
	__u32			headers_start[0];
	/* public: */

/* if you move pkt_type around you also must adapt those constants */
#ifdef __BIG_ENDIAN_BITFIELD
#define PKT_TYPE_MAX	(7 << 5)
#else
#define PKT_TYPE_MAX	7
#endif
#define PKT_TYPE_OFFSET()	offsetof(struct sk_buff, __pkt_type_offset)

	__u8			__pkt_type_offset[0];
	__u8			pkt_type:3;
	__u8			ignore_df:1;
	__u8			nf_trace:1;
	__u8			ip_summed:2;
	__u8			ooo_okay:1;

	__u8			l4_hash:1;
	__u8			sw_hash:1;
	__u8			wifi_acked_valid:1;
	__u8			wifi_acked:1;
	__u8			no_fcs:1;
	/* Indicates the inner headers are valid in the skbuff. */
	__u8			encapsulation:1;
	__u8			encap_hdr_csum:1;
	__u8			csum_valid:1;

#ifdef __BIG_ENDIAN_BITFIELD
#define PKT_VLAN_PRESENT_BIT	7
#else
#define PKT_VLAN_PRESENT_BIT	0
#endif
#define PKT_VLAN_PRESENT_OFFSET()	offsetof(struct sk_buff, __pkt_vlan_present_offset)
	__u8			__pkt_vlan_present_offset[0];
	__u8			vlan_present:1;
	__u8			csum_complete_sw:1;
	__u8			csum_level:2;
	__u8			csum_not_inet:1;
	__u8			dst_pending_confirm:1;
#ifdef CONFIG_IPV6_NDISC_NODETYPE
	__u8			ndisc_nodetype:2;
#endif

	__u8			ipvs_property:1;
	__u8			inner_protocol_type:1;
	__u8			remcsum_offload:1;
#ifdef CONFIG_NET_SWITCHDEV
	__u8			offload_fwd_mark:1;
	__u8			offload_l3_fwd_mark:1;
#endif
#ifdef CONFIG_NET_CLS_ACT
	__u8			tc_skip_classify:1;
	__u8			tc_at_ingress:1;
	__u8			tc_redirected:1;
	__u8			tc_from_ingress:1;
#endif
#ifdef CONFIG_TLS_DEVICE
	__u8			decrypted:1;
#endif

#ifdef CONFIG_NET_SCHED
	__u16			tc_index;	/* traffic control index */
#endif

	union {
		__wsum		csum;
		struct {
			__u16	csum_start;
			__u16	csum_offset;
		};
	};
	__u32			priority;
	int			skb_iif;
	__u32			hash;
	__be16			vlan_proto;
	__u16			vlan_tci;
#if defined(CONFIG_NET_RX_BUSY_POLL) || defined(CONFIG_XPS)
	union {
		unsigned int	napi_id;
		unsigned int	sender_cpu;
	};
#endif
#ifdef CONFIG_NETWORK_SECMARK
	__u32		secmark;
#endif

	union {
		__u32		mark;
		__u32		reserved_tailroom;
	};

	union {
		__be16		inner_protocol;
		__u8		inner_ipproto;
	};

	__u16			inner_transport_header;
	__u16			inner_network_header;
	__u16			inner_mac_header;

	__be16			protocol;
	__u16			transport_header;
	__u16			network_header;
	__u16			mac_header;

	/* private: */
	__u32			headers_end[0];
	/* public: */

	/* These elements must be at the end, see alloc_skb() for details.  */
	sk_buff_data_t		tail;
	sk_buff_data_t		end;
	unsigned char		*head,
				*data;
	unsigned int		truesize;
	refcount_t		users;

#ifdef CONFIG_SKB_EXTENSIONS
	/* only useable after checking ->active_extensions != 0 */
	struct skb_ext		*extensions;
#endif
};

4.5 Relationship between XDP and eBPF

XDP programs are controlled through the bpf() system call, which uses the program type BPF_PROG_TYPE_XDP for loading.

XDP mode of operation

XDP supports 3 working modes, native mode is used by default:

  • Native XDP: In the native mode, the XDP BPF program runs on the early receiving path (RX queue) of the network driver, so the network card driver support is required when using this mode.
  • Offloaded XDP: In Offloaded mode, the XDP BFP program processes data packets directly in the NIC (Network Interface Controller) without using the host CPU. Compared with the native mode, the performance is higher
  • Generic XDP: Generic mode is mainly provided for developers to test and use. For the case where the network card or driver cannot support native or offloaded mode, the kernel provides a general generic mode, which runs in the protocol stack and does not require any modification to the driver. It is recommended to use native or offloaded mode in the production environment

XDP operation result code

  • XDP_DROP: drops the packet, occurs in the earliest RX phase of the driver
  • XDP_PASS: Pass the data packet to the protocol stack for processing. The operation may be in the following two forms: 1. Receive the data packet normally, allocate the desired data sk_buff structure and push the received data packet into the stack, and then guide the data packet to another CPU for processing . It allows the raw interface to be processed in user space. This can happen before or after packet modification. 2. Receive large data packets through GRO (Generic receive offload), and merge data packets of the same connection. After processing, GRO finally passes the packet into the "normal receive" flow
  • XDP_TX: Forward the packet, sending the received packet back to the same network card that the packet arrived on. This can happen before or after packet modification
  • XDP_REDIRECT: packet redirection, XDP_TX, XDP_REDIRECT is to send the packet to another network card or to the cpumap of BPF
  • XDP_ABORTED: Indicates that an error occurred in the eBPF program and caused the data packet to be discarded. Self-developed programs should not use this return code

XDP and iproute2 loader

The ip command provided in the iproute2 tool can act as an XDP loader, compiling the XDP program into an ELF file and loading it.

  • Write the XDP program xdp_filter.c. The function of the program is to discard all TCP connection packets. The program takes the xdp_md structure pointer as input, which is equivalent to the BPF structure of the driver program xdp_buff. The entry function of the program is filter, and the area of ​​the compiled ELF file is named mysection.
#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/in.h>
#include <linux/ip.h>
#include <linux/tcp.h>

#define SEC(NAME) __attribute__((section(NAME), used))

SEC("mysection")
int filter(struct xdp_md *ctx) {
    int ipsize = 0;
    void *data = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;
    struct ethhdr *eth = data;
    struct iphdr *ip;

    ipsize = sizeof(*eth);
    ip = data + ipsize;

    ipsize += sizeof(struct iphdr);
    if (data + ipsize > data_end) {
        return XDP_DROP;
    }

    if (ip->protocol == IPPROTO_TCP) {
        return XDP_DROP;
    }

    return XDP_PASS;
}
  • Compile XDP program to ELF file
clang -O2 -target bpf -c xdp_filter.c -o xdp_filter.o
  • Use the ip command to load the XDP program, and use the mysection part as the entry point of the program
sudo ip link set dev ens33 xdp obj xdp_filter.o sec mysection

The loading is complete without error reporting, and you can view the results with the following command:

$ sudo ip a show ens33
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdpgeneric/id:56 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:0c:29:2f:a8:41 brd ff:ff:ff:ff:ff:ff
    inet 192.168.136.140/24 brd 192.168.136.255 scope global dynamic noprefixroute ens33
       valid_lft 1629sec preferred_lft 1629sec
    inet6 fe80::d411:ff0d:f428:ce2a/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

Among them, xdpgeneric/id:56 indicates that the driver used is xdpgeneric, and the XDP program id is 56

  • Verify connection blocking effect
  1. Use nc -l 8888 to monitor TCP port 8888, use nc xxxxx 8888 to connect and send data, if the target host does not receive any data, it means that the TCP connection is blocked successfully
  2. Use nc -kul 9999 to monitor UDP port 9999, use nc -u xxxxx 9999 to connect and send data, the target host receives data normally, indicating that the UDP connection is not affected
  • Uninstall the XDP program
$ sudo ip link set dev ens33 xdp off

After uninstalling, connect to port 8888, send data, and the communication is normal.

XDP and BCC

Write C code xdp_bcc.c, DROP when the destination port of TCP connection is 9999:

#define KBUILD_MODNAME "program"
#include <linux/bpf.h>
#include <linux/in.h>
#include <linux/ip.h>
#include <linux/tcp.h>

int filter(struct xdp_md *ctx) {
    int ipsize = 0;
    void *data = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;
    struct ethhdr *eth = data;
    struct iphdr *ip;

    ipsize = sizeof(*eth);
    ip = data + ipsize;

    ipsize += sizeof(struct iphdr);
    if (data + ipsize > data_end) {
        return XDP_DROP;
    }

    if (ip->protocol == IPPROTO_TCP) {
        struct tcphdr *tcp = (void *)ip + sizeof(*ip);
        ipsize += sizeof(struct tcphdr);
        if (data + ipsize > data_end) {
            return XDP_DROP;
        }

        if (tcp->dest == ntohs(9999)) {
            bpf_trace_printk("drop tcp dest port 9999\n");
            return XDP_DROP;
        }
    }

    return XDP_PASS;
}

Similar to using the ip command to load the XDP program, write the python loader here to implement the compilation and kernel injection of the XDP program.

#!/usr/bin/python

from bcc import BPF
import time

device = "ens33"
b = BPF(src_file="xdp_bcc.c")
fn = b.load_func("filter", BPF.XDP)
b.attach_xdp(device, fn, 0)

try:
  b.trace_print()
except KeyboardInterrupt:
  pass

b.remove_xdp(device, 0)

Verify the effect, use nc test, can not communicate with the target host port 9999

$ sudo python xdp_bcc.py 

<idle>-0       [003] ..s. 22870.984559: 0: drop tcp dest port 9999
<idle>-0       [003] ..s. 22871.987644: 0: drop tcp dest port 9999
<idle>-0       [003] ..s. 22872.988840: 0: drop tcp dest port 9999
<idle>-0       [003] ..s. 22873.997261: 0: drop tcp dest port 9999
<idle>-0       [003] ..s. 22875.000567: 0: drop tcp dest port 9999
<idle>-0       [003] ..s. 22876.002998: 0: drop tcp dest port 9999
<idle>-0       [003] ..s. 22878.005414: 0: drop tcp dest port 9999
<idle>-0       [003] ..s. 22882.018119: 0: drop tcp dest port 9999

4.5Rings

There are 4 different types of rings: FILL, COMPLETION, RX and TX, all rings are single producer/single consumer, so user space programs need to explicitly synchronize multiple processes/threads that read/write to these rings .

UMEM uses 2 rings: FILL and COMPLETION. Each socket associated with UMEM must have 1 RX queue, 1 TX queue or both queues. If 4 sockets are configured (using TX and RX at the same time), then there will be 1 FILL ring, 1 COMPLETION ring, 4 TX rings and 4 RX rings.

The ring is based on the first (producer) tail (consumer) structure. A producer will write data at the ring index indicated by the producer member of the structure xdp_ring, and increase the producer index; a consumer will read data at the ring index indicated by the consumer member of the structure xdp_ring, and increase the consumer index .

The ring can be configured and created through the _RING setsockopt system call, using mmap(), combined with a suitable offset, to map it to user space

The size of the ring needs to be a power of 2.

4.6UMEM Fill Ring

The FILL ring is used to pass UMEM frames from user space to kernel space, while passing UMEM addresses to the ring. For example, if the size of UMEM is 64k, and the size of each chunk is 4k, then UMEM contains 16 chunks, and the addresses that can be passed are 0 to 64k.

Frames passed to the kernel are used in the ingress path (RX rings).

User applications also generate UMEM addresses in this ring. Note that the kernel masks incoming addresses if the application is running in aligned chunk mode. That is, if a chunk size is 2k, the address of log2(2048) LSB will be masked out, which means that 2048, 2050 and 3000 will all refer to the same chunk. If the user application is running in non-aligned chunk mode, the address passed in will remain unchanged.

4.7UMEM Completion Ring

COMPLETION Ring is used to pass UMEM frames from kernel space to user space, same as FILL ring, using UMEM index.

Frames that have been sent from kernel space to user space can also be used by user space.

The user application will consume the UMEM address of this ring.

4.8RX Ring

The RX ring is located on the receiving side of the socket, and each entry in the ring is a descriptor of the xdp_desc structure. This descriptor contains the UMEM offset (address) and the length of the data.

If no frames are passed from the FILL ring to the core, no descriptors will appear in the RX ring.

The user program will consume the xdp_desc descriptor in the ring.

4.9TX Ring

TX Ring is used to send frames. Passed to the ring after populating the xdp_desc (index, length and offset) descriptor.

If you want to initiate a data transfer, you must call sendmsg(), this restriction may be relaxed in the future.

User program will generate xdp_desc descriptor for TX ring.

4.10XSKMAP / BPF_MAP_TYPE_XSKMAP

On the XDP side, a BPF map of type BPF_MAP_TYPE_XSKMAP will be used, combined with bpf_redirect_map() to pass the ingress frame to the socket.

The user application will insert sockets into the map through the bpf() system call.

Note that if an XDP program tries to redirect a frame to a socket that does not match the queue configuration and netdev, the frame will be dropped. That is, if an AF_XDP socket is bound to a netdev named eth0 and the queue is 17, only when the XDP program is assigned to eth0 and the queue is 17, the data will be delivered to the socket. See samples/bpf/ for examples

4.11 Configuration flags and socket options

XDP_COPY 和XDP_ZERO_COPY bind flag

When binding to a socket, the kernel will first try to make a copy using zero copy. If zero-copy is not supported, it will fall back to using copy mode. That is, copy all packets to user space. But if you want to force a specific mode, you can use the following flags: if XDP_COPY is passed to the bind call, the kernel will force into copy mode; if copy mode is not used, the bind call will fail and return an error. Conversely, XDP_ZERO_COPY will force the socket to use zero copy or the call will fail.

XDP_SHARED_UMEM bind flag

This means that multiple sockets can be bound to the system's UMEM, but only the system's queue id can be used. In this mode, each socket has its own RX and TX rings, but UMEM can only have one FILL ring and one COMPLETION ring. In order to use this mode, the first socket needs to be created and bound using normal mode. Then create a second socket, with a RX and a TX (or either), but will not create a FILL or COMPLETION ring (shared with the first socket). In the bind call, set the XDP_SHARED_UMEM option and provide the fd of the initial socket in sxdp_shared_umem_fd. and so on.

So when a message is received, which socket should it be sent to? The answer is determined by the XDP program. Put all the sockets into XDP_MAP, and then send the message to the socket corresponding to the index in the array. The following shows a simple example of distributing packets in a polling manner:

#include <linux/bpf.h>
#include "bpf_helpers.h"

#define MAX_SOCKS 16

struct {
     __uint(type, BPF_MAP_TYPE_XSKMAP);
     __uint(max_entries, MAX_SOCKS);
     __uint(key_size, sizeof(int));
     __uint(value_size, sizeof(int));
} xsks_map SEC(".maps");

static unsigned int rr;

SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
{
     rr = (rr + 1) & (MAX_SOCKS - 1);

     return bpf_redirect_map(&xsks_map, rr, XDP_DROP);
}

Note that since there is only one FILL and one COMPLETION ring, and it is a single-producer and single-consumer ring, it is necessary to ensure that multiple processors or threads do not use these rings at the same time. libbpf does not provide atomic synchronization.

This mode is used by libbpf when multiple sockets are bound to the same umem. However, it should be noted that the XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD libbpf_flag needs to be provided in the xsk_socket__create call, and then loaded into its own XDP program (because libbpf has no built-in routing traffic function).

XDP_USE_NEED_WAKEUP bind flag

This option supports setting a flag called need_wakeup in the FILL ring and TX ring, with userspace as the producer of these rings. When this option is set in the bind call, the need_wakeup flag will be set if it is necessary to explicitly wake up the kernel through a system call to continue processing packets.

If this flag is set to FILL ring, the application needs to call poll() to continue receiving packets on the RX ring. For example, this happens when the kernel detects that there is not enough buffer in the FILL ring, and there is not enough buffer in the RX HW RING of the NIC. At this time, the interrupt will be turned off, so that the NIC cannot receive any messages (because there is not enough buffer), because the need_wakeup is set, so that the user space can increase the buffer on the FILL ring, and then call poll(), so that the kernel driver will These buffers can be added to the HW ring to continue receiving packets.

If this flag is set to the TX ring, it means that the application needs to explicitly notify the kernel to send the message on the TX ring. Can be done by calling poll(), or calling sendto().

An example can be found in samples/bpf/xdpsock_user.c . An example of using libbpf helper functions on the TX path is as follows:

if (xsk_ring_prod__needs_wakeup(&my_tx_ring))
   sendto(xsk_socket__fd(xsk_handle), NULL, 0, MSG_DONTWAIT, NULL, 0);

It is recommended to enable this mode, because it reduces the number of system calls on the TX path, so it can improve performance when the application and driver run on the same (or different) core.

XDP_{RX|TX|UMEM_FILL|UMEM_COMPLETION}_RING setsockopts

These socket options set the number of descriptors for RX, TX, FILL and COMPLETION ring respectively (must set at least the descriptor size for RX or TX ring). If RX and TX are set at the same time, traffic from the application can be received and sent at the same time; if only one of them is set, corresponding resources can be saved. If you need to bind a UMEM to the socket, you need to set the FILL ring and COMPLETION ring at the same time. If the XDP_SHARED_UMEM flag is used, there is no need to create a separate UMEM for sockets other than the first socket, all sockets will use the shared UMEM. Note that the ring is a single-producer and single-consumer structure, so multiple processes cannot access the same ring at the same time. See XDP_SHARED_UMEM section.

When using libbpf, you can create Rx-only and Tx-only sockets by setting NULL for the rx and tx parameters of the xsk_socket__create function.

If a Tx-only socket is created, it is recommended not to put any packets in the FILL ring, otherwise, the driver may think that it needs to receive data (but it is not the case), which will affect performance.

XDP_UMEM_REG setsockopt

The socket option will register a UMEM for a socket, and its corresponding area contains a buffer that can accommodate messages. The call takes a pointer to the beginning of the region, and the size of the region. In addition, there is a chunk size parameter that UMEM can split (currently only supports 2K or 4K). If the size of a UMEM area is 128K, and the chunk size is 2K, it means that the UMEM area can have a maximum of 128K / 2K = 64 packets, and the maximum packet size is 2K.

There is also an option to set the headroom per buffer in UMEM. If it is set to N bytes, it means that the packet will start from the Nth byte of the buffer and reserve the first N bytes for the application. The last option is the Flags field, which is handled separately in each UMEM flag.

XDP_STATISTICS getsockopt

Get a socket drop information for debugging. Supported information is:

struct xdp_statistics {
       __u64 rx_dropped; /* Dropped for reasons other than invalid desc */
       __u64 rx_invalid_descs; /* Dropped due to invalid descriptor */
       __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */
};

XDP_OPTIONS getsockopt

Get the options of an XDP socket. Currently only XDP_OPTIONS_ZEROCOPY is supported, which is used to check whether zero copy is used.

From the characteristics of AF_XDP, we can see its limitations: XDP cannot be used to redirect different traffic to multiple AF_XDP sockets, because each AF_XDP socket must be bound to the TX queue of the physical interface. Each interface of most physical and simulated HWs supports only one RX/TX queue, so when an AF_XDP is bound to this interface, subsequent binding operations will fail. Only a few HWs support multiple RX/TX queues, and usually only 2/4/8 queues, which cannot be extended to hundreds of containers in the cloud.

See the AF_XDP official documentation and this paper for more details.

5. TC

In addition to XDP, BPF can also be used outside the kernel tc (traffic control) layer of the network data path. The difference between XDP and TC has been given above.

  • ingress hook:__netif_receive_skb_core() -> sch_handle_ingress()
  • egress hook:__dev_queue_xmit() -> sch_handle_egress()

picture

运行在tc层的BPF程序使用的是 cls_bpf (cls即Classifiers的简称)分类器。在tc中,将BPF的附着点描述为一个"分类器",这个词有点误导,因此它少描述了cls_bpf的所支持的功能。即一个完整的可编程的报文处理器不仅可以读取skb的元数据和报文数据,还可以对其进行任意修改,最后终止tc的处理,并返回裁定的action(见下)。cls_bpf可以认为是一个自包含的,可以管理和执行tc BPF程序的实体。

cls_bpf可以包含一个或多个tc BPF程序。通常,在传统的tc方案中,分类器和action模块是分开的,每个分类器可以附加一个或多个action,一旦匹配到分类器时就会执行action。但在现代软件数据路径中使用这种模式的tc处理复杂的报文时会遇到扩展性问题。由于附加到cls_bpf的tc BPF程序是完全自包含的,因此可以有效地将解析和操作过程融合到一个单元中。幸好有了cls_bpf的direct-action模式,该模式下,仅需要返回tc action裁定结果并立即结束处理流即可,可以在网络数据流中实现可扩展的可编程报文处理流程,同时避免了action的线性迭代。cls_bpf是tc层中唯一能够实现这种快速路径的“分类器”模块。

与XDP BPF程序类似,tc BPF程序可以在运行时通过cls_bpf自动更新,而不会中断任何网络流或重启服务。

cls_bpf可以附加的tc ingress和egree钩子都通过一个名为sch_clsact的伪qdisc进行管理。由于该伪qdisc可以同时管理ingress和egress的tc钩子,因此它是ingress qdisc的超集(也可直接替换)。对于__dev_queue_xmit()中的tc的egress钩子,需要注意的是,它不是在内核的qdisc root锁下运行的。因此,tc ingress和egress钩子都以无锁的方式运行在快速路径中,且这两个钩子都禁用了抢占,并运行在RCU读取侧。

Usually there will be qdisc attached to the network device on the egress, such as sch_mq, sch_fq, sch_fq_codel or sch_htb, some of which are classifiable qdisc (including subclasses), so a packet classification mechanism will be required to decide where to decompress Use packets. This process is handled by calling tcf_classify() which in turn invokes the tc classifier if present. cls_bpf can also be attached and used in the following scenarios: some operations under the qdisc root lock may be affected by lock contention. The egress hook for the sch_clsact qdisc appeared at an earlier point in time, but it is outside the scope of this lock and thus operates completely independently of regular egress qdiscs. So, for cases like sch_htb, the sch_clsact qdisc can perform heavy packet classification work through tc BPF outside the qdisc root lock, by setting skb->mark or skb->priority in these tc BPF programs, so that sch_htb only needs one Simple mapping is enough, and there is no need to perform expensive packet classification work under the root lock. In this way, lock competition can be reduced.

Support for offloaded tc BPF programs in the context of sch_clsact combined with cls_bpf, where previously loaded BPF programs are jit-generated from the SmartNIC driver to run natively on the NIC. Only the cls_bpf program running in direct-action mode supports offloaded. cls_bpf only supports offloading a single program (multiple programs cannot be offloaded), and only ingress supports offloading BPF programs.

A cls_bpf instance can contain multiple tc BPF programs, if this is the case, then the TC_ACT_UNSPEC program return code can continue execution to the next tc BPF program in the list. However, the disadvantage of this is that multiple programs need to parse the same message multiple times, resulting in performance degradation.

5.1 return code

tc's ingress and egress hooks share the same action to return the verdict used by the tc BPF program, defined in the linux/pkt_cls.h system header file:

#define TC_ACT_UNSPEC         (-1)
#define TC_ACT_OK               0
#define TC_ACT_SHOT             2
#define TC_ACT_STOLEN           4
#define TC_ACT_REDIRECT         7

There are also some action variables starting with TC_ACT_* in the system header file, which can be used by the two hooks. But they have the same semantics as above. That is, the semantics of TC_ACT_OK and TC_ACT_RECLASSIFY are identical from the perspective of tc BPF, as are the semantics of the three TC_ACT_stelled, TC_ACT_QUEUED, and TC_ACT_TRAP opcodes. Therefore, for these cases, we only describe the TC_ACT_OK and TC_ACT_STOLEN opcodes.

Starting from TC_ACT_UNSPEC, it means "unspecified action", which is used in the following three scenarios: i) when the tc ingress hook of an offloaded tc program runs at the position of cls_bpf, the offloaded program will return TC_ACT_UNSPEC; ii) for multi-program Continue to execute the next BPF program in cls_bpf in the scenario, and the subsequent program needs to be used in conjunction with the offloaded tc BPF program in step i, but there is a tc BPF program running in a non-offloaded scenario; iii) TC_ACT_UNSPEC can also be used for a single Procedural scenario, used to tell the kernel to continue using skb without other side effects. TC_ACT_UNSPEC is similar to TC_ACT_OK. Both will pass skb up to the upper layer of the network stack through ingress, or pass down to the network device driver through egress for transmission at egress. The only difference with TC_ACT_OK is that TC_ACT_OK sets skb->tc_index based on the classid set by the tc BPF program, while TC_ACT_UNSPEC is set by skb->tc_classid in the BPF context outside the tc BPF program.

TC_ACT_SHOT informs the kernel to discard the message, that is, the upper layer of the network stack will not see the message in the skb of the ingress, and similarly, this type of message will not be sent in the egress. TC_ACT_SHOT and TC_ACT_STOLEN are similar in nature, there are only some differences: TC_ACT_SHOT will notify the kernel that skb has been released through kfree_skb(), and will immediately return NET_XMIT_DROP to the caller; while TC_ACT_STOLEN will release skb through consume_skb(), and return NET_XMIT_SUCCESS to the upper layer , pretending the transfer was successful. Perf's packet discard monitoring will record the operation of kfree_skb(), so it will not record any packets discarded due to TC_ACT_STOLEN, because semantically, these skbs are consumed or queued rather than discarded.

Finally, the TC_ACT_REDIRECT action allows the tc BPF program to redirect the skb to the same or different device ingress or egress path through the bpf_redirect() helper function. By importing packets into the ingress or egress direction of other devices, the packet forwarding function of BPF can be maximized. Using this method does not require any changes to the target network device, nor does it need to run another instance of cls_bpf on the target device.

5.2 Loading tc BPF program

Suppose there is a tc BPF program named prog.o, which can be loaded to the network device mountain through the tc command. Unlike XDP, it does not need to rely on the driver to attach the BPF program to the device. A network device named em1 will be used below, and the program will be attached to the ingress message path of em1.

# tc qdisc add dev em1 clsact
# tc filter add dev em1 ingress bpf da obj prog.o

The first step is to configure a clsact qdisc first. As mentioned above, clsact is a fake qdisc, similar to ingress qdisc, which only contains classifiers and actions, but does not provide the actual queue function, which is required for attaching bpf classifiers. clsact provides two special hooks, called ingress and egress, to which classifiers can be attached. Both the ingress and egress hooks are located at the central receive and transmit locations of the network data path, where every message passing through the device passes. The ingees hook is called through the kernel's __netif_receive_skb_core() -> sch_handle_ingress(), and the egress hook is called through __dev_queue_xmit() -> sch_handle_egress().

The operation of attaching the program to the egress hook is:

# tc filter add dev em1 egress bpf da obj prog.o

The clsact qdisc processes packets from the ingress and egress directions in a lock-free manner, and can be attached to a queue-free virtual device, such as a veth device connected to a container.

After the hook, the tc filter command chooses to use bpf's da (direct-action) mode. It is recommended to use and specify the da mode, which basically means that the bpf classifier no longer needs to call the external tc action module, and all packet modification, forwarding or other actions can be implemented through additional BPF programs, so the processing speed is faster.

At this point, the bpf program has been attached, and the program will be executed once a message is transmitted to the device. Same as XDP, if you don't use the default section name, you can specify it during loading, for example, the section named foobar is specified below:

# tc filter add dev em1 egress bpf da obj prog.o sec foobar

The BPF loader for iptables2 allows the same command line syntax to be used across program types.

Additional programs can be listed using the following command:

# tc filter show dev em1 ingress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 prog.o:[ingress] direct-action id 1 tag c5f7825e5dac396f

# tc filter show dev em1 egress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 prog.o:[egress] direct-action id 2 tag b2fd5adc0f262714

The output of prog.o:[ingress] indicates that the program segment ingress is loaded through the file prog.o, and bpf runs in direct-action mode. The above two cases are appended with program id and tag, where the latter represents the hash of the instruction stream, which can be related to the object file or perf report with stack trace, etc. Finally, id represents the system-wide unique identifier of the BPF program, and bpftool can be used to view or dump the attached BPF program.

Multiple BPF programs can be attached to tc, which provides other classifiers that can be chained together. However, adding a BPF program can fully meet the requirements, because all message operations can be implemented in one program through the da (direct-action) mode, which means that the BPF program will return tc action judgment results, such as TC_ACT_OK, TC_ACT_SHOT, etc. This is the recommended method for best performance and flexibility.

In the above show command, pref 49152 and handle 0x1 are displayed next to the relevant output of BPF. These two outputs are automatically generated if not explicitly provided via the command line. perf indicates a priority number, that is, when multiple classifiers are attached, these classifiers will be executed in ascending order of priority. handle represents an identifier that works when a perf loads multiple instances of the system classifier. Since one program is enough in BPF scenarios, perf and handle can usually be ignored.

Only in the case of automatic replacement of additional BPF programs, it is recommended to specify pref and handle before the initial load, so that there is no need to query when performing the replace operation in the future. Create it like this:

# tc filter add dev em1 ingress pref 1 handle 1 bpf da obj prog.o sec foobar

# tc filter show dev em1 ingress
filter protocol all pref 1 bpf
filter protocol all pref 1 bpf handle 0x1 prog.o:[foobar] direct-action id 1 tag c5f7825e5dac396f

For atomic replacement, one can use (from the BPF program in the foobar section of the file prog.o) the following command to update the program on the existing ingress hook

# tc filter replace dev em1 ingress pref 1 handle 1 bpf da obj prog.o sec foobar

Finally, to remove all programs attached to ingress and egress, use the following command:

# tc filter del dev em1 ingress
# tc filter del dev em1 egress

In order to remove the entire clsact qdisc on the network device, that is, remove all programs attached to the ingress and egress hooks, the following command can be used:

# tc qdisc del dev em1 clsact

tc BPF programs can also be offloaded if the NIC and driver also support offloading like XDP BPF programs. Netronome's nfp supports both types of BPF offload.

# tc qdisc add dev em1 clsact
# tc filter replace dev em1 ingress pref 1 handle 1 bpf skip_sw da obj prog.o
Error: TC offload is disabled on net device.
We have an error talking to the kernel

If the above error occurs, it means that the tc hardware offload needs to be started first through the hw-tc-offload of ethtool:

# ethtool -K em1 hw-tc-offload on
# tc qdisc add dev em1 clsact
# tc filter replace dev em1 ingress pref 1 handle 1 bpf skip_sw da obj prog.o
# tc filter show dev em1 ingress
filter protocol all pref 1 bpf
filter protocol all pref 1 bpf handle 0x1 prog.o:[classifier] direct-action skip_sw in_hw id 19 tag 57cd311f2e27366b

The in_hw flag indicates that the program has been offloaded to the NIC. Note that tc and XDP BPF cannot be offloaded at the same time, and only one of them must be selected.


Copyright statement: This article is an original article of the WeChat public account "Deep Linux", which follows the CC 4.0 BY-SA copyright agreement. For reprinting, please attach the original source link and this statement.
f handle 0x1 prog.o:[foobar] direct-action id 1 tag c5f7825e5dac396f


对于原子替换,可以使用(来自文件prog.o中的foobar section的BPF程序)如下命令来更新现有的ingress钩子上的程序

tc filter replace dev em1 ingress pref 1 handle 1 bpf da obj prog.o sec foobar


最后,为了移除所有ingress和egress上附加的程序,可以使用如下命令:

tc filter del dev em1 ingress

tc filter del dev em1 egress


为了移除网络设备上的整个clsact qdisc,即移除掉ingress和egress钩子上附加的所有程序,可以使用如下命令:

tc qdisc del dev em1 clsact


如果NIC和驱动也像XDP BPF程序一样支持offloaded,则tc BPF程序也可以是offloaded的。Netronome的nfp同时支持两种类型的BPF offload。

tc qdisc add dev em1 clsact

tc filter replace dev em1 ingress pref 1 handle 1 bpf skip_sw da obj prog.o

Error: TC offload is disabled on net device.
We have an error talking to the kernel


如果出现了如上错误,则表示首先需要通过ethtool的hw-tc-offload来启动tc硬件offload:

ethtool -K em1 hw-tc-offload on

tc qdisc add dev em1 clsact

tc filter replace dev em1 ingress pref 1 handle 1 bpf skip_sw da obj prog.o

tc filter show dev em1 ingress

filter protocol all pref 1 bpf
filter protocol all pref 1 bpf handle 0x1 prog.o:[classifier] direct-action skip_sw in_hw id 19 tag 57cd311f2e27366b

in_hw标志表示程序已经offload到了NIC中,注意不能同时offload tc和XDP BPF,必须且只能选择其中之一。

Copyright statement: This article is an original article of the WeChat public account "Deep Linux", which follows the CC 4.0 BY-SA copyright agreement. For reprinting, please attach the original source link and this statement.
Original link: https://mp.weixin.qq.com/s?__biz=Mzg4NDQ0OTI4Ng==&mid=2247485153&idx=1&sn=95375d07e6c14912038b1920f049b339&chksm=cfb94f88f8cec69e5903bd97f6 ac71abecdedde81ab604a9fc6c41e7fa07fdf6a0ead0c3718e #rd

Guess you like

Origin blog.csdn.net/m0_50662680/article/details/131484191