The road to BPF

Preface

BPF is the top module in the kernel. It is very sophisticated. There are limited books on it, and they all look at BPF from an application perspective. I want to write a series of articles to observe BPF from the perspective of a security researcher to help more people learn. and research

In the book "Linux Kernel Observation Technology", the existing wrapping functions in the source code tree are used as an example to get started. Layers of wrapping lead to many dependencies during compilation, and the code is complex and the bottom layer cannot be seen at a glance, which is not very user-friendly.

Let’s make it clear first: All BPF related functions in user space are ultimately wrappers for bpf system calls, we can skip it completely These wrapper functions, handwritten bpf related system calls

The best learning material is always man. I translated the part about bpf system calls in manual, as follows

System call declaration

  • bpf – Execute commands on extended BPF maps or programs
#include <linux/bpf.h>
int bpf(int cmd, union bpf_attr *attr, unsigned int size);
  • This function is actually not defined inlinux/bpf.h and needs to be defined manually. It is actually a wrapping function for system calls
int bpf(enum bpf_cmd cmd, union bpf_attr *attr, unsigned int size)
{
    return syscall(__NR_bpf, cmd, attr, size);
}

describe

bpf()The system call will perform a series of exBPF-related operations. eBPF is similar to classic BPF (cBPF) and is also used to filter network packets. Both cBPF and eBPF kernels will be statically analyzed before loading to ensure security.

eBPF is an extension of cBPF, including calling some fixed kernel helper functions (through the BPF_CALL opcode extension provided by eBPF), and can access some shared data structures, such as eBPF maps

eBPF design architecture

eBPF mapping is a universal data structure designed to save multiple types of data. The data types are all binary, so users only need to specify the size of the key and value when creating a mapping. In other words, the key or value of a mapping can be any type of data

A user process can create multiple maps (using key-value pairs that are data-opaque bytes) and access them through the file descriptor fd. Different eBPF programs can access the same map in parallel. What is stored in the map depends on the user process and the eBPF program.

There is a special mapping type for , called a program-array. This type of mapping holds file descriptors that reference other eBPF processes. When looking up in this mapping, the program execution flow is In-place relocation to the beginning of another eBPF program, without returning to the calling program. Nesting is up to 32 levels so there are no infinite matryoshka dolls. At runtime, the program's file descriptor is saved in a map that can be modified , so the program can enter some kind of requested purposeful change. Programs referenced in the program array map must be loaded into the kernel in advance through bpf(). If the mapping lookup fails, the current program will continue to execute< /span>

In general, eBPF programs are loaded by user processes and automatically unloaded when the process exits. In some special cases, such astc-bpf(), even if the process that loads the BPF program exits , the BPF program will also reside in the kernel. In this example, after the file descriptor of the BPF program is closed by the process, the tc subsystem maintains a reference to the BPF program. Therefore, whether a BPF program survives in the kernel depends on How to further attach bpf() to other subsystems after loading the kernel

Each eBPF program is a set of instructions that can be safely executed before terminating. A verifier in the kernel statically checks whether a BPF program will terminate and whether it is safe. During verification, the kernel will increase the references of all mappings used by the eBPF program. count, so the attached mapping cannot be removed until the program is uninstalled

eBPF programs can be attached to various events. These events can be the arrival of network packets, trace times, classification events according to network queuing rules, and other events that will be added in the future. A new event will trigger the execution of the eBPF program , and may also save event-related information in the eBPF map. In addition to saving data, the eBPF program may also call some fixed set of kernel help functions

The same eBPF program can be attached to multiple events, and different eBPF programs can access the same mapping. The schematic diagram is as follows

tracing     tracing    tracing    packet      packet     packet
event A     event B    event C    on eth0     on eth1    on eth2
|             |         |          |           |          ^
|             |         |          |           v          |
--> tracing <--     tracing      socket    tc ingress   tc egress
     prog_1          prog_2      prog_3    classifier    action
     |  |              |           |         prog_4      prog_5
  |---  -----|  |------|          map_3        |           |
map_1       map_2                              --| map_4 |--

System call parameters

bpf()The operation performed by the system call is determined by the cmd parameter. Each operation has a corresponding parameter passed through attr, which points to the public type< The pointer of a i=2>, parameter represents the length of the data pointed to by the pointerbpf_attrsizeattr

cmdCan be the following values

  • BPF_MAP_CREATE: Creates a mapping and returns a file descriptor referencing this mapping. The close-on-exec flag is automatically set
  • BPF_MAP_LOOKUP_ELEMFind an element in the specified map based on key and return its value
  • BPF_MAP_UPDATE_ELEMCreate or update an element in the specified map
  • BPF_MAP_DELETE_ELEMFind and delete an element based on key in the specified map
  • BFP_MAP_GET_NEXT_KEYFind an element based on key in the specified map and return the key of the next element
  • BPF_PROG_LOAD: Verify and load an eBPF program, returning a new file descriptor associated with this program. The close-on-exec flag is also automatically added

Utilitiesbfp_attr consists of a variety of anonymous structures used for different bfp commands:

union bpf_attr {
   struct {    /* 被BPF_MAP_CREATE使用 */
       __u32         map_type;    /* 映射的类型 */
       __u32         key_size;    /* key有多少字节 size of key in bytes */
       __u32         value_size;  /* value有多少字节 size of value in bytes */
       __u32         max_entries; /* 一个map中最多多少条映射maximum number of entries in a map */
   };

   struct {    /* 被BPF_MAP_*_ELEM和BPF_MAP_GET_NEXT_KEY使用  */
       __u32         map_fd;
       __aligned_u64 key;
       union {
           __aligned_u64 value;
           __aligned_u64 next_key;
       };
       __u64         flags;
   };

   struct {    /* 被BPF_PROG_LOAD使用  */
       __u32         prog_type;
       __u32         insn_cnt;
       __aligned_u64 insns;      /* 'const struct bpf_insn *' */
       __aligned_u64 license;    /* 'const char *' */
       __u32         log_level;  /* 验证器的详细级别 */
       __u32         log_size;   /* 用户缓冲区的大小 size of user buffer */
       __aligned_u64 log_buf;    /* 用户提供的char*缓冲区 user supplied 'char *' buffer */
       __u32         kern_version;
                                 /* checked when prog_type=kprobe  (since Linux 4.1) */
   };
} __attribute__((aligned(8)));

eBPF mapping

Mapping is a general data structure that holds different types of data. Mapping can share data in different eBPF kernel programs, and can also share data between user processes and the kernel.

Each mapping has the following properties

  • type type
  • How many elements to make
  • How many bytes does the key have?
  • How many bytes does value have?

The following wrapper functions show how to access the map using various bpf system calls. These functions represent different operations through the cmd parameter.

BPF_MAP_CREATE

BPF_MAP_CREATEcommand can be used to create a new mapping, returning a file descriptor referencing this mapping

int bpf_create_map(enum bpf_map_type map_type,
    unsigned int key_size,
    unsigned int value_size,
    unsigned int max_entries)
{
    union bpf_attr attr = {    //设置attr指向的对象
        .map_type = map_type,
        .key_size = key_size,
        .value_size = value_size,
        .max_entries = max_entries
    };

    return bpf(BPF_MAP_CREATE, &attr, sizeof(attr)); //进行系统调用
}

The type of new mapping is specified bymap_type, and the attribute is specified by key_size, value_size, max_entries. If successful, the file descriptor is returned, and if it fails, -1 is returned

key_size, value_sizeThe attribute will be used by the validator during loading to check whether the program is called with a correctly initialized key bfp_map_*_elem() and to check whether the mapped element value exceeds the specified value_size.

For example, when a mapping is createdkey_size is 8, and the eBPF program callsbpf_map_lookup_elem(map_fd, fp - 4), the program will be rejected because of the helper function in the kernel bpf_map_lookup_elem(map_fd, void *key)expects to read 8 bytes from the location pointed to by key, butfp-4 (fp is the top of the stack) starting address will cause an out-of-bounds access to the stack

Similarly, if a mapping is created with value_size=1, the eBPF program contains

value = bpf_map_lookup_elem(...);
*(u32 *) value = 1;

This program will be denied execution because the value pointer it accesses exceeds the 1-byte limit specified by value_size

Currently the following values ​​are available formap_type

enum bpf_map_type {
                      BPF_MAP_TYPE_UNSPEC,  /* Reserve 0 as invalid map type */
                      BPF_MAP_TYPE_HASH,
                      BPF_MAP_TYPE_ARRAY,
                      BPF_MAP_TYPE_PROG_ARRAY,
                      BPF_MAP_TYPE_PERF_EVENT_ARRAY,
                      BPF_MAP_TYPE_PERCPU_HASH,
                      BPF_MAP_TYPE_PERCPU_ARRAY,
                      BPF_MAP_TYPE_STACK_TRACE,
                      BPF_MAP_TYPE_CGROUP_ARRAY,
                      BPF_MAP_TYPE_LRU_HASH,
                      BPF_MAP_TYPE_LRU_PERCPU_HASH,
                      BPF_MAP_TYPE_LPM_TRIE,
                      BPF_MAP_TYPE_ARRAY_OF_MAPS,
                      BPF_MAP_TYPE_HASH_OF_MAPS,
                      BPF_MAP_TYPE_DEVMAP,
                      BPF_MAP_TYPE_SOCKMAP,
                      BPF_MAP_TYPE_CPUMAP,
                      BPF_MAP_TYPE_XSKMAP,
                      BPF_MAP_TYPE_SOCKHASH,
                      BPF_MAP_TYPE_CGROUP_STORAGE,
                      BPF_MAP_TYPE_REUSEPORT_SOCKARRAY,
                      BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE,
                      BPF_MAP_TYPE_QUEUE,
                      BPF_MAP_TYPE_STACK,
                      /* See /usr/include/linux/bpf.h for the full list. */
                  };
  • map_typeSelect one of the map implementations available in the kernel. For all map types, eBPF programs use the same bpf_map_look_elem() and bpf_map_update_elem() helper function accesses.< /span>

BPF_MAP_LOOK_ELEM

BPF_MAP_LOOKUP_ELEMThe command is used to find the corresponding element based on key in the map pointed to by fd.

int bpf_lookup_elem(int fd, const void* key, void* value)
{
    union bpf_attr attr = {
        .map_fd = fd,
        .key = ptr_to_u64(key),
        .value = ptr_to_u64(value),
    };

    return bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
}

If an element is found, it will return 0 and save the value of the element in value. Value must be a buffer pointing to value_size bytes

If not found, -1 will be returned and errno will be set toENOENT

BPF_MAP_UPDATE_ELEM

BPF_MAP_UPDATE_ELEMThe command creates or updates an element with the givenkey/value in the map referenced by fd

int bpf_update_elem(int fd, const void* key, const void* value, uint64_t flags)
{
    union bpf_attr attr = {
        .map_fd = fd,
        .key = ptr_to_u64(key),
        .value = ptr_to_u64(value),
        .flags = flags,
    };

    return bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
}

flagsParameters should be specified as one of the following

  • BPF_ANYCreate a new element or update an existing one
  • BPF_NOEXISTOnly create a new element if the element does not exist
  • BPF_EXISTUpdate an existing element

returns 0 on success, -1 on error, and errno will be set to EINVAL, EPERM, ENOMEM, E2BIG

  • E2BIGIndicates that the number of elements in the map has reached the upper limit specified when creatingmax_entries
  • EEXIST means the flag is set BPF_NOEXIST but key already has the corresponding element
  • ENOENT means the flag is set BPF_EXIST but key has no corresponding element

BPF_MAP_DELETE_ELEM

BPF_MAP_DELETE_ELEMThe command is used to delete the element with the key key in the map pointed to by fd

int bpf_delete_elem(int fd, const void* key)
{
    union bpf_attr attr = {
        .map_fd = fd,
        .key = ptr_to_u64(key),
    };

    return bpf(BPF_MAP_DELETE_ELEM, &attr, sizeof(attr));
}

returns 0 if successful, -1 if the corresponding element does not exist, and errno will be set to ENOENT

BPF_MAP_GET_NEXT_KEY

BPF_MAP_GET_NEXT_KEYThe command is used to find the corresponding element based on key in the map referenced by fd, and set the next_key key pointing to the next element

int bpf_get_next_key(int fd, const void* key, void* next_key)
{
    union bpf_attr attr = {
        .map_fd = fd,
        .key = ptr_to_u64(key),
        .next_key = ptr_to_u64(next_key),
    };

    return bpf(BPF_MAP_GET_NEXT_KEY, &attr, sizeof(attr));
}

Ifkey is found, then 0 will be returned and the pointer netx_pointer will be set to point to the key of the next element. Ifkey If not found, 0 will be returned and next_pointer will be set to point to the key of the first element in the map. If key is the last element, then - 1, and set errno to ENOENT. Other possible values ​​for errno are ENOMEM, EFAULT, EPERM, EINVAL. This method Can be used to iterate over all elements in the map

close(map_fd)

Delete the mapping referenced by map_fd. When the user process that created the mapping exits, all mappings will be automatically deleted

Types of eBPF mappings

The following mapping types are supported

BPF_MAP_TYPE_HASH

Hash-table mapping has the following characteristics

  • Mappings are created and deleted by user-space programs. Both user-space programs and eBPF programs can perform lookup, update, and delete operations.
  • The kernel is responsible for allocating and releasing key-value pairs
  • When the number limit ofmax_entries is reached, the helper functionmap_update_elem() cannot insert new elements, (this ensures that eBPF will not run out of memory) a>
  • map_update_elem()Will automatically replace existing elements

Hash-table mapping is optimized for search speed

BPF_MAP_TYPE_ARRAY

Array mapping has the following characteristics

  • is optimized for the fastest super-search speed. In the future, the verifier or JIT compiler may recognize lookup() operations using constant keys and optimize them into constant pointers. Since pointers and value_size are constants for the lifetime of eBPF, it is also possible to optimize a non-const key into direct pointer arithmetic (similar to base addressing in C arrays). In other words , array_map_lookup_elem()may be inlined by the validator or JIT compiler, while retaining concurrent access from user space
  • During initialization, all array elements are pre-allocated and zero-initialized.
  • The mapped key is the subscript of the array, which must be 4 bytes.
  • map_delete_elem()Failed withEINVAL error because the element in the array cannot be deleted
  • map_update_elem() will replace an element non-atomically. If you want atomic updates you should use a hash-table mapping. But there is a special case that can be used for arrays: the built-in atomic function__sync_fetch_and_add()Can be used on 32- or 64-bit atomic counters. For example: if the value represents a single counter, this function can be used on the entire value itself. If a structure contains multiple counters, this function can be used on a single counter. This is useful for event aggregation and statistics

Array mapping has the following uses

  • As a global eBPF variable: an array with only one element and the key is 0. value is a collection of global variables. The eBPF program can use these variables to save the state of time.
  • Aggregate tracking events into a fixed set of buckets
  • Statistics of network events, such as number and size of packets

BPF_MAP_TYPE_PROG_ARRAY

A program array map is a special kind of array map whose mapped values ​​only contain file descriptors that refer to other eBPF programs. Therefore key_size and value_sizeBoth must be specified as four bytes (the index of the array mapping is 4 bytes, and the file descriptor is 4 bytes). This mapping helper function is used in conjunction with bpf_tail_call()

This means that an eBPF program with a program array map can be called from the kernel sidevoid bpf_tail_call(void *context, void *prog_map, unsigned int index); thereby replacing its own program execution flow with a given program in the program array. Program Array Can be used as a jump-table to switch to other eBPF programs. The called program will continue to use the same stack. When jumping to a new program, it will never return to the original old program.

If no eBPF program is found in the program array using the given index (because there is no valid file descriptor in the corresponding slot, or the index is out of bounds, or the 32-level nesting limit is reached), the current eBPF program will continue to be executed. This part ( following the jump instruction) can be used for default error handling

Program array mapping is useful in tracing or networking, and can be used to handle individual system calls or protocols in its own subroutine (original eBPF acts as a task dispatcher, calling the corresponding eBPF subroutine according to each case). This method helps Performance improvements, and the possibility of breaking through the instruction limit limit of a single eBPF program. In a dynamic environment, a user-space daemon may automatically replace a single subroutine with a newer version of the program at runtime to change the behavior of the entire program. For example, in In the case of global policy revision

Load eBPF program

BPF_PROG_LOADThe command is used to load the eBPF program in the kernel and returns a file descriptor associated with the eBPF program.

char bpf_log_buf[LOG_BUF_SIZE];

int bpf_prog_load(enum bpf_prog_type type, const struct bpf_insn* insns, int insn_cnt, const char* license)
{
    union bpf_attr attr = {
        .prog_type = type,
        .insns = ptr_to_u64(insns),
        .insn_cnt = insn_cnt,
        .license = ptr_to_u64(license),
        .log_buf = ptr_to_u64(bpf_log_buf),
        .log_size = LOG_BUF_SIZE,
        .log_level = 1,
    };

    return bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
}

prog_typeIs one of the following available program types

enum bpf_prog_type {
    BPF_PROG_TYPE_UNSPEC, /* Reserve 0 as invalid program type */
    BPF_PROG_TYPE_SOCKET_FILTER,
    BPF_PROG_TYPE_KPROBE,
    BPF_PROG_TYPE_SCHED_CLS,
    BPF_PROG_TYPE_SCHED_ACT,
    BPF_PROG_TYPE_TRACEPOINT,
    BPF_PROG_TYPE_XDP,
    BPF_PROG_TYPE_PERF_EVENT,
    BPF_PROG_TYPE_CGROUP_SKB,
    BPF_PROG_TYPE_CGROUP_SOCK,
    BPF_PROG_TYPE_LWT_IN,
    BPF_PROG_TYPE_LWT_OUT,
    BPF_PROG_TYPE_LWT_XMIT,
    BPF_PROG_TYPE_SOCK_OPS,
    BPF_PROG_TYPE_SK_SKB,
    BPF_PROG_TYPE_CGROUP_DEVICE,
    BPF_PROG_TYPE_SK_MSG,
    BPF_PROG_TYPE_RAW_TRACEPOINT,
    BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
    BPF_PROG_TYPE_LWT_SEG6LOCAL,
    BPF_PROG_TYPE_LIRC_MODE2,
    BPF_PROG_TYPE_SK_REUSEPORT,
    BPF_PROG_TYPE_FLOW_DISSECTOR,
    /* See /usr/include/linux/bpf.h for the full list. */
};

The details of the eBPF program type are later, bpf_attrThe remaining areas are set as follows

  • insns is an array composed of struct bpf_insn instructions
  • insn_cntis the number of instructions in insns
  • license is the license string, GPL must be set in order to match the helper function marked gpl_only
  • log_bufIt is a buffer allocated by the caller, in which the verifier in the kernel can save verification log information. This log information consists of a multi-line string, and the purpose is to let the program author understand why the verifier thinks this program is unsafe ( Equivalent to the compiler's log), as the verifier evolves, the output format may change
  • log_size is the buffer size of log_buf. If the buffer is not enough to save all the validator logs, -1 will be returned and errno will be set. forENOSPC
  • log_level is the verbosity level of the validator log, 00 means that the validator will not provide logs, in which case log_buf must be a null pointer, log_size Must be 0

Calling the returned file descriptorclose()will uninstall the eBPF program

Maps can be accessed by eBPF programs and used to exchange data between eBPF programs, and between eBPF and user programs. For example, eBPF programs can process various events (kprobe, packet) and save their data in the map, And user-space programs can obtain data from the mapping. In turn, user-space programs can use mapping as a configuration mechanism, filling the mapping with values ​​checked by the eBPF program, and dynamically changing the behavior of the program based on the values.

eBPF program types

The type of eBPF program determines which kernel helper functions can be called. The type of program also determines the format of the program's input-struct bpf_context (that is, the format passed to the eBPF program when it is first run some data)

For example, a tracer that is a socket filter does not necessarily have the same set of helper functions (there may be common helper functions). Similarly, the input (context) of a tracer is a collection of register values. For a socket filter, Say it is a network data packet

The set of functions available for specific types of eBPF programs may increase in the future.

The following program types are supported

  • BPF_PROG_TYPE_SOCKET_FILTER, currently, BPF_PROG_TYPE_SOCKET_FILTER has the following set of available functions:
    • bpf_map_lookup_elem(map_fd, void *key): Find key in map_fd
    • bpf_map_update_elem(map_fd, void *key, void *value): Update key or value
    • bpf_map_delete_elem(map_fd, void *key)Delete a key in map_fd
    • bpf_contextThe parameter is a pointer to struct __sk_buff (network packet buffer)
  • BPF_PROG_TYPE_KPROBE
  • BPF_PROG_TYPE_SCHED_CLS
  • BPF_PROG_TYPE_SCHED_ACT

event

Once a program is loaded, it can be attached to an event. Various kernel subsystems have different ways of doing this.

Starting from linux3.19, the following call will attach the programprog_fd to the socket previously created through socket()sockfd

setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd));

Since linux4.1, the following call will attach the eBPF program pointed to by prog_fd to a perf event descriptorevent_fd, which was previously perf_event_open()Create

ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);

return value

For a successful call, the return value depends on the operation

  • BPF_MAP_CREATE: Returns the file descriptor associated with the eBPF mapping
  • BPF_PROG_LOAD: Returns the file descriptor associated with the eBPF program
  • Other commands: 0

If an error occurs, -1 is returned, and errno is set to the cause of the error.

Other related notes

Before linux4.4, all bpf() commands required the caller to have the abilityCAP_SYS_ADMIN. Starting from linux4.4, unprivileged users can create restricted a>BPF_PROG_TYPE_SOCKET_FILTER type programs and related mappings. Then they cannot save the kernel pointer in the mapping, and now they can only use the following helper functions

  • get_random()
  • get_smp_processer_id()
  • tail_call())
  • ktime_get_ns()

Unprivileged access can be blocked by writing 1 to/proc/sys/kernel/unprivileged_bpf_disabled

eBPF objects (maps and programs) can be shared between processes. For example, after fork(), the child process will inherit the same file descriptor pointing to the eBPF object. In addition, the file descriptor referencing the eBPF object Can also be passed via a UNIX domin socket. File descriptors referencing eBPF objects are also copied in the ordinary way using dup(2) and similar calls. An eBPF object is only created when all file descriptor references are closed It will be destroyed later

eBPF programs can be written in restricted C language and then compiled into eBPF bytecode. In restricted C language, many features have been deleted, such as: loops, global variables, mutables Parametric function, floating point number, passing a structure as a function parameter. There are some examples of eBPF programs in the kernel source codesamples/bpf/*_kern.cfile

For better performance, the kernel contains a just-in-time compiler (JIT, just-in-time compiler) that can translate eBPF bytecode into local machine instructions. In the kernel before linux4.15, this JIT compiler The processor is disabled by default, but its behavior can be controlled by writing an integer string to /proc/sys/net/core/bpf_jit_enable

  • 0: Disable JIT compiler (default)
  • 1: Normal compilation
  • 2: Debug mode. The generated instructions will be copied to the kernel log in hexadecimal format. This bytecode can be reversed through the kernel source treetools/net/bpf_jit_disasm.c Compile

Starting in 4.15, the kernel can be configured with the CONFIG_BPF_JIT_ALWAYS_ON option, in which case the JIT compiler is always enabled, and bpf_jit_enable is also Initialized to 1 and unchangeable. (Kernel configuration option to mitigate potential attacks against the BPF interpreter)

eBPF's JIT compiler is currently available for the following architectures

    *  x86-64 (since Linux 3.18; cBPF since Linux 3.0);
   *  ARM32 (since Linux 3.18; cBPF since Linux 3.4);
   *  SPARC 32 (since Linux 3.18; cBPF since Linux 3.5);
   *  ARM-64 (since Linux 3.18);
   *  s390 (since Linux 4.1; cBPF since Linux 3.7);
   *  PowerPC 64 (since Linux 4.8; cBPF since Linux 3.1);
   *  SPARC 64 (since Linux 4.12);
   *  x86-32 (since Linux 4.18);
   *  MIPS 64 (since Linux 4.18; cBPF since Linux 3.16);
   *  riscv (since Linux 5.1).

code sample

In order to be more intuitive, I will not introduce bpf_help.h in the kernel source tree, nor use loader, to show the usage of BPF more intuitively

Use of array mapping

//gcc ./bpf.c -o bpf
#include <stdio.h>
#include <stdlib.h>  //为了exit()函数
#include <stdint.h>    //为了uint64_t等标准类型的定义
#include <errno.h>    //为了错误处理
#include <linux/bpf.h>    //位于/usr/include/linux/bpf.h, 包含BPF系统调用的一些常量, 以及一些结构体的定义
#include <sys/syscall.h>    //为了syscall()

//类型转换, 减少warning, 也可以不要
#define ptr_to_u64(x) ((uint64_t)x)

//对于系统调用的包装, __NR_bpf就是bpf对应的系统调用号, 一切BPF相关操作都通过这个系统调用与内核交互
int bpf(enum bpf_cmd cmd, union bpf_attr *attr, unsigned int size)
{
    return syscall(__NR_bpf, cmd, attr, size);
}

//创建一个映射, 参数含义: 映射类型, key所占自己, value所占字节, 最多多少个映射
int bpf_create_map(enum bpf_map_type map_type, unsigned int key_size, unsigned int value_size, unsigned int max_entries)
{
    union bpf_attr attr = {    //设置attr指向的对象
        .map_type = map_type,
        .key_size = key_size,
        .value_size = value_size,
        .max_entries = max_entries
    };

    return bpf(BPF_MAP_CREATE, &attr, sizeof(attr)); //进行系统调用
}

//在映射中更新一个键值对
int bpf_update_elem(int fd, const void* key, const void* value, uint64_t flags)
{
    union bpf_attr attr = {
        .map_fd = fd,
        .key = ptr_to_u64(key),
        .value = ptr_to_u64(value),
        .flags = flags,
    };

    return bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
}

//在映射中根据指针key指向的值搜索对应的值, 把值写入到value指向的内存中
int bpf_lookup_elem(int fd, const void* key, void* value)
{
    union bpf_attr attr = {
        .map_fd = fd,
        .key = ptr_to_u64(key),
        .value = ptr_to_u64(value),
    };

    return bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
}

int main(void){
    //首先创建一个数组映射, 键和值都是4字节类型, 最多0x100个映射
    int map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY, 4, 4, 0x100);
    printf("BPF_map_fd: %d\n", map_fd);

    //按照key->key+1的规律填充这个数组映射
    for(int idx=0; idx<0x20; idx+=1){
        int value = idx+1;
        //记住, 数组映射中的元素预先分配, 已经存在, 不可删除, 因此flag要么是BPF_ANY, 要么是BPF_EXISTS, 表示更新一个已有的值
        if(bpf_update_elem(map_fd, &idx, &value, BPF_EXIST)<0){ 
            perror("BPF update error");
            exit(-1);
        }
    }

    //读入key
    int key;
    scanf("%d", &key);

    //尝试在数组映射中查找对应的值
    int value;
    if(bpf_lookup_elem(map_fd, &key, &value)<0){
        perror("BPF lookup error");
        exit(-1);
    }
    printf("key: %d => value: %d\n", key, value);

}

operation result

Use of hash mapping

//gcc ./bpf.c -o bpf
#include <stdio.h>
#include <stdlib.h>  //为了exit()函数
#include <stdint.h>    //为了uint64_t等标准类型的定义
#include <errno.h>    //为了错误处理
#include <linux/bpf.h>    //位于/usr/include/linux/bpf.h, 包含BPF系统调用的一些常量, 以及一些结构体的定义
#include <sys/syscall.h>    //为了syscall()

//类型转换, 减少warning, 也可以不要
#define ptr_to_u64(x) ((uint64_t)x)

//对于系统调用的包装, __NR_bpf就是bpf对应的系统调用号, 一切BPF相关操作都通过这个系统调用与内核交互
int bpf(enum bpf_cmd cmd, union bpf_attr *attr, unsigned int size)
{
    return syscall(__NR_bpf, cmd, attr, size);
}

//创建一个映射, 参数含义: 映射类型, key所占自己, value所占字节, 最多多少个映射
int bpf_create_map(enum bpf_map_type map_type, unsigned int key_size, unsigned int value_size, unsigned int max_entries)
{
    union bpf_attr attr = {    //设置attr指向的对象
        .map_type = map_type,
        .key_size = key_size,
        .value_size = value_size,
        .max_entries = max_entries
    };

    return bpf(BPF_MAP_CREATE, &attr, sizeof(attr)); //进行系统调用
}

//在映射中更新一个键值对
int bpf_update_elem(int fd, const void* key, const void* value, uint64_t flags)
{
    union bpf_attr attr = {
        .map_fd = fd,
        .key = ptr_to_u64(key),
        .value = ptr_to_u64(value),
        .flags = flags,
    };

    return bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
}

//在映射中根据指针key指向的值搜索对应的值, 把值写入到value指向的内存中
int bpf_lookup_elem(int fd, const void* key, void* value)
{
    union bpf_attr attr = {
        .map_fd = fd,
        .key = ptr_to_u64(key),
        .value = ptr_to_u64(value),
    };

    return bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
}

//字符串表
char *strtab[] = {
    "The",
    "Dog",
    "DDDDog"
};

int main(void){
    //创建一个hash映射, 键为4字节的int, 值为一个char*指针, 因此大小分别是sizeof(int)与sizeof(char*), 最多0x100个
    int map_fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(int), sizeof(char*), 0x100);
    printf("BPF_map_fd: %d\n", map_fd);

    //用strtable初始化hash映射
    for(int idx=0; idx<3; idx+=1){
        char *value = strtab[idx];
        //hash映射中元素预先是不存在的, 因此可以设置BPF_NOEXIST或者BPF_ANY标志
        if(bpf_update_elem(map_fd, &idx, &value, BPF_NOEXIST)<0){
            perror("BPF update error");
            exit(-1);
        }
    }

    //读入键
    int key;
    scanf("%d", &key);

    //查找对应值, 把值作为char*类型
    char *value;
    if(bpf_lookup_elem(map_fd, &key, &value)<0){
        perror("BPF lookup error");
        exit(-1);
    }
    printf("key: %d => value: %s\n", key, value);
}

Run the example

Load BPF program

Loading a BPF program involves how to use BPF assembly. Let's ignore BPF assembly first, use the fixed assembly code directly, and then run it after loading.

//gcc ./bpf.c -o bpf
#include <stdio.h>
#include <stdlib.h>  //为了exit()函数
#include <stdint.h>    //为了uint64_t等标准类型的定义
#include <errno.h>    //为了错误处理
#include <linux/bpf.h>    //位于/usr/include/linux/bpf.h, 包含BPF系统调用的一些常量, 以及一些结构体的定义
#include <sys/syscall.h>    //为了syscall()

//类型转换, 减少warning, 也可以不要
#define ptr_to_u64(x) ((uint64_t)x)

//对于系统调用的包装, __NR_bpf就是bpf对应的系统调用号, 一切BPF相关操作都通过这个系统调用与内核交互
int bpf(enum bpf_cmd cmd, union bpf_attr *attr, unsigned int size)
{
    return syscall(__NR_bpf, cmd, attr, size);
}

//用于保存BPF验证器的输出日志
#define LOG_BUF_SIZE 0x1000
char bpf_log_buf[LOG_BUF_SIZE];

//通过系统调用, 向内核加载一段BPF指令
int bpf_prog_load(enum bpf_prog_type type, const struct bpf_insn* insns, int insn_cnt, const char* license)
{
    union bpf_attr attr = {
        .prog_type = type,        //程序类型
        .insns = ptr_to_u64(insns),    //指向指令数组的指针
        .insn_cnt = insn_cnt,    //有多少条指令
        .license = ptr_to_u64(license),    //指向整数字符串的指针
        .log_buf = ptr_to_u64(bpf_log_buf),    //log输出缓冲区
        .log_size = LOG_BUF_SIZE,    //log缓冲区大小
        .log_level = 2,    //log等级
    };

    return bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
}

//BPF程序就是一个bpf_insn数组, 一个struct bpf_insn代表一条bpf指令
struct bpf_insn bpf_prog[] = {
    { 0xb7, 0, 0, 0, 0x2 }, //初始化一个struct bpf_insn, 指令含义: mov r0, 0x2;
    { 0x95, 0, 0, 0, 0x0 }, //初始化一个struct bpf_insn, 指令含义: exit;
};

int main(void){
    //加载一个bpf程序
    int prog_fd = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, bpf_prog, sizeof(bpf_prog)/sizeof(bpf_prog[0]), "GPL");
    if(prog_fd<0){
        perror("BPF load prog");
        exit(-1);
    }
    printf("prog_fd: %d\n", prog_fd);
    printf("%s\n", bpf_log_buf);    //输出程序日志
}

Operation status

 

 

 

Original BPF compilation

Linux Socket Filtering aka Berkeley Packet Filter (BPF) — The Linux Kernel documentation

The original BPF is also called class BPF (cBPF). The relationship between BPF and eBPF is similar to i386 and amd64. The original BPF can only be used for socket filtering in the kernel source tree tools/bpf/bpf_asm can be used to write this original BPF program,

The basic elements of cBPF architecture are as follows

element describe
A 32bit wide accumulator
X 32bit wide X register
M[] 16*32-bit wide miscellaneous register register, also known as temporary register, can be found in the range: 0~15<br />Similar to a int32_t M[16]; small memory< br />

An instruction of cBPF assembly is 64 bytes, which is defined in the header file<linux/filter.h>. As follows. This structure is assembled into a 4-tuple array, which contains code, jt , jf and k values. jt and jf are jump offsets used to provide code, k is a general value

struct sock_filter {    /* Filter block */
        __u16   code;   /* 16位宽的操作码 */
        __u8    jt;     /* 如果条件为真时的8位宽的跳转偏移  */
        __u8    jf;     /* 如果条件为假时的8位宽的跳转偏移 */
        __u32   k;      /* 杂项参数 */
};

For socket filtering, pass the pointer to the struct sock_filter array to the kernel via setsockopt(2). Example:

#include <sys/socket.h>
#include <sys/types.h>
#include <arpa/inet.h>
#include <linux/if_ether.h>
/* ... */

/* From the example above: tcpdump -i em1 port 22 -dd */
struct sock_filter code[] = {
        { 0x28,  0,  0, 0x0000000c },
        { 0x15,  0,  8, 0x000086dd },
        { 0x30,  0,  0, 0x00000014 },
        { 0x15,  2,  0, 0x00000084 },
        { 0x15,  1,  0, 0x00000006 },
        { 0x15,  0, 17, 0x00000011 },
        { 0x28,  0,  0, 0x00000036 },
        { 0x15, 14,  0, 0x00000016 },
        { 0x28,  0,  0, 0x00000038 },
        { 0x15, 12, 13, 0x00000016 },
        { 0x15,  0, 12, 0x00000800 },
        { 0x30,  0,  0, 0x00000017 },
        { 0x15,  2,  0, 0x00000084 },
        { 0x15,  1,  0, 0x00000006 },
        { 0x15,  0,  8, 0x00000011 },
        { 0x28,  0,  0, 0x00000014 },
        { 0x45,  6,  0, 0x00001fff },
        { 0xb1,  0,  0, 0x0000000e },
        { 0x48,  0,  0, 0x0000000e },
        { 0x15,  2,  0, 0x00000016 },
        { 0x48,  0,  0, 0x00000010 },
        { 0x15,  0,  1, 0x00000016 },
        { 0x06,  0,  0, 0x0000ffff },
        { 0x06,  0,  0, 0x00000000 },
};

struct sock_fprog bpf = {
        .len = ARRAY_SIZE(code),
        .filter = code,
};

sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));    //建立套接字
if (sock < 0)
        /* ... bail out ... */

ret = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf)); //把bpf程序附加到套接字上
if (ret < 0)
        /* ... bail out ... */

/* ... */
close(sock);

Due to limited performance, cBPF later developed into eBPF, with new instructions and architecture. The original BPF instructions will be automatically translated into new eBPF instructions.

eBPF virtual machine

The eBPF virtual machine is a RISC instruction, virtual machine with registers. There are 11 64-bit registers inside, a program counter (PC), and a 512-byte fixed-size stack. 9 general-purpose registers can be read and written, a It is a read-only stack pointer register (SP), and an implicit program counter. We can only perform fixed offset jumps based on the PC. Virtual machine registers are always 64-bit (even for 32-bit physical machines) ), and supports 32-bit sub-register addressing (the upper 32 bits of the register are automatically set to 0)

  • r0: Save the return value of function call and current program exit
  • r1~r5: As a function call parameter, when the program starts running, r1 contains a pointer to the context parameter
  • r6~r9: Preserved between kernel function calls
  • r10: Read-only stack pointer pointing to the 512-byte stack

The program type (prog_type) provided when loading a BPF program determines which subset of functions in the kernel can be called, and also determines the context parameter provided through r1 when the program is started. The meaning of the return value saved in r0 is also determined by the program type.

For eBPF to eBPF, eBPF to kernel, each function call has up to 5 parameters, which are stored in registersr1~r5. And when passing parameters, registersr1~r5Only constants or pointers to the stack can be saved, not pointers to any memory. All memory accesses must first load the data into the eBPF stack before it can be used. Such restrictions simplify the memory model and help the eBPF verifier perform correct operations Sexual examination

BPF can access the kernel helper functions provided by the kernel core (excluding the module expansion part), similar to system calls. These helper functions are defined in the kernel through the BPF_CALL_* macro. a>The bpf.h file provides declarations of all kernel helper functions that BPF can access.

Take bpf_trace_printk as an example. This function is defined in the kernel through BPF_CALL_5, and there are 5 pairs of type and parameter names. The type of the defined parameter is eBPF is important because every time an eBPF program is loaded, the eBPF validator ensures that the register data type matches the parameter type of the called function.

BPF_CALL_5(bpf_trace_printk, char *, fmt, u32, fmt_size, u64, arg1, u64, arg2, u64, arg3)
{
    ...
}

This design is to match the virtual machine instructions with the native instruction set (x86 arm) as much as possible, so that the instructions compiled by JIT can be simpler and more efficient, and all registers are mapped to hardware registers one-to-one. For example, the x86_64 JIT compiler can map them as

R0 - rax
R1 - rdi
R2 - rsi
R3 - rdx
R4 - rcx
R5 - r8
R6 - rbx
R7 - r13
R8 - r14
R9 - r15
R10 - rbp

eBPF instruction encoding

Each eBPF instruction is a fixed 8 bytes, there are about 100 instructions, divided into 8 types. The virtual machine supports 1-8 words from general memory (map, stack, contexts such as packets,...) Reading and writing of sections supports unconditional jumps before and after, supports data and logical operations (ALU instructions), and supports function calls.

An eBPF program is a sequence of 64-bit instructions. All eBPF instructions have the same basic format:

  • 8bit opcode
  • 4bit target register
  • 4bit source register
  • 16bit offset
  • 32bit immediate number
msb最高bit                                                    lsb最低bit
+------------------------+----------------+----+----+--------+
|immediate               |offset          |src |dst |opcode  |
+------------------------+----------------+----+----+--------+
|       32               |    16          | 4  | 4  |    8   |

Most instructions do not use all areas, unused areas should be set to 0

The lowest 3 bits of the opcode indicate the instruction type, which groups related opcodes together.

LD/LDX/ST/STXThe opcode has the following structure

msb      lsb
+---+--+---+
|mde|sz|cls|
+---+--+---+
| 3 |2 | 3 |

szThe area represents the size of the target memory area,  the mde area is the memory access mode, uBPF only supports general MEM access mode

ALU/ALU64/JMPOpcode structure

msb      lsb
+----+-+---+
|op  |s|cls|
+----+-+---+
| 4  |1| 3 |

Ifs is 0, then the source operand isimm, ifs is 1, then the source operand is That is src. The op part specifies which ALU or branch operation to perform

bpf.hstruct bpf_insn is used in struct bpf_insn array. Description to describe an eBPF instruction, and its definition is consistent with the above. Therefore, an eBPF program can also use an

struct bpf_insn {
    __u8    code;        /* 操作码 opcode */
    __u8    dst_reg:4;    /* 目标寄存器 dest register */
    __u8    src_reg:4;    /* 源寄存器 source register */
    __s16    off;        /* 有符号的偏移 signed offset */
    __s32    imm;        /* 有符号的立即数 signed immediate constant */
};

ALU instructions: 64-bit

Operates on 64-bit

opcode mnemonic pseudocode
0x07 add dst, imm dst += imm
0x0f add dst, src dst += src
0x17 sub dst, imm dst -= imm
0x1f sub dst, src dst -= src
0x27 mul dst, imm dst *= imm
0x2f mul dst, src dst *= src
0x37 div dst, imm dst /= imm
0x3f div dst, src dst /= src
0x47 or dst, imm dst
0x4f or dst, src dst
0x57 and dst, imm dst &= imm
0x5f and dst, src dst &= src
0x67 lsh dst, imm dst <<= imm
0x6f lsh dst, src dst <<= src
0x77 rsh dst, imm dst >>= imm (logical)
0x7f rsh dst, src dst >>= src (logical)
0x87 neg dst dst = -dst
0x97 mod dst, imm dst %= imm
0x9f mod dst, src dst %= src
0xa7 xor dst, imm dst ^= imm
0xaf xor dst, src dst ^= src
0xb7 mov dst, imm dst = imm
0xbf mov dst, src dst = src
0xc7 arsh dst, imm dst >>= imm (arithmetic)
0xcf arsh dst, src dst >>= src (arithmetic)

ALU instructions:32-bit

These opcodes use only the lower 32 bits of their operands, and initialize the upper 32 bits of the destination register with 0 (the operand is 32 bits)

opcode mnemonic pseudocode
0x04 add32 dst, imm dst += imm
0x0c add32 dst, src dst += src
0x14 sub32 dst, imm dst -= imm
0x1c sub32 dst, src dst -= src
0x24 mul32 dst, imm dst *= imm
0x2c mul32 dst, src dst *= src
0x34 div32 dst, imm dst /= imm
0x3c div32 dst, src dst /= src
0x44 or32 dst, imm dst
0x4c or32 dst, src dst
0x54 and32 dst, imm dst &= imm
0x5c and32 dst, src dst &= src
0x64 lsh32 dst, imm dst <<= imm
0x6c lsh32 dst, src dst <<= src
0x74 rsh32 dst, imm dst >>= imm (logical)
0x7c rsh32 dst, src dst >>= src (logical)
0x84 neg32 dst dst = -dst
0x94 mod32 dst, imm dst %= imm
0x9c mod32 dst, src dst %= src
0xa4 xor32 dst, imm dst ^= imm
0xac xor32 dst, src dst ^= src
0xb4 mov32 dst, imm dst = imm
0xbc mov32 dst, src dst = src
0xc4 arsh32 dst, imm dst >>= imm (arithmetic)
0xcc arsh32 dst, src dst >>= src (arithmetic)

byte swap instructions

opcode mnemonic pseudocode
0xd4 (imm == 16) le16 dst dst = htole16(dst)
0xd4 (imm == 32) le32 dst dst = htole32(dst)
0xd4 (imm == 64) le64 dst dst = htole64(dst)
0xdc (imm == 16) be16 dst dst = htobe16(dst)
0xdc (imm == 32) be32 dst dst = htobe32(dst)
0xdc (imm == 64) be64 dst dst = htobe64(dst)

memory instructions

opcode mnemonic pseudocode
0x18 lddw dst, imm dst = imm
0x20 ldabsw src, dst, imm See kernel documentation
0x28 ldabsh src, dst, imm
0x30 ldabsb src, dst, imm
0x38 ldabsdw src, dst, imm
0x40 ldindw src, dst, imm
0x48 ldindh src, dst, imm
0x50 ldindb src, dst, imm
0x58 ldinddw src, dst, imm
0x61 ldxw dst, [src+off] dst = (uint32_t ) (src + off)
0x69 ldxh dst, [src+off] dst = (uint16_t ) (src + off)
0x71 ldxb dst, [src+off] dst = (uint8_t ) (src + off)
0x79 ldxdw dst, [src+off] dst = (uint64_t ) (src + off)
0x62 stw [dst+off], imm (uint32_t ) (dst + off) = imm
0x6a sth [dst+off], imm (uint16_t ) (dst + off) = imm
0x72 stb [dst+off], imm (uint8_t ) (dst + off) = imm
0x7a stdw [dst+off], imm (uint64_t ) (dst + off) = imm
0x63 stxw [dst+off], src (uint32_t ) (dst + off) = src
0x6b stxh [dst+off], src (uint16_t ) (dst + off) = src
0x73 stxb [dst+off], src (uint8_t ) (dst + off) = src
0x7b stxdw [dst+off], src (uint64_t ) (dst + off) = src

branch instruction

opcode mnemonic pseudocode
0x05 and +off PC += off
0x15 jeq dst, imm, +off PC += off if dst == imm
0x1d jeq dst, src, +off PC += off if dst == src
0x25 jgt dst, imm, +off PC += off if dst > imm
0x2d jgt dst, src, +off PC += off if dst > src
0x35 jge dst, imm, +off PC += off if dst >= imm
0x3d jge dst, src, +off PC += off if dst >= src
0xa5 jlt dst, imm, +off PC += off if dst < imm
0 limit jlt dst, src, +off PC += off if dst < src
0xb5 jle dst, imm, +off PC += off if dst <= imm
0xbd jle dst, src, +off PC += off if dst <= src
0x45 jset dst, imm, +off PC += off if dst & imm
0x4d jset dst, src, +off PC += off if dst & src
0x55 jne dst, imm, +off PC += off if dst != imm
0x5d jne dst, src, +off PC += off if dst != src
0x65 jsgt dst, imm, +off PC += off if dst > imm (signed)
0x6d jsgt dst, src, +off PC += off if dst > src (signed)
0x75 jsge dst, imm, +off PC += off if dst >= imm (signed)
0x7d jsge dst, src, +off PC += off if dst >= src (signed)
0xc5 jslt dst, imm, +off PC += off if dst < imm (signed)
0xcd jslt dst, src, +off PC += off if dst < src (signed)
0xd5 jsle dst, imm, +off PC += off if dst <= imm (signed)
0xdd jsle dst, src, +off PC += off if dst <= src (signed)
0x85 call imm Function call
0x95 exit return r0

https://github.com/iovisor/bpf-docs/blob/master/eBPF.md

Assembling and writing eBPF programs

According to the above table, we can directly write eBPF bytecode

struct bpf_insn bpf_prog[] = {
    { 0xb7, 0, 0, 0, 0x123 },   // mov r0, 0x123
    { 0xb7, 1, 0, 0, 0x456 },   // mov r1, 0x456
    { 0x0F, 0, 1, 0, 0 },       // add r0, r1
    { 0x95, 0, 0, 0, 0x0 },     // exit 
};

Use the method mentioned in the previous chapter to load the BPF program. The log output by the verifier is as follows, indicating that the program has been accepted.

Using bytecode is very unintuitive. We can wrap the initializationstruct bpf_insn to facilitate writing. If you don’t understand, you can refer to the above instruction encoding

First, define the instruction type sc, which indicates which category the instruction belongs to.

#define BPF_CLASS(code) ((code) & 0x07) //指令种类为指令操作码的低3bit
#define BPF_ALU64    0x07    /* 操作64位对象的ALU指令种类 */
#define    BPF_JMP        0x05  //跳转指令类别

Next, define the op part of the opcode. This part indicates which opcode it is, that is, what the instruction is supposed to do.

#define BPF_OP(code)    ((code) & 0xf0)  //操作数为操作码的高4bit
#define BPF_MOV        0xb0    /* 把寄存器移动到寄存器 */
#define    BPF_ADD        0x00     //加法操作
#define BPF_EXIT    0x90    /* 从函数中返回 */

For the opcodes of ALU and JMP instructions, there is still 1 bit of s that needs to be defined, indicating the source of the operation.

#define BPF_SRC(code)   ((code) & 0x08)    //只占用第4bit一个bit
#define        BPF_K        0x00    //源操作数是立即数, 立即数的值在imm中表示
#define        BPF_X        0x08    //源操作数是寄存器,具体是哪一个寄存器在src字段表示

The next step is to define the register, which is to use the enumeration type to encoder0~r10from0~10

enum {
    BPF_REG_0 = 0,
    BPF_REG_1,
    BPF_REG_2,
    BPF_REG_3,
    BPF_REG_4,
    BPF_REG_5,
    BPF_REG_6,
    BPF_REG_7,
    BPF_REG_8,
    BPF_REG_9,
    BPF_REG_10,
    __MAX_BPF_REG,
};

Once the basic elements are in place, they can be combined into macros that represent instructions.

/*
    给寄存器赋值, mov DST, IMM
    操作码: BPF_ALU64 | BPF_MOV表示要进行赋值操作, BPF_K表示要源是立即数IMM
*/
#define BPF_MOV64_IMM(DST, IMM)                    \
    ((struct bpf_insn) {                    \
        .code  = BPF_ALU64 | BPF_MOV | BPF_K,        \
        .dst_reg = DST,                    \
        .src_reg = 0,                    \
        .off   = 0,                    \
        .imm   = IMM })


/*
    两个寄存器之间的ALU运算指令: OP DST, SRC; 
    OP可以是加减乘除..., DST SRC表示是那个寄存器
    操作码: BPF_ALU64|BPF_OP(OP)表示执行什么ALU64操作, BPF_X表示源操作数是寄存器
*/
#define BPF_ALU64_REG(OP, DST, SRC)                \
    ((struct bpf_insn) {                    \
        .code  = BPF_ALU64 | BPF_OP(OP) | BPF_X,    \
        .dst_reg = DST,                    \
        .src_reg = SRC,                    \
        .off   = 0,                    \
        .imm   = 0 })

/*
    退出指令: exit
    操作码: BPF_JMP|BPF_EXIT表示要进行跳转指令类比中的退出指令
*/
#define BPF_EXIT_INSN()                        \
    ((struct bpf_insn) {                    \
        .code  = BPF_JMP | BPF_EXIT,            \
        .dst_reg = 0,                    \
        .src_reg = 0,                    \
        .off   = 0,                    \
        .imm   = 0 })

Borrowing the above macro definition, we can rewrite this eBPF program without confusing constants, and the effect will be the same as before

    struct bpf_insn bpf_prog[] = {
        BPF_MOV64_IMM(BPF_REG_0, 0x123),                 //{ 0xb7, 0, 0, 0, 0x123 },  mov r0, 0x123
        BPF_MOV64_IMM(BPF_REG_1, 0x456),                 //{ 0xb7, 1, 0, 0, 0x456 },  mov r1, 0x456
        BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_1),    //{ 0x0F, 0, 1, 0, 0 }, add r0, r1
        BPF_EXIT_INSN()                                  //{ 0x95, 0, 0, 0, 0x0 } exit 
    };

In fact, #include <linux/bpf.h> contains the definition of constants such as instruction opcodes, and the kernel source code directorysamples/bpf/bpf_insn.h contains the macro definition of the above instructions. And to be more comprehensive, we only need to put this file in the same directory as the source code, and then#include "./bpf_insn.h"can directly use these macros to define the bytecode of the eBPF instruction

Writing eBPF instructions in C

It’s still the same program, but we write it in C. Since gcc does not support compiling BPF programs, clang or llvm must be used to compile it.  -target bpf means compiling into eBPF bytecode. -c means that it can be compiled into a target file, because eBPF has no entry point and cannot be compiled into an executable file. Conversion process: C---llvm--->eBPF---JIT--->本机指令

//clang -target bpf -c ./prog.c -o ./prog.o
unsigned long prog(void){
    unsigned long a=0x123;
    unsigned long b=0x456;
    return a+b;
}

The compiled target file is in ELF format, and you can see the final compiled bytecode through readelf.

objdump does not support disassembly of eBPF, you can use llvm-objdump to decompile the bytecode, r10 is the stack pointer, *(u32 *)(r10-4) = r1 is in the stack Write local variables, the overall structure is similar to the one written in assembly before

If you want to execute eBPF bytecode, you need to extract the .text segment from the target file in ELF format first. You can use llvm-objcopy to do this

How to extract a specified section from elflinux - How to extract only the raw contents of an ELF section? - Stack Overflow

After , write a loader responsible for reading the bytecode from prog.text, put it into the buffer, and then use the BPF_PROG_LOAD command to make the bpf system call , thereby injecting the bytecode into the kernel. The loader code is as follows. The whole thing is similar to before. If you don’t understand, you can read the previous article

//gcc ./loader.c -o loader
#include <stdio.h>
#include <stdlib.h>  //为了exit()函数
#include <stdint.h>    //为了uint64_t等标准类型的定义
#include <errno.h>    //为了错误处理
#include <linux/bpf.h>    //位于/usr/include/linux/bpf.h, 包含BPF系统调用的一些常量, 以及一些结构体的定义
#include <sys/syscall.h>    //为了syscall()

//类型转换, 减少warning, 也可以不要
#define ptr_to_u64(x) ((uint64_t)x)

//对于系统调用的包装, __NR_bpf就是bpf对应的系统调用号, 一切BPF相关操作都通过这个系统调用与内核交互
int bpf(enum bpf_cmd cmd, union bpf_attr *attr, unsigned int size)
{
    return syscall(__NR_bpf, cmd, attr, size);
}

//用于保存BPF验证器的输出日志
#define LOG_BUF_SIZE 0x1000
char bpf_log_buf[LOG_BUF_SIZE];

//通过系统调用, 向内核加载一段BPF指令
int bpf_prog_load(enum bpf_prog_type type, const struct bpf_insn* insns, int insn_cnt, const char* license)
{
    union bpf_attr attr = {
        .prog_type = type,        //程序类型
        .insns = ptr_to_u64(insns),    //指向指令数组的指针
        .insn_cnt = insn_cnt,    //有多少条指令
        .license = ptr_to_u64(license),    //指向整数字符串的指针
        .log_buf = ptr_to_u64(bpf_log_buf),    //log输出缓冲区
        .log_size = LOG_BUF_SIZE,    //log缓冲区大小
        .log_level = 2,    //log等级
    };

    return bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
}

//BPF程序就是一个bpf_insn数组, 一个struct bpf_insn代表一条bpf指令
struct bpf_insn bpf_prog[0x100];

int main(int argc, char **argv){
    //用法 loader <保存字节码的文件> <字节码长度, 字节为单位>

    //读入文件中的内容到bpf_prog数组中
    int text_len = atoi(argv[2]);
    int file = open(argv[1], O_RDONLY);
    if(read(file, (void *)bpf_prog, text_len)<0){  
        perror("read prog fail");
        exit(-1);
    }
    close(file);

    //加载执行
    int prog_fd = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, bpf_prog, text_len/sizeof(bpf_prog[0]), "GPL");
    if(prog_fd<0){
        perror("BPF load prog");
        exit(-1);
    }
    printf("prog_fd: %d\n", prog_fd);
    printf("%s\n", bpf_log_buf);    //输出程序日志
}

clang compiles 9 instructions, each 72 bytes. Using the command./loader ./prog.text 72, the execution result is as follows

 

 

Guess you like

Origin blog.csdn.net/yangzex/article/details/131953996