Linux performance optimization (9)-Kernel Bypass

1. Linux kernel protocol stack performance bottleneck

In the x86 architecture, the traditional way of receiving data packets is the CPU interrupt method, that is, the network card driver receives the data packet and informs the CPU through an interrupt to process, and then the CPU copies the data and delivers it to the kernel protocol stack. When the amount of data is large, the CPU interrupt mode will generate a large number of CPU interrupts, resulting in high CPU load.
Linux performance optimization (9)-Kernel Bypass
(1) Thread and process switching caused by
hardware interrupts. Hardware interrupt requests will preempt software interrupts with lower priority. Frequently arriving hardware interrupts and soft interrupts mean frequent thread switching, which is followed by operating mode switching and context A series of CPU performance losses such as switching, thread scheduler load, Cache Missing, multi-core cache shared data synchronization, and competing locks.
(2) The memory copy
network card driver is in the kernel mode. After the network driver receives the data packet, it will be processed by the kernel protocol stack, and then copied to the application layer buffer in the user mode. Data copying from the kernel mode to the user mode is time-consuming Operation, the time for data copying will account for more than 50% of the time for data packet processing.
(3) Multi-processor platform CPU drifting.
A data packet may be interrupted on CPU0, kernel mode processing is CPU1, user mode processing is CPU2, processing across multiple physical cores (Core) will cause a large number of CPU Cache misses and locality Invalidate. For the NUMA architecture, memory accesses across NUMA nodes will also occur, which greatly affects CPU performance.
(4) Cache invalidation
Traditional servers mostly use page-based virtual memory, and memory pages default to 4K small pages. There will be a large number of page mapping items on processors with larger storage space. At the same time, due to the limited TLB cache space, the mapping items of the TLB fast table are eventually changed frequently, resulting in a large number of TLB misses.

2. Introduction to Kernel Bypass

1. Introduction to Kernel Bypass

Kernel Bypass (kernel bypass) is a technology that bypasses the Linux kernel (TCPIP protocol stack). It does not use the function of the Linux kernel subsystem and uses the code of the same function implemented by itself to process, directly access and control the device memory from the user space. Avoid copying data from the device to the kernel and then from the kernel to user space.
Kernel Bypass is currently mainstream implementation solutions such as DPDK and SolarFlare.

2. Advantages of Kernel Bypass

The Kernel Bypass technology itself is designed for high performance and low latency, so the biggest advantage is high performance and low latency.

3. Disadvantages of Kernel Bypass

The disadvantages of Kernel Bypass technology are as follows:
(1) It has changed the working mode of the existing operating system and is difficult to integrate with the existing operating system.
(2) Since network data does not pass through the kernel network protocol stack, relevant network applications need to re-implement the functions provided by the operating system.
(3) Since the operating system has no control over the relevant network hardware, the network management deployment tools provided by the operating system are no longer available.
(4) The security provided by the operating system kernel is destroyed. In the container scenario, the abstraction and isolation of resources are mainly provided by the operating system kernel.
(5) One or more CPU cores need to be consumed to process network packets exclusively.

Three, DPDK

1. Introduction to DPDK

DPDK (Data Plane Development Kit) is a data plane development tool set provided by Intel. It provides library functions and driver support for efficient data packet processing in user space under Intel Architecture processor architecture, focusing on high-performance processing of data packets in network applications .
The DPDK application program runs in the user space and uses the data plane library provided by itself to send and receive data packets, bypassing the Linux kernel protocol stack for data packet processing. The Linux kernel regards the DPDK application as an ordinary user-mode process, including compilation, linking, and loading methods like ordinary programs. After the DPDK program is started, there can only be one main thread, and then create some sub-threads and bind them to the designated CPU core to run.
DPDK (Data Plane Development Kit) is an open source, fast processing data plane data packet forwarding development platform and interface, supporting X86, ARM, Power PC hardware platforms.
Intel started the open source process of DPDK technology in 2010, officially released the source code software package through the BSD open source license agreement in September 2010, and formally established an independent open source community on www.dpdk.org in April 2014 The platform provides support for developers.
DPDK provides a user-mode efficient data packet processing library, through the environment abstraction layer, the kernel bypass protocol stack, the non-interrupted sending and receiving of messages in the polling mode, optimized memory/buffer/queue management, multi-queue and flow recognition based on the network card Load balancing and other technologies have achieved high-performance message forwarding capabilities under the x86 processor architecture. Users can develop various high-speed forwarding applications in the Linux user mode, and are also suitable for various commercial data plane acceleration solutions To integrate.
DPDK reloads the network card driver to separate the control plane and data plane of the data packet. The driver no longer hard interrupts to notify the CPU after receiving the data packet, but allows the data packet to bypass the Linux kernel protocol stack through the kernel bypass protocol stack. And stored in the memory through zero copy technology, the application layer program can read the data packet through the interface provided by DPDK.
The DPDK data packet processing mode saves CPU interruption time and memory copy time, and provides simple and efficient data packet processing interface functions to the application layer, making the development of network applications more convenient. However, due to the need to reload the network card driver, DPDK can currently only be used in some network card devices that use Intel network processing chips. The list of network cards supported by DPDK: https://core.dpdk.org/supported/, the mainstream use Intel 82599 (optical port) and Intel x540 (electric port). DPDK can increase packet processing performance by up to ten times. Achieve over 80 Mbps throughput on a single Intel Xeon processor, which can be doubled in a dual-processor configuration.
Linux performance optimization (9)-Kernel Bypass

2. DPDK principle

Linux performance optimization (9)-Kernel Bypass

3. DPDK architecture

Linux performance optimization (9)-Kernel Bypass
In Linux Kernel, DPDK has two modules, KNI and IGB_UIO. In the user mode, it consists of multiple DPDK libraries, mainly including Core Libraries, platform-related modules (Platform), and network card polling mode driver module (PMD- Natives&Virtual), QoS library, packet forwarding classification algorithm (Classify), etc. Users can use the DPDK library to develop applications.

4 、 UIO

In the traditional way of sending and receiving data packets, the network card first informs the Linux kernel protocol stack to process the data packets through interrupts. The kernel protocol stack first checks the validity of the data packets, and then determines whether the destination of the data packets is native Socket, if the conditions are met, a copy of the data packet will be submitted to the user mode Socket for processing.
In order to make the network card driver (PMD Driver) run in user mode and achieve kernel bypass, Linux provides a UIO (User Space IO) mechanism. Use UIO to sense interrupts through read, and communicate with network card devices through mmap.
UIO is an IO technology in user mode. It is the basis for DPDK to bypass the kernel protocol stack and provide user mode PMD Driver support. The DPDK architecture installs the IGB_UIO (igb_uio.ko and kni.ko.IGB_UIO) modules in the Linux kernel, which uses UIO technology to intercept interrupts and reset the interrupt callback behavior, thereby bypassing the subsequent processing flow of the kernel protocol stack, and IGB_UIO will In the process of kernel initialization, the network card hardware registers are mapped to user mode.
The UIO implementation mechanism is to expose the file interface to the user mode. When registering a UIO device uioX, the system file /dev/uioX will appear. Reading and writing to the UIO device file is the reading and writing of the network card device memory.

5. DPDK features

(1) Polling: avoid the overhead of interrupting context switching during packet processing,
(2) User mode drive: avoid unnecessary memory copy and system calls, and facilitate rapid iterative optimization
(3) Affinity and exclusivity: specific tasks can be Designated to work only on a certain core, to avoid frequent thread switching between different cores, to ensure more cache hits
(4) Reduce memory access overhead: use memory large pages HUGEPAGE to reduce TLB miss, and use memory multi-channel interleaved access to increase memory Access effective bandwidth
(5) Software tuning: cache line alignment, prefetching data, multiple data batch operations

Four, XDP

1. Introduction to XDP

XDP (eXpress Data Path) is not a Kernel Bypass implementation scheme. It is completely opposite to the Kernel Bypass implementation. It is a kernel code injection technology that relies on eBPF, which can filter or process network packets before they reach the kernel protocol stack. . XDP puts the network packet processing flow in the Linux kernel in front of the network protocol stack. When processing network data packets, it does not need to go through the complex flow of the network protocol stack, while retaining the ability of the operating system to control network hardware.
XDP is an SDN technology developed in recent years. It has been fully integrated into Linux Kernel and is still evolving.
Linux performance optimization (9)-Kernel Bypass

2. XDP network data processing flow

The Linux kernel network stack determines the processing (dropping, forwarding) of data packets according to the rules set by the iptables firewall. XDP can discard data packets as soon as they arrive at the network card, so it can be used to process high-speed data traffic.
Linux performance optimization (9)-Kernel Bypass

3. XDP composition

When a data packet arrives at the network card, before the kernel network stack allocates the buffer to store the data packet content in the sk_buff structure, the XDP program executes, reads the data packet processing rules written by the user-mode control plane to the BPF maps, The package performs corresponding operations, such as directly discarding the data packet, or sending the data packet back to the current network card, or forwarding the data packet to other network cards or upper-layer virtual network cards (and then forwarding the data packet directly to the upper-layer container or virtual machine) , Or pass the data packet to the kernel protocol stack, and then send it to the user program through layer-by-layer analysis of the protocol stack, or directly forward the data packet to the upper application program through the AF_XDP special socket.
Linux performance optimization (9)-Kernel Bypass
XDP driver hook: A mount point of the XDP program in the network card driver. Whenever the network card receives a data packet, the XDP program will be executed; the XDP program can parse the data packets layer by layer, filter them according to rules, or perform data packets Encapsulate or decapsulate, modify fields to forward data packets, etc.;
eBPF virtual machine: The XDP program is written by the user and written in a restricted C language, and then compiled by the clang front-end to generate BPF bytecode, and the bytecode is loaded into the kernel After running on the eBPF virtual machine, the virtual machine compiles the XDP bytecode into low-level binary instructions through just-in-time compilation; the eBPF virtual machine supports dynamic loading and unloading of XDP programs;
BPF maps (BPF maps): store key-value pairs as users The communication medium between the state program and the kernel state XDP program and the kernel state XDP program is similar to the shared memory access of inter-process communication; the user state program can predefine rules in the BPF mapping, and the XDP program matches the rules in the mapping to the data packet Perform filtering, etc.; XDP program stores data packet statistics in BPF mapping, user mode program can access BPF mapping to obtain data packet statistics;
eBPF program verifier: BPF program verifier before loading XDP bytecode into the kernel Perform security checks on the bytecode, such as judging whether there is a loop, whether the program length exceeds the limit, whether the program memory access is out of bounds, and whether the program contains unreachable instructions. The code is statically analyzed before the program is loaded into the kernel to ensure that the code will crash or damage the running kernel.

4. XDP advantages

The advantages of XDP are as follows:
(1) XDP is integrated with the Linux kernel network protocol stack, the kernel still has the ability to control network hardware, retains the security provided by the kernel, and the existing network configuration management tools can be used directly.
(2) Any network card with Linux driver can use XDP, only need to update the network card driver to add the XDP execution hook point. DPDK requires special hardware support.
(3) XDP can selectively use the functions of the kernel network protocol stack, such as routing table and TCP stack. This can accelerate critical network data while retaining the existing configuration interface
(4) when interacting with programs based on the kernel network protocol stack, without re-injecting network data packets from user space into kernel space.
(5) It is transparent to applications running on the host.
(6) Support dynamic reprogramming without service interruption, and hot upgrade of XDP programs without network interruption.
(7) No dedicated CPU core is needed for packet processing. Low traffic corresponds to low CPU usage, which is correspondingly more efficient and saves power.

5. XDP usage scenarios

The usage scenarios of XDP include the following:
(1) DDoS defense
(2) Firewall
(3) Load balancing based on XDP_TX
(4) Network statistics
(5) Complex network sampling
(6) High-speed trading platform

五 、 Solar flare

1. Introduction to Solarflare

Solarflare network card supports OpenOnload network card accelerator, and its Kernel Bypass implementation scheme is as follows: implement the network protocol stack in user space, and use LD_PRELOAD to cover the network system calls of the target program, and rely on the EF_VI library when accessing the network card at the bottom.
Solarflare network card provides a three-level kernel bypass implementation solution. Onload does not require users to change the code. After installing the network card driver, the program can be started with the onload library, which is simple and easy to use; tcpdirect is faster than onload, but requires users to modify the code; ef_vi will skip all protocol stacks and directly read a specific RX queue of the network card.
Solarflare products have occupied 90% of the stock and futures high-frequency trading hardware market with high-performance and low-latency features. Solarflare X2522 plus low-latency optical fiber network card is widely used in high-frequency trading of financial securities and futures, with a price of more than 20,000 RMB. Currently, Solarflare's latest products are X2552, X2541, and X2562 ultra-low latency network cards.

2、onload

Onload is Solarflare's most classic kernel bypass network protocol stack. It is a transparent kernel bypass and provides an interface compatible with socket network programming. Users do not need to modify their own code, just preload libonload.so before the program starts.

3 、 ef_vi

ef_vi is a level 2 low-level API that can send and receive raw Ethernet frames, but it does not provide support for upper-level protocols. The biggest feature of ef_vi is zero-copy: users are required to pre-allocate some recv buffers for use by ef_vi. After the network card receives the packet, it will directly write the buffer, and the user can start to process the received data by obtaining the buffer id of the filled data through the eventq poll interface , And return the buffer to ef_vi for recycling after processing.
The ef_vi API is more complicated to use and lacks upper-layer protocol support, but it can provide the best performance.

4、tcpdirect

tcpdirect implements the upper-layer network protocol stack based on ef_vi, and provides a socket-like API: zocket, which allows users to read and write tcp/udp payload data, and also inherits the characteristics of ef_vi zero-copy. tcpdirect requires the use of huge pages.

Guess you like

Origin blog.51cto.com/9291927/2594168