Explore the core technology of network communication, write TCPIP user mode protocol stack by hand, so that the performance soars!

1. Introduction to DPDK

DPDK (Data Plane Development Kit) is an open source data plane development kit, which provides a set of C language libraries and drivers for rapid development of high-performance data plane applications. DPDK uses user space to implement network packet processing, thereby avoiding the performance loss caused by frequent switching between traditional kernel mode and user mode.

DPDK supports a variety of hardware platforms and operating systems, and exhibits excellent performance in various scenarios. For example, in industries such as cloud computing, telecommunications, finance, and online games, DPDK is widely used in high-speed network data processing, virtualized network functions, and SDN.

img

2. The main purpose of DPDK

The main purpose of DPDK is to provide a fast, efficient, and flexible data plane development framework to help developers easily build high-performance network applications. Specifically, DPDK addresses the following issues:

  • \1. Improve packet processing performance: By using technical means such as user space and optimization algorithms, DPDK has significantly improved packet processing performance, which can reach the level of millions of pps (packets per second).
  • \2. Support a variety of hardware platforms and operating systems: DPDK supports a variety of common hardware platforms and operating systems, and can be integrated with other open source projects (such as Open vSwitch, OpenStack, etc.).
  • \3. Provide easy-to-use APIs: DPDK provides a set of easy-to-use C language APIs, allowing developers to easily write high-performance network applications.
  • \4. Support applications in various scenarios: Due to its high performance and flexibility, DPDK is widely used in cloud computing, telecommunications, finance, online games and other industries, involving high-speed network data processing, virtualized network functions and SDN etc.

3. Working environment

DPDK's environment abstraction layer hides the details of the underlying environment from applications and function libraries, so it can be extended to any processor. As far as the operating system is concerned, it provides support for Linux and FreeBSD.

As far as the market is concerned, as the number of business users increases, more and more companies will use dpdk, and learning dpdk well will also become a direction for you to enter excellent Internet companies or even large manufacturers. Regarding the learning of dpdk, it is different from java-oriented business. It pays more attention to your understanding of the underlying technology, the degree of learning of the basics of computer principles, and your academic qualifications.

To describe it in one sentence: it has little to do with the business, but has a lot to do with the underlying technology .

4. Working principle

DPDK uses polling instead of interrupts to process packets. When receiving a data packet, the network card driver overloaded by DPDK will not notify the CPU through an interrupt, but directly store the data packet into the memory, and deliver the application layer software to directly process through the interface provided by DPDK, which saves a lot of CPU Interrupt time and memory copy time.

How DPDK works:

  1. Initialization: When the application starts, DPDK initializes all required hardware devices, threads, memory pools and other resources, and allocates an independent operating environment for each CPU core.
  2. Receive data: When a data packet arrives at the network card, DPDK will buffer it in a ring buffer and notify the application that new data has arrived.
  3. Packet classification: Applications classify data packets according to predefined rules (such as traffic monitoring, load balancing, security inspection, etc.) and send them to the corresponding queues.
  4. Data packet processing: DPDK takes out the data packets to be processed from the queue, and uses the user space driver (userspace driver) to perform operations such as data packet analysis, protocol conversion, and business logic processing.
  5. Packet sending: After the processing is completed, DPDK repackages the result into a network message and sends it back to the network through the network card.
  6. Reclaiming resources: When the application no longer needs certain resources, DPDK will release them back to the memory pool or hardware device for use by other threads or processes.

4.1 Technical Principle and Architecture

Due to the use of software forwarding and software switching technologies, the internal forwarding capability of a single server is the main performance bottleneck of the NFV system. In various high-speed forwarding NFV applications, data packets are received from the network card and then transmitted to the virtualized user mode application (VNF) for processing. The whole process has to go through CPU interrupt processing, virtualized I/O and address mapping conversion , virtual switching layer, network protocol stack, kernel context switching, memory copy and many other time-consuming CPU operations and I/O processing links.

The industry usually uses technologies such as eliminating massive interrupts, bypassing the kernel protocol stack, reducing memory copies, CPU multi-core task sharing, and Intel VT to comprehensively improve the packet processing performance of the server data plane, which is difficult for ordinary users to master. The industry urgently needs a comprehensive performance optimization solution while providing a good user development and business integration environment, and the DPDK acceleration technology solution has become a typical representative of it.

DPDK is an open source data plane development tool set, which provides an efficient data packet processing library function in user space, which bypasses the kernel protocol stack through the environment abstraction layer, non-interrupted sending and receiving of messages in polling mode, and optimizes memory/buffering Area/queue management, load balancing based on network card multi-queue and flow identification, etc., have realized the high-performance message forwarding capability under the x86 processor architecture, and users can develop various high-speed forwarding applications in the Linux user mode space, It is also suitable for integration with various commercial data plane acceleration solutions.

Intel started the open source process of DPDK technology in 2010, and officially released the source code package through the BSD open source license agreement in September of that year to provide support for developers. Participants in the open source community have greatly promoted the technical innovation and rapid evolution of DPDK, and now it has developed into a key technology for SDN and NFV.

4.2 Main features

What are the disadvantages of data transmission based on the OS kernel

  1. Interrupt handling . When a large number of data packets arrive in the network, frequent hardware interrupt requests will be generated. These hardware interrupts can interrupt the execution process of soft interrupts or system calls with lower priority. High performance overhead.
  2. memory copy . Under normal circumstances, a network data packet needs to go through the following process from the network card to the application program: the data is transferred from the network card to the buffer opened by the kernel through DMA, and then copied from the kernel space to the user mode space, in the Linux kernel protocol stack , this time-consuming operation even accounts for 57.1% of the entire processing flow of the data packet.
  3. context switch . Frequently arriving hardware interrupts and soft interrupts may preempt the execution of system calls at any time, which will generate a lot of context switching overhead. In addition, in the multi-threaded server design framework, the scheduling between threads will also generate frequent context switching overhead. Similarly, the energy consumption of lock competition is also a very serious problem.
  4. Local failure . Today's mainstream processors have multiple cores, which means that the processing of a data packet may span multiple CPU cores. For example, a data packet may be interrupted on cpu0, the kernel state processing is on cpu1, and the user state processing is on cpu2. Multiple cores can easily cause CPU cache failure and localized failure. If it is a NUMA architecture, it will cause cross-NUMA access to memory, and the performance will be greatly affected.
  5. memory management . Traditional server memory pages are 4K. In order to improve the memory access speed and avoid cache misses, you can increase the entries in the mapping table in the cache, but this will affect the retrieval efficiency of the CPU.

In response to these issues, technical points that can be explored:

  1. The control layer and the data layer are separated. Tasks such as data packet processing, memory management, and processor scheduling are transferred to user space to complete, while the kernel is only responsible for processing part of the control instructions. In this way, there are no problems such as system interrupts, context switching, system calls, system scheduling, etc. mentioned above.
  2. Use multi-core programming technology instead of multi-threading technology, and set CPU affinity to bind threads and CPU cores one-to-one to reduce scheduling switching between each other.
  3. For NUMA systems, try to make the CPU core use the memory of the NUMA node where it resides to avoid cross-memory access.
  4. Use huge page memory instead of ordinary memory to reduce cache-miss.
  5. Use lock-free technology to solve the problem of resource competition.

The structure of DPDK is shown in the figure below, and the relevant technical principles are summarized as follows:

img

In the figure, there are two modules in DPDK in the kernel state (Linux Kernel) at the bottom: KNI and IGB_UIO. Among them, KNI provides users with a protocol stack using the Linux kernel mode, as well as traditional Linux network tools (such as ethtool, ifconfig). IGB_UIO (igb_uio.ko and kni.ko. IGB_UIO) uses UIO technology to map network card hardware registers to user mode during initialization.

As shown in the figure, the upper-level user mode of DPDK is composed of many libraries, mainly including core components library (Core Libraries), platform-related modules (Platform), network card polling mode driver module (PMD-Natives&Virtual), QoS library, message forwarding Classification algorithms (Classify) and other major categories, user applications can use these libraries for secondary development, the following is a brief introduction.

UIO (Linux Userspace I/O) provides driver support in the application space, that is to say, the network card driver runs in the user space, reducing the multiple copies of messages in the user space and application space. As shown in the figure: DPDK bypasses the network driver module of the Linux kernel, and directly reaches the user space from the network hardware, without the need for frequent memory copies and system calls. According to the official data, it takes 80 clock cycles per package for DPDK naked package bounce, while the traditional Linux kernel protocol stack needs 2k~4k clock cycles per package. DPDK can significantly improve the data collection efficiency of virtualized network devices.

img

Comparison of Linux kernel without and with DPDK

How UIO technology works:

img

UIO technology divides the device driver into two parts: user space driver and kernel space driver. The kernel space driver is mainly responsible for device resource allocation, UIO device registration and a small part of interrupt response functions. Most of the work of the driver is completed under the driver of the user space. Register the UIO driver to the kernel through the API interface provided by the UIO framework. After the registration is completed, a map file containing information such as the physical address of the device will be generated. It can directly operate the memory space of the device. UIO technology enables applications to directly operate the memory space of the device through user space drivers, avoiding multiple copies of data in the kernel buffer and application buffer, and improving data processing efficiency.

Simply put, DPDK makes the development of high-speed packet networking applications faster, which means that it allows building applications that can process packets faster, thanks to the kernel bypass. In fact, it uses the fast path instead of the normal network layer path and context switch path. Packages are delivered directly to userspace (as raw packages). The following figure shows the difference between linux kernel package processing and dpdk package processing.

img

linux kernel processing package

img

dpdk processing package

Comparison between slow path and fast path

img

core component library

The operating environment formed by this module is built on Linux and initialized through the operating environment of the environment abstraction layer (EAL), including: HugePage memory allocation, memory/buffer/queue allocation and lock-free operation, CPU affinity binding, etc. ;Secondly, EAL realizes the shielding of I/O operations between the operating system kernel and the underlying network card (I/O bypasses the kernel and its protocol stack), provides a set of calling interfaces for DPDK applications, and uses UIO or VFIO technology to PCI device addresses are mapped to user space, which facilitates application calls and avoids processing delays caused by network protocol stacks and kernel switching. In addition, the core components also include creating a memory pool suitable for message processing, buffer allocation management, memory copy, and timer, ring buffer management, etc.

DPDK mainly has six core components:

  • 1. Environment Abstraction Layer (EAL) : Provide a unified interface for other DPDK components and applications to shield specific platform features. The functions provided by the environment abstraction layer mainly include: DPDK loading and startup; support for multi-core and multi-thread execution types; CPU core Affinity processing; atomic operation and lock operation interface; clock reference; PCI bus access interface; trace and debug interface; CPU characteristic acquisition interface; interrupt and alarm interface, etc.
  • 2. Heap memory management component (Malloc lib) : The heap memory management component provides an interface for applications to allocate memory from large page memory. Use of these interfaces can reduce TLB page faults when large chunks of memory need to be allocated.
  • 3. Ring buffer management component (Ring lib) : The ring buffer management component provides a lock-free multi-producer and multi-consumer FIFO queue API for applications and other components: Ring. Ring is based on the Linux kernel kfifo lock-free queue, which can enter and exit pairs without locks, and supports multiple consumers/producers entering and exiting the queue at the same time.
  • 4. Memory pool management component (Mem pool lib) : Provides an interface for allocating memory pools for applications and other components. A memory pool is a memory container composed of multiple memory blocks of a fixed size, which can be used to store the same object entity, such as Message cache block, etc. The memory pool is uniquely identified by the name of the memory pool. It consists of a ring buffer and a set of core local cache queues. Each core allocates memory blocks from its own cache queue. Apply for a memory block in the buffer to supplement the local queue.
  • 5. Network message cache block management component (Mbuf lib) : Provides an interface for applications to create and release cache blocks used to store message information, and these MBUFs are stored in memory pools. Two types of MBUF are provided, one for storing general information and one for storing message information.
  • 6. Timer component (Timer lib) : Provides some interfaces for asynchronous periodic execution (or only once), and can specify a function to be executed asynchronously at a specified time, just like the timer timer in LIBC, but here The timer needs the application program to periodically call rte_timer_manage in the main loop to make the timer execute. The time reference of the timer component comes from the time interface provided by the EAL layer.

In addition to the above six core components, DPDK also provides the following functions:

  • 1) Ethernet polling mode driver (PMD) architecture : Move the Ethernet driver from the kernel to the application layer, and use a synchronous polling mechanism instead of an asynchronous interrupt mechanism in the kernel state to improve the efficiency of receiving and sending messages.
  • 2) Message forwarding algorithm support : Hash library and LPM library provide support for message forwarding algorithms.
  • 3) Network protocol definitions and related macro definitions : related definitions based on the FreeBSD IP protocol stack, such as: TCP, UDP, SCTP and other protocol header definitions.
  • 4) Packet QOS scheduling library : supports related QOS functions such as random early detection, traffic shaping, strict priority and weighted random cycle priority scheduling.
  • 5) Kernel Network Interface Library (KNI) : Provides a communication method between DPDK applications and the kernel protocol stack, similar to the TUN/TAP interface of ordinary Linux, but more efficient than the TUN/TAP interface. Each physical network port can virtualize multiple KNI interfaces.

img

Platform related modules

Its internal modules mainly include KNI, energy management and IVSHMEM interface. Among them, the KNI module mainly passes the data message from the user state to the kernel state protocol stack through the kni.ko module, so that the user process can use the traditional socket interface to process the relevant message; the energy consumption management provides some APIs, and the application The program can dynamically adjust the processor frequency or enter different sleep states of the processor according to the packet receiving rate; in addition, the IVSHMEM module provides a zero-copy shared memory mechanism between virtual machines, or between virtual machines and hosts, when DPDK When the program is running, the IVSHMEM module will call the core component library API, map several HugePages into an IVSHMEM device pool, and pass parameters to QEMU, thus realizing zero-copy memory sharing between virtual machines.

Userspace Polling Mode (PMD)

Traditional interrupt mode : In a traditional Linux system, when a network device detects a data frame, it will use DMA (direct memory access) to send the frame to the pre-allocated kernel buffer, and then update the corresponding receive descriptor ring. After that, an interrupt is generated to notify that a data frame is coming. The Linux system will respond accordingly, then update the corresponding descriptor ring, and then hand over the received data frame to the network stack in the kernel for processing. After the network stack is processed, the corresponding data will be copied to the corresponding socket , so that the data is copied to the user space, and the application can use the data. The receiving process of the data frame is shown in the figure:

img

Data frame reception process

When sending, once the user program finishes processing the data, it will write the data to the socket through a system call, copy the data from the user space to the buffer of the kernel space, and hand it over to the network stack for processing. Encapsulate the data and call the driver of the network card device, the driver of the network card device will update the transmission descriptor ring, and then notify the network card device that there is a data frame to be transmitted. The network card device will copy the data frame from the buffer in the kernel to its own buffer and send it to the network link. After being transmitted to the link, the network card device will notify the successful sending through an interrupt, and then the kernel will release the corresponding frame. buffer.

The data is sent as shown in the figure:

img

The sending process of the data frame

Since the linux system informs the CPU that there are data packets coming by way of interruption, when the traffic of the network becomes larger and larger, the linux system will waste more and more time to deal with the interruption. When the traffic rate reaches 10G, the linux system may Will be overwhelmed by interrupts, wasting a lot of CPU resources.

Polling mode driver for DPDK user space : The user space driver allows applications to access network device cards without going through the linux kernel. The network card device can transfer the data packet to the pre-allocated buffer through DMA. This buffer is located in the user space. The application can read the data packet and process it directly on the original address through continuous polling without interruption, and It also saves the process of copying packets from the kernel to the application layer.

Therefore, compared with the traditional interrupt method of the Linux system, Intel DPDK avoids the performance consumption caused by interrupt processing, context switching, system calls, and data replication, and greatly improves the processing performance of data packets. At the same time, because Intel DPDK can develop drivers in user space, compared with the traditional development of drivers in the kernel, the safety factor is greatly reduced. Because the permissions of the kernel layer are relatively high, the operation is relatively dangerous, and a small code bug may cause the system to crash, which requires careful development and extensive testing. On the contrary, it is safer at the application layer, and it is much more convenient to debug the code at the application layer.

Polling mode driver module

The PMD-related API realizes the sending and receiving of network card messages in polling mode, which avoids the response delay caused by the interrupt mode in the conventional message processing method, and greatly improves the performance of network card sending and receiving. In addition, the module also supports both physical and virtualized network interfaces, from only supporting Intel network cards to supporting the entire industry ecosystem such as Cisco, Broadcom, Mellanox, Chelsio, etc., as well as virtualized network interfaces based on KVM, VMWARE, XEN, etc. support.

DPDK also defines a large number of APIs to abstract data plane forwarding applications, such as ACL, QoS, traffic classification, and load balancing. Moreover, in addition to the Ethernet interface, DPDK is also defining software and hardware acceleration interfaces (Extensions) for encryption and decryption.

large page technology

The processor's memory management consists of two concepts: physical memory and virtual memory. In the Linux operating system, the entire physical memory is managed by frames, and the virtual memory is managed by pages.

The memory management unit (MMU) completes the translation from virtual memory addresses to physical memory addresses. The information required by the memory management unit for address translation is stored in a data structure called a page table, and page table lookup is an extremely time-consuming operation. In order to reduce the lookup process of the page table, the Intel processor implements a cache to save the lookup results. This cache is called TLB (Translation Lookaside Buffer), which stores the mapping relationship between virtual addresses and physical addresses. Before all virtual addresses are converted into physical addresses, the processor will first check whether there is a valid mapping relationship in the TLB. If no valid mapping is found, that is, TLS miss, the processor will search the page table. The search process of the page table has a great impact on performance, so it is necessary to minimize the occurrence of TLB misses. In the default configuration of x86 processor hardware, the page size is 4K, but it can also support larger page table size, such as 2M or 1G page table. After using the large page table function, a TLB entry can point to a larger memory area, which can greatly reduce the occurrence of TLB misses. Early Linux did not use the large page table function provided by x86 hardware. Only after the version 2.6.33 of the Linux kernel, the application software can use the large page table function. For details, please refer to the Linux large page table file system (hugetlbfs )characteristic.

DPDK uses huge page technology, all memory is allocated from HugePage, realizes the management of memory pool (mempool), and pre-allocates mbuf of the same size for each data packet.

The memory management in DPDK is shown in the figure. The bottom part is continuous physical memory, which is composed of 2MB large pages. Above the continuous physical memory is the memory segment, and above the memory segment is the memory area. The basic unit we allocate Objects are allocated in the memory area, which contains ring queues, memory pools, LPM routing tables, and other high-performance key structures.

img

polling technique

During the message receiving/sending process of the traditional network card, the network card hardware needs to send an interrupt to the CPU after receiving the network message or sending the network message to notify the application software that there is a network message to be processed. On the x86 processor, an interrupt processing needs to save the processor's status register to the stack, run the interrupt service routine, and finally restore the saved status register information from the stack. The whole process requires at least 300 processor clock cycles. For high-performance network processing applications, frequent interrupt processing overhead greatly reduces the performance of network applications.

In order to reduce interrupt processing overhead, DPDK uses polling technology to process network packets. After the network card receives the message, it directly saves the message in the processor cache (with DDIO (Direct Data I/O) technology), or saves it in memory (without DDIO technology), and sets the report The flag bit of text arrival. The application software periodically polls the flag bit of message arrival to detect whether there is a new message to be processed. There is no interruption of the processing process during the whole process, so the network packet processing capability of the application program is greatly improved.

CPU affinity technology

Modern operating systems implement task scheduling based on time-sharing calls, and multiple processes or threads are continuously and alternately executed on a certain core of a multi-core processor. In each switching process, it is necessary to save the status register of the processor in the stack and restore the status information of the current process, which is actually a kind of processing overhead for the system. Fixing a thread to run on one core can eliminate the extra overhead caused by switching. In addition, when the process or thread is migrated to other cores of the multi-core processor for operation, the data in the processor cache also needs to be cleared, resulting in reduced utilization of the processor cache.

CPU affinity technology is to bind a certain process or thread to one or more specific cores for execution without being migrated to other cores to run, thus ensuring the performance of dedicated programs.

DPDK uses the Linux pthread library to bind the corresponding threads to the CPU affinity in the system, and then the corresponding threads use independent resources as much as possible for related data processing.

On a multi-core processor machine, each CPU core has its own cache, which stores information used by threads. If the thread is not bound to the CPU core, the thread may be scheduled to other CPUs by the Linux system. In this case, the cache hit rate of the CPU is reduced. Using CPU affinity technology, once a thread is bound to a certain CPU, the thread will always run on the specified CPU, and the operating system will not schedule it to other CPUs, which saves the performance consumption of scheduling, thereby improving Efficiency of program execution.

Multi-core polling mode : There are two multi-core polling modes, namely IO exclusive mode and pipeline mode. IO exclusive means that each core independently completes the process of receiving, processing, and sending data packets, and the cores are independent of each other. Its advantage is that when a problem occurs in one core, it does not affect the data transmission and reception of other cores. The pipeline type uses multi-core cooperation to process data packets, and the receiving, processing and sending of data packets are completed by different cores. Pipeline is suitable for stream-oriented data processing. Its advantage is that data packets can be processed in order in the order they are received. The disadvantage is that when the core involved in a certain environment (such as receiving) is blocked, it will cause interruption of sending and receiving.

In the IO exclusive multi-core polling mode, each network card is only allocated to one logical core for processing. Each logical core assigns a sending queue and a receiving queue to the network card it takes over, and independently completes the process of receiving, processing and sending data packets, and the cores are independent of each other. The processing of system data packets is performed by multiple logical cores at the same time, and the sending and receiving packet queue of each network card can only be provided by one logical core. When the data packet enters the hardware buffer area of ​​the network card, the network card driver provided by the user space knows that the network card receives the data packet through polling, takes out the data packet from the hardware buffer, and stores the data packet in the packet receiving queue provided by the logical core , the logical core takes out the data packets in the packet receiving queue for processing, and after processing, the data packets are stored in the sending queue provided by the logical core, and then taken out by the network card driver and sent to the network card, and finally sent to the network.

img

Application of DPDK in AWCloud

The DPDK (Data Plane Development Kit) data plane development tool set provides library functions and driver support for efficient data packet processing in user space under the Intel architecture (IA) processor architecture. It is different from the Linux system, which is designed for general purpose. Instead, it focuses on high-performance processing of packets in network applications. The DPDK application runs on the user space and uses the data plane library provided by itself to send and receive data packets, bypassing the process of processing data packets by the Linux kernel protocol stack. To speed up data processing, users can customize the protocol stack in user space to meet their own application requirements. Compared with the traditional kernel-based network data processing, DPDK has made a major breakthrough in the network data flow from the kernel layer to the user layer.

The DPDK function is used to accelerate the speed at which cloud hosts and physical hosts process network packets. With a series of technologies such as large page memory and CPU Affinity, it bypasses the cumbersome process of the system for processing network data packets and improves network performance.

The cloud platform uses DPDK technology to meet network performance optimization, as shown in the following figure:

img

Memory pool and lock-free ring cache management

In addition, Intel DPDK optimizes libraries and APIs to be lock-free, such as lock-free queues, which can prevent deadlocks in multi-threaded programs. Then cache alignment is performed on data structures such as buffers. If there is no cache alignment, it is possible to read and write memory and cache one more time during memory access.

The application and release of the memory pool cache area is managed by a producer-consumer mode lock-free cache queue, which avoids the overhead of locks in the queue and improves the efficiency of buffer application release during the use of the cache area.

Lock-free ring queue production process

img

Lock-free circular queue consumption process

img

As shown in the figure, the direction in which the producer stores content in the queue is the same as the direction in which the consumer fetches content from the queue, both in a clockwise direction. When the buffer area applies for a memory block from the memory pool, or the application program releases the memory block, the producer pointer of the lock-free circular queue in the buffer area moves clockwise, and stores the address information of the memory block in the queue for the production of the buffer queue process. When the application needs to apply for a memory block from the buffer, the consumer pointer of the buffer's lock-free circular queue takes out the memory block address of the queue in a clockwise direction and allocates it to the application. This process is the consumption process of the buffer queue .

The process of producing n objects: First, the producer head pointer moves n positions clockwise to obtain a new head pointer, and then stores n objects one by one from the area pointed by the producer tail pointer, and finally the producer tail pointer moves clockwise n positions get a new producer tail pointer

The process of consuming n objects: First, the consumer head pointer moves n positions clockwise to obtain a new consumer head pointer, then reads n objects one by one from the consumer tail pointer, and finally the consumer tail pointer moves clockwise n The location gets the new consumer tail pointer.

Network storage optimization

img

4. Breakthrough of 3DPDK

Traditional Linux kernel network data flow:

硬件中断--->取包分发至内核线程--->软件中断--->内核线程在协议栈中处理包--->处理完毕通知用户层
用户层收包-->网络层--->逻辑层--->业务层

dpdk network data flow:

硬件中断--->放弃中断流程
用户层通过设备映射取包--->进入用户层协议栈--->逻辑层--->业务层

4.4 KNI components

KNI is a component provided by the DPDK platform for re-entering data into the kernel protocol stack. Its purpose is to make full use of the more stable protocol processing functions that have been implemented by the traditional kernel protocol stack. The processing of data packets by the DPDK platform bypasses the kernel protocol stack and is directly handed over to the user space for processing. However, the user space does not have a complete protocol processing stack. If developers are allowed to implement a complete and independent protocol stack in the user space, the development work will be very complicated. Yes, so the DPDK platform provides the KNI component. Developers can implement some special protocol processing functions in user space, and then hand over common and common protocols to the traditional kernel protocol stack for processing through the KNI reentry kernel protocol stack function.

KNI communication mechanism

The KNI component realizes the communication between the user space and the kernel protocol stack by creating the KNI virtual interface device and passing the data packet through the virtual interface. When the network card receives the data packet, the application program will get the data packet to the user space through the user space driver, and the KNI component will send the required data packet to the KNI virtual interface, and the KNI virtual interface will hand it over to the kernel protocol stack for processing. After processing, if there is a response The message is then handed over to the KNI virtual interface and returned to the application program. Among them, the data packets sent to the kernel protocol stack and the data packets received from the kernel protocol stack are processed by two different logic cores respectively, without blocking the application program to let the kernel protocol stack send data packets or receive data packets from the kernel protocol stack process.

The KNI interface is actually a virtual device that defines four queues, namely the receive queue (rx_q), the send queue (tx_q), the allocated memory block queue (alloc_q), and the memory block queue to be released (free_q ). The receiving queue is used to store the messages sent by the user space program to the KNI virtual device, and the sending queue is used to store the messages sent by the kernel protocol stack to the KNI virtual device. The allocated memory block queue stores the memory blocks that have been applied for in the memory, and is used for taking out when the kernel protocol stack sends a message. The memory block queue to be released is used to record the memory blocks that will no longer be used after the KNI virtual device receives the message from the user space program, and then release the memory blocks in the queue back to the memory. When the user space program receives a message from the network card, it sends the message to the KNI virtual device. After the KNI virtual device receives the message from the user space program, it passes it to the kernel protocol stack for protocol analysis. When sending a message, the original data is first encapsulated by the kernel protocol stack, and then the message is sent to the KNI virtual device. After the KNI virtual device receives the message, it sends the message to the user space program

4.5 DKDP core optimization

The UIO driver of DPDK shields the interrupts issued by the hardware, and then adopts the active polling method in the user mode. This mode is called PMD (Poll Mode Driver)

UIO bypasses the kernel, actively rotates and removes hard interrupts, so that DPDK can process sending and receiving packets in user mode. It brings the benefits of zero copy and no system calls, and synchronous processing reduces the Cache MIss caused by context switching.

The CORE running on PMD will be in the state of 100% CPU.

When the network is idle, the CPU idles for a long time, which will cause energy consumption problems. Therefore, DPDK launched the interrupt DPDK mode.

interrupt DPDK:

img

High-performance code implementation of DPDK

1. **Using HugePage to reduce TLB Miss: **Linux uses 4KB as a page by default. The smaller the page, the larger the memory, the greater the overhead of the page table, and the larger the memory usage of the page table. The CPU has a TLB (Translation Lookaside Buffer) that is expensive, so generally it can only store hundreds to thousands of page table entries. If the process wants to use 64G memory, then 64G/4KB=16000000 (16 million) pages, and each page occupies 16000000 * 4B=62MB in the page table entry. If you use HugePage to use 2MB as a page, you only need 64G/2MB=2000, and the quantity is not at the same level. DPDK uses HugePage, which supports 2MB and 1GB page sizes under x86-64, which geometrically reduces the size of page table entries, thereby reducing TLB-Miss. It also provides basic libraries such as memory pool (Mempool), MBuf, lockless ring (Ring), and Bitmap. According to our practice, frequent memory allocation and release in the data plane (Data Plane) must use the memory pool instead of rte_malloc directly. DPDK's memory allocation implementation is very simple, not as good as ptmalloc.

2. **SNA(Shared-nothing Architecture)** Decentralization of software architecture, try to avoid global sharing, bring global competition, and lose the ability of horizontal expansion. Under the NUMA system, memory is not used remotely across Nodes.

3. **SIMD (Single Instruction Multiple Data)** From the earliest mmx/sse to the latest avx2, the capability of SIMD has been increasing. DPDK uses batches to process multiple packages at the same time, and then uses vector programming to process all packages in one cycle. For example, memcpy uses SIMD for speed. SIMD is more common in the background of games, but if other businesses have similar batch processing scenarios and need to improve performance, you can also see if it can be satisfied.

4. **Do not use slow APIs:** Here you need to redefine the slow APIs, such as gettimeofday, although it is no longer necessary to fall into the kernel mode through vDSO under 64-bit , it is just a pure memory access, which can reach 1 second Tens of millions of levels. However, don't forget that under 10GE, our processing capacity per second will reach tens of millions. So even gettimeofday is a slow API. DPDK provides the Cycles interface, such as the rte_get_tsc_cycles interface, which is implemented based on HPET or TSC.

Five, DPDK video tutorial

[DPDK high-performance storage] dpdk starts from the tcpip protocol stack, and prepares the linux environment to start together

[DPDk high-performance storage] C/C++ development is a good technical direction, dpdk network development

[DPDK high-performance storage] The underlying principle of dpdk allows you to open another technical direction

[DPDK high-performance storage] The past and present of DPDK, the future direction of cc++ programmers

[DPDK high-performance storage] The virtualization of dpdk, the story of vhost and virtio, and the realization principle of qemu

[DPDK high-performance storage] The stories of nff-go and dpdk, the process analysis of golang calling c

[DPDK high-performance storage] 5 misunderstandings of dpdk, use code to solve, start from dpdk handwritten protocol stack

[DPDK high-performance storage] 5 misunderstandings of dpdk, use code to solve, start from dpdk handwritten protocol stack

[DPDK high-performance storage] 10 technical issues about the development of dpdkspdk

[DPDK high-performance storage] fio's iops test, hand-written a spdk engine for fio, (bring your own linux environment)

[DPDK high-performance storage] vpp source code process analysis, dynamic library loading, plugin, node, featrue process

[DPDK high-performance storage] How does SPDK achieve high performance, in-depth working principle of NVMe

[DPDK high-performance storage] The storage framework spdk opens a storage door for the technology stack

[DPDK high-performance storage] Understand 6 questions and start the road to high-performance development of dpdk/spdk

6. Handwritten TCP/IP user mode protocol stack (pure C language)

img

(1) Basic knowledge of DPDK

  • 1. dpdk environment construction and multi-queue network card
  • 2. dpdk network card binding and arp
  • 3. Realization of dpdk sending process
  • 4. Debugging of dpdk sending process
  • 5. Implementation of dpdk-arp
  • 6. ARP debugging process
  • 7. Implementation of dpdk-icmp
  • 8. Dpdk-icmp process debugging and checksum implementation
  • 9. Implementation of arp-table

(2) Implementation of udp/tcp of the protocol stack

  • 1. Arp request implementation
  • 2. ARP debugging process
  • 3. Protocol stack architecture design optimization
  • 4. Design of udp system api for udp implementation
  • 5. Ring queue of sbuf and rbuf implemented by udp
  • 6. The sending process and concurrency decoupling realized by udp
  • 7. Architecture design and debugging of udp implementation
  • 8. Design of dpdk tcp process architecture for tcp three-way handshake
  • 9. Realization of dpdk tcp11 states realized by tcp three-way handshake
  • 10. Dpdk code debugging of tcp three-way handshake

(3) Implementation of tcp of the protocol stack

  • 1. Implementation of ack and seqnum confirmation code and sliding window for tcp data transmission
  • 2. ack and seqnum code implementation of tcp data transmission and sliding window
  • 3. The realization of bind and listen of tcp protocol api implementation
  • 4. The implementation of accept in tcp protocol api implementation
  • 5. Implementation of send and recv of tcp protocol api implementation
  • 6. The implementation of close in tcp protocol api implementation
  • 7. Segmentation fault and logic flow of tcp protocol stack debugging
  • 8. Ringbuffer memory error in tcp protocol stack debugging.
  • 9. The principle of dpdk kni and kni startup
  • 10. Reconstruct the process of network protocol distribution

(4) Component functions of the protocol stack

  • 1. kni packet capture debugging tcpdump
  • 2. dpdk kni mempool errors and memory leaks
  • 3. Mathematical theory of entropy-based ddos ​​detection
  • 4. Implementation of dpdk ddos ​​entropy calculation code
  • 5. Debugging of dpdkddosattach detection accuracy
  • 6. ddos ​​attack testing tool hping3
  • 7. The principle and application of dpdk cuckoo hash

(5) Concurrent implementation of tcp protocol stack

  • 1. Design of tcp concurrent connection
  • 2. Implementation of tcp concurrent epoll
  • 3. Callback and concurrency test of tcp concurrent protocol stack and epoll
  • 4. bpf and bpftrace system, network mount implementation
  • 5. Mount monitoring of bpf and bpftrace application ntyco

(6) DPDK network basic components

  • 1. Source code analysis and explanation of mempool and mbuf
  • 2. Source code analysis of dpdk-ringbuffer
  • 3. dpdk-igb_ uio source code analysis
  • 4. Analysis of dpdk-kni source code
  • 5. The realization of rcu and mutual exclusion lock, spin lock, read-write lock

Solve the problem

  • 1. Hard study of online books without actual project application
  • 2. There is no suitable network project to write in the resume
  • 3. Have C foundation, pure interest

Handwritten 3000 lines of code, allowing you to master the core technology of network communication.

7. DPDK market development

Most applications of DPDK were originally in the telecommunications field. As CSPs adopt network virtualization to reduce operational costs and accelerate the deployment of new services, they virtualize use cases that require high throughput and/or low latency, such as routers, firewalls, radio access networks (RAN) and evolved packet cores (EPC). Vendors of virtualization platforms, in these cases VNFs and applications have leveraged DPDK in their products to achieve the performance goals of CSPs. As CSPs explore new edge-hosted applications, such as video caching, surveillance, augmented reality (AR), assisted driving, retail and industrial IoT, DPDK remains a critical technology for achieving aggressive performance goals.

Similar to the performance requirements for packet processing functions that DPDK was first used in telecom applications, DPDK is increasingly used in enterprises and clouds. For example, in 2018, VMware introduced DPDK-based edge provisioning for their NSX-T data center software-defined infrastructure. This release of NSX-T addresses applications requiring high packet throughput with variable packet sizes and servers supporting high-speed NICs with up to 100Gbps North/South traffic. Typically, north-south flows vary in packet size and packet processing requirements, even though they represent less than 20% of the total traffic. In this use case, by using DPDK with small packets (64 bytes), analysis by Intel and VMware shows a five-fold increase in performance.

Meanwhile several companies have used DPDK for financial applications, where low latency brings a huge competitive advantage. For example, in high-frequency trading (HFT), latency can directly impact a trader's trading efficiency, algorithmic strategies and their ability to outperform their competitors. InformationWeek estimates that for a large brokerage, a millisecond is worth $100 million a year. DPDK is a key technology on which solution providers in this market rely for development.

Write at the end:

As an underlying technology that is becoming more and more popular on the Internet, Dpdk is welcomed and used by more and more excellent Internet companies. But for many programmers, they may only have seen or heard of it, and have not studied and used it in depth. If you are also interested in dpdk or have friends who are going to be a dpdk engineer, you can take a look at the [ dpdk/network protocol stack/vpp/OvS/DDos/SDN/NFV/virtualization/high performance expert road ] system course.


Copyright statement: This article is an original article written by Zhihu blogger "Playing with the Linux Kernel". It follows the CC 4.0 BY-SA copyright agreement. For reprinting, please attach the original source link and this statement.
Original link: https://zhuanlan.zhihu.com/p/638255695

Guess you like

Origin blog.csdn.net/m0_50662680/article/details/131484254