DPDK analysis

1. The situation and trend of network IO

From the use of our users, we can feel that the network speed has been increasing, and the development of network technology has also evolved from 1GE/10GE/25GE/40GE/100GE. From this, it can be concluded that the network IO capability of a single machine must keep up with the development of the times.

  1. The traditional telecommunications field
    IP layer and below, such as routers, switches, firewalls, base stations and other equipment all use hardware solutions. Based on dedicated network processor (NP), some are based on FPGA, and some are based on ASIC. However, the disadvantages of hardware-based are very obvious. Bugs are not easy to repair, difficult to debug and maintain, and network technology has been developing. For example, the innovation of mobile technologies such as 2G/3G/4G/5G. These business logics are too painful to be implemented based on hardware. Iterate quickly. The challenge in the traditional field is the urgent need for a high-performance network IO development framework with a software architecture.
  2. The development of
    cloud The emergence of private cloud has become a trend through network function virtualization (NFV) sharing hardware. NFV is defined as the realization of various traditional or new network functions through standard servers and standard switches. There is an urgent need for a high-performance network IO development framework based on common systems and standard servers.
  3. The soaring of stand-alone performance. With
    the development of network cards from 1G to 100G, and the development of CPUs from single-core to multi-core to multi-CPU, the stand-alone capabilities of servers have reached new highs through rampant expansion. However, software development can't keep up with the rhythm, and the processing capacity of a single machine cannot be matched with the hardware door. How to develop a high-throughput service that keeps pace with the times, and a single machine has millions of concurrent capabilities. Even if there is a business that does not require high QPS, it is mainly CPU-intensive, but now big data analysis, artificial intelligence and other applications need to transmit a large amount of data between distributed servers to complete the job. This point should be the most concerned and most relevant for our Internet back-end development.

2. Linux + x86 network IO bottleneck

A few years ago, I wrote the article "Network Card Working Principle and Tuning under High Concurrency", which described the process of sending and receiving messages in Linux. According to experience, running an application on C1 (8 cores) needs to consume 1% soft interrupt CPU for every 1W packet processing, which means that the upper limit of a single machine is 1 million PPS (Packet Per Second). From the performance of TGW (Netfilter version) of 1 million PPS, AliLVS is optimized to only 1.5 million PPS, and the configuration of the server they use is still relatively good. Assuming that we want to run a full 10GE network card, each packet of 64 bytes, which requires 20 million PPS (Note: The upper limit of the Ethernet 10 Gigabit network card speed is 14.88 million PPS, because the minimum frame size is 84B "Bandwidth, Packets Per Second, and Other Network Performance Metrics), 100G is 200 million PPS, that is, the processing time of each packet cannot exceed 50 nanoseconds. In a Cache Miss, whether it is TLB, Data Cache, or Command Cache, a Miss occurs. It takes about 65 nanoseconds to read back to the memory, and about 40 nanoseconds to communicate across Nodes under the NUMA system. Therefore, even if business logic is not added, it is so difficult to send and receive packets purely. We need to control the hit rate of the Cache, and we need to understand the computer architecture to prevent cross-Node communication.
From these data, I hope I can directly feel how big the challenge is here, ideal and reality, we need to balance it. The problems have these

  • The traditional way of sending and receiving messages must use hard interrupts for communication. Each hard interrupt consumes about 100 microseconds, which is not counted as the Cache Miss caused by the termination of the context.
  • The data must be switched from kernel mode to user mode, which brings a lot of CPU consumption and global lock competition.
  • Both sending and receiving packets have the overhead of system calls.
  • The kernel works on multiple cores and is globally consistent. Even if Lock Free is used, the performance loss caused by the lock bus and memory barrier cannot be avoided.
  • The path from the network card to the business process is too long, and some are actually unnecessary, such as the netfilter framework, which brings certain consumption and is easy to Cache Miss.

Three, the basic principle of DPDK

From the previous analysis, we can know the IO implementation method, the bottleneck of the kernel, and the uncontrollable factors in the flow of data through the kernel. These are all implemented in the kernel. The kernel is the cause of the bottleneck. To solve the problem, you need to bypass the kernel. Therefore, the mainstream solution is to bypass the network card IO, bypassing the kernel and directly sending and receiving packets in the user mode to solve the bottleneck of the kernel.
The Linux community also provides a bypass mechanism Netmap. The official data is 14 million PPS for 10G network cards, but Netmap is not widely used. There are several reasons for this:

  • Netmap needs the support of the driver, that is, the network card manufacturer needs to approve this solution.
  • Netmap still relies on the interrupt notification mechanism, which does not completely solve the bottleneck.
  • Netmap is more like a few system calls, realizing user mode to send and receive packets directly, the function is too primitive, it does not form a dependent network development framework, and the community is not perfect.

So, let's take a look at the DPDK that has been developed for more than ten years. From Intel's leading development to the addition of major manufacturers such as Huawei, Cisco, and AWS, the core players are all in this circle, with a complete community, and the ecosystem has formed a closed loop. In the early days, it was mainly the applications below the 3rd layer in the traditional telecommunications field. For example, Huawei, China Telecom, and China Mobile were all early adopters, and switches, routers, and gateways were the main application scenarios. However, with the requirements of upper-level services and the improvement of DPDK, higher applications are gradually appearing.

DPDK bypass principle :

Insert picture description here
The picture is quoted from the document "Flow Bifurcation on Intel® Ethernet Controller X710/XL710" by Jingjing Wu

On the left is the original data from the network card -> driver -> protocol stack -> Socket interface -> business

On the right is the way of DPDK, based on UIO (Userspace I/O) bypass data. Data from the network card -> DPDK polling mode -> DPDK basic library -> business

The benefits of user mode are easy to use, develop and maintain, and have good flexibility. And Crash does not affect the operation of the kernel, which is robust.

CPU architecture supported by DPDK: x86, ARM, PowerPC (PPC)

The list of network cards supported by DPDK: core.dpdk.org/supported/, our mainstream use Intel 82599 (optical port), Intel x540 (electric port)

Fourth, the cornerstone of DPDK UIO

In order to make the driver run in user mode, Linux provides UIO mechanism. UIO can sense interrupts through read, and communicate with the network card through mmap.

UIO principle :

Insert picture description here
There are several steps to develop user mode driver:

  • Develop UIO modules running in the kernel, because hard interrupts can only be processed in the kernel
  • Read interrupt via /dev/uioX
  • Shared memory with peripherals through mmap

5. DPDK core optimization: PMD

DPDK's UIO driver shields the hardware from issuing interrupts, and then adopts active polling in user mode. This mode is called PMD (Poll Mode Driver).

UIO bypasses the kernel and actively polls to remove hard interrupts, so that DPDK can perform packet sending and receiving processing in user mode. Brings the benefits of Zero Copy and no system calls, and synchronous processing reduces Cache Miss caused by context switching.

The Core running on PMD will be in the state of user mode CPU 100%. When the
Insert picture description here
network is idle, the CPU will idle for a long time, which will cause energy consumption problems. Therefore, DPDK introduced the Interrupt DPDK mode.

Interrupt DPDK :
Insert picture description here
Picture quoted from the document "Towards Low Latency Interrupt Mode DPDK" by David Su/Yunhong Jiang/Wei Wang

Its principle is very similar to NAPI, that is, it goes to sleep when there is no package to process, and it changes to interrupt notification. And it can share the same CPU Core with other processes, but the DPDK process will have a higher scheduling priority.

Six, DPDK high-performance code implementation

1. Use HugePage to reduce TLB Miss

By default, Linux uses 4KB as a page. The smaller the page, the larger the memory, the greater the overhead of the page table, and the greater the memory usage of the page table. The CPU has a high cost of TLB (Translation Lookaside Buffer), so generally it can only store hundreds to thousands of page table entries. If the process wants to use 64G of memory, 64G/4KB=16000000 (16 million) pages, and each page occupies 16000000 * 4B=62MB in the page table entry. If HugePage uses 2MB as a page, only 64G/2MB=2000, and the number is not at the same level.

DPDK uses HugePage, which supports 2MB and 1GB page sizes under x86-64, which geometrically reduces the size of page table entries, thereby reducing TLB-Miss. It also provides basic libraries such as Mempool, MBuf, Ring and Bitmap. According to our practice, in the data plane (Data Plane) frequent memory allocation and release, the memory pool must be used, and rte_malloc cannot be used directly. The memory allocation implementation of DPDK is very simple, not as good as ptmalloc.

2.SNA(Shared-nothing Architecture)

The software architecture is decentralized, and global sharing is avoided as much as possible, which brings global competition and loses the ability to scale horizontally. Under the NUMA system, memory is not used remotely across Node.

3. SIMD(Single Instruction Multiple Data)

From the earliest mmx/sse to the latest avx2, the capabilities of SIMD have been enhanced. DPDK uses batches to process multiple packets at the same time, and then uses vector programming to process all packets in one cycle. For example, memcpy uses SIMD to increase speed.

SIMD is more common in the background of games, but if there are scenarios similar to batch processing in other businesses, you can also see if it can be satisfied if you want to improve performance.

4. Do not use slow API

Here we need to redefine the slow API, such as gettimeofday. Although it is no longer necessary to fall into the kernel mode through vDSO under 64-bit, it is just a pure memory access, which can reach tens of millions of levels per second. However, don't forget that under 10GE, the processing power per second will reach tens of millions. So even gettimeofday is a slow API. DPDK provides Cycles interface, such as rte_get_tsc_cycles interface, based on HPET or TSC implementation.

Use the RDTSC instruction under x86-64 to read directly from the register. Two parameters need to be input. A more common implementation:

static inline uint64_t
rte_rdtsc(void)
{
    
    
      uint32_t lo, hi;

      __asm__ __volatile__ (
                 "rdtsc" : "=a"(lo), "=d"(hi)
                 );

      return ((unsigned long long)lo) | (((unsigned long long)hi) << 32);
}

The logic of writing this way is correct, but it is not extreme enough. It also involves 2 bit operations to get the result. Let's see how DPDK is implemented:

static inline uint64_t
rte_rdtsc(void)
{
    
    
	union {
    
    
		uint64_t tsc_64;
		struct {
    
    
			uint32_t lo_32;
			uint32_t hi_32;
		};
	} tsc;

	asm volatile("rdtsc" :
		     "=a" (tsc.lo_32),
		     "=d" (tsc.hi_32));
	return tsc.tsc_64;
}

Clever use of C's union shared memory, direct assignment, reducing unnecessary operations. But there are some problems that need to be faced and solved when using tsc

  • CPU affinity, solve the problem of inaccurate multi-core beating
  • Memory barrier to solve the problem of inaccurate execution out of order
  • Prohibit frequency reduction and prohibit Intel Turbo Boost, fix the CPU frequency, and solve the problem of inaccuracy caused by frequency changes

5. Compile and execute optimization

Branch prediction

Modern CPUs improve parallel processing capabilities through pipelines and superscalars. In order to further exert their parallel capabilities, they will make branch predictions and improve the parallel capabilities of the CPU. When a branch is encountered, it is judged which branch may be entered, the code of the branch is processed in advance, the instruction is read in advance to read the code, the register, etc., and the preprocessing is discarded if the prediction fails. When we develop business, we sometimes know very well whether this branch is true or false, so we can generate more compact code through manual intervention to indicate the success rate of CPU branch prediction.

#pragma once

#if !__GLIBC_PREREQ(2, 3)
#    if !define __builtin_expect
#        define __builtin_expect(x, expected_value) (x)
#    endif
#endif

#if !defined(likely)
#define likely(x) (__builtin_expect(!!(x), 1))
#endif

#if !defined(unlikely)
#define unlikely(x) (__builtin_expect(!!(x), 0))
#endif

CPU Cache prefetch

The cost of Cache Miss is very high. It takes 65 nanoseconds to read back the memory. It can optimize the CPU Cache that actively pushes the data to be accessed. A typical scenario is the traversal of a linked list. The next node of the linked list is a random memory address, so the CPU must not be automatically preloaded. But when we are processing this node, we can push the next node to the Cache through CPU instructions.
API documentation: doc.dpdk.org/api/rte__pr...

static inline void rte_prefetch0(const volatile void *p)
{
    
    
	asm volatile ("prefetcht0 %[p]" : : [p] "m" (*(const volatile char *)p));
}

#if !defined(prefetch)
#define prefetch(x) __builtin_prefetch(x)
#endif

…and many more

Memory alignment
Memory alignment has 2 benefits:

l Avoid structure members from crossing the Cache Line, requiring two reads to merge them into the register, reducing performance. The members of the structure need to be sorted and forced to align from largest to smallest. Refer to "Data alignment: Straighten up and fly right"

#define __rte_packed __attribute__((__packed__))

l False sharing occurs when writing in multi-threaded scenarios, causing Cache Miss, and the structure is aligned according to Cache Line

#ifndef CACHE_LINE_SIZE
#define CACHE_LINE_SIZE 64
#endif

#ifndef aligined
#define aligined(a) __attribute__((__aligned__(a)))
#endif

Constant optimization

The compilation phase of constant-related operations is completed. For example, C++11 introduces constexp. For example, you can use GCC's __builtin_constant_p to determine whether the value is constant, and then compile the constant to get the result. Example network sequence host sequence conversion

#define rte_bswap32(x) ((uint32_t)(__builtin_constant_p(x) ?		\
				   rte_constant_bswap32(x) :		\
				   rte_arch_bswap32(x)))

The realization of rte_constant_bswap32

#define RTE_STATIC_BSWAP32(v) \
	((((uint32_t)(v) & UINT32_C(0x000000ff)) << 24) | \
	 (((uint32_t)(v) & UINT32_C(0x0000ff00)) <<  8) | \
	 (((uint32_t)(v) & UINT32_C(0x00ff0000)) >>  8) | \
	 (((uint32_t)(v) & UINT32_C(0xff000000)) >> 24))

Use CPU instructions

Modern CPUs provide many instructions to directly complete common functions, such as large-to-small end conversion, and x86 directly supports the bswap instruction.

static inline uint64_t rte_arch_bswap64(uint64_t _x)
{
    
    
	register uint64_t x = _x;
	asm volatile ("bswap %[x]"
		      : [x] "+r" (x)
		      );
	return x;
}

This realization is also the realization of GLIBC, first constant optimization, CPU instruction optimization, and finally realized with bare code. After all, they are all top programmers, and their pursuit of language, compiler, and realization is different, so you must first understand the wheel before making the wheel.

Google's open source cpu_features can obtain what features the current CPU supports, so as to optimize the execution of a specific CPU. High-performance programming is endless, and the understanding of hardware, kernel, compiler, and development language must be in-depth and keep pace with the times.

Seven, DPDK ecology

For our Internet background development, the capabilities provided by the DPDK framework itself are relatively bare. For example, to use DPDK, you must implement basic functions such as ARP and IP layers, which is difficult to get started. If you want to use higher-level services, you also need user-mode transmission protocol support. It is not recommended to use DPDK directly.

The current application layer development project with perfect ecology and strong community (supported by first-tier manufacturers) is FD.io (The Fast Data Project), VPP supported by Cisco open source, relatively complete protocol support, ARP, VLAN, Multipath, IPv4/v6 , MPLS, etc. User mode transmission protocol UDP/TCP has TLDK. From project positioning to community support, it is a relatively reliable framework.

Tencent Cloud's open source F-Stack is also worth paying attention to. The development is simpler and directly provides a POSIX interface.

Seastar is also very powerful and flexible. Both kernel mode and DPDK can be switched at will. It also has its own transmission protocol Seastar Native TCP/IP Stack support. However, we have not seen any large-scale projects using Seastar, and there may be more pits to fill.

Our GBN Gateway project needs to support L3/IP layer access as Wan gateway, stand-alone 20GE, based on DPDK development.

Linux, C/C++ technology exchange group: [960994558] I have compiled some good learning books, interview questions from big companies, and popular technology teaching video materials to share in it (including C/C++, Linux, Nginx, ZeroMQ, MySQL) , Redis, fastdfs, MongoDB, ZK, streaming media, CDN, P2P, K8S, Docker, TCP/IP, coroutine, DPDK, etc.), you can add it yourself if you need it! ~
Insert picture description here

Original link

Guess you like

Origin blog.csdn.net/weixin_52622200/article/details/113704378