First-line siege lion combat experience: RDMA, easy to use but difficult to use?

Unstoppable RDMA

Today, the network bandwidth of servers is getting higher and higher. When the network bandwidth crosses the 10 Gigabit line, the operating system's overhead for processing network IO becomes more and more difficult to ignore. In some network IO-intensive services, the operating system itself becomes the bottleneck of network communication, which not only increases the call delay (especially the long tail), but also affects the overall throughput of the service.

Relative to the speed of development of network bandwidth, the stagnation of CPU performance is the main reason for the above problems. Therefore, in order to fundamentally solve the inefficiency problem of CPU participation in network transmission, it is necessary to make more use of the capabilities of dedicated chips. RDMA high-performance networks are unstoppable. 

RDMA (Remote Direct Memory Access) can be simply understood as the network card completely bypassing the CPU to achieve memory data exchange between two servers. As a hardware-implemented network transmission technology, it can greatly improve network transmission efficiency and help network IO-intensive services (such as distributed storage, distributed databases, etc.) achieve lower latency and higher throughput.

Specifically, the application of the RDMA technology requires a network card that supports the RDMA function and a corresponding driver. As shown in the figure below, once the application allocates resources, it can directly hand over the memory address and length information of the data to be sent to the network card. The network card pulls data from the memory, completes the packet encapsulation by the hardware, and then sends it to the corresponding receiver. After the receiving end receives the RDMA message, it is directly decapsulated by the hardware, and after taking out the data, it is directly placed in the memory location pre-specified by the application program.

**Because the entire IO process does not require CPU participation, no operating system kernel participation, no system calls, no interrupts, and no memory copying, RDMA network transmission can achieve extremely high performance. **In the extreme benchmark test, the delay of RDMA can reach 1us level, and the throughput can even reach 200G.

RDMA technical note

**It should be noted that the use of ****RDMA requires the code cooperation of the application (RDMA programming). **Unlike traditional TCP transmission, RDMA does not provide socket API encapsulation, but calls it through the verbs API (using libibverbs). In order to avoid the extra overhead of the middle layer, the verbs API adopts a semantic form close to the hardware implementation, resulting in a huge difference between the usage method and the socket API. Therefore, for most developers, it is not easy to adapt the original application to RDMA, or to write a new RDMA native application.

Where is the difficulty in RDMA programming?

As shown in the figure, in the socket API, the main interfaces used to send and receive data are as follows:

Socket API

Among them, fd in write and read operations is a file descriptor that identifies a connection. The data to be sent by the application will be copied to the system kernel buffer through write; and read actually copies the data from the system kernel buffer. In most applications, fd is usually set to be non-blocking, that is, if the system kernel buffer is full, the write operation will return directly; and if the system kernel buffer is empty, the read operation will also return directly . In order to know the state change of the kernel buffer at the first time, the application needs the epoll mechanism to monitor the EPOLLIN and EPOLLOUT events. If the epoll_wait function returns due to these events, the next write and read operations can be triggered. This is the basic usage of the socket API. For comparison, in the verbs API, the main interfaces used to send and receive data are as follows:

Verbs API

where ibv_ is the prefix for functions and structures in the libibverbs library. ibv_post_send approximates a send operation, and ibv_post_recv approximates a receive operation. The qp (queue pair) in the send and receive operations is similar to the fd in the socket API, which is used as an identifier corresponding to a connection. The wr (work request) structure contains the memory address (virtual address of the process) and length of the data to be sent/received. ibv_poll_cq exists as an event detection mechanism, similar to epoll_wait.

At first glance, RDMA programming seems simple, just replace the above functions. But in fact, the above correspondences are approximate, similar, rather than equivalent. The key difference is that socket APIs are all synchronous operations, while RDMA APIs are all asynchronous operations (note that asynchronous and non-blocking are two different concepts) .

Specifically, the ibv_post_send function returns success, which only means that the sending request is successfully submitted to the network card, and does not guarantee that the data is really sent. If the memory where the data is sent is written immediately at this time, the data sent is likely to be incorrect. The socket API is a synchronous operation. The write function returns successfully, which means that the data has been written to the kernel buffer. Although the data may not be actually sent at this time, the application can freely dispose of the memory where the data is sent.

On the other hand, the events obtained by ibv_poll_cq are different from those obtained by epoll_wait. The former indicates that a certain send or receive request previously submitted to the network card is completed; while the latter indicates that a new message has been successfully sent or received. These semantic changes will affect the memory usage patterns and API calls of upper-layer applications.

In addition to the semantic difference between synchronous and asynchronous,  RDMA programming also has a key element, that is, all the data involved in sending and receiving must be registered in the memory .

The so-called memory registration, simply understood, is to bind the mapping relationship between the virtual address and the physical address of a piece of memory and then register it to the network card hardware. The reason for this is that the memory addresses submitted by sending and receiving requests are virtual addresses. Only after the memory registration is completed, the network card can translate the virtual address in the request into a physical address, and can skip the CPU for direct memory access. Memory registration (and deregistration) is a very slow operation. In practical applications, it is usually necessary to build a memory pool to avoid calling registration functions frequently through one-time registration and repeated use.

Regarding RDMA programming, there are many details that ordinary network programming does not care about (such as flow control, TCP fallback, non-interrupt mode, etc.), so I will not introduce them one by one here. All in all, RDMA programming is not an easy task. So, how can developers quickly use high-performance network technologies such as RDMA?

See tricks and tricks, using RDMA in brpc

The comparison between socket API and verbs API mentioned above is mainly to set off the complexity of RDMA programming itself. In fact, in the actual production environment, there are not many businesses that directly call the socket API for network transmission, and most of them use the socket API indirectly through the rpc framework. A complete RPC framework needs to provide a complete set of network transmission solutions, including data serialization, error handling, multi-threading, and more. brpc is a C++-based rpc framework open sourced by Baidu, which is more suitable for scenarios with high performance requirements than grpc. In addition to traditional TCP transmission, brpc also provides the use of RDMA to further break through the performance limitations of the operating system itself. For specific implementation details, interested friends can refer to the source code on github ( https://github.com/apache/incubator-brpc/tree/rdma).

brpc client uses RDMA

brpc server side uses RDMA

The above lists the methods of using RDMA for the client and server in brpc, that is, when the channel and server are created, set the use_rdma option to true (the default is false, that is, using TCP).

Yes, just these two lines of code. If your application itself is built on brpc, then migrating from TCP to RDMA can take a few minutes. Of course, after the above quick start, if there are more advanced tuning requirements, brpc also provides some runtime flag parameters that can be adjusted, such as memory pool size, qp/cq size, polling instead of interrupts, and so on.

The performance benefits of brpc using RDMA are illustrated below through the echo benchmark (the benchmark can be found in the rdma_performance directory in the github code). In the test environment of 25G network, for messages under 2KB, after using RDMA, the maximum QPS on the server side is increased by more than 50%, and the average delay at 200k QPS is reduced by more than 50%. 

Maximum QPS on the server side under Echo benchmark (25G network)

Average latency of Echo benchmark at 200k QPS (25G network)

The performance benefit of RDMA on the echo benchmark is only for reference. The workload of the actual application is very different from the echo. For some services, the benefit of using RDMA may be less than the above-mentioned value, because the overhead of the network part only accounts for a part of its business overhead. But for other services, the benefit of using RDMA is even higher than the above-mentioned value because the interference of the kernel operation on the business logic is avoided. Here are two examples of applying brpc:

  • In Baidu's distributed block storage business, using RDMA compared to using TCP, the average 4KB fio latency test dropped by about 30% (RDMA only optimizes network IO, and storage IO is not affected by RDMA).

  • In Baidu's distributed memory KV service, using RDMA compared to using TCP, the average latency of a single query of 30 keys at 200k QPS decreased by 89%, and the 99th percentile latency decreased by 96%.

RDMA requires infrastructure support

RDMA is an emerging high-performance network technology, which is of great significance for network IO-intensive services in data centers where both ends of the communication are controllable, such as HPC, machine learning, storage, and databases. We encourage developers of related businesses to pay attention to RDMA technology and try to use brpc to build their own applications to smoothly migrate to RDMA. However, it should be pointed out that RDMA is not currently as versatile as TCP, and there are some infrastructure-level limitations to be aware of:

  • RDMA requires network card hardware support. Common 10 Gigabit network cards generally do not support this technology.

  • The normal use of RDMA depends on the physical network support.

Baidu Smart Cloud has accumulated profound experience in RDMA infrastructure. Thanks to advanced hardware configuration and powerful engineering technology, users can fully obtain the performance benefits brought by RDMA technology through physical machines or containers in a 25G or even 100G network environment, and use complex physical networks that are not related to services. The configuration work (such as lossless network PFC and display congestion notification ECN) is handed over to Baidu Intelligent Cloud technical support personnel. Developers who need high-performance computing, high-performance storage and other services are welcome to consult Baidu Smart Cloud. 

The author of this article, Li Zhaogeng, is a senior R&D engineer in Baidu's system department. He has long been concerned with high-performance network technology and is responsible for Baidu's RDMA R&D work, covering the entire technology stack from upper-level business calls to underlying hardware support.

This article is published by Baidu Developer Center, a knowledge sharing platform for developers, focusing on creating a warm technical exchange community for developers, developers share knowledge and communicate with each other through the platform. developer release!

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324225468&siteId=291194637