Experience in RDMA Performance Optimization

1. Overview of RDMA

First, let's introduce some core concepts of RDMA. Of course, I don't plan to write its API and calling methods. We pay more attention to the hardware execution methods and principles behind these basic concepts. The understanding of these principles is to be able to write high-level Key to performance RDMA programs.

Memory Region

The RDMA network card (hereinafter referred to as RNIC) reads and writes system memory through DMA. Since DMA can only be accessed according to physical addresses, RNIC needs to save a mapping table from virtual memory to physical memory in the target memory area . This mapping table is called Stored in the Memory Translation Table (MTT) of the RNIC . At the same time, since the current RDMA access is mostly based on Direct Cache Access and does not support page-fault, we also need to ensure that the target memory area is pagelocked to prevent the operating system from swapping out this part of the memory page .

To sum up, when we use RDMA to access a piece of memory, this part of the memory must be pagelocked first, and then we need to send the mapping table from virtual address to logical address of this piece of memory to RNIC for subsequent access lookup, this The process is called Memory Registration, and the registered memory is the Memory Region . At the same time, when we register the memory, we need to specify the access permission of this memory. RNIC stores this access permission information in Memory Protection Tables (MPT) for permission verification when users request.

MTT and MPT are stored in memory, but are cached in SRAM of the RNIC. When the RNIC receives a READ/WRITE request from the user, it first searches the cache in the SRAM for the physical address corresponding to the target address requested by the user and the access rights corresponding to this address. If the cache hits, it will be directly based on DMA. Operation, if there is no hit, you have to send a request through PCIe and search in the MTT and MPT of the memory, which brings considerable additional overhead, especially when your application scenario requires a large amount of fine-grained memory access. , the impact of MTT/MPT misses in RNIC SRAM may be fatal.

The registration of Memory Region is a time-consuming operation, but in most cases, we only need to do it one or more times at the very beginning. Now there are also access methods based on on-demand paging that do not need to register MR, such as AWS's EFA protocol. But I will not expand the content of this piece today, because this piece is more about the topic of Unified Memory. Later, I may introduce this together with GPU UVM, because their core principles are actually the same.

RDMA Verbs

Users send instructions to RNIC through the Verbs API of RDMA. Verbs are divided into Memory Verbs and Message Verbs. Memory Verbs mainly include READ, WRITE and some ATOMIC operations, and Message Verbs mainly include SEND and RECV. Memory verbs are real CPU Bypass and Kernel Bypass, so the performance is better after all. Message Verbs require the participation of Responder's CPU, which is relatively more flexible, but the performance is generally not good compared to Memory Verbs.

Queue Pair

RDMA hosts communicate through Queue Pair (QP). A QP includes a Send Queue (SQ), a Receive Queue (RQ), and corresponding Send Completion Queue (SCQ) and Receive Completion Queue (RCQ). When the user sends a request, the request is encapsulated into a Work Queue Element (WQE) and sent to the SQ, and then the RDMA network card will send the WQE. When the WQE is completed, a Completion Queue Element will be placed in the corresponding SCQ (CQE), and then the user can Poll this CQE from SCQ and check the status to confirm whether the corresponding WQE is successfully completed. It should be pointed out that different QPs can share CQ to reduce the storage consumption of SRAM.

Next, we focus on the knowledge behind QP.

First of all, after we create a QP, the system needs to save state data, such as QP metadata, congestion control state, etc. Excluding WQE, MTT, and MPT in QP, a QP corresponds to about 375B of state data. This would be a relatively heavy storage burden when the SRAM of the RNIC was relatively small in the past, so the previous RDMA work would have QP Sharing research, that is, different processing threads share QP to reduce the storage pressure of meta data, but this will Bring a certain loss of performance [1]. Now the SRAM of the new RNIC is relatively large. The SRAM of Mellanox’s CX4 and CX5 series network cards is about 2MB, so now on the new network cards, people still pay less attention to the storage overhead brought by QP, unless you want to create thousands. , tens of thousands of QP.

Secondly, RNIC contains multiple Processing Units (PUs) [2]. At the same time, since the request processing in QP is sequential, and in order to avoid cross-PU synchronization, generally speaking, we think that one QP corresponds to one PU to process . Therefore, we can establish multiple QPs in one thread to speed up your data processing and avoid the RDMA program performance bottleneck being stuck on the PU processing [3].

2. RDMA performance optimization

RDMA performance optimization is complex and complicated, but simple and simple. The simple point is that, from the perspective of performance optimization, in fact, there are not too many designs and choices we can make at the software level, because the upper limit of performance is stuck by the hardware, so in order to pursue performance as close as possible to the upper limit of the hardware, Its core is to do data access in the most hardware-friendly way. There are no particularly complicated algorithms in it. When you want high performance, you just need to know more about the hardware. In contrast to the three core concepts we introduced above, we introduce the experience of performance optimization one by one.

2.1 Focus on the performance overhead of address translation

We mentioned earlier that when the requested data address does not hit the MTT/MPT in the RNIC SRAM, the RNIC needs to search the MTT and MPT in the memory through PCIe, which is a time-consuming operation. Especially when we need high fan-out, fine-grained data access, this overhead will be particularly obvious. There are two main optimization methods for this problem:

  1. Large Page: Whether it is MTT or the Page Table of the operating system, the virtual address-to-physical address mapping entry is Page granular, that is, a Page corresponds to an MTT Entry or Page Table Entry (PTE). Using Large Page can effectively reduce the size of MTT, which in turn makes the MTT Cache hit rate in RNIC higher.
  2. Use Contiguous Memory + PA-MR [4, 5]. The new generation of CX network cards supports users to access based on physical addresses. In order to avoid maintaining a heavy Page Table, we can apply for a large block of continuous physical memory through the CMA API of Linux. In this way, our MTT has only one item, which can guarantee a 100% cache hit rate. But this itself has some security risks, because using PA-MR will bypass the access authorization verification, so pay attention to this when using it.

Of course, there are actually some other optimization methods. In our recent work, we proposed a new method to improve the performance of address translation. I will introduce it after the work is open sourced.

2.2 Focus on the execution model of RNIC PU/QP

One QP corresponds to one PU, which is a simple modeling of how RNIC performs. Under this model, we need to use multiple QPs to give full play to the parallel processing capabilities of multiple PUs. At the same time, we must also pay attention to our operations to reduce the synchronization between PUs. Synchronization between PUs will greatly damage performance.

2.3 RMDA Verbs

For the use of RDMA Verbs, based on my personal experience, READ/WRITE is preferred. In some cases that require CPU intervention and Batch processing logic, you can try to use SEND/RECV. In the past, there are many works based on READ/WRITE to construct Message Passing processing semantics [1, 6, 7], which can be referred to emphatically.

At the same time, a READ/WRITE WQE can set whether it needs to be SIGNALED when it is completed by setting the corresponding FLAG. If not, the WQE will not generate a CQE when it is completed. A common optimization technique at this time is that when you need to continuously send K READ/WRITE requests in a QP, only set the last request as SIGNALED, and the others are UNSIGNALED. Since the execution of QP itself has a sequential relationship, so After the last one is executed, it must mean that the previous WQEs have been executed. Of course, whether the execution is successful requires an Application-Specific method to confirm.

3. RNIC+X

The most classic way to use RNIC is naturally RNIC + System Memory, that is, to access memory directly through RNIC. However, with the development of GP-GPU and NVM, direct access to GPU through RNIC or direct access to NVM through RNIC are relatively mature and popular technologies at present. RDMA + GPU can greatly accelerate the communication between GPU and GPU, and RDMA + NVM can greatly expand the memory capacity and reduce the demand for network communication. This piece of content involves both the hardware and the virtual memory mechanism of the operating system. It takes a lot of space to explain it clearly, and we will introduce it in the next article.

Four. Summary

This article mainly introduces some basic concepts of RDMA and the principles behind them. Based on these concepts and principles, we introduce common performance optimization techniques of RDMA. In the next article, we will introduce RNIC + X, including RNIC + GPU and RNIC + NVM The content introduction, interested readers and friends can stay tuned~.

Guess you like

Origin blog.csdn.net/weixin_43778179/article/details/132622504