【RDMA】优化 RDMA 代码的提示和技巧

RDMA is used in many places, mainly because of the high performance that it allows to achieve. In this post, I will provide tips and tricks on how to optimize RDMA code in several aspects.

一般建议

避免在数据路径中使用控制操作

Unlike the data operations that stay in the same context that they were called in (i.e. don't perform a context switch) and they are written in optimized way, the control operations (all create/destroy/query/modify) operations are very expensive because:

  • Most of the time, they perform a context switch
  • Sometimes they allocate or free dynamic memory
  • Sometimes they involved in accessing the RDMA device

As a general rule of thumb, one should avoid calling control operations or decrease its use in the data path.

The following verbs are considered as data operations:

  • ibv_post_send()
  • ibv_post_recv()
  • ibv_post_srq_recv()
  • ibv_poll_cq()
  • ibv_req_notify_cq

post多个 WR 时,用list串起它们在一次post完成

(When posting multiple WRs, post them in a list in one call)

When posting several Work Requests to one of the ibv_post_*() verbs, posting multiple Work Requests as a linked list in one call instead of several calls each time with one Work Request will provide better performance since it allows the low-level driver to perform optimizations.

处理WC时,不要一个一个回复,攒几个再一次同时回复多个

(When using Work Completion events, acknowledge several events in one call)

When handling Work Completions using events, acknowledging several completions in one call instead of several calls each time will provide better performance since less mutual exclusion locks are being performed.

避免使用许多分散/聚集条目

(Avoid using many scatter/gather entries)

Using several scatter/gather entries in a Work Request (either Send Request or Receive Request) mean that the RDMA device will read those entries and will read the memory that they refer to. Using one scatter/gather entry will provide better performance than more than one.

在WR(发送请求SR或接收请求RR)中使用多个分散/收集条目意味着 RDMA 设备将读取这些条目并读取它们引用的内存。使用一个分散/聚集条目将提供比多个条目更好的性能

避免使用 Fence|Avoid using Fence

Send Request with the fence flag set will be blocked until all prior RDMA Read and Atomic Send Requests will be completed. This will decrease the BW.

避免使用原子操作|Avoid using atomic operations

Atomic Operations allow to perform read-modify-write in an atomic way. This usually will decrease the performance since doing this usually involved in locking the access to the memory (implementation dependent).

一次Read多个WC|Read multiple Work Completions at once

ibv_poll_cq() allows to reading multiple completions at once. If the number of Work Completions in the CQ is less than the number of Work Completion that one tried to read, it means that the CQ is empty and there isn't any need to check if there are more Work Completions in it.

ibv_poll_cq() 允许一次读取多个wc。如果 CQ 中的 Work Completion 数量小于尝试读取的 Work Completion 数量,则表示 CQ 为空,无需检查其中是否有更多的 Work Completion。

为某个任务或进程绑核运行|Set processor affinity for a certain task or process

When working with a Symmetric MultiProcessing (SMP) machines, binding the processes to a specific CPU(s)/core(s) may provide better utilization of the CPU(s)/core(s) thus provide better performance. Executing processes as the number of CPU(s)/core(s) in a machine and spread a process to each CPU(s)/core(s) may be a good practice. This can be done with the "taskset" utility.

使用本地 NUMA 节点|Work with local NUMA node

When working on a Non-Uniform Memory Access (NUMA) machines, binding the processes to CPU(s)/core(s) which are considered local NUMA nodes for the RDMA device may provide better performance because of faster CPU access. Spreading the processes to all of the local CPU(s)/core(s) may be a good practice.

使用内存对齐的buffer|Work with cache-line aligned buffers

Working with cache-line aligned buffers (in S/G list, Send Request, Receive Request and data) will improve performance compared to working with unaligned memory buffers; it will decrease the number of CPU cycles and number of memory accesses.

避免进入重传流|Avoid getting into retransmission flows

Retransmission is a performance killer. There are 2 major reasons for retransmission in RDMA:

  • Transport retransmission - remote QP isn't at a state that can process incoming messages, i.e. didn't get to, at least, RTR state, or moved to Error state
  • RNR retransmission - there is a message that should consume a Receive Request in the responder side, but there isn't any Receive Request in the Receive Queue

There are RDMA devices that provide counters to indicate that retry flows occurred, but not all of them.

Setting QP.retry_cnt and QP.rnr_retry to zero will cause a failure (i.e. Completion with error) when the QP enters those flows.

However, if retry flows can't be avoided, use low (as possible) delay between the retransmission.

Improving the Bandwidth

Find the best MTU for the RDMA device

The MTU value specifies the maximum packet payload size (i.e. excluding the packet headers) that can be sent. As a rule of thumb since the packet header sizes are the same for all MTU values, using the maximum available MTU size will decrease the "paid price" per packet; the percent of the payload data in the total used BW will be increased. However, there are RDMA devices which provide the best performance for MTU values which are lower than the maximum supported value. One should perform some testing in order to find the best MTU for the specific device that he works with.

Use big messages

Sending a few big messages is more effective than sending a lot of small messages. In application level one should collect data and send big messages over RDMA.

Work with multiple outstanding Send Requests

处理多个未完成的发送请求

Working with multiple outstanding Send Requests and keeping the Send Queue always full (i.e. for every polled Work Completion post a new Send Request) will keep the RDMA device busy and prevents it from being idle.

Configure the Queue Pair to allow several RDMA Reads and Atomic in parallel

If one uses RDMA Read or Atomic operations, it is advised to configure the QP to work with several RDMA Read and Atomic operations in flight since it will provide higher BW.

Work with selective signaling in the Send Queue

Working with selective signaling in the Send Queue means that not every Send Request will produce a Work Completion when it ends and this will reduce the number of Work Completions that should be handled.

减少延迟|Reducing the latency

使用polling读取WC|Read Work Completions by polling

为了尽快将放入CQ的wc读出来,polling相比于event polling提供更好的结果。

In order to read the Work Completion as soon as they are added to the Completion Queue, polling will provide the best results (rather than working with Work Completion events).

Send small messages as inline

In RDMA devices which supports sending data as inline, sending small messages as inline will provide better latency since it eliminates the need of the RDMA device to perform extra read (over the PCIe bus) in order to read the message payload.

Use low values in QP's timeout and min_rnr_timer

Using lower values in the QP's timeout and min_rnr_timer means that in case that something gets wrong and retry is required (whether if because the remote QP doesn't answer or if it doesn't have outstanding Receive Request) the waited time before a retransmission will be short.

If immediate data is used, use RDMA Write with immediate instead of Send with immediate

When sending a message that has only immediate data, RDMA Write with immediate will provide better performance than Send With immediate since the latter causes the outstanding posted Receive Request to be read (in the responder side) and not only be consumed.

Reducing memory consumption

Use Shared Receive Queue (SRQ)

Instead of posting many Receive Requests for each Queue Pair, using SRQ can save the total number of outstanding Receive Request thus reduce the total consumed memory.

Register physical contiguous memory

Register physical contiguous memory, such as huge pages, can allow the low-level driver(s) to perform optimizations since lower amount of memory address translations will be required (compared to 4KB memory pages buffer).

Reduce the size of the used Queues to the minimum

Creating the various Queues (Queue Pairs, Shared Receive Queues, Completion Queues) may consume a lot of memory. One should set the size of them to the minimum that is required by his application.

Reducing CPU consumption

Work with Work Completion events

Reading the Work Completions using events will eliminate the need to perform constant polling on the CQ since the RDMA device will send an event when a Work Completion was added to the CQ.

Work with solicited events in Responder side

When reading the Work Completions in the Responder side, the solicited event can be a good way to the Requestor to provide a hint that now is a good time to read the completions. This reduces the total number of handled Work Completions.

Share the same CQ with several Queues

Using the same CQ with several Queues and reducing the total number of CQs will eliminate the need to check several CQs in order to understand if an outstanding Work Request was completed. This can be done by sharing the same CQ with multiple Send Queues, multiple Receive Queues or with a mix of them.

Increase the scalability

Use collective algorithms

Using collective algorithms will reduce the total number of messages that cross the wire and will decrease the total number of messages and resources that each node in a cluster will use. There are RDMA devices that provide special collective offload operations that will help reducing the CPU utilization.

Use Unreliable Datagram (UD) QP

If every node needs to be able to receive or send a message to any other node in the subnet, using a connected QP (either Reliable or Unreliable) may be a bad solution since many QPs will be created in every node. Using a UD QP is better since it can send and receive messages from any other UD QP in the subnet.

内存池:hwchiu-blog-source/ceph-with-rdma.md at 1c4555a2533efe1768658abf8b522cccd24fccf4 · hwchiu/hwchiu-blog-source · GitHub

Guess you like

Origin blog.csdn.net/bandaoyu/article/details/120713020