Understanding RDMA based on SoftRoCE

RDMA is a direct transfer of memory based on IB technology, without the involvement of the kernel, and the hardware network card is done. IB requires dedicated hardware in the HPC field, ROCE is the implementation of the RDMA protocol on a common Ethernet card, RoCEv1 is a Layer 2 encapsulation on the MAC, which can be used in a local area network, and RoCEv2, a UDP-based version, is required to pass through a router.

So why is RDMA fast? Ordinary network cards need to receive complete packets, and support RoCE network cards to directly read and write memory, without going around the kernel. Just like we have 1G memory to copy to the other party, this is a big express. Generally, we use socket programming to go through the post office of the kernel. There are many restrictions. For example, it has a limit on the size of the message. State-owned enterprises, the internal process is also more complicated, sending a courier is laborious and slow, and they consume a lot of resources. RoCE is a private courier, you don’t need to wait for you to go to the post office, tell him where your memory is, and which address to send to the other party, and then Large memory, they are silently done in the background like a mouse moving house, and your CPU can do something else. The same goes for reading remote memory. You can also choose whether to sign the receipt or not.

Unlike DPDK, DPDK just skips the post office and encapsulates the packets to the network card itself, while RoCE does not need to worry about packaging, which is more diligent than courier companies. Therefore, RDMA can be regarded as a message mechanism encapsulated on the network card, with a higher level. Having said so many benefits, can't you wait to experience it? A network card that supports RoCE costs a few hundred dollars on Taobao. It seems to be 10G. Fortunately, there is SoftRoCE. Based on a common network card, software is used to implement what the hardware needs to do. You can experience it on a virtual machine. How amazing, the emphasis is on experience, and the actual measurement efficiency is not high. . . . In fact, this technology has been out for many years, but it has always been. . .

For SoftRoCE installation steps, see https://github.com/SoftRoCE/rxe-dev/wiki/rxe-dev:-Home Other readmes are not new enough.

rxe-dev is actually a complete kernel, in which the rxe driver and a header file are added. The code under the master should not be used, it is not new enough. With the v18 branch, the compiled kernel is version 4.7. Note that this clone+compile process is slow. . .
Under centos7, you need to install bc, ncurses-devel, openssl-devel. After compiling and installing, there will be a 4.7 kernel boot menu under grub. After entering, use rxe-cfg start, then rxe-cfg add <eth>, it can be run The test commands are mainly in ib-utils and rdma-utils, rping, rdma_server/client, qperf, ibv_rc_pingpong can all be played.

RDMA mainly has recv/send. This mechanism is that the two sides need to shake hands. Send here and recv there. IB's verb, that is, these send/recv are executed sequentially. If no one receives it there, you will be scrapped later. . In the same way, when you want to accept, the other party has to send it, otherwise you will just hang there and wait. . The recv command must wait. The send command is optional without waiting, that is, un-signaled. It is similar to sending a letter without a receipt, but there is a special case. You send a bunch of messages without waiting. ,why? Because there needs to be a signaled to trigger batch sending (see someone mentioning this problem, no verification), this design is rather silly, although it improves efficiency. . . According to my programming test un-signaled is very easy, init_attr.sq_sig_all = 0, do not bring IBV_SEND_SIGNALED in send_attr. This signaled keeps failing...

Let's talk about read/write, these two are direct access to remote memory without each other's participation. First of all, the two send flags must be IBV_SEND_SIGNALED, and there will be no response without the server. Test 1M memory about 0.5S to read and write back and forth. It is recommended that you print the time between each rdma_xxx command and the previous command, so that it is easy to know which side is not responding, or the response is slow. At the same time, capture the packets to see if the udp in each direction is correct. read/write seems to be stupid, it can't r/w the same remote address, it can only be two different addresses. And can't add the offset part to the address to access. . . Not sure if there is something wrong with my test. . . r/w must also know the address and key of the remote memory, use the previous send/recv to get it back, and then you can r/w.

There is a small question here. The send/recv mechanism seems to be a master-slave method, for example, only the client Send a request to the server, the server has been recv, and the client sends a request. If the two parties are equal, can only one more channel be opened to transmit the message? This rdma_get_recv_comp() is a blocking method, doesn't it require two threads?

There must be a rdma_recv before rdma_accept. This logic is unscientific. Generally, a connection is established before sending and receiving. . The server can also work without it, but the first request will be delayed 0.5Sec. The

basic code refers to rdma_server.c and rdma_client.c.
Two documents are recommended, which can be found on Yahoo that cannot be googled:
   RDMA Read and Write with IB Verbs
   Introduction to RDMA Programming
  
In theory, the RDMA encapsulation level is higher, and the hardware is added to save the CPU and the delay is small. But the programming model is different from before, and we have to fight against all kinds of weird phenomena. . From the perspective of performance, if the cost of a network card is not high, almost all socket communication can be ported to this efficient transmission method.

RDMA also supports multicast, unstable transmission methods (audio, video). . .

From a management point of view, ROCE must have a corresponding supervision method when used on a VM, and debugging will be more challenging, such as how to capture packets.

Anyway, if your system needs to improve network transmission efficiency and free up CPU, you should look at ROCE.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326474522&siteId=291194637