[RDMA] ibv function and related issues|IBV_SEND_INLINE

API query URL:

https://docs.oracle.com/cd/E88353_01/html/E37842/ibv-modify-qp-3.html

https://www.rdmamojo.com/2013/01/26/ibv_post_send/

 

ibv_post_send() 

The function prototype is

int ibv_post_send(struct ibv_qp *qp, struct ibv_send_wr *wr,
                  struct ibv_send_wr **bad_wr);
其中struct ibv_send_wr结构体的定义为:

https://www.rdmamojo.com/2013/01/26/ibv_post_send/

struct ibv_send_wr {
    uint64_t        wr_id;
    struct ibv_send_wr     *next;
    struct ibv_sge           *sg_list;
    int            num_sge;
    enum ibv_wr_opcode    opcode;
    int            send_flags;
    uint32_t        imm_data;
    union {
        struct {
            uint64_t    remote_addr;
            uint32_t    rkey;
        } rdma;
        struct {
            uint64_t    remote_addr;
            uint64_t    compare_add;
            uint64_t    swap;
            uint32_t    rkey;
        } atomic;
        struct {
            struct ibv_ah  *ah;
            uint32_t    remote_qpn;
            uint32_t    remote_qkey;
        } ud;
    } wr;
};

The opcode parameter in the ibv_send_wr structure determines the type of data transmission, for example:

 

IBV_WR_SEND——This transmission method, the contents of the sg_list in the current buffer memory will be sent to the remote QP. The sender does not know where the data will be written to the remote node (the receiver decides). The receiver must post_recv, and the received data must be placed in the specified address.

IBV_WR_RDMA_WRITE-This transmission method, the contents of the sg_list in the local memory buffer will be sent and written to a contiguous memory block in the virtual space of the QP of the remote node-this does not mean that the remote memory is also physically continuously. And remote QP does not need post_recv. (In real RDMA, the other party's cpu does not participate, and the local end directly uses the addr and key obtained during the initial handshake to manipulate the memory of the opposite end.)
 

ibv_send_flags send_flags

https://www.rdmamojo.com/2013/01/26/ibv_post_send/

 enum ibv_send_flags {
 	IBV_SEND_FENCE		= 1 << 0,
 	IBV_SEND_SIGNALED	= 1 << 1,
 	IBV_SEND_SOLICITED	= 1 << 2,
 	IBV_SEND_INLINE		= 1 << 3,
 	IBV_SEND_IP_CSUM	= 1 << 4
 };

Describe the properties of WR. It is a bitwise "or" of 0 or one or more of the following flags:

IBV_SEND_FENCE       -Set the fence indicator of this WR. This means that the processing of the WR will be blocked until all previously issued RDMA reads and atomic WRs are completed. Only valid for QP whose Transport Service Type is IBV_QPT_RC.
IBV_SEND_SIGNALED -Set the completion notification indicator for this WR. This means that if the QP was created with sq_sig_all=0, when the processing of this WR ends, WC will be generated. If QP was created with sq_sig_all = 1, this flag has no effect.

IBV_SEND_SOLICITED -Set the request event indicator for this WR. This means that when the message in this WR will end in the remote QP, a requested event will be created for it, and if the user on the remote end is waiting for a requested event, it will be awakened. Only related to sending with immediate opcode and RDMA write operations.

The memory buffer specified in IBV_SEND_INLINE       -sg_list will be placed inline in the SR. This means that the underlying driver (ie, the CPU) will read the data instead of the RDMA device. This means that L_Key will not be checked, in fact those memory buffers do not even need to be registered, and they can be reused immediately after ibv_post_send() is about to end. Only valid for Send and RDMA write opcodes.

Since there is no key exchange involved in this code, RDMA transmission cannot be used, so the CPU is still used to read the data. Since it is read by the CPU, there is no need to register the memory buffer. This tag can only be used For sending and writing operations.

The flag IBV_SEND_INLINE describe to the low level driver how the data
will be read from the memory buffer
(the driver will read the data and give it to the HCA or the HCA will
fetch the data from the memory using
the gather list).

The data on the wire for inline messages looks the same like data which
was fetched by the HCA.

The receive side don't have any feature which can work with memory
buffer without any registration.

https://general.openfabrics.narkive.com/dWKbPu7p/ofa-general-query-on-ibv-post-recv-and-ibv-send-inline

 

 ibv_sge

struct ibv_sge describes a scatter/gather entry. The memory buffer that this entry describes must be registered until any posted Work Request that uses it isn't considered outstanding anymore. The order in which the RDMA device access the memory in a scatter/gather list isn't defined. This means that if some of the entries overlap the same memory address, the content of this address is undefined.

struct ibv_sge {
	uint64_t		addr;
	uint32_t		length;
	uint32_t		lkey;
};

Here is the full description of struct ibv_sge:

addr The address of the buffer to read from or write to
length The length of the buffer in bytes. The value 0 is a special value and is equal to 2^{31} bytes (and not zero bytes, as one might imagine)
lkey The Local key of the Memory Region that this memory buffer was registered with

 

Sending inline'd data is an implementation extension that isn't defined in any RDMA specification: it allows send the data itself in the Work Request (instead the scatter/gather entries) that is posted to the RDMA device.The memory that holds this message doesn't have to be registered.

Sending inline'd data is an implementation extension that is not defined in any RDMA specification: inline'd allows data on WR (not scatter/gather) to be posted to the RDMA driver, and the memory for storing this message does not need to be registered .

There isn't any verb that specifies the maximum message size that can be sent inline'd in a QP.Some of the RDMA devices support it.In some RDMA devices, creating a QP with will set the value of max_inline_data to the size of messages that can be sent using the requested number of scatter/gather elements of the Send Queue.

value QP verb specification no maximum can be much inline Message, it has some support RDMA drive, in some RDMA device, create QP will max_inline_data is the number of elements that can be used for the requested Send Queue scatter / gather transmission The size of the message.

If others, one should specify explicitly the message size to be sent inline before the creation of a QP. for those devices, it is advised to try to create the QP with the required message size and continue to decrease it if the QP creation fails.

If it is another version, you should clearly specify the message size to be sent inline before creating the QP. For these devices, it is recommended to try to create a QP with the required message size, and if the QP creation fails, continue to reduce it.

While a WR is considered outstanding:

Although WR is considered excellent:

  • If the WR sends data, the local memory buffers content shouldn't be changed since one doesn't know when the RDMA device will stop reading from it (one exception is inline data)
  • If the WR reads data, the local memory buffers content shouldn't be read since one doesn't know when the RDMA device will stop writing new content to it

If WR sends data, you should not change the contents of the local memory buffer, because you don't know when the RDMA device will stop reading content from it (the inline data method is an exception)

If WR reads the data, it should not read the contents of the local memory buffer because it is unknown when the RDMA device will stop writing new contents to it

============================

ibv_post_send() work queue overflow问题

This problem occurs because when sending, there is no place in the send queue. One reason is that the sending is too frequent, and the other is that the sent WC (work complete) is not processed.

A simple scenario is designed for sending inline. According to the document, inline sending can reuse the sending buffer without waiting for wc.

    struct ibv_send_wr send_wr = {
	.sg_list    = &sge,
	.num_sge    = 1,
	.opcode     = IBV_WR_SEND,
	.send_flags = IBV_SEND_INLINE,
    };

The actual measurement found that every time it is sent to a certain amount, it stops, errno: 12 (ENOMEM)-Cannot allocate memory. This amount is the req_num when qp is initialized:

	struct ibv_qp_init_attr qp_init_attr = {
			.send_cq = verbs.cq,
			.recv_cq = verbs.cq,
			.cap = {
				.max_send_wr = req_num,
				.max_recv_wr = req_num,
				.max_send_sge = 1,
				.max_recv_sge = 1,
			},
			.qp_type = IBV_QPT_RC,
	};

Reference https://www.rdmamojo.com/2013/01/26/ibv_post_send/

Many people are asking this ghost question. Everyone thinks that inline sending does not require wc. In fact, inline only copies the data to wr (work request) and gives it to the hardware. The network card does not need to be dma again. If there is no wc, the hardware is processed in the queue. The first pointer in the sending queue has not been updated, so the verbs driver thinks that the hardware is still operating, so there is no room to continue sending.

The suggestion is that although sending inline, you still have to poll cq every three times, so that the queue pointer is updated.

Therefore, the bird function of inline only reduces the frequency of responding to wc, and cannot be eradicated.

So my sq and rq share the same cq, and I often rotate rq? I can't see it from the test results.

When qp is initialized, add: .sq_sigall = 1,
or send send_wr without a certain number of specified flags: IBV_SEND_SIGNALED...

If op_code is IBV_WC_SEND when processing inline wc, ignore it.

 

 

 

==========================================

5. Why should IBV_SEND_INLINE be set?

int send_flags, describes the attributes of WR, and its value is 0 or the bitwise XOR of one or more flags.

IBV_SEND_FENCE-Set fence indicator for this WR. This means that the processing of this WR will be blocked until all previously issued RDMA Read and Atomic WR will be completed. Only valid for QP with transport service type IBV_QPT_RC

IBV_SEND_SIGNALED-Set the completion notification indicator for this WR. This means that if the QP is created with sq_sig_all = 0, then when the WR processing ends, a job completion will be generated. If QP was created with sq_sig_all = 1, it will not have any effect on this flag

IBV_SEND_SOLICITED-Set the request event indicator for this WR. This means that when the message in this WR will end in the remote QP, a request event will be created, and if the user is waiting for the request event on the remote side, it will be awakened. Related to immediate opcodes used only for sending and RDMA writes

IBV_SEND_INLINE-The memory buffer specified in sg_list will be placed inline in the send request. This means that the low-level driver (ie, the CPU) will read the data instead of the RDMA device. This means that L_Key will not be checked, in fact these memory buffers do not even need to be registered, and can be reused immediately after the end of ibv_post_send(). Only valid for transmit and RDMA write operation codes. Since there is no key exchange involved in this code, RDMA transmission cannot be used, so the CPU is still used to read the data. Since it is read by the CPU, there is no need to register the memory buffer. This tag can only be used For sending and writing operations.

6. What is the relationship between the opcode and the corresponding QP transmission service type?

7. What is the difference between libibverbs and librdmacm?

In infiniband/verbs.h, the ibv_post_send() and ibv_post_recv() operations are defined, which respectively indicate that the wr is published to the SQ and RQ. As for the operation, it is related to the opcode in the wr. For ibv_post_send(), it corresponds to struct ibv_send_wr, where there is opcode, which means operation code, and there are SEND/WRITE/READ, etc. For ibv_post_recv(), the corresponding is struct ibv_recv_wr, there is no opcode, because only one action is received, so there is no need to define other opcodes. But for sending, there are three categories.

在rdma/rdma_verbs.h中,有rdma_post_send(),rdma_post_recv(),rdma_post_read(),rdma_post_write()。

rdma_post_send(): To post wr to QP’s SQ, mr is required

rdma_post_recv(): To post wr to QP's RQ, mr is required

rdma_post_read(): Publish wr to QP's SQ, perform RDMA READ operation, need remote address and rkey, as well as local storage address and length, and mr

rdma_post_write(): Publish wr to QP's SQ, RDMA WRITE operation requires the remote address and rkey to be written, as well as the address and length of the data to be sent locally, and mr

So the four communication functions in rdma/rdma_verbs.h are actually the same as the two methods in infiniband/verbs.h. ibv_post_send() corresponds to rdma_post_send(), rdma_post_read(), rdma_post_write(), and ibv_post_recv() corresponds to rdma_post_recv().

Original link: https://blog.csdn.net/upupday19/article/details/79379539

 

IBV_SEND_INLINE

 

The flag IBV_SEND_INLINE describe to the low level driver how the datawill be read from the memory buffer
(the driver will read the data and give it to the HCA or the HCA will fetch the data from the memory usingthe gather list).

The flag IBV_SEND_INLINE describes to the underlying driver in which way the data will be read from the (application) memory buffer (there are two ways):

(1, the driver will read the data and provide it to HCA or

    2. HCA will use gather list to get data from memory)

 

张杰基写道:
>嗨,我正在编写一个调用* ibv_post_send *到RDMA的程序,将数据(用IBV_SEND_SIGNALED *选项)写入其他节点。
> IBV_SEND_INLINE可与IBV_SEND_INLINE一起使用。 用* IBV_SEND_INLINE *,则该缓冲区可以立即使用。如果数据包是足够小,那么使用IBV_SEND_INLINE是非常好的。

答:IBV_SEND_INLINE表示驱动程序(cpu)自行复制数据(到驱动),数据缓冲区可以重用/释放。

>当我发送大数据并且不使用IBV_SEND_INLINE选项,所以我必须稍等片刻,然后我才能使用相同的缓冲区再次发布数据。
>事实上,如果没有等待,将会报告一些错误。所以我想知道我应该等待多长时间?我认为定时时长的等待不是应对此类情况的正确解决方案。

答:如果您不使用IBV_SEND_INLINE,则必须等待相应的时间该WR的完成。您正在使用IBV_SEND_SIGNALED,这意味着每个WR都会以一个结局结束,因此为了安全地使用
数据缓冲区,您必须等待此WR完成。(读CQ ?)

https://lists.openfabrics.org/pipermail/ewg/2008-April/006271.html

original:

Hi.

zhang Jackie wrote:
> hi,
> I am writing a program calling *ibv_post_send* to RDMA write data to 
> other node with *IBV_SEND_SIGNALED* option. 
> IBV_SEND_INLINE can be used with IBV_SEND_INLINE .*IBV_SEND_INLINE* is 
> to say that the buffer can be used immediately. If the packet is 
> small enough ,then it is very good to use  IBV_SEND_INLINE.

答:IBV_SEND_INLINE means that the driver copy the data by itself and the 
data buffers can be reused/freed.


> while I send big data and dont use IBV_SEND_INLINE option ,So I must 
> wait a while before I can use this same buffer to post data again.In 
> fact some errors will be reported if there is no wait. SO I want to 
> know how long should I wait? And I think explicit wait is not the 
> correct solution to deal with such condition.

答:If you don't use IBV_SEND_INLINE, you must wait for the corresponding 
completion of this WR. You are using IBV_SEND_SIGNALED Which means that 
every WR will ends up with a completion, so in order to safely use the 
data buffer, you must wait for the completion of this WR.

https://lists.openfabrics.org/pipermail/ewg/2008-April/006271.html

 

 

Guess you like

Origin blog.csdn.net/bandaoyu/article/details/113253052