Read the article|RDMA principle

What is DMA

The full name of DMA is Direct Memory Access, that is, direct memory access. It means that the process of reading and writing the memory by the peripheral can be carried out directly without the participation of the CPU. Let's take a look at the time without DMA:

Data path between I/O devices and memory without DMA controller

Assuming that the I/O device is an ordinary network card, in order to get the data to be sent from the memory, and then assemble the data packet and send it to the physical link, the network card needs to inform the CPU of its own data request through the bus. Then the CPU will copy the data in the memory buffer to its own internal registers, and then copy it to the storage space of the I/O device. If the amount of data is relatively large, the CPU will be busy moving data for a long time, and cannot devote itself to other tasks.

The main work of the CPU is calculation, not data copying, which is a waste of its computing power. In order to "lighten" the CPU and put it into more meaningful work, people later designed the DMA mechanism:

Data path between I/O devices and memory with DMA controller

It can be seen that there is another DMA controller on the bus, which is a device specially used to read and write memory. With it, when our network card wants to copy data from the memory, except for some necessary control commands, the entire data copy process is completed by the DMA controller. The process is the same as the CPU copy, except that this time the data in the memory is copied to the internal register of the DMA controller through the bus, and then copied to the storage space of the I/O device. In addition to paying attention to the start and end of this process, the CPU can do other things at other times.

The DMA controller is generally together with the I/O device, that is to say, a network card has both a module responsible for data transmission and reception, and a DMA module.

What is RDMA

RDMA (Remote Direct Memory Access) means remote direct address access. Through RDMA, the local node can "directly" access the memory of the remote node. The so-called direct means that the remote memory can be read and written by bypassing the complex TCP/IP network protocol stack of traditional Ethernet, just like accessing local memory. This process is not perceived by the peer, and most of the read and write process Work is done by hardware, not software.

In order to understand this process intuitively, please look at the following two diagrams (the arrows in the diagrams are only for illustration and do not represent the actual logical or physical relationship):

In a traditional network, what "node A sends a message to node B" actually does is "move a piece of data in the memory of node A to the memory of node B through the network link", and this process, whether sending or receiving Each segment requires the command and control of the CPU, including the control of the network card, the processing of interrupts, the encapsulation and analysis of packets, and so on.

The data of the node on the left in the above figure in the memory user space needs to be copied to the buffer of the kernel space by the CPU, and then can be accessed by the network card. During this period, the data will pass through the TCP/IP protocol stack implemented by the software, plus various Layer header and check code, such as TCP header, IP header, etc. The network card copies the data in the kernel to the buffer inside the network card through DMA, and sends it to the peer end through the physical link after processing.

After the peer end receives the data, it will perform the opposite process: copy the data from the internal storage space of the network card to the buffer of the memory kernel space through DMA, and then the CPU will analyze it through the TCP/IP protocol stack, and fetch the data Copy it out to user space.

It can be seen that even with DMA technology, the above process still has a strong dependence on the CPU.

After using RDMA technology, this process can be simply expressed as the following schematic diagram:

Similarly, a piece of data in the local memory is copied to the peer memory. When RDMA technology is used, the CPUs at both ends hardly need to participate in the data transmission process (only participate in the control plane). The network card at the local end directly copies data from the user space of the memory to the internal storage space by DMA, and then the hardware assembles the packets of each layer and sends them to the peer network card through a physical link. After receiving the data, the RDMA network card at the opposite end strips the headers and check codes of each layer, and directly copies the data to the user space memory through DMA.

Advantages of RDMA

RDMA is mainly used in the field of high-performance computing (HPC) and large data centers, and the equipment is much more expensive than ordinary Ethernet cards (for example, the market price of Mellanox's Connext-X 5 100Gb PCIe network card is more than 4,000 yuan). Due to usage scenarios and prices, RDMA is far away from ordinary developers and consumers. At present, it is mainly deployed and used by some large Internet companies.

Why can RDMA technology be applied in the above scenarios? This involves its following characteristics:

  • 0 copy: It means that there is no need to copy data back and forth between user space and kernel space.

Because operating systems such as Linux divide the memory into user space and kernel space, in the traditional Socket communication process, the CPU needs to copy data back and forth in the memory many times. With RDMA technology, we can directly access the registered memory area at the remote end. 356

  • Kernel Bypass: It means that the IO (data) process can bypass the kernel, that is, the data can be prepared at the user layer and the hardware can be notified to be sent and received. The overhead of system calls and context switches is avoided.

The picture above (original picture [1]) can well explain the meaning of "0 copy" and "kernel bypass". The upper and lower parts are Socket-based and RDMA-based one-time receiving-sending process respectively, and the left and right are respectively two nodes. It can be clearly seen that there is an additional copy action in the software in the Socket process. RDMA bypasses the kernel and also reduces memory copying, and data can be directly transferred between the user layer and the hardware.

  • CPU unloading: It means that the memory can be read and written without the CPU of the remote node participating in the communication (of course, you must hold the "key" to access a certain segment of memory at the remote end), which is actually to encapsulate the message And analysis is done in hardware. In traditional Ethernet communication, the CPUs of both parties must participate in the parsing of messages at each layer. If the amount of data is large and the interaction is frequent, it will be a lot of overhead for the CPU, and these occupied CPU computing resources could have been Do something more valuable.

The two performance indicators with the highest appearance rate in the communication field are "bandwidth" and "delay". Simply put, the so-called bandwidth refers to the amount of data that can be transmitted per unit of time, and the delay refers to the time it takes for data to be sent from the local end to being received by the opposite end. Because of the above characteristics, compared with traditional Ethernet, RDMA technology achieves higher bandwidth and lower latency at the same time, so it is used in bandwidth-sensitive scenarios—such as the interaction of massive data, and latency-sensitive—such as multiple It can play its role in the scenario of data synchronization between computing nodes.

protocol

RDMA itself refers to a technology, specific protocol level, including Infiniband (IB), RDMA over Converged Ethernet (RoCE) and internet Wide Area RDMA Protocol (iWARP). All three protocols conform to the RDMA standard, use the same upper-layer interface, and have some differences at different levels.

The above figure [2] makes a very clear comparison of the protocol levels of several common RDMA technologies.

Infiniband

The IB protocol proposed by the IBTA (InfiniBand Trade Association) in 2000 is the well-deserved core, which specifies a complete set of specifications from the link layer to the transport layer (not the transport layer of the traditional OSI seven-layer model, but on top of it) , but it is not compatible with the existing Ethernet. In addition to the network card that supports IB, the enterprise needs to re-purchase supporting switching equipment if it wants to deploy.

RoCE

RoCE can be seen from the full English name that it is a protocol based on the Ethernet link layer. The network layer of the v1 version still uses the IB specification, while the v2 uses UDP+IP as the network layer, so that data packets can also be routed. RoCE can be regarded as IB's "low-cost solution", which encapsulates IB's messages into Ethernet packets for sending and receiving. Since RoCE v2 can use Ethernet switching devices, it is now widely used in enterprises, but in the same scenario, it has some loss in performance compared to IB.

iWARP

The iWARP protocol was proposed by the IETF based on TCP, because TCP is a reliable connection-oriented protocol, which makes iWARP more powerful than RoCE v2 and IB in the face of lossy network scenarios (it can be understood that packet loss may often occur in the network environment). Better reliability, also has obvious advantages in large-scale networking. However, a large number of TCP connections will consume a lot of memory resources. In addition, TCP's complicated flow control mechanism will cause performance problems. Therefore, in terms of performance, iWARP is worse than UDP's RoCE v2 and IB.

It should be noted that although there are RoCE and iWARP protocols implemented by software, the above-mentioned protocols require special hardware (network card) support for real commercial use.

iWARP itself is not directly developed from Infiniband, but it inherits some design ideas of Infiniband technology. The relationship between these three protocols is shown in the figure below:

  Information through train: Linux kernel source code technology learning route + video tutorial kernel source code

Learning through train: Linux kernel source code memory tuning file system process management device driver/network protocol stack

player

Standard/Ecological Organization

When it comes to the IB agreement, we have to mention two major organizations - IBTA and OFA.

IBTA[3]

Founded in 1999, it is responsible for formulating and maintaining Infiniband protocol standards. IBTA is independent of each manufacturer, integrates the entire industry by sponsoring technical activities and promoting resource sharing, and actively promotes IB and RoCE through online communication, marketing and offline activities.

IBTA will conduct protocol standard compliance and interoperability testing and certification for commercial IB and RoCE equipment. It is led by a committee composed of many large IT manufacturers, and its main members include Broadcom, HPE, IBM, Intel, Mellanox and Microsoft. Huawei is also a member of IBTA.

OFA[4]

A non-profit organization established in 2004, responsible for developing, testing, certifying, supporting and distributing an open source cross-platform infiniband protocol stack independent of vendors, and began supporting RoCE in 2010. It is responsible for the OFED (OpenFabrics Enterprise Distribution) software stack used to support RDMA/Kernel bypass applications, ensuring its compatibility and ease of use with mainstream software and hardware. The OFED software stack includes drivers, kernel, middleware and API.

The above two organizations are in a cooperative relationship. IBTA is mainly responsible for developing, maintaining and enhancing the Infiniband protocol standard; OFA is responsible for developing and maintaining the Infiniband protocol and upper-layer application API.

development community

LinuxCommunity

The RDMA subsystem of the Linux kernel is relatively active. Some protocol details are often discussed, and the framework is frequently modified. In addition, some manufacturers including Huawei and Mellanox often modify the driver code.

Email subscription: http://vger.kernel.org/vger-lists.html#linux-rdma

The code is located in the kernel drivers/infiniband/ directory, including the framework core code and the driver code of each manufacturer.

Code repository: https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/

RDMACommunity

For upper-level users, IB provides a set of interfaces similar to Socket - libibverbs, and all the three protocols mentioned above can be used. It is easy to write a demo by referring to the protocol, API documentation and sample programs. The RDMA community in this column specifically refers to its user mode community, and its warehouse name on github is linux-rdma.

It mainly contains two sub-warehouses:

  • rdma-core

User mode core code, API, documentation, and user mode drivers from various vendors.

  • perftest

A powerful tool for testing RDMA performance.

Code warehouse: https://github.com/linux-rdma/

UCX[5]

UCX is a communication framework for data processing and high-performance computing built on technologies such as RDMA, and RDMA is one of its underlying cores. We can understand it as a middleware between the application and the RDMA API, which encapsulates an interface that is easier to develop for upper-level users.

The author doesn't know too much about it, I only know that some companies in the industry are developing applications based on UCX.

Code warehouse: https://github.com/openucx/ucx

hardware manufacturer

There are many manufacturers who design and produce IB-related hardware, including Mellanox, Huawei, Intel, which acquired Qlogic's IB technology, Broadcom, Marvell, Fujitsu, etc. Here we will not expand one by one, but simply mention Mellanox and Huawei.

  • Mellanox

A leader in the IB field, Mellanox can be seen in the formulation of protocol standards, software and hardware development, and ecological construction. It has the greatest say in the community and standard formulation. Currently the latest generation of network cards is the ConnextX-6 series that supports 200Gb/s.

  • Huawei

The Kunpeng 920 chip launched at the beginning of last year already supports the RoCE protocol of 100Gb/s, and is in a leading position in China technically. However, there is still a long way to go from Mellanox in terms of software, hardware and influence. I believe Huawei can catch up with the big brother as soon as possible.

user

Microsoft, IBM, domestic Alibaba and JD.com are all using RDMA, and many large IT companies are doing preliminary development and testing. In data centers and high-performance computing scenarios, it is the general trend for RDMA to replace traditional networks. The author does not have much contact with the market, so I cannot provide more detailed application information.

Original Author: Savir

 

 

Guess you like

Origin blog.csdn.net/youzhangjing_/article/details/132174782