RDMA technical analysis

 

1 What is RDMA
1.1 DMA in the traditional sense


--- direct memory access (DMA) mode, is a working mode that performs I/O exchange entirely by hardware. In this way, the DMA controller takes over the control of the bus completely from the CPU, and the data exchange does not go through the CPU, but directly between the memory and I/O devices. When DMA is working, the DMA controller sends out address and control signals to the memory; modifies the address; counts the number of transferred words; and reports the end of the transfer operation to the CPU in an interrupt mode. The DMA method is generally used for high-speed transfer of grouped data.


---Using the DMA method Purpose: To reduce the overhead of the CPU when transmitting large quantities of data; Method: Use a dedicated DMA controller (DMAC) to generate the memory access address and control the memory access process; Advantages: The operation is implemented by hardware circuits, and the transmission speed is fast ; The CPU basically does not intervene, only participates in the initialization and the end, the CPU and the peripherals work in parallel, and the efficiency is high.

---The data block transfer process of DMA can be divided into three stages: pre-processing before transfer; formal transfer; post-transfer processing. DMA control flow: 1. Preprocessing: The CPU executes I/O instructions to initialize and start the DMAC. 2. Data transmission: data transmission is performed by the DMAC control bus. 3. Post-processing: After the transfer is over, the DMAC sends an interrupt request to the CPU to report the end of the DMA operation. The CPU responds and transfers to the interrupt service routine to complete the DMA end processing.

1.2 RDMA


---RDMA (Remote Direct Memory Access) is a storage area that directly transfers data to a computer through the network, without using much computer processing functions. The ordinary network card integrates the function of supporting hardware checksum, and improves the software, thereby reducing the copy amount of sent data, but cannot reduce the copy amount of received data, and this part of the copy amount takes up a lot of computing cycles of the processor . The working process of a common network card is as follows: First, the received data packets are buffered on the system, and after the data packets are processed, the corresponding data is allocated to a TCP connection. In the next step, the receiving system associates the offered TCP data with the corresponding application program, and copies the data from the system buffer to the target storage address. Ethernet has been able to meet the requirements of high-performance applications for network throughput, with high throughput and cost advantages. Ethernet technology should be linked with high-performance network applications, and the main problem to be solved is application throughput. Normally, the system needs CPU resources to continuously process Ethernet communication in the host CPU. The CPU speed will limit the network data rate; continuously processing this type of traffic will degrade the CPU performance; this problem will be exacerbated for multi-port gigabit or single-port 10 gigabit Ethernet.


---The factors that restrict the network speed are mainly in two aspects: the application communication strength and the efficiency of the host CPU processing data between the kernel and the application memory. Achieving a specific performance level requires additional host CPU resources, efficient software configuration, and enhanced system load management. The traditional TCP/IP technology needs to occupy a lot of server resources in the process of data transmission. In this way, it is difficult to reflect the advantages of low investment and low operating cost of Ethernet. In order to give full play to the performance advantages of 10 Gigabit Ethernet, the problem of application performance must be solved. The system cannot continuously handle Ethernet communications in software; host CPU resources must be freed up to focus on application processing. The key to solving such problems is to eliminate unnecessary frequent data transmission in the host CPU and reduce the information delay between systems. In general, we need to start from three aspects: protocol, software and hardware.


---The full name of RDMA is "Remote Direct Data Access". As shown in Figure 2, RDMA directly transfers data to the computer's storage area through the network, and quickly moves the data from one system to the remote system memory without correcting The operating system makes no difference, so that it doesn't require much of the computer's processing power. It eliminates external memory copy and text swap operations, thus freeing up bus space and CPU cycles for improving application system performance. The current common practice requires the system to analyze the incoming information first, and then store it in the correct area.

 


--- When an application performs an RDMA read or write request, no data copying is performed. Without any kernel memory involvement, RDMA requests are sent from applications running in user space to the local NIC (network card), and then sent over the network to the remote NIC. Request completion can either be handled entirely in user space (by polling the user-level completion queue), or through kernel memory if the application sleeps until the request completes. RDMA operations allow applications to read data from or write data to the memory of a remote application. The remote virtual memory address used for the operation is included in the RDMA information. The remote application does not need to do anything other than register the relevant memory buffers for its local NIC. The CPU in the remote node does not participate in the incoming RDMA operations at all, and these do not impose any burden on the CPU.


---RDMA allows computers to directly access the memory of other computers without the time-consuming transmission of the processor, because usually such data needs to go through the operating system and other software layers. Memory bottlenecks are exacerbated as connection speeds exceed the server's processing power and memory bandwidth. Remote Direct Memory Access (RDMA) enables one computer to transfer information directly into another computer's memory. This technique reduces latency by reducing the need for bandwidth and processor overhead. This effect is achieved by deploying a reliable transport protocol in the NIC's hardware and supporting zero-copy networking and kernel memory bypass. Zero-copy networking technology enables the NIC to transfer data directly to and from application memory, eliminating the need to copy data between application memory and kernel memory. Kernel memory bypass enables applications to send commands to the NIC without making kernel memory calls. RDMA requests are sent from user space to the local NIC and over the network to the remote NIC without any kernel memory involvement, which reduces the number of context switches between kernel memory space and user space when processing network traffic . RDMA is faster than current methods. Using the commonly used network to connect the computer and storage system through RDMA, the hardware connection speed will be faster to integrate many low-priced servers into a more powerful database without having to purchase expensive machines. For systems where space and power consumption are important, the RNIC requires only a fraction of the power consumption of the corresponding network card and microprocessor for the task of full Gigabit Ethernet transmission.

2 Specification of the working principle of RDMA

---RDMA是一种网卡技术,采用该技术可以使一台计算机直接将信息放入另一台计算机的内存中。通过最小化 处理过程的开销和带宽的需求,RDMA减少了延迟时间。RDMA通过在网卡上将可靠传输协议固化于硬件,以及支持绕过内核的零拷贝网络这两种途径来达到这 一目标。绕过内核使应用程序不必执行内核调用就可以向网卡发出命令。当一个应用程序执行RDMA读/写请求时,系统并不执行数据拷贝动作。这就减少了处理 网络通信时在内核空间和用户空间上下文切换的次数。RDMA请求的完成,或者完全在用户空间中进行,或者在应用程序希望进入睡眠直到完成信号出现的情况下 通过内核进行。RDMA操作用于读写操作的远程虚拟内存地址含在RDMA消息中传送,远程应用程序要做的只是在其本地网卡中注册相应的内存缓冲区。远程节 点的CPU在整个RDMA操作中并不提供服务,因此没有带来任何负载。通过类型值(键值)的使用,一个应用程序能够在远程应用程序对它进行随机访问的情况 下保护它的内存。发布RDMA操作的应用程序必须为它试图访问的远程内存指定正确的类型值,远程应用程序在本地网卡中注册内存时获得这个类型值。发布 RDMA的应用程序也必须确定远程内存地址和该内存区域的类型值。远程应用程序会将相关信息通知给发布RDMA的应用程序,这些信息包括起始虚拟地址、内 存大小和该内存区域的类型值。在发布RDMA的应用程序能够对该内存区域进行RDMA操作之前,远程应用程序应将这些信息通过发送操作传送给发布RDMA 的应用程序。

 

---如图 3 所示, 服务模型分为三层,依次为网络互连层、 SCSI 协议层、 SCSI应用层。

 

---SCSI RDMA有两种RDMA操作:RDMA写和RDMA读。读和写都是相对于SCSI启动方来说的。

 

---零复制网络技术及其功能实现 

 

---零复制网络技术是通过在NIC的硬件中部署一项可靠的传输协议以及支持零复制网络技术和内核内存旁路实现的。请求完成既可以完全在用户空间中处理(通过轮询用户级完成排列),或者在应用希望一直睡眠到请求完成时的情况下通过内核内存处理。软件部分负责协议功能实现。

 

---API(应用程序接口)包括用于低时延消息处理、成就高性能计算的MPI(消息通过接口),以及DAPL(直接接入供应库)。后者包括两部分:KDAPL和UDAPL,分别用于内核和用户(应用程序)。Linux支持KDAPL,其他操作系统将来也有可能支持。 
   
3 RDMA在数据传输中的应用 
3.1 RDMA的应用

---RDMA的优势在于可利用传统的网络硬件,以TCP/IP及以太网络标准来建立因特网。RDMA将被用来把小型服务器连接为一个群集,可以处理现今 一些十几颗处理器的高端服务器才能够处理的大型数据库。如果把RDMA及TOE,以及10GB以太网络放在一起,这是个相当吸引人的技术。RDMA正在迅 速成为高速集群和服务器区域网的一种基本特性。 

---InfiniBand网络和实现虚拟接口架构的网络支持RDMA,应用于带传输卸载引擎网卡的RDMA over TCP/IP正在开发之中。采用RDMA来获取高性能的协议包括Sockets Direct Protocol、SCSI RDMA Protocol(SRP)和Direct Access File System(DAFS)。采用RDMA的通信库包括Direct Access Provider Library(DAPL)、Message Passing Interface(MPI)和Virtual Interface Provider Library(VIPL)。运行分布式应用程序的集群是RDMA能够大显身手的领域之一。通过DAPL 或VIPL以及集群上运行的数据库软件来使用RDMA,可在相同的节点数目下获得更高的性能和更好的延展性。使用MPI的集群科技运算应用程序,通过支持 互连RDMA实现了低延迟时间、低开销和高吞吐量,这一结果也使它获得了巨大的性能提升。其他初期的RDMA应用还有通过DAFS的远程文件服务器访问, 通过SRP的存储设备访问。RDMA技术正在迅速成为高速集群系统和存储域网络的基本特征技术。其中iWARP/RDMA是一类基本构造块;此外还有 iSER,它是用于RDMA的iSCSI扩展,充分利用了RDMA的功能。

 

---RDMA的其他早期应用包括通过DAFS的远程文件服务器访问和通过SRP的刀片服务器存储访问。RDMA正在迅速成为高速集群和服务器区域网的一种基本特性。 

 

3.2 NAS和SAN中的应用

 

---传统的直接连接存储DAS(Direct Access Storage)是以服务器为中心的存储结构。这一存储体系结构存在容量限制、连接距离有限、不易于共享和管理等不可克服的缺点,已经不能够满足网络时代的应用需求。

---网络时代的到来使存储技术发生了巨大变化,网络附加存储

NAS(Network Attatched Storage )、存储区域网络SAN(Storage Area Network)既能为网络上的应用系统提供丰富、快速、简便的存储资源;又能共享存储资源并对其实施集中管理,成为当今理想的存储管理和应用模式。 NAS结构存在一些难以解决的问题。如传输能力有限、可扩展性有限、数据备份能力有限并且不能对数据库服务提供有效的支持。DAFS把RDMA的优点和 NAS的存储能力集成在一起,全部读写操作都直接通过RDMA驱动器执行,从而降低了网络文件协议所带来的系统负载。今后的NAS存储系统将采用DAFS 技术提高系统性能,并且在性能和价格上与SAN存储系统进行有力的竞争。 

3.3 Infiniband

---Infiniband的四大优点:基于标准的协议,10 GB/s性能,RDMA和传输卸载。优势在于:RDMA,传输卸载,高速度。InfiniBand网络和采用虚拟接口架构的网络支持RDMA,使用具有传 输卸载引擎的NIC的RDMA over TCP/IP。支持Infiniband的服务器使用主机通道适配器(HCA),把协议转换到服务器内部的PCI-X或者PCI-Xpress总线。 HCA具有RDMA功能,有时也称之为内核旁路(Kernel Bypass)。

4 其他

---网络存储技术的发展日新月异,虚拟存储技术、网格存储技术等等最新的技术将会给我们带来更多的方便,虚拟存储技术将底层存储设备进行抽象化统一管 理,向服务器层屏蔽存储设备硬件的特殊性,而只保留其统一的逻辑特性,从而实现了存储系统集中、统一而又方便的管理。统一的虚拟存储将不同厂商的FC- SAN、NAS、IP-SAN、DAS等各类存储资源整合起来,形成一个统一管理、监控和使用的公用存储池。虚拟存储的实质是资源共享,因此,统一虚拟存 储的任务有两点:其一是如何进一步增加可共享的存储资源的数量;其二是如何通过有效的机制在现有存储资源基础上提供更好的服务。从系统的观点看,存储虚拟 化有三种途径:基于主机的虚拟化存储、基于存储设备的虚拟化存储以及基于网络的虚拟化存储。统一虚拟存储的实现只能从虚拟存储的实质出发,单一存储映像的 方法可能是虚拟存储的发展方向。

---由于RDMA是近期新的存储技术,尽管有各个厂商发布信息或报道,但是都没有详细或完整的资料。本人在参照标准的基础上,参考了这些报道和一些专家的文章的内容,形成了此文,供大家参考。同时也请大家批评指正。

 

http://blog.chinaunix.net/uid-140205-id-2849341.html

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326991578&siteId=291194637