RDMA and TCP/IP technology and performance comparison based on SmartX distributed storage

Author: SmartX Financial Team who is deeply involved in the industry

background

In the previous article " The independent research and development journey of distributed block storage ZBS|Architecture ", we briefly introduced the architectural principles of SmartX distributed block storage ZBS. Next, we will give readers an in-depth analysis of one of the most important technologies in ZBS storage, "RDMA".

Currently, ZBS uses RDMA technology at two levels, namely the storage access network and the storage internal data synchronization network. In order to make it easier for readers to understand and make a more targeted comparison of storage performance, two independent articles are introduced separately. In this issue, we will focus on RDMA remote memory direct access technology, and conduct a detailed expansion in conjunction with ZBS internal storage data synchronization (ZBS supports RDMA capabilities and was the first to implement internal storage data synchronization).

ZBS storage internal data synchronization

One of the most important differences between distributed storage systems and centralized storage is the architecture implementation. To ensure the storage consistency and reliability of data, a distributed architecture must rely on the network for data synchronization. Here is an example of a ZBS storage cluster consisting of 3 nodes (A/B/C). The data is protected by two copies (data is stored in two copies, placed on different physical nodes). Assume that the data is stored in node A and node respectively. On B, when the data is modified, ZBS distributed storage must complete the data modification on nodes A and B and then return confirmation. In this process, nodes A and B synchronize data modifications, and the network used is the storage network.

Through examples, I believe readers have understood that data synchronization efficiency has a great impact on the performance of distributed storage. It is one of the important directions for distributed storage performance optimization and is also the focus of this article.

Figure 1: Distributed storage data synchronization network

Under current conventional workload requirements, ZBS storage networks are usually configured using 10GbE Ethernet switches and server network cards, and use standard TCP/IP as the network transmission protocol. However, for high-bandwidth and low-latency business workloads, such a configuration will obviously become a performance bottleneck for internal storage data synchronization. At the same time, in order to take advantage of the stronger I/O performance of new high-speed storage media (such as NVMe disks), using RDMA technology combined with 25GbE or higher network specifications is a better choice to meet the higher storage performance requirements of the business end.

TCP/IP

Distributed storage implemented through software definition and built on universal standard hardware is one of its important features that distinguishes it from traditional storage. For many years, ZBS has used the TCP/IP network protocol stack as the storage internal communication method. The advantage is that it has the greatest compatibility with existing Ethernet and meets the business workload requirements of the vast majority of customers. However, TCP/IP network communication is gradually unable to adapt to the business requirements of higher-performance computing. Its main limitations are the following two points:

  • Delay caused by TCP/IP protocol stack processing

When the TCP protocol stack receives/sends data packets, the system kernel needs to perform multiple context switches. This action will undoubtedly increase the transmission delay. In addition, during the data transmission process, multiple data copies are required and the CPU is relied upon for protocol encapsulation processing, which results in a delay of tens of microseconds for protocol stack processing alone.

  • TCP protocol stack processing results in higher CPU consumption on the server

In addition to latency issues, TCP/IP networks require the host CPU to participate in the memory copy of the protocol stack multiple times. The larger the scale of the distributed storage network, the higher the network bandwidth requirements, and the greater the processing burden on the CPU when sending and receiving data, resulting in higher consumption of CPU resources (which is very unfriendly to hyper-converged architecture).

Figure 2: TCP/IP Socket communication

RDMA

RDMA is the abbreviation of Remote Direct Memory Access. DMA refers to the technology of the device directly reading and writing memory (without going through the CPU).

Figure 3: DMA

RDMA 技术的出现,为降低 TCP/IP 网络传输时延和 CPU 资源消耗,提供了一种全新且高效的解决思路。通过直接内存访问技术,数据从一个系统快速移动到远程系统的内存中,无需经过内核网络协议栈,不需要经过中央处理器耗时的处理,最终达到高带宽、低时延和低 CPU 资源占用的效果。

目前实现 RDMA 的方案有如下 3 种:

图 4:RDMA 实现方案 (图片来源:SNIA)

InfiniBand(IB)是一种提供了 RDMA 功能的全栈架构,包含了编程接口、二到四层协议、网卡接口和交换机等一整套 RDMA 解决方案。InfiniBand 的编程接口也是 RDMA 编程接口的事实标准,RoCE 和 iWARP 都使用 InfiniBand 的接口进行编程。

RoCE(RDMA over Converged Ethernet)和 iWARP(常被解释为 Internet Wide Area RDMA Protocol,这并不准确,RDMA Consortium 专门做出解释 iWARP 并不是缩写),两个技术都是将 InfiniBand 的编程接口封装在以太网进行传输的方案实现。RoCE 分为两个版本,RoCEv1 包含了网络层和传输层的协议,所以不支持路由(更像是过渡协议,应用并不多);RoCEv2 基于 UDP/IP 协议,具有可路由能力。iWARP 是构建于 TCP 协议之上的。

跟 RoCE 协议继承自 Infiniband 不同,iWARP 本身不是直接从 Infiniband 发展而来的。Infiniband 和 RoCE 协议都是基于“Infiniband Architecture Specification”,也就是常说的“IB 规范”。而 iWARP 是自成一派,遵循着一套 IETF 设计的协议标准。虽然遵循着不同的标准,但是 iWARP 的设计思想受到了很多 Infiniband 的影响,并且目前使用同一套编程接口(Verbs*)。这三种协议在概念层面并没有差异。

* Verb 是 RDMA 语境中对网络适配器功能的一个抽象,每个 Verb 是一个函数,代表了一个 RDMA 的功能,实现发送或接收数据、创建或删除 RDMA 对象等动作。

RDMA 需要设备厂商(网卡和交换机)的生态支持,主流网络厂家的协议支持能力如下:

Infiniband 从协议到软硬件封闭,其性能虽然最优,但成本也最高,因为需要更换全套设备,包括网卡、光缆和交换机等。这对于通用标准化的分布式存储场景并不友好,在 ZBS 选择时首先被排除掉。

对于 RoCE 和 iWARP 选择上,虽然 RoCE 在数据重传和拥塞控制上受到 UDP 协议自身的限制,需要无损网络的环境支持,但在综合生态、协议效率和复杂度等多方面因素评估下,SmartX 更加看好 RoCE 未来的发展,在极致的性能诉求下,RoCE 也会比 iWARP 具有更强的潜力。当前 ZBS 存储内部数据同步网络采用的是 RoCEv2 的 RDMA 技术路线。

图 5:ZBS RDMA RoCEv2

性能验证数据

为了使测试数据有更直观的对比性(RDMA vs TCP/IP),将控制测试环境严格一致性,包括硬件配置、系统版本以及相关软件版本,唯一变量仅为开启/关闭存储内部数据同步 RDMA 能力,基于此,测试集群在两种状态下的性能表现。

环境信息

存储集群,由 3 节点组成,安装 SMTX OS 5.0,分层存储结构,所有存储节点的硬件配置相同,节点环境信息如下:

性能数据

在相同的测试环境和测试方法下,分别使用 RDMA 和 TCP/IP 协议进行性能验证。为了更好地观测读写 I/O 跨节点的性能表现(ZBS 分布式存储默认具有数据本地化特点,对读 I/O 模型有明显优化作用),本次测试基于 Data Channel 平面(ZBS 内部的 RPC 通道,用于节点间收发数请求)。本测试仅用于评估网络性能差异,I/O 读写操作并不落盘。

性能对比数据

测试结论

通过以上基准测试数据,可以看出,相同软硬件环境以及测试方法下,使用 RDMA 作为存储内部数据同步协议,可以取得更优的 I/O 性能输出。其表现为更高的 4K 随机 IOPS 和更低的延时,以及在 256K 顺序读写场景,充分释放网络带宽(25GbE)条件,提供更高的数据吞吐表现。

总结

通过本篇文章的理论介绍和客观的性能测试数据,希望读者能够对于 RDMA 协议有了更加全面的了解。RDMA 对于数据跨网络通信性能的优化,已经应用于很多企业场景中,分布式存储作为其中一个重要场景,借助 RDMA 实现了存储内部数据同步效率的提升,进而为更高工作负载需求的业务应用提供了更好的存储性能表现。

参考文章:

1. RDMA over Converged Ethernet. Wikipedia.

https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet

2. How Ethernet RDMA Protocols iWARP and RoCE Support NVMe over Fabrics.

https://www.snia.org/sites/default/files/ESF/How_Ethernet_RDMA_Protocols_Support_NVMe_over_Fabrics_Final.pdf

点击了解 SMTX ZBS 更多产品特性与技术实现亮点

Guess you like

Origin blog.csdn.net/weixin_43696211/article/details/128942120