Practice and exploration of cloud service capabilities based on RDMA

01

   background   

      As data systems based on big data and big models become more and more commercially valuable, there are more and more players in machine learning, and the amount of data is getting bigger and bigger. In order to solve the problem of synchronization efficiency of massive data between servers, RDMA (Remote Direct Memory Access) technology has gradually entered the field of vision of network technicians. Why can RDMA become a hot technology for network acceleration in machine learning? This article will give you answers one by one, and at the same time bring detailed dry goods sharing.
02

   Pain points of traditional TCP communication   

      TCP (Transmission Control Protocol) is a main protocol in the network protocol family, providing users with reliable, orderly, and error-detecting byte stream transmission services. It originated from ARPANET (Advanced Research Project Agency Net), a network project developed by the U.S. Department of Defense in the 1960s. At that time, the physical link bandwidth was only a few Mbps (Megabits per second). Today, the physical bandwidth has reached dozens or several Hundreds of Gbps (Gigabytes per second). It is no longer suitable for the current high-speed network environment.

      Data center servers widely use the Linux operating system, and the process of sending and receiving messages is shown in the figure below.


Figure 1 Linux kernel sending and receiving message process

      The message receiving process is mainly time-consuming due to the task stack switching caused by the interrupt, the memory copy twice, and the standard lengthy kernel protocol stack program and other operations. At present, this architecture can no longer meet the latency and network bandwidth requirements of business scenarios such as high-performance computing and deep neural networks. If the network bandwidth and delay are not improved, the CPU (Central Processing Unit) and GPU (Graphics Processing Unit) will be idle because they are waiting for data, and the upgrade and expansion of computing power will not bring about business acceleration and efficiency.

     RDMA technology is committed to providing a lossless, ultra-low latency, ultra-high throughput network, thereby improving CPU/GPU performance, and can well solve the bandwidth and latency problems in deep neural network learning and AI (Artificial Intelligence) training . Therefore, the RDMA network is currently the preferred technology for large model training platforms.

03

   Introduction to RDMA technology   

3.1 RDMA technology introduction

      RDMA expands the ability of the network card, and can complete memory data copy operations between two communicating hosts without the participation of the CPU. RDMA provides three technical specification implementation methods, namely IB (Infiniband), iWARP (Internet Wide Area RDMA Protocol) and RoCE (RDMA over Converged Ethernet). All three implementations support the RDMA Verbs primitives and data types formulated by the IBTA (InfiniBand Trade Association), provide a unified service programming interface for users to use, and achieve seamless service switching.

      IB requires a dedicated IB network card and IB switch. Excellent performance, but expensive network cards and switches, poor compatibility. The IWARP technology stack requires a general-purpose Ethernet switch and an Ethernet card that supports the iWARP function. Messages rely on TCP connections, and TCP connections need to occupy kernel resources, and the market recognition is lower than RoCE. The RoCE technology stack only needs general-purpose Ethernet network cards and switches, and uses PFC (Priority-based Flow Control) and ECN (Explicit Congestion Notification) congestion control algorithms to achieve lossless transmission. It has better overall performance, better compatibility, and a very affordable price. RoCE is highly recognized in the market, and users can choose corresponding products according to their usage scenarios and actual needs.

3.2 RoCE technical description 

     The latest version of RoCE is RoCEv2, and the IETF standard stipulates that UDP (User Datagram Protocol) port 4791 is used to identify RoCEv2 data packets. RoCEv1 only supports Layer 2 MAC (Media Access Control) mutual access, while RoCEv2 supports Layer 3 IP access, realizing the routable function of packets and breaking the limitation of Layer 2 LAN deployment for services. RoCEv2 is also called RRoCE (Routable RoCE), which redesigns the congestion control algorithm.

      Ethernet forwarding can perform load balancing according to the source port of UDP to improve throughput. RoCEv2 implements the network layer functions of the original IB with UDP and IP header information. RoCEv2 supports both IPv4 and IPv6.

Figure 2 RoCEv2 message description

3.3 RoCE performance

  • TCP vs RoCE
      Since the business needs to use RDMA-accelerated RPC (Remote Procedure Call) function, the Apache community bRPC (better Remote Procedure Call) is selected for delay and bandwidth test evaluation. The data in the TCP and RoCEv2 scenarios are shown in Figure 3. The results show that RDMA is significantly higher than the TCP network in terms of bandwidth and delay, and the bandwidth can almost double.

Figure 3 Comparison of TCP vs RoCE bandwidth and delay

  • RoCE vs IB

      由于最近大模型应用CHAT GPT比较流行,越来越多的厂商开始进行大模型训练的研究,推升了RoCE网络的需求,我们探索用RoCE网络代替IB,为GPU训练的提供合理的加速增效解决方案。为此对RoCE和IB的性能进行了测试验证对比。RoCE(单卡最大100Gbits/s) 单队列可达92.8Gbits/s ,16队列可达196Gbits/s (两个100G卡组合成Bond接口)。IB带宽(单卡最大200Gbits/s) 单队列可以达到 185Gbits/s, 16个队列测试可达196Gbits/s。时延接近,最短时延在2us左右。RoCE方案更加具有性价比和大规模部署应用前景。

图4 RoCE vs IB 带宽对比

04

   OPPO云RDMA能力建设   

4.1 RDMA资源调度平台

      云原生时代,从一开始就规划了RDMA的弹性伸缩能力,用户可以根据业务需要申请RDMA卡,用于业务加速。系统主要组件图如图5所示。

图5 资源调度架构图

      下面从基础设虚拟化层、无损网络拥塞控制算法、资源管理与调度、智能业务运营系统四个维度说明RDMA云化所做的工作。

  • 基础设施虚拟化层

      将一个物理网卡虚拟成若干个相互独立的子卡,用户可以根据需要申请一个或者多个RDMA卡。同时网络虚拟化组件可将节点上可用RDMA子卡数目上报给系统的调度器。调度器根据任务请求进行分配和调度。容器业务可以使用RDMA业务。

  • 无损网络拥塞控制算法

      在服务器和交换机对RDMA流量进行标记,即PFC和ECN的方式来进行拥塞控制,服务器上通过DSCP(Differentiated Services Code Point )位标识RDMA业务流,交换机根据DSCP进行流分类、拥塞管理和死锁检测。基础网络架构采用CLOS架构建设,非阻塞多交换网,网络保证足够高的加速比。     

  • 资源管理与调度

      数字化管理RoCE和IB等可用的网卡类型及可用资源,系统根据用户的业务申请及资源请求,灵活调度计算实例到对应计算节点上。

  • 智能业务运营系统

      对RDMA的网卡和交换机进行监控,实时了解系统资源使用率和健康情况,及时介入业务扩容和故障处理等紧急情况。

   4.2 ORPC业务   

4.2.1 ORPC设计

      虽然供应商提供了通用的开发接口,但这些接口针对特定的RDMA业务场景,如HPC,GPU训练等高性能计算领域,无法满足普通RPC业务的需求。RDMA技术涉及到底层专用硬件、通信协议、特有的Verbs接口及晦涩难懂的C语言等问题。业务用户自主开发基于RDMA的应用是有一定困难的。ORPC可以说做到了让用户业务无感知的平滑迁移到RoCE网络上来,用户只需要专注于业务研发,无需关注内核驱动适配,性能调优,软件版本兼容性等问题,实现真正的高效业务迁移。

      OPPO云平台开发集成了ORPC( Oppo Remote Procedure Call)应用,天然支持RoCE,帮助用户无缝切换到RDMA编程。ORPC同时提供TCP和RDMA通信能力,兼容TCP和RDMA模式,用户根据需要选择接入业务方式,如图6所示。其业务兼容性高,ORPC 同时开启TCP/RDMA,客户端根据自身情况,选择TCP/RDMA接入。ORPC能够灵活适配业务使用的特定版本的Protocolbuffer,gflags等中间件依赖库。

图6  ORPC通信场景

      ORPC采用直接调用Verbs接口的方式进行业务开发,之所以没有使用社区的UCX(Unified Communication X)框架是因为多一次C库调用会产生对第三方产品的依赖,实际测试性能和稳定性没达到预期。第三方的RDMA_CM库在虚拟容器网络中的兼容性差,运行不够稳定,会发生系统崩溃。

      ORPC优先支持C++语言,后期计划推Golang版本的ORPC。

4.2.2 ORPC实践

实测ORPC在单流和多流场景下带宽提升都很明显。100G网卡可以测试到80-90G的样子。我们的推理、训练采用RoCE进行加速验证发现,RoCE可以显著提高升推理的性能,收益提升明显。对于其他被时延和带宽困扰的业务,可以尝试RoCE加速,相信会带来一定的收益。

图7  ORPC 带宽对比图

  • 业务改造

      服务端默认同时工作在RDMA模式和TCP会话模式,用户可以通过参数设置工作方式,如果是RDMA方式,需要指定RDMA设备,如果不指定或者端口索引指定错误,会导致程序异常。ORPC建立会话接口只需要提供必要连接参数即可,应用程序逻辑无需更改。

1)服务端必选参数说明:

-rdma_device     说明使用的RDMA设备

-rdma_gid_index  RDMA设备的索引

-use_rdma     true 为rdma,默认tcp

-port 指定监听端口

2)客户端参数:

-rdma_device     说明使用的RDMA设备

-rdma_gid_index  RDMA设备的索引

-use_rdma     true 为rdma,默认tcp

  • 自动生成的roce yaml文件
apiVersion: v1kind: Podmetadata:  name: if-roce-test231  namespace: nethouse-testspec:  nodeName: 1x.x.x.231  containers:    - name: if-roce-test231      image: hub.x.y.z/inference/inference:rdma-1.0.0      resources:        limits:          devices.csp.io/rdma: "1"        requests:          devices.csp.io/rdma: "1"      volumeMounts:        - mountPath: /gx-infer          name: gx-infer  volumes:   - name: gx-infer     hostPath:       path : /home/service/var/data       type: Directory... ...
  • 容器运行命令
#docker run --net container:7a9bc59dd57afe8e91504ecefcdf720097fb919fc76a9aab7ba13ac265b93799 --privileged --device=/dev/infiniband/rdma_cm --device=/dev/infiniband/uverbs1 --name orpc ... hub.x.y.z/orpc-rdma/orpc-rdma-depy:v1.0
05

   RDMA应用前景与展望      

      以RoCE为代表的RDMA技术正展示着令人兴奋的应用潜力。ORPC只是其中一个应用,实践收益得到了深度学习推理等应用的首肯。相信在不久的未来,RoCE承载的大模型训练平台会越来越成熟。同时NVMe over RoCE 也会逐步落地。

作者介绍

Junwei Wang  

OPPO高级后端工程师

主要负责云计算网络架构设计与开发实现,长期致力于网络新技术实践与创新。

END
About AndesBrain

安第斯智能云

OPPO 安第斯智能云(AndesBrain)是服务个人、家庭与开发者的泛终端智能云,致力于“让终端更智能”。安第斯智能云提供端云协同的数据存储与智能计算服务,是万物互融的“数智大脑”。

本文分享自微信公众号 - 安第斯智能云(OPPO_tech)。
如有侵权,请联系 [email protected] 删除。
本文参与“OSC源创计划”,欢迎正在阅读的你也加入,一起分享。

工信部:不得为未备案 App 提供网络接入服务 Go 1.21 正式发布 阮一峰发布《TypeScript 教程》 Vim 之父 Bram Moolenaar 因病逝世 某国产电商被提名 Pwnie Awards“最差厂商奖” HarmonyOS NEXT:使用全自研内核 Linus 亲自 review 代码,希望平息关于 Bcachefs 文件系统驱动的“内斗” 字节跳动推出公共 DNS 服务 香橙派新产品 Orange Pi 3B 发布,售价 199 元起 谷歌称 TCP 拥塞控制算法 BBRv3 表现出色,本月提交到 Linux 内核主线
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4273516/blog/10087507