New ideas for bandwidth optimization: RoCE network card aggregation achieves X2 growth

This article is shared from Huawei Cloud Community "Bond aggregation of two RoCE network cards to achieve bandwidth X2" , author: tsjsdbd.

We know that in the operating system, two actual physical network cards can be combined to form a "logical network card" to achieve purposes such as primary and backup/increased bandwidth. But does the RoCE network card also support Bond capabilities like ordinary network cards? The answer is yes, RoCE can also form a Bond, but it has more constraints than ordinary network cards.

10001.png

Today we will practice this process and understand what needs to be paid attention to. You are also welcome to exchange and learn together.

1. RoCE network card link aggregation (LAG)

According to the information found: https://mellanox.my.site.com/mellanoxcommunity/s/article/How-to-Configure-RoCE-over-LAG-ConnectX-4-ConnectX-5-ConnectX-6, The Bond of the RoCE network card only supports 3 modes:

  • Mode 1 (active and backup)
  • Mode 2 (Load Balancing)
  • Mode 4 (link aggregation)

Compared with the total seven modes from 0 to 6 of ordinary network cards, it is a big discount. Fortunately, the "increased bandwidth" model we want still exists.

2. The server performs dual network card aggregation (Bond) operation

Different operating systems have different commands to execute Bond. What I actually operate here is Ubuntu22.04. Using the built-in netplan tool, the bond execution process is as follows:

Revise:

vi /etc/netplan/00-installer-config.yaml
network:
  ethernets:
    ens3f0np0
      dhcp4: no
    ens3f1np1
      dhcp4: no
  version: 2
  renderer: networkd
  bonds:
    bond0:
      interfaces: [ens3f0np0, ens3f1np1]
      parameters:
        mode: 802.3ad
        mii-monitor-interval: 1
        lacp-rate: faset
        transmit-hash-policy: layer3+4
      addresses: [10.10.2.20/24]

implement:

netplan apply

After that, you can see a network card called "bond0".

Here, there are two important parameters in the bond we configured:

(1) Select bond mode 4, which is 802.3ad (link aggregation)

(2) transmit-hash-policy, load balancing policy, has the following three values:

10002.png

Here, due to RDMA point-to-point communication, the IP+MAC address will not change. So we choose layer3+4. After all, when sending a message, the source port is still random.

Attached is the CentOS operation for reference:

Create a new bond port

nmcli con add type bond ifname tsjbond0 bond.options "mode=2,miimon=100,updelay=100,downdelay=100"

Add subnet card

nmcli con add type ethernet ifname enp80s0f0 master tsjbond0
nmcli con add type ethernet ifname enp80s0f1 master tsjbond0

Activate subnet card

nmcli con up bond-slave-enp80s0f0
nmcli con up bond-slave-enp80s0f1

Modified the bond card configuration

vi /etc/sysconfig/network-scripts/ifcfg-bond-tsjbond0
IPADDR=29.28.195.228
NETMASK=255.255.240.0

Modify 2 subnet card configuration

vi /etc/sysconfig/network-scripts/ifcfg-enp80s0f0
DEVICE=enp80s0f0
TYPE=Ethernet
ONBOOT=yes
MASTER= tsjbond0
SLAVE=yes
BOOTPROTO=none

Activate bond card

ifup bond-slave-enp80s0f0
ifup bond-slave-enp80s0f1
ifdown bond-tsjbond0
ifup bond-tsjbond0

3. The server enables PFC flow control for the new network card.

Execute the following command to first set the MTU:

ifconfig bond0 man 4200

Then enable the pfc flow control policy of queue 4:

mlnx_qos -i ens3f0np0 --pfc=0,0,0,0,1,0,0,0 --turst=dscp
mlnx_qos -i ens3f1np1 --pfc=0,0,0,0,1,0,0,0 --turst=dscp
cma_roce_mode -d mlx5_bond_0 -p 1 -m 2
echo 128 > /sys/class/infiniband/mlx5_bond_0/tc/1/traffic_class

Among them, the first two commands need to enable pfc for each subnetwork card under bond.

Then, mlx5_bond_0 can be queried through the ibdev2netdev command.

The last echo 128 command is to force the traffic class of the packets sent by the network card to be 128, that is, to match the network card sending queue 4. It's okay if you don't set it. You can achieve the same purpose by setting NCCL_IB_TC=128. For details, please refer to the article "Why NCCL_IB_TC=128 must be set for AI training on Huawei Cloud".

4. The switch performs dual network port aggregation (LACP)

Different switches have different commands to enable LACP mode. The model here is CE9860. Execute as follows:

Open the eth-trunk port.

interface Eth-Trunk1
port link-type trunk
mode lacp-static

Then switch to the corresponding network port and add it to this trunk port.

interface GigabitEthernet0/0/1
eth-trunk 1
 
interface GigabitEthernet0/0/2
eth-trunk 1

The command operation is based on this idea. In addition, the LACP LB policy is completed by modifying the load-balance profile default configuration:

eth-trunk hash-mode ?
  INTEGER<1-9> Different hash mode provide different load distribution result for egress traffic flows from a trunk, the default is 1
  For Eth-Trunk, mode 1 is suggested
  For SMAC change, mode 1/2/6/7 is suggested
  For SIP change, mode 1/5/7/9 is suggested
  For DIP change, mode 5/6 is suggested
  For DMAC&SMAC change, mode 9 is suggested
  For SMAC+SIP change, mode 5/6 is suggested

The default value is 1.

5. The switch enables PFC flow control for the corresponding port.

Execute on the switch:

qos buffer headroom-pool size 20164 cells slot 1
interface 400 x/x/x
trust dscp
dcb pfc enable mode manual
dcb pfc buffer 4 xoff dynamic 4 hdrm 3000 cells
commit

The above command actually not only turns on pfc, but also sets the buffer size corresponding to the network port. The specific parameter values ​​are up to you.

6. RDMA traffic bandwidth test

This is the bandwidth test command we often use:

First, on the server side, start the Server.

ib_write_bw -s 8388608 -F --run_infinitely -x 3 -q 8 --report_gbits

Then the Client starts streaming to the server:

ib_write_bw -s 8388608 -F --run_infinitely -x 3 10.10.2.20 -q 8 --report_gbits

The -x parameter is set to 3, which means the RoCE V2 protocol is used.

The parameter --run_infinitely allows the test to continue without stopping.

-q indicates using multiple QPS (Queue-Pairs) streams. Corresponding to NCCL_IB_QPS_PER_CONNECTION, you can try setting it larger to try the effect.

An example of the result is as follows:

10003.png

7. Server-side statistics

Query the number of packets in queue 4:

watch -n 2 “ethtool -S ens3f0np0 | grep prio4”

1695889454470488453.png

This number of packets will not be reduced, and it is inconvenient to clear it. It seems that the number will not be cleared to 0 even if the server is restarted.

I only found the purpose of clearing statistics by uninstalling the IB module (if necessary):

rmmod mlx5_ib
rmmod mlx5_core
modprob mlx5_core

Query the network card temperature:

mget_temp -d mlx5_bond_0

You can see the temperature, which is generally around 62/63 degrees.

1695889477679879869.png

8. Summary

This article is just an operation record for mutual communication. It is not necessarily the best practice. You can read it at your own choice.

Because the official website https://mellanox.my.site.com/mellanoxcommunity/s/article/How-to-Configure-RoCE-over-LAG-ConnectX-4-ConnectX-5-ConnectX-6

It is written like this:

10005.png

Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~

The author of the open source framework NanUI switched to selling steel, and the project was suspended. The first free list in the Apple App Store is the pornographic software TypeScript. It has just become popular, why do the big guys start to abandon it? TIOBE October list: Java has the biggest decline, C# is approaching Java Rust 1.73.0 Released A man was encouraged by his AI girlfriend to assassinate the Queen of England and was sentenced to nine years in prison Qt 6.6 officially released Reuters: RISC-V technology becomes the key to the Sino-US technology war New battlefield RISC-V: Not controlled by any single company or country, Lenovo plans to launch Android PC
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4526289/blog/10115827