The future has come, Tencent AI computing network

Welcome to Tencent Cloud + Community to get more Tencent's massive technical practice dry goods~

Author: Published in Cloud + Community by Goose Factory Network

"Goose Factory Network" is operated by the Network Platform Department of the Technology Engineering Business Group of Shenzhen Tencent Computer System Co., Ltd. We hope to exchange and learn the latest network and server industry dynamic information with like-minded partners in the industry, and share Tencent's network and server field. , practical dry goods at planning, operation, R&D, service and other levels, and look forward to growing together with you.

There is no doubt that artificial intelligence is the hottest research direction in the IT industry in recent years. Especially after the landmark event of Alpha GO in 2016, domestic and foreign technology giants continue to increase their investment in artificial intelligence. At present, the main directions of artificial intelligence such as image recognition, speech recognition, etc. are all through machine learning, with the help of powerful computing platforms to analyze and calculate massive data. demand, it is necessary to use a high-performance computing (HPC, High Performance Computing) cluster to further improve the computing power.

HPC cluster is a distributed system that organizes multiple computing nodes together for collaborative computing. It generally uses RDMA (Remote Direct Memory Access) technologies such as iWARP/RoCE/IB to complete the fast exchange of data between computing nodes' memory. As shown in Figure 1, the RDMA network card can retrieve data from the address space of the sending node and transfer it directly to the address space of the receiving node. The entire interaction process does not require the participation of kernel memory, thus greatly reducing the processing delay on the server side. At the same time, as the network is part of the HPC cluster, any transmission blockage will result in a waste of computing resources. In order to maximize the computing power of the cluster, the network is usually required to complete the transfer of RDMA traffic within 10us. Therefore, for the network supporting HPC, the delay is the primary indicator that affects the computing performance of the cluster.

In actual deployment, the main factors affecting network latency are:

RDMA interconnect architecture

  1. hardware delay. Network equipment forwarding, forwarding hops, and fiber distance all affect network latency. The optimization solution is to use two-level "Fat-Tree" to reduce the network forwarding level, upgrade the network rate to forward data at a higher baud rate, and deploy low-speed Delay switch (minimum 0.3us);
  2. Network packet loss. When network congestion causes buffer overflow and packet loss, the server side needs to retransmit the entire data segment, resulting in serious delays. Common solutions include: increasing the switch cache and network bandwidth to improve the ability to handle congestion, optimizing application layer algorithms to avoid incast scenarios to reduce network congestion points, and deploying flow control technology to notify the source to slow down to eliminate congestion, etc.

The network hardware environment of the data center is relatively fixed, and the effect of reducing the delay by upgrading the hardware is very limited. Most of the time, the delay is reduced by reducing network congestion. Therefore, for HPC networks, the industry focuses more on the research of "lossless networks". Currently, the more mature solutions include lossy networks with flow control protocols, and industrial lossless networks.

Commonly used network solutions in the industry

Lossy Networking and Flow Control Protocols

Ethernet adopts a "best effort" forwarding method. Each network element tries its best to deliver data to downstream network elements without caring about the forwarding capability of the other party, which may cause congestion and packet loss of downstream network elements. Therefore, Ethernet is a A lossy network that does not guarantee reliable transmission. In data centers, reliable TCP protocols are often used to transmit data, but Ethernet RDMA packets are mostly UDP packets, which requires the deployment of cache management and flow control technologies to reduce packet loss on the network side.

PFC (Priority Flow Control) is a queue-based backpressure protocol. Congested network elements send Pause frames to notify upstream network elements to slow down to prevent buffer overflow and packet loss. In a single-machine scenario, PFC can quickly and effectively adjust The server rate is used to ensure that the network does not lose packets, but in a multi-level network, there may be problems such as head-of-line blocking (as shown in Figure 2), unfair speed reduction, and PFC storms, and when an abnormal server injects PFC packets into the network , may also cause the entire network to be paralyzed. Therefore, to enable PFC in the data center, it is necessary to strictly monitor and manage Pause frames to ensure the reliability of the network.

Thread blocking problem in PFC

ECN (Explict Congestion Notification) is an IP-based end-to-end flow control mechanism.

ECN deceleration process

As shown in Figure 3, when the switch detects that the port cache is occupied, it will set the ECN field of the packet when forwarding, and the destination NIC will generate a notification packet based on the packet characteristics to accurately notify the source NIC to slow down. ECN avoids the problem of head-of-line blocking and can achieve accurate speed reduction at the flow level. However, because it requires the network card side to generate backpressure packets and has a long response period, it is usually used as an auxiliary means of PFC to reduce the number of PFCs in the network. As shown in Figure 4, the ECN should have a smaller trigger threshold to complete the deceleration of traffic before the PFC takes effect.

Trigger time of PFC and ECN

In addition to the mainstream large cache, PFC, and ECN, the industry has also proposed solutions such as HASH based on RDMA fields, elephant flow shaping, HASH algorithm DRILL based on queue length, and bandwidth-for-cache algorithm HULL, but most of these solutions require network cards, The support of switching chips is difficult to deploy on a large scale.

Industrial Lossless Networking

IB flow control mechanism

Infiniband is an interconnect architecture designed for high-performance computing and storage. It completely defines one to seven layers of protocol stacks, and has the characteristics of low latency and lossless forwarding. As shown in Figure 5, the IB network adopts the flow control mechanism based on "credit". The sender negotiates the initial Credit for each queue when the link is initialized, indicating the number of packets that can be sent to the opposite end, and the receiver can forward it according to its own forwarding capability. , the sender refreshes the credit of each queue in real time, and stops sending packets when the sender's credit is exhausted. Because network elements and network cards must be authorized to send packets, the IB network will not be congested for a long time, and it is a lossless network that can ensure reliable transmission. IB provides 15 service queues to distinguish traffic, and the traffic of different queues will not be blocked by the head of the line. At the same time, the IB switch adopts the "Cut-through" forwarding mode, and the one-hop forwarding delay is about 0.3us, which is much lower than that of the Ethernet switch.

Therefore, IB is an excellent choice for small HPC and storage networks, but IB also has problems such as incompatibility with Ethernet and single product form, making it difficult to integrate into Tencent's production network.

Tencent AI Computing Network

Tencent's AI computing network is part of the production network. In addition to communicating with other network modules, it also needs to be connected to back-end systems such as network management and security. Therefore, only an Ethernet solution compatible with the existing network can be selected. The architecture of the computing network has undergone several iterations with the growth of business needs, from HPC v1.0, which initially supported 80 40G nodes, to HPC v3.0, which supports 2,000 100G nodes today.

The computing nodes in the computing network are used as a resource pool for the common use of all departments of the entire company, which makes the network face the problem of concurrent congestion of multi-service traffic. For a network carrying a single service, network congestion can be avoided through application layer algorithm scheduling. However, when multiple services share the network, concurrent congestion of multi-service traffic will inevitably occur. Even with queue protection, flow control mechanisms and other means to reduce network packet loss , it will also cause the loss of cluster computing power due to the slowdown of the server. At the same time, the defect of PFC is not suitable for opening in multi-level network, and its effective scope needs to be limited. Therefore, our design ideas are:

  1. Physically isolate services, use high-density devices as access devices, try to concentrate nodes of a department under one access device, and limit the number of cross-device clusters;
  2. Enable PFC only on access devices to ensure fast backpressure, and enable ECN on the entire network to protect cross-device clusters;
  3. For a small number of cross-device clusters, Go provides sufficient network bandwidth to reduce congestion, and uses large cache switches to solve the problem of long ECN backpressure cycles.

Considering the requirements of high-density access, large cache, and end-to-end back pressure, the HPCv3.0 architecture selects the frame switch using BCM DUNE series chips as the access device.

HPC3.0 Architecture

As shown in Figure 6, HPC v3.0 is a two-level CLOS architecture. The aggregation device LC and the access device LA are all frame switches with BCM DUNE chips. Each LA can connect up to 72 40G/100G servers. Considering that At present, the cluster size of most applications is 10~20 nodes, and in the future, the performance improvement of computing nodes and the optimization of algorithms will further limit the increase of the cluster size. Therefore, 72 sets are enough to meet the computing needs of a single business. The DUNE line card supports 4GB of buffer memory, which can buffer ms-level congested traffic, and supports the end-to-end flow control scheme based on VoQ (Figure 7), which can realize accurate speed reduction of servers under the same chassis with the help of PFC. Although the forwarding delay (4us) of a modular switch is greater than that of a boxed switch (1.3us), it will not affect the cluster performance considering that the delay deterioration caused by multi-level forwarding, packet loss, and congestion is reduced.

DUNE chip end-to-end flow control

In terms of cost, although the cost of a single port of a modular switch is higher than that of a boxed switch, because a single LA node can already meet most of the computing needs, the cross-LA cluster demand is limited, and interconnection modules are reduced, which is more expensive than traditional boxed access. , and the one-to-one convergence ratio scheme has lower cost.

Summarize

For a long time, the network is not the bottleneck of data center performance, and the network design based on "large bandwidth" can meet the needs of business applications. However, in recent years, the rapid development of server technology has led to the rapid improvement of data center computing and storage capabilities, while RDMA technologies such as RoCE and NVME over Fabric have shifted the performance bottleneck of data centers to the network side. Especially for new RDMA-based applications such as HPC, distributed storage, GPU cloud, and hyper-converged architecture, network latency has become the main factor restricting performance. Therefore, it is foreseeable that the design goals of data centers in the future will gradually shift from bandwidth-driven to delay-driven. long-term goal of exploration.

Note 1: The copyright of all works such as text and pictures from "Goose Factory Network" belongs to "Shenzhen Tencent Computer System Co., Ltd." and cannot be used without official authorization. The right to be investigated is reserved; Note 2: Part of the pictures in this article are from the Internet. If related copyright issues are involved, please contact [email protected]

Q&A

What impact will AI have on our lives?

Related Reading

CCTV-Tencent released a report: 90% of respondents believe that AI is not far from themselves

AAAI Exclusive | Tencent AI Lab Live Presentation Paper: HodgeRank Maximizing Crowdsourced Matchmaking Ranking Aggregation Information

Blockbuster Exclusive | Tencent AI Lab AAAI18 Live Presentation Paper: Training L1 Norm Constrained Model with Stochastic Quadrant Negative Descent Algorithm


This article has been authorized by the author to publish the Tencent Cloud + community, please indicate the source of the article if you reprint it

** Original link: https://cloud.tencent.com/developer/article/1037628**

image

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325036576&siteId=291194637