Inspur Information: Bringing Open Networks to Provide New Options for Large-scale Model Networks

In recent years, big models have brought unforeseen productivity changes. Driven by the big model, the development of many technologies has also exceeded our original cognition and prediction, including the network.

In the past, in the field of artificial intelligence, computing power, algorithms, and data were the three major vehicles driving the development of artificial intelligence. The role of network connections has not been valued by people. Nowadays, with the training of large models, the computing power of thousands of GPU cards is often required. , so that the communication requirements between servers become huge, causing network bandwidth and delay to become one of the biggest bottlenecks of the GPU cluster system in the data center.

All of a sudden, people from all walks of life in the Internet industry raced to find breakthroughs. At the 2023 Open Computing China Community Technology Summit (OCP China Day 2023) held in Beijing recently, we learned about the changes and assistance made by "open network" related technologies in the development of large models.

Open Networks : A New Choice for Large Model Networks

Open technology promotes and accelerates technological innovation through the sharing of IT infrastructure products, specifications, intellectual property rights, etc., and effectively supports the growing needs of various industries for IT infrastructure. Open technologies have achieved great success in the computing field. The data centers of leading cloud service providers are built based on open computing technologies, accelerating cloud computing business innovation.

As more and more services are digitized, network traffic in data centers has surged, resulting in greater demand for network bandwidth. In order to achieve flexible expansion of network resources and agile O&M, the demand for network decoupling is becoming more and more urgent. The open network realizes the decoupling of network software and hardware through the separation of network hardware devices and software codes, creating a more flexible, agile and programmable network architecture. Overall, an open network can reduce the total cost of ownership by 1/3, shorten the time to launch new services by 50%, and double the overall O&M efficiency.

According to Li Pengchong, general manager of the Inspur Information Network R&D Department, the open network accelerates the pace of innovation and iteration through the separation of network hardware and software , creates a more flexible, agile and programmable network architecture, and provides a new choice for the AIGC large-scale network.

First of all, the rapid upgrade of large models puts forward higher requirements for network bandwidth—requires rapid innovation of network hardware, and when chips come out, switches and network devices that can keep up must be launched immediately.

Secondly, since the large model is an end-to-end traffic model, there must be a network card and a switch to cooperate. Therefore, the network card and the switch need to solve two core problems. One is end-to-end flow control, and there must be a good algorithm to solve network congestion; The second is to do a good job of load balancing for network streaming.

Thirdly, in response to the different hardware requirements of MaaS (Model as a Service, Model as a Service), the open network can ensure the construction of an elastic network, realize the rapid allocation and segmentation of network resources, and at the same time ensure the security isolation between multiple tenants .

The traditional closed network, whose innovation and iteration is basically based on the year, cannot meet the needs of large-scale network to a large extent. Therefore, most of the truly mature large-scale networks on the market are based on open network products and Idea framework.

Inspur Information creates a high-performance lossless Ethernet solution 

As we all know, large-scale model training requires key technologies such as computing power, algorithms, storage, and network transmission. Currently, computing power and storage technologies are developing rapidly, requiring network solutions with greater bandwidth and lower latency.

Based on years of accumulation in open networks, Inspur Information has created a set of high-performance lossless Ethernet solutions for large-scale model training through 400G high-performance switches and smart network cards. The switch network of this solution supports the Packet forwarding mode, and realizes the receiving end-based proactive flow notification and message out-of-sequence adjustment mechanism on the smart network card. This innovative model solves the unbalanced link load defect of traditional ECMP routing, avoids congestion at the network level, and greatly reduces forwarding delay on the basis of providing 400G high bandwidth, fully meeting the needs of large-scale model training acceleration .

In addition, for accelerated business scenarios such as distributed storage and hyper-convergence, Inspur Information provides an end-to-end RoCE solution. The network intelligent scheduling control plane, based on UXOS programmable INT technology, collects the traffic characteristics and congestion status of network equipment in real time according to different business scenarios of customers, and automatically adjusts RoCE such as PFC/ECN/DCQCN of switches and network cards through algorithms on the intelligent scheduling control platform Configure parameters to support the rapid deployment and optimal configuration of customer service networks; at the same time, it has accumulated a large number of typical parameter configurations for accelerated business scenarios, which can support customer services to go online easily.

The continuous evolution of the open network promotes the innovation of data center network technology

At present, the war of 100 models is intensifying, and more large-scale models of the industry are also pouring into the battlefield, which provides more possibilities for the rapid evolution of large models, and thus the demand for open networks will also continue to evolve. According to Chen Xiang, deputy general manager of Inspur Information Network R&D Department, there are three possible directions for future improvement of the open network.

One is a better end-to-end flow algorithm. The current RDMA (network) uses the DCQCN algorithm more often. After the arrival of the era of large models, this algorithm can no longer fully cover the real requirements for flow control of large models, so a better flow control algorithm is needed.

The second is that the multi-path selection of the network has been a topic that needs to be further improved.

The third is the transport layer of the existing network, which is still based on the IB transport layer decades ago. To fully utilize the network power, the entire transport layer may need to be reconstructed.

On the whole, computing diversification, application diversification, and technological complexity are driving a new round of transformation in the data center. The open source and open community has become an important force to promote continuous innovation in the data center. Through global collaboration and innovation, we will work together to solve the basic problems of the data center. Major issues such as facility iteration and sustainable development. Open network enterprises represented by Inspur Information are building a complete industrial ecology from hardware ecology to software ecology to management ecology, continuously improving the influence of open network ecology, and providing new choices for AIGC networking.

Guess you like

Origin blog.csdn.net/dQCFKyQDXYm3F8rB0/article/details/132564323