Can carry 12 with 1! Switches are also starting to play high-density

Recently, the NVIDIA GTC conference came to an end, and many new products seem to "open up" the real and virtual worlds, building a more efficient environment for the further development of AI.

Among the many new products released by NVIDIA, it can be seen that network-related products are becoming more and more abundant, and the era of 3U interconnection of CPU, GPU, and DPU has officially begun.

A few years ago, the DPU was just a product line that NVIDIA explored in the data center field, and now it has expanded its product portfolio, one of which is the Spectrum-4 end-to-end Ethernet platform. The platform is mainly composed of three parts: Spectrum-4 switches that can accelerate the entire cloud network architecture; ConnectX-7 smart network adapters that accelerate network performance in server nodes ; programmable data center infrastructure, software-defined network storage security BlueField-3 DPU .

These three products have jointly built an end-to-end 400Gbps ultra-large-scale network acceleration platform Spectrum-4.

 

Spectrum-4 switch that can carry 12 in 1

 

NVIDIA Spectrum-4 is the world's first 400Gbps end-to-end network platform switch. It has 64 ports, each port has a bandwidth of 800Gbps, and can connect two 400Gbps links, providing a total of 128 400Gbps switching links. The overall switching bandwidth is 51.2Tbps, the packet forwarding rate reaches 37.6Tbps, and it can provide 12.8Tbps line-speed encryption capability, which is far superior to similar products in terms of performance.

Can carry 12 with 1!  Switches are also starting to play high-density

 

The chip that provides such powerful performance has up to 100 billion transistors and uses a custom 4N process. The chip can optimize the cloud platform through artificial intelligence and storage load, and can provide low-latency network interaction. Spectrum-4 switches achieve nanosecond-level timing accuracy, which is five to six orders of magnitude higher than ordinary millisecond-level data centers.

Can carry 12 with 1!  Switches are also starting to play high-density

 

According to Cui Yan , a network expert from NVIDIA , the Spectrum-4 switch has increased the bandwidth by 4 times compared with the previous generation products, tripled the line speed encryption, and can reduce the number of switches by 12 times and reduce energy consumption by 40%. And can quickly deploy the entire network with the lowest total cost of ownership and the highest ROI. At the same time, relying on stronger performance and forwarding efficiency, Spectrum-4 switches can better avoid network congestion.

In addition, NVIDIA Spectrum-4 has two stunts, the first is adaptive routing. Generally speaking, in the case of static hashing, a data flow can only select one path for data forwarding. If the traffic is relatively large, the entire link will be congested and the delay will be increased at the same time.

The adaptive routing function can load balance the data flow to multiple links of the switch. When it is predicted that a link will be congested, part of the data flow will be transferred to other links for forwarding, so that it can be It greatly reduces the congestion caused by the sudden increase of traffic packets or the occurrence of tail delay, and improves the operation efficiency of the entire network.

The second stunt is the efficient and massively accelerated Omniverse architecture. In fact, we can find the advantages of ultra-high density from the characteristics of Spectrum-4. The original 12 switches only need to consume a lot of time and energy for port configuration and deployment. Now a Spectrum-4 is used instead, which greatly simplifies the deployment process, and the 400Gbps port also expands the number of connections to data center nodes, which significantly improves efficiency.

In addition, in terms of management, it was necessary to manage 12 devices, but now only one device is needed, which makes troubleshooting easier, takes up less cabinet space, and can reduce a lot of energy consumption and space costs every year.

 

Smart NIC and DPU bring a new leap

 

The second part of the product launched by NVIDIA is the ConnectX-7 smart network card, which can accelerate software-defined networking without consuming CPU resources, provide enhanced storage performance, and support online encryption, decryption, and accurate data center applications. time.

Can carry 12 with 1!  Switches are also starting to play high-density

 

The third part of the product is the BlueField-3 DPU that everyone is very familiar with. Compared with the previous generation, it also brings huge changes, including 4 times the ARM computing power, 5 times the memory, 2 times the network bandwidth, and 4 times the host bandwidth. , and performance improvements in security, storage, etc.

NVIDIA has simultaneously updated the DOCA SDK development platform, allowing more developers to develop their own software-defined network storage and security applications based on the BlueField-3 DPU platform. At the same time, DOCA will also provide more services, and users can directly use container-based services to support services on the network.

 

OVX: Putting the Compute Factory in a Box

 

At this NVIDIA GTC conference, the OVX computing system attracted the most attention. It is NVIDIA's powerful tool for data center AI acceleration and an important part of its huge data center layout.

Simply put, OVX is a high computing density server product, which is built for the Omniverse digital twin and can be stacked in clusters to produce more "horrific" performance.

Can carry 12 with 1!  Switches are also starting to play high-density

 

The Omniverse OVX computing system consists of eight NVIDIA A40 GPUs, three NVIDIA ConnectX-6 Dx200 Gbps network cards, dual Intel Ice Lake 8362 CPUs, 1TB of system memory, and 16TB of NVMe storage. When connected using Spectrum switches, OVX computing systems can scale from a single POD containing 8 OVX servers to a Super POD of 32 OVX servers. Users can achieve larger simulation requirements through the deployment of multiple super PODs.

Can carry 12 with 1!  Switches are also starting to play high-density

 

Meng Qing , director of network marketing at NVIDIA, said: The OVX server is a box very much like DGX, a standardized product that provides the best support for Omniverse. OVX SuperPOD is a supercomputing cluster connected through the Spectrum platform. With its powerful performance, it can help designers build more accurate digital buildings and create a more realistic simulation environment; it can also solve the increasingly complex computing needs of the industrial field; Or make self-driving cars and robots more intelligent.

 

The peculiar computing card H100 CNX

 

There are many heavyweight products brought by NVIDIA this time, but the most innovative among them is the H100 CNX . This is a fusion accelerator, which can be regarded as a direct connection between the network and the GPU, that is, the H100 GPU is directly connected to the ConnectX-7 400Gb/s InfiniBand and Ethernet smart network card at a speed of 50GB/s through RDMA, so as to achieve Higher I/O performance.

What? I do not understand? So let's start from the beginning!

Why is CPU called Central Processing Unit? That's because its status is the "central". Almost any internal or external device needs to communicate with the CPU. After receiving the command, it goes back to "your own work". The "pipe" of this communication is the bus, which is currently mainly PCIe. .

However, with the continuous enhancement of computing power, under the bombardment of large data transmission users such as GPU, NVMe, and network, the bus bandwidth gradually began to be insufficient, resulting in frequent occurrence of system delays and other phenomena. In a traditional server, the GPU has a large amount of data to communicate with the CPU. This data is generally stored in the memory. After the CPU transmits the instructions, it is transferred to the network card. It can be said that such a data link has many turnover links. Once the data volume surges, it will bring congestion.

Can carry 12 with 1!  Switches are also starting to play high-density

 

The approach of H100 CNX is to design the GPU and the ConnectX-7 network chip on one board, and they are interconnected at a super-high speed of 400Gbps. All that needs to be done is to let the CPU provide some instructions, bypassing the CPU, memory and Direct participation of massive data. Problems that can be solved on one card should never be "troublesome" with the CPU that manages all the machines.

In addition, another advantage brought by the H100 CNX is compatibility. It adopts the PCIe interface form and can be adapted to various mainstream servers, so that server manufacturers do not have to spend too much resources on research and development, and the scope of application is wider.

 

In the digital age, safety comes first

 

Most of the data in enterprise applications carry key business, how can we ensure security?

According to Cui Yan, NVIDIA has always attached great importance to network security. Spectrum-4 switches, ConnectX-7 and BlueField-3 all have security authentication policies on the underlying firmware. With immutable firmware and boot verification procedures, these devices are guaranteed to be safe from illegal modifications.

In addition, as mentioned above, these devices provide a variety of encryption, decryption, and acceleration functions, including encryption in the transmission of customer application data, so as to ensure security. BlueField-3 can also achieve better zero-trust security, and isolate the application domain from the infrastructure domain, so that the client's application and the data on the infrastructure side will be secured.

NVIDIA also has many ecological partners based on the BlueField-3 DPU as a distributed firewall and security mechanism, which can better defend against network attacks on the host and server.

In addition, the ConnectX-7 SmartNIC can provide very precise time synchronization for data center applications and time-sensitive infrastructure.

 

DPU has become a hot spot in the industry,

NVIDIA stays ahead

 

Now, we see that after a wave of acquisitions, many chip manufacturers and cloud service providers have begun to develop DPU-related products, so what does NVIDIA think about this?

According to Meng Qing, when NVIDIA proposed DPU as early as 2020, it instantly detonated this concept, and many friends and startups have launched similar products and roadmaps in this direction. This also proved from the side that NVIDIA had the correct grasp of the development direction of the data center at that time.

Huang Renxun has said many times that NVIDIA provides a full-stack computing platform, including the world-renowned GPU and industry-leading DPU. It is worth noting that NVIDIA's third-generation DPU is about to be delivered to customers, ahead of its peers in terms of speed.

In terms of research and development, NVIDIA has been increasing its investment in research and development, and is fully aware of the importance of developers and related ecosystems. From the growth of CUDA and the nearly 3 million developers around the world, to the current DOCA developer community that has attracted a large number of developers in just one year, all these are continuing to invest. Growing together with developers, customers, and partners is one of NVIDIA's secrets to staying ahead of the curve.

When it comes to future development, NVIDIA believes that there are five main directions: 1. Million-X million-fold computing acceleration. 2. Transformers enhance AI. 3. Data centers evolve into AI factories. 4. The demand for robotic systems is growing exponentially. 5. Digital twins for the next AI era. NVIDIA will continue to improve itself and help partners, developers and customers to work together in these five directions to promote the development of the industry.

 

The need for computing power will never end

 

" Turn the data center into an AI factory ", this is one of the visions that NVIDIA founder and CEO Huang Renxun talked about at the GTC conference. At the same time, we have also seen that companies from many fields are building their own data factories through AI, including meteorological research, geological exploration, drug and vaccine research and development, virus analysis, etc., which have put forward higher requirements for computing power.

The author once interviewed a doctoral supervisor from a well-known university, and he said: "In the field of biological research, no matter how high the computing power of supercomputers is, it will be fully used, because in the process of research, if the precision is improved a little, then It will increase the amount of computing dozens of times. It can be said that the demand for computing power is never-ending.”

Currently, we are seeing a large number of data centers pop up across the country, working 24/7, consuming a lot of energy, but still not meeting demand. This is actually the "terrible" aspect of the digital age.

Can carry 12 with 1!  Switches are also starting to play high-density

 

Therefore, enterprise users have been eagerly waiting for new technologies, expecting higher computing density and lower power consumption. This series of products launched by NVIDIA is born for this purpose. It can support 12 Spectrum-4 switches, highly intelligent ConnectX-7 network cards, 4 times the computing power of BlueField-3 DPU, and support 8 GPU cards. The OVX server and the efficient fusion accelerator H100 CNX will bring new changes to the data center in this digital age, and will inevitably set off a new wave of metaverse.

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5547601/blog/5513896