AI training performance improved by 30%, Alibaba Cloud releases GPU computing bare metal instance ebmgn7ex

The rapid rise of technology trends such as ChatGPT and AIGC (artificial intelligence generated content) has made ordinary people feel the huge changes brought about by the application of artificial intelligence technology to user experience. The rapid landing of applications is inseparable from the support of the underlying infrastructure. The training scenarios of artificial intelligence models often require high computing power, high throughput, and low latency, which can greatly speed up training and model iteration.

Recently, Alibaba Cloud released the latest generation of GPU computing bare metal instance specification family ebmgn7ex for AI training scenarios. Compared with the previous generation of bare metal computing instance ebmgn7e equipped with A100 GPU, the bandwidth of ebmgn7ex has increased by 150% and the latency has been reduced. 50%, the performance of the overall AI training scene is improved by about 30%, and the cost performance is increased by about 20% to 30%.

This example is mainly applicable to artificial intelligence scenarios such as automatic driving, AI image recognition, speech recognition, semantic recognition, and automatic control. At the same time, it is also very suitable for high-performance computing scenarios, such as simulation applications in petroleum, meteorology, geology, industrial simulation, machinery, hydrology and other industries and research, as well as predictive computing in the economic and financial fields.

According to Alibaba Cloud's elastic computing product experts, the ebmgn7ex instance uses Alibaba Cloud's self-developed cloud infrastructure processor CIPU, and its bandwidth has been upgraded to 160G, which meets the training requirements of most models. The GPU is connected to the TCP overlay network through RDMA, and supports GPU Direct (GPU pass-through technology) with a minimum delay of 8 microseconds, making multi-machine AI training more efficient and more flexible. Based on the above capabilities, users can quickly and flexibly build multi-machine GPU computing clusters.

insert image description here

The traditional RDMA network has low latency and is difficult to expand, which greatly limits its usage scenarios. Alibaba Cloud's self-developed eRDMA network has the advantages of low latency and support for large-scale networking, so that gn7ex instances can be deployed in all Alibaba Cloud Available Zones (AZ), and any number of clusters can be implemented in major regions The rapid construction helps enterprises quickly deploy artificial intelligence models.
insert image description here

AI training performance improved by 30%, Alibaba Cloud releases GPU computing bare metal instance ebmgn7ex

Guess you like

Origin blog.csdn.net/bjchenxu/article/details/129851235