Looking at the development trend of future computing chips from the rise of large models such as ChatGPT

This article is reprinted from: "Looking at the development trend of future computing chips from the rise of large models such as ChatGPT"
https://aijishu.com/a/1060000000403993



The popularity of ChatGPT directly triggered the prosperity of large models, and also made NVIDIA GPUs in short supply.

From a development perspective, GPU is not the most efficient computing platform for large models.

Why don’t large models such as GPT exceed one trillion parameters? The core reason is that on the current GPU platform, performance and cost have reached a limit.

If you want to continue to support larger models with more than one trillion parameters, you need to increase the performance by orders of magnitude and reduce the cost per unit of computing power by orders of magnitude. This will inevitably require a new architecture of AI computing platform.

This article is an introduction and we look forward to more discussions in the industry.


1. Overview

Why do large models "invariably" stay at hundreds of billions of parameters and fail to break through trillions of parameters? The main reason is that under the current architecture system:

  • The performance growth (Scale up) of a single GPU is limited. If you want to increase performance, you can only increase the size of the computing cluster (Scale out);
  • In a computing cluster with tens of thousands of GPUs, the east-west traffic interaction increases exponentially, which is limited by the network bandwidth of the cluster and restricts the computing performance of the cluster nodes;
  • Constrained by Amdahl's law, the degree of parallelism cannot be expanded infinitely, and the method of increasing cluster size has also reached a bottleneck;
  • Moreover, with such a large cluster size, the cost becomes unaffordable.

In general, in order to break through the upper limit of computing power by an order of magnitude, we need to start from the following aspects:

  • First of all, performance improvement is not just a matter of a single chip, but a system project.
    Therefore, collaborative optimization of the entire system from chip software and hardware to complete machines and data centers is required.

  • Secondly, expand the cluster size, which is also known as Scale Out.
    If you want to scale out, you need to enhance the inline interaction of the cluster, which means higher bandwidth and a more efficient high-performance network. At the same time, the cost of a single computing node also needs to be reduced.

  • Finally, the most essential, Scale Up, increases the performance of a single node. This is the most essential way to increase computing power.

    Under the constraints of power consumption, process, cost and other factors, if you want to improve performance, you can only tap the potential from the aspects of software and hardware architecture and micro-architecture implementation.


2. Collaborative optimization of the whole system

Insert image description here


Computing power is not just a matter of microscopic chip performance, but a macroscopically complex and huge system engineering.

In the entire system, from technology to software, from chips to data centers, the development of various fields in the entire computing power system has reached a relatively stable and mature stage.

The development of large AI models still requires a major improvement in computing power, which not only requires step-by-step continuous optimization in various fields, but also requires cross-domain collaborative optimization and innovation among various fields:

  • Semiconductor process and packaging: more advanced processes, 3D integration, and Chiplet packaging, etc.
  • Chip implementation (microarchitecture): realized through some innovative designs, such as storage and calculation integration, DSA architecture design, and various new types of storage.
  • System architecture: For example, open and streamlined RISC-v, heterogeneous computing is gradually moving towards hyper-heterogeneous computing, and the integration of software and hardware to control complex computing, etc.
  • System software, frameworks, and libraries: basic ones such as OS, hypervisor, containers, and various computing frameworks and libraries that require continuous optimization and open source.
  • Business application (algorithm): business scenario algorithm optimization, algorithm parallelism optimization, etc.; as well as system flexibility and programmability design; system control and management, system scalability, etc.
  • Hardware, including servers, switches, etc.: board integration of multiple functional chips, customized boards and servers, server power supply and heat dissipation optimization;
  • Data center infrastructure: such as green data center, liquid cooling, PUE optimization, etc.;
  • Data center operation and management: such as ultra-large-scale data center operation and management, cross-data center operation and management scheduling, etc.

3. Scale out: increase cluster size

Insert image description here

N nodes are connected in pairs through connections, and the total connection data is required N*(N-1)/2.
According to this formula, if the cluster has only one node, there will be no east-west internal traffic;
as the number of nodes in the cluster increases, the number of internal interactions will increase rapidly, and subsequently, the interactive traffic within the cluster will increase rapidly. increase.

According to statistics, the east-west network traffic in large data centers currently accounts for more than 85%; the number of nodes in a large AI model training cluster basically exceeds 1,000, and its east-west traffic is estimated to exceed 90%.
Theoretically, when the traffic of each connection is equal, the bandwidth of the current mainstream network card is 200Gbps. Even if all traffic is east-west, the traffic between each two nodes can only be 200/1000 = 0.2 Gbps.
On the one hand, the north-south traffic is extremely compressed, and the east-west traffic of a single connection continues to decrease as the number of clusters increases, which further highlights the problem of network bandwidth bottlenecks.

At the same time, affected by Amdahl's law, the overall computing power does not have an ideal linear relationship with the number of nodes. Instead, as the cluster size increases, the increase in overall computing power will gradually slow down.


To increase the computing power of the cluster through Scale Out:

  • The first is to quickly increase network bandwidth.

Insert image description here


  • Secondly, there must be better high-performance network support.
    By optimizing high-performance network functions, such as congestion control, multi-path load balancing ECMP, out-of-order delivery, high scalability, rapid failure recovery, and incast optimization, we can achieve better high-performance network capabilities.

Insert image description here


  • Again, AI computes a faster path for data to the network.
    In the traditional architecture, the GPU is mounted outside the CPU as an accelerator card. The data transmission path from the GPU to the network is very long, and the CPU must participate in the control of the transmission.
    For example, the GPU can be integrated with a RoCE high-performance network card to directly transmit data, bypassing the CPU and DPU/NIC.
  • Finally, it is necessary to improve the degree of parallelism from the algorithm and software levels and reduce the coupling between parallel programs as much as possible.

4. Scale Up: Increase single-chip performance

East-west traffic is essentially an internal "loss". Improving performance through Scale Out puts huge pressure on the network and has a performance upper limit. It is a method of "treating the symptoms but not the root cause".

To truly increase computing power on a large scale, the most essential and effective way is to improve the performance of a single computing node and a single computing chip.

To improve single-chip performance:

  • The first is to increase the size of the chip.
    Improve the design scale of a single chip through process advancement, 3D and chiplet packaging. Currently, the number of mainstream large-chip transistors is 50 billion. Intel plans to increase the number of transistors on a single chip to 1 trillion by 2030 (a 20-fold increase).
  • Second, improve the performance efficiency of unit transistor resources.
    There are 6 main processor types: CPU, coprocessor, GPU, FPGA, DSA and ASIC. CPU is the most general but has the lowest performance efficiency, while ASIC is the most specialized and has the highest performance efficiency.
    In terms of computing processors, it is necessary to choose ASIC or computing engines close to ASIC as much as possible to increase the calculation proportion of such processors in the entire system as much as possible.
  • Third, improve general flexibility.
    Performance and flexibility are a contradiction. Why can't we use 100% ASIC-level computing engine in one chip? The reason is that a pure ASIC makes no sense.
    Chips need to be used on a wide scale to dilute R&D costs. This requires considering the general flexibility of the chip.

Insert image description here


At present, driven by various large computing power scenarios such as AI, heterogeneous computing has become the mainstream of computing architecture.
In the future, with the further development of scenarios with higher computing power requirements such as large models, the computing architecture needs to move further from heterogeneous computing to hyper-heterogeneous computing:

  • The first stage is single-CPU serial calculation;
  • The second stage is homogeneous parallel computing of multiple CPUs;
  • The third stage is heterogeneous parallel computing of CPU+GPU;
  • The fourth stage, heterogeneous parallel computing of CPU+DSA;
  • The fifth stage is ultra-heterogeneous parallel computing with multiple heterogeneous fusions.

5. Universal analysis of large computing power chips

So far, Google TPU has been hard to say successful: although TPU can do it, from chips to frameworks, and even to AI applications, Google can achieve full-stack optimization, but TPU still cannot achieve larger-scale implementation, and has dragged down The development of upper-level AI business. The reason is actually very simple:

  • When the upper-level business logic and algorithms are constantly iterating rapidly, it is difficult to solidify them into circuits for acceleration.
  • Although Google invented Transformer, it was limited by its underlying chip TPU, which made the upper-layer business need to consider compatibility with the underlying chip and was unable to devote itself to model development;
  • The development of AI models is still in the "alchemy" development stage. Whoever can quickly trial and error and iterate will be most likely to succeed.

Therefore, Google lags behind in the development of large AI models.
OpenAI has no baggage. It can choose the optimal computing platform (general GPU+CUDA platform) and focus entirely on the research and development of its own models. It has taken the lead in realizing high-quality AI large models such as ChatGPT and GPT4, thus leading the AGI revolution. The era of explosion.

Conclusion: Today, with the rapid evolution of AI algorithms, versatility is more important than performance.
Therefore, NVIDIA GPU integrates CUDA core and Tensor Core into the GPU, taking into account both versatility and flexibility, becoming the best AI computing platform currently.


6. Related trend cases

1、Intel They do not have GAUDI

Insert image description here


Gaudi is a typical Tensor accelerator.

From the first-generation Gaudi's 16nm process to the second-generation 7nm process, Gaudi2 takes training and inference performance to a whole new level.
It increases the number of AI custom Tensor processor cores from 8 to 24, adds support for FP8, and integrates a media processing engine for processing compressed media to offload the host subsystem.

Gaudi2's on-package memory triples to 96GB of HBM2e at 2.45 Tbps bandwidth.


Insert image description here


Gaudi can achieve very high cluster expansion capabilities through 24 100Gbps RDMA high-performance network cards.
The actual cluster architecture design can be flexibly designed according to specific needs.

Compared with traditional accelerators such as GPU and TPU, the biggest highlight of Gaudi is its integration of ultra-high-bandwidth high-performance networks.
This improves the efficiency of east-west traffic interaction between cluster nodes and makes larger-scale cluster design possible.


2、Graphcore IPU

Insert image description here


The picture above shows Graphcore's IPU processor. The IPU processor has 1216 Tiles (each Tile contains a Core and its local memory), switching structure (an on-chip interconnect), and IPU link interfaces for connecting to other IPUs. , PCIe interface is used to connect with the host.

Graphcore is a product similar to NVIDIA GPU in architecture. It is a relatively general computing architecture and is more in line with the requirements of AI computing.
However, it is limited by the lack of co-processing optimization like Tensor core, which has disadvantages in performance; and it has not yet formed a powerful development framework and rich ecosystem like NVIDIA CUDA.


3、Tesla DOJO

Insert image description here


Insert image description here


Insert image description here


The Tesla Dojo chip and the corresponding entire cluster system are very different from traditional design concepts.
It is based on the super scalability of the entire POD level and the collaborative design capabilities of the entire system stack.
Each Node in the Dojo system is completely symmetrical and is a POD-level complete UMA architecture.
In other words, Dojo's scalability spans chips, Tiles, and Cabinets, reaching the POD level.

DOJO is Tesla's chip, cluster and solution dedicated to data center AI training.
DOJO's scalability capabilities allow AI engineers to focus on model development and training itself, while paying less attention to details related to hardware features such as model segmentation and interaction.

DOJO is also a relatively general computing architecture: the core is a CPU+AI coprocessor, and then multiple cores form a chip, and the chip is organized into a POD. Macroscopically, it is close to the overall idea of ​​NVIDIA GPU.


4、Tenstorrent Grayskull & Wormhole

Tesla Dojo and Tenstorrent's AI series chips are both projects led by Jim Keller, and their architectural design concepts have many similarities.


Insert image description here


The basic architectural unit is the Tensix core, which is built around a large computing engine that undertakes the vast majority of the 3 TOPS calculations from a single dense math unit.


Insert image description here


Tenstorrent's Grayskull accelerator chip implements a 12x10 array of Tensix cores with a peak performance of 368 INT8 TOPS.


Insert image description here


Tenstorrent's first-generation chip is code-named Grayskull, and its second-generation chip is code-named Wormhole. The macro architecture of the two is similar.

Using the Wormhole module, Tenstorrent designed nebula, a 4U server containing 32 Wormhole chips.


Insert image description here


This is a complete 48U rack that works like a 2D grid, with each Wormhole server connected to the peer of another server, like a large, uniform mesh network.

Tenstorrent achieves the ultimate scalability of the cluster through this multi-network connection method. The overall idea is similar to Tesla DOJO.


Guess you like

Origin blog.csdn.net/lovechris00/article/details/135017647