Complex comparison and importance analysis of CPU, GPU and computing-memory interconnect

 

LLM | AMD | Intel | NVIDIA

GLM| ARM | AIGC | Chiplet

With the rapid development of deep learning, high-performance computing, NLP, AIGC, GLM, and AGI, large models have been developed rapidly. The top technology in the science and innovation circle in 2023 is undoubtedly the large model. According to the "China Artificial Intelligence Large Model Map Research Report" released by the New Generation Artificial Intelligence Development Research Center of the Ministry of Science and Technology, my country has released 79 large models with parameters exceeding 1 billion, almost forming a situation where a hundred models compete. In terms of large-scale model research and development, 14 provinces, regions and municipalities in China are actively working, of which Beijing has 38 projects and Guangdong has 20 projects.

At present, natural language processing is the most active field of research and development of large models, followed by multimodal fields, while relatively few large models are available in fields such as computer vision and intelligent speech. Different innovation subjects such as universities, scientific research institutions and enterprises have participated in the research and development of large models, but the cooperative research and development between academia and industry still needs to be strengthened.

As the parameters of large artificial intelligence models increase from billions to trillions, people are paying more and more attention to the importance of ultra-large-scale computing power and network connections required to support large-scale model training. The training tasks of large models require huge GPU server clusters to provide computing power and exchange massive amounts of data through the network. However, even if the performance of a single GPU is strong, if the network performance cannot keep up, the computing power of the entire computing power cluster will drop significantly. Therefore, a large cluster does not mean a large computing power. On the contrary, the larger the GPU cluster, the more additional communication loss. This article will introduce in detail the complexity comparison between CPU and GPU, the combination of multiple computing power (CPU+GPU), the importance of computing-memory interconnection and computing power interconnection.

Multiple computing power: CPU+GPU

The popularity of ChatGPT has made the attention of the intelligent computing center rise again, and GPU has also become the object of competition among major companies. GPU is not only the core of the intelligent computing center, but also widely used in the field of supercomputing. According to the latest TOP500 list (to be announced in late May 2023):

  • The number of systems using accelerators or coprocessors has increased from 179 in the previous session to 185, of which 150 systems use NVIDIA's Volta (such as V100) or Ampere (such as A100) GPUs.

  • Seven of the top 10 systems on the list use a GPU, and of the top 5, only the second one does not.

Ultra-large-scale data center architecture and optical module upgrade path

 

Of course, the CPU is still indispensable. Taking the top 10 names on the list as an example, AMC EPYC family processors occupy 4 positions, Intel Xeon family processors and IBM's POWER9 each occupy 2 positions, and Arm also has 1 position (Fujitsu A64FX) ranking second.

General-purpose computing power and intelligent computing power complement each other to meet diverse computing needs. Take the MareNostrum 5 being deployed by the European High Performance Computing Joint Enterprise (EuroHPC JU) as an example: the general-purpose computing power based on the fourth-generation Intel Xeon Scalable processor is planned to open services in June 2023, while the "next-generation" general-purpose computing power based on the NVIDIA Grace CPU and the accelerated computing power composed of the fourth-generation Intel Xeon Scalable processor and NVIDIA Hopper GPU (such as H100) are planned to be put into use in the second half of 2023.

1. GPU: big chip and small chip

Despite Nvidia's dominance in the GPU market, AMD and Intel haven't given up on the competition. Taking the top 10 of the latest TOP500 list as an example, among the 4 systems based on AMC EPYC family processors, 2 are equipped with AMD Instinct MI250X and NVIDIA A100, the former ranks first and third respectively.

However, when it comes to AI applications, Nvidia GPUs have a significant advantage. The NVIDIA H100 Tensor Core GPU released at GTC 2022 further cemented its leadership. The H100 GPU is based on the Nvidia Hopper architecture, adopts TSMC's N4 process, has up to 80 billion transistors, and has comprehensive improvements in computing, storage and connectivity:

  • 132 SMs (streaming multiprocessors) and 4th generation Tensor Core double the performance;

  • A 50MB L2 cache larger than the previous generation and a video memory upgraded to HBM3 form a new memory subsystem;

  • The total bandwidth of the fourth-generation NVLink reaches 900GB/s, supports NVLink network, and PCIe is also upgraded to 5.0.

In January 2023, Intel will also launch the Intel Data Center GPU Max series, code-named Ponte Vecchio, together with the fourth-generation Intel Xeon Scalable processors and the Intel Xeon CPU Max series. Intel's data center GPU Max series utilizes Intel's Foveros and EMIB technologies to integrate 47 small chips on a single product, integrating more than 100 billion transistors, with up to 408MB of L2 cache and 128GB of HBM2e memory, fully embodying the concept of chip modularization.

 

Comparison between pluggable optical modules and CPO

2. CPU: performance core and energy efficiency core

The development trend of general computing power is diversified to meet the needs of different application scenarios. In mobile phones, PCs and other terminal products, the proven "small and small core" architecture has gradually become the mainstream, and it has also become popular in the server CPU market. However, the server is characterized by cluster operations, and there is no urgent need to realize the combination of large and small cores in the same CPU. Mainstream manufacturers are meeting the needs of different customers by providing CPUs with all large cores or all small cores. Among them, large cores focus on single-core performance and are suitable for vertical expansion, while small cores focus on core density and are suitable for horizontal expansion.

As the inventor of big.LITTLE technology, Arm has introduced the concept of heterogeneous cores into the server CPU market and has been around for some time. Arm's Neoverse platform is divided into three series:

  • The Neoverse V series is used to build high-performance CPUs, providing high performance per core, suitable for workloads such as high-performance computing, artificial intelligence machine learning acceleration, etc.;

  • Neoverse N series focuses on scale-out performance, providing an optimized balanced CPU design to provide ideal performance per watt, suitable for scale-out cloud, enterprise network, smart network card DPU, custom ASIC accelerator, 5G infrastructure, and edge scenarios with limited power and space;

  • The Neoverse E series is designed to support high data throughput with minimal power consumption, and is oriented towards 5G deployment of network data plane processors and low-power gateways.

These series are designed to meet the needs of different fields and applications.

 

In large-scale cloud computing centers, intelligent computing centers, and supercomputing centers and other application scenarios, the practice of large and small core architectures in the data center market can be summarized into two series: the V series focuses on the vertical expansion of single-core performance (Scale-up), while the N series focuses on the horizontal expansion of multi-core performance (Scale-out).

In these application scenarios, representatives of the V series products include 64-core AWS Graviton3 (guessed to be V1) and 72-core NVIDIA Grace CPU (V2). In addition to Alibaba Cloud's 128-core Yitian 710 (guessed to be N2), N-series products are also widely used in the DPU field. The recently released AmpereOne adopts the A1 core developed by Ampere Computing Company and has a maximum of 192 cores, which is closer to the design style of the Neoverse N series. Intel also announced a similar plan at a conference for investors.

 

Intel also announced a similar plan at a conference for investors:

  • 5th Generation Intel Xeon Scalable processors (codenamed Emerald Rapids) in Q4 2023

  • The next generation, code-named Granite Rapids, launches in 2024. These processors will continue the current performance core (P-Core) route. In addition, the CPU code-named Sierra Forest launched in the first half of 2024 will be the first-generation energy-efficient core (E-Core) Xeon processor with 144 cores.

The 5th generation Intel Xeon Scalable processors share the platform with the 4th generation for easy migration while Granite Rapids and Sierra Forest will use Intel's 3nm process. The combination of P-Core and E-Core has been verified on Intel's client CPUs, and an important difference between the two is whether hyper-threading is supported. E-Core has only one thread per core, focusing on energy efficiency, and is suitable for cloud-native applications that pursue higher physical core density.

 

AMD's strategy is much the same.

  • In November 2022, the fourth-generation EPYC processor code-named Genoa will be released, using a 5-nanometer Zen 4 core with up to 96 cores.

  • In mid-2023, AMD will also launch a "cloud-native" processor code-named Bergamo, which is rumored to have as many as 128 cores and provide higher core density by shrinking cores and caches.

Although there is a difference in the number of cores between the two routes of performance cores and energy efficiency cores, increasing the number of cores is a consensus. As the number of CPU cores continues to grow, the requirements for memory bandwidth are getting higher and higher, and it is not enough to just upgrade to DDR5 memory. AMD's fourth-generation EPYC processor (Genoa) has expanded the number of DDR channels per CPU from 8 to 12, and Ampere Computing has similar plans.

 

However, a CPU with more than 100 cores has exceeded the actual needs of some enterprise users, and each CPU's 12 memory channels in a dual-socket configuration also bring new challenges to server motherboard design. Under the influence of various factors, it remains to be seen whether the share of single-socket servers in the data center market will increase significantly.

AMD's 4th generation EPYC processors have 12 DDR5 memory channels, but neither single (2DPC) nor dual (1DPC) configurations have more than 24 memory slots, which is less than the dual configuration (32 memory slots) of an 8 memory channel CPU. In other words, the number of memory channels for a single CPU increases, but the number of memory slots for a dual-channel configuration decreases

3. Who is more complex compared to GPU and CPU chips

In terms of massively parallel computing, both high-end GPUs (such as NVIDIA's A100 or AMD's Radeon Instinct MI100) and high-end CPUs (such as Intel's Xeon series or AMD's EPYC series) have their own complexities. GPU has a large number of CUDA cores or stream processors, and focuses on parallel computing to support high-performance parallel tasks. The high-end CPU has more cores, higher frequency and complex hyper-threading technology to provide powerful general-purpose computing capabilities. Let's take a look at it from the dimensions of application scenarios, number of transistors, and architecture design.

Optical chips are the core components in optical communication systems

 

1. Application scenarios

GPU has a large number of computing cores, dedicated memory and high-speed data transmission channels. Its design focuses on meeting the needs of graphics rendering and computing-intensive applications, emphasizing large-scale parallel computing, memory access and graphics data stream processing. The core concept of the GPU is parallel processing. By having more processing units, it can perform a large number of parallel tasks at the same time. This makes GPUs excellent at parallelizable workloads such as graphics rendering, scientific computing, and deep learning. In contrast, CPUs focus on general-purpose computing and a broad range of applications, often featuring multiple processing cores, cache hierarchies, and complex instruction set architectures.

2. The number of transistors

Top-of-the-line GPUs typically have more transistors because they require a large number of parallel processing units to support high-performance computing. For example, NVIDIA's A100 GPU has about 54 billion transistors, while AMD's EPYC 7742 CPU contains about 39 billion transistors. The difference in transistor count reflects the importance and focus of GPUs on parallel computing.

3. Architecture design

CPUs are generally considered more complex from an architectural and design perspective. CPUs need to handle a variety of different types of tasks and need to be optimized to perform them as fast as possible. In order to achieve this goal, the CPU uses many complex techniques, such as pipelining, out-of-order execution, branch prediction, and hyperthreading. These technologies are designed to improve the performance and efficiency of the CPU.

In contrast, a top-of-the-line GPU may be larger in hardware scale (such as the number of transistors), but may be relatively simplified in architecture and design. The design of the GPU focuses on massively parallel computing and graphics data flow processing, so its architecture is more focused and optimized for these specific tasks.

1) GPU architecture

GPUs have some key architectural properties.

First, it has a large number of parallel processing units (cores) , each of which can execute instructions simultaneously. NVIDIA's Turing architecture has thousands of parallel processing units, also known as CUDA cores. Secondly, the GPU adopts a hierarchical memory architecture, including global memory, shared memory, local memory, and constant memory. These memory types are used to cache data to reduce access latency to global memory. In addition , the GPU utilizes hardware for thread scheduling and execution , maintaining high efficiency. In NVIDIA's GPU, threads are scheduled and executed in the form of warps (32 threads). In addition, there are some special functional units, such as texture unit and rasterization unit, which are specially designed for graphics rendering. The latest GPUs also feature special units designed for deep learning and artificial intelligence, such as tensor cores and RT cores. In addition, GPU adopts stream multiprocessor and SIMD (Single Instruction Multiple Data) architecture , so that one instruction can be executed on multiple data in parallel.

Specific GPU architecture designs vary by manufacturer and product line, for example NVIDIA's Turing and Ampere architectures have some key differences from AMD's RDNA architecture. However, all GPU architectures follow the basic idea of ​​parallel processing.

2) CPU architecture

The architectural design of a CPU (Central Processing Unit) involves multiple fields, including hardware design, microarchitecture, and instruction set design.

The instruction set architecture (ISA) is the foundation of the CPU, which defines the operations the CPU can perform and how those operations are coded. Common ISAs include x86 (Intel and AMD), ARM, and RISC-V. Modern CPUs use pipelining to break down instructions into stages to increase instruction throughput. CPUs also contain caches and memory hierarchies to reduce the latency of accessing memory. Out-of-order execution and register renaming are the key optimization methods of modern CPUs, which can improve the parallel execution ability of instructions and solve the data hazard problem. Branch prediction is an optimization technique used to predict the result of a conditional jump instruction to avoid stalls caused by waiting for the result of the jump. Modern CPUs usually have multiple processing cores, each of which can execute instructions independently, and some CPUs also support multi-threading technology, which can improve the utilization of cores.

CPU architecture design is an extremely complex process that needs to consider multiple factors such as performance, energy consumption, area, cost, and reliability.

4. Chiplet and chip layout

There are some differences between AMD and Intel in the chiplet implementation of the CPU. Starting from the second-generation EPYC, code-named Rome (Rome), AMD separated I/O devices such as DDR memory controllers, Infinity Fabrics, and PCIe controllers from CCDs and concentrated them into an independent chip (IOD) to act as a switch. This part of the chip still adopts the mature 14nm process, while the proportion of the 8 cores and L3 cache inside the CCD has increased from 56% to 86%, gaining greater benefits from the 7nm process. By manufacturing IOD and CCD separately and combining them as needed, many advantages are brought:

1. Independent optimization

According to the different requirements of I/O, computing and storage (SRAM), choose the process that suits the cost. For example, the fourth-generation EPYC processor code-named Genoa will use a 5nm process CCD with a 6nm process IOD.

2. High flexibility

An IOD can be matched with different numbers of CCDs to provide CPUs with different numbers of cores. For example, the second-generation EPYC processor code-named Rome supports up to 8 CCDs, but it can also be reduced to 6, 4 or 2, so it can easily provide 8 to 64 cores. Think of the CCD as an 8-core CPU, the IOD as the North Bridge or MCH (memory controller center) in the original server, and the second-generation EPYC is equivalent to a set of miniaturized eight-way servers. Building a 64-core CPU this way is much simpler than providing 64 cores on a single chip, and has yield and flexibility advantages.

3. Scaling up is easier

By increasing the number of CCDs, with the support of IOD, more CPU cores can be easily obtained. For example, the fourth-generation EPYC processor expands the number of cores to 96 by using 12 CCDs. This chiplet implementation brings many advantages to AMD and is more competitive in terms of process selection, flexibility and scalability.

 

AMD 4th generation EPYC processor, 12 CCDs surround 1 IOD

The second to fourth-generation EPYC processors adopt a star topology, connecting multiple smaller-scale CCDs with the IOD as the center. The advantage of this architecture is that it can flexibly increase the number of PCIe and memory controllers and reduce costs. However, the downside is that any core is farther away from other resources, potentially limiting bandwidth and increasing latency.

In the past, AMD has made EPYC processors perform well in multi-core performance by virtue of its process advantages and higher core count. However, as competitors such as Intel and Arm improve their manufacturing processes and improve the performance of large cores, AMD's advantage in core numbers may weaken, making it difficult to sustain its multi-core performance advantage. At the same time, the multi-core CPUs of other manufacturers adopt a grid layout, which reduces the access distance between the core and other resources through fast interconnection, and more effectively controls the delay.

5. Arm new upgrades: NVIDIA Grace and AmpereOne

Arm has been hoping to gain a foothold in the server market. Companies such as Amazon, Qualcomm, and Huawei have launched server CPUs based on the Arm instruction set. Arm's position in the server CPU market is gradually strengthening as products such as Amazon's Graviton and Ampere Altra gain a firm foothold in the market. At the same time, with the rise of heterogeneous computing, Arm's influence in high-performance computing and AI/ML computing power infrastructure is also expanding.

NVIDIA Grace, a data center-specific CPU based on the Arm Neoverse architecture launched by Nvidia, has 72 cores. The Grace CPU superchip consists of two Grace chips connected together via NVLink-C2C, providing 144 cores and 1TB of LPDDR5X memory in a single socket. In addition, NVIDIA also announced that Grace can be connected to Hopper GPU through NVLink-C2C to form a Grace Hopper super chip.

NVIDIA Grace is an important product based on Arm Neoverse V2 IP. The transistor size of NVIDIA Grace has not yet been announced, but you can refer to the data of AWS Graviton 3 and Alibaba Cloud Yitian 710 based on Arm Neoverse V1. According to speculation, AWS Graviton 3 based on Arm Neoverse V1 has about 55 billion transistors, corresponding to 64 cores and 8-channel DDR5 memory; Alibaba Cloud Yitian 710 based on Arm Neoverse N2 has about 60 billion transistors, corresponding to 128 cores, 8-channel DDR5 memory and 96-channel PCIe 5.0. Judging from the rendering of the NVIDIA Grace Hopper super chip, the chip area of ​​Grace is close to that of Hopper, which is known to be 80 billion transistors. Therefore, it is reasonable to speculate that the transistor size of the 72-core Grace chip is larger than that of Graviton 3 and Yitian 710, and it is consistent with the situation that Grace is based on Neoverse V2 (supporting Arm V9 instruction set, SVE2).

The interconnection solution for Arm Neoverse V2 is CMN-700, which is called SCF (Scalable Coherency Fabric) in NVIDIA Grace. Nvidia claims that Grace's grid supports expansion of more than 72 CPU cores. In fact, 80 CPU cores can be counted in the accompanying picture of Nvidia's white paper. Each core has 1MB L2 cache, and the entire CPU has up to 117MB L3 cache (1.625MB per core on average), which is significantly higher than other Arm processors of the same level.

 

Grid layout for NVIDIA Grace CPUs

The SCF within the NVIDIA Grace chip provides 3.2 TB/s of segmented bandwidth, connecting CPU cores, memory controllers, and system I/O controllers such as NVLink. The nodes in the grid are called CSNs and usually each CSN connects 2 cores and 2 SCCs (SCF cache partitions). However, as can be seen from the schematic diagram, the 4 CSNs located at the corners of the grid connect 2 cores and 1 SCC, while the 4 CSNs located on both sides of the middle connect 1 core and 2 SCCs. Overall, there should be 80 cores and 76 SCCs in Grace's grid, of which 8 cores may be blocked due to factors such as yield. The missing 4 cores and 8 SCCs on the periphery of the grid are used to connect NVLink, NVLink-C2C, PCIe and LPDDR5X memory controllers, etc.

NVIDIA Grace supports many of Arm's management features, such as Server Base System Architecture (SBSA), Server Base Boot Requirements (SBBR), Memory Partitioning and Monitoring (MPAM), and Performance Monitoring Unit (PMU). Through Arm's memory partitioning and monitoring functions, the problem of performance degradation during CPU access to cache due to shared resource competition can be solved. High-priority tasks can first occupy the L3 cache, or pre-allocate resources according to virtual machines to achieve performance isolation between services.

NVIDIA Grace CPU Superchip

 

As the representative of the latest and strongest version of the Arm architecture core (Neoverse V2), NVIDIA Grace has attracted great attention from the industry, especially considering that it will benefit from NVIDIA's powerful GPGPU technology. People finally had the opportunity to witness Grace in person at GTC 2023, but it will take some time to verify the actual market performance. Expectations are high for Grace's performance in areas such as supercomputing and machine learning.

The actual Grace super chip shown in the GTC2023 speech

6. Two types of chiplets in grid architecture

The Ampere One uses the popular Chiplet technology, which comes as no surprise with up to 192 cores and 384MB of L2 cache. The current general speculation is that its design is similar to AWS Graviton3, that is, the CPU and cache are placed on a single die, the die of the DDR controller is located on both sides of it, and the die of the PCIe controller is located below it. Separating the CPU core and cache from the controller responsible for external I/O on different dies is the mainstream approach to implementing chiplets for server CPUs.

IOD centered AMD 2nd generation EPYC processor, and core die centered AWS Graviton3 processor

As mentioned earlier, the AMD EPYC family of processors adopts a star topology, placing the I/O part on one IOD, while the CPU core and cache (CCD) surround it. This is determined by the characteristics of the grid architecture, which requires that the CPU core and cache must be located in the center, while the I/O part is scattered in the periphery. So when a split is made, the layout is reversed, the central die will be larger and the surrounding dies will be smaller.

Compared with the architecture of the EPYC family, the grid architecture has a stronger integrity, and its inherent Monolithic structure is not suitable for splitting. In the grid architecture, the utilization of intersections (nodes) must be considered. If there are too many vacant intersections, resources will be wasted. Therefore, it may be more effective to reduce the grid size.

Taking the first-generation Intel Xeon Scalable processor as an example, in order to meet the core count (Core Count, CC) range from 4 to 28, three different configurations of die (die chop) are provided. Among them, 6×6 XCC (eXtreme CC, most cores or extremely multi-core) supports up to 28 cores; 6×4 HCC (High CC, high core count) supports up to 18 cores; 4×4 LCC (Low CC, low core count) supports up to 10 cores.

From this point of view, it is reasonable that Ampere One does not support 128 cores and below, unless the configuration of die is added, but this involves the support of the company's scale and shipments and needs to be solved by mass production. The fourth-generation Intel Xeon Scalable processors provide two configurations of die.

Among them, MCC (Medium CC) mainly meets the requirements of 32 cores and below, which is lower than the 40 cores of the third-generation Intel Xeon Scalable processor code-named Ice Lake, so the grid size is 7×7, which is one column less than the latter’s 7×8, and can accommodate up to 34 cores and their caches. The requirement of 36 to 60 cores must be met by XCC, which is the chiplet version mentioned above, which cuts the grid architecture into 4 equal parts from the middle, which is very unique.

The XCC version of the fourth-generation Intel Xeon Scalable Processor consists of two types of die that are mirror images of each other to form a large 2×2 matrix, so the overall is highly symmetrical, both up and down and left and right, while the previous three generations of products and the MCC version of the same generation are not so symmetrical.

 

 

Intel believes that the fourth-generation Intel Xeon Scalable Processor (XCC version) is spliced ​​by 4 dies to form a quasi-monolithic die. The concept of monolith is easier to understand. This is the case with common grid architectures. In the fourth-generation Intel Xeon Scalable Processor, there are DDR memory controllers on the left and right on the outer circle, PCIe controllers and integrated accelerators (DSA/QAT/DLB/IAA) on the top and bottom, and UPI is located at the four corners, which is a typical grid architecture layout.

 

EMIB Connectivity for 4th Generation Intel Xeon Scalable Processors

Interconnect

Chiplets and CXL

"Counting from the east and saving from the west" is the foundation and prelude of "counting from the east and calculating from the west", not a subset. It involves the relationship between data, storage, and computation. Data is usually generated in the densely populated east and stored in the sparsely populated west. The main difficulty lies in how to complete data transmission at a lower cost. Computing requires frequent access to data, and in the case of cross-regional, network bandwidth and delay become insurmountable obstacles.

Compared to data transfer and computation, storage does not consume much energy but takes up a lot of space. The core area is always a scarce resource. Just as the core area of ​​the core city will not be used to build a super-large-scale data center, the core area of ​​the CPU can only be reserved for the limited silicon area of ​​the memory.

Realizing "counting from the east to the west" is not a one-off thing, and the ultra-large-scale data center is gradually alienated from the core city, and the farther the better, the better. A layered storage system has also been established around the CPU. Although the cache and memory belong to volatile memory (memory), usually the data in the intermediate state has higher requirements for access latency, so it needs to be closer to the core. If it is data that needs to be stored for a long time, it doesn't matter if it is far away from the core, and data with low access frequency can be stored in a remote location (west storage).

The memory closest to the CPU core is the cache at all levels (Cache), except for the Cache at all levels based on SRAM. Even in the cache there is a distinction between near and far. In modern CPUs, the L1 and L2 caches are already part of the core and the main footprint to consider is the L3 cache.

1. The area law of SRAM

Now it is 2023 and the manufacturing process is moving to 3 nanometers. TSMC announced that the SRAM cell area of ​​its N3 process is 0.0199 square microns, which is only 5% smaller than that of the N5 process which is 0.021 square microns. What is even more troubling is that due to yield and cost issues, N3 is not expected to become TSMC's main process customers. Customers are more concerned about the second-generation 3nm process N3E. The SRAM unit area of ​​N3E is 0.021 square micron, which is exactly the same as the N5 process.

In terms of cost, it is said that the cost of a wafer of N3 is 20,000 US dollars and the price of N5 is 16,000 US dollars, which means that the SRAM of N3 is 25% more expensive than N5. As a reference, the SRAM area of ​​Intel's 7nm process (10nm) is 0.0312 square microns. The SRAM area of ​​Intel's 4nm process (7nm) is 0.024 square microns, which is comparable to TSMC's N5 and N3E processes.

Although the quotations of semiconductor manufacturers are commercial secrets, the price of SRAM is getting higher and higher, and the density is difficult to increase. Therefore, it is a reasonable choice to manufacture SRAM separately and combine it with advanced packaging technology to achieve high bandwidth and low latency.

 

2. Stack up and climb over the memory wall

Currently, AMD's architecture faces the problem of lagging behind in memory performance. The reasons include that the average memory bandwidth of each core is relatively small due to the large number of cores, the distance between the core and the memory is relatively long, resulting in a large delay, and the cross-CCD bandwidth is too small. In order to make up for the disadvantage of accessing memory, AMD needs to use a larger L3 cache.

However, from Zen 2 to Zen 4 architecture, AMD's L3 cache per CCD still maintains a capacity of 32MB and has not kept pace with the times. In order to solve the problem of SRAM size lag, AMD decided to expand the opportunity of SRAM independently of the CPU. On the EPYC 7003X series processor code-named Milan-X, AMD applied the first generation of 3D V-Cache technology. These processors use the Zen 3 architecture core, each cache (L3 cache chip, referred to as L3D) has a capacity of 64MB, an area of ​​about 41mm² and is manufactured using a 7nm process.

The cache chip is vertically connected to the CCD (back side) through hybrid bonding and TSV (Through Silicon Vias) process. The unit consists of 4 components: CCD at the bottom, L3D in the middle of the upper layer, and support structures on both sides of the upper layer. Silicon material is used to balance the entire structure in the vertical direction and conduct heat from the lower CCX (Core Complex, core complex) to the top cover.

 

The layout of Zen4 CCD, please feel the area of ​​L3 Cache

At the beginning of the core design of the Zen 3 architecture, AMD reserved the necessary logic circuits and TSV circuits, and the relevant parts increased the area of ​​​​the CCD by about 4%. The stacking position of the L3D is just above the L2/L3 cache area of ​​the CCD. This design conforms to the layout in which the cache in the CCD in the bidirectional ring bus is centered and the CPU cores are distributed on both sides. Considering that the power density of the (L3) cache is relatively lower than that of the CPU core, it is beneficial to control the heat generation of the entire cache area.

 

3D V-Cache structure diagram

The L3 cache of the Zen 3 architecture consists of 8 slices of 4MB each; while the L3D design is 8 slices of 8MB each. There are 1024 TSV connections between each slice of the two sets of caches, for a total of 8192 connections. AMD claims that this extra L3 cache only adds 4 cycles of latency.

With the introduction of the Zen 4 architecture processor, the second generation of 3D V-Cache also appeared. Its bandwidth has been increased from 2TB/s of the previous generation to 2.5TB/s, and the capacity is still 64MB. The process is still 7nm, but the area has been reduced to 36mm². The reduction of this area is mainly from the TSV part. AMD claims that the area of ​​​​the relevant area has been reduced by 50% without reducing the minimum pitch of TSV. The EPYC family, code-named Genoa-X, is expected to launch in mid-2023.

Increasing the SRAM capacity can greatly improve the cache hit rate and reduce the impact of memory latency on performance. AMD's 3D V-Cache has achieved a huge increase in cache capacity at a relatively reasonable cost (a 2-fold increase on the basis of the L3 cache in the CCD), and the improvement in performance is also very obvious.

 

AMD EPYC 7003X processor with 3D V-Cache

3. The Rise of HBM: From GPU to CPU

HBM (High Bandwidth Memory) is a technology jointly released by AMD and SK Hynix in 2014. It uses TSV technology to stack multiple DRAM chips together to greatly increase capacity and data transfer rate. Subsequently, companies such as Samsung, Micron, NVIDIA, and Synopsys also actively participated in this technical route. The standardization organization JEDEC also included HBM2 in the standard (JESD235A) and successively launched HBM2e (JESD235B) and HBM3 (JESD235C). Mainly due to the stacked package and huge bit width (1024 bits in a single package), HBM provides far more bandwidth and capacity than other common memory forms (such as DDR DRAM, LPDDR, GDDR, etc.).

 

A typical implementation is to connect to the core of the processor through a 2.5D package, and it is widely used in products such as CPUs and GPUs. In the early days, some people regarded HBM as an L4 cache, and this view is also reasonable from the perspective of TB/s-level bandwidth. From a capacity standpoint, HBM is much larger than SRAM or eDRAM.

Therefore, HBM can be used both as a (part) cache and as a high-performance memory. AMD is an early adopter of HBM. At present, the AMD Instinct MI250X computing card integrates 2 computing cores and 8 HBM2e in a single package, with a total capacity of 128GB and a bandwidth of 3276.8GB/s. NVIDIA mainly applies HBM in professional cards. Its 2016 TESLA P100 HBM version is equipped with 16GB HBM2, and the subsequent V100 is equipped with 32GB HBM2. Currently popular A100 and H100 also have HBM version, the former provides a maximum of 80GB HBM2e, the bandwidth is about 2TB/s; the latter is upgraded to HBM3, the bandwidth is about 3.9TB/s. Huawei's Ascend 910 processor also integrates 4 HBMs.

For products such as computing cards, smart network cards, and high-speed FPGAs, HBM, as a substitute for GDDR, has been very maturely applied. The CPU has also begun to integrate HBM. The most prominent case is Fugaku, which once won the TOP500 supercomputer, which uses the A64FX processor developed by Fujitsu. Based on the Armv8.2-A architecture, the A64FX adopts a 7nm process and integrates 4 HBM2s in each package, with a total capacity of 32GB and a bandwidth of 1TB/s.

 

Fujitsu A64FX CPU

4. Downward development: support from the base layer

The Intel Data Center Max GPU series introduces the concept of Base Tile, which can be regarded as a basic chip. Compared with the concept of the interposer, we can also regard the basic chip as the base layer. On the surface, the functions of the base layer and the silicon interposer are similar, and both are used to carry computing cores and high-speed I/O (such as HBM), but in fact the base layer has more functions.

The silicon interposer mainly uses mature semiconductor photolithography, deposition and other processes (such as 65 nanometer level) to form high-density electrical connections on silicon. And the base layer goes a step further: Since multi-layer patterns are already being processed, why not integrate functions such as logic circuits?

 

Intel Data Center Max GPU

At ISSCC 2022, Intel demonstrated the Chiplet (small chip) architecture of the Intel Data Center Max GPU. Among them, the basic chip has an area of ​​640mm² and uses Intel's 7nm process, which is currently Intel's advanced process for mainstream processors. Why do you need to use advanced processes on "basic" chips? This is because Intel integrates the SerDes of high-speed I/O in the basic chip, similar to AMD's IOD. These high-speed I/Os include HBM PHY, Xe Link PHY, PCIe 5.0, cache, etc. These circuits are all better suited for fabrication at 5nm and beyond, and decoupling them from the computing core and repackaging them in the same process is a fairly reasonable choice.

 

Chiplet Architecture for Intel Data Center Max GPUs

The Intel Data Center Max GPU series uses Foveros packaging technology to stack 8 computing chips (Compute Tile) and 4 RAMBO chips (RAMBO Tile) on top of the basic chip. The computing chip is manufactured using TSMC's N5 process, and each chip has a 4MB L1 Cache. RAMBO is the abbreviation of "Random Access Memory Bandwidth Optimized", that is, bandwidth-optimized random access memory. The independent RAMBO chip is based on Intel's 7nm process, and each chip has four banks of 3.75MB, providing a total capacity of 15MB. Each set of 4 RAMBO chips provides a total of 60MB of L3 Cache. In addition, there is a RAMBO with a capacity of 144MB in the basic chip, and a switching network (Switch Fabric) of L3 Cache.

Therefore, in the Intel data center Max GPU, the basic chip organizes the 144MB Cache in the basic layer with the 60MB Cache of 8 computing chips and 4 RAMBO chips through the Cache switching network, providing a total of 204MB of L2/L3 Cache. The entire package consists of two groups, namely a total of 408MB of L2/L3 Cache. Each group of processing units is connected to the other 7 groups through Xe Link Tile. The Xe Link chip is manufactured using TSMC's N7 process.

 

Logical architecture of X HPC

As mentioned earlier, I/O chip independence is the general trend, and sharing Cache and I/O closer is also a trend. Intel Data Center Max GPU integrates Cache and various high-speed I/O PHYs in the same chip, which is a master of the aforementioned trends. As for HBM, Xe Link chips, and adjacent basic chips in the same package, they are connected together through EMIB (orange part in the explosion diagram).

 

Intel Data Center Max GPU exploded view

According to the data released by Intel on HotChips, the total L2 Cache bandwidth of Intel data center Max GPU can reach 13TB/s. Considering that two sets of basic chips and computing chips are packaged to double the bandwidth, the bandwidth of the basic chip and 4 RAMBO chips is 6.5TB/s, which is still far higher than the current L2 and L3 Cache bandwidth of Xeon and EPYC. In fact, AMD has already proved the performance of 3D packaging through the fingernail-sized 3D V-Cache before, not to mention the RAMBO and basic chip area of ​​Intel's data center Max GPU. 

Recall that one of the weaknesses of 3D V-Cache is poor heat dissipation. We also found that integrating Cache into the basic chip has an advantage: the high-power computing core is arranged on the upper layer of the entire package, which is more conducive to heat dissipation. Further observe that in the grid-based processor architecture, the L3 Cache is not simply a few blocks (slices), but is divided into dozens or even hundreds of units and connected to the grid nodes. The base chip can completely cover or accommodate the processor chip in the vertical direction, and the SRAM can be divided into an equal number of cells and connected to the processor's grid node.

In the current mature 3D packaging technology, the pitch of bumps in the range of 30 to 50 microns is sufficient to meet the needs of hundreds to thousands of connections per square millimeter, which can meet the bandwidth requirements of current grid nodes. Of course, higher-density connections are also feasible, and 10-micron or even sub-micron technologies are constantly advancing, but the priority scenario is highly customized internal stack hybrid bonding such as HBM and 3D NAND, which may not be suitable for Chiplet’s needs for flexibility.

5. Standardization: Chiplet and UCIe

In order to realize this vision, in March 2022, core players in the general processor market, including Intel, AMD, Arm, etc., jointly released a new interconnection standard UCIe (Universal Chiplet Interconnect Express Universal Small Chip Interconnect Channel), which aims to solve the industry standardization of Chiplets. The launch of this standard will help promote the development and application of Chiplet technology.

Since the leader of the standard already has a close relationship with PCIe and CXL (Compute Express Link), UCIe places great emphasis on the collaboration with PCIe/CXL, providing the function of mapping PCIe and CXL protocols at the local end of the protocol layer. The collaboration with CXL shows that the goal of UCIe is not only to solve the interconnection problem in chip manufacturing, but to hope that the interaction between chips and devices, and between devices and devices is seamless.

In the UCIe 1.0 standard, there are two levels of applications: Chiplet (inside the package) and Rack space (outside the package). This means that UCIe can realize the interconnection between chiplets inside the chip, and also realize the interconnection between chips and devices outside the package. This flexibility enables UCIe to adapt to different application scenarios.

 

 

Rack connections planned by UCIe were handed over to CXL

1. CXL: Decoupling and Expansion of Memory

Compared with PCIe, the most important value of CXL is to reduce the access delay of each subsystem memory (theoretically, the delay of PCIe is 100 nanoseconds and that of CXL is 10 nanoseconds). This is critical for high-volume data exchange between devices, such as when a GPU accesses system memory. This improvement mainly stems from two aspects:

First of all, PCIe did not consider cache coherence issues at the beginning of its design. When performing cross-device DMA read and write data through PCIe, the memory data may have changed during the operation delay, so an additional verification process is required, which increases the instruction complexity and delay. CXL solves the problem of cache consistency through the CXL.cache and CXL.memory protocols, simplifying operations and reducing delays.

Secondly, the original intention of PCIe is to optimize for large traffic, optimize for large data blocks (512 bytes, 1KB, 2KB, 4KB), hoping to reduce instruction overhead. CXL, on the other hand, is optimized for 64-byte transfers and has lower operational latency for fixed-size data blocks. In other words, the protocol features of PCIe are more suitable for block storage devices represented by NVMe SSD, and more suitable for computing devices CXL that focus on byte-level addressability.

In addition to fully releasing the computing power of heterogeneous computing, CXL also makes the vision of memory pooling see the hope of standardization. The purpose of the CXL Type 3 device is memory buffering, and the protocol of CXL.io and CXL.memory is used to realize the expansion of remote memory. After expansion, the bandwidth and capacity of the system memory are the superposition of local memory and CXL memory modules.

In CXL 1.0/1.1, which is generally supported by the new generation of CPUs, the CXL memory module first realizes host-level memory expansion, trying to break through the development bottleneck of traditional CPU memory controllers. The reason for this is that the number of CPU cores is growing much faster than the number of memory channels.

Over the past decade the number of CPU cores has grown from 8 to 12 to 60 or even 96 cores, while the number of memory channels per socket CPU has only increased from 4 to 8 or 12 channels. The memory of each channel has also experienced three major iterations during this period, the bandwidth has increased by about 1.5 to 2 times, and the storage density has increased by about 4 times. Judging from the development trend, the number of memory channels that can be allocated to each CPU core is obviously decreasing, and the memory capacity and memory bandwidth that can be allocated to each core are also decreasing. This is a form of memory wall that prevents CPU cores from getting enough data to run at full capacity, resulting in a decrease in overall computing efficiency.

 

2. UCIe and heterogeneous computing power 

With the advent of the AI ​​era, heterogeneous computing has become the norm. In principle, as long as the power density allows, the high-density integration of these heterogeneous computing units can be completed by UCIe. In addition to integration considerations, standardized Chiplets also bring flexibility in function and cost. For unnecessary units, it is not necessary to participate in packaging during manufacture, and for traditional processors, useless units often become useless "dark silicon", which means a waste of cost. A typical example is DSA, such as several accelerators in Intel's fourth-generation scalable Xeon processors, users can pay to enable them, but if users don't pay, these DSAs have actually been manufactured.

 

UCIe includes protocol layer (Protocol Layer), adaptation layer (Adapter Layer) and physical layer (Physical Layer). The protocol layer supports PCIe 6.0, CXL 2.0 and CXL 3.0 and also supports user-defined protocols. According to different packaging levels, UCIe also provides different Package modules. A lower power consumption and better performance Die-to-Die interconnect interface can be achieved by replacing the PHY and data packets of PCIe/CXL with the adaptation layer and PHY of UCIe.

 

UCIe considers two different levels of packages: Standard Package and Advanced Package. The two packages differ by orders of magnitude in bump pitch, transmission distance, and power consumption. For example, for advanced packaging, the bump pitch (Bump Pitch) is 25-55 μm, which represents the characteristics of 2.5D packaging technology using silicon interposer. Taking Intel's EMIB as an example, the current bump pitch is about 50 μm, and it will evolve to 25 μm or even 10 μm in the future. TSMC's InFO, CoWoS, etc. also have similar specifications and evolution trends. The standard packaging (2D) specifications correspond to the most widely used organic carrier boards.

 

There are also substantial differences in the signal density of different packages. The standard package module corresponds to 16 pairs of data lines (TX, RX), while the advanced package module contains 64 pairs of data lines. Every 32 data pins also provide 2 additional pins for Lane repair. If greater bandwidth is required, more modules can be expanded and the frequency of these modules can be set independently.

 

What needs to be mentioned is that the deep integration of UCIe and high-speed PCIe makes it more suitable for high-performance applications. In fact, SoC (system on a chip) is a broad concept and UCIe is oriented towards macro system-level integration. In the traditional concept, SoC suitable for low cost and high density may need to integrate a large number of transceivers, sensors, block storage devices, etc. For example, some edge-scenario-oriented reasoning applications and IP design companies for video stream processing are very active, and these IPs may require more flexible commercialization methods. Since UCIe does not consider the integration of relatively low-speed devices, there is still room for standardization of low-speed, low-cost interfaces.

computing power interconnection

From inside to outside, from small to large 

With the advancement of the "East Number, West Computation" project, subdivision scenarios such as "East Number, West Rendering" and "East Number, West Training" have emerged. The training tasks of video rendering and artificial intelligence (AI)/machine learning (ML) are essentially offline computing or batch processing, which can be performed on the basis of "storage in the east and west". That is, after the original material or historical data is transmitted to the data center in the western region, the calculation process is completed independently in this region, and there is less interaction with the data center in the eastern region, so it will not be affected by cross-regional delay. In other words, the business logic of "East Data, West Rendering" and "East Data, West Training" is established because computing and storage are still coupled nearby, and there is no need to face the challenge of "separation of storage and computing" across regions.

Inside a server, there is a similar but different relationship between CPUs and GPUs. For the current popular large models, there are high requirements on computing performance and memory capacity. However, there is a "mismatch" phenomenon between the CPU and the GPU: the AI ​​computing power of the GPU is significantly higher than that of the CPU, but the direct memory (video memory) capacity usually does not exceed 100GB, which is an order of magnitude less than the TB-level memory capacity of the CPU. Fortunately, the distance between the CPU and GPU can be shortened and the bandwidth can be increased. By eliminating interconnect bottlenecks, unnecessary data movement can be greatly reduced and GPU utilization improved.

1. CPU for GPU

The core of the NVIDIA Grace CPU is based on the Arm Neoverse V2 architecture, and its interconnection architecture SCF (Scalable Consistency Fabric) can be regarded as a customized version of the Arm CMN-700 grid. However, in terms of external I/O, NVIDIA Grace CPU is very different from other Arm and x86 servers, which reflects the main intention of NVIDIA to develop this CPU, which is to provide services for GPUs that need high-speed access to large memory.

In terms of memory, the Grace CPU has 16 LPDDR5X memory controllers, which correspond to 8 LPDDR5X chips packaged together, with a total capacity of 512GB. After ECC overhead, the usable capacity is 480GB. So it can be inferred that there is a memory controller and its corresponding LPDDR5X memory die for ECC. According to NVIDIA official information, the memory bandwidth parameter that appears simultaneously with 512GB memory capacity is 546GB/s and that appears at the same time as 480GB (with ECC) is about 500GB/s, and the actual memory bandwidth should be around 512GB/s.

The PCIe controller is essential. The practice of the Arm CPU is to multiplex some PCIe channels with CCIX, but such CCIX interconnection bandwidth is relatively weak, not as good as Intel's QPI/UPI dedicated to inter-CPU interconnection.

 

The NVIDIA Grace CPU provides 68 PCIe 5.0 lanes, of which 2 x16 lanes can be used as 12-lane coherent NVLink (cNVLINK). The actual interconnection between chips (CPU/GPU) is the cNVLINK/PCIe NVLink-C2C interface separated from the "core", with a bandwidth of up to 900GB/s.

C2C in NVLink-C2C stands for chip-to-chip connection. According to NVIDIA's description in the ISSCC 2023 paper, NVLink-C2C consists of 10 groups of connections (9 pairs of signals and 1 pair of clocks in each group), using NRZ modulation, with an operating frequency of 20GHz and a total bandwidth of 900GB/s. The transmission distance within each package is 30mm, and the transmission distance on the PCB is 60mm. For the NVIDIA Grace CPU super chip, use NVLink-C2C to connect two CPUs to form a 144-core module; and for the NVIDIA Grace Hopper super chip, the Grace CPU and the Hopper GPU are interconnected.

The 900GB/s bandwidth of NVLink-C2C is amazing data. For reference, Intel's fourth-generation Xeon scalable processor codenamed Sapphire Rapids contains 3 or 4 sets of x24 UPI 2.0 (@16GT/s) with a total bandwidth of close to 200GB/s among multiple processors; while AMD's fourth-generation EPYC processor uses the GMI3 interface for interconnection between CCD and IOD with a bandwidth of 36GB/s, while the Infinity Fabric between CPUs is equivalent to 16-channel PCIe 5.0 with a bandwidth of 32GB/s. Between the two-way EPYC 9004, you can choose to use 3 or 4 sets of Infinity Fabric interconnection, and the total bandwidth of 4 sets is 128GB/s.

Through the huge bandwidth, two Grace CPUs can be closely connected together, and its "tightness" is far beyond the traditional multi-processor system, and it is enough to compete with most chip packaging solutions (2D packaging) based on organic carrier boards. To exceed this bandwidth, the technology of silicon interposer (2.5D packaging) needs to be introduced. For example, the Ultra Fusion architecture of the Apple M1 Ultra uses a silicon interposer to connect the two M1 Max dies. Apple claims that Ultra Fusion can transmit more than 10,000 signals simultaneously, achieving up to 2.5TB/s of low-latency processor interconnect bandwidth. Intel's EMIB is also a 2.5D packaging technology, and the interconnection bandwidth between its chips should also reach the TB level.

Another important application case of NVLink-C2C is the GH200 Grace Hopper super chip, which interconnects a Grace CPU with a Hopper GPU. Grace Hopper is the world's first famous female programmer and the inventor of the term "bug". Therefore, NVIDIA named this generation of CPU and GPU as Grace and Hopper respectively. This naming actually has a profound meaning, which fully shows that in the early planning, they are closely integrated.

 

NVIDIA Grace Hopper super chip key specifications

Data exchange efficiency (bandwidth, latency) between CPU and GPU is especially important in the era of very large machine learning models. NVIDIA equips Hopper GPU with large-capacity high-speed video memory, fully turns on 6 groups of video memory controllers, with a capacity of 96GB and a bandwidth of 3TB/s.

In comparison, the discrete GPU card H100 is configured with 80GB of video memory, while the dual-card combination of the H100 NVL is 188GB. The Grace CPU is equipped with 480GB of LPDDR5X memory, with a bandwidth of slightly more than 500GB/s. While Grace's memory bandwidth is on par with competitors using DDR5 memory, the interconnect between the CPU and GPU is the deciding factor. A typical x86 CPU can only communicate with the GPU through PCIe, while NVLink-C2C has far more bandwidth than PCIe and has the advantage of cache coherency.

Through NVLink-C2C Hopper GPU can smoothly access CPU memory beyond H100 PCIe and H100 SXM. Additionally, high-bandwidth direct addressing also translates into capacity advantages, enabling the Hopper GPU to address 576GB of local memory.

The CPU has a memory capacity that the GPU cannot match, and the interconnect (PCIe) between the GPU and the CPU is the bottleneck. The bandwidth and energy efficiency ratio advantages of NVLink-C2C are one of the core advantages of the GH200 Grace Hopper super chip over the x86+GPU solution. NVLink-C2C consumes only 1.3 picojoules of energy for every bit of data transmitted, which is about one-fifth of PCIe 5.0 and has a 25-fold difference in energy efficiency. It should be noted that this comparison is not completely fair, because PCIe is inter-board communication, and the transmission distance of NVLink-C2C is fundamentally different.

 

NVLink was originally designed for data exchange between high-speed GPUs. With the help of NVSwitch, multiple GPUs inside the server can be connected together to form a memory pool with doubled capacity.

2. GPU interconnection of NVLink

The goal of NVLink is to break through the bandwidth bottleneck of the PCIe interface and improve the efficiency of exchanging data between GPUs. The P100 released in 2016 is equipped with the first generation of NVLink, providing a bandwidth of 160GB/s, which is equivalent to 5 times the bandwidth of PCIe 3.0 x16 at that time. The NVLink2 equipped on the V100 increases the bandwidth to 300GB/s, which is nearly 5 times that of PCIe 4.0 x16. The A100 is equipped with NVLink3 with a bandwidth of 600GB/s.

H100 is equipped with NVLink4. Compared with NVLink3, NVLink4 not only increases the number of links, but also has relatively significant changes in content. In NVLink3, each link channel uses four 50Gb/s differential pairs, each channel is 25GB/s in one direction and 50GB/s in two directions. The A100 uses 12 NVLink3 links, making up a total of 600GB/s of bandwidth. NVLink4 instead uses two 100Gb/s differential pairs per link channel, and the bidirectional bandwidth per channel is still 50GB/s, but the number of lines is reduced.

Eighteen NVLink4 links can provide a total bandwidth of 900GB/s on the H100. Most of NVIDIA's GPUs provide NVLink interfaces, and the PCIe version can be interconnected through NVLink Bridge, but the scale is limited. Larger-scale interconnections still have to be organized through NVLink on the motherboard/substrate, and the corresponding GPU has NVIDIA's proprietary specification SXM.

NVIDIA GPUs with SXM specifications are mainly used in data center scenarios. Its basic shape is rectangular, and the gold fingers cannot be seen on the front. It belongs to a mezzanine card. It is "buckled" on the motherboard in a horizontal installation method similar to a CPU socket, usually a group of 4-GPU or 8-GPU. Among them, 4-GPU systems can be directly connected to each other without NVSwitch, while 8-GPU systems need to use NVSwitch.

Organizational structure of the NVIDIA HGX A100 4-GPU system. The 12 NVLinks of each A100 are divided into 3 groups, which are directly connected to the other 3 A100s

 

After several generations of development, NVLink has matured and has begun to be applied to the interconnection between GPU servers, further expanding the size of GPU (and its video memory) clusters.

 

Organization of the NVIDIA HGX H100 8-GPU system. The 18 NVLinks of each H100 are divided into 4 groups and interconnected with 4 NVSwitches respectively.

3. NVLink networking super cluster

At COMPUTEX held at the end of May 2023, Nvidia announced a cluster of 256 Grace Hopper superchips, with a total of 144TB of GPU memory. Large language models (LLM) such as GPT have an urgent need for memory capacity, and the huge memory capacity is in line with the development trend of large models. So how was this unprecedented capacity achieved?

One of the big innovations is NVLink4 Networks, which enables NVLink to scale beyond nodes. Through the architecture diagram of the 256-GPU SuperPOD built by DGX A100 and DGX H100, you can intuitively feel the characteristics of NVLink4 Networks. In DGX A100 SuperPOD, 8 GPUs of each DGX node are interconnected through NVLink3, while 32 nodes need to be interconnected through HDR InfiniBand 200G network card and Quantum QM8790 switch. In the DGX H100 SuperPOD, 8 GPUs are interconnected using NVLink4 inside the node, and the nodes are interconnected through the NVLink4 Network, and each node is connected to a device called NVLink Switch.

 

DGX A100 and DGX H100 256 SuperPOD architecture

According to the architecture information provided by NVIDIA, NVLink Network supports OSFP (Octal Small Form Factor Pluggable) optical ports, which is also in line with NVIDIA's claim that the cable length has increased from 5 meters to 20 meters. The specifications of the NVLink Switch used by the DGX H100 SuperPOD are: 128 ports, 32 OSFP cages, and a total bandwidth of 6.4TB/s.

 

Network architecture inside the DGX H100 SuperPOD node

Each 8-GPU node has 4 NVSwitches inside, and for DGX H100 SuperPOD, each NVSwitch is connected to the outside through 4 or 5 NVLinks. The bandwidth of each NVLink is 50GB/s, corresponding to an OSFP port is equivalent to 400Gb/s, which is very mature. A total of 18 OSFP interfaces need to be connected to each node, and a total of 576 connections are required for 32 nodes, corresponding to 18 NVLink Switches.

DGX H100 can also be interconnected (only) through InfiniBand. Refer to the configuration of DGX H100 BasePOD. The DGX H100 system is configured with 8 H100s, dual-channel 56-core fourth-generation Intel Xeon Scalable processors, 2TB DDR5 memory, and 4 ConnectX-7 network cards - 3 dual-port cards for management and storage services, and 4 OSFP ports for computing networks.

Going back to the Grace Hopper super chip, NVIDIA provides a simplified schematic diagram in which 18 NVLink4s on the Hopper GPU are connected to the NVLink Switch. The NVLink Switch connects "two sets" of Grace Hopper superchips. Any GPU can access the memory of other CPUs and GPUs in the network through NVLink-C2C and NVLink Switch.

The scale of NVLink4 Networks is 256 GPUs, note that it is a GPU and not a super chip, because the NVLink4 connection is provided through the H100 GPU. For the Grace Hopper super chip, the memory limit of this cluster is: (480GB memory + 96GB video memory) × 256 nodes = 147456GB, that is, the scale of 144TB. If NVIDIA launches Grace + 2Hopper mentioned in GTC2022, then according to the access capability of NVLink Switch, there are 128 Graces and 256 Hoppers, and the memory capacity of the entire cluster will drop to the order of about 80TB.

 

Interconnect between Grace Hooper superchips

During COMPUTEX 2023, NVIDIA announced that the Grace Hopper superchip has been mass-produced and released the DGX GH200 supercomputer based on it. NVIDIA DGX GH200 uses 256 sets of Grace Hopper super chips and NVLink interconnection. The entire cluster provides up to 144TB of shared "video memory" to meet the needs of super-large models. Here are some numbers that give us a sense of the scale of the exa-level supercomputing system built by NVIDIA:

  • Computing power: 1 exa Flops (FP8)

  • Total fiber length: 150 miles

  • Number of fans: 2112 (60mm)

  • Air volume: 70,000 cubic feet per minute (CFM)

  • Weight: 40,000 pounds

  • Video memory: 144TB NVLink

  • Bandwidth: 230TB/s

From the length of 150 miles of optical fiber, we can feel the complexity of its network. The overall network resources of this cluster are as follows:

 

Since there is only one CPU and one GPU on the Grace Hopper chip, the number of GPUs is much smaller compared to the DGX H100. The number of nodes required to reach 256 GPUs is greatly increased, which makes the architecture of NVLink Network more complex.

 

NVLink network architecture within NVIDIA DGX GH200 cluster

Each node of DGX GH200 has 3 sets of NVLink external connections, and each NVLink Switch connects 8 nodes. The 256 nodes are divided into 32 groups in total. Each group consists of 8 nodes and 3 L1 NVLink Switches. A total of 96 switches are required. These 32 groups of networks are also organized together through 36 L2 NVLink Switches.

Compared with DGX H100 SuperPOD, the number of nodes of GH200 has increased significantly, and the complexity of NVLink Network has increased significantly. Here is a comparison of the two:

 

3. InfiniBand expands the scale

If a larger-scale (more than 256 GPU) cluster is required, an InfiniBand switch needs to be introduced. For large-scale clusters of Grace Hopper superchips, NVIDIA recommends using Quantum-2 switch networking to provide NDR 400 Gb/s ports. Each node is configured with BlueField-3 DPU (connectX-7 integrated), each DPU provides two 400Gb/s ports, and the total bandwidth reaches 100GB/s. A similar level of bandwidth could theoretically be achieved using an Ethernet connection, but the preference for InfiniBand is understandable given NVIDIA's acquisition of Mellanox.

 

NVIDIA BlueField-3 DPU

The Grace Hopper super chip cluster based on the InfiniBand NDR400 organization has two architectures. One is to fully use InfiniBand connection, and the other is to mix NVLink Switch and InfiniBand connection. What these two architectures have in common is that each node is connected to an InfiniBand switch through dual ports (total 800Gbps), and the DPU occupies the PCIe 5.0 of x32 and the PCIe connection is provided by the Grace CPU. The difference between them is that each node of the latter is also connected to the NVLink Switch through the GPU to form several NVLink sub-clusters.

Obviously, the mixed configuration of InfiniBand and NVLink Switch performs better because of greater bandwidth between some GPUs and atomic operations on memory. For example, NVIDIA plans to build the supercomputer Helios, which will consist of four DGX GH200 systems organized over a Quantum-2 InfiniBand 400 Gb/s network.

 

4. Look at NVLink from the perspective of H100 NVL

At GTC 2023, NVIDIA released the NVIDIA H100 NVL, which is specially deployed for large language models. Compared with the other two versions of the H100 family (SXM and PCIe), it has two features: first, the H100 NVL is equivalent to connecting two H100 PCIe together through three NVLink bridges; second, each card has nearly a full 94GB of video memory, even more than the H100 SXM5.

According to the introduction of NVIDIA official documents, H100 PCIe's dual-slot NVLink bridge continues the previous generation of A100 PCIe, so the NVLink interconnection bandwidth of H100 NVL is 600GB/s, which is still more than 4 times higher than that through PCIe 5.0 interconnection (128GB/s). H100 NVL is composed of two H100 PCIe cards, suitable for reasoning applications. The high-speed NVLink connection makes the memory capacity up to 188GB to meet the (reasoning) needs of large language models. If the NVLink interconnection of H100 NVL is regarded as a shrunken version of NVLink-C2C, it helps to understand the principle of NVLink accelerating memory access through computing power units.

Blue Ocean Brain's high-performance large-scale model training platform uses the working fluid as the intermediate heat transfer medium to transfer heat from the hot zone to a distant place for cooling. It supports a variety of hardware accelerators, including CPU, GPU, FPGA and AI, etc., which can meet the needs of large-scale data processing and complex computing tasks. Adopt distributed computing architecture to efficiently process large-scale data and complex computing tasks, and provide powerful computing support for deep learning, high-performance computing, large-scale model training, and research and development of large-scale language model (LLM) algorithms. It is highly flexible and scalable, and can be customized according to different application scenarios and requirements. Various computing tasks can be quickly deployed and managed, improving the utilization and efficiency of computing resources.

 

Summarize

In the field of computing, CPU and GPU are two key components with different characteristics and complexities in processing data and performing tasks. With the increase of computing requirements, a single CPU or GPU can no longer meet the requirements of high-performance computing. Therefore, the importance of the combination of multiple computing power and computing-storage interconnection and computing power interconnection has become increasingly prominent.

As the core of a computer system, the CPU is highly flexible and versatile, and is suitable for a wide range of computing tasks. It executes various instructions and handles complex logic operations through a complex instruction set and optimized single-thread performance. However, with the increase of computing requirements, a single CPU has limited capabilities in parallel computing and cannot meet the requirements of high-performance computing.

GPUs were originally designed for graphics rendering and image processing, but over time, their computing power has been greatly improved to become an important part of high-performance computing. GPU has large-scale parallel processing units and high-bandwidth memory, and can perform a large number of computing tasks at the same time. However, the complexity of GPU is mainly reflected in its parallel computing architecture and specialized instruction set, which makes programming and optimizing GPU applications more challenging.

In order to make full use of the advantages of CPU and GPU, the combination of multiple computing power becomes crucial. By combining CPU and GPU, parallel processing and division of labor of tasks can be realized. The CPU handles serial tasks and control flow, while the GPU focuses on massively parallel computing. This combination of multiple computing power can improve the overall computing performance and efficiency, and meet the needs of different application scenarios.

Computing and storage interconnection refers to the high-speed interconnection between computing units and storage units, while computing power interconnection refers to the high-speed interconnection between computing units. In high-performance computing, the speed at which data is transferred and accessed is critical to overall performance. By optimizing the computing-memory interconnection and computing power interconnection, the delay and bottleneck of data transmission can be reduced, and the computing efficiency and throughput can be improved. Efficient computing-storage interconnection and computing power interconnection can ensure fast data transmission and collaborative computing, thereby improving the overall performance of the system.

CPU and GPU play an important role in the computing field, but a single CPU or GPU can no longer meet the needs of high-performance computing. The combination of multiple computing power, computing-storage interconnection and computing power interconnection has become the key to improving computing performance and efficiency. By making full use of the advantages of the CPU and GPU, and optimizing the computing-memory interconnection and computing power interconnection, a higher level of computing power and application performance can be achieved, and the development and innovation of computing technology can be promoted.

Guess you like

Origin blog.csdn.net/LANHYGPU/article/details/131573288