When the next-generation Nvidia H100 GPU is released, can domestic chips catch up?


Written by Lu Jianping


Following the release of the A100 GPU at GTC (GPU Technology Conference) in 2020, after two years, NVIDIA announced the H100, which the media described as a " nuclear bomb " GPU, at GTC this year (2022) as expected .


After the GTC in 2020, many GPU startups in China are scrambling to claim that they can surpass the A100; the DPU released by NVIDIA at the GTC in 2021 also makes many chip elites compete to start businesses and invest in DPU development, which is called a trend; what kind of response will this year's H100 cause? , and let's wait and see.


There have been a lot of introductions about the H100 in the media, and detailed reports on the technology of the H100. For example, the title of the core thing is "In -depth interpretation of the 80 billion transistor nuclear bomb GPU architecture, and is it "assembled goods"? ", and "In -depth Interpretation of NVIDIA's "HOPPER" GPU Architecture " from the semiconductor industry observation , which will not be repeated here. The purpose of this article is to discuss the inspiration of the technical ideas and market strategies behind the H100 to China. Is it worth the domestic industry to follow and follow? By 2024's GTC, the next generation of the H100, are we even further behind?


1

Interpreting NVIDIA


Nvidia has been committed to claiming that in the post-Moore era, its new products can still surpass the traditional Moore's Law, providing the industry with more than twice the performance improvement compared to products two years ago, and diluting the cost of increased power consumption. Between each generation of products (Turing, Ampere, and Hopper as examples) basically rely on four performance improvement potentials to achieve the goal: 1) The process improvement brings a frequency increase of less than 1.5 times; 2) Without considering the power consumption cost , the number of computing units is doubled; 3) Domain-Specific Architecture (DSA) design brings performance doubling, such as Turing's Tensor Core, Ampere's hardware Sparsity, Hopper's Transformer Engine; 4) The introduction of new data precision types, Replacing high precision units with low precision such as Turing's INT4, Ampere's TF32, and Hopper's FP8 brings double performance.


Combining these factors, in theory, there is 1.5 x 2 x 2 x 2 = 12 between Nvidia's generational products, which is an order of magnitude performance improvement space. However, limited by the reality of the post-Moore's Law era and the constraints of the power wall, the performance improvement is always far less than an order of magnitude. Using this idea, Nvidia claims the H100 has six times the performance of the A100 with the following graph:


The performance improvement of the H100 over the previous generation A100


However, it's important to note that this six-fold performance improvement is a peak of a peak, a special case of a special case, not an average performance in general. The reason is that this is the peak performance when running a network like Transformer, and it only happens when the data can be fully represented in FP8. What's more, in the post-Moore's Law era, when it comes to performance, we must also consider power consumption, and the performance-to-power ratio is a more important indicator.


In addition, we should also exclude the performance-to-power ratio improvement brought about by the process in order to know exactly how much performance-to-power ratio the H100 has contributed to the innovation of the architecture compared to the A100. In addition, in reality, many innovations of NVIDIA products need time to be digested by the industry, and some new functions need time to be accepted by the market, and the final universal effect may not happen, and it will not really have such a big impact. Specifically, most customers will do translation first when iterating, that is, directly move the code of the previous generation to the next generation. As can be seen from the table below, compared with A100, H100 should increase the peak computing power by 3.2 times under normal conditions.

 

H100 VS A100


Assuming that TSMC's N7 to N4 process progress has increased the performance-to-power ratio by 26%, then the H100 has a performance-to-power ratio improvement of 3.2 x 350/700 - 1 = 60% compared to the A100 under normal conditions. While filtering out the process blessing, pure architectural innovation only contributes (3.2 x 350) / (1.26 x 700) -1 ≈ 30% of the peak computing power efficiency improvement.


So, how much will the customer pay in the end in exchange for a 60% performance/watt improvement? We still don’t know whether the price/performance ratio of the H100 series products has improved in the end, but what is known is that this time GTC did not disclose the price of DGX based on H100, nor repeated the famous saying “the more you buy, the more you save” (“The more you buy, the more you save”).


2

Technical route analysis


We next analyze several important technical routes of H100.


Did the GPU take the DSA route from the H100?


At the end of the report of the core stuff, it is mentioned that the H100 design is the beginning of the development of NVIDIA's GPU in the direction of DSA (Domain Specific Architecture). However, GPUs have traditionally accepted DSA, not from the H100, which is also a key to NVIDIA's ability to calmly deal with DSA challengers. What ordinary people, including the masters of DSA, John Hennessey and Professor David Patterson, do not understand is that the idea of ​​GPU architects has always been to integrate DSA into a general architecture. But they are heterogeneous at the core, not at the top of the chip. This can be illustrated by the following figure.


Traditional DSA architecture diagram (left) Schematic diagram of GPU fusion DSA architecture (right)


The left side of the picture is a schematic diagram of the DSA architecture that most people agree with. Google's TPU AI acceleration chip is roughly like this. The right side of the figure is a schematic diagram of the GPU fusion DSA architecture, such as from the early texture unit (Texture Unit), special function unit (Special Function Unit), to the recent tensor core (Tensor Core) and light pursuit core (RT Core). These examples have something in common:


1. The hardware resources designed by a DSA are evenly distributed to each computing unit, referenced in the form of special instructions or program calls, and become part of the general computing core of each unit. It does not become an independent processor at the top of the chip, but can be The natural extension of the programming ecology does not affect the original programming method.


2. It is suitable for mature applications in the market. For example, texture computing is used for most graphics applications, and tensor computing is used for almost all AI algorithms. The amount of resources invested can be based on the frequency of applications, and it will not be excessively idle.


We can call the way of integrating DSA design of GPU as "DSA generalization", which continuously strengthens the general advantages while improving performance. This can explain why the so-called AI specially designed chips, including the TPU, cannot crush the GPU, and are completely courteous to the GPU in terms of versatility.


This time, NVIDIA added the Transformer Engine optimized for Transformer-type networks and the corresponding FP8 data format to the H100, and the DPX special instruction set optimized for Dynamic Programming, which can be said to continue the tradition of DSA generalization. As far as the Transformer Engine is concerned, the Transformer-type network has been recognized as universal to various application fields, breaking away from the natural language processing category, and the Transformer Engine is also configured in the Tensor Core to perform statistical analysis on the network layer data. It is possible in the future Can be generalized to other types of networks.


For DPX, Nvidia also cited applications such as gene sequencing and robot path planning. Different from the previous application of DSA accelerators for mature markets, the application scope of Transformer Engine and DPX is still relatively small in the short term, and they have not been widely accepted by the market. NVIDIA is ahead of the market this time. Whether this is the generalization trend of GPU DSA in the future is unknown. For Tianshu Zhixin, we are willing to work closely with domestic customers to blaze a trail of DSA generalization that is suitable for the domestic market and has an international technical vision.


H100 improves general computing efficiency with "Asynchronous Execution"


H100 extends the asynchronous execution route started by A100, improves general computing efficiency, and adds Tensor Memory Accelerator (TMA) to deal with the problem of off-chip memory and in-core shared memory (SMEM) or moving large tensors between SMEMs. SMEM is attached to an SM (Streaming Multiprocessor, NVIDIA's computing unit). Now in order to support data movement between SMEMs and integrate into a piece of SMEM, there is now an interconnection network between SMs.


Because of the diversity and rapid evolution of AI algorithms, we cannot have both. In my opinion, the ultimate goal of asynchronous execution technology direction is to fill the performance gap between general-purpose and special-purpose, so that we can have both, and make the general-purpose computing efficiency of GPU closer to that of ASIC (Application-Specific IC). Common dedicated pipeline.


The word ASIC is now almost completely obscured by DSA, and I use ASIC instead of DSA because the latter is not necessarily pipeline-centric. The special feature of the pipeline is that the producer and consumer continue to work while the data is transferred from the producer (Producer) to the consumer (Consumer).


I further strengthen this technical direction as "computational graphics", because the graphics pipeline, as shown on the left side of the following figure, is the representative work of the dedicated pipeline. Although the intermediate nodes have been replaced by shader programs running in the general computing pool, its pipeline structure still exists. Asynchronous execution approaches the efficiency of dedicated pipelines by not wasting time waiting for data transfers. In the face of the arrival of the post-Moore's Law era, general computing borrowing the spirit of ASIC-style dedicated pipeline is a route that must be followed.


graphics pipeline

 
Traditional graphics cannot fully utilize the AI ​​computing power of H100


Of the 66 TPCs in the H100 SMX version, and the 57 TPCs in the PCIe version, only two TPCs are graphics capable. This design may be because, although the area of ​​graphics-specific hardware is not large in a TPC, after multiplying it by about 30 times, it is unbearable when the area and power consumption have already exceeded the table.


Because the H100's graphics are so disproportionate to general-purpose computing, we can call the H100 a general-purpose GPU. It is conceivable that a general-purpose GPU like the H100 must have matching graphics capabilities. The premise is that the graphics must make full use of AI computing power and simplify graphics-specific hardware without reducing functions and performance.


Simplify graphics with mesh shaders


With multiple nodes in the graphics pipeline being replaced by shader programs running on a general-purpose computing pool, why not reduce a few shader nodes? As shown in the figure above, in the advanced graphics standard, the new mesh shader that uses more computing functions can replace the shader from vertex shader to geometry shader, so that the number of graphics pipeline nodes can be greatly reduced without reducing the function. Remove some specialized hardware connecting nodes. Performance may also be improved due to the flexibility that mesh shaders have. This is the first step in simplifying the graph.


In NVIDIA's metaverse/digital twin blueprint, the H100 general-purpose GPU series and RTX graphics GPUs perform their respective roles. However, the graphics GPU needs general computing support to support the physical simulation operations required by the digital twin, and also requires AI to do super-resolution and noise reduction for light tracking. Conversely, general-purpose GPUs require rendering to extensively engage in AI-based content generation and 3D modeling.


My opinion is that general purpose and graphics GPUs should converge. But the H100 did not do this because the H100 as a general-purpose GPU is already highly optimized for AI. A more correct statement is that tensor computing has evolved from the role of co-processing into the computing power center of general-purpose GPUs, because AI is dominated by tensor computing.


However, traditional graphics rendering shader algorithms are not based on tensors. This means that the only way for general-purpose GPUs that support tensor computing to achieve matching graphics is to be able to integrate graphics and AI, so that graphics rendering shaders must also use AI-based algorithms. I call this trend "Graphics Computationalization".


This is difficult for NVIDIA, as the market leader in graphics cards, because the choice and writing of shader algorithms is up to the graphics application developer. For Tianshu Zhixin, the goal of our graphics is to support the cloud rendering of the metaverse/digital twin, and have the opportunity to develop an ecosystem belonging to China with customers, so that the development of graphics applications is based on AI, and the optimization for tensor computing is achieved. The general-purpose GPU can also show its talents in the graphics field.


3

A grand alliance of deep cooperation and healthy competition


The question that everyone is most concerned about is, how do we approach or even surpass Nvidia? As analyzed above, under normal circumstances, after filtering out factors such as process and power consumption, the innovation in the H100 architecture contributes about 30% to the performance improvement compared to the A100. If we want to surpass NVIDIA in 2024, we will take a different path. On the technical route, we should:


1. Cooperate with domestic customers to make DSA generalization suitable for the domestic market to continue the general advantage

2. With computing graphics, improve general computing performance, comparable to graphics pipeline

3. Cooperate with the domestic ecosystem to directly bridge advanced graphics standards through graphics computing, and enable general-purpose GPUs specializing in tensor computing to show their talents in the graphics field


We can't ignore the importance of "fully autonomous development and widely used technology" in the GPU track. Only by insisting on independent innovation, independent design and development from the underlying hardware to the upper-level software, and not taking the shortcut of purchasing foreign GPU IP, can we ensure complete independent intellectual property rights and break the long-term domestic situation as a foreign IP agent. Only the fully self-developed architecture, computing core, instruction set and basic software stack can immediately respond to rapidly changing market demands and achieve sustainable and independent development without being restricted by foreign IP. Moreover, the open testing of different technical levels required by customers can fundamentally ensure the safety of customers' use and information.


As The Information quoted me in a report titled "China's 'Little Nvidia' Has a Big Secret: Its Homegrown AI Chip Isn't", "The only way to go is to write code line by line to implement the core functions of the GPU. The only way to be autonomous". After we have developed fully independent GPU chips with a wide range of technologies, we must also be able to benchmark with NVIDIA in terms of testing, customer adaptation, stable supply, successful mass production and large-scale applications, and only tape-out and lighting. It can be said that it is the initial stage on the track.


Finally, we also need to explore the fundamental significance of borrowing from Nvidia. We want to see a company, in terms of computing power, sweeping chips, boards, servers, small clusters, large clusters to data centers and even computing power centers, and on the network, covering chips, chips, chassis, and clusters. Interconnection, as well as in application, is it sitting on the strength of chips, which is widely used in medicine, the Internet, factories, autonomous driving, and biomedicine?


What China needs may be a major alliance that is comparable to Nvidia in terms of technical depth and breadth, and is based on deep cooperation and healthy competition.


Lv Jianping, Chief Technology Officer (CTO) of Tianshu Zhixin. He graduated from Yale University with a Ph.D. in Computer Science. He has held important positions in multinational semiconductor giants such as Nvidia, Intel, and Samsung. He is a well-known expert in the field of GPU.


(This article is published with permission. Original :

https://mp.weixin.qq.com/s/rtO8PxRj08GVimT3bfbplA)


everyone else is watching
Click " Read the original text " , welcome to download and experience the latest version of OneFlow v0.7.0


This article is shared from the WeChat public account - OneFlow (OneFlow Technology).
If there is any infringement, please contact [email protected] to delete it.
This article participates in the " OSC Yuanchuang Project ", you are welcome to join and share with us.

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/oneflow/blog/5515593