How GPUs can become AI accelerators

Table of contents

0. Preface

1. Starting from the birth of graphics card

2. GPU makes its debut

3. Rendering - making computer images more realistic

4. From GPU to GPGPU

5. CUDA – laid the foundation for NVIDIA to become an oligarchy

6. The future is not just about GPUs


0. Preface

In accordance with international practice, I would like to first declare: This article is only my own understanding of learning. Although I refer to the valuable insights of others, the content may contain inaccuracies. If you find errors in the article, I hope to criticize and correct them so that we can make progress together.

Keywords in this article: GPU, deep learning, GPGPU, rendering, Brook language, stream computing, hardware T&L, CUDA, PyTorch, TOPS, TPU, NPU

The development of deep learning theory is a gradual process. From the proposal of artificial neuron networks in the 1940s to the proposal and rise of backpropagation in the 1970s and 1980s, to the rise of deep learning after 2006, this process has gone through many processes. stage. Early deep learning theories were limited by hardware performance and could not conduct large-scale data training, nor could the network be made too deep. In recent years, with the continuous improvement of hardware performance, especially the development of graphics processing units (GPUs), deep learning theory and even the entire AI field have begun to develop rapidly and be widely used.

Did you have the following questions when you first came into contact with deep learning:

  • Isn’t the graphics card used to process computer images? How is it related to deep learning?
  • Why does the GPU handle the deep learning training process faster than the CPU?
  • Why can’t I use the graphics card to accelerate deep learning training even though my computer has a graphics card?

The above questions prompted the creation of this article, but after reviewing a large amount of information, I found that explaining these issues clearly involves professional theories in many fields, and I am not an expert in these fields, so I can only teach you in a popular science way ( and myself) to get a glimpse into this deep and vast territory.

1. Starting from the birth of graphics card

In 1981, the computer IBM 5150 launched by IBM was equipped with the world's first "independent graphics card" - CGA (Color Graphics Adapter) . CGA has two commonly used graphics display modes: 320×200 resolution 4 colors and 640×200 resolution 2 colors. The effect of the well-known Pac-Man game running on CGA is as follows:

In the 1980s and 1990s, IBM successively launched EGA (Enhanced Graphics Adapter) and VGA (Video Graphics Array) . These "graphics cards" can support higher resolutions and more display colors.

However, whether CGA, EGA or VGA, these "graphics cards" themselves do not have computing capabilities . They only translate the graphics generated by CPU calculations into signals that can be recognized by the display device for display. The "graphics card" performs corresponding operations completely according to the instructions of the CPU, and then returns the results to the CPU. It is purely a worker of the CPU. Therefore, the "graphics card" at this stage cannot be called a "graphics card" in a strict sense. It is more appropriate to call it a "graphics adapter" or "image accelerator card".

2. GPU makes its debut

As the requirements for image display become higher and higher, especially as 3D image display becomes more and more popular, the CPU alone can no longer meet the increasingly complex image processing requirements. Therefore, a chip with real computing power - a graphics card is needed to handle the problem alone. Process images. Graphics cards experienced a chaotic situation in the 1990s. At that time, dozens of graphics card companies each had their own development standards, and the compatibility of each company was also very poor. The typical representative of graphics cards during this period was the Voodoo graphics card from 3dfx.

How important it is that display standards and hardware drivers are compatible! I still remember downloading (pirated) games when I was a kid, waiting eagerly for the download and installation to finish, but I couldn’t play it!

The truly epoch-making product is the GeForce 256 launched by NVIDIA in September 1999, which is called the world's first GPU . It is not only equipped with a hardware T&L engine (Transforming & Lighting), the biggest function of T&L is to process the overall angle rotation of graphics and Three-dimensional effects such as light source shadows.) It also supports Microsoft's Direct 3D display. Since then, NVIDIA has begun its leadership in the GPU field.

The name NVIDIA originated from Invidia in Roman mythology. In Latin, Invidia means jealousy and gaze, which corresponds to malice and the "evil eye", so NVIDIA's logo is also an eye.

However, since the name INVIDIA has already been registered, NVIDIA chose to remove the initial "I" and registered the name NVIDIA, but retained the pronunciation of "I" in its Chinese name "NVIDIA".

With the vigorous development of the 3D graphics field, especially the gaming field, major manufacturers have been driven to rapidly iterate the performance of GPUs. GPUs have gradually begun to be on an equal footing with CPUs from the former slave laborers of CPUs. After the GPU has its own computing power, it also assumes more and more tasks.

3. Rendering - making computer images more realistic

First, let’s take a look at the image quality of the “Forza Horizon 5” game released in 2021:

Yes, the picture above is not a real shot but a picture in the game. The evolution of the game picture from Pac-Man to Horizon 5 is thanks to rendering. In order to achieve the effect of computer-generated images that look real and fake, it cannot be accomplished by relying on the CPU (the CPU hardware design principle is not suitable for rendering calculations, which will be discussed later), so this task was initially completely responsible for the CPU, and gradually became completely transferred to the GPU.

Rendering is a complex engineering discipline. It is impossible for this article to introduce each algorithm included in rendering one by one (this is not the focus of this article). We only need to understand that its essence is the process of assigning colors to pixels. It can be based on objects. The shape, material, light source and other information are used to calculate the color of each pixel. The rendering algorithm mainly includes the following aspects:

  • Geometry processing: Convert the geometric information of the 3D model into a form that can be processed by the computer, including descriptions of basic geometric elements such as points, lines, and surfaces.
  • Ray tracing: Ray tracing is a technology that calculates the path of light in a scene. It can simulate the reflection, refraction, and scattering of light in the scene to generate realistic images.
  • Texture mapping: Texture mapping is a technology that applies texture maps to the surface of a model. It can simulate the texture of the surface of an object, such as wood grain, stone grain, etc.
  • Perspective projection: Perspective projection is a technology that converts a three-dimensional scene into a two-dimensional image. It can simulate human visual perception to generate realistic images.

Below we only take the movement of the most basic point in the three-dimensional coordinate system in geometric processing as an example: Assume that the point rotates (x, y, z)around the three axes x, y, and z\alpha, \beta, \gamma in the three-dimensional coordinate system and (b_1, b_2, b_3)translates it with the displacement to obtain the point after movement. (x',y',z')for:

\begin{bmatrix} x'\\ y'\\ z' \end{bmatrix} =M(\alpha, \beta, \gamma)\begin{bmatrix} x\\ y\\ z \end{bmatrix} +\begin{bmatrix} b_1\\ b_2\\ b_3 \end{bmatrix}

The rotation matrix M(\alpha, \beta,\gamma)is:

M(\alpha, \beta,\gamma) =\begin{bmatrix} cos\alpha\cdot cos\gamma-cos \beta\cdot sin \alpha\cdot sin \gamma& &-cos \beta\cdot cos \gamma \cdot sin \alpha - cos \alpha\cdot sin \gamma & & sin \alpha \cdot sin \beta \\ cos \gamma\cdot sin \alpha +cos \alpha \cdot cos\beta \cdot sin\gamma & & cos \alpha \cdot cos \beta \cdot cos \gamma - sin \alpha \cdot sin \gamma & & -cos \alpha \cdot sin \beta \\ sin \beta \cdot sin \gamma& &cos\gamma \cdot sin \beta & &cos \beta \end{bmatrix}

The formula is written here. If you are familiar with the mathematical model of neuron network, have you discovered something? If the rotation matrix is ​​abbreviated as:

M = \begin{bmatrix} w_{11} & w_{12}&w_{13} \\ w_{21}& w_{22} &w_{23} \\ w_{31}&w_{32} &w_{33} \end{bmatrix}

Isn’t the calculation process of this matrix the formal forward propagation process of the neuron network model?

Can the GPU also be applied to neuron network model calculations and other fields that mainly use matrix operations, so that the GPU can not only be used in graphics calculations but become more "universal"? You are not the first person to think of this!

4. From GPU to GPGPU

In 2004, Ian Buck and others proposed in Brook for GPUs: Stream Computing on Graphics Hardware : With the development of programmable graphics hardware, the functional instructions of these processors have become very versatile and can be used outside the rendering field! The concept of general-purpose GPU - GPGPU (General Purpose Graphic Process Unit) was proposed .

This article mainly introduces Brook, a stream computing programming system for GPU. Initially, only assembly language could be used to program the GPU. Before Brook, some high-level languages ​​based on the C language were proposed to program the GPU, but these languages ​​still regarded the GPU only as an image renderer, and had many restrictions and could not be virtualized. Due to the limitations of the underlying hardware, GPU developers at that time not only had to master the latest image program APIs, but also had to understand the characteristics and limitations of the GPU hardware used, which placed extremely high demands on programmers.

The improved Brook language can reflect the capabilities of different hardware, and extends the data parallel processing (Data Parallelism) architecture on the traditional C language, improving the computing power density (Arithmetic Intensity) of the hardware .

Here is an explanation of the Stream computing mentioned above , which includes 3 main concepts:

  • Streams: Streams can be understood as raw data to be processed. These data ① can be processed in parallel, ② are dynamic data, and ③ can be processed immediately, without waiting for the data to be completely collected before processing;
  • Kernel: algorithm that acts on streams;
  • Reduction: This is a mechanism of the kernel, that is, merging multiple streams into one stream, or reducing a larger stream into a smaller stream. If you understand the convolutional neuron network (CNN), you can easily understand the kernel. and Reduction;

Stream computing has become very common in our current life scenarios, such as: live video broadcast, real-time monitoring, online shopping, etc...

If you don’t know CNN, you can read my previous blogs: Understanding of Commonly Used Convolution Kernels in Convolutional Neuron Networks and Example Applications Based on Pytorch (complete code attached)_Convolutional Neural Network Convolution Kernel Selection-CSDN Blog

At this point in the article, we should understand why GPUs originally used for image processing can be generalized and crossed over to other fields. Then why can GPUs achieve faster processing speeds than CPUs in deep learning?

This is determined by the underlying architecture of the hardware: the CPU is designed to handle serial complex calculations (such as sorting algorithms), while the GPU is designed to handle parallel simple operations (such as rendering algorithms, deep learning algorithms). The difference between CPU and GPU is as follows:

The gap can be seen intuitively from the number of processor cores. Use current consumer-grade CPUs and GPUs to compare: Intel Core i9 13900K CPU: 24 cores, 32 threads; NVIDIA GeForce series GPU core numbers are as follows:

Obviously, the tens of thousands of cores of the GPU are more suitable for the simple and large-scale calculation requirements of deep learning mathematical models. This video very intuitively explains the advantages of multi-core GPUs in parallel computing: NVIDIA shows the difference in the working principles of CPU and GPU live . _bilibili_bilibili icon-default.png?t=N7T8https://www.bilibili.com/video/BV1ry4y1y7KZ/?spm_id_from=autoNext&vd_source=b1f7ee1f8eac76793f6cadd56716bfbf

At this point, there seems to be an illusion that the GPU has better performance than the CPU, but we must realize that the multi-core GPU can only handle simple operations, and complex operations still rely on the CPU. Both have their own strengths. , the above video only shows the principles of GPU from the perspective of the work that GPU is better at!

5. CUDA – laid the foundation for NVIDIA to become an oligarchy

In 2006, NVIDIA was also thinking hard about how to build a complete GPU ecosystem. It is estimated that Huang also saw the above paper and hired the main author Ian Buck to NVIDIA (Ian Buck is now NVIDIA's vice president and chief accelerated computing Director.), and then NVIDIA launched a game-changing and epoch-making high-level programming language for computing power platform in 2007 - CUDA (Compute Unified Device Architecture) .

The emergence of CUDA has laid the foundation for NVIDIA to become the dominant player in the GPU field in the future. Even recently, the computing power level of a company is measured by how many NVIDIA A100 graphics cards the company has purchased.

We can simply understand CUDA as NVIDIA's own dedicated Brook environment. CUDA supports a variety of high-level programming languages ​​and also encapsulates many library files internally, which greatly facilitates the use of developers.

NVIDIA also invests heavily in improving its GPU hardware to support CUDA.

The Tesla above is not Musk’s Tesla… it’s NVIDIA’s own product model.

As mentioned above, GPUs have natural advantages for high-throughput parallel algorithms such as deep learning, so now mainstream third-party libraries are compatible with CUDA, such as PyTorch (in fact, most of PyTorch is written in C++ and CUDA)

After installing the CUDA version of PyTorch, you can check the availability of CUDA:

print(torch.cuda.is_available())

If True, CUDA is available.

At this point, we return to the top question: Why can’t I use the graphics card to accelerate deep learning training even though my computer has a graphics card?

Because you must use a graphics card that supports CUDA = NVIDIA graphics card to perform GPU acceleration! ! !

In recent years, the popularity of AI has also promoted the explosive growth of NVIDIA's market value. Other companies have also turned their heads and wanted to launch products similar to CUDA. Unfortunately, NVIDIA, which started in 2007, is already "far ahead." So it is not an exaggeration to say that CUDA laid the foundation for NVIDIA to become an oligarchy today. Now let’s take a look at the business fields it involves on NVIDIA’s official website (it can even be said that NVIDIA has become an unavoidable existence in these fields). If you Without knowing this company, could you guess that its original main business was making graphics cards?

It must be explained here again that from the introduction of this article, it seems that all developments are so reasonable, but NVIDIA's growth is by no means smooth sailing! ! !

Not to mention that the early and weak NVIDIA was pushed into the ICU several times in disputes with giants such as Microsoft, ATI, AMD, and INTEL. The investment in CUDA alone almost brought NVIDIA to the brink of bankruptcy. Lao Huang described it this way in the following speech:



Lao Huang’s full speech (FQ required): https://m.youtube.com/watch?v=mkGy1by5vxw&t=0s

It is definitely a very bold decision to have all NVIDIA GPUs support CUDA! I still remember that when I was choosing a computer in college, the principle at that time was that I had to buy an A card, because the N card had the risk of overheating and having no picture when booting up. This was all because of Huang’s persistence in changing the design of the GPU to make it compatible with CUDA. Early questions for!

If there was no explosion of Bitcoin, AI, or the Metaverse in the later period, it is estimated that NVIDIA may not be able to sustain it to this day. (Of course, NVIDIA has also in turn promoted the explosion in these fields)

6. The future is not just about GPUs

Although GPU has more advantages than CPU in AI algorithms, the birth of GPU was not originally for AI algorithms. It was just because its versatility was discovered and applied across borders to the AI ​​field. So is it possible to "privately customize" chips for AI-related fields?

The answer is yes. A variety of application-specific integrated circuits (ASICs) have now been launched. These ASICs have lower power consumption and higher computing power than GPUs. Here is a brief introduction to two common ASICs - NPU and TPU.

NPU (Neural network Processing Unit, neural network processor)

As the name suggests, NPU is a processor specially used to process neuron network models. Its circuit architecture design idea is to refer to the characteristics of neurons - the integration of computing and storage. From the schematic diagrams of the CPU and GPU above, we can see Computation (core, actually ALU) and storage (cache) are still separated. At present, NPU has been widely used, such as the NEURAL ENGINE of Apple A15 chip below:

TPU (Tensor Processing Unit, tensor processor)

The background of the birth of TPU is that in 2013, Google found that people use the voice search function for an average of 3 minutes a day, which almost doubled the computing power requirements of Google's data center. Therefore, they wanted to design an ASIC with at least 10 times more computing power than a GPU. , so TPU was proposed in 2017.

The main computing unit of TPU is a 256×256 matrix multiplication unit. The design idea of ​​TPU is to allow this matrix multiplication unit to operate continuously. A TPU containing 65,536 8-bit MAC (Multiply Accumulate) blocks can achieve a computing power of 92 TOPS, which is 15 to 30 times that of the GPU at that time, and the computing power and power consumption ratio TOPS/W is 30 to 80 times that of the GPU at that time. Times!

Here are two more popular science units: TOPS (Tera Operations Per Second) and TFLOPS (Tera Floating Point Operations Per Second)

TOPS refers to the number of integer operations that can be performed per second (Operations Per Second), which is mainly used in image processing, speech recognition, etc.

TFLOPS refers to the number of floating point operations that can be performed per second (Floating Point Operations Per Second). It is mainly used in scientific computing, artificial intelligence training and other application fields that require a large number of floating point operations.

The main difference that distinguishes the two is the type of calculation. There is no fixed conversion relationship between the two, but because floating point operations are more complex than integer operations, TFLOPS is usually lower than TOPS under the same computing device.

The main references of this article:

[1] NVIDIA official website: World Leader in Artificial Intelligence Computing | NVIDIA

[2] Fletcher Dunn, Ian Parberry, Dunn, et al. Fundamentals of 3D Mathematics: Graphics and Game Development [M]. Tsinghua University Press, 2005.

[3]Buck I , Foley T , Horn D ,et al.Brook for GPUs: Stream computing on graphics hardware[J].ACM Transactions on Graphics, 2004, 23(3):777-786.DOI:10.1145/1186562.1015800.

[4] Liu Zhenlin, Huang Yongzhong, Wang Lei, et al. Application of Brook on GPU [J]. Journal of University of Information Engineering, 2008, 9(1):5.DOI:10.3969/j.issn.1671-0673.2008.01.022.

[5]Jouppi N P , Young C , Patil N ,et al.In-Datacenter Performance Analysis of a Tensor Processing Unit[J].Computer architecture news, 2017, 45(2):1-12.DOI:10.1145/3079856.3080246.

Guess you like

Origin blog.csdn.net/m0_49963403/article/details/133394967