Differences between GPU and CPU

The difference between the design of
CPU and GPU The reason why CPU and GPU are very different is because of their different design goals. They are aimed at two different application scenarios. The CPU needs a strong versatility to handle various data types, and at the same time, it requires logical judgment and introduces a large number of branch jumps and interrupts. All these make the internal structure of the CPU extremely complicated. The GPU is faced with highly unified, independent and large-scale data and a pure computing environment that does not need to be interrupted.

So CPU and GPU present a very different architecture (schematic diagram):
  Insert picture description here

The big difference between CPU and GPU

Image from nVidia CUDA documentation. Among them, green is the calculation unit, orange-red is the storage unit, and orange is the control unit.

GPU uses a large number of computing units and super long pipeline, but only very simple control logic and save the cache. The CPU not only occupies a lot of space by the Cache, but also has complex control logic and many optimization circuits. In contrast, the computing power is only a small part of the CPU
Insert picture description here

It can be seen from the picture above:

Cache, local memory: CPU > GPU

Threads: GPU> CPU

Registers: GPU> CPU Multiple registers can support a lot of Thread. Thread needs to use register. The number of threads is large. Register must also be followed by a large number.

SIMD Unit (single instruction multiple data streams, execute the same instruction at the same time in a synchronous manner): GPU> CPU.

CPU design based on low latency:
Insert picture description here

The CPU has a powerful ALU (arithmetic operation unit), which can complete arithmetic calculations in a few clock cycles.

Today's CPUs can achieve 64bit double precision. The addition and multiplication of double-precision floating-point source calculations only require 1 to 3 clock cycles.

The frequency of the CPU's clock cycle is very high, reaching 1.532 ~ 3gigahertz (gigabit HZ, 10 to the power of 9).

Large buffers can also reduce latency. Save a lot of data in the cache. When you need to access the data, as long as you have accessed it before, now you can fetch it directly in the cache.

Complex logic control unit. When a program contains multiple branches, it reduces latency by providing branch prediction capabilities.

Data forwarding. When some instructions depend on the results of previous instructions, the logical control unit of data forwarding determines the position of these instructions in the pipeline and forwards the result of one instruction to subsequent instructions as soon as possible. These actions require many comparison circuit units and repeater circuit units.

Insert picture description here

GPU is designed based on large throughput.

The characteristic of GPU is that there are many ALUs and few caches. The purpose of caching is not to save the data that needs to be accessed later, which is different from CPU, but to improve the service of thread. If there are many threads that need to access the same data, the cache will merge these accesses, and then access the dram (because the data to be accessed is stored in the dram instead of the cache). After the data is obtained, the cache will forward this data to the corresponding Thread, this time is the role of data forwarding. However, due to the need to access the dram, it will naturally bring about the problem of delay.

The GPU control unit (yellow area block on the left) can combine multiple accesses into fewer accesses.

Although the GPU has a dram delay, it has a lot of ALU and a lot of threads. To balance the problem of memory delay, we can make full use of the characteristics of multiple ALUs to achieve a very large throughput effect. Assign as many Threads as possible. Generally speaking, GPU ALU will have a very heavy pipeline because of this.

So with the CPU is good at logic control, serial operation. Unlike general-purpose data operations, GPUs are good at large-scale concurrent computing, which is exactly what password cracking requires. Therefore, in addition to image processing, GPUs are also increasingly involved in computing.

Most of the work of GPU is like this. It is computationally intensive, but it has no technical content, and it has to be repeated many times. Just like you have a job that requires hundreds of millions of additions, subtractions, multiplications, and divisions, the best way is to hire dozens of elementary school students to do the calculations together. One person counts a part. Anyway, these calculations are not technical, and they are purely physical. The CPU is like an old professor. Integral and differential calculations are counted, that is, high wages. An old professor has 20 primary school students. If you are Foxconn, which one do you hire? This is the case with GPUs, which use a lot of simple computing units to complete a large number of computing tasks, purely human tactics. This strategy is based on the premise that primary school students A and B have no dependence on their work and are independent of each other. Many problems involving a lot of calculations basically have this feature, such as cracking passwords, mining, and many graphical calculations. These calculations can be decomposed into multiple identical simple small tasks, and each task can be assigned to a primary school student. But there are still some tasks that involve "flow". For example, if you go on a blind date, both sides can look forward to continue development. Can't you haven't met on this side yet? Look for someone to get the certificate. This kind of complicated problem is all done by CPU.

All in all, because the CPU and GPU are used to process different tasks at the beginning, there is no small difference in design. Some tasks are similar to the problems that the GPU was originally used to solve, so the GPU is used. The computing speed of the GPU depends on how many elementary school students are hired, and the computing speed of the CPU depends on how powerful a professor is hired. The ability of professors to deal with complex tasks crushes primary school students, but for less complex tasks, it still can't withstand the crowd. Of course, the current GPU can also do some slightly complicated work, equivalent to upgrading to the level of junior high school students. But you still need the CPU to feed the data to your mouth before you can start working. It still depends on the CPU.

What type of program is suitable for running on the GPU?

(1) Computationally intensive procedures. The so-called Compute-intensive program is that most of its running time is spent on register operations. The speed of the register is comparable to the speed of the processor, and there is almost no delay in reading and writing data from the register. You can make a comparison. The latency of reading memory is about a few hundred clock cycles; the speed of reading the hard disk is not to mention, even SSD, it is really too slow.

(2) Easy parallel programs. The GPU is actually a SIMD (Single Instruction Multiple Data) architecture. It has hundreds or thousands of cores, and it is best for each core to do the same thing at the same time.

Published 16 original articles · Like1 · Visits 367

Guess you like

Origin blog.csdn.net/qq_41174940/article/details/104460899