cuda programming learning - introduction to basic knowledge! Dry goods to (3)

The main content of this article is to introduce some basic knowledge before CUDA programming

References:

Gao Sheng's blog
"CUDA C programming authoritative guide"
and CUDA official documents

Articles and explanatory videos are simultaneously updated to the public "AI Knowledge Story", station B: go out to eat three bowls of rice

1: Parallel Computing

Parallel programs can be divided into
instruction parallelism: it is generally used in management systems, such as the Taobao trading system, where many people are using it at the same time all the time, and the background needs to be able to process these requests in parallel.
Data parallelism: general application of large-scale data calculation, a large amount of data, using the same calculation program to calculate

CUDA is well suited for data-parallel computing.

In the first step of mathematical parallelism,
the data is divided by threads
(1)
block division, a whole block of data is cut into small blocks, each small block is randomly divided into a thread, and the execution order of each block is random.
insert image description here

(2)
Cycle division, threads process adjacent data blocks in order, each thread processes multiple data blocks, for example, we have five threads, thread 1 executes block 1, thread 2 executes block 2...thread 5 executes block 5 , thread 1 executes block 6

insert image description here

2: Computer Architecture

Flynn's Taxonomy, which is classified according to the way instructions and data enter the CPU, is divided into the following four categories:
insert image description here
(1) Analyze data and instructions separately:

Single instruction single data SISD (traditional serial computer, 386)
single instruction multiple data SIMD (parallel architecture, such as vector machine, all core instructions are unique, but the data is different, and now CPUs basically have such vector instructions) multiple instruction single
data MISD (rare, multiple instructions surround one data)
multi-instruction multiple data MIMD (parallel architecture, multi-core, multi-instruction, asynchronous processing of multiple data streams, so as to achieve spatial parallelism, MIMD includes SIMD in most cases, that is, MIMD There are many computing cores, and the computing cores support SIMD)

(2) In order to improve parallel computing capabilities, we need to achieve the following performance improvements from the architecture:

Lower Latency
Increase Bandwidth
Increase Throughput

Latency refers to the time required for an operation from start to finish, generally calculated in microseconds, the lower the delay, the better.
Bandwidth is the amount of data processed per unit time, generally expressed in MB/s or GB/s.
Throughput is the number of operations successfully processed per unit time. It is generally expressed in gflops (billion floating-point calculations). There is a certain relationship between throughput and delay, and they all reflect the calculation speed. One is dividing the time by the number of operations to get One is the time per unit time – delay, one is the number of operations divided by time, and the result is the number of executions per unit time – throughput

3: Heterogeneous architecture

We can regard the CPU as a commander, the host end, host, and the GPU that completes a large amount of calculation is our computing device, device
insert image description here

(1) The above picture can roughly reflect the different architectures of CPU and GPU.

Left picture: CPU, 4 ALUs, mainly responsible for logic calculation, 1 control unit Control, 1 DRAM memory, 1 Cache cache

Right picture: GPU, the small green square is ALU, we pay attention to the part of SM in the red box, this group of ALUs share a Control unit and Cache, this part is equivalent to a complete multi-core CPU, but the difference is that there are more ALUs, control Partially smaller, it can be seen that the computing power has been improved and the control ability has been weakened. Therefore, for complex control (logic) programs, a GPU SM cannot compare with a CPU, but for tasks with simple logic and large data volume, GPU It's more funny, and, note that a GPU has many SMs, and more and more

(2) The host code runs** on the host side and is compiled into the machine code of the host architecture. The device side executes on the device and is compiled into the machine code of the device architecture. Therefore, the machine code of the host side and the machine code of the device side are isolated. , Execute your own, there is no way to exchange execution.

The code on the host side is mainly to control the device and complete control tasks such as data transmission, and the main task on the device side is to calculate .
Because the CPU can complete these calculations when there is no GPU, but the speed will be much slower, so the GPU can be regarded as an acceleration device for the CPU.

(3) The difference between CPU and GPU threads:

CPU threads are heavyweight entities. The operating system executes threads alternately. Thread context switching costs a lot .
GPU threads are lightweight . GPU applications generally contain tens of thousands of threads, most of which are in the queue state, and there is basically no switching between threads. overhead.
CPU cores are designed to minimize latency when one or two threads run, while GPU cores are large numbers of threads, maximizing throughput

4: CUDA programming structure

The possible execution sequence of a complete CUDA application is shown in the figure below:
insert image description here
from the serialization of the host to calling the kernel function (after the kernel function is called, the control is returned to the host thread immediately, that is, when the first parallel code is executed, it is very likely that the second paragraph The host code has started executing synchronously).

5: Memory management

insert image description here
(1) Host (CPU) performs memory management with Device (GPU) through cudaMalloc, cudaMemcpy, cudaMemset, cudaFree, etc. (
2) There is a Grid in the Device space, which consists of many Blocks and GlobalMemory. The size of the Grid is the Block it contains Quantity
(3) Block contains many Thread and Shared Memory shared spaces, and the size of Block is equal to the number of threads it contains

6: Thread management

(1) When the kernel function starts to execute, how to organize the threads of the GPU becomes the most important issue. We must make it clear that a kernel function can only have one grid, and a grid can have many blocks, and each block can have Many threads, this hierarchical organizational structure makes our parallel process more flexible:
insert image description here
threads in a thread block block can complete the following cooperation:

1 Synchronization
2
Threads in different blocks of shared memory cannot affect each other! They are physically separated!

(2) Each thread executes the same piece of serial code. In order to make the same piece of code correspond to different data, the first step is to distinguish these threads from each other, so that they can correspond to the corresponding slave threads, so that these threads also Can distinguish its own data. If the thread itself does not have any flags, then there is no way to confirm its behavior.
Rely on the following two built-in structures to determine the thread label:
blockIdx (the position index of the thread block in the thread grid)
threadIdx (the position index of the thread in the thread block)

The above two are coordinates. Of course, we must have the same corresponding two structures to save its range, that is, the range of the three fields in blockIdx and the range of the three fields in threadIdx:

blockDim
gridDim

7: Kernel function

The kernel function is the serial code that runs in many threads on the CUDA model . This code is run on the device , compiled with NVCC, and the machine code generated is the machine code of the GPU, so when we write CUDA programs, we write kernel functions . The first step is to ensure that the kernel function can run correctly and produce tangent results. The second part is to optimize the CUDA program. Whether it is optimizing the algorithm or adjusting the memory structure, the thread structure is to adjust the code in the kernel function to complete these. optimized.

We have always regarded our CPU as a controller, running a kernel function, to initiate from the CPU, then we start to learn how to start a kernel function

(1) Kernel function call

kernel_name<<<4,8>>>(argument list);

The three angle brackets '<<<grid,block>>>' are the configuration of the thread structure executed by the device code (or simply the configuration of the kernel). The
thread layout of the above instruction is: (4 blocks, 8 threads are assigned and called in each block)
insert image description here
Our kernel function is copied to multiple threads for execution at the same time. We mentioned a corresponding problem above. It is definitely a waste of time to execute multiple calculations on one data, so in order to allow multiple Threads correspond to different data according to our wishes, and we need to give the thread a unique identifier. Since the device memory is linear (basically the memory hardware on the market stores data in a linear form), we can use threadIdx to observe the above figure. x and blockIdx.x to combine to obtain the unique identifier of the corresponding thread (we will see later that threadIdx and blockIdx can be combined to produce many different effects)

kernel_name<<<1,32>>>(argument list);  // 调用1个块

kernel_name<<<32,1>>>(argument list);// 调用32个块

Guess you like

Origin blog.csdn.net/qq_40514113/article/details/130818169
Recommended