Embedded algorithm transplantation optimization study notes 6-CUDA programming


Reference:
Add link description

The Chinese name of CUDA (Compute Unified Device Architecture) is called the unified computing device architecture. Students in the field of image vision will be exposed to CUDA more or less. After all, to optimize performance and speed, CUDA is a very important tool. CUDA is a pit that is difficult for vision students to bypass. You must step on it to be practical. CUDA programming is really easy to get started and difficult to master. It should not be very difficult for students with computer architecture and C language programming knowledge to get started with CUDA programming. This article will help you understand the most important knowledge points of CUDA programming through the following five aspects, and get started quickly:

  • GPU architecture characteristics
  • CUDA thread model
  • CUDA memory model
  • CUDA programming model
  • CUDA application small example

1. GPU architecture features

First, let's talk about serial computing and parallel computing. We know that the key to high-performance computing is to use multi-core processors for parallel computing.

When we solve a computer program task, our natural idea is to decompose the task into a series of small tasks and complete these small tasks one by one. In serial calculations, our idea is to let our processor process one calculation task at a time, and then calculate the next task after processing one calculation task, until all small tasks are completed, then this large program task is also finished. As shown in the figure below, it is the steps of how we use serial programming ideas to solve problems.

But the shortcomings of serial computing are very obvious. If we have multi-core processors, we can use multi-core processors to process multiple tasks at the same time, and these small tasks are not related (do not need to depend on each other, for example, my computing tasks do not Use your calculation results), then why do we use serial programming? In order to further accelerate the calculation speed of large tasks, we can allocate some independent modules to different processors for simultaneous calculation (this is parallel), and finally integrate these results to complete a task calculation. The following figure is to decompose a large computing task into small tasks, and then assign independent small tasks to different processors for parallel computing, and finally summarize the results through a serial program to complete the total computing task.

Therefore, whether a program can perform parallel computing, the key is that we have to analyze which execution modules the program can split into, which of these execution modules are independent, and which are strongly dependent on strong coupling, independent modules we You can try to design parallel computing and make full use of the advantages of multi-core processors to further accelerate our computing tasks. For the strong coupling module, we use serial programming and use the serial + parallel programming idea to complete a high-performance computing.

Next, let’s talk about the difference between CPU and GPU, and their respective characteristics. We talked about the concept of "multi-core" many times when we talked about parallel and serial computing. Now let’s start this topic from the perspective of "core". . First, the CPU is composed of several cores optimized for sequential serial processing. The GPU is composed of thousands of smaller, more efficient cores, which are specifically designed to handle multiple tasks at the same time, and can efficiently handle parallel tasks. That is, although each core of the CPU is extremely capable and powerful in processing tasks, it has few cores and does not perform well in parallel computing. On the other hand, the GPU, although the computing power of each core is not strong, it The advantage is that there are many cores, which can handle multiple computing tasks at the same time, and do a good job in supporting parallel computing.

The different hardware characteristics of GPU and CPU determine their application scenarios. CPU is the core of computer operation and control, and GPU is mainly used for graphics and image processing. The form of the image presented on the computer is a matrix. Our processing of images is actually to manipulate various matrices for calculations, and many matrix operations can actually be parallelized, which makes image processing fast. Therefore, GPU is in the field of graphics and images. There is also an opportunity to make a big splash. The figure below shows a multi-GPU computer hardware system. It can be seen that a GPU memory has many SPs and various types of memory. These hardware are the basis for GPU to perform efficient parallel computing.

Now let's compare the characteristics of CPU and GPU from the perspective of data processing. The CPU needs a strong versatility to handle various data types, such as integers, floating-point numbers, etc., and it must be good at handling a large number of branch jumps and interrupt processing caused by logical judgments, so the CPU is actually a very capable Strong guy, he can handle a lot of things properly. Of course, we need to give him a lot of resources for his use (various hardware), which also makes the CPU impossible to have too many cores (the total number of cores does not exceed 16) . The GPU is faced with a highly unified, independent, large-scale data and a pure computing environment that does not need to be interrupted. The GPU has a lot of cores (the Fermi architecture has 512 cores), although its core The ability is far less powerful than the core of the CPU, but it has more
advantages. It shows the advantage of "many people and more power" when dealing with simple computing tasks. This is the charm of parallel computing.

To summarize the characteristics of the two are:

  • CPU: Good at process control and logic processing, irregular data structures, unpredictable storage structures, single-threaded programs, branch-intensive algorithms
  • GPU: Good at data parallel computing, regular data structure, predictable storage mode.
  • Insert picture description here
    In the current computer architecture, in order to complete CUDA parallel computing, the GPU alone cannot complete the computing task. The CPU must be used to cooperate to complete a high-performance parallel computing task.

Generally speaking, the parallel part runs on the GPU and the serial part runs on the CPU. This is heterogeneous computing. Specifically, heterogeneous computing means that processors of different architectures cooperate with each other to complete computing tasks. The CPU is responsible for the overall program flow, and the GPU is responsible for the specific calculation tasks. When the GPU threads complete the calculation tasks, we copy the results calculated on the GPU side to the CPU side to complete a calculation task.

Therefore, the overall division of labor for applications that use the GPU to achieve acceleration is: the intensive calculation code (about 5% of the code amount) is completed by the GPU, and the remaining serial code is executed by the CPU.

2. CUDA thread model

Below we introduce the thread organization structure of CUDA. First of all, we all know that thread is the most basic unit of program execution, and CUDA's parallel computing is realized by the parallel execution of thousands of threads. The following structural diagram illustrates the different levels of the GPU structure.

The threading model of CUDA has been summarized as follows:

1. Thread: thread, the basic unit of parallelism

2. Thread Block: Thread block, a thread group that cooperates with each other. The thread block has the following characteristics:

  • Allow to synchronize with each other
  • Data can be exchanged quickly through shared memory
  • Organize in 1, 2 or 3 dimensions

3. Grid: a set of thread blocks

  • Organize in 1 and 2 dimensions
  • Shared global memory

Kernel : The core program executed on the GPU. This kernel function runs on a certain Grid.

One kernel <->
Each block and thread of One Grid has its own ID, and we find the corresponding thread and thread block through the corresponding index.

threadIdx,blockIdx
Block ID: 1D or 2D
Thread ID: 1D, 2D or 3D

To understand the kernel, you must have a clear understanding of the thread hierarchy of the kernel. First of all, there are many parallelized lightweight threads on the GPU. When the kernel is executed on the device, many threads are actually started. All threads started by a kernel are called a grid. The threads on the same grid share the same global memory space. The grid is the first thread structure. Level, and the grid can be divided into many thread blocks (block), a thread block contains many threads, this is the second level. The two-layer organization structure of threads is shown in the figure above. This is a thread organization in which both grid and block are 2-dim. Both grid and block are defined as variables of type dim3. Dim3 can be regarded as a structure variable containing three unsigned integer (x, y, z) members. When defined, the default value is initialized to 1. Therefore, grid and block can be flexibly defined as 1-dim, 2-dim and 3-dim structures. When the kernel is called, the grid dimensions and threads used by the kernel must also be specified by executing the configuration <<<grid, block>>> Block dimensions. For example, let's take the above picture as an example to analyze how to index to the thread we want through the mark method <<<grid,block>>>>. CUDA's <<<grid,block>>> is actually a multi-level indexing method. The first level index is (grid.xIdx, grid.yIdy), and the corresponding example in the figure above is (1, 1). Through it We can find the location of this thread block, and then we start the secondary index (block.xIdx, block.yIdx, block.zIdx) to locate the specified thread. This is our CUDA thread organization structure.

Insert picture description here
Here I want to talk about SP and SM (Stream Processor), many people will be confused by these two professional terms.

  • SP: The most basic processing unit, streaming processor, also known as CUDA core. The final specific instructions and tasks are all processed on the SP. GPU performs parallel computing, that is, many SPs do processing at the same time.

  • SM: Multiple SPs plus some other resources form a streaming multiprocessor. Also called GPU core, other resources such as warp scheduler, register, shared memory, etc. SM can be regarded as the heart of GPU (compared to CPU core), register and shared memory are scarce resources of SM. CUDA allocates these resources to all threads residing in SM. Therefore, these limited resources make the active warps in each SM have very strict restrictions, which also limits the parallel capability.
    It should be pointed out that the number of SPs contained in each SM differs according to the GPU architecture. Fermi architecture GF100 has 32, GF10X has 48, Kepler architecture is 192, and Maxwell has 128.

In short, SP is the hardware unit of thread execution. SM contains multiple SPs. A GPU can have multiple SMs (for example, 16). Finally, a GPU may contain thousands of SPs. So many cores "run at the same time", the speed can be imagined, this quotation mark is just to show that in fact, software logic is that all SPs are parallel, but physically not all SPs can perform calculations at the same time (for example, we only have 8 SM has 1024 thread blocks that need to be scheduled for processing), because some will be in a suspended, ready and other state, which is related to GPU thread scheduling.

The following figure will explain CUDA's threading model from a hardware perspective and a software perspective.
Insert picture description here

  • Each thread is executed by each thread processor (SP)
  • The thread block is executed by a multi-core processor (SM)
  • A kernel is actually executed by a grid, and a kernel can only be executed on one GPU at a time

Block is a software concept. A block can only be scheduled by one sm. When developing, programmers can tell the GPU hardware how many threads I have and how to organize the threads by setting the attributes of the block. The specific scheduling is the responsibility of the SM warps scheduler. Once the block is assigned to the SM, the block will remain in the SM until the end of the execution. An SM can have multiple blocks at the same time, but sequence execution is required. The following figure shows the hardware architecture inside the GPU:
Insert picture description here

3. CUDA memory model

The memory model in CUDA is divided into the following levels:

  • Each thread uses its own registers
  • Each thread has its own local memory (local memory)
  • Each thread block has its own shared memory (shared memory), and all threads in all thread blocks share this memory resource
  • Each grid has its own global memory (global memory), which can be used by threads of different thread blocks
  • Each grid has its own constant memory (constant memory) and texture memory (texture memory), which can be used by threads of different thread blocks

The speed at which threads access these types of memory is register> local memory> shared memory> global memory

The following figure shows the level of these memories in the computer architecture.
Insert picture description here

4. CUDA programming model

I have talked about so many hardware-related knowledge points above, and now I can finally start to talk about how CUDA writes programs.

Let's start with a look at common CUDA terms:
Insert picture description here

The first programming key to master: keywords

How do we write a program or function that can run on the GPU?

You can use keywords to indicate whether a certain program runs on the CPU or on the GPU! As shown in the following table, for example, we use __global__ to define a kernel function, which is called on the CPU and executed on the GPU. Note that the return value of the __global__ function must be set to void.Insert picture description here

The second programming point: data transmission

How to write data transfer between CPU and GPU?

First introduce the function interface of GPU memory allocation and reclaiming memory:

cudaMalloc(): 在设备端分配global memory
cudaFree(): 释放存储空间

The function interface for data transmission between CPU data and GPU data is the same. They use the passed function parameters (enumeration type) to indicate the transmission direction:

cudaMemcpy(void *dst, void *src, size_t nbytes,enum cudaMemcpyKind direction)

enum cudaMemcpyKind:

cudaMemcpyHostToDevice(CPU到GPU)
cudaMemcpyDeviceToHost(GPU到CPU)
cudaMemcpyDeviceToDevice(GPU到GPU)

Guess you like

Origin blog.csdn.net/mao_hui_fei/article/details/114258912