What is CUDA - Introduction to CUDA

CPU, GPU
CPU
CPU (Central Processing Unit) is a very large-scale integrated circuit, which is the computing core (Core) and control core (Control Unit) of a computer. Its function is mainly to interpret computer instructions and process data in computer software.

The CPU, internal memory (Memory) and input/output (I/O) devices are collectively referred to as the three core components of an electronic computer. The CPU mainly includes an arithmetic unit (ALU, Arithmetic Logic Unit), a control unit (CU, Control Unit), a register (Register), and a cache memory (Cache) and the data (Data) to realize the connection between them , control and status bus (Bus). Simply put: computing unit, control unit and storage unit. The CPU follows the von Neumann architecture, and its core is: stored program, sequential execution.

Because the CPU architecture requires a lot of space to place storage units and control units, compared to the calculation unit which only occupies a small part, it is extremely limited in large-scale parallel computing capabilities, and is better at logic control. .

GPU
graphics card (Video card, Graphics card) full name display interface card, also known as display adapter, is one of the most basic configuration and most important accessories of a computer. The graphics card is a device for digital-to-analog signal conversion by the computer, which undertakes the task of outputting display graphics. Specifically, the graphics card is connected to the computer motherboard, which converts the digital signal of the computer into an analog signal for display on the monitor. At the same time, the graphics card still has image processing capabilities, which can assist the CPU to work and improve the overall operating speed. In scientific computing, a graphics card is called a display accelerator card.

The original graphics card is generally integrated on the motherboard, only completes the most basic signal output work, and is not used to process data. Graphics cards are also divided into discrete graphics cards and integrated graphics cards. Generally speaking, the performance and speed of discrete graphics cards introduced in the same period are better and faster than integrated graphics cards.

type Location Memory
Integrated Graphics It is integrated on the motherboard and cannot be replaced at will use physical memory
Discrete graphics card Plugged into the AGP interface of the main board as an independent device, it can be replaced and upgraded at any time has its own video memory

With the rapid development of graphics cards, the concept of GPU was proposed by NVIDIA in 1999. A GPU is a chip on a graphics card, just like a CPU is a chip on a motherboard. Both integrated graphics and discrete graphics have GPUs.
insert image description here
CPU and GPU
Before there is no GPU, basically all tasks are handed over to the CPU. With the GPU, the two have a division of labor. The CPU is responsible for logical transaction processing and serial computing, while the GPU focuses on executing highly threaded parallel processing tasks (large-scale computing tasks). The GPU is not a computing platform that runs independently, but needs to work with the CPU, which can be regarded as a coprocessor of the CPU. Therefore, when we talk about GPU parallel computing, we actually refer to a heterogeneous computing architecture based on CPU+GPU . In a heterogeneous computing architecture, the GPU and the CPU are connected together through the PCIe bus to work together. The location of the CPU is called the host, and the location of the GPU is called the device.
insert image description here
The GPU includes more computing cores, which are especially suitable for data-parallel computing-intensive tasks, such as large-scale matrix operations, while the CPU has fewer computing cores, but it can implement complex logic operations, so it is suitable for control-intensive tasks. In addition, the threads on the CPU are heavyweight, and the context switching overhead is high, but the threads on the GPU are lightweight due to the existence of many cores. Therefore, the heterogeneous computing platform based on CPU+GPU can complement each other's advantages. The CPU is responsible for processing complex logic serial programs, while the GPU focuses on processing data-intensive parallel computing programs to maximize its efficiency. No matter how fast the GPU develops, it can only share the work for the CPU, not replace the CPU.

There are many arithmetic unit ALUs and few caches in the GPU. The purpose of the cache is not to save the data that needs to be accessed later. This is different from the CPU, but to improve the service of the thread. If many threads need to access the same data, the cache will coalesce these accesses before accessing DRAM.
CUDA programming model foundation
CUDA
In 2006, NVIDIA released CUDA (Compute Unified Device Architecture), which is a new hardware and software architecture for operating GPU computing. It is a general-purpose parallel computing platform and programming model built on NVIDIA GPUs. , which provides a simple interface for GPU programming. Based on CUDA programming, applications based on GPU computing can be built, and the parallel computing engine of GPUs can be used to solve more complex computing problems more efficiently. It treats the GPU as a device for data-parallel computing without mapping those computations to a graphics API. The multitasking mechanism of the operating system can manage CUDA to access the GPU and the runtime library of graphics programs at the same time, and its computing characteristics support the use of CUDA to intuitively write GPU core programs.

CUDA provides support for other programming languages, such as C/C++, Python, Fortran and other languages. Complicated parallel computing can only be performed if CUDA is installed. The mainstream deep learning frameworks are also based on CUDA for GPU parallel acceleration, almost without exception. There is also one called cudnn, which is an accelerated library for deep convolutional neural networks.

In terms of software, CUDA consists of: a CUDA library, an application programming interface (API) and its runtime (Runtime), and two higher-level general mathematics libraries, namely CUFFT and CUBLAS. CUDA improves the read and write flexibility of DRAM, making the mechanism of GPU and CPU consistent. On the other hand, CUDA provides on-chip shared memory so that data can be shared between threads. Applications can use shared memory to reduce DRAM data transfers and rely less on DRAM memory bandwidth.
The programming model
CUDA architecture introduces the concepts of host and device. The CUDA program includes both the host program and the device program. At the same time, the host and the device can communicate, so that data can be copied between them.

Host (Host) : The CPU and system memory (memory stick) are called hosts.

Device (Device) : The GPU and the display memory of the GPU itself are called devices.

Dynamic Random Access Memory (DRAM) : Dynamic Random Access Memory, the most common system memory. DRAM can only hold data for a short time. In order to keep data, DRAM uses capacitor storage, so it must be refreshed every once in a while. If the storage unit is not refreshed, the stored information will be lost. (shutdown will lose data)

The execution flow of a typical CUDA program is as follows:
insert image description here
thread hierarchy
kernel kernel
The most important process in the CUDA execution flow is to call the CUDA kernel function to perform parallel computing. Kernel is an important concept in CUDA. In the CUDA program architecture, the host-side code part is executed on the CPU, which is ordinary C code; when encountering the part of data parallel processing, CUDA will compile the program into a program that can be executed by the GPU and send it to the GPU. It is called kernel in CUDA. The device-side code part is executed on the GPU, and this code part is written on the kernel (.cu file). The kernel is declared with the __global__ symbol, and <<<grid, block>>> needs to be used to specify the execution and structure of the kernel when calling.

CUDA distinguishes functions on host and device through function type qualifiers. The main three function type qualifiers are as follows:

  • global : Executed on the device, called from the host (some specific GPUs can also be called from the device), the return type must be void, does not support variable parameter parameters, and cannot be a class member function. Note that kernels defined with __global__ are asynchronous, which means that the host does not wait for the kernel to finish executing before executing the next step.
  • device : Executed on the device, it can only be called from the device, and cannot be used with __global__ at the same time.
  • host : Executed on the host, can only be called from the host, generally omitted and not written, cannot be used with __global__ at the same time, but can be used with __device__ at the same time, at this time the function will be compiled on both device and host.


When the grid grid kernel is executed on the device, it actually starts many threads. All the threads started by a kernel are called a grid, and the threads on the same grid share the same global memory space . grid is the first level of the thread structure.

Thread blocks The block
grid can be divided into many thread blocks (blocks), and a block contains many threads. Each block is executed in parallel, there is no communication between blocks, and there is no execution order. The number of blocks is limited to no more than 65535 (hardware limit). second level.

Both grid and block are defined as variables of dim3 type. dim3 can be regarded as a structure variable containing three unsigned integer (x, y, z) members. When defined, the default value is initialized to 1. Grid and block can be flexibly defined as 1-dim, 2-dim and 3-dim structures.

In CUDA, each thread must execute the kernel function, and each thread needs two built-in coordinate variables (blockIdx, threadIdx) of the kernel to uniquely identify, where blockIdx indicates the position in the grid where the thread is located, and threadIdx indicates the position in the block where the thread is located . They are all dim3 type variables.

The global ID of a thread in a block must also know the organizational structure of the block, which is obtained through the built-in variable blockDim of the thread. It gets the size of each dimension of the block. For a 2-dim block (D_x, D_y), the ID value of the thread (x, y) is (x+y∗D_x), if it is a 3-dim block (D_x, D_y, D_z), the thread (x, y) y, z) has an ID value of (x+y∗D_x+z∗D_x∗D_y) . In addition, the thread also has a built-in variable gridDim, which is used to obtain the size of each dimension of the grid.

Each block contains shared memory (Shared Memory), which can be shared by all threads in the thread block, and its life cycle is consistent with the thread block.
Each thread has its own private local memory (Local Memory). In addition, all threads can access global memory (Global Memory), and can also access some read-only memory blocks: constant memory (Constant Memory) and texture memory (Texture Memory).

Thread thread
A CUDA parallel program will be executed in many threads. Several threads will be grouped into a block, and the threads in the same block can be synchronized or communicate through shared memory.

The scheduling unit of the thread warp
GPU when executing programs, and the basic execution unit of SM. Currently in the CUDA architecture, a warp is a collection of 32 threads that are "woven together" and executed in a "coordinated" fashion. Each thread in the same warp will execute the same instruction with different data resources, which is the so-called SIMT architecture (Single-Instruction, Multiple-Thread, Single Instruction Multiple Thread).
insert image description here
insert image description here
CUDA's memory model
SP : the most basic processing unit, streaming processor, also known as CUDA core. Finally, specific instructions and tasks are processed on the SP. The GPU performs parallel computing, that is, many SPs perform processing at the same time.

SM : A core component of GPU hardware is the Streaming Multiprocessor. The core components of SM include CUDA core, shared memory, registers, etc. SM can execute hundreds of threads concurrently. The threads on a block are placed on the same streaming multiprocessor (SM), therefore, the limited memory resources of an SM restrict the number of threads per block. In the early NVIDIA architecture, a block can contain up to 512 threads, while some devices that appeared later can support up to 1024 threads. A kernel can be executed simultaneously by multiple blocks of the same size, so the total number of threads should be equal to the number of threads per block multiplied by the number of blocks.
insert image description here
insert image description here
A kernel actually starts many threads, and these threads are logically parallel, but the grid and thread blocks are only logically divided, and the SM is the physical layer of execution, which is not necessarily concurrent at the same time. The reasons are as follows:

  1. When a kernel is executed, the blocks in its grid are allocated to SMs, and a block can only be scheduled on one SM. SM can generally schedule multiple blocks, depending on the capabilities of SM itself. It is possible that each block of a kernel is assigned to multiple SMs, so the grid is only the logical layer, and the SM is the physical layer for execution.
  2. When a block is allocated to an SM, it is further divided into multiple wraps. SM uses SIMT. The basic execution unit is wraps. A wrap contains 32 threads. These threads execute the same instruction at the same time, but each thread contains its own instruction address counter and register state, and also has its own independent execution path. So although the threads in wraps execute from the same program address at the same time, they may have different behaviors. For example, when encountering a branch structure, some threads may enter this branch, but others may not execute. They can only die and wait, because the GPU stipulates that All threads in a warp execute the same instruction in the same cycle, and warp splitting can lead to performance degradation.

In summary, SM needs to allocate shared memory for each block, and also allocates independent registers for threads in each warp. Therefore, the configuration of SM will affect the number of concurrent thread blocks and warps it supports. Therefore, the configuration of the grid and block of the kernel is different, and the performance will be different. Also, since the basic execution unit of SM is a warp containing 32 threads, the block size should generally be set to a multiple of 32.

Guess you like

Origin blog.csdn.net/chanlp129/article/details/127680147