background

In the article Introduction to the Translation and Learning of CUDA10.0 Official Documents, I translated the introduction part of the CUDA10.0 official documents. This book continues the previous article and translates Chapter 2-Programming Model

This chapter introduces some of the main concepts behind the cuda programming model and outlines how they are expressed in C. For a concentrated introduction of cuda's C interface, please refer to the next chapter Programming Interface. The vector addition sample source code used in this chapter and the next chapter can be obtained in the vectorAdd sample of cuda

Kernel function

cudaC extends the C language by allowing programmers to define C functions called core functions. Compared with ordinary C language, core functions will be executed N times in parallel in N different cuda threads (one thread executes once) when they are called. .

A kernel function is declared with the __global__ identifier. When calling it, the number of threads executing the kernel must be specified and wrapped with <<<...>>>. Each thread executing the kernel function is assigned a globally unique thread id, which can be obtained by traversing the built-in threadIdx in the kernel function. To illustrate, the following example accumulates two N-dimensional vectors A and B, and stores the result in C

Here, each of the N threads (1 means 1 thread block) will execute the kernel function VecAdd() once to perform pairwise addition

__global__ void VecAdd(float* A, float* B, float* C) {
    int i = threadIdx.x;
    C[i] = A[i] + B[i];
}

int main() {
    .....
    VecAdd<<<1, N>>>(A, B, C); // 用N个线程执行核函数
    ....

}

Thread level

For convenience, threadIdx is a three-dimensional vector, so threads can be represented by using one-, two-, or three-dimensional thread indexes, which correspond to one-, two-, or three-dimensional blocks of threads, also called thread blocks. This representation method allows us to calculate the elements in a space similar to vectors, matrices, and cubes.

The thread index and thread id are related to each other in a direct way: for a one-dimensional block, the two are directly equal; for a two-dimensional block (Dx, Dy), the thread whose index is (x, y) The thread id of is equal to x + y * Dx; for a three-dimensional block (Dx, Dy, Dz), the thread id of the thread with index (x, y, z) is equal to x + yDx + zDxDy. The following example adds two N*N matrices A and B, and stores the result in C

__global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) {
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main() {
    ...
    int numBlocks = 1;
    dim3 threadsPerBlock(N, N);
    MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
    ...
}

Because the threads in a block should exist on a processor and must share the limited memory resources on it, the number of threads in each block should be limited. On existing GPUs, a thread block can contain up to 1024 threads. However, a kernel function can be executed by multiple thread blocks with the same shape, so the total number of threads = the number of threads on each block * the number of blocks. Thread blocks can form a grid in one-dimensional, two-dimensional, and three-dimensional forms (as shown in the figure below). The number of thread blocks in a grid is usually determined by the amount of data being processed or the number of processors, but the number of thread blocks can be far away. Exceed these two values

The number of threads per block and the number of thread blocks per grid are determined by the two parameters in <<<...>>>, the type can be int or dim3, as shown in the above two examples. Each block in the grid can be specified by a one-dimensional, two-dimensional, or three-dimensional index. The index can be obtained in the core through the built-in blockIdx variable, and the dimension of the thread block can be obtained in the core through the built-in blockDim variable to handle multi-core blocks. The enhanced version of MatAdd() is shown below

__global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;

    if (i < N && j < N) {
        C[i][j] = A[i][j] + B[i][j];
    }
}

int main() {
    ...
    dim3 threadsPerBlock(16, 16);
    dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
    MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
}

The size of the thread block is 16 * 16, which is also the usual choice, and then enough thread blocks are used to create the grid so that each thread can process one matrix element. For simplicity, the number of threads in each dimension of the cell in this example can be divisible by the number of threads in each block in that dimension, but this is not a required requirement. Thread blocks are required to execute independently, that is, they must be executed serially or in parallel in any order. This independence requirement allows thread blocks to be executed in any order on any processor (as shown in the figure below), and also allows programmers Write code according to the number of cores.

Threads in a block can cooperate by sharing data located in a memory area called shared memory, and coordinate memory accesses through synchronous execution. To be precise, we can call the __syncthreads() function inside the kernel function to specify the synchronization point. The __syncthreads() function is like a fence. All threads in the block must wait until the release signal is sent. An example of using shared memory is given in the shared memory section. In addition to __syncthreads(), the cooperation group part of the official manual also provides a large number of thread synchronization primitives.

In order to cooperate effectively, the shared memory should be a low-latency memory located inside the processor and close to each processor core (just like the L1 cache), and the __syncthreads() function should be lightweight

Memory hierarchy

The cuda thread can access data from multiple memory spaces during execution, as shown in the following figure

Each thread has its own local memory, each thread block has shared memory visible to all threads in the block, and the shared memory has the same life cycle as the thread block, and all threads can access a global memory. In addition, there are two additional read-only memory, which can also be accessed by all threads: constant memory and texture memory. Global, constant, and texture memory can be optimized for different memory accesses, please refer to the section on device memory access; texture memory also provides different addressing modes, and also performs data filtering for some specific data formats, see texture and Surface memory section. Finally, global, constant, and texture memory always exist during the kernel startup process of the same application.

Heterogeneous programming

The cuda programming model assumes that the cuda thread is executed on a separate physical device, which is a collaborator of the host running the C program, as shown in the figure below. In this case, the kernel function is executed on the GPU, and the remaining C programs are executed on the CPU

The cuda programming model also assumes that both the host and the device have independent spaces in the memory, which are respectively identified as host memory and device memory. Therefore, the global, constant, and texture memory spaces managed by the program are visible to the kernel function, and the allocation and recovery of device memory, and data transfer between the host and device memory can be performed by calling the cuda runtime function. Unified memory provides management memory to connect the host and device memory space. The management memory is a single, continuous memory mirror, accessible to all CPUs and GPUs in the system, and provides a common address space. This function supports This reduces the excessive use of device memory, and greatly simplifies application migration by eliminating the explicit requirements of the host and device for mirrored data.

Calculate ability

The computing capability of the device is expressed as a version number, sometimes also referred to as the SM version. This version number identifies the features supported by the GPU hardware, and is used to allow the application to determine which hardware features or instructions are available on the current GPU at runtime. In addition, this version number is composed of the major version number X and the minor version number Y. , Expressed as XY.

Devices with major version numbers have the same kernel architecture. Devices with major version number 7 are based on Volta architecture, 6 is based on Pascal architecture, 5 is Maxwell architecture, 3, 2, and 1 are based on Kepler, Fermi, and Tesla architecture; the minor version number is related to some incremental improvements in the kernel architecture, sometimes including new features. Turing's computing power 7.5 device architecture is also based on an incremental update of Volta. In addition, Tesla and Fermi architectures are no longer supported since cuda7.0 and cuda9.0 respectively.

Conclusion

This chapter introduces kernel functions, thread level and memory level in GPU, heterogeneous programming, computing power, etc.

The kernel function is a function called by the CPU and executed by the GPU. The number of threads (thread grid dimension and thread block dimension) to execute this kernel function can be configured in <<<>>>, and the dynamically allocated shared memory will be introduced in the following chapters Number and stream number. We can get the current thread index through built-in variables in the kernel function, and calculate the global index for global data access;

The thread level includes thread cells, thread blocks, and threads, all of which are located on a stream processor, and the configurable number is limited by the parameters of the stream processor;

The memory hierarchy includes global memory, texture memory, constant memory, shared memory and local memory. The first three are located in the global GPU, the application can access at any time, the global memory can be read and written, the texture memory needs to be read and written through the texture object, and the constant memory is often used to cache frequently accessed constants to reduce the number of requests; shared memory is On-chip, exists in each thread block, can be statically allocated in the kernel function or dynamically allocated in the execution configuration (<<<>>>) of the kernel function, the life cycle is the same as the thread block; local memory is located in each Within the thread, the access speed is the fastest, and the life cycle is the same as that of the thread.

The heterogeneous programming part explains the relationship between host code and device code, host memory and video memory. Before executing the kernel function, generally speaking, we have to copy the input data from the memory to the video memory, and copy the result back after calling the kernel function, because the memory and the video memory are independent of each other and invisible. Replication can be synchronous or asynchronous, and asynchronous requires calling a synchronous function to synchronize. But if you use managed memory, you don't need to explicitly migrate the data.

The computing power is the version identification of the stream processor (or graphics card). Obviously, different major versions have different hardware features and resource limitations. The differences between different minor versions are smaller, at least the core architecture is the same. We should also pay attention to the architecture names of devices with different major versions (for example, Volta corresponds to 7.X devices) to avoid that we can't correspond to other documents.

Starting from the next chapter, we will introduce the programming interface in detail , which is the main force of this document, and there will be more content.

CUDA10.0 official document translation and learning programming model