cuda programming learning - introduction to CUDA memory (7)

foreword

References:

Gao Sheng's blog
"CUDA C programming authoritative guide"
and CUDA official document
CUDA programming: basics and practice Fan Zheyong

All the codes of the article are available on my GitHub, and will be updated slowly in the future

Articles and explanatory videos are simultaneously updated to the public "AI Knowledge Story", station B: go out to eat three bowls of rice

1: Introduction to memory organization

Memory in modern computers often has a hierarchy. In this structure, there are multiple types of memory, each of which has different capacity and latency (latency, which can be understood as the time for the processor to wait for memory data). In general, low-latency (high-speed) memory capacity is small, and high-latency (slow speed) memory capacity is large.
insert image description here
insert image description here

2: CUDA different types of memory

2.1 Global memory:

(1) The meaning is that all threads in the kernel function can access the data in it, which is not the same as the "global variable" in C++. We have already used this kind of memory, in the array addition example, the pointers d_x, d_y and d_z all point to global memory. Since global memory is not stored on the GPU chip , it has high latency and low access speed.

(2) The main function of the global memory is to provide data for the kernel function , and to transfer data between the host and the device and between the device and the device. First, we use the cudaMalloc function to allocate device memory for global memory variables. Then, the allocated memory can be directly accessed in the kernel function, and the data value in it can be changed.
(3) Global memory is visible to all threads of the entire grid. That is, all threads of a grid can access (read or write) all data in the global memory pointed to by the device pointer passed into the kernel function.
(4) The lifetime of the global memory is not determined by the kernel function, but by the host. In the example of array addition, the life cycle of the global memory buffers pointed to by the pointers d_x, d_y and d_z starts from the time when the host uses cudaMalloc to allocate memory for them, and ends when the host uses cudaFree to release their memory.

2.2 Static global memory:

Static global memory variables are defined outside any function by:

__device__ T x; // 单个变量 
__device__ T y[N]; // 固定长度的数组

Among them, the modifier device indicates that the variable is a variable in the device, not a variable in the host; T is the type of the variable; N is an integer constant.

In the kernel function, the static global memory variables can be directly accessed, and they do not need to be passed to the kernel function as parameters. Static global memory variables cannot be directly accessed in host functions, but the cudaMemcpyToSymbol
and cudaMemcpyFromSymbol functions can be used to transfer data between static global memory and host memory.

2.3 Constant memory:

Constant memory (constant memory) is global memory with a constant cache , and the amount is limited, only 64 KB in total. Its visibility and lifetime are the same as global memory. The difference is that constant memory is only readable, not writable . Due to the cache, the access speed of constant memory is higher than that of global memory, but the premise of obtaining high access speed is that threads in a thread warp (32 adjacent threads in a thread block) need to read the same constant memory data.

One way to use constant memory is to define variables with constant outside the kernel function , and use the CUDA runtime API function cudaMemcpyToSymbol described above to copy the data from the host to the constant memory of the device for use by the kernel function.

In the example of array addition, the parameter const int N of the kernel function is a variable defined on the host side , and is passed to the thread in the kernel function by value. In the code segment if (n < N) in the kernel function, this parameter N is used by each thread . Therefore, every thread in the kernel knows the value of this variable, and access to it is faster than access to global memory. In addition to passing a single variable to the kernel function, you can also pass a structure, which also uses constant memory.

2.4 Texture memory and surface memory

Texture memory (texture memory) and surface memory (surface memory) are similar to constant memory. They are also a kind of global memory with cache, have the same visible range and life cycle, and are generally only readable (surface memory can also be written). The difference is that texture memory and surface memory have a larger capacity and are used differently from constant memory .

2.5 Registers

Variables defined in kernel functions without any qualifiers are generally stored in registers. The array defined in the kernel function without any qualifiers may be stored in registers, but it may also be stored in local memory. In addition, the various built-in variables mentioned before, such as gridDim, blockDim, blockIdx, threadIdx, and warpSize are stored in special registers. It is efficient to access these built-in variables in kernel functions.

const int n = blockDim.x * blockIdx.x + threadIdx.x;

Here n is a register variable. Registers can be read and written. The function of the above statement is to define a register variable n and assign (write) the value calculated on the right side of the assignment number to it. in a later statement

z[n] = x[n] + y[n];

In , the value of register variable n is used (read).
Register variables are only visible to one thread. That is, each thread has a copy of the variable n. Although the same variable name is used in the code of the kernel function, the value of the register variable in different threads can be different
.

2.6 Local memory

Local memory is almost the same as registers. The variables defined in the kernel function without any qualifiers may be in registers or in local memory. Variables that cannot fit in registers, and arrays whose index values ​​cannot be determined at compile time, may be placed in local memory.
insert image description here
This judgment is made automatically by the compiler. For the variable n in the array addition example, the author can be sure that it is in a register, not in local memory, because the number of registers used by the kernel function is far from reaching the upper limit.

2.7 Shared memory

Shared memory is similar to registers. It exists on the chip and has a read and write speed second only to registers, and its number is also limited. Table 6.2 lists the metrics for the amount of shared memory corresponding to several compute capabilities.

Unlike registers, shared memory is visible to the entire thread block, and its life cycle is consistent with the entire thread block. That is, each thread block has a copy of the shared memory variable. The value of a shared memory variable can be different in different thread blocks. All threads in a thread block can access the copy of shared memory variables of the thread block, but cannot access the copies of shared memory variables of other thread blocks.

2.8 L1, L2 cache

Starting from the Fermi architecture, there is an L1 cache (level one cache) at the SM level and an L2 cache (level two cache) at the device level (one device has multiple SMs). They are mainly used to cache access to global memory and local memory to reduce latency.

3: Streaming multiprocessor SM

A GPU is composed of multiple SMs. An SM consists of the following resources:
• A certain number of registers.
• A certain amount of shared memory.
• Cache of constant memory.
• Caching of texture and surface memory.
• L1 cache.
• Two (compute capability 6.0) or four (other compute capabilities) warp schedulers for rapidly switching between different thread contexts and issuing instructions for ready warps.
• Execution cores, including:
– Core for several integer operations (INT32).
– Core for several single-precision floating-point operations (FP32).
– Core for several double-precision floating-point operations (FP64).
– Special Function Units (SFUs) for several single-precision floating-point transcendental functions.
– Several mixed-precision tensor cores (tensor cores, introduced by the Volt architecture, suitable for
low-precision matrix calculations in machine learning, not discussed in this book).

SM share

Because various computing resources in an SM are limited, the number of threads residing in an SM may not reach the ideal maximum in some cases. At this point, we say that the share of the SM is less than 100%. Obtaining 100% occupancy rate is not a necessary or sufficient condition to obtain high performance, but in general, try to make the SM occupancy rate not less than a certain value, such as 25%, to obtain higher performance.

4: Query device with CUDA runtime API function

#include <stdio.h>

#define CHECK(call)                                   \
do                                                    \
{
      
                                                           \
    const cudaError_t error_code = call;              \
    if (error_code != cudaSuccess)                    \
    {
      
                                                       \
        printf("CUDA Error:\n");                      \
        printf("    File:       %s\n", __FILE__);     \
        printf("    Line:       %d\n", __LINE__);     \
        printf("    Error code: %d\n", error_code);   \
        printf("    Error text: %s\n",                \
            cudaGetErrorString(error_code));          \
        exit(1);                                      \
    }                                                 \
} while (0)


int main(int argc, char* argv[])
{
    
    
    int device_id = 0;
    if (argc > 1) device_id = atoi(argv[1]);
    CHECK(cudaSetDevice(device_id));

    cudaDeviceProp prop;
    CHECK(cudaGetDeviceProperties(&prop, device_id));

    printf("Device id:                                 %d\n",
        device_id);
    printf("Device name:                               %s\n",
        prop.name);
    printf("Compute capability:                        %d.%d\n",
        prop.major, prop.minor);
    printf("Amount of global memory:                   %g GB\n",
        prop.totalGlobalMem / (1024.0 * 1024 * 1024));
    printf("Amount of constant memory:                 %g KB\n",
        prop.totalConstMem / 1024.0);
    printf("Maximum grid size:                         %d %d %d\n",
        prop.maxGridSize[0],
        prop.maxGridSize[1], prop.maxGridSize[2]);
    printf("Maximum block size:                        %d %d %d\n",
        prop.maxThreadsDim[0], prop.maxThreadsDim[1],
        prop.maxThreadsDim[2]);
    printf("Number of SMs:                             %d\n",
        prop.multiProcessorCount);
    printf("Maximum amount of shared memory per block: %g KB\n",
        prop.sharedMemPerBlock / 1024.0);
    printf("Maximum amount of shared memory per SM:    %g KB\n",
        prop.sharedMemPerMultiprocessor / 1024.0);
    printf("Maximum number of registers per block:     %d K\n",
        prop.regsPerBlock / 1024);
    printf("Maximum number of registers per SM:        %d K\n",
        prop.regsPerMultiprocessor / 1024);
    printf("Maximum number of threads per block:       %d\n",
        prop.maxThreadsPerBlock);
    printf("Maximum number of threads per SM:          %d\n",
        prop.maxThreadsPerMultiProcessor);

    return 0;
}

insert image description here

Guess you like

Origin blog.csdn.net/qq_40514113/article/details/130954738