CUDA programming notes (5)


foreword

The memory organization of cuda maximizes the performance when using the GPU, and it is also very important to use the memory of the device reasonably.

CUDA memory organization

As shown in the table:

memory type physical location access permission visible range life cycle
global memory off chip readable and writable All threads and host side allocated and freed by the host
constant memory off chip read only All threads and host side allocated and freed by the host
Texture and Surface Memory off chip generally read only All threads and host side allocated and freed by the host
register memory in chip readable and writable single thread where thread
local memory off chip readable and writable single thread where thread
Shared memory in chip readable and writable single thread block thread block

global memory

Definition: The global memory here refers to the memory that all threads in the kernel function can access data.
Role: save the data provided by the kernel function, and transfer data between the host and the device and between the device and the device.
Not on the GPU chip, so it has higher latency and lower access speed when feeding data to the kernel function.
The memory capacity is basically the same as the video memory of the GPU.
is readable and writable.
Dynamic global memory variables: The d_x, d_y, and d_z defined in the previous cuda array addition program are dynamically allocated. First, use cudaMalloc() to allocate device memory and cudaMemcpy() to transfer the data on the host to the device. Then access the allocated memory and change the value in it in the kernel function.
Static global memory variables: use cudaMemcpyToSymbol() for data transfer between host and device and cudaMemcpyFromSymbol() for data transfer between device and host. In the kernel function, the static global memory variables can be directly accessed, and they do not need to be passed to the kernel function as parameters.
defined outside the function by

__device__ T x;  // 单个变量
__device__ T y[N];	// 固定长度的数组

example:
insert image description here

constant memory

Definition: It is a global memory with a constant cache, and the number is limited, only 64kb.
Function: Same as global memory.
It can only be read but not written, and because of the cache, the access speed of constant memory is higher than that of global memory.
Use: The const int N in the cuda array addition program is a variable that uses constant memory.

Texture memory and surface memory

Definition: Similar to constant memory,
generally only readable, and surface memory can also be written. For GPUs with computing capability no less than 3.5, using the __ldg() function to read some read-only global memory data through the read-only data cache can not only achieve the acceleration effect of using texture memory, but also make the code concise.

register

Definition: Variables without any qualifiers in the kernel function are generally stored in registers (maybe in local memory).
Registers can be read and written. Register memory is on-chip and has the highest access speed of all memory, but its number is limited.
Use: int n = blockDim.x * blockIdx.x + threadIdx.x in the cuda array addition program;
where n is a register variable. Use z[n] = x[n] + y[n] in the kernel function, register variable n and assign the value calculated on the right side of the assignment number to it.
The life cycle is consistent with the life cycle of the owning thread, from the time it is defined to the end of the thread. A register variable is only visible to one thread. The name of the register variable is the same in different threads, but the value of the variable is different, which is equivalent to creating another copy.

local memory

Definition: Almost the same as registers.
Variables that cannot fit in registers may be placed in local memory, and this judgment is made automatically by the compiler.

Shared memory

Definition: Similar to registers, but the shared memory is visible to the entire thread block.
Function: reduce access to global memory, or improve access mode to global memory.
Its life cycle is consistent with the entire thread block.
Use: To define a variable as a shared memory variable in a kernel function, add a qualifier __shared__ to the definition statement

__shared__ real s_y[128];

L1 and L2 cache

Starting from the Fermi architecture, there is an SM-level L1 cache (first-level cache) and a device-level L2 cache (second-level cache).
It is mainly used to cache access to global memory and local memory to reduce latency. L1 and L2 caches are non-programmable caches (the user can at most direct the compiler to make some choices).

Composition of SM

(1) A certain number of registers
(2) A certain amount of shared memory (
3) A cache of constant memory
(4) A cache of texture and surface memory
(5) An L1 cache
(6) Two warp schedulers for different Rapidly switch between thread contexts and issue execution instructions for ready warps.
(7) Execution core: several cores of integer operations, several cores of single-precision floating-point operations, several cores of double-precision floating-point operations, several special function units of single-precision floating-point transcendental functions, and several tensors of mixed precision core.

API function query device

Use some cuda api programs to query some specifications of the device.

#include "error.cuh"
#include <stdio.h>

int main(int argc, char *argv[])
{
    
    
    // 设置查询的设备编号.
    int device_id = 0; 
    if (argc > 1) device_id = atoi(argv[1]);
    // cudaSetDevice()函数将对所指定的设备进行初始化
    CHECK(cudaSetDevice(device_id));
	// 定义设备输出规格的一些结构体变量
    cudaDeviceProp prop;
    CHECK(cudaGetDeviceProperties(&prop, device_id));  // 得到了device_id设备的性质,存放在结构体变量中的prop中.

    printf("Device id:                                 %d\n", 
        device_id);
    printf("Device name:                               %s\n",
        prop.name);
    printf("Compute capability:                        %d.%d\n",
        prop.major, prop.minor);
    printf("Amount of global memory:                   %g GB\n",
        prop.totalGlobalMem / (1024.0 * 1024 * 1024));
    printf("Amount of constant memory:                 %g KB\n",
        prop.totalConstMem  / 1024.0);
    printf("Maximum grid size:                         %d %d %d\n",
        prop.maxGridSize[0], 
        prop.maxGridSize[1], prop.maxGridSize[2]);
    printf("Maximum block size:                        %d %d %d\n",
        prop.maxThreadsDim[0], prop.maxThreadsDim[1], 
        prop.maxThreadsDim[2]);
    printf("Number of SMs:                             %d\n",
        prop.multiProcessorCount);
    printf("Maximum amount of shared memory per block: %g KB\n",
        prop.sharedMemPerBlock / 1024.0);
    printf("Maximum amount of shared memory per SM:    %g KB\n",
        prop.sharedMemPerMultiprocessor / 1024.0);
    printf("Maximum number of registers per block:     %d K\n",
        prop.regsPerBlock / 1024);
    printf("Maximum number of registers per SM:        %d K\n",
        prop.regsPerMultiprocessor / 1024);
    printf("Maximum number of threads per block:       %d\n",
        prop.maxThreadsPerBlock);
    printf("Maximum number of threads per SM:          %d\n",
        prop.maxThreadsPerMultiProcessor);

    return 0;
}

Query some device settings:
insert image description here
From these outputs, you can see the memory organization of the GPU and the maximum capacity of each memory.

Summarize

The timing method of cuda program execution and the analysis of GPU performance acceleration
Reference:
If the blog content is infringing, you can contact and delete it in time!
CUDA Programming: Basics and Practice
https://docs.nvidia.com/cuda/
https://docs.nvidia.com/cuda/cuda-runtime-api
https://github.com/brucefan1983/CUDA-Programming

Guess you like

Origin blog.csdn.net/weixin_41311686/article/details/128743188
Recommended