Introduction to GPU programming memory types

Memory classification

Storage space is mainly divided into two types : programmable and non-programmable .

Usually in the storage structure, the cache is not programmable. (For example: L1 and L2 cache)

The programmable memory space of GPU includes: Global, local, texture, constant, shared and register memory.

Sort them according to speed and size as shown in the figure below:

Insert picture description here

Memory function

The distribution of each part of the storage space is shown in the following figure:
Insert picture description here
It is not difficult to see from the above figure that there are actually only three types of memory in each thread block of the GPU : registers, shared memory, and local memory. The remaining constants, textures, and global memory are stored outside the thread.

Although the memory speed seems to be as fast as possible, the other two characteristics that determine how to choose the type of memory are the range and lifetime of the memory.

Registers

The data stored in the register memory is only visible to the thread that created the memory, and continues during the running of the thread. The automatic variables that are not specially declared in the kernel are all placed in registers. When the index of the array is a constant type and can be determined at compile time, it is a built-in type, and the array is also placed in a register.

The resources of the register are few, compared with other types of memory resources are scarce. The maximum number of registers that can be contained in each thread block can be seen through the regsPerBlock attribute. **On Fermi, each thread is limited to a maximum of 63 registers, while Kepler has 255. **When designing the kernel, using fewer registers can allow more blocks to reside in the SM, which increases the occupancy rate and improves performance.

In most cases, each instruction to access the register consumes zero clock cycles. However, due to writing and reading, delays may occur. The delay is approximately 24 clock cycles. For newer CUDA devices with 32 threads per SM, as many as 768 threads may be required to completely hide the delay.

Local Memory

Local memory is not a physical type of memory, but an abstraction of global memory. Its scope is local to the thread, but it resides outside the chip, which makes it expensive to access as a global memory. Local memory is only used to store automatic variables. When the compiler determines that there is not enough register space to save the variable, the compiler uses local memory. Automatic variables of large structures or arrays are usually also placed in local memory.

Local memory will be used in the following situations:
1. When all register memory is occupied, local memory will be used (called register spilling).
2. The local array whose exact value cannot be determined during compilation.
3. Larger structures and arrays

Shared Memory

Variables placed in shared memory need to be modified by _shared_. The shared memory is stored on the chip, so it has higher bandwidth and lower latency than Local Memory and Global Memory.

Unlike Register, Shared memory is declared in the kernel, but its life cycle is accompanied by the entire block, rather than a single thread. When the block is executed, the resources it owns will be released and reallocated to other blocks.

The data stored in the shared memory is visible to all threads in the block and lasts for the duration of the block. This is very valuable because this type of memory allows threads to communicate with each other and share data.
The data stored in global memory is visible to all threads in the application (including the host) and lasts for the duration allocated by the host.

While the memory can be used together, it also brings the problem of access conflicts. In order to solve this bottleneck, in the commonly used GPU architecture, threads are divided into groups of 32 according to the Warpsize attribute, and continuous shared memory is allocated to continuous thread groups.

Constant Memory

Constant Memory and Texture Memory are only useful for very specific types of applications. Constant Memory is used for data that will not be changed and only read during the execution of the kernel. Using Constant Memory instead of Global Memory can reduce the required memory bandwidth, but this performance gain can only be achieved when threads read the same location. The initialization of Constant Memory is shown in the following formula:

cudaError_t cudaMemcpyToSymbol(const void* symbol, const void* src,size_t count);

Constant Memory performs best when all threads in a warp read data from the same Memory address. For example, calculating the coefficients in formulas. If all threads read data from different addresses and read only once, then Constant Memory is not a good choice, because a read Constant Memory operation will be broadcast to all threads.

Texture Memory

Similar to constant memory, texture memory is another type of read-only memory on the device. When all the reads in a thread are physically adjacent, compared with global memory, using texture memory can reduce memory traffic and improve performance.

Global Memory

Global Memory has the largest space and highest latency, and is the most basic memory unit for GPU. Any SM can obtain global memory data throughout the life of the program. Variables in global can be static or dynamically declared. You can use the __device__ modifier to limit its attributes. The allocation of global memory is cudaMalloc, which was frequently used before, and cudaFree is released. The global memory resides in the Device Memory and can be transmitted in three formats: 32-byte, 64-byte, or 128-byte.

Generally speaking, the more transactions are needed, the more potentially unnecessary data transmissions will be, resulting in a decrease in the transmission rate.

Design ideas

Most CUDA programs are gradually mature. Initialize with global memory. After initialization, consider using other types of memory, such as zero-copy memory, shared memory, and constant memory. Finally, registers should also be considered. Come in. In order to optimize a program, it is necessary to consider the use of faster memory at the beginning of the program design, and know exactly where and how to improve program performance. In addition, we must not only think about how to access global memory more efficiently, but also always find ways to reduce the number of accesses to global memory, especially when data is reused.

Guess you like

Origin blog.csdn.net/daijingxin/article/details/109352757