Chapter 5 CUDA Memory

Chapter 5 CUDA Memory

"Massively Parallel Processor Programming Practice" learning, other chapters focus on the column CUDA C

CUDA C programming friendly link:

When a thread processes data, it needs to first copy the data from the host to the device's global memory, and then use the block ID and thread ID to determine the location of the data it needs to process, so as to perform parallel computing. When the kernel calculates, it needs a lot of interaction with the memory (global memory). This kind of memory usually uses Dynamic Random Access Memory (DRAM) , and the delay in accessing is sometimes particularly large. Therefore, CUDA provides a large number of The method of accessing the memory, which eliminates the access to the global memory, thereby improving the speed of memory access and the processing speed of the CUDA kernel.

5.1 Importance of Memory Access Efficiency

CGMA (Compute to Global Memory Access) refers to the number of floating-point operations performed each time a CUDA program accesses the global memory in a certain area. The CUDA ratio can more accurately reflect the performance of the kernel function in CUDA .
insert image description here

CGMA at this time = 1:1 = 1.0

When the bandwidth of the global memory is 200GB/s and the single-precision floating-point number is 4 bytes, the read bandwidth of the single-precision floating-point number is 200GB/4B = 50 GFLOPS. When CGMA = 1.0, the operation of the single-precision floating-point number cannot exceed 50 GFLOPS.

5.2 Types of CUDA Device Memory

insert image description here
insert image description here

  • Global memory (host interactive)
  • Constant memory (read only)
  • Registers and shared memory, that is, on-chip memory registers can only be accessed by a single thread, and shared memory can be accessed by all threads in a thread block.

Two different processing architectures

insert image description hereinsert image description here

Different variables have different declaration methods when defining their scope:insert image description here

NOTE: Automatic array variables are stored in global memory

5.3 A strategy to reduce global memory traffic

Global memory accesses are slow, while shared memory accesses are fast. Divide the data into multiple subsets, called tiles , for block access, reduce the number of access (global) and improve efficiency.
Frequently accessed global memory points can be resident in the on-chip memory, which can reduce access to the global memory, but too much resident will cause a large amount of on-chip memory to be occupied, so it is necessary to use fence synchronization for synchronization to reduce the resident time.

5.4 Kernel function for block matrix multiplication

For primitive matrix multiplication, it is possible to perform step-by-step operations.
For each stage,

  1. Each thread first takes the required block data from the global memory to the shared memory, and each thread takes 2 positions (matrix multiplication in units of 2*2);
  2. Perform the sum of product sum operation on the obtained data to obtain the matrix product of this 2*2 unit.
  3. As shown in the figure, red represents P 0 , 0 P_{0,0}P0,0The operation of the thread, the sum of the products of the red area , which takes M 0 , 0 M_{0,0}M0,0and M 0 , 1 M_{0,1}M0,1The data, the value of the N matrix is ​​given to purple. The same is true for other colors, because the same data M or N is used by two threads within the range of 2*2 at the same time, so the two threads take values ​​separately and share data at the same stage.

insert image description here

[insert image description here
insert image description here

5.5 Memory—a factor limiting parallelism

The registers and shared memory that can be allocated in each SM are limited, and are allocated in units of thread blocks. If the thread block * required registers/memory per block > the maximum capacity of the SM, the number of running thread blocks in each SM will be reduced in units of blocks. If the number of threads that can be accommodated in memory/registers is greater than the maximum number of threads that can be accommodated by the SM, then the inherent maximum number of threads of the SM becomes a speed-limiting factor.
According to the obtained hardware information during programming, the usage of shared memory is dynamically determined, which can improve GPU utilization and accelerate calculation. Therefore, the shared register declaration in the matrix multiplication is modified:
insert image description here
when starting the Kernel, the memory size can also be set:
insert image description here

Guess you like

Origin blog.csdn.net/qq_40491305/article/details/116236291