Introduction to CUDA - CUDA memory mode

1 Introduction

Preface blog:

Insert image description here

CUDA memory mode adopts a layered design, which is the biggest difference between CUDA programs and normal C programs:

  • Thread-Memory Correspondence
  • Block-Memory Correspondence
  • Grid-Memory Correspondence

The overall CUDA memory model is:
Insert image description here

  • 1) Registers & Local Memory: Regular variables declared in the Kernel. Small space variables are stored in Register, and large space variables are stored in Local Memory. Because Local Memory operates slowly, you should try to avoid using Local Memory.
  • 2) Shared Memory: Allows Threads in the same Block to communicate with each other.
  • 3) Constant Memory: used to store data that does not change during Kernel execution. It will be cached in On-Chip memory, which can greatly reduce the communication throughput of Global Memory during Kernel execution.
  • 4) Global Memory: used to store data interacting with the Host.

2. Thread-Memory Correspondence

Insert image description here

Thread-Memory Correspondence, that is, Threads are equivalent to Local Memory (and Registers):

  • Its scope is: private to the corresponding Thread, and inaccessible to other Threads.
  • Its life cycle is: Thread. When the Thread execution is completed, any local memory (and Registers) related to the Thread will be automatically released.
  • However, local memory and registers have completely different performance characteristics.

3. Block-Memory Correspondence

Insert image description here
Block-Memory Correspondence, that is, Blocks are equivalent to Shared Memory:

  • Its scope is: every Thread in the same Block can access it.
  • Its life cycle is: Block. When the Block execution is completed, its shared memory contents will be automatically released.

4. Grid-Memory Correspondence

Insert image description here
Grid-Memory Correspondence, that is, Grids are equivalent to Global Memory:

  • The scope is:All Each Thread in Grids can be accessed.
  • Its life cycle is: the entire main() program within the Host code. Or it can be released by manually calling cudaFree(...) in the Host code.

5. Device memory mode

Insert image description here

  • Host: It consists of CPU and machine memory.
  • Device: It consists of GPU and its DRAM. [The green square on the Device diagram represents a CUDA core. 】

Insert image description here

In Device's DRAM, there are:

  • Global Memory physical space
  • Local Memory physical space. It should be noted that the Local here does not refer to its physical location; the Local here refers to the scope and lifetime of the memory space.

The GPU of Device includes:

  • RegistersPhysical space
  • Shared Memory physical space

The green square on the Device diagram represents a CUDA core. The CUDA cores on the Device are combined together to become a streaming Multiprocessor (SM for short).
The yellow square on the Device diagram represents SM. The yellow squares are grouped together to form a collection of CUDA cores.

The memory located on the SM is called:

  • “On-Chip” Device memory. Therefore, Registers and Shared Memory both correspond to “On-Chip” Device memory. Because Registers and Shared Memory both physically exist in the GPU's streaming Multiprocessor.

Memory on non-SM, called::

  • “Off-Chip” Device memory. Because it does not exist on the GPU. Correspondingly, Global Memory and Local Memory are both "Off-Chip" Device memories. That is, the DRAM on the Device is “Off-Chip” Device memory.

Take the physical layout of NVIDIA GPU Geforce Titan as an example:
Insert image description here

  • Shown in the blue box in the above figure is the DRAM of the Device, which is “Off-Chip” Device memory.
  • The green box in the above picture is the actual GPU, which corresponds to the "On-Chip" Device memory.

Understanding how Blocks are mapped to SM is a basic requirement for designing the kernel to obtain optimized GPU computing performance.

5.1 Memory speed

Different memory spaces have different bandwidth and latency:
Insert image description here

  • The operation speed of on-chip memory is faster than that of off-chip memory.

5.2 Global Memory access

Insert image description here
Insert image description here

Global Memory access methods are:

  • cudaMalloc()
  • cudaMemset()
  • cudaMemCopy()
  • cudaFree()

It is unavoidable not to use Global Memory, because Global Memory space must be used to transfer data from the host to the device. However, Global Memory traffic should be minimized as it is very slow.

The advantages of Global Memory are:

  • It is usually quite large. For example, computer memory is 8GB or 16GB, and both Titan and Tesla k40 have 6GB of global DRAM.

5.3 Registers and Local Memory

Insert image description here
Variables declared in Kernel are stored in Register:

  • Corresponds to On-Chip Device memory.
  • The fastest form of memory.

Arrays that are too large to fit in Register will be stored in Local Memory:

  • Corresponds to Off-Chip Device memory.
  • Controlled by the compiler.
  • "Local" refers to range, not location. "Local" here refers to the Local Memory relative to each Thread.
    • Each Thread has its own private local memory and registers, which are not accessible to other Threads.
  • It should be avoided as much as possible because Local Memory is one of the slowest memory forms. Registers are the fastest form of memory.
    • Therefore, the design goal is to avoid storing more information in local variables so as not to exceed the register's storage space.
    • Register space is a scarce hardware resource. Full utilization should be better planned.

5.4 Shared Memory

Insert image description here
With Shared Memory:

  • Supports mutual communication between Threads in the same Block:
    • synchronous communication
  • Shared Memory is a very special kind of memory that is crucial to achieving computing performance and correctness:
    • Shared Memory has fast processing speed, second only to Registers. Because it is On-Chip device memory.
    • Supports inter-communication between Threads in the Block, which can be regarded as a user-defined L1 Cache, which can be used as a "scratch-pad (high-speed temporary storage memory)" memory. Later, we will introduce the close relationship between Shared Memory and L1 Cache.

Insert image description here

Use__shared___ keyword to indicate that the allocated memory is shared memory:
Insert image description here

5.5 Constant Memory

Constant Memory is a special area of ​​Device Memory:

  • Used to store data that remains unchanged during Kernel execution.
  • It is read-only for the Kernel.
  • Constant Memory为Off-Chip Device Memory。
  • Constant Memory actively caches into On-Chip Memory.

Insert image description here
The idea of ​​Constant Memory is:

  • The GPU does not have a large cache. Thus, constant memory can be used to implement very simple cache types.
  • Constant Memory can be very large because it is actually located in Device DRAM. All Threads can access Constant Memory, but it is read-only memory.
  • For data that needs to be accessed frequently but does not change during Kernel execution, Constant Memory can be used.
  • Constant Memory is implemented in off-chain DRAM hardware, but its content is actually actively cached in On-Chip Memory. Therefore, using Constant Memory can greatly reduce the communication throughput of Global Memory during Kernel execution.

References

[1] Intro to CUDA (part 5): Memory Model

Guess you like

Origin blog.csdn.net/mutourend/article/details/134781045