1 Introduction
Preface blog:
- Introduction to CUDA - basic concepts
- Introduction to CUDA - Programming Mode
- Introduction to CUDA - For Loop Parallelization
- Introduction to CUDA - Thread index in Grid and Block
CUDA memory mode adopts a layered design, which is the biggest difference between CUDA programs and normal C programs:
- Thread-Memory Correspondence
- Block-Memory Correspondence
- Grid-Memory Correspondence
The overall CUDA memory model is:
- 1) Registers & Local Memory: Regular variables declared in the Kernel. Small space variables are stored in Register, and large space variables are stored in Local Memory. Because Local Memory operates slowly, you should try to avoid using Local Memory.
- 2) Shared Memory: Allows Threads in the same Block to communicate with each other.
- 3) Constant Memory: used to store data that does not change during Kernel execution. It will be cached in On-Chip memory, which can greatly reduce the communication throughput of Global Memory during Kernel execution.
- 4) Global Memory: used to store data interacting with the Host.
2. Thread-Memory Correspondence
Thread-Memory Correspondence, that is, Threads are equivalent to Local Memory (and Registers):
- Its scope is: private to the corresponding Thread, and inaccessible to other Threads.
- Its life cycle is: Thread. When the Thread execution is completed, any local memory (and Registers) related to the Thread will be automatically released.
- However, local memory and registers have completely different performance characteristics.
3. Block-Memory Correspondence
Block-Memory Correspondence, that is, Blocks are equivalent to Shared Memory:
- Its scope is: every Thread in the same Block can access it.
- Its life cycle is: Block. When the Block execution is completed, its shared memory contents will be automatically released.
4. Grid-Memory Correspondence
Grid-Memory Correspondence, that is, Grids are equivalent to Global Memory:
- The scope is:All Each Thread in Grids can be accessed.
- Its life cycle is: the entire main() program within the Host code. Or it can be released by manually calling
cudaFree(...)
in the Host code.
5. Device memory mode
- Host: It consists of CPU and machine memory.
- Device: It consists of GPU and its DRAM. [The green square on the Device diagram represents a CUDA core. 】
In Device's DRAM, there are:
- Global Memory physical space
- Local Memory physical space. It should be noted that the Local here does not refer to its physical location; the Local here refers to the scope and lifetime of the memory space.
The GPU of Device includes:
- RegistersPhysical space
- Shared Memory physical space
The green square on the Device diagram represents a CUDA core. The CUDA cores on the Device are combined together to become a streaming Multiprocessor (SM for short).
The yellow square on the Device diagram represents SM. The yellow squares are grouped together to form a collection of CUDA cores.
The memory located on the SM is called:
- “On-Chip” Device memory. Therefore, Registers and Shared Memory both correspond to “On-Chip” Device memory. Because Registers and Shared Memory both physically exist in the GPU's streaming Multiprocessor.
Memory on non-SM, called::
- “Off-Chip” Device memory. Because it does not exist on the GPU. Correspondingly, Global Memory and Local Memory are both "Off-Chip" Device memories. That is, the DRAM on the Device is “Off-Chip” Device memory.
Take the physical layout of NVIDIA GPU Geforce Titan as an example:
- Shown in the blue box in the above figure is the DRAM of the Device, which is “Off-Chip” Device memory.
- The green box in the above picture is the actual GPU, which corresponds to the "On-Chip" Device memory.
Understanding how Blocks are mapped to SM is a basic requirement for designing the kernel to obtain optimized GPU computing performance.
5.1 Memory speed
Different memory spaces have different bandwidth and latency:
- The operation speed of on-chip memory is faster than that of off-chip memory.
5.2 Global Memory access
Global Memory access methods are:
- cudaMalloc()
- cudaMemset()
- cudaMemCopy()
- cudaFree()
It is unavoidable not to use Global Memory, because Global Memory space must be used to transfer data from the host to the device. However, Global Memory traffic should be minimized as it is very slow.
The advantages of Global Memory are:
- It is usually quite large. For example, computer memory is 8GB or 16GB, and both Titan and Tesla k40 have 6GB of global DRAM.
5.3 Registers and Local Memory
Variables declared in Kernel are stored in Register:
- Corresponds to On-Chip Device memory.
- The fastest form of memory.
Arrays that are too large to fit in Register will be stored in Local Memory:
- Corresponds to Off-Chip Device memory.
- Controlled by the compiler.
- "Local" refers to range, not location. "Local" here refers to the Local Memory relative to each Thread.
- Each Thread has its own private local memory and registers, which are not accessible to other Threads.
- It should be avoided as much as possible because Local Memory is one of the slowest memory forms. Registers are the fastest form of memory.
- Therefore, the design goal is to avoid storing more information in local variables so as not to exceed the register's storage space.
- Register space is a scarce hardware resource. Full utilization should be better planned.
5.4 Shared Memory
With Shared Memory:
- Supports mutual communication between Threads in the same Block:
- synchronous communication
- Shared Memory is a very special kind of memory that is crucial to achieving computing performance and correctness:
- Shared Memory has fast processing speed, second only to Registers. Because it is On-Chip device memory.
- Supports inter-communication between Threads in the Block, which can be regarded as a user-defined L1 Cache, which can be used as a "scratch-pad (high-speed temporary storage memory)" memory. Later, we will introduce the close relationship between Shared Memory and L1 Cache.
Use__shared___
keyword to indicate that the allocated memory is shared memory:
5.5 Constant Memory
Constant Memory is a special area of Device Memory:
- Used to store data that remains unchanged during Kernel execution.
- It is read-only for the Kernel.
- Constant Memory为Off-Chip Device Memory。
- Constant Memory actively caches into On-Chip Memory.
The idea of Constant Memory is:
- The GPU does not have a large cache. Thus, constant memory can be used to implement very simple cache types.
- Constant Memory can be very large because it is actually located in Device DRAM. All Threads can access Constant Memory, but it is read-only memory.
- For data that needs to be accessed frequently but does not change during Kernel execution, Constant Memory can be used.
- Constant Memory is implemented in off-chain DRAM hardware, but its content is actually actively cached in On-Chip Memory. Therefore, using Constant Memory can greatly reduce the communication throughput of Global Memory during Kernel execution.