CUDA memory organization

The textbooks currently on the market generally have translation problems and are often contradictory.
Therefore, please refer to the official CUDA C++ Programming Guide for the main content.

1 GPU storage hardware

  • DRMA/HBM: The concept of this part is similar to the main memory of the CPU
  • L2 Cache
  • L1 Cache
  • Register
  • Shared Memory
  • Read-only data cache

Note: The above hardware configuration may vary depending on the GPU version.

2 CUDA memory model

2.1 Global Memory

  • Global memory is large, on-board memory and is characterized by relatively high latencies.
  • cudaMallocGlobal memory can be dynamically allocated in the CPU using
  • can be __device__declared statically

Source of the definition of device memory: The CUDA programming model also assumes that both the host and the device maintain their own separate memory spaces in DRAM, referred to as host memory and device memory, respectively (feels similar to the definition of global memory ? Yes The textbook also regards these two as the same thing)

2.2 Constant Memory

  • The constant memory space resides in device memory and is cached in the constant cache.
  • Global memory with constant cache
  • Read- only to kernel code , but readable and writable to host
  • The fastest speed is accessing the same address in a Wrap (broadcasting seems to be in the form of a half-warp). If different addresses are accessed, the accesses must be serial.

At present, I have not found any reliable information about the constant cache. I only know that there is a dedicated on-chip cache (has a dedicated on-chip cache), and that the constant cache of each SM has a size limit of 64KB.

In addition, Nvidia has an official tutorial with such a picture. I guess the constant cache should be in SM.
Insert image description here

2.3 Texture Memory/Surface Memory

  • The texture and surface memory spaces reside in device memory and are cached in texture cache
  • Global memory with cache
  • read only
  • Optimized for 2D and 3D locality, more suitable for scattered reads
    Insert image description here

Regarding the texture cache, ① it may be placed in the Read-only data cache; ② it may be placed together with the L1 cache and the shared memory is independent; ③ it may also be that the L1 cache, texture cache and shared memory are unified

2.4 Shared Memory

  • Shared memory is smaller(than Global Memory), low-latency on-chip memory
  • __shared__Allocate using memory space specifier
  • program-managed cache
  • Each thread block has shared memory visible to all threads of the block and with the same lifetime as the block.
  • One SM can run multiple modules at the same time. In this case, multiple modules share the same shared memory.
  • Supports the declaration of 1D, 2D, and 3D shared memory arrays
  • Pay attention to Bank Conflict
  • With the introduction of NVIDIA Compute Capability 9.0, the CUDA programming model introduces an optional level of hierarchy called Thread Block Clusters that are made up of thread blocks. Thread blocks in a thread block cluster can perform read, write, and atomics operations on each other’s shared memory.
    Insert image description here

2.5 Register

  • An automatic variable declared in device code without any of the __device__, __shared__ and __constant__ memory space specifiers described in this section generally resides in a register.
  • Register memory is on-chip and has the highest access speed among all memories, but its number is also very limited.

2.6 Local Memory

  • Each thread has private local memory.
  • Automatic variables that the compiler is likely to place in local memory are:
    • Arrays for which it cannot determine that they are indexed with constant quantities (arrays whose index values ​​cannot be determined at compile time),
    • Large structures or arrays that would consume too much register space,
    • Any variable if the kernel uses more registers than available(register spilling).

Guess you like

Origin blog.csdn.net/xxt228/article/details/132001604