Memory efficiency in CUDA

GPU memory structure

  • Off-chip storage

    • Constant memory (faster reading speed)
    • Texture memory
    • Global memory
  • On-chip storage

    • 32-bit register group in each SP (in thread unit)
    • Shared storage (similar to cache speed) (in thread blocks)
    • Read-only constant memory cache (in grids)
    • Read-only texture memory cache
  • Access time of different storage types

    Storage type register Shared memory Constant memory Global memory
    bandwidth 8TB/s 1.5TB/s 200MB/s 200MB/s
    delay A cycle 1-32 cycles 400-600 cycles 400-600 cycles
  • Data storage location

    • The data copied from the host to the GPU using cudaMemcpy is stored in global memory, constant memory or texture memory
    • Placing data in shared memory requires programmers to manually define the shared memory area, and store data from the global to shared memory
    • Most of the variables allocated inside threads are in registers

Global memory

Try to make the thread group request consecutive memory addresses at the same time, thereby improving the efficiency of global memory access

  • CUDA commonly used programming strategies
    • Divide the data to be processed into small pieces of data so that it is just stored in the shared memory
    • Send small blocks from global memory to shared memory, using multi-threaded reading can effectively utilize memory-level parallelism
    • The threads in the thread block perform calculations on small pieces of data located in shared memory
    • Transfer calculation results from shared memory back to global memory

Shared memory

  • Shared memory is based on the architecture of bank switching (if there is no bank conflict, the access speed of shared memory is the same as that of registers)

Guess you like

Origin blog.csdn.net/qq_42573343/article/details/105295513