A brief introduction to GPU architecture and CUDA (continue to add in the future)

reference

SIMD and SIMT

  • SISD: A single instruction stream executes a single data stream, that is, each instruction can only execute its corresponding data, and there will be no different cores to execute different instruction streams, that is, there is only one core.

  • SIMD: The architecture has only one control unit and one instruction memory, so only one instruction can be executed at the same time, but this instruction can process multiple data at the same time, similar to the operand of the instruction is no longer for a single number, but for vectors. For example, in the SISD scenario, if abc is all four-dimensional data, we want to execute c = a + b:

    ADD c.x, a.x, b.x
    ADD c.y, a.y, b.y
    ADD c.z, a.z, b.z
    ADD c.w, a.w, b.w
    

    But under SIMD, only one instruction needs to be executed:

    SIMD_ADD c, a, b
    

insert image description here

  • MISD: Different instructions operate on the same data, rarely used.

  • MIMD: Both instructions and data can be parallelized. Different cores can execute different instruction streams for different data. Can it be considered as parallelism of multiple SISDs?

  • SIMT: It can be considered as an upgraded version of SIMD. The same instruction can be executed by multiple concurrent threads, and the operands of the unified instructions executed by different threads can also be different.

Fermi Architecture

insert image description here

  • There are 16 SM (Streaming Processor), that is, the black box above
  • The right side of the picture above is SM, and one SM has:
    • 2 Warps, each Warp has 16 Cores
    • 16 groups of LD/ST (load processing unit)
    • 4 SFUs (Special Function Processing Units)
    • shared memory
  • Each Warp includes:
    • 16 cores
    • A Warp Scheduler
    • A Dispatch Unit
  • Each Core (Streaming Processor, SP) includes:
    • An INT Unit (integer computing unit, ALU)
    • A FP Unit (floating point calculation unit)

CUDA

When the cuda kernel function is started, it is mainly in the following form:

cuda_kernel<<<grid_size, block_size, Ns, stream>>>(...)
  • grid_size: dim3 type data, such as x, y and z, sets the dimension of the grid. x * y * z is equal to the number of blocks started. All threads in the Grid share global memory.
  • blcok_size: dim3 type data, setting the dimension of each block. For example, x * y * z is equal to the number of threads contained in each block. The threads inside the block can be synchronized and can access shared memory.
  • Ns: The default is 0, which means that in the shared memory, the memory that can be dynamically applied for each block, the size is the number of bytes.
  • stream: indicates the stream that the kernel function matches, the default is 0, which is the default stream in the GPU.

Connections and layers to the GPU architecture

  • Thread in Cuda: Corresponds to Core in GPU.
  • Block in Cuda: It can be considered that it can be composed of multiple warps, that is, multiple threads, but it is not larger than SM. That is to say, all threads in a block must be in the same SM. Only in this way can shared memory be used.
  • Warp: It is the basic unit of scheduling and operation. It is stipulated that 32 threads in a block form a warp. If the remaining threads are not enough to 32, a warp must also be occupied. Therefore, block_size is preferably an integer multiple of 32. The 32 threads in the warp will execute the same instruction every time , which is SIMT.
  • For blocks in Cuda, they run on the same SM, but can be scheduled into different SMs. The same SM can run multiple blocks at the same time (concurrent execution). But the resources in SM are limited, such as shared memory or registers, so there are limitations: Maximum number of resident blocks per SMand Maximum number of resident threads per SM. Because of this limitation, the number of threads in a block should be as large as possible SM的core数目/SM最多驻留block数目. Different architectures have different ratios. For example, for GTX 1080 Ti, it is 2048 / 32 = 64. 128 is more suitable as the general value of the block.
  • An SM can execute multiple warps in parallel, which is mainly determined by how many warp schedulers an SM has. But the multiple warps executed in parallel can only belong to the same block or can they belong to multiple different blocks? It seems that it can belong to different blocks, and there are many opinions.

Guess you like

Origin blog.csdn.net/qq_43219379/article/details/125346948