Instruction delay hiding

1. Instruction delay hiding

1. Delay and delay hiding

  • Instruction latency refers to the clock cycles required to calculate the instruction from scheduling to instruction completion.
  • If there are ready warps that can be executed at every clock cycle, the GPU is in a full compliance state.
  • The phenomenon of instruction latency being masked by the full computing state of the GPU is called latency hiding
  • Latency hiding is important for GPU programming development. The design goal of GPU is to handle a large number of but lightweight thread functions.
  • How to calculate the number of warps required to satisfy latency hiding:

 2. Instruction classification

  • GPU instructions are divided into: arithmetic operation instructions, memory access instructions
  • Arithmetic operation instruction latency refers to the clock cycle from starting the operation to getting the calculation result, usually 10 to 20 clock cycles
  • Memory access instruction latency refers to the clock cycle from the command is issued to the data arriving at the destination, usually 400~800 clock cycles

2. Parallelism requirements for arithmetic operation instructions

1. Concept of parallelism requirements

  • The number of instruction operations required to keep the GPU running at full capacity
  • Arithmetic instruction parallelism requirements are measured by the number of operations required to hide arithmetic instruction latency.

  • The threads in the thread warp execute the same instruction (instruction) and perform 32 operations (opeartion)
  • 640/32 warps are required to meet the parallelism requirements of arithmetic operation instructions
  • Methods to improve parallelism: more independent instructions in threads, more concurrent threads

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#maximize-instruction-throughput

Guess you like

Origin blog.csdn.net/m0_46521579/article/details/132798833