NVIDIA GPU SM and CUDA programming understanding

SM Hardware Architecture Basics

For changes in different architectures, please refer to:

​​​​​​A review of GPU architecture changes from the perspective of AI systems -- from Fermi to Ampere (V1.2) bazyd

Nvidia's GPU architecture has evolved for nearly a decade, from Fermi to Ampere - Programmer Sought

Volta GV100 Streaming Multiprocessor (SM)

GA100 Streaming Multiprocessor (SM)

GA102 Streaming Multiprocessor (SM)

The above shows the differences between several different architecture SMs, and some notable similarities and differences need to be noted:

Each SM is divided into 4 sub-blocks, pay attention to which parts are shared by these 4 sub-blocks, and which parts are independent of these 4 sub-blocks.

For example, shared mem and L1 cache are shared by the entire SM4 sub-blocks, while register file, cuda core, etc. are independent for each sub-block. These are instructive for CUDA programming practice and understanding.

Pay attention to the number of cuda cores in each sub-block. For example, each sub-block of GV100 GA100 has 16 INT32 and FP32 cuda cores, 8 FP64 cuda cores, and 4 SFUs, while GA102 has no FP64 cuda cores. The latest hopper architecture has not 16 but 32 FP32 cuda cores per sub-block.

Pay attention to the number of TensorCores for each sub-block and their specific parameter specifications.

CUDA SIMT Programming Model Fundamentals

A few questions that need to be clarified:

What does the meaning of CUDA core have to do with threads?

The relationship between warp, thread block and SM?

Different warp switching understanding?

Programs written by CPU programming are generally executed serially by a single thread. In SIMD (Single Instruction Multiple Data), one instruction applies to many data elements at the same time. Nvidia GPUs, on the other hand, use the SIMT (Single Instruction Multiple Threads) model for parallel computing.

We first need to write a kernel function, and eventually thousands of threads will be created. Each thread independently executes the same kernel instructions, but processes different data:

Although each thread executes the same instruction, all threads are managed according to two levels of block and grid:

add<<<grid_size, block_size>>> (a, b, c); 

Each thread block block contains dozens or hundreds of threads (generally an integer multiple of 32), and the threads inside the thread block are executed as a warp composed of 32 threads. At the same time, the 32 threads inside a warp are executed strictly synchronously (each thread executes the same instruction at the same time). Finally, multiple blocks form a whole grid.

Why this organization corresponds directly to the hardware architecture:

Compared with the previous SM hardware architecture, all threads of each thread block are executed on the same SM (executed by 4 sub-blocks), and one SM can reside and execute multiple thread blocks at the same time. The entire GPU generally has dozens or hundreds of SMs that can be executed, depending on the specific hardware specifications.

At the same time, 32 threads of each warp are executed in the same SM sub-block, and multiple warps of the same thread block may be distributed in multiple SM sub-blocks for execution.

Warp switching within the same sub-block: One of the characteristics that GPU is different from CPU is that thread switching is extremely fast. This is because the resources used by each thread and thread block are saved directly based on hardware resources, instead of saving the register memory to the memory first, and then loading the information of the new thread from the memory to the register and then executing it. Here are a few points to note: 1. When the current warp is waiting for something (for example, the thread in the current warp is reading global mem, which takes hundreds of clocks), then it will switch to a new warp for Execution, which can significantly improve hardware utilization and execution performance. 2. Therefore, although SM can only execute 4 warps at the same time, there should be enough warps resident for switching to ensure performance. The number of threads per thread block and the upper limit of the number of thread blocks that each SM can execute simultaneously can be queried through the interface provided by CUDA. However, since each thread and thread block uses real registers and shared mem hardware resources, hardware resources are limited. Therefore, the resource usage of each thread and thread block determines the actual number of threads contained in each thread block and the number of thread blocks that each SM can execute simultaneously. Therefore, the actual program should carefully plan the number of threads in each thread block, the registers and shared mem resources used by each thread and thread block, so as to ensure that the SM has enough warps to execute at the same time. Generally, there should be 4-8 warps that can actually be executed. more than double.

Each warp's execution context (program counter, registers, etc.) is maintained on-chip for the lifetime of the warp. Register files, data caches, and shared memory are partitioned among thread blocks. Therefore, there is no cost penalty for switching to another warp at the next time step compared to other context switches. But the predefined maximum number of thread blocks and warps that can reside in an SM is limited by the capacity of the GPU. This instruction can be selected from the same warp with no dependency on the last instruction, or more often an instruction of another warp. The execution time for many arithmetic instructions will take 2 clock cycles.

The specific execution logic of each thread instruction:

What is the relationship between the tens of thousands of thread execution of the CUDA program SIMT and the CUDA core?

At the beginning, it is easier to misunderstand people as if each thread is executed on each CUDA CORE, which is not the case.

We can think of a kernel as a series of instructions. Assume the next instruction is an INT32 operation. The Nvidia GPU dispatches a warp of 32 threads to 16 INT32 arithmetic units to execute instructions simultaneously (or to 16 FP32 units for FP32 operations).

Note that the instructions of a warp 32 threads are dispatched to 16 cores instead of 32, because from the previous SM diagram, it can be seen that each SM sub-block has only 16 FP32 and INT32 cuda cores, which also makes the warp execute every A FP32/INT32 instruction actually takes 2 clocks to complete (except for hopper, because it already has 32 CUDA CORE per sub-block).

Similarly, if the next instruction is FP64, the instructions of 32 threads in the same warp need to be dispatched to 8 FP64 cuda cores for execution, so a longer cycle is required.

The CUDA cores in Fermi provide both FP and INT operations (time multiplexed), but similar to the V100 and Turing GPUs, Ampere separates them into separate INT32, FP32 and FP64 units. By separating the FP32 and INT32 cores, it allows concurrent execution of FP32 and INT32 operations and increases instruction issue throughput. Many applications have inner loops that perform pointer arithmetic (integer memory address calculations) combined with floating-point calculations that benefit from the simultaneous execution of FP32 and INT32 instructions. Each iteration of the pipelined loop can update addresses (INT32 pointer arithmetic) and load data for the next iteration while processing the current iteration in FP32.

Here is another view of issuing instruction and execution in the Volta architecture in a processing block (sub-core).

Some considerations and optimization points of CUDA programs

The basic principle is to use up the GPU: one SM can execute multiple thread blocks at the same time. At the same time a grid should have enough thread blocks.

A SM can execute multiple thread blocks concurrently:

Because an SM needs to have enough warps to be able to perform concurrency and switch warps to ensure performance. The upper limit of the number of warps that each SM can execute at the same time depends on the minimum of the two:

1. Hardware limitations and kernel parameter settings (the number of threads and thread blocks of each thread block and each SM is fixed and can be queried through the interface). When the resources are sufficient and greater than the resources used by threads and thread blocks, the number of warps executed by each SM is limited by the parameters set by the kernel. Limited, which leads to insufficient number of threads executed by SM at the same time. Generally, the number of threads in a thread block must reach 128 or 256 to fully use up the SM. This parameter can be adjusted to find an optimal value.

2. Thread and thread block resource usage results in a limit to the number that can actually be executed. If the shared mem of a thread block is used too much, for example, one thread block uses up more than half of all shared mem, such a SM can only execute one thread block at most. In order to ensure that one SM can execute multiple thread blocks at the same time, obviously each thread block can only use a fraction of the total shared mem of each SM. The same is true for the use of registers. When the use of registers is reasonable, a warp can execute 32 threads at the same time, and at the same time, the resources of a sub-block can satisfy multiple warps that reside at the same time. However, registers use too many sub-blocks to reside in multiple warps, and even in extreme cases, all the resources of a warp are not enough for 32 threads.

A grid should have enough thread blocks

A kenel corresponds to a grid, and there must be enough thread blocks in it to make full use of all the SMs of the entire GPU. On the one hand, an SM itself needs to reside in multiple thread blocks, so the number of thread blocks used up by dozens or hundreds of SMs in the entire GPU should be multiplied by a relatively large multiple.

Here is an example of an actual reduce/layer_norm calculation in deep learning. If we calculate the reduce mean of each row in the innermost dimension of a [200, 768] tensor, if the naive idea calculates one row per thread, then the total number of threads will be 200. In this way, only one or two thread blocks can be generated, and only one or two SMs can be used. Obviously, the performance is extremely poor. And if we use one warp to calculate a row, then there are 200 warps, and if a thread block has 4 warps, there are 50 thread blocks, which can use most of the SM. Of course, one thread block can also be used to calculate a row, then we have 200 thread blocks, and the SM utilization rate is higher.

Of course, there are some other trade-offs in this example of reduce: because reduce needs to exchange data between threads, when using warp to calculate a line, as mentioned earlier, the registers of each thread are directly saved on the hardware, and the same warp is in The same SM sub-block runs, these sub-blocks share the register file, and the shared data of different sub-blocks can only pass through shared mem at the fastest. Therefore, data exchange between different threads in the same warp can directly exchange register data through warp shuffle (Warp-Level Primitives), which is faster. And a thread block needs to exchange data through shared mem first to calculate a line. How to balance this trade-off depends on the amount of work (number of elements per row).

Some resource information of the ampere architecture:

NVIDIA Ampere GPU Architecture Tuning Guide :: CUDA Toolkit Documentation

1.4.1.1. Occupancy
The maximum number of concurrent warps per SM remains the same as in Volta (i.e., 64), and other factors influencing warp occupancy are:
‣ The register file size is 64K 32-bit registers per SM.
‣ The maximum number of registers per thread is 255.
The maximum number of thread blocks per SM is 32 for devices of compute capability 8.0 (i.e., A100 GPUs) and 16 for GPUs with compute capability 8.6 (GA102/104,如RTX3060等).
‣ For devices of compute capability 8.0 (i.e., A100 GPUs) shared memory capacity per SM is 164 KB, a 71% increase compared to V100's capacity of 96 KB. For GPUs with compute capability 8.6, shared memory capacity per SM is 100 KB.
‣ For devices of compute capability 8.0 (i.e., A100 GPUs) the maximum shared memory per thread block is 163 KB. For GPUs with compute capability 8.6 maximum shared memory per thread block is 99 KB.

CUDA Storage Hierarchy and Considerations

register

What data will be automatically used as registers?

The usage of shared memory is usually clearly defined and known by the user, but how to determine the usage of registers?

How many registers are available to each thread? How to avoid excessive register usage?

Unlike the CPU, the GPU uses different hardware registers for each thread. When switching threads, the process of saving registers to memory and loading registers from memory does not occur, so thread switching is very efficient. But the SM's total register resources and the number of registers used by each thread determine the number of threads that can execute concurrently. Shared memory also has this limitation.

Different from CPU, each thread of GPU has different independent hardware registers, so switching threads does not need to save and load register memory, and the cost of thread switching is low.

The register and shared memory usage of each thread block limits the number of thread blocks that each SM can execute at the same time. At the same time, excessive register usage may cause register overflow and data storage to memory, resulting in a drop.

https://developer.download.nvidia.com/CUDA/training/register_spilling.pdf

CMakeLists.txt settings to view registers and other usage: 

set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --ptxas-options=-v")

Does the array in the kernel use registers or local mem:

In CUDA programming, is the array allocated by the thread in the register or in the local memory? _Common Netizen's Blog-CSDN Blog

Fast Dynamic Indexing of Private Arrays in CUDA | NVIDIA Technical Blog

That is, only when the compiler can statically determine the array index index, the array will be placed in the register.

Warp shuffle

https://people.maths.ox.ac.uk/gilesm/cuda/lecs/lec4.pdf

Warp Level Primitives Using CUDA - NVIDIA Tech Blog

Since all threads in the same warp run in the same SM block, these threads use the same register space, which allows threads in the same warp to efficiently exchange data through registers. The warp shuffle command provides just such a function. The only way to efficiently exchange data between different warps in the same thread block is through shared mem.

The code to accumulate all the data in the same warp to the first thread based on warp suffle:

template <typename T>
__inline__ __device__ T WarpReduceSum(T data) {
#pragma unroll 5 // for warp_size = 32
    for (int offset = 16; offset > 0; offset >>= 1) {
        data += __shfl_down_sync(0xFFFFFFFF, data, offset);
    }
    // optional broadcast value of the first thread to all threads in warp
    data = __shfl_sync(0xFFFFFFFF, data, 0);
    return data;
}

Shared memory

The principle of bank conflict and how to avoid it? 

https://developer.nvidia.com/blog/using-shared-memory-cuda-cc/

https://on-demand.gputechconf.com/gtc/2018/presentation/s81006-volta-architecture-and-performance-optimization.pdf

When a bank conflict occurs, the warp will not be switched to reduce the delay, so the impact on performance is relatively large.

The commonly used method to avoid bank conflict is padding, that is, to lengthen the row length of the original matrix, so that the actual matrix is ​​a sub-matrix of the shared mem storage matrix. The following figure shows that a matrix with a width of 32 can avoid bank conflicts when different threads access the same column through +1 padding. d_id and b_id are the ids of the data and the bank, respectively. In fact, you can also add other padding, such as +4 or +8, which can also meet the 128/256 byte alignment requirements of each row of data. The d_id in the figure is the id of the data, and b_id is the id of the bank. 

How to implement double buffer/prefetch?

The double buffer is to use two buffers to realize the pipe calculation of read/write back and calculation, and keep the calculation unit always in a busy state. The implementation of double buffer requires asynchronous execution to realize the overlap of calculation and data copy.

Loading from global mem to a register itself is asynchronous (does not block subsequent instructions unless the register is used later). However, the previous architecture of Ampere needs to be directly loaded from global mem to shared through register transfer. Since writing to shared mem depends on register ready, it needs to wait for global mem to complete. In order to realize the asynchrony from global to shared, the previous architecture of Ampere can manually transfer based on registers. That is, first manually read the global to reg, then execute other irrelevant calculation instructions, and then copy the reg memory to shared, thereby hiding the global mem read wait.

Apmere introduces a new asynchronous copy LDGSTS instruction that does not require register transfer to read from global mem to shared mem, which reduces the pressure on registers and unnecessary data transfer, and further saves power consumption. And because of the asynchronous nature of this instruction, it can be executed as an overlap between the background operation and the foreground computing instruction, further improving the overall computing efficiency.
A simple demo code for double buffer:

prefetch data block0
for loop:
    prefetch next block
    compute cur block

CUDA 11.0 introduces an async-copy feature that can be used within device code to explicitly
manage the asynchronous copying of data from global memory to shared memory. This
feature enables CUDA kernels to overlap copying data from global to shared memory with
computation. It also avoids an intermediary register file access traditionally present between
the global memory read and the shared memory write.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#async_data_operations

global memory

Questions to be clarified:

Understanding and practice of coalesce memory accesses?

Merged memory access coalesced memory access is one of the important points of attention in performance optimization.

Memory merged access: The memory addresses read by 32 threads in the same warp at the same time are consecutive n*128 bytes (each thread reads 4 bytes, and the address id read by each thread does not necessarily have to be the same as the thread id Consistent, but the entire warp perspective reads a continuous memory), and memory alignment is required (the first address of the memory read continuously is an integer multiple of 128 bytes).

Some points to note are that when dealing with data types such as int8 fp16, if each thread reads 1 element, then a warp can read less than 128 bytes. At this time, data types such as float2 float4 can be used to read and then in distribution.

coalesce memory accesses: Adjacent threads preferably read adjacent data elements.

Vectorized load/store:  However, data elements can be of various data types, and the number of bytes of each data type is different, so it is best to combine Coalesced memory load/store with Vectorized load/store, that is, each thread reads a vector data Types, such as float2, float4, half4, etc., but a thread should not read too large data types, generally it is best to read at least 4 bytes corresponding to float/half2, but usually should not exceed 16 corresponding to float4/half8 byte.

cuda-samples/helper_math.h at master · NVIDIA/cuda-samples · GitHub

If the matrix shape is not an integer multiple of 4, 8, or 32, padding can be considered, otherwise the first address of the non-zero row is not 128-byte aligned.
This example of matrix transposition is a good illustration of how to use shared mem to simultaneously realize the combined content access of reading and writing back two matrices, and avoid the bank conflict of shared mem: CUDA learning (2) matrix transposition and optimization ( Combined access, shared memory, bank conflict) bzdww

The idea is that each warp reads a 32x32 data block, and the 32 threads of each warp read each row of 32x32 in turn, so that the input read is combined with memory access. Transposition can be easily implemented based on index idx transformation when reading or writing shared mem. Then, based on the transposed shared mem, each row of the output matrix is ​​written back in turn to realize the combined memory access of the output. Since there are always 32 threads accessing the same row or column at the same time when 32x32 shared mem is read or written back, if only 32x32 shared mem size is created, bank conflict will occur when reading or writing back. But if you create a shared mem data block of [32,33], you will write to shared mem[i,0:32] every time you read it, and read shared mem[0:32,i] when you write it back. A bank conflict appears.

constant memory, texture memory

Generally, it is only used in specific application scenarios. Is there room for application in AI?

CUDA Warp-Level Primitives

Using CUDA Warp-Level Primitives | NVIDIA Technical Blog

Register Cache: Caching for Warp-Centric CUDA Programs | NVIDIA Technical Blog

Cooperative Groups: Flexible CUDA Thread Programming | NVIDIA Technical Blog

Thread blocks are automatically scheduled and executed by SM in warp units. This process is basically imperceptible to programmers, but it can also be explicitly operated at the warp level. For example, Warp-level intra register exchange, because the thread execution and register content of the same warp are in the same sm block, so the same warp thread has the possibility of exchanging register data with each other (register-shuffle), and different warps may be in Different sm blocks are executed, and data can only be exchanged through shared memory.

CUDA 9 introduced three categories of new or updated warp-level primitives.

  1. Synchronized data exchange: exchange data between threads in warp.
    • __all_sync__any_sync__uni_sync__ballot_sync
    • __shfl_sync__shfl_up_sync__shfl_down_sync__shfl_xor_sync
    • __match_any_sync__match_all_sync
  2. Active mask query: returns a 32-bit mask indicating which threads in a warp are active with the current executing thread.
    • __activemask
  3. Thread synchronization: synchronize threads in a warp and provide a memory fence.
    • __syncwarp

Please see the  CUDA Programming Guide for detailed descriptions of these primitives.

Here is an example of using each warp to calculate the average of each row of a two-dimensional tensor based on warp shuffle:

__global__ void reduce_mean_row_warp(const float* __restrict__ A,
                                     float* __restrict__ B,
                                     int row, int col) {
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    int cur_row = tid / warpSize;
    int start_col = tid % warpSize;

    if (cur_row < row) {
        float ratio = 1.0f / col;
        int addr_offset = cur_row * col;

        float mean_val = 0;
        for (int i = start_col; i < col; i += warpSize) {
            mean_val += ratio * A[addr_offset + i]; // method 1
        }
        // use warp shuffle to get correct mean for thread 0 from all threads in a warp
        mean_val += __shfl_down_sync(0xFFFFFFFF, mean_val, 16);
        mean_val += __shfl_down_sync(0xFFFFFFFF, mean_val, 8);
        mean_val += __shfl_down_sync(0xFFFFFFFF, mean_val, 4);
        mean_val += __shfl_down_sync(0xFFFFFFFF, mean_val, 2);
        mean_val += __shfl_down_sync(0xFFFFFFFF, mean_val, 1);

        if (start_col == 0) {
            B[cur_row] = mean_val;
        }
    }
}

TensorCore

How do threads in a warp use TensorCore together?

to do

Other common considerations

Warp divergence caused by branches should be avoided as much as possible, such as allowing the same warp to process the same branch as much as possible.

__restrict__ The keyword may bring about some optimization effects, and it has  restrictbasically the same semantics as the C99 keyword.

The most important thing in performance optimization is to know where the bottleneck is

1. Where is the bottleneck of the whole model? Is memory allocation, data copy? Or are some operators time-consuming?

2. Where is the bottleneck in a single operator? data calculation? Data read and write? Bias calculation?

Some new features of CUDA

Asynchronous memory allocation

Memory allocation and reuse is an extremely important part of the inference engine, because each re-allocation and release of memory is a time-consuming process. It is usually necessary to implement a memory pool, allocate memory in advance, and then reuse memory based on the memory pool to improve performance. . and

However, cuda11.2 introduces new functions and automatically implements such functions at the bottom layer, eliminating the need for users to implement complex memory reuse algorithms by themselves.

cudaMallocAsync(&ptr, size, stream); // Allocates physical memory
kernel<<<...,stream>>>(ptr);
cudaFreeAsync(ptr, stream);          // releases memory back into a pool
cudaMallocAsync(&ptr, size, stream); // Reuses previously freed pointer
kernel<<<...,stream>>>(ptr);
cudaFreeAsync(ptr, stream);          // releases memory back into a pool
....                                 // Executes other work in the stream

profiling tool

nvprof

nvprof python xx.py

nvprof xx_bin

NVIDIA Nsight Systems

NVIDIA Nsight Systems | NVIDIA Developer

Download Linux Host .run Installer on the linux side, download Windows Host on the windows side and install them separately

nsys profile xx executes to generate qdrep file and open it with windows host to see the visualized time line diagram

To use nsys on windows, first add the nsys path to the system environment variable: C:\Program Files\NVIDIA Corporation\Nsight Systems 2022.3.4\target-windows-x64

Execute on the prompt command line of anaconda:

nsys profile D:\ProgramData\Anaconda3\python.exe matmul_tf.py

Direct python xx.py may not be found after upgrading python, you can use the full path above.

ref

​​​​​​A review of GPU architecture changes from the perspective of AI systems -- from Fermi to Ampere (V1.2) bazyd

Nvidia's GPU architecture has evolved for nearly a decade, from Fermi to Ampere - Programmer Sought

GPU architecture and rendering - Zhihu

"CUDA Parallel Programming GPU Programming Guide"

《 PROFESSIONALCUDA C Programming》

Diving Deep Into The Nvidia Ampere GPU Architecture

Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking

NVIDIA A100 Tensor Core GPU Architecture

Drilling Down Into Nvidia’s “Pascal” GPU

Overall Process Design (1) - Hierarchical Structure of CUDA Programs - Know Quora

https://jonathan-hui.medium.com/ai-chips-a100-gpu-with-nvidia-ampere-architecture-3034ed685e6e

In-depth analysis of Nvidia Ampere architecture - Programmer Sought

https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/

CUDA micro-architecture and instruction set (4) - instruction emission and warp scheduling- Know about

Programming Guide :: CUDA Toolkit Documentation

How to evaluate the new GPU H100 released by Nvidia on March 22? - Know almost

NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog

statement:

Part of the content of this article uses the content of the cited documents and web pages in the article.

Guess you like

Origin blog.csdn.net/u013701860/article/details/121311135