Fermi architecture GPU personal notes

Some reference articles:
CUDA C++ Programming Guide
NVIDIA’s Fermi the First Complete GPU Architecture
Performance Optimization: Programming Guidelines and GPU Architecture Reasons Behind Them
Nvidia GPU Architecture–Fermi Architecture
Notes Principles of Computer Composition—GPU Graphics Processor
Some Notes on GPU (SIMT)
NVIDIA GPU architecture review
CUDA programming study notes-02 (GPU hardware architecture)
Check the computing capabilities corresponding to the GPU.
My configuration is here:
Insert image description here

1 GPU hardware architecture

Insert image description here
Insert image description here
Insert image description here

  • Host interface: Connects the GPU to the CPU via PCI-Express
  • Giga Thread/Giga Thread Engine: The global scheduler distributes thread blocks to the SM thread scheduler

This switching is managed by the chip-level GigaThread hardware thread scheduler, which manages 1,536 simultaneously active threads for each streaming multiprocessor across 16 kernels.
This centralized scheduler is another point of departure from conventional CPU design. In a multicore or multiprocessor server, no one CPU is “in charge”. All tasks, including the operating system’s kernel itself, may be run on any available CPU. This approach allows each operating system to follow a different philosophy in kernel design, from large monolithic kernels like Linux’s to the microkernel design of QNX and hybrid designs like Windows 7. But the generality of this approach is also its weakness, because it requires complex CPUs to spend time and energy performing functions that could also be handled by much simpler hardware.
With Fermi, the intended applications, principles of stream processing, and the kernel and thread model, were all known in advance so that a more efficient scheduling method could be implemented in the GigaThread engine

  • SM (Stream Multiprocessor): As shown in the black box in the picture and the enlarged picture on the right, there are 32 Cores (Cuda Core)
    • Core (Cuda Core): The running unit of the vector. Each Core contains an integer operation unit ALU (Integer Arithmetic Logic Unit) and a floating point operation unit FPU (Floating Point Unit).
    • Register File: Register file
    • SFU (Specil Function Units): special function units, transcendental functions and mathematical functions, such as inverse square root, sine and cosine, etc.)
    • Warp Scheduler: Thread warp scheduler, assigns threads to specific computing units
    • Dispatch Unit: Instruction distribution unit
    • LD/ST (Load/Store): accesses the storage unit (responsible for data processing), and calculates the source operand address and destination operand address for 16 threads per cycle (because there are 16 LD/ST). Access for GMEM, SMEM, LMEM

Before Cuda Core, there was also the concept of SP (Stream Processor) , which was actually Cuda Core. Starting from the Fermi architecture, SP was renamed Cuda Core.

In the subsequent newly exited architectures, there will be different types of Cores placed in the same SM. For example, starting from the Volta architecture, it has become INT32, FP32, and Tensor Core to better support concurrent execution.
The concept of Cuda Core has gradually faded away, replaced by a more detailed division of Core types.

In the newly launched architecture, each SM also comes with multiple Tex , which is used to access Texture and Read-only GMEM.

  • Warp thread bundle: The warp is the fundamental unit of dispatch within a single SM (Warp is the basic execution unit of SM). A Warp contains 32 parallel Threads (it seems that all current GPU architectures have this number. You can pass warpSize in the kernel function Obtained), these 32 Threads execute in SIMT mode. That is to say, all Threads execute the same instruction in a lock-step manner, but each Thread will use its own Data to execute the instruction branch.

If there are no 32 Threads in the Warp that need to work, then although the Warp is still running as a whole, this part of the Thread is in an inactive state.

In an SM, a Warp is broken down into specific instructions and given to the computing unit to execute.

Personal understanding: There are multiple Cores under an SM, but a Core does not represent a runnable thread. The number of Cores under the same SM does not represent the maximum number of parallel threads. The SM scheduler will only Core is scheduled to let the Core execute instructions so that all Cores can be more fully utilized to run efficiently. Generally, the
number of threads in a thread block is usually an integer multiple of the thread warp to maximize the performance of the hardware (not absolute, there are special circumstances) )

2 The relationship between Cuda and GPU hardware architecture

The picture comes from the video of station b.
Insert image description here
Insert image description here
This explains very well that shared memory corresponds to the shared memory in a Block ( __shared__created by defining a modified variable)

  • An SM can store one or more Blocks. The number of Blocks stored depends on several factors: (1) limitations of shared memory and registers; (2) limitations of the number of threads and blocks on an SM.
  • All Blocks stored on the SM have their related context stored in a fixed location (note that it is different from the CPU). For example, the shared memory is always placed in the shared cache, and it is still placed in the shared memory after thread switching (so the switching thread Less overhead than CPU)
  • Although multiple Blocks may be stored on an SM, if a Block is scheduled, the Block will be executed until the end (without time slice rotation like the CPU)
  • An SM can only schedule one (earlier GPUs) or a small number (recent GPUs) of Warps (32 Threads) at a time, and based on the principle of SIMT, execute one instruction at a time.
  • Hide delays through Warp’s scheduling

There are indeed very few relevant instructions on Thread and Block scheduling, and some conflict with each other. Therefore, the above content is mostly summarized by myself. If there are any mistakes, please correct me.

Guess you like

Origin blog.csdn.net/xxt228/article/details/131882716