A Beginner’s Guide to GPU Architecture and Computing

Most engineers are familiar with CPUs and sequential programming because they have been in close contact with them since they began writing CPU code. However, relatively little is known about the inner workings of GPUs and what makes them unique. Over the past decade, GPUs have become extremely important due to their widespread use in deep learning. Therefore, it is necessary for every software engineer to understand its basic working principles. This article aims to provide readers with background knowledge in this regard.

 

The author of this article is software engineer Abhinav Upadhyay. He wrote most of the content of this article based on the fourth edition of "Massive Parallel Processor Programming" (Hwu et al.), which introduced the GPU architecture and execution model. Of course, the basic concepts and methods of GPU programming in this article are also applicable to products from other vendors.

 

(This article is compiled and published by OneFlow. Please contact us for authorization for reprinting. Original text: https://codeconfessions.substack.com/p/gpu-computing)

 

作者 | Abhinav Upadhyay

OneFlow compilation

Translation|Wan Zilin, Yang Ting

 

1

 

Compare CPU vs GPU

 

First, we will compare CPUs and GPUs, which will help us better understand the state of GPU development, but this should be treated as a separate topic because it is difficult to cover everything in one section. Therefore, we will highlight some key points.

 

The main difference between CPUs and GPUs is their design goals. CPUs are designed to execute sequential instructions [1]. Over the years, many features have been introduced in CPU design to improve sequential execution performance. The focus is on reducing instruction execution latency so that the CPU can execute a series of instructions as quickly as possible. These features include instruction pipelines, out-of-order execution, speculative execution, and multi-level caching (to name just a few).

 

GPUs are designed for massive parallelism and high throughput, but this design results in moderate to high instruction latencies. This design direction has been influenced by its widespread use in video games, graphics processing, numerical computing, and now deep learning, all of which require extensive linear algebra and numerical calculations to be performed at extremely high speeds. A lot of effort goes into improving the throughput of these devices.

 

Let's consider a concrete example: a CPU can add two numbers faster than a GPU due to lower instruction latency. When performing multiple such calculations in sequence, the CPU is able to complete them faster than the GPU. However, when millions or even billions of such calculations need to be performed, the GPU will complete these computing tasks faster than the CPU due to its powerful massive parallel capabilities.

 

We can illustrate this through specific data. Hardware performance in numerical computation is measured in floating-point operations per second (FLOPS). NVIDIA's Ampere A100 has a throughput of 19.5 TFLOPS at 32-bit precision. By comparison, Intel's 24-core processor has a throughput of just 0.66 TFLOPS (2021) at 32-bit precision. At the same time, as time goes by, the gap in throughput performance between GPU and CPU increases year by year.

 

The figure below compares the architecture of CPU and GPU.

 

Figure 1: Comparison of chip designs of CPU and GPU. Quoted from "CUDA C++ Programming Guide" (NVIDIA)

 

As shown in the figure, CPUs are mainly used in the chip field to reduce instruction latency, such as large caches, fewer arithmetic logic units (ALUs), and more control units. In contrast, GPU uses a large number of ALUs to maximize computing power and throughput, and only uses a very small chip area for cache and control units. These components are mainly used to reduce CPU latency.

 

Latency tolerance and high throughput

 

Maybe you are curious, how can a GPU tolerate high latency and provide high performance at the same time? GPUs make this possible with their large number of threads and massive computing power. Even if a single instruction has high latency, the GPU efficiently schedules threads to run so that they take advantage of computing power at any point in time. For example, when some threads are waiting for the results of instructions, the GPU will switch to running other non-waiting threads. This ensures that the compute units on the GPU are running at their maximum capacity at all points in time, providing high throughput. We'll get a clearer picture of this later when we discuss how the kernel runs on the GPU.

 

2

 

GPU architecture

 

We already know that GPUs are good for achieving high throughput, but how are they architected to achieve this? This section will explore this.

 

GPU computing architecture

 

A GPU consists of a series of streaming multiprocessors (SMs), where each SM is composed of multiple streaming processors, cores, or threads. For example, the NVIDIA H100 GPU has 132 SMs, each with 64 cores, for a total of 8448 cores.

 

Each SM has a certain amount of on-chip memory, often called shared memory or temporary memory, which is shared by all cores. Similarly, the control unit resources on the SM are also shared by all cores. In addition, each SM is equipped with a hardware-based thread scheduler for executing threads.

 

In addition, each SM is equipped with several functional units or other accelerated computing units, such as tensor cores or ray tracing units, to meet the specific requirements of the workload processed by the GPU. Calculation requirements.

 

 

Figure 2: GPU computing architecture

 

Next, let’s dig into GPU memory and understand the details.

 

GPU memory architecture

 

GPUs have multiple layers of different types of memory, each with a specific purpose. The figure below shows the memory hierarchy of an SM in the GPU.

 

Figure 3: GPU memory architecture based on Cornell University Virtual Workshop

 

Let's dissect it:

 

  • Registers : Let’s start with the registers. Each SM in the GPU has a large number of registers. For example, NVIDIA's A100 and H100 models have 65536 registers per SM. These registers are shared between cores and dynamically allocated based on thread demand. During execution, each thread is assigned private registers that cannot be read or written by other threads.

     

  • Constant cache : Next is the constant cache on the chip. These caches are used to cache constant data used in code executed on the SM. To take advantage of these caches, programmers need to explicitly declare objects as constants in their code so that the GPU can cache them and save them in the constant cache.

     

  • Shared memory : Each SM also has a shared memory or temporary memory, which is a small, fast and low-latency on-chip programmable SRAM memory for shared use by thread blocks running on the SM. The design idea of ​​shared memory is that if multiple threads need to process the same data, only one thread needs to load it from global memory, and other threads will share this data. Proper use of shared memory can reduce the need to load duplicate data from global memory and improve kernel execution performance. Shared memory can also be used as a synchronization mechanism between threads within a thread block.

     

  • L1 cache : Each SM also has an L1 cache, which can cache frequently accessed data from the L2 cache.

     

  • L2 cache : All SMs share an L2 cache, which is used to cache frequently accessed data in global memory to reduce latency. It should be noted that the L1 and L2 caches are public to SM, that is, SM does not know whether it gets data from L1 or L2. SM fetches data from global memory, which is similar to how the L1/L2/L3 cache works in the CPU.

     

  • Global memory : The GPU also has an off-chip global memory, which is a large-capacity and high-bandwidth dynamic random access memory (DRAM). For example, the NVIDIA H100 has 80 GB of high-bandwidth memory (HBM) with a bandwidth of 3000 GB per second. Due to the long distance from SM, the latency of global memory is quite high. However, there are several additional layers of memory on the chip as well as a large number of computing units that help mask this latency.

 

Now that we understand the key components of GPU hardware, let's take a closer look and understand how these components come into play when executing code.

 

3

 

Understand the execution model of the GPU

 

 

To understand how the GPU executes the kernel, we first need to understand what a kernel is and its configuration.

 

Introduction to CUDA Kernel and Thread Blocks

 

CUDA is a programming interface provided by NVIDIA for writing programs that run on its GPUs. In CUDA, you express the calculations you want to run on the GPU in a form similar to a C/C++ function. This function is called a kernel. The kernel operates in parallel on vectors of numbers that are supplied to it as function arguments. A simple example is a kernel that performs vector addition, that is, takes two vectors as input, adds them element-wise, and writes the result to a third vector.

 

To execute the kernel on the GPU, we need to enable multiple threads. These threads are collectively called a grid, but the grid also has more structures. A grid is composed of one or more thread blocks (sometimes just called blocks), and each thread block is composed of one or more threads.

 

The number of thread blocks and threads depends on the size of the data and the degree of parallelism we require. For example, in the vector addition example, if we want to add 256-dimensional vectors, we can configure a single thread block with 256 threads so that each thread can process one element of the vector. If the data is larger, there may not be enough threads available on the GPU, and we may need each thread to be able to process multiple data points.

 

Figure 4: Thread block grid. Quoted from "CUDA C++ Programming Guide" (NVIDIA)

 

Writing a kernel requires two steps. The first step is the host code running on the CPU. This part of the code is used to load data, allocate memory for the GPU, and start the kernel using the configured thread grid; the second step is to write the device (GPU) code that executes on the GPU. .

 

For the vector addition example, the following figure shows the host code.

 

Figure 5: Host code of the CUDA kernel for adding two vectors.

 

The picture below shows the device code, which defines the actual kernel function.

 

Figure 6: Device code containing the vector addition kernel definition.

Since the focus of this article is not on teaching CUDA, we will not discuss this code in more depth. Now, let's look at the specific steps to execute the kernel on the GPU.

 

4

 

Steps to execute Kernel on GPU

 

1. Copy data from host to device

 

Before a kernel is scheduled for execution, all the data it requires must be copied from host (i.e., CPU) memory to the GPU's global memory (i.e., device memory). Nonetheless, in the latest GPU hardware, we can also use unified virtual memory to read data directly from host memory (see the paper "EMOGI: Efficient Memory-access for Out-of-memory Graph-traversal in GPUs").

 

2. Scheduling of thread blocks on SM

 

When the GPU has all the required data in its memory, it allocates the thread block to the SM. All threads within the same block will be processed by the same SM at the same time. To do this, the GPU must reserve resources on the SM for these threads before starting to execute them. In actual operation, multiple thread blocks can be assigned to the same SM to achieve parallel execution.

 

Figure 7: Assigning thread blocks to SMs

 

Since the number of SMs is limited and a large kernel may contain a large number of thread blocks, not all thread blocks can be allocated for execution immediately. The GPU maintains a list of thread blocks to be allocated and executed. When any thread block is completed, the GPU will select a thread block from the list for execution.

 

3. Single instruction multi-threading (SIMT) and thread warp (Warp)

 

As we all know, all threads in a block will be assigned to the same SM. But after that, the threads are further divided into groups of size 32 (called warp[2]), and are assigned together to a core collection called a processing block (processing block) for execution.

 

SM executes all threads in a warp simultaneously by fetching and issuing the same instructions to all threads. These threads will then execute the instruction simultaneously on different parts of the data. In the vector addition example, all threads in a warp might be executing add instructions, but they would be operating on different indices of the vector.

 

Since multiple threads execute the same instructions simultaneously, the execution model of this warp is also called single instruction multi-threading (SIMT). This is similar to Single Instruction Multiple Data (SIMD) instructions in CPUs.

Volta and subsequent generations of GPUs introduced a mechanism to replace instruction scheduling, called Independent Thread Scheduling. It allows full concurrency between threads without being restricted by warps. Independent thread scheduling allows for better utilization of execution resources and also serves as a synchronization mechanism between threads. This article will not cover independent thread scheduling, but you can learn more about it in the CUDA Programming Guide.

 

4. Warp scheduling and delay tolerance

 

There are some interesting things worth discussing about how warp works.

 

Even though all processing blocks (core groups) within the SM are processing warps, only a few of them are actively executing instructions at any given moment. Because the number of execution units available in SM is limited.

 

Some instructions take a long time to execute, which causes the warp to wait for the instruction results. In this case, SM will sleep the waiting warp and execute another warp that does not need to wait for any results. This enables the GPU to maximize utilization of all available computing resources and increase throughput.

 

Zero computational overhead scheduling: Since each thread in each warp has its own set of registers, there is no additional computational overhead when the SM switches from executing one warp to another.

 

It is different from the context switching method (context-switching) between processes on the CPU. If a process needs to wait for a long-running operation, the CPU schedules another process on that core in the meantime. However, context switching in the CPU is expensive because the CPU needs to save the register state to main memory and restore the state of the other process.

 

5. Copy result data from device to host memory

 

Finally, when all threads of the kernel have finished executing, the last step is to copy the results back to the host memory.

 

Although we've covered everything about typical kernel execution, there's one more point worth discussing: dynamic resource partitioning.

 

5

 

Resource division and occupation concepts

 

We measure GPU resource utilization through a metric called "occupancy", which represents the ratio between the number of warps allocated to the SM and the maximum number of warps that the SM can support. To achieve maximum throughput, we want to have 100% occupancy. However, in practice, this is not easy to achieve due to various constraints.

 

Why can't we always achieve 100% occupancy? SM has a fixed set of execution resources, including registers, shared memory, thread block slots, and thread slots. These resources are dynamically divided between threads based on demand and the limitations of the GPU. For example, on NVIDIA H100, each SM can handle 32 thread blocks, 64 warps (i.e. 2048 threads), and each thread block has 1024 threads. If we start a grid with 1024 threads, the GPU will divide the 2048 available thread slots into 2 thread blocks.

Dynamic partitioning vs fixed partitioning: Dynamic partitioning can make more efficient use of GPU computing resources. In contrast, fixed partitioning allocates a fixed amount of execution resources to each thread block, which is not always the most efficient. In some cases, fixed partitioning can cause threads to be allocated more resources than they actually need, resulting in wasted resources and reduced throughput.

 

Below we use an example to illustrate the impact of resource allocation on SM occupancy. Suppose we use a thread block of 32 threads and require a total of 2048 threads, then we will need 64 such thread blocks. However, each SM can only handle 32 thread blocks at a time. Therefore, even though an SM can run 2048 threads, it can only run 1024 threads at a time, which is only 50% occupancy.

 

Likewise, each SM has 65536 registers. To execute 2048 threads simultaneously, each thread has up to 32 registers (65536/2048 =32). If a kernel requires 64 registers per thread, then each SM can only run 1024 threads, and the occupancy rate is also 50%.

 

The challenge with under-occupancy is that it may not provide sufficient latency tolerance or required compute throughput to achieve optimal performance from the hardware.

 

Efficiently creating a GPU kernel is a complex task. We must allocate resources reasonably and minimize latency while maintaining high occupancy. For example, having a large number of registers can make your code run faster, but may reduce occupancy, so it is crucial to optimize your code carefully.

 

6

 

Summarize

 

I understand that the sheer number of new terms and concepts can be daunting for readers, so I summarize the key points at the end for a quick review.

 

  • The GPU is composed of multiple SMs, and each SM contains multiple processing cores.

  • There is an off-chip global memory on the GPU, usually high-bandwidth memory (HBM) or dynamic random access memory (DRAM). It is farther away from the SM on the chip, so the latency is higher.

  • There are two levels of cache in the GPU: off-chip L2 cache and on-chip L1 cache. The L1 and L2 caches work similar to the L1/L2 cache in the CPU.

  • There is a small piece of configurable shared memory on each SM. This shared memory is shared between processing cores. Typically, threads within a thread block will load a piece of data into shared memory and reuse it when needed, rather than loading it from global memory each time.

  • Each SM has a large number of registers, and the registers are divided according to thread requirements. NVIDIA H100 has 65536 registers per SM.

  • When executing the kernel on the GPU, we need to start a thread grid. A grid is composed of one or more thread blocks, and each thread block is composed of one or more threads.

  • Depending on resource availability, the GPU allocates one or more thread blocks to execute on the SM. All threads in the same thread block will be assigned to the same SM for execution. The purpose of this is to make full use of data locality and achieve synchronization between threads.

  • Threads assigned to SMs are further divided into groups of size 32, called warps. All threads within a warp execute the same instructions simultaneously, but on different parts of the data (SIMT) (although newer generations of GPUs also support independent thread scheduling).

  • The GPU dynamically divides resources between threads based on the needs of each thread and the limitations of the SM. Programmers need to carefully optimize the code to ensure the highest SM occupancy during execution.

 

footnote

 

[1] Yes, thanks to hyper-threading technology and multi-core processors, CPUs can also perform tasks in parallel. But a lot of work has been devoted to improving the performance of sequential execution for a long time.

[2] In the current generation of NVIDIA GPUs, the warp size is 32. But this size may change in future hardware iterations.

 

 

Everyone else is watching

 

 

Try OneFlow: github.com/Oneflow-Inc/oneflow/

 

This article is shared from the WeChat public account - OneFlow (OneFlowTechnology).
If there is any infringement, please contact [email protected] for deletion.
This article participates in the " OSC Source Creation Plan ". You who are reading are welcome to join and share together.

Alibaba Cloud suffered a serious failure and all products were affected (restored). Tumblr cooled down the Russian operating system Aurora OS 5.0. New UI unveiled Delphi 12 & C++ Builder 12, RAD Studio 12. Many Internet companies urgently recruit Hongmeng programmers. UNIX time is about to enter the 1.7 billion era (already entered). Meituan recruits troops and plans to develop the Hongmeng system App. Amazon develops a Linux-based operating system to get rid of Android's dependence on .NET 8 on Linux. The independent size is reduced by 50%. FFmpeg 6.1 "Heaviside" is released
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/oneflow/blog/10140417