Getting Started with GPU Parallel Computing

0. Preface

Before there was no GPU, basically all tasks were handed over to the CPU. With the GPU, the two have a division of labor. The CPU is responsible for logical transaction processing and serial computing, while the GPU focuses on executing highly threaded parallel processing tasks (large-scale computing tasks).

The GPU is not a computing platform that runs independently, but needs to work with the CPU, which can be regarded as a coprocessor of the CPU. Therefore, when we talk about GPU parallel computing, we actually refer to a heterogeneous computing architecture based on CPU+GPU .

NVIDIA and AMD are the main independent graphics card manufacturers, and their graphics cards are called N card and A card respectively; N card mainly supports CUDA programming, and A card mainly supports OpenCL programming. NVIDIA is the leader in the GPU industry, and its GPU products account for about 80% of the market share.

1. CPU vs GPU

This section recommends reading the blog What is the difference between CPU and GPU?

The architecture of CPU and GPU is shown in the figure below:

insert image description here
It can be understood visually as:

The CPU has 25% ALU (computing unit), 25% Control (control unit), and 50% Cache (cache unit)

GPU has 90% ALU (computing unit), 5% Control (control unit), and 5% Cache (cache unit)

Therefore, the structure of the two is different, resulting in different characteristics:

CPU: 强控制弱计算,更多资源⽤于缓存
GPU: 强计算弱控制,更多资源⽤于数据计算

Therefore, when GPU is designed, more transistors are used for data processing, rather than data caching and flow control, which can achieve a high degree of parallel computing.

In addition, the advantages of GPU are:

  • Focus on floating-point calculations/high cost performance: GPU avoids or weakens complex functions such as branch processing and logic control that are not related to floating-point calculations during design, and focuses on floating-point calculations, so it has a huge advantage in manufacturing costs .
  • Green power consumption ratio: GPU integrates a large number of lightweight micro-processing units, and the clock frequency is limited, so that the power consumption generated by the operation is extremely small.

NVIDIA has three mainstream GPU product lines:

  1. Teals series: designed for high-performance computing, expensive; such as the famous deep learning training artifacts A100, V100, P100 are all Teals series
  2. Geforce series: focus on games, entertainment, and of course can also be used for deep learning computing; such as Geforce RTX 3060, Geforce RTX 4090, etc.
  3. Quadro series: Designed for graphics and image processing, I have never used it personally, so I won’t introduce it.

2. Introduction to Parallel Computing

Parallel computing refers to the use of multiple processors to jointly solve the same problem, where each processor undertakes part of the computing task.

Parallel computing includes time parallelism (pipeline independent work) and space parallelism (matrix block calculation), which need to ensure load balancing and small communication (communication between CPU and GPU).

The premise of parallel computing is that the application problem must have a degree of parallelism, that is, it can be decomposed into multiple subtasks that can be executed in parallel

Individual high-performance computing nodes are mainly divided into:

  1. Homogeneous nodes: only use hardware from the same manufacturer, such as CPU, Intel Xeon CPU, AMD Opteron CPU
  2. Heterogeneous nodes: use hardware from different manufacturers, divided into the host side and the device side, focusing on logical operations and floating-point calculations respectively. General parallel computing is heterogeneous nodes.

3. Introduction to CUDA

CUDA (Compute Unified Device Architecture), a general-purpose parallel computing architecture, is a computing platform. CUDA was released by NVIDIA in 2007. It does not require a graphics API and uses a C-like language for easy development.

CUDA is a dedicated heterogeneous programming model. CUDA is an extension based on C language. For example, some qualifiers device, shared, etc. have been extended. Starting from 3.0, it also supports C++ programming.

The program code developed based on CUDA is divided into two types in actual execution:

  • Host code running on the CPU (Host Code)
  • Device code running on the GPU (Device Code)

The CUDA parallel computing function running on the GPU is called a kernel function (kernel function). A complete CUDA program is composed of a series of device-side kernel function parallel parts and host-side serial processing parts.

The kernel is executed in multiple threads on the GPU. What we learn about CUDA programming is mainly to learn how to write kernel functions.

4. CUDA processing flow

The processing flow of CUDA is:

  1. Copy data from system memory to GPU memory
  2. CPU instructions drive GPU operation
  3. Parallel processing per CUDA core of the GPU
  4. The GPU returns the final result of the CUDA processing to the system's memory

The basic flow of CUDA program execution is:

  1. Allocate memory space and video memory space
  2. Initialize memory space
  3. Copy the data to be calculated from the Host memory to the GPU memory
  4. Write your own program as a kernel function to perform kernel calculations
  5. Copy the calculated data from the GPU memory to the Host memory
  6. Process data copied to Host memory

Guess you like

Origin blog.csdn.net/qq_43799400/article/details/131216479