How to use gpu computing

About gpu computing and cpu computing:

The gpu runs the corresponding gpu instructions , not the language, and the same goes for the cpu . All languages ​​must be compiled into machine instructions for the corresponding platform. Or convert it into machine instructions when executed by a parser . When Python does deep learning and needs GPU acceleration, the bottom layer still needs to be converted into GPU instructions through a compiler . The tools used are generally CUDA, Opencl, and DirectCompute. Using CUDA is to compile C/C++ code into GPU instructions. Together with other API call codes, it can only be used on Nvidia GPUs . OpenCL and DirectCompute do not distinguish between graphics cards. This driver is implemented as long as the graphics card hardware supports general computing. CL's kernel is a C-like language. It will eventually be compiled into GPU instructions. As for what C++ and Java can do, it is just calling the compiled Kernel. Copy the data into it and pass the parameters. After the GPU completes the calculation, the CPU retrieves the data. GPU computing is generally oriented to large-scale unified data. The so-called unity means that the data types are all the same. It is an array of fixed type and size . Because it is SIMD/SIMT, it is suitable for simple calculations. The processing of each data unit is the same, and it is not suitable for logic with complex conditional jumps. Therefore, the usage scenarios of GPU acceleration are relatively limited. Additional SIMD instruction acceleration, for small data, can be prioritized. Because of GPU acceleration, memory copying may be required, and the usage overhead is relatively high. If you use SIMD, you can directly use the CPU's special instructions MMX SSE AVX to accelerate. Java should have a corresponding package, and C++ uses Intrinsic header files.

GPU is particularly suitable for running data parallel processing tasks of single program and multiple data streams, that is, it mainly supports SPMD parallel computing mode (the idea of ​​matlab's spmd is that different data are processed by the same program . Of course, this program can be written internally for different situations. Processing code. The parallel idea of ​​parfor is to assign the same batch of data to different loop bodies in the for loop for processing. The internal implementation code of spmd has few restrictions, and the flexibility of spmd is much higher than parfor )

For machine learning, big data is needed for training. That is to say, a large number of parallel repeated calculations are required. GPU has this expertise and is very suitable.
  Big data is already a science and technology closely related to people's lives, and with the popularization and development of the Internet and the Internet of Things, big data will bring greater commercial value to modern society. In the era of big data, due to the huge amount of data processed and the improvement of data processing algorithms, only using the CPU to analyze massive data will lead to inefficiency. Therefore, people began to build new computing architectures to improve the performance of big data processing. People first applied GPU to big data processing. Although the GPU does not have the complex control logic of the CPU, it has a large number of computing resources that the CPU cannot match, and the structure type of the GPU is very uniform, which is suitable for parallel computing of large amounts of data.

There are the following six types of operations suitable for GPU operations:

(1) A large number of lightweight operations

That is, using a large amount of data or using the same data to call the same formula or calculation process multiple times. The formula itself is not complicated, but it is executed more times. This is the inherent advantage of GPU.

(2) Highly parallel

High degree of parallelism means that the operations on each data do not affect each other, that is, the degree of coupling is low. Due to the hardware foundation of the GPU itself, various workgroups do not communicate with each other. Only work-items within the same workgroup communicate with each other. Therefore, the GPU itself does not support calculations with high data coupling such as iteration. This is due to the GPU itself. Require.

(3) Computationally intensive

Tasks can be divided into computationally intensive and IO-intensive. Computing-intensive, that is, a small amount of IO reading + a large amount of calculation, which consumes more CPU resources; while IO-intensive, refers to multiple uses of IO reading + a small amount of calculation, which involves the relationship between registers and memory and the relationship between Communication problems between device memories, the main limitation is the memory bandwidth problem.

(4) Simple control

Compared with the GPU, the CPU is better at judgment, logic control, branching, etc., has general computing capabilities, and contains a powerful ALU (arithmetic operation unit); while the GPU is more suitable for simple logical operations.

(5) Multiple stages of execution.

The computing program can be decomposed into multiple small programs or the same program can be executed in multiple stages. This is similar to using a cluster to process the same task and decompose it into multiple task fragments and distribute them to each node for execution to increase the computing speed.

(6) Floating point operations.

GPUs are good at floating point operations.

The CPU has always been responsible for "computing" on computers. The larger the number of CPU cores, the stronger the computing power. Compared with the dozen or so cores of a CPU, a GPU can host thousands of processing units. In the past, GPU technology was mainly used for image rendering and real scene simulation.

Now, GPU computing has been widely used in deep learning and high-performance computing (HPC), becoming more and more like higher-performance CPUs. The "massively parallel computing" capabilities of GPUs have begun to be exploited, and their positioning has shifted from previous co-processors to mainstream processors.

CUDA stands for "Compute U-nified Device Architecture" and is a set of application software development environment and tool software developed by NV IDIA specifically for GPU computing. CUDA consists of 3 main components: a compiler that provides access to parallel computing resources on the GPU, a computing-specific runtime driver, and a set of optimized scientific computing libraries developed for CUDA. The core part of CUDA is a specially developed C compiler, which simplifies the coding of GPU parallel programs. Programmers familiar with the C language can focus on developing parallel programs rather than processing graphics APIs. To simplify development, CUDA's C compiler allows programmers to record a mix of CPU and GPU code into a single program file. Add some simple code to the C program to inform the CUDA compiler which functions are processed by the CPU and which are processed by the GPU. Functions processed by the GPU are compiled by the CUDA compiler, while code processed by the CPU is compiled by the standard C compiler

Using CUDA to develop GPU computing application software, developers use a new programming model to map and load parallel data into the GPU. The CUDA program subdivides the data to be processed into smaller blocks and then processes them in parallel. When the GPU computing program is running, the developer only needs to run the program on the host CPU,

The CUDA driver automatically loads and executes programs on the GPU. The host-side program can exchange information with the GPU through the high-speed PCI Express bus. The transmission of data, the startup of GPU computing functions and other interactions between the CPU and GPU can be completed by calling special operations in the runtime driver. These advanced operations free programmers from manually managing GPU computing resources. GPUs using CUD technology can run either as flexible thread processors, with thousands of computing programs calling threads to collaborate on solving complex problems, or as stream processors running in specific applications, each of which Threads do not exchange information. CUDA-capable applications can use GPUs for fine-grained data-intensive processing and multi-core GPUs for complex coarse-grained tasks such as control and data management

Matlab uses gpu calculation:

MATLAB® lets you use NVIDIA® GPUs to accelerate AI, deep learning, and other compute-intensive analytics without being a CUDA® programmer. With MATLAB and Parallel Computing Toolbox™, you can:

Invoke NVIDIA GPUs directly in MATLAB with over 500 built-in functions available.

Access multiple GPUs on desktops, compute clusters, and the cloud using MATLAB worker and MATLAB Parallel Server™.

Generate CUDA code directly from MATLAB with GPU Coder™ for deployment to data centers, clouds, and embedded devices.

Use GPU Coder to generate NVIDIA TensorRT™ code from MATLAB for low-latency and high-throughput inference.

Deploy MATLAB AI applications to data centers equipped with NVIDIA GPUs, integrating with enterprise systems using MATLAB Production Server™.

Guess you like

Origin blog.csdn.net/qq_42152032/article/details/131342001