Introduction to the translation and learning of CUDA10.0 official documents

background

From this time on, I will use several blogs to share the translation and study notes of the programming guide of the CUDA 10.0 official document. Due to the large amount of content, I will share each chapter separately. There may be places where the translation is not satisfactory, so I suggest you refer to the original text: https://docs.nvidia.com/cuda/archive/10.0/cuda-c-programming -guide/index.html

Introduction

From image processing to general-purpose parallel computing

Driven by the unsatisfied real-time market, high-resolution 3D graphics and programmable GPUs have evolved into a highly parallelized, multi-threaded and multi-core processor, with powerful computing power and memory bandwidth. , As shown in the following two pictures

The reason behind the difference in the ability of CPU and GPU to handle floating-point numbers is that GPU is specifically designed for intensive and highly parallelized calculations-especially image rendering, so it uses more transistors for data processing- —Much more than for data storage or process control, as shown in the figure below

 

In addition, GPUs are particularly suitable for solving problems called parallel data calculations, that is, to execute programs that process data elements in parallel, and have high-intensity mathematical operations (mathematical operations are much more than memory operations). Because the same program is executed for each data element, the requirements for complex process control will be reduced. In addition, because the program is executed on many data elements and has high computational intensity, memory access delays can be hidden by calculations instead of large data cache locks.

Data parallel processing maps data elements to data processing threads. Many applications that process large data sets can use the parallel data programming model to speed up calculations. In 3D rendering, a huge collection of pixels and edges can be mapped to parallel threads, similarly, data and media processing such as reprocessing of rendered images, video encoding and decoding, image scaling, stereo vision, and pattern recognition The application can map image blocks and pixels to parallel processing threads. In fact, many algorithms outside the field of image rendering and processing can be accelerated by data parallel processing, including general-purpose signal processing, physical simulation to economic computing or biological computing.

cuda: a parallel computing platform and programming model for general purpose

In November 2006, NVIDIA introduced cuda, which is a general computing platform and programming model that uses the parallel computing engine in NVIDIA GPUs to solve many complex computing problems in a more efficient way than on a single CPU. cuda provides a software environment that allows developers to use C language as a high-level programming language. As shown in the figure below, other languages ​​such as Fortran, DirectCompute, OpenACC, application programming interfaces or instruction-based methods are also supported by cuda

Quantifiable programming model

The emergence of multi-core CPUs and multi-core GPUs means that mainstream processor chips are now parallel platforms, and their parallelism obeys Moore's Law. The challenge we face is to develop applications that can extend parallelism to take advantage of more and more cores. For example, 3D graphics applications extend their parallelism to multi-core GPUs with different numbers of cores.

The cuda parallel programming model is designed to overcome this challenge while maintaining a low learning curve with standard programming languages ​​familiar to programmers (such as C).

There are three key abstractions in the cuda core: the level of thread groups, shared memory and synchronization fences, which are also the least language extensions for programmers. These abstractions provide fine-grained data parallelism and thread parallelism nested in coarse-grained data parallelism and task parallelism, thereby guiding programmers to divide the problem into coarse-grained sub-problems that can be solved by thread or block parallelism, and divide each sub-problem into coarse-grained subproblems that can be solved by thread or block parallelism. The problem is further divided into fine-grained problems that can be solved in parallel by all threads in the block.

This deconstruction preserves the expressiveness of language by allowing threads to cooperate in solving each sub-problem, while allowing automatic quantification. Indeed, each thread block can be executed concurrently or serially in any order on any schedulable processor in the GPU (as shown in the figure below), and only the runtime system needs to know the number of physical processors

This quantifiable programming model allows the GPU architecture to span a wide range of markets by scaling the number of processors and memory partitions: from the high-performance enthusiast's GeForce to professional Quadro and Tesla computing products to a variety of cheap mainstream GeforceGPU.

Conclusion

The above is the first chapter of the official CUDA10.0 document, from which my thoughts are as follows:

1. Most of the transistors in GPU are used for ALU, so it is not good at complex logic control, but good at calculation, which is complementary to CPU. So when we develop, we should do our best to let the CPU process the logical judgment, and then let the GPU process the calculation;

2. The interfaces of CUDA are all written in C or C++, so we must master C or C++ in advance, but I believe that those who are from a professional class should be familiar with these two;

3. CUDA can accelerate many computationally intensive applications. For these applications, please refer to the examples in the book "CUDA High Performance Computing", which involves flashlight applications, thermal visualization, oscillators and other applications implemented with CUDA. The general logic is The UI rendering part is handed over to OpenGL, and the computing part is handed over to CUDA.

The next chapter will introduce the CUDA programming model , including the thread view level, memory level, kernel function, and heterogeneous programming

Guess you like

Origin blog.csdn.net/qq_37475168/article/details/110293902