"CUDA C++ Programming Guide" Chapter 1 Introduction to CUDA

Chapter 1 Introduction to CUDA

1.1 Benefits of using GPUs

Graphics processors GPUs offer higher instruction throughput and memory bandwidth than CPUs in the same price and power range, and many applications take advantage of these higher capabilities to run faster on GPUs than on CPUs. It is also very energy efficient compared to other computing devices, such as FPGAs, but offers much less programming flexibility than GPUs.

This performance difference between GPUs and CPUs exists because they are designed with different goals in mind. A CPU is designed to execute a series of operations (called threads) as quickly as possible, and can execute dozens of such threads in parallel. GPUs are designed to be able to execute thousands of threads in parallel (buffering slower single-thread performance for greater throughput).

GPUs are designed for highly parallel computing and are therefore designed to use more transistors for data processing rather than data caching and flow control. Diagram 1 shows an example distribution of CPU chip resources and GPU chip resources.

insert image description here
Using more transistors for data processing, such as floating-point operations, is conducive to highly parallel computing; GPUs can hide memory access delays through calculations, without relying on large data caches and complex flow control to avoid long memory access delays , since both are expensive in terms of transistors.

Typically, applications have a mix of parallel and sequential parts, so systems are designed with a mix of GPUs and CPUs to maximize overall performance. Applications with a high degree of parallelism can take advantage of the massively parallel processor nature of graphics processors to achieve higher performance than CPUs.

1.2 CUDA: A Universal Parallel Computing Platform and Programming Model

In November 2006, NVIDIA launched CUDA, a general-purpose parallel computing platform and programming model that utilizes the parallel computing engine in NVIDIA graphics processors to solve many complex computing problems in a more efficient manner than CPUs.

CUDA provides a software environment that allows developers to use C++ as a high-level language. Other languages, application programming interfaces, or instruction-based methods are also supported, such as FORTRAN, DirectCompute, OpenACC.

insert image description here

1.3 Scalable programming model

The advent of multi-core CPUs and multi-core GPUs means that mainstream processor chips are now parallel systems. The challenge is to develop application software that transparently scales its parallelism to take advantage of growing processor core counts, just as a 3D graphics application can transparently scale its parallelism to accommodate multiple GPUs with widely varying core counts

The CUDA parallel programming model aims to overcome this challenge while keeping the learning curve low for programmers familiar with standard programming languages ​​such as C.

At its core are three key abstractions—a hierarchy of thread groups, shared memory, and barrier synchronization—that are exposed to programmers only as a minimal set of language extensions.

These abstractions provide fine-grained data parallelism and thread parallelism nested within coarse-grained data parallelism and task parallelism. They guide the programmer to divide the problem into coarse-grained problems that can be solved independently and in parallel by thread blocks, and to divide each subproblem into finer-grained problems that can be solved by all threads within the block cooperatively in parallel.

This decomposition preserves the expressive power of the language by allowing threads to cooperate in solving each subproblem while enabling automatic scalability. In fact, each thread block can be scheduled in any order on any available multiprocessor in the GPU, and executed concurrently or sequentially, so that a compiled CUDA program can be executed on any number of multiprocessors, And only the runtime system needs to know the number of physical multiprocessors.

This scalable programming model allows GPU architectures to span a wide range of markets by simply scaling the number of multiprocessors and memory partitions: from high-performance GeForce graphics processors and professional Quadro and Tesla computing products to a variety of inexpensive Mainstream GeForce graphics processors

insert image description here
GPUs are built around streaming multiprocessor (SM) arrays. A multithreaded program is divided into independently executing thread blocks so that a GPU with more processors takes less time to execute the program than a GPU with fewer processors

1.4 CUDA development file structure

This documentation is divided into the following sections: (The following sections will be updated in the CUDA column one after another)

  1. Introduction: A general introduction to CUDA (this chapter)
  2. Programming Model: An overview of the CUDA programming model
  3. Programming Interface: describes the programming interface
  4. Hardware Implementation: Describes the hardware implementation
  5. Performance Guidelines: Provides some guidance on how to achieve optimal performance
  6. CUDA-Enabled GPUs: Lists all CUDA-enabled devices
  7. C++ Language Extensions: Detailed descriptions of all extensions to the C++ language
  8. Cooperative Groups: Describes the synchronization primitives for groups of CUDA threads
  9. CUDA Dynamic Parallelism: Describes how to launch and synchronize one core from another
  10. Virtual memory management: describes how to manage a unified virtual address space
  11. Stream Ordered Memory Allocator: Describes how an application allocates and frees memory
  12. Graph Memory Node: describes how graphs are created and have memory allocations
  13. Math Functions: Lists the math functions supported by CUDA
  14. C++ language support: Lists the C++ features supported in device code
  15. Texture Acquisition: Provides more details on texture acquisition
  16. Computational Capabilities: Technical specifications of various devices are given, along with more architectural details
  17. Driver API: introduces the low-level driver API
  18. CUDA environment variables: lists all CUDA environment variables
  19. Unified Memory Programming: Introducing the Unified Memory programming model

When the GPU was first created, twenty years ago, it was designed as a specialized processor to accelerate graphics rendering. Due to the insatiable market demand for real-time, high-definition, 3D graphics, it has evolved into a general-purpose processor for more workloads than just graphics rendering.

Guess you like

Origin blog.csdn.net/qq_38973721/article/details/131946873
Recommended