CUDA C Programming Vector Addition - Chapter 3 Introduction to CUDA

Chapter 3 Introduction to CUDA

"Massively Parallel Processor Programming Practice" learning, other chapters focus on the column CUDA C

CUDA C programming friendly link:

This chapter mainly takes vector addition vector add as the starting point, and describes how to rewrite a c language vector addition code into cuda extended c language vector addition.

1.1 Traditional vector addition

Traditional vector addition is implemented with loops
insert image description here

1.2 CUDA addition acceleration

CUDA vector addition is implemented in parallel through multi-threaded CUDA addition, that is, open n threads at the same time, each thread calculates 1 addition, and the vector of length n is calculated synchronously. A program using CUDA extensions requires three steps:

  1. First apply for the memory (memory) of the device (device, cuda), and copy the data from the host (host) to the device (device)
  2. Use the device API to operate on the requested memory. (Operations on the device are performed in the form of kernel functions)
  3. Copy the calculation result back to the host
    insert image description here

The specific implementation of the addition function

When operating on a vector in the form of multithreading, multiple threads exist in the form of thread blocks. A thread block contains blockDim (for example, 256) threads, and each thread executes the same code.
insert image description here

The addition code is as follows, where threadIdx.x indicates the index of the thread in this thread block, that is, in the i-th block, the thread is the j-th, 0<=j<=blockDim-1. The calculation of i is the index of the thread in all blocks, that is, i=this block index + a block blockDim threads * blockIdx thread block:
insert image description here

It is worth noting that in the CUDA cross-programming environment, the default code is executed on the host host. If it is to be executed on the device, global modification is required , such as:
insert image description here

Kernel function parameters

Revisit the Kernel function, the configuration parameters are between <<< >>>, indicating

  • ceil(n/256.0) requires n/256 thread blocks
  • 256 Each thread block has 256 child threads
    insert image description here

Guess you like

Origin blog.csdn.net/qq_40491305/article/details/114528176