Chapter 3 Introduction to CUDA
"Massively Parallel Processor Programming Practice" learning, other chapters focus on the column CUDA C
CUDA C programming friendly link:
- Chapter 3 Introduction to CUDA - CUDA C Programming Vector Addition
- Chapter 4 CUDA Data Parallel Execution Model
- Chapter 5 CUDA Memory
- Chapter 6 CUDA Performance Optimization (with link to original book)
- Kernel function: Introduction to CUDA programming (1) - See thread organization and use of kernel function through image operations
- Expansion: CUDA convolution calculation and optimization - taking one-dimensional convolution as an example
This chapter mainly takes vector addition vector add as the starting point, and describes how to rewrite a c language vector addition code into cuda extended c language vector addition.
1.1 Traditional vector addition
Traditional vector addition is implemented with loops
1.2 CUDA addition acceleration
CUDA vector addition is implemented in parallel through multi-threaded CUDA addition, that is, open n threads at the same time, each thread calculates 1 addition, and the vector of length n is calculated synchronously. A program using CUDA extensions requires three steps:
- First apply for the memory (memory) of the device (device, cuda), and copy the data from the host (host) to the device (device)
- Use the device API to operate on the requested memory. (Operations on the device are performed in the form of kernel functions)
- Copy the calculation result back to the host
The specific implementation of the addition function
When operating on a vector in the form of multithreading, multiple threads exist in the form of thread blocks. A thread block contains blockDim (for example, 256) threads, and each thread executes the same code.
The addition code is as follows, where threadIdx.x indicates the index of the thread in this thread block, that is, in the i-th block, the thread is the j-th, 0<=j<=blockDim-1. The calculation of i is the index of the thread in all blocks, that is, i=this block index + a block blockDim threads * blockIdx thread block:
It is worth noting that in the CUDA cross-programming environment, the default code is executed on the host host. If it is to be executed on the device, global modification is required , such as:
Kernel function parameters
Revisit the Kernel function, the configuration parameters are between <<< >>>, indicating
- ceil(n/256.0) requires n/256 thread blocks
- 256 Each thread block has 256 child threads