Massively Parallel Processor Programming Practice - Chapter 4 CUDA Data Parallel Execution Model

Chapter 4 Data Parallel Execution Model

"Massively Parallel Processor Programming Practice" learning, other chapters focus on the column CUDA C

CUDA C programming friendly link:

Threads form thread blocks, thread blocks form thread grids, and thread grids are kernels. All threads in a kernel execute the same code, the difference is that different threads belong to different blocks (with different blockIdx) and have their unique position in the block (threadIdx) , so each thread has a unique Coordinates, access to data can be independently controlled.
Same execution process + different data sources = parallel data processing

4.1 CUDA thread organization

The thread blocks in the grid, and the threads in the block, are all three-dimensional positions.

insert image description here

The size of the kernel can also be dynamically specified during the organization, controlled by n, and determined according to the size of n when addingKernel.
insert image description here
Quick start 1D grid and thread:
insert image description here

Specific examples:
insert image description here

Note that dimBlock(2,2,1) means that the blocks in a grid are 2 2 1 structures. dimGrid means that the threads in a block are 16*16

4.2 Mapping between threads and multidimensional data

According to the form of the data, choose the appropriate thread organization. Image = 2D and so on.
A thread block is 16 16. For a picture of 76 64, 76/16 * 64/16 = 5*4 thread blocks are required. n, m represent the maximum value of x, y of the picture; d_Pin/d_Pout is the source address and destination address pointer of the picture data, then:
insert image description here
insert image description here

If you want to perform three-dimensional position positioning, you need to add additional spatial z information :
insert image description here

4.3 Matrix multiplication

During image processing, threads correspond to the pixels to be processed (source data), and operate on each pixel to obtain a parallel process. In matrix multiplication, the thread corresponds to the value of the destination matrix (destination data), and performs the same ∑ di , k 1 ∗ dk , j 2 \sum{d^1_{i, k}*d^2_{k,j}}di,k1dk,j2, so as to achieve a parallel effect .
insert image description here

The specific implementation is as follows:
insert image description here

4.4 Thread Synchronization and Transparent Scalability

_ syncthreads() fence synchronization (that is, fast and slow in the same thread block, two )
fence deadlock:
insert image description here

There is no barrier synchronization between different thread blocks, and they can be executed in any order.
Transparent scalability:
insert image description here

4.5 Resource allocation of thread blocks

Execution resources are organized into Streaming Multiprocessors (SMs). Multiple thread blocks will be allocated to one SM, and each device also has restrictions on the thread blocks on one SM. For example, an SM allows 64 threads and 2 blocks, and the SM can only allocate 2 blocks including 32 threads. A thread block cannot allocate 4 thread blocks containing 16 threads.

4.6 Query device properties

insert image description here

Query the properties of all devices
insert image description here

Other parameters corresponding to the query
insert image description here
insert image description here

4.7 Thread Scheduling and Allowable Latency

Once a thread block is assigned to an SM, it is divided into warp units consisting of 32 threads. Warp is the thread scheduling unit in SM.
The SM executes all threads of a warp in accordance with the Single Instruction Multiple Data (SIMD) mode. All threads in a warp can only fetch and execute one instruction at any time.
An SM includes multiple stream processors (Streaming Processor, SP), and the SP is the component that actually executes the instruction. In general, the number of SPs in an SM is less than the number of threads allocated, and the hardware in the SM executes only a portion of all warps. When the SM is assigned multiple warps, although only a part is executed at each moment, the SM will put aside the warp instructions that require "long delay" and execute short-latency instructions, so as not to waste hardware. This mechanism of overriding the delay time with the execution of other threads is called "tolerating delay" or "delay hiding" .
This is also why multiple warps are fed into one SM, even if the SP is less than the number of warps or threads.
Latency hiding"**.
This is also why multiple warps are fed into one SM, even if the SP is less than the number of warps or threads.

Guess you like

Origin blog.csdn.net/qq_40491305/article/details/116235878