NVIDIA CUDA Highly Parallel Processor Programming (2): Data Parallel Execution Exercises

NVIDIA CUDA Highly Parallel Processor Programming (2): Data Parallel Execution Exercises

  1. If one SM of a CUDA device can hold 1536 threads and 4 thread blocks, which of the following thread block configurations can hold the most threads in one SM?
    a. 128 threads per thread block
    b. 256 threads per thread block
    c. 512 threads per thread block
    d. 1024 threads per thread block
    Answer: c, one SM holds 3 thread blocks, 1536 in total threads.

  2. In vector addition, assuming the vector length is 2000, each thread produces an output element, and each thread block contains 512 threads. How many threads are there in the grid?
    a. 2000
    b. 2024
    c. 2048
    d. 2096
    Answer: c, ceil (2000/512) = 2048

  3. In the previous question, what number of warps would result in branch diversity due to the bounds check on the length of the vector?
    a. 1
    b. 2
    c. 3
    d. 6
    Answer: a. Some of the threads in the last warp satisfy the if condition and some do not. There will be two control flow paths in the warp, that is, branch diversity.

  4. Write an image processing function with a size of 400x900, use one thread to process one pixel, and the thread block is a square, how to set the dimensions of the grid and thread blocks so that each thread block contains as many threads as possible?
    Answer: The grid dimension is 13x29, and the thread block dimension is 32x32.

  5. In the setting of the previous question, how many idle threads will be generated?
    Answer: 16 928+28 416-16*28 = 26,048

  6. 8 threads form a thread block, and 8 threads execute the same piece of code. The time required for each thread is: 2.0, 2.3, 3.0, 2.8, 2.4, 1.9, 2.6, and 2.9, and the rest of the time is waiting to reach the synchronization point. What percentage of the time stack's total execution time was spent by all threads waiting on synchronization points?
    Answer: The maximum execution time is 3.0 s, and the rest of the threads have to wait until 3.0 s. Subtract all execution times from 3, add and divide by the total execution time.

  7. When a CUDA programmer says that there are only 32 threads per thread block in a kernel function, the __syncthreads() instruction can be omitted, right?
    Answer: On the surface, this is the case, because the threads in the warp are executed synchronously. But this may cause problems, it is best to add it.
    refer to

  8. I'm going to do block matrix multiplication for a 1024x1024 matrix, using a grid size of 32x32, with 512 threads per thread block, one element per thread, do you think I can do it?
    Answer: I can't. There are 1024x512 threads in total, but there are 1024x1024 numbers. Each thread has to process two elements, but one cannot.

  9. The following kernel function is used to process the block matrix. In order to process the block, a novice CUDA programmer wrote the following kernel function to transpose each block matrix. The block size is BLOCK_WIDTH*BLOCK_WIDTH, each dimension of matrix A is a multiple of BLOCK_WIDTH, the value of BLOCK_WIDTH is 1~20, the kernel code is as follows:

dim3 blockDim(BLOCK_WIDTH,BLOCK_WIDTH);
dim3 gridDim(A_width/blockDim.x,A_height/blockDim.y);
BlockTranspose<<<gridDim, blockDim>>>(A, A_width, A_height);
__global__ void
BlockTranspose(float* A_elements, int A_width, int A_height)
{
    
    
	__shared__ float blockA[BLOCK_WIDTH][BLOCK_WIDTH];
	int baseIdx=blockIdx.x * BLOCK_SIZE + threadIdx.x;
	baseIdx += (blockIdx.y * BLOCK_SIZE + threadIdx.y) * A_width;
	blockA[threadIdx.y][threadIdx.x]=A_elements[baseIdx];
	A_elements[baseIdx]=blockA[threadIdx.x][threadIdx.y];
}

a. What value of BLOCK_WIDTH can ensure that the kernel function is executed correctly on the device?
Because __syncthreads() is not used, each thread block has at most warpsize threads, so BLOCK_WIDTH should be 1~5. Anything greater than 5 will not work.
b. If it cannot be executed for all BLOCK_WIDTH values, please modify the code to enable execution.

exist

__shared__ float blockA[BLOCK_WIDTH][BLOCK_WIDTH];

and

blockA[threadIdx.y][threadIdx.x]=A_elements[baseIdx]

Add __syncthreads() between.

reference

Guess you like

Origin blog.csdn.net/weixin_45773137/article/details/124897806