Detailed explanation of CUDA shared memory

Why do you need shared memory?

The access speed of shared memory is much faster than that of accessing global memory. Therefore, for programs that access global memory multiple times, especially operations that need to cache the operation results of global memory to global memory multiple times, first cache the temporary results in shared memory and then Doing calculations will increase the calculation speed.

1. For example, cumulative merge and sum, adding a series of numbers from the beginning to the end, the merge and sum algorithm is as follows:
insert image description here
If global memory is used, half of the global memory must be accessed every time a halving operation is performed, and the total number of times the global memory needs to be accessed is N + N/2 + N/4 +... + 1 times. Therefore, reducing the number of global memory accesses can improve the operation speed.
What if it is improved? That is to copy the global memory to the shared memory for the first time. At this time, it is always necessary to access the global memory N times, and each thread only needs to access it once.

2. Another example is histogram statistics:
Assume that the number of bins counted by the histogram is 256, and assuming N numbers are counted. If you use global memory, you need to add the number of N global memories to 256 global memories, then you need Global memory is accessed 2N times. If the memory of 256 bins is changed to shared memory, it only needs to access the global memory N times, and the operation speed is improved.

What is shared memory

For CUDA, a grid has multiple blocks, and a block has multiple threads.
Shared memory is only shared by threads within the Block block , and shared memory between different Block blocks will not be shared.
Defined in the kernel function, the keyword before adding is __shared__, for example:
shared int sharedata[thread_perblock];

Intra-block synchronization

When doing shared memory, it is generally necessary to use __syncthreads() to synchronize within the thread block, but the threads in the same block block must reach this position before they can be executed.

How to achieve

In order to use shared memory, the size of shared memory is generally defined as the number of threads in the Block block . In this way, each thread corresponds to the shared memory of the response. Use the thread id for correspondence.
int thread_id = threadIdx.x;
For shared memory in different blocks, there is no sharing between them, so the global mapping to shared memory is as follows:
int tid = threadIdx.x + blockDim.x * blockIdx.x;
sharedata[thread_id] = data[tid];
which Block the tid is in is the shared memory mapped to which Block.

After the mapping and correspondence are completed, the calculation can be performed. The operation in the block takes the block as the basic unit, and uses the id as the thread id in the block:
for example, the sum of the rules:

 int mid = thread_perblock / 2;  //规约求和, thread_perblock 必须为 2 的 K 次方
    while(mid != 0){
    
    
        if(thread_id < mid){
    
    
            sharedata[thread_id] += sharedata[thread_id + mid];
        }
        __syncthreads();
        mid /= 2;
    }

After each round of operations in the block is completed, a synchronization is performed, waiting for all threads in the block to finish executing. while is the number of loops.
After all the blocks are executed, the result of the shared memory is assigned to the global memory to complete the whole process.

 out[blockIdx.x] = sharedata[0]; // 不同block中的shareddata 互不干扰

the code

1. Merge and sum:

#define thread_perblock 8
 __global__ void reduce_add(int *data, int *out, int N){
    
    
    __shared__ int sharedata[thread_perblock];
    int tid = threadIdx.x + blockDim.x * blockIdx.x;
    int thread_id = threadIdx.x;
    printf("tid = %d, %d\n", tid, data[tid]);
    while (tid < N) // 数据预处理到共享内存,当数组大于线程数量,需要一个线程处理多个数据,跨度(blockDim.x * gridDim.x)
    {
    
    
        sharedata[thread_id] += data[tid];
        tid += blockDim.x * gridDim.x;
    }
    __syncthreads();// 块内同步

    int mid = thread_perblock / 2;  //规约求和, thread_perblock 必须为 2 的 K 次方
    while(mid != 0){
    
    
        if(thread_id < mid){
    
    
            sharedata[thread_id] += sharedata[thread_id + mid];
            printf("blockIdx.x = %d, sharedata[%d] = %d , mid = %d\n " ,blockIdx.x, thread_id, sharedata[thread_id], mid);
        }
        __syncthreads();
        mid /= 2;
    }
    out[blockIdx.x] = sharedata[0]; // 不同block中的shareddata 互不干扰
    printf("blockIdx.x = %d, sharedata[0] = %d \n" , blockIdx.x,  sharedata[0]);
 }

2. Histogram statistics:

__global__ void histo_kernel(unsigned char *buffer, long size, unsigned int *histo) {
    
    
    __shared__ unsigned int temp[256];
    temp[threadIdx.x] = 0;
    //这里等待所有线程都初始化完成
    __syncthreads();
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    int offset = blockDim.x * gridDim.x;
    while (i < size) {
    
    
        atomicAdd(&temp[buffer[i]], 1); // 先在共享内存中操作,减少全局内存访问
        i += offset;
    }
    __syncthreads();//等待所有线程完成计算,统计完一个块
    //将每个块结果统计到全局内存
    atomicAdd(&(histo[threadIdx.x]), temp[threadIdx.x]);
}

Guess you like

Origin blog.csdn.net/long630576366/article/details/131282779