CUDA programming (2) basics and simple examples (parallel protocol, shared memory)

What is Parallel Protocol

       Parallel reduction (Reduction) is a very basic parallel algorithm. Simply put, we have N input data, use a binary operator that conforms to the associative law to act on it, and finally generate 1 result.

Applicable objects of the parallel protocol

Data characteristics:

(1) There is no order requirement for the elements in the dataset.

(2) The data can be divided into several small collections, and each thread processes a collection.

For example: operations such as finding the maximum value, finding the minimum value, summing, and multiplying.

unoptimized parallel protocol

       First open up a storage space for 8 ints, as shown in the first line of the figure below. Add two adjacent numbers, and write the result into the storage space of the first number. In the second round of iterations, we add the first results in pairs to get the next level of results, and repeat this process until we get the final result. Compared with serial calculation, the time complexity is changed from O(N) to O(logN). (The calculation time can be shortened, but the amount of data must be more than hundreds of thousands, otherwise there may not be fast serial summation).

d428da0cf3051b03b3efc3c2ca36797b.png

//相当于计算每个block结果然后返回
__global__ void Reducetion(int* in, int* outs, int sizes)
{
  int tid = threadIdx.x;
  //boundary check
  if (tid >= sizes) return;
  int myid = blockIdx.x * blockDim.x + threadIdx.x;
  //in-place reduction in global memory
  for (int stride = 1; stride < blockDim.x; stride *= 2)
  {
    if ((tid % (2 * stride)) == 0)
    {
      in[myid] += in[myid + stride];
    }
    __syncthreads();
  }
  if (tid == 0)
    outs[blockIdx.x] = in[myid];
}


int main()
{
  int allnum = 8;
  int* data = new int[allnum];
  for (int i = 0; i < allnum; i++)
  {
    data[i] = i;
  }


  cudaError_t cudaStatus;
  bool label = true;
  //设置cuda device
  cudaStatus = cudaSetDevice(0);
  if (cudaStatus != cudaSuccess)
  {
    cout << "cudaSetDevice failed!" << endl;
    label = false;
  }
  //定义grid和block的维度(形状)
  dim3 threadsPerBlock(1024, 1);//[x,y,z]
  dim3 blocksPerGrid((allnum + threadsPerBlock.x - 1) / threadsPerBlock.x, 1);
  //申请指针并指向它指向GPU空降
  int* InGpu = nullptr;
  int* OutGpu = nullptr;
  cudaStatus = cudaMalloc((void**)&InGpu, sizeof(int) * allnum);         
  if (cudaStatus != cudaSuccess)
  {
    cout << "cudaMalloc InGpu failed!" << endl;
    label = false;
  }
  cudaStatus = cudaMalloc((void**)&OutGpu, sizeof(int) * blocksPerGrid.x);
  if (cudaStatus != cudaSuccess)
  {
    cout << "cudaMalloc OutGpu failed!" << endl;
    label = false;
  }
  //将数据从cpu传输到gpu
  cudaStatus = cudaMemcpy(InGpu, data, sizeof(int) * allnum, cudaMemcpyHostToDevice);
  if (cudaStatus != cudaSuccess)
  {
    cout << "cudaMemcpy InGpu failed!" << endl;
    label = false;
  }
  //调用在gpu上运行的核函数
  Reducetion << <blocksPerGrid, threadsPerBlock>> > (InGpu, OutGpu, allnum);
  cudaStatus = cudaGetLastError();
  if (cudaStatus != cudaSuccess)
  {
    cout << "addKernel launch failed:" << cudaGetErrorString(cudaStatus) << endl;
    label = false;
  }
  cudaStatus = cudaDeviceSynchronize();
  if (cudaStatus != cudaSuccess)
  {
    cout << "cudaDeviceSynchronize failed:" << cudaGetErrorString(cudaStatus) << endl;
    label = false;
  }
  //将计算结果传回cpu内存
  int* outs = new int[blocksPerGrid.x];
  cudaStatus = cudaMemcpy(outs, OutGpu, sizeof(int) * blocksPerGrid.x, cudaMemcpyDeviceToHost);
  if (cudaStatus != cudaSuccess)
  {
    cout << "cudaMemcpy OutGpu failed!" << endl;
    label = false;
  }
  //释放gpu内存空间
  cudaFree(InGpu);
  cudaFree(OutGpu);
  cudaDeviceReset();
  for (int i = 0; i < blocksPerGrid.x; i++)
  {
    cout << outs[i] << endl;
  }
  cout << "柯西的笔" << endl;
  return 0;
}

       After careful observation, you will find that only half of the threads are active in the first round of calculations, and after each round of calculations, the number of active threads will be reduced by half, but the entire warp will still be called. Due to hardware design, this type of access affects efficiency.

Optimized Parallel Protocol

       Modify the memory access method, as shown in the figure below:

0794fb33f8b96c94b461e22ce7497865.png

__global__ void Reduction2(int* in, int* outs, int sizes)
{
  unsigned int tid = threadIdx.x;
  unsigned idx = blockIdx.x * blockDim.x + threadIdx.x;
  if (idx >= sizes)
    return;
  //in-place reduction in global memory
  for (int stride = blockDim.x / 2; stride > 0; stride >>= 1)//右移相当于/2
  {
    if (tid < stride)
    {
      in[idx] += in[idx + stride];
    }
    __syncthreads();  
  }
  if (tid == 0)
    outs[blockIdx.x] = in[idx];
}

       Optimizing the memory access method can significantly improve the running speed of the program, especially to use joint memory access and storage as much as possible. In addition, using shared memory technology can further improve the performance of the program. Shared memory access latency is about 100 times lower than uncached global memory, so shared memory is also at a premium and only a small amount is available per graphics card. However, how to make full use of shared memory is a test of the technical level of engineers.

Shared Memory Parallel Protocol

       We need to be careful to avoid race conditions when data is shared between threads, because while threads in a block logically run in parallel, not all threads can execute at the same time. In order to guarantee correct results when parallel threads cooperate, threads must be synchronized. CUDA provides __syncthreads(). The __shared__ designator declares shared memory in CUDA C/C++ device code.

__global__ void Reduction2_share(int* in, int* outs, int sizes)
{
  extern __shared__ int sharem[];
  unsigned int tid = threadIdx.x;
  unsigned idx = blockIdx.x * blockDim.x + threadIdx.x;


  sharem[tid] = in[idx];
  __syncthreads();  //保证 block 内的所有线程都已经运行到调用__syncthreads()的位置
  for (int stride = blockDim.x / 2; stride > 0; stride >>= 1)//右移相当于/2
  {
    if (tid < stride)
    {
      sharem[tid] += sharem[tid + stride];
    }
    __syncthreads();  
  }
  if (tid == 0)
    outs[blockIdx.x] = sharem[0];
}

       In this example when the amount of shared memory is unknown at compile time, in this case the size of the shared memory allocation per thread block (in bytes) must be specified using the optional third execution configuration parameter, as follows Show:

Reduction2_share <<<blocksPerGrid, threadsPerBlock ,1024*sizeof(int)>>> (InGpu, OutGpu, allnum);

summary

       CUDA parallel reduction optimization can significantly improve the performance of parallel programs by grouping data to reduce global memory access and using shared memory to accelerate computation. In addition, using appropriate thread block sizes and optimizing shared memory access patterns can further improve performance.

References

https://link.zhihu.com/?target=https%3A//developer.download.nvidia.cn/assets/cuda/files/reduction.pdf

Guess you like

Origin blog.csdn.net/weixin_41202834/article/details/129659649