cuda programming learning - CUDA shared memory performance optimization (9)

foreword

References:

Gao Sheng's blog
"CUDA C programming authoritative guide"
and CUDA official document
CUDA programming: basics and practice Fan Zheyong

All the codes of the article are available on my GitHub, and will be updated slowly in the future

Articles and explanatory videos are simultaneously updated to the public "AI Knowledge Story", station B: go out to eat three bowls of rice

1: shared memory

Shared memory is a kind of cache that can be directly manipulated by programmers. It has two main functions:
(1) One is to reduce the number of accesses to global memory in kernel functions and realize efficient internal communication of thread blocks;
(2) One is to Improved coalescing of global memory accesses.

The following is a reduction calculation written in C++ An
array x with N elements, if we need to calculate the sum of all elements in the array,
that is sum = x[0] + x[1] + ... + x[N - 1 ]

#include<stdint.h>
#include<cuda.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <math.h>
#include <stdio.h>

#define CHECK(call)                                   \
do                                                    \
{
      
                                                           \
    const cudaError_t error_code = call;              \
    if (error_code != cudaSuccess)                    \
    {
      
                                                       \
        printf("CUDA Error:\n");                      \
        printf("    File:       %s\n", __FILE__);     \
        printf("    Line:       %d\n", __LINE__);     \
        printf("    Error code: %d\n", error_code);   \
        printf("    Error text: %s\n",                \
            cudaGetErrorString(error_code));          \
        exit(1);                                      \
    }                                                 \
} while (0)



#ifdef USE_DP
typedef double real;
#else
typedef float real;
#endif

const int NUM_REPEATS = 20;
void timing(const real* x, const int N);
real reduce(const real* x, const int N);

int main(void)
{
    
    
    const int N = 100000000;
    const int M = sizeof(real) * N;
    real* x = (real*)malloc(M);
    for (int n = 0; n < N; ++n)
    {
    
    
        x[n] = 1.23;
    }

    timing(x, N);

    free(x);
    return 0;
}

void timing(const real* x, const int N)
{
    
    
    real sum = 0;

    for (int repeat = 0; repeat < NUM_REPEATS; ++repeat)
    {
    
    
        cudaEvent_t start, stop;
        CHECK(cudaEventCreate(&start));
        CHECK(cudaEventCreate(&stop));
        CHECK(cudaEventRecord(start));
        cudaEventQuery(start);

        sum = reduce(x, N);

        CHECK(cudaEventRecord(stop));
        CHECK(cudaEventSynchronize(stop));
        float elapsed_time;
        CHECK(cudaEventElapsedTime(&elapsed_time, start, stop));
        printf("Time = %g ms.\n", elapsed_time);

        CHECK(cudaEventDestroy(start));
        CHECK(cudaEventDestroy(stop));
    }

    printf("sum = %f.\n", sum);
}

real reduce(const real* x, const int N)
{
    
    
    real sum = 0.0;
    for (int n = 0; n < N; ++n)
    {
    
    
        sum += x[n];
    }
    return sum;
}

insert image description here

2: Thread synchronization mechanism

For multi-threaded programs, the execution order of instructions in two different threads may be different from the order shown in the code.
To ensure that the execution order of the statements in the kernel function is consistent with the order of appearance, some kind of synchronization mechanism must be used. In CUDA, a synchronization function __syncthreads is provided. This function can only be used in kernel functions, and its simplest usage is without any parameters:
__syncthreads();
This function can ensure that all threads in a thread block fully execute the statement before executing the statement following the statement previous statement. However, this function is only for threads in the same thread block, and the execution order of threads in different thread blocks is still uncertain.

3: Use thread synchronization to reduce calculations

Assuming that the number of array elements is a power of 2 (we'll remove this assumption later), we can add each element in the second half of the array to the corresponding array element in the first half. If this process is repeated, the resulting first array element is the sum of the elements in the original array. This is the so-called binary reduction method.

3.1 Reduction calculation under the condition of global memory

void __global__ reduce_global(real* d_x, real* d_y)
{
    
    
    const int tid = threadIdx.x;
 //定义指针X,右边表示 d_x 数组第  blockDim.x * blockIdx.x个元素的地址
 //该情况下x 在不同线程块,指向全局内存不同的地址---》使用不同的线程块对dx数组不同部分分别进行处理   
    real* x = d_x + blockDim.x * blockIdx.x;

    //blockDim.x >> 1  等价于 blockDim.x /2,核函数中,位操作比 对应的整数操作高效
    for (int offset = blockDim.x >> 1; offset > 0; offset >>= 1)
    {
    
    
        if (tid < offset)
        {
    
    
            x[tid] += x[tid + offset];
        }
        //同步语句,作用:同一个线程块内的线程按照代码先后执行指令(块内同步,块外不用同步)
        __syncthreads();
    }

    if (tid == 0)
    {
    
    
        d_y[blockIdx.x] = x[0];
    }
}

3.2 Reduction calculation under the condition of static shared memory

void __global__ reduce_shared(real* d_x, real* d_y)
{
    
    
    const int tid = threadIdx.x;
    const int bid = blockIdx.x;
    const int n = bid * blockDim.x + tid;
    //定义了共享内存数组 s_y[128],注意关键词  __shared__
    __shared__ real s_y[128];
    s_y[tid] = (n < N) ? d_x[n] : 0.0;
    __syncthreads();
    //归约计算用共享内存变量替换了原来的全局内存变量。这里也要记住: 每个线程块都对其中的共享内存变量副本进行操作。在归约过程结束后,每一个线程
    //块中的 s_y[0] 副本就保存了若干数组元素的和
    for (int offset = blockDim.x >> 1; offset > 0; offset >>= 1)
    {
    
    

        if (tid < offset)
        {
    
    
            s_y[tid] += s_y[tid + offset];
        }
        __syncthreads();
    }

    if (tid == 0)
    {
    
    
        d_y[bid] = s_y[0];
    }
}

3.3 Reduction calculation under the condition of dynamic shared memory

In the previous kernel function, we specified a fixed length (128) when defining the shared memory array. Our program assumes that this length is the same as the execution configuration parameter block_size of the kernel function (that is, blockDim.x in the kernel function). If you accidentally write the wrong length of the array when defining the shared memory variable, it may cause errors or reduce the performance of the kernel function.

One way to reduce the probability of this error is to use dynamic shared memory

  1. Write down the third parameter in the execution configuration that calls the kernel function:
<<<grid_size, block_size, sizeof(real) * block_size>>>
前面2个参数网格大小和线程块大小,
第三个参数就是核函数中每个线程块需要 定义的动态共享内存的字节数
  1. To use dynamic shared memory, you also need to change the declaration of shared memory variables in the kernel function
extern __shared__ real s_y[];这是动态声明
__shared__ real s_y[128]; 这是静态声明

Reduction calculation program code

#include<stdint.h>
#include<cuda.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <math.h>
#include <stdio.h>

#define CHECK(call)                                   \
do                                                    \
{
      
                                                           \
    const cudaError_t error_code = call;              \
    if (error_code != cudaSuccess)                    \
    {
      
                                                       \
        printf("CUDA Error:\n");                      \
        printf("    File:       %s\n", __FILE__);     \
        printf("    Line:       %d\n", __LINE__);     \
        printf("    Error code: %d\n", error_code);   \
        printf("    Error text: %s\n",                \
            cudaGetErrorString(error_code));          \
        exit(1);                                      \
    }                                                 \
} while (0)

#ifdef USE_DP
typedef double real;
#else
typedef float real;
#endif

const int NUM_REPEATS = 100;
const int N = 100000000;
const int M = sizeof(real) * N;
const int BLOCK_SIZE = 128;

void timing(real* h_x, real* d_x, const int method);

int main(void)
{
    
    
    real* h_x = (real*)malloc(M);
    for (int n = 0; n < N; ++n)
    {
    
    
        h_x[n] = 1.23;
    }
    real* d_x;
    CHECK(cudaMalloc(&d_x, M));

    printf("\nUsing global memory only:\n");
    timing(h_x, d_x, 0);
    printf("\nUsing static shared memory:\n");
    timing(h_x, d_x, 1);
    printf("\nUsing dynamic shared memory:\n");
    timing(h_x, d_x, 2);

    free(h_x);
    CHECK(cudaFree(d_x));
    return 0;
}

void __global__ reduce_global(real* d_x, real* d_y)
{
    
    
    const int tid = threadIdx.x;
 //定义指针X,右边表示 d_x 数组第  blockDim.x * blockIdx.x个元素的地址
 //该情况下x 在不同线程块,指向全局内存不同的地址---》使用不同的线程块对dx数组不同部分分别进行处理   
    real* x = d_x + blockDim.x * blockIdx.x;

    //blockDim.x >> 1  等价于 blockDim.x /2,核函数中,位操作比 对应的整数操作高效
    for (int offset = blockDim.x >> 1; offset > 0; offset >>= 1)
    {
    
    
        if (tid < offset)
        {
    
    
            x[tid] += x[tid + offset];
        }
        //同步语句,作用:同一个线程块内的线程按照代码先后执行指令(块内同步,块外不用同步)
        __syncthreads();
    }

    if (tid == 0)
    {
    
    
        d_y[blockIdx.x] = x[0];
    }
}

void __global__ reduce_shared(real* d_x, real* d_y)
{
    
    
    const int tid = threadIdx.x;
    const int bid = blockIdx.x;
    const int n = bid * blockDim.x + tid;
    //定义了共享内存数组 s_y[128],注意关键词  __shared__
    __shared__ real s_y[128];
    //将全局内存中的数据复制到共享内存中
    //共享内存的特 征:每个线程块都有一个共享内存变量的副本
    s_y[tid] = (n < N) ? d_x[n] : 0.0;
    //调用函数 __syncthreads 进行线程块内的同步
    __syncthreads();
    //归约计算用共享内存变量替换了原来的全局内存变量。这里也要记住: 每个线程块都对其中的共享内存变量副本进行操作。在归约过程结束后,每一个线程
    //块中的 s_y[0] 副本就保存了若干数组元素的和
    for (int offset = blockDim.x >> 1; offset > 0; offset >>= 1)
    {
    
    

        if (tid < offset)
        {
    
    
            s_y[tid] += s_y[tid + offset];
        }
        __syncthreads();
    }

    if (tid == 0)
    {
    
    
        d_y[bid] = s_y[0];
    }
}

void __global__ reduce_dynamic(real* d_x, real* d_y)
{
    
    
    const int tid = threadIdx.x;
    const int bid = blockIdx.x;
    const int n = bid * blockDim.x + tid;
    //声明 动态共享内存 s_y[]  限定词 extern,不能指定数组大小
    extern __shared__ real s_y[];
    s_y[tid] = (n < N) ? d_x[n] : 0.0;
    __syncthreads();

    for (int offset = blockDim.x >> 1; offset > 0; offset >>= 1)
    {
    
    

        if (tid < offset)
        {
    
    
            s_y[tid] += s_y[tid + offset];
        }
        __syncthreads();
    }

    if (tid == 0)
    {
    
    //将每一个线程块中归约的结果从共享内存 s_y[0] 复制到全局内 存d_y[bid]
        d_y[bid] = s_y[0];
    }
}

real reduce(real* d_x, const int method)
{
    
    
    int grid_size = (N + BLOCK_SIZE - 1) / BLOCK_SIZE;
    const int ymem = sizeof(real) * grid_size;
    const int smem = sizeof(real) * BLOCK_SIZE;
    real* d_y;
    CHECK(cudaMalloc(&d_y, ymem));
    real* h_y = (real*)malloc(ymem);

    switch (method)
    {
    
    
    case 0:
        reduce_global << <grid_size, BLOCK_SIZE >> > (d_x, d_y);
        break;
    case 1:
        reduce_shared << <grid_size, BLOCK_SIZE >> > (d_x, d_y);
        break;
    case 2:
        reduce_dynamic << <grid_size, BLOCK_SIZE, smem >> > (d_x, d_y);
        break;
    default:
        printf("Error: wrong method\n");
        exit(1);
        break;
    }

    CHECK(cudaMemcpy(h_y, d_y, ymem, cudaMemcpyDeviceToHost));

    real result = 0.0;
    for (int n = 0; n < grid_size; ++n)
    {
    
    
        result += h_y[n];
    }

    free(h_y);
    CHECK(cudaFree(d_y));
    return result;
}

void timing(real* h_x, real* d_x, const int method)
{
    
    
    real sum = 0;

    for (int repeat = 0; repeat < NUM_REPEATS; ++repeat)
    {
    
    
        CHECK(cudaMemcpy(d_x, h_x, M, cudaMemcpyHostToDevice));

        cudaEvent_t start, stop;
        CHECK(cudaEventCreate(&start));
        CHECK(cudaEventCreate(&stop));
        CHECK(cudaEventRecord(start));
        cudaEventQuery(start);

        sum = reduce(d_x, method);

        CHECK(cudaEventRecord(stop));
        CHECK(cudaEventSynchronize(stop));
        float elapsed_time;
        CHECK(cudaEventElapsedTime(&elapsed_time, start, stop));
        printf("Time = %g ms.\n", elapsed_time);

        CHECK(cudaEventDestroy(start));
        CHECK(cudaEventDestroy(stop));
    }

    printf("sum = %f.\n", sum);
}

Result comparison:
Global memory takes 25ms, the calculation result is wrong, it should be 1.23*10^8, there are many decimals behind it
insert image description here
Static shared memory, takes 28ms, the result is also wrong
insert image description here

Dynamic shared memory
takes 29ms.
insert image description here
Conclusion:
(1) The access speed of global memory is the lowest among all memories , and its use should be minimized. Of all device memory, registers are the most efficient, but in problems that require thread cooperation, using registers that are only visible to a single thread is not enough. We need to use shared memory that is visible to the entire thread block.
(2) There is almost no difference in execution time between a kernel function using dynamic shared memory and a kernel function using static shared memory. Therefore, using dynamic shared memory will not affect program performance, but sometimes it can improve program maintainability.
(3) Using shared memory to improve access to global memory does not necessarily improve the performance of kernel functions. Therefore, when optimizing CUDA programs, it is generally necessary to
test and compare different optimization schemes.
(4) There is an error about the calculation result SUM, which is because the so-called "big numbers eat small numbers" phenomenon occurs in the accumulation calculation. Single-precision floating-point numbers have only 6 or 7 accurate significant figures. In the above function reduce, after accumulating the value of the variable sum to more than 30 million, and then adding it to 1.23, its value will no longer increase (the small number is "eaten" by the large number, but the large number is not
Variety).
Some current solutions such as: Kahan summation algorithm

Guess you like

Origin blog.csdn.net/qq_40514113/article/details/130989649