cuda programming learning - GPU acceleration / time timing Clock dry goods (5)

foreword

References:

Gao Sheng's blog
"CUDA C programming authoritative guide"
and CUDA official document
CUDA programming: basics and practice Fan Zheyong

All the codes of the article are available on my GitHub, and will be updated slowly in the future

Articles and explanatory videos are simultaneously updated to the public "AI Knowledge Story", station B: go out to eat three bowls of rice

1: Commonly used timing methods

1cudaEvent_t start, stop;
2CHECK(cudaEventCreate(&start));
3CHECK(cudaEventCreate(&stop));
4CHECK(cudaEventRecord(start));
5cudaEventQuery(start); // 此处不能用 CHECK 宏函数(见第 4 章的讨论) 
6 
7 需要计时的代码块 
8 
9 CHECK(cudaEventRecord(stop));
10CHECK(cudaEventSynchronize(stop)); 
11 float elapsed_time;
12 CHECK(cudaEventElapsedTime(&elapsed_time, start, stop));
13printf("Time = %g ms.\n", elapsed_time);
14 
15CHECK(cudaEventDestroy(start));
16CHECK(cudaEventDestroy(stop));

Line 1: Define 2 variables start and stop of cuda event type cudaEvent
Line 2 and 3: Use the cudaEventCreate function to initialize two variables
Line 4: Pass start to the cudaEventRecord function, and record a representative start before the code block that needs to be timed Event
Line 5: Can be omitted for GPUs in TCC driver mode, but must be reserved for GPUs in WDDM driver mode Line 7: Represents
a code block that needs to be timed
Line 9: Pass stop to cudaEventRecord A function that logs an end event after a block of code that needs to be timed.
Line 10: The cudaEventSynchronize function makes the host wait for the event stop to be recorded.
Lines 11-13: Call the cudaEventElapsedTime function to calculate the time difference between the two events start and stop (in ms) and output to the screen. Lines
15-16: Call the cudaEventDestroy function to destroy the two CUDA events start and stop

2: Test add2cpu performance

Adding 100000000 elements, the time result is as follows, around 171ms


#include <math.h>
#include <stdio.h>
#include<stdint.h>
#include<cuda.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>


#include <stdio.h>

#define CHECK(call)                                   \
do                                                    \
{
      
                                                           \
    const cudaError_t error_code = call;              \
    if (error_code != cudaSuccess)                    \
    {
      
                                                       \
        printf("CUDA Error:\n");                      \
        printf("    File:       %s\n", __FILE__);     \
        printf("    Line:       %d\n", __LINE__);     \
        printf("    Error code: %d\n", error_code);   \
        printf("    Error text: %s\n",                \
            cudaGetErrorString(error_code));          \
        exit(1);                                      \
    }                                                 \
} while (0)



#ifdef USE_DP
typedef double real;
const real EPSILON = 1.0e-15;
#else
typedef float real;
const real EPSILON = 1.0e-6f;
#endif

const int NUM_REPEATS = 10;
const real a = 1.23;
const real b = 2.34;
const real c = 3.57;
void add(const real* x, const real* y, real* z, const int N);
void check(const real* z, const int N);

int main(void)
{
    
    
    const int N = 100000000;
    const int M = sizeof(real) * N;
    real* x = (real*)malloc(M);
    real* y = (real*)malloc(M);
    real* z = (real*)malloc(M);

    for (int n = 0; n < N; ++n)
    {
    
    
        x[n] = a;
        y[n] = b;
    }

    float t_sum = 0;
    float t2_sum = 0;
    for (int repeat = 0; repeat <= NUM_REPEATS; ++repeat)
    {
    
    
        cudaEvent_t start, stop;
        CHECK(cudaEventCreate(&start));
        CHECK(cudaEventCreate(&stop));
        CHECK(cudaEventRecord(start));
        cudaEventQuery(start);

        add(x, y, z, N);

        CHECK(cudaEventRecord(stop));
        CHECK(cudaEventSynchronize(stop));
        float elapsed_time;
        CHECK(cudaEventElapsedTime(&elapsed_time, start, stop));
        printf("Time = %g ms.\n", elapsed_time);

        if (repeat > 0)
        {
    
    
            t_sum += elapsed_time;
            t2_sum += elapsed_time * elapsed_time;
        }

        CHECK(cudaEventDestroy(start));
        CHECK(cudaEventDestroy(stop));
    }

    const float t_ave = t_sum / NUM_REPEATS;
    const float t_err = sqrt(t2_sum / NUM_REPEATS - t_ave * t_ave);
    printf("Time = %g +- %g ms.\n", t_ave, t_err);

    check(z, N);

    free(x);
    free(y);
    free(z);
    return 0;
}

void add(const real* x, const real* y, real* z, const int N)
{
    
    
    for (int n = 0; n < N; ++n)
    {
    
    
        z[n] = x[n] + y[n];
    }
}

void check(const real* z, const int N)
{
    
    
    bool has_error = false;
    for (int n = 0; n < N; ++n)
    {
    
    
        if (fabs(z[n] - c) > EPSILON)
        {
    
    
            has_error = true;
        }
    }
    printf("%s\n", has_error ? "Has errors" : "No errors");
}



insert image description here

3: Test add2gpu performance

Adding 100000000 elements, the time results are as follows, around 7.94ms Among
them, we can further speed up by optimizing the value of grid_size and block_size
add<<<grid_size, block_size>>>(d_x, d_y, d_z, N);


#include <math.h>
#include <stdio.h>
#include <math.h>
#include <stdio.h>
#include<stdint.h>
#include<cuda.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>

#define CHECK(call)                                   \
do                                                    \
{
      
                                                           \
    const cudaError_t error_code = call;              \
    if (error_code != cudaSuccess)                    \
    {
      
                                                       \
        printf("CUDA Error:\n");                      \
        printf("    File:       %s\n", __FILE__);     \
        printf("    Line:       %d\n", __LINE__);     \
        printf("    Error code: %d\n", error_code);   \
        printf("    Error text: %s\n",                \
            cudaGetErrorString(error_code));          \
        exit(1);                                      \
    }                                                 \
} while (0)


#ifdef USE_DP
    typedef double real;
    const real EPSILON = 1.0e-15;
#else
    typedef float real;
    const real EPSILON = 1.0e-6f;
#endif

const int NUM_REPEATS = 10;
const real a = 1.23;
const real b = 2.34;
const real c = 3.57;
void __global__ add(const real *x, const real *y, real *z, const int N);
void check(const real *z, const int N);

int main(void)
{
    
    
    const int N = 100000000;
    const int M = sizeof(real) * N;
    real *h_x = (real*) malloc(M);
    real *h_y = (real*) malloc(M);
    real *h_z = (real*) malloc(M);

    for (int n = 0; n < N; ++n)
    {
    
    
        h_x[n] = a;
        h_y[n] = b;
    }

    real *d_x, *d_y, *d_z;
    CHECK(cudaMalloc((void **)&d_x, M));
    CHECK(cudaMalloc((void **)&d_y, M));
    CHECK(cudaMalloc((void **)&d_z, M));
    CHECK(cudaMemcpy(d_x, h_x, M, cudaMemcpyHostToDevice));
    CHECK(cudaMemcpy(d_y, h_y, M, cudaMemcpyHostToDevice));

    const int block_size = 128;
    const int grid_size = (N + block_size - 1) / block_size;

    float t_sum = 0;
    float t2_sum = 0;
    for (int repeat = 0; repeat <= NUM_REPEATS; ++repeat)
    {
    
    
        cudaEvent_t start, stop;
        CHECK(cudaEventCreate(&start));
        CHECK(cudaEventCreate(&stop));
        CHECK(cudaEventRecord(start));
        cudaEventQuery(start);

        add<<<grid_size, block_size>>>(d_x, d_y, d_z, N);

        CHECK(cudaEventRecord(stop));
        CHECK(cudaEventSynchronize(stop));
        float elapsed_time;
        CHECK(cudaEventElapsedTime(&elapsed_time, start, stop));
        printf("Time = %g ms.\n", elapsed_time);

        if (repeat > 0)
        {
    
    
            t_sum += elapsed_time;
            t2_sum += elapsed_time * elapsed_time;
        }

        CHECK(cudaEventDestroy(start));
        CHECK(cudaEventDestroy(stop));
    }

    const float t_ave = t_sum / NUM_REPEATS;
    const float t_err = sqrt(t2_sum / NUM_REPEATS - t_ave * t_ave);
    printf("Time = %g +- %g ms.\n", t_ave, t_err);

    CHECK(cudaMemcpy(h_z, d_z, M, cudaMemcpyDeviceToHost));
    check(h_z, N);

    free(h_x);
    free(h_y);
    free(h_z);
    CHECK(cudaFree(d_x));
    CHECK(cudaFree(d_y));
    CHECK(cudaFree(d_z));
    return 0;
}

void __global__ add(const real *x, const real *y, real *z, const int N)
{
    
    
    const int n = blockDim.x * blockIdx.x + threadIdx.x;
    if (n < N)
    {
    
    
        z[n] = x[n] + y[n];
    }
}

void check(const real *z, const int N)
{
    
    
    bool has_error = false;
    for (int n = 0; n < N; ++n)
    {
    
    
        if (fabs(z[n] - c) > EPSILON)
        {
    
    
            has_error = true;
        }
    }
    printf("%s\n", has_error ? "Has errors" : "No errors");
}


insert image description here

4: Increase arithmetic complexity arithmetic2cpu

In the calculation process, loops, sqrt, etc. are used to increase the computational complexity. Set N to 10000, and the
CPU running time increases from 171ms to 370ms (although N is reduced, the calculation is more time-consuming because of the increased complexity)

#include <math.h>
#include <stdio.h>
#include <math.h>
#include <stdio.h>
#include<stdint.h>
#include<cuda.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>

#define CHECK(call)                                   \
do                                                    \
{
      
                                                           \
    const cudaError_t error_code = call;              \
    if (error_code != cudaSuccess)                    \
    {
      
                                                       \
        printf("CUDA Error:\n");                      \
        printf("    File:       %s\n", __FILE__);     \
        printf("    Line:       %d\n", __LINE__);     \
        printf("    Error code: %d\n", error_code);   \
        printf("    Error text: %s\n",                \
            cudaGetErrorString(error_code));          \
        exit(1);                                      \
    }                                                 \
} while (0)

#ifdef USE_DP
typedef double real;
#else
typedef float real;
#endif

const int NUM_REPEATS = 10;
const real x0 = 100.0;
void arithmetic(real* x, const real x0, const int N);

int main(void)
{
    
    
    const int N = 10000;
    const int M = sizeof(real) * N;
    real* x = (real*)malloc(M);

    float t_sum = 0;
    float t2_sum = 0;
    for (int repeat = 0; repeat <= NUM_REPEATS; ++repeat)
    {
    
    
        for (int n = 0; n < N; ++n)
        {
    
    
            x[n] = 0.0;
        }

        cudaEvent_t start, stop;
        CHECK(cudaEventCreate(&start));
        CHECK(cudaEventCreate(&stop));
        CHECK(cudaEventRecord(start));
        cudaEventQuery(start);

        arithmetic(x, x0, N);

        CHECK(cudaEventRecord(stop));
        CHECK(cudaEventSynchronize(stop));
        float elapsed_time;
        CHECK(cudaEventElapsedTime(&elapsed_time, start, stop));
        printf("Time = %g ms.\n", elapsed_time);

        if (repeat > 0)
        {
    
    
            t_sum += elapsed_time;
            t2_sum += elapsed_time * elapsed_time;
        }

        CHECK(cudaEventDestroy(start));
        CHECK(cudaEventDestroy(stop));
    }

    const float t_ave = t_sum / NUM_REPEATS;
    const float t_err = sqrt(t2_sum / NUM_REPEATS - t_ave * t_ave);
    printf("Time = %g +- %g ms.\n", t_ave, t_err);

    free(x);
    return 0;
}

void arithmetic(real* x, const real x0, const int N)
{
    
    
    for (int n = 0; n < N; ++n)
    {
    
    
        real x_tmp = x[n];
        while (sqrt(x_tmp) < x0)
        {
    
    
            ++x_tmp;
        }
        x[n] = x_tmp;
    }
}

insert image description here

5: Increase arithmetic complexity arithmetic2Gpu

In the calculation process, loops, sqrt, etc. are used to increase the complexity of the calculation. Set N to 10000, and the
running time of the gpu increases from 7.94ms to 10.97ms (although N is reduced, the calculation is more time-consuming because of the increased complexity)

#include <math.h>
#include <stdio.h>
#include <math.h>
#include <stdio.h>
#include<stdint.h>
#include<cuda.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>

#define CHECK(call)                                   \
do                                                    \
{
      
                                                           \
    const cudaError_t error_code = call;              \
    if (error_code != cudaSuccess)                    \
    {
      
                                                       \
        printf("CUDA Error:\n");                      \
        printf("    File:       %s\n", __FILE__);     \
        printf("    Line:       %d\n", __LINE__);     \
        printf("    Error code: %d\n", error_code);   \
        printf("    Error text: %s\n",                \
            cudaGetErrorString(error_code));          \
        exit(1);                                      \
    }                                                 \
} while (0)
#ifdef USE_DP
typedef double real;
#else
typedef float real;
#endif

const int NUM_REPEATS = 10;
const real x0 = 100.0;
void arithmetic(real* x, const real x0, const int N);

int main(void)
{
    
    
    const int N = 10000;
    const int M = sizeof(real) * N;
    real* x = (real*)malloc(M);

    float t_sum = 0;
    float t2_sum = 0;
    for (int repeat = 0; repeat <= NUM_REPEATS; ++repeat)
    {
    
    
        for (int n = 0; n < N; ++n)
        {
    
    
            x[n] = 0.0;
        }

        cudaEvent_t start, stop;
        CHECK(cudaEventCreate(&start));
        CHECK(cudaEventCreate(&stop));
        CHECK(cudaEventRecord(start));
        cudaEventQuery(start);

        arithmetic(x, x0, N);

        CHECK(cudaEventRecord(stop));
        CHECK(cudaEventSynchronize(stop));
        float elapsed_time;
        CHECK(cudaEventElapsedTime(&elapsed_time, start, stop));
        printf("Time = %g ms.\n", elapsed_time);

        if (repeat > 0)
        {
    
    
            t_sum += elapsed_time;
            t2_sum += elapsed_time * elapsed_time;
        }

        CHECK(cudaEventDestroy(start));
        CHECK(cudaEventDestroy(stop));
    }

    const float t_ave = t_sum / NUM_REPEATS;
    const float t_err = sqrt(t2_sum / NUM_REPEATS - t_ave * t_ave);
    printf("Time = %g +- %g ms.\n", t_ave, t_err);

    free(x);
    return 0;
}

void arithmetic(real* x, const real x0, const int N)
{
    
    
    for (int n = 0; n < N; ++n)
    {
    
    
        real x_tmp = x[n];
        while (sqrt(x_tmp) < x0)
        {
    
    
            ++x_tmp;
        }
        x[n] = x_tmp;
    }
}

insert image description here

6: Experimental results:

Run setting ring, Nvidia11.6, graphics card 3050Ti

                  1w次循环复杂运算       1亿元素相加
CPU                 370ms                171ms
 

GPU                 10.97ms              7.94ms

7: Summary (optimize performance)

Necessary conditions for optimal performance:

(1) The proportion of data transmission is small.
(2) The arithmetic intensity of the kernel function is high.
(3) The number of threads defined in the kernel function is large.

Programming means:

• Reduce data transfer between host and device .
• Improve the arithmetic strength of kernel functions . • Increase the parallelism
of kernel functions .

8: Expansion

(1) The ratio of data transfer
If the purpose of a program is only to calculate the sum of two arrays, then using the GPU may be slower than using the CPU. This is because much more time is spent on data transfer (between CPU and GPU) than on computation (summation) itself. The peak theoretical bandwidth of data transfer between GPU computing cores and device memory is much higher than the bandwidth of data transfer between GPU and CPU.
The design calculation task is not to do one calculation of array addition, but to do 10,000 calculations of array addition, and only need to perform data transmission at the beginning and end of the program, so the proportion of data transmission will be negligible. At this point, the performance of the entire CUDA program is greatly improved.

(2) Arithmetic intensity
The reason why it is difficult to obtain a higher speedup ratio for the problem of adding arrays is because the arithmetic intensity of the problem is not high. The arithmetic intensity of a computational problem refers to the ratio of the workload of arithmetic operations in it to the workload of necessary memory operations.
For example, in the problem of adding arrays, when summing each pair of data, it is necessary to first fetch a pair of data from the device memory, then perform a sum calculation on them, and finally store the calculation result in the device Memory. The arithmetic intensity of this problem is actually not high, because only one summation calculation is done in the case of fetching data twice and storing data once. In CUDA, reading and writing device memory is expensive (time-consuming).

(3) Parallel scale:
The parallel scale can be measured by the total number of threads in the GPU .
From a hardware point of view, a GPU is composed of multiple streaming multiprocessors (SM), and each SM has several CUDA cores. Each SM is relatively independent. From the Kepler architecture to the Volt architecture, the maximum number of threads that can reside in an SM is 2048. For the Turing architecture, this number is 1024. There are generally several to dozens of SMs in a GPU (depending on the specific model). Therefore, a GPU can host tens of thousands to hundreds of thousands of threads in total . If the number of threads defined in a kernel function is much smaller than this number, it is difficult to obtain a high speedup ratio.
insert image description here

Guess you like

Origin blog.csdn.net/qq_40514113/article/details/130902279