CUDA: Implementation of matrix multiplication (Share Memory)

This article participates in the 2022CUDA on Platform online training camp study notes

1. Basics of Matrix Multiply

insert image description here
insert image description here
Matrix multiplication is the basis of linear algebra. Simply put, it is the result of multiplying the row of matrix A by the column of matrix B. The code on the CPU side can be written using analog thinking, so I won’t introduce it here 相信聪明的你一定熟练掌握了矩阵相乘.

2. CPU-side implementation of matrix multiplication

void cpu_matrix_mult(int* h_a, int* h_b, int* h_result, int m, int n, int k) {
    
    
    for (int i = 0; i < m; ++i)
    {
    
    
        for (int j = 0; j < k; ++j)
        {
    
    
            int tmp = 0.0;
            for (int h = 0; h < n; ++h)
            {
    
    
                tmp += h_a[i * n + h] * h_b[h * k + j];
            }
            h_result[i * k + j] = tmp;
        }
    }
}

The code on the CPU side mainly adopts the idea of ​​simulation. The outer two layers of loops traverse the positions in the result matrix, and the third layer of loops traverses the rows of the A matrix and the columns of the B matrix for multiplication and summation. What can be considered is If the size of matrix A and matrix B is large enough, it will be a huge computing task, and the CPUserial execution at the end will be 面临巨大的压力. So we can gather parallel programming ideas and implement them using CUDA code. Not much to say, start GPUwriting the end code

3. GPU-side implementation of matrix multiplication (Share Memory)

Considering that the size of the matrix is ​​large enough, the code in this article directly considers GPU Share Memorythe insufficient situation during the implementation process, and adopts 移动tilethe form to solve this problem.
In the previous articles CUDA学习介绍, we have successfully implemented an unused Share Memoryversion, the code is as follows:

__global__ void gpu_matrix_mult(int* d_a, int* d_b, int* d_c, int m, int n, int k) {
    
    
    int row = threadIdx.y + blockDim.y * blockIdx.y;
    int col = threadIdx.x + blockDim.x * blockIdx.x;
    if (row < m && col < k) {
    
    
        for (int i = 0; i < n; i++) {
    
    
            d_c[row * k + col] += d_a[row * n + i] * d_b[col + i * k];
        }
    }
}

d_a,d_c,d_bThey are all arrays that exist in 全局内存, and the code will be accessed multiple times during the execution process Global Memory. Because Global Memoryy is latencyrelatively high, 大大降低了代码执行的效率so we introduced it Share Memoryfor optimization, mainly using Share Memory 一次写入多次读取to reduce the data transmission during execution. loss. First use __share__the identifier to define two arrays that exist in shared memory

	__shared__ int smem_m[BLOCK_SIZE][BLOCK_SIZE];
    __shared__ int smem_n[BLOCK_SIZE][BLOCK_SIZE];

In this article, the size of the square is used tileas the current Blocksize. In the code of the kernel function, each thread will play two roles: 1. Assign Global Memorythe data in to Share Memory, 2. Calculate the value in the current matrix. Our code uses the moving tilemethod to copy one by one tile边移动边拷贝边计算, as shown in the figure below, smem_mit will x轴move in the positive direction of the axis, and move smem_nin ythe positive direction of the axis. The step size is the side length of the current tile. A subconcept is involved in the moving process , which is the number of steps currently tilemoving, tilethe total number of steps is n / BLOCK_SIZErounded down
insert image description here

for (int stride = 0; stride <= n / BLOCK_SIZE; stride++) {
    
    
        int idm = stride * BLOCK_SIZE + row * n + threadIdx.x;
        if (row < m && BLOCK_SIZE * stride + threadIdx.x < n) {
    
    
            smem_m[threadIdx.y][threadIdx.x] = a[idm];
        }
        else {
    
    
            smem_m[threadIdx.y][threadIdx.x] = 0;
        }
        int idn = stride * BLOCK_SIZE * k + col + threadIdx.y * k;
        if (col < k && BLOCK_SIZE * stride + threadIdx.y < n) {
    
    
            smem_n[threadIdx.y][threadIdx.x] = b[idn];
        }
        else {
    
    
            smem_n[threadIdx.y][threadIdx.x] = 0;
        }
        __syncthreads();
        for (int i = 0; i < BLOCK_SIZE; i++) {
    
    
            tmp += smem_m[threadIdx.y][i] * smem_n[i][threadIdx.x];
        }
        __syncthreads();
    }

Since the copying position of the current thread in A is different from the copying position in B, it is necessary to calculate the sum separately to idmensure idnthat all threads participate in collective activities. We use __syncthreads()a function to synchronize the current blockthread. After the synchronization is completed, the current tilerelated The calculated results of the calculation steps are stored in a temporary medium tmp, and will tmpbe assigned to our global Memorymedium when all the moves are performed by the tile.

	if (row < m && col < k)
    {
    
    
        c[row * k + col] = tmp;
    }

Special attention should be paid to the judgment of conditions. The current thread has its own computing tasks and the current collective task (copying data Global Memoryfrom Share Memoryit). You cannot not participate in the collective task because the respective computing tasks are invalid. Through the above analysis, we have obtained the complete code
using the Share Memoryoptimized versionGPU

__global__ void gpu_matrix_mult(int* a, int* b, int* c, int m, int n, int k)
{
    
    
    __shared__ int smem_m[BLOCK_SIZE][BLOCK_SIZE];
    __shared__ int smem_n[BLOCK_SIZE][BLOCK_SIZE];
    int row = blockDim.y * blockIdx.y + threadIdx.y;
    int col = blockDim.x * blockIdx.x + threadIdx.x;
    int tmp = 0;
    for (int stride = 0; stride <= n / BLOCK_SIZE; stride++) {
    
    
        int idm = stride * BLOCK_SIZE + row * n + threadIdx.x;
        if (row < m && BLOCK_SIZE * stride + threadIdx.x < n) {
    
    
            smem_m[threadIdx.y][threadIdx.x] = a[idm];
        }
        else {
    
    
            smem_m[threadIdx.y][threadIdx.x] = 0;
        }
        int idn = stride * BLOCK_SIZE * k + col + threadIdx.y * k;
        if (col < k && BLOCK_SIZE * stride + threadIdx.y < n) {
    
    
            smem_n[threadIdx.y][threadIdx.x] = b[idn];
        }
        else {
    
    
            smem_n[threadIdx.y][threadIdx.x] = 0;
        }
        __syncthreads();
        for (int i = 0; i < BLOCK_SIZE; i++) {
    
    
            tmp += smem_m[threadIdx.y][i] * smem_n[i][threadIdx.x];
        }
        __syncthreads();
    }
    if (row < m && col < k)
    {
    
    
        c[row * k + col] = tmp;
    }
}

4. Code reference


#include <stdio.h>
#include <math.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <stdlib.h>

#define CHECK(call)                                   \
do                                                    \
{
      
                                                           \
    const cudaError_t error_code = call;              \
    if (error_code != cudaSuccess)                    \
    {
      
                                                       \
        printf("CUDA Error:\n");                      \
        printf("    File:       %s\n", __FILE__);     \
        printf("    Line:       %d\n", __LINE__);     \
        printf("    Error code: %d\n", error_code);   \
        printf("    Error text: %s\n",                \
            cudaGetErrorString(error_code));          \
        exit(1);                                      \
    }                                                 \
} while (0)


#define BLOCK_SIZE 32

__global__ void gpu_matrix_mult(int* a, int* b, int* c, int m, int n, int k)
{
    
    
    __shared__ int smem_m[BLOCK_SIZE][BLOCK_SIZE];
    __shared__ int smem_n[BLOCK_SIZE][BLOCK_SIZE];
    int row = blockDim.y * blockIdx.y + threadIdx.y;
    int col = blockDim.x * blockIdx.x + threadIdx.x;
    int tmp = 0;
    for (int stride = 0; stride <= n / BLOCK_SIZE; stride++) {
    
    
        int idm = stride * BLOCK_SIZE + row * n + threadIdx.x;
        if (row < m && BLOCK_SIZE * stride + threadIdx.x < n) {
    
    
            smem_m[threadIdx.y][threadIdx.x] = a[idm];
        }
        else {
    
    
            smem_m[threadIdx.y][threadIdx.x] = 0;
        }
        int idn = stride * BLOCK_SIZE * k + col + threadIdx.y * k;
        if (col < k && BLOCK_SIZE * stride + threadIdx.y < n) {
    
    
            smem_n[threadIdx.y][threadIdx.x] = b[idn];
        }
        else {
    
    
            smem_n[threadIdx.y][threadIdx.x] = 0;
        }
        __syncthreads();
        for (int i = 0; i < BLOCK_SIZE; i++) {
    
    
            tmp += smem_m[threadIdx.y][i] * smem_n[i][threadIdx.x];
        }
        __syncthreads();
    }
    if (row < m && col < k)
    {
    
    
        c[row * k + col] = tmp;
    }
}

void cpu_matrix_mult(int* h_a, int* h_b, int* h_result, int m, int n, int k) {
    
    
    for (int i = 0; i < m; ++i)
    {
    
    
        for (int j = 0; j < k; ++j)
        {
    
    
            int tmp = 0.0;
            for (int h = 0; h < n; ++h)
            {
    
    
                tmp += h_a[i * n + h] * h_b[h * k + j];
            }
            h_result[i * k + j] = tmp;
        }
    }
}

int main(int argc, char const* argv[])
{
    
    
    int m = 111;
    int n = 222;
    int k = 333;

    int* h_a, * h_b, * h_c, * h_cc;
    cudaMallocHost((void**)&h_a, sizeof(int) * m * n);
    cudaMallocHost((void**)&h_b, sizeof(int) * n * k);
    cudaMallocHost((void**)&h_c, sizeof(int) * m * k);
    cudaMallocHost((void**)&h_cc, sizeof(int) * m * k);

    for (int i = 0; i < m; ++i) {
    
    
        for (int j = 0; j < n; ++j) {
    
    
            h_a[i * n + j] = rand() % 1024;
        }
    }

    for (int i = 0; i < n; ++i) {
    
    
        for (int j = 0; j < k; ++j) {
    
    
            h_b[i * k + j] = rand() % 1024;
        }
    }

    int* d_a, * d_b, * d_c;
    CHECK(cudaMalloc((void**)&d_a, sizeof(int) * m * n));
    cudaMalloc((void**)&d_b, sizeof(int) * n * k);
    cudaMalloc((void**)&d_c, sizeof(int) * m * k);

    // copy matrix A and B from host to device memory
    CHECK(cudaMemcpy(d_a, h_a, sizeof(int) * m * n, cudaMemcpyHostToDevice));
    cudaMemcpy(d_b, h_b, sizeof(int) * n * k, cudaMemcpyHostToDevice);

    unsigned int grid_rows = (m + BLOCK_SIZE - 1) / BLOCK_SIZE;
    unsigned int grid_cols = (k + BLOCK_SIZE - 1) / BLOCK_SIZE;
    dim3 dimGrid(grid_cols, grid_rows);
    dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);


    cudaEvent_t cudastart;
    cudaEvent_t cudaend;
    cudaEventCreate(&cudastart);
    cudaEventCreate(&cudaend);

    cudaEventRecord(cudastart);
    cudaEventQuery(cudastart);
    gpu_matrix_mult << <dimGrid, dimBlock >> > (d_a, d_b, d_c, m, n, k);
    cudaEventRecord(cudaend);
    cudaEventSynchronize(cudaend);

    float ms;
    cudaEventElapsedTime(&ms, cudastart, cudaend);
    printf("GPU time is %fms\n", ms);

    cudaMemcpy(h_c, d_c, (sizeof(int) * m * k), cudaMemcpyDeviceToHost);
    //cudaThreadSynchronize();

    cpu_matrix_mult(h_a, h_b, h_cc, m, n, k);

    int ok = 1;
    for (int i = 0; i < m; ++i)
    {
    
    
        for (int j = 0; j < k; ++j)
        {
      
      
            if (fabs(h_cc[i * k + j] - h_c[i * k + j]) > (1.0e-10))
            {
    
    

                ok = 0;
            }
        }
    }

    if (ok)
    {
    
    
        printf("Pass!!!\n");
    }
    else
    {
    
    
        printf("Error!!!\n");
    }

    // free memory
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);
    cudaFreeHost(h_a);
    cudaFreeHost(h_b);
    cudaFreeHost(h_c);
    return 0;
}

The result of the operation is as follows:
insert image description here

5. Practical experience

1. Through the role change of __syncthreads()

Compared with the previous codes, the biggest difference in this code is that forthere are two functions __syncthreads(), which are described by huanthe teacher's professional and abstract as: each thread has __syncthreads()changed its identity, __syncthreads()before, each now Participate in collective activities and be responsible for data transmission. After synchronization, each thread is responsible for its own corresponding calculation. It can be seen that the function has a great effect __syncthreads()on the threads in the same thread.block

2. Synchronization in parallel thinking

This code embodies parallel thinking. In share Memorythe assignment operation, because it is executed in parallel, the execution speed of each thread is also different. If the synchronization operation is not performed, the calculation of the thread may occur before the assignment operation share Memory. This leads to wrong calculations, and thread synchronization solves this problem very well. By ensuring that blockall the current threads are share Memoryassigned values ​​before performing operations, the above-mentioned problems are avoided. It can be seen that the importance of synchronous thinking in parallel programming is often bugdifficult to make up for the loss of this thinking in the process of searching, so codewe must think carefully in the link to avoid stagnation in the debug link

3. Improve the efficiency of hardware usage

The size setting in this article tileis set for the convenience of demonstration and understanding. BLOCK_SIZE*BLOCK_SIZEIn practice, a more reasonable value should be assigned to it. Since each movement is assigned with a tilesize , its size will affect performance. share memoryhave a greater impact. At the same time, in order to improve the efficiency of hardware usage, the settings of GridDimand BlockDimalso need to be tuned.

Guess you like

Origin blog.csdn.net/m0_46197553/article/details/125665881