This article participates in the 2022CUDA on Platform online training camp study notes
GPU-side implementation of matrix multiplication
1. Basics of Matrix Multiply
Matrix multiplication is the basis of linear algebra. Simply put, it is the result of multiplying the row of matrix A by the column of matrix B. The code on the CPU side can be written using analog thinking, so I won’t introduce it here 相信聪明的你一定熟练掌握了矩阵相乘
.
2. CPU-side implementation of matrix multiplication
void cpu_matrix_mult(int* h_a, int* h_b, int* h_result, int m, int n, int k) {
for (int i = 0; i < m; ++i)
{
for (int j = 0; j < k; ++j)
{
int tmp = 0.0;
for (int h = 0; h < n; ++h)
{
tmp += h_a[i * n + h] * h_b[h * k + j];
}
h_result[i * k + j] = tmp;
}
}
}
The code on the CPU side mainly adopts the idea of simulation. The outer two layers of loops traverse the positions in the result matrix, and the third layer of loops traverses the rows of the A matrix and the columns of the B matrix for multiplication and summation. What can be considered is If the size of matrix A and matrix B is large enough, it will be a huge computing task, and the CPU
serial execution at the end will be 面临巨大的压力
. So we can gather parallel programming ideas and implement them using CUDA code. Not much to say, start GPU
writing the end code
3. GPU-side implementation of matrix multiplication (Share Memory)
Considering that the size of the matrix is large enough, the code in this article directly considers GPU Share Memory
the insufficient situation during the implementation process, and adopts 移动tile
the form to solve this problem.
In the previous articles CUDA学习介绍
, we have successfully implemented an unused Share Memory
version, the code is as follows:
__global__ void gpu_matrix_mult(int* d_a, int* d_b, int* d_c, int m, int n, int k) {
int row = threadIdx.y + blockDim.y * blockIdx.y;
int col = threadIdx.x + blockDim.x * blockIdx.x;
if (row < m && col < k) {
for (int i = 0; i < n; i++) {
d_c[row * k + col] += d_a[row * n + i] * d_b[col + i * k];
}
}
}
d_a,d_c,d_b
They are all arrays that exist in 全局内存
, and the code will be accessed multiple times during the execution process Global Memory
. Because Global Memory
y is latency
relatively high, 大大降低了代码执行的效率
so we introduced it Share Memory
for optimization, mainly using Share Memor
y 一次写入多次读取
to reduce the data transmission during execution. loss. First use __share__
the identifier to define two arrays that exist in shared memory
__shared__ int smem_m[BLOCK_SIZE][BLOCK_SIZE];
__shared__ int smem_n[BLOCK_SIZE][BLOCK_SIZE];
In this article, the size of the square is used tile
as the current Block
size. In the code of the kernel function, each thread will play two roles: 1. Assign Global Memory
the data in to Share Memory
, 2. Calculate the value in the current matrix. Our code uses the moving tile
method to copy one by one tile边移动边拷贝边计算
, as shown in the figure below, smem_m
it will x轴
move in the positive direction of the axis, and move smem_n
in y
the positive direction of the axis. The step size is the side length of the current tile. A sub
concept is involved in the moving process , which is the number of steps currently tile
moving, tile
the total number of steps is n / BLOCK_SIZE
rounded down
for (int stride = 0; stride <= n / BLOCK_SIZE; stride++) {
int idm = stride * BLOCK_SIZE + row * n + threadIdx.x;
if (row < m && BLOCK_SIZE * stride + threadIdx.x < n) {
smem_m[threadIdx.y][threadIdx.x] = a[idm];
}
else {
smem_m[threadIdx.y][threadIdx.x] = 0;
}
int idn = stride * BLOCK_SIZE * k + col + threadIdx.y * k;
if (col < k && BLOCK_SIZE * stride + threadIdx.y < n) {
smem_n[threadIdx.y][threadIdx.x] = b[idn];
}
else {
smem_n[threadIdx.y][threadIdx.x] = 0;
}
__syncthreads();
for (int i = 0; i < BLOCK_SIZE; i++) {
tmp += smem_m[threadIdx.y][i] * smem_n[i][threadIdx.x];
}
__syncthreads();
}
Since the copying position of the current thread in A is different from the copying position in B, it is necessary to calculate the sum separately to idm
ensure idn
that all threads participate in collective activities. We use __syncthreads()
a function to synchronize the current block
thread. After the synchronization is completed, the current tile
related The calculated results of the calculation steps are stored in a temporary medium tmp
, and will tmp
be assigned to our global Memory
medium when all the moves are performed by the tile.
if (row < m && col < k)
{
c[row * k + col] = tmp;
}
Special attention should be paid to the judgment of conditions. The current thread has its own computing tasks and the current collective task (copying data Global Memory
from Share Memory
it). You cannot not participate in the collective task because the respective computing tasks are invalid. Through the above analysis, we have obtained the complete code
using the Share Memory
optimized versionGPU
__global__ void gpu_matrix_mult(int* a, int* b, int* c, int m, int n, int k)
{
__shared__ int smem_m[BLOCK_SIZE][BLOCK_SIZE];
__shared__ int smem_n[BLOCK_SIZE][BLOCK_SIZE];
int row = blockDim.y * blockIdx.y + threadIdx.y;
int col = blockDim.x * blockIdx.x + threadIdx.x;
int tmp = 0;
for (int stride = 0; stride <= n / BLOCK_SIZE; stride++) {
int idm = stride * BLOCK_SIZE + row * n + threadIdx.x;
if (row < m && BLOCK_SIZE * stride + threadIdx.x < n) {
smem_m[threadIdx.y][threadIdx.x] = a[idm];
}
else {
smem_m[threadIdx.y][threadIdx.x] = 0;
}
int idn = stride * BLOCK_SIZE * k + col + threadIdx.y * k;
if (col < k && BLOCK_SIZE * stride + threadIdx.y < n) {
smem_n[threadIdx.y][threadIdx.x] = b[idn];
}
else {
smem_n[threadIdx.y][threadIdx.x] = 0;
}
__syncthreads();
for (int i = 0; i < BLOCK_SIZE; i++) {
tmp += smem_m[threadIdx.y][i] * smem_n[i][threadIdx.x];
}
__syncthreads();
}
if (row < m && col < k)
{
c[row * k + col] = tmp;
}
}
4. Code reference
#include <stdio.h>
#include <math.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <stdlib.h>
#define CHECK(call) \
do \
{
\
const cudaError_t error_code = call; \
if (error_code != cudaSuccess) \
{
\
printf("CUDA Error:\n"); \
printf(" File: %s\n", __FILE__); \
printf(" Line: %d\n", __LINE__); \
printf(" Error code: %d\n", error_code); \
printf(" Error text: %s\n", \
cudaGetErrorString(error_code)); \
exit(1); \
} \
} while (0)
#define BLOCK_SIZE 32
__global__ void gpu_matrix_mult(int* a, int* b, int* c, int m, int n, int k)
{
__shared__ int smem_m[BLOCK_SIZE][BLOCK_SIZE];
__shared__ int smem_n[BLOCK_SIZE][BLOCK_SIZE];
int row = blockDim.y * blockIdx.y + threadIdx.y;
int col = blockDim.x * blockIdx.x + threadIdx.x;
int tmp = 0;
for (int stride = 0; stride <= n / BLOCK_SIZE; stride++) {
int idm = stride * BLOCK_SIZE + row * n + threadIdx.x;
if (row < m && BLOCK_SIZE * stride + threadIdx.x < n) {
smem_m[threadIdx.y][threadIdx.x] = a[idm];
}
else {
smem_m[threadIdx.y][threadIdx.x] = 0;
}
int idn = stride * BLOCK_SIZE * k + col + threadIdx.y * k;
if (col < k && BLOCK_SIZE * stride + threadIdx.y < n) {
smem_n[threadIdx.y][threadIdx.x] = b[idn];
}
else {
smem_n[threadIdx.y][threadIdx.x] = 0;
}
__syncthreads();
for (int i = 0; i < BLOCK_SIZE; i++) {
tmp += smem_m[threadIdx.y][i] * smem_n[i][threadIdx.x];
}
__syncthreads();
}
if (row < m && col < k)
{
c[row * k + col] = tmp;
}
}
void cpu_matrix_mult(int* h_a, int* h_b, int* h_result, int m, int n, int k) {
for (int i = 0; i < m; ++i)
{
for (int j = 0; j < k; ++j)
{
int tmp = 0.0;
for (int h = 0; h < n; ++h)
{
tmp += h_a[i * n + h] * h_b[h * k + j];
}
h_result[i * k + j] = tmp;
}
}
}
int main(int argc, char const* argv[])
{
int m = 111;
int n = 222;
int k = 333;
int* h_a, * h_b, * h_c, * h_cc;
cudaMallocHost((void**)&h_a, sizeof(int) * m * n);
cudaMallocHost((void**)&h_b, sizeof(int) * n * k);
cudaMallocHost((void**)&h_c, sizeof(int) * m * k);
cudaMallocHost((void**)&h_cc, sizeof(int) * m * k);
for (int i = 0; i < m; ++i) {
for (int j = 0; j < n; ++j) {
h_a[i * n + j] = rand() % 1024;
}
}
for (int i = 0; i < n; ++i) {
for (int j = 0; j < k; ++j) {
h_b[i * k + j] = rand() % 1024;
}
}
int* d_a, * d_b, * d_c;
CHECK(cudaMalloc((void**)&d_a, sizeof(int) * m * n));
cudaMalloc((void**)&d_b, sizeof(int) * n * k);
cudaMalloc((void**)&d_c, sizeof(int) * m * k);
// copy matrix A and B from host to device memory
CHECK(cudaMemcpy(d_a, h_a, sizeof(int) * m * n, cudaMemcpyHostToDevice));
cudaMemcpy(d_b, h_b, sizeof(int) * n * k, cudaMemcpyHostToDevice);
unsigned int grid_rows = (m + BLOCK_SIZE - 1) / BLOCK_SIZE;
unsigned int grid_cols = (k + BLOCK_SIZE - 1) / BLOCK_SIZE;
dim3 dimGrid(grid_cols, grid_rows);
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
cudaEvent_t cudastart;
cudaEvent_t cudaend;
cudaEventCreate(&cudastart);
cudaEventCreate(&cudaend);
cudaEventRecord(cudastart);
cudaEventQuery(cudastart);
gpu_matrix_mult << <dimGrid, dimBlock >> > (d_a, d_b, d_c, m, n, k);
cudaEventRecord(cudaend);
cudaEventSynchronize(cudaend);
float ms;
cudaEventElapsedTime(&ms, cudastart, cudaend);
printf("GPU time is %fms\n", ms);
cudaMemcpy(h_c, d_c, (sizeof(int) * m * k), cudaMemcpyDeviceToHost);
//cudaThreadSynchronize();
cpu_matrix_mult(h_a, h_b, h_cc, m, n, k);
int ok = 1;
for (int i = 0; i < m; ++i)
{
for (int j = 0; j < k; ++j)
{
if (fabs(h_cc[i * k + j] - h_c[i * k + j]) > (1.0e-10))
{
ok = 0;
}
}
}
if (ok)
{
printf("Pass!!!\n");
}
else
{
printf("Error!!!\n");
}
// free memory
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
cudaFreeHost(h_a);
cudaFreeHost(h_b);
cudaFreeHost(h_c);
return 0;
}
The result of the operation is as follows:
5. Practical experience
1. Through the role change of __syncthreads()
Compared with the previous codes, the biggest difference in this code is that for
there are two functions __syncthreads()
, which are described by huan
the teacher's professional and abstract as: each thread has __syncthreads()
changed its identity, __syncthreads()
before, each now Participate in collective activities and be responsible for data transmission. After synchronization, each thread is responsible for its own corresponding calculation. It can be seen that the function has a great effect __syncthreads()
on the threads in the same thread.block
2. Synchronization in parallel thinking
This code embodies parallel thinking. In share Memory
the assignment operation, because it is executed in parallel, the execution speed of each thread is also different. If the synchronization operation is not performed, the calculation of the thread may occur before the assignment operation share Memory
. This leads to wrong calculations, and thread synchronization solves this problem very well. By ensuring that block
all the current threads are share Memory
assigned values before performing operations, the above-mentioned problems are avoided. It can be seen that the importance of synchronous thinking in parallel programming is often bug
difficult to make up for the loss of this thinking in the process of searching, so code
we must think carefully in the link to avoid stagnation in the debug link
3. Improve the efficiency of hardware usage
The size setting in this article tile
is set for the convenience of demonstration and understanding. BLOCK_SIZE*BLOCK_SIZE
In practice, a more reasonable value should be assigned to it. Since each movement is assigned with a tile
size , its size will affect performance. share memory
have a greater impact. At the same time, in order to improve the efficiency of hardware usage, the settings of GridDim
and BlockDim
also need to be tuned.