foreword
References:
Gao Sheng's blog
"CUDA C programming authoritative guide"
and CUDA official document
CUDA programming: basics and practice Fan Zheyong
Reference station B: Monte Carlo plus tree
All the codes of the article are available on my GitHub, and will be updated slowly in the future
Articles and explanatory videos are simultaneously updated to the public "AI Knowledge Story", station B: go out to eat three bowls of rice
0: CUDA Pytorch relation
Image source, detailed article reference point here
Convolution calculation
1: CUDA convolution calculation programming
Code overview:
(1) CHECK is used to debug error detection (it is recommended to make a good habit)
(2) getThreadNum() to obtain thread-related information
(3) conv convolution calculation
(4)
open space on the CPU in the main function, define data, img and kernel (convolution kernel)
CPU data copy to GPU calculation
GPU calculation (running kernel function conv)
calculation result GPU copy to CPU
output to
free up space
#include<stdint.h>
#include<cuda.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <math.h>
const int NUM_REPEATS = 10;
#define CHECK(call) \
do \
{
\
const cudaError_t error_code = call; \
if (error_code != cudaSuccess) \
{
\
printf("CUDA Error:\n"); \
printf(" File: %s\n", __FILE__); \
printf(" Line: %d\n", __LINE__); \
printf(" Error code: %d\n", error_code); \
printf(" Error text: %s\n", \
cudaGetErrorString(error_code)); \
exit(1); \
} \
} while (0)
static void HandleError(cudaError_t err,
const char* file,
int line)
{
if (err != cudaSuccess)
{
printf("%s in %s at line %d\n",
cudaGetErrorString(err),
file, line);
exit(EXIT_FAILURE);
}
}
#define HANDLE_ERROR(err) (HandleError(err, __FILE__, __LINE__))
int getThreadNum()
{
cudaDeviceProp prop;
int count;
CHECK(cudaGetDeviceCount(&count));
printf("gpu num %d\n", count);
CHECK(cudaGetDeviceProperties(&prop, 0));
printf("max thread num: %d\n", prop.maxThreadsPerBlock);
printf("max grid dimensions: %d, %d, %d)\n",
prop.maxGridSize[0], prop.maxGridSize[1], prop.maxGridSize[2]);
return prop.maxThreadsPerBlock;
}
__global__ void conv(float* img, float* kernel, float* result,
int width, int height, int kernelSize)
{
int ti = threadIdx.x;
int bi = blockIdx.x;
int id = (bi * blockDim.x + ti);
if (id >= width * height)
{
return;
}
int row = id / width;
int col = id % width;
for (int i = 0; i < kernelSize; ++i)
{
for (int j = 0; j < kernelSize; ++j)
{
float imgValue = 0;
int curRow = row - kernelSize / 2 + i;
int curCol = col - kernelSize / 2 + j;
if (curRow < 0 || curCol < 0 || curRow >= height || curCol >= width)
{
}
else
{
imgValue = img[curRow * width + curCol];
}
result[id] += kernel[i * kernelSize + j] * imgValue;
}
}
}
int main()
{
int width = 1000;
int height = 1000;
float* img = new float[width * height];
for (int row = 0; row < height; ++row)
{
for (int col = 0; col < width; ++col)
{
img[col + row * width] = (col + row) % 256;
}
}
int kernelSize = 3;
float* kernel = new float[kernelSize * kernelSize];
for (int i = 0; i < kernelSize * kernelSize; ++i)
{
kernel[i] = i % kernelSize - 1;
}
float* imgGpu;
float* kernelGpu;
float* resultGpu;
CHECK(cudaMalloc((void**)&imgGpu, width * height * sizeof(float)));
CHECK(cudaMalloc((void**)&kernelGpu, kernelSize * kernelSize * sizeof(float)));
CHECK(cudaMalloc((void**)&resultGpu, width * height * sizeof(float)));
CHECK(cudaMemcpy(imgGpu, img,
width * height * sizeof(float), cudaMemcpyHostToDevice));
CHECK(cudaMemcpy(kernelGpu, kernel,
kernelSize * kernelSize * sizeof(float), cudaMemcpyHostToDevice));
int threadNum = getThreadNum();
int blockNum = (width * height - 0.5) / threadNum + 1;
float t_sum = 0;
float t2_sum = 0;
for (int repeat = 0; repeat <= NUM_REPEATS; ++repeat)
{
cudaEvent_t start, stop;
CHECK(cudaEventCreate(&start));
CHECK(cudaEventCreate(&stop));
CHECK(cudaEventRecord(start));
cudaEventQuery(start);
conv << <blockNum, threadNum >> >
(imgGpu, kernelGpu, resultGpu, width, height, kernelSize);
CHECK(cudaEventRecord(stop));
CHECK(cudaEventSynchronize(stop));
float elapsed_time;
CHECK(cudaEventElapsedTime(&elapsed_time, start, stop));
printf("Time = %g ms.\n", elapsed_time);
if (repeat > 0)
{
t_sum += elapsed_time;
t2_sum += elapsed_time * elapsed_time;
}
CHECK(cudaEventDestroy(start));
CHECK(cudaEventDestroy(stop));
}
const float t_ave = t_sum / NUM_REPEATS;
const float t_err = sqrt(t2_sum / NUM_REPEATS - t_ave * t_ave);
printf("Time = %g +- %g ms.\n", t_ave, t_err);
float* result = new float[width * height];
CHECK(cudaMemcpy(result, resultGpu,
width * height * sizeof(float), cudaMemcpyDeviceToHost));
// visualization
printf("img\n");
for (int row = 0; row < 10; ++row)
{
for (int col = 0; col < 10; ++col)
{
printf("%2.0f ", img[col + row * width]);
}
printf("\n");
}
printf("kernel\n");
for (int row = 0; row < kernelSize; ++row)
{
for (int col = 0; col < kernelSize; ++col)
{
printf("%2.0f ", kernel[col + row * kernelSize]);
}
printf("\n");
}
printf("result\n");
for (int row = 0; row < 10; ++row)
{
for (int col = 0; col < 10; ++col)
{
printf("%2.0f ", result[col + row * width]);
}
printf("\n");
}
return 0;
}
operation hours:
Calculate 1 time
Calculate 50 times
2: Pytorch convolution calculation
GPU
import time
import torch
import torch.nn.functional as F
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
width = 1000;
height = 1000;
#img =torch.ones([width,height])
img =torch.randn([width,height])
img = img.to(device)
kernel = torch.tensor([[-1.0, 0.0, 1.0],
[-1.0, 0.0, 1.0],
[-1.0, 0.0, 1.0]])
#input = torch.reshape(input, (1, 1, 5, 5))
img = torch.reshape(img, (1, 1, width, height))
kernel = torch.reshape(kernel, (1, 1, 3, 3))
kernel = kernel.to(device)
output = F.conv2d(img, kernel, stride=1).to(device)
# torch.nn.functional.conv2d(input, weight, bias=None, stride=1, padding=0, dilation=1, groups=1)
# 返回的是 s , 乘1000 为ms
start = time.perf_counter()
# output = F.conv2d(img, kernel, stride=1).to(device)
output = F.conv2d(img, kernel, stride=1).to(device)
end = time.perf_counter()
print("startime:",start)
print("endtime:",end)
print("total:",end-start)
print("output:size===>",output.shape)
print("output tensor:",output)
Calculate 1 time
Calculated 50 times
CPU
import time
import torch
import torch.nn.functional as F
width = 1000;
height = 1000;
#img =torch.ones([width,height])
img =torch.randn([width,height])
# print(img.shape)
# print(img)
kernel = torch.tensor([[-1.0, 0.0, 1.0],
[-1.0, 0.0, 1.0],
[-1.0, 0.0, 1.0]])
#input = torch.reshape(input, (1, 1, 5, 5))
img = torch.reshape(img, (1, 1, width, height))
kernel = torch.reshape(kernel, (1, 1, 3, 3))
# print(kernel.shape)
# torch.nn.functional.conv2d(input, weight, bias=None, stride=1, padding=0, dilation=1, groups=1)
start = time.perf_counter()
# output = F.conv2d(img, kernel, stride=1).to(device)
output = F.conv2d(img, kernel, stride=1)
end = time.perf_counter()
print("startime:",start)
print("endtime:",end)
print("total:",end-start)
print("output:size===>",output.shape)
print("output tensor:",output)
Calculate 1 time
Calculated 50 times
performance comparison
1epoch 50epoch
CUDA 1.4-2.2ms == 1.6ms 9ms
Pytorch(CPU) 10ms 290ms
Pytorch(GPU) 0.1ms 2.4ms
7: Summary (optimize performance)
Necessary conditions for optimal performance:
(1) The proportion of data transmission is small.
(2) The arithmetic intensity of the kernel function is high.
(3) The number of threads defined in the kernel function is large.
Programming means:
• Reduce data transfer between host and device .
• Improve the arithmetic strength of kernel functions . • Increase the parallelism
of kernel functions .
8: Expansion
(1) The ratio of data transfer
If the purpose of a program is only to calculate the sum of two arrays, then using the GPU may be slower than using the CPU. This is because much more time is spent on data transfer (between CPU and GPU) than on computation (summation) itself. The peak theoretical bandwidth of data transfer between GPU computing cores and device memory is much higher than the bandwidth of data transfer between GPU and CPU.
The design calculation task is not to do one calculation of array addition, but to do 10,000 calculations of array addition, and only need to perform data transmission at the beginning and end of the program, so the proportion of data transmission will be negligible. At this point, the performance of the entire CUDA program is greatly improved.
(2) Arithmetic intensity
The reason why it is difficult to obtain a higher speedup ratio for the problem of adding arrays is because the arithmetic intensity of the problem is not high. The arithmetic intensity of a computational problem refers to the ratio of the workload of arithmetic operations in it to the workload of necessary memory operations.
For example, in the problem of adding arrays, when summing each pair of data, it is necessary to first fetch a pair of data from the device memory, then perform a sum calculation on them, and finally store the calculation result in the device Memory. The arithmetic intensity of this problem is actually not high, because only one summation calculation is done in the case of fetching data twice and storing data once. In CUDA, reading and writing device memory is expensive (time-consuming).
(3) Parallel scale:
The parallel scale can be measured by the total number of threads in the GPU .
From a hardware point of view, a GPU is composed of multiple streaming multiprocessors (SM), and each SM has several CUDA cores. Each SM is relatively independent. From the Kepler architecture to the Volt architecture, the maximum number of threads that can reside in an SM is 2048. For the Turing architecture, this number is 1024. There are generally several to dozens of SMs in a GPU (depending on the specific model). Therefore, a GPU can host tens of thousands to hundreds of thousands of threads in total . If the number of threads defined in a kernel function is much smaller than this number, it is difficult to obtain a high speedup ratio.