GPU programming

GPU programming must consider both CPU hardware and GPU hardware. This kind of programming is calledheterogeneous programming

The code starts executing on the CPU, encounters parts that require a lot of parallelization, and then parallelizes it on the GPU, and then returns the results to the CPU for other possible calculations.

Vector addition on CPU

  • Open up memory space,
  • Initialize two vectors,
  • Loop addition, or use vectorized addition
  • free memory

This computational efficiency is very limited by bandwidth. For example, the following code needs to calculate the addition of two vectors with a length of 1 million.

#include <iostream>
int main(void) {
    
    
int N = 1<<20; // 1M elements
float *x = new float[N]; // Allocate memory
float *y = new float[N];
// initialize x and y on the CPU
for (int i = 0; i < N; i++) {
    
    
 x[i] = 1.0f; y[i] = 2.0f;
}
    
// Run on 1M elements on the CPU
add(N, x, y);
    
// Free memory
delete [] x; delete [] y;
return 0;
}

Vector addition on GPU

The function executed on the GPU is called the kernel function (kernel), and the kernel function is called by the CPU

  • Open up memory space (video memory) on GPU
  • Copy data to GPU
  • Execute kernel function
  • Waiting for calculation
  • Return the result to the CPU

GPU serial computation vector addition

float *x = new float[N];
float *y = new float[N];
int size = N*sizeof(float);
float *d_x, *d_y; // device copies of x y
cudaMalloc((void **)&d_x, size);//GPU上开辟内存
cudaMalloc((void **)&d_y, size);//GPU上开辟内存
cudaMemcpy(d_x, x, size, cudaMemcpyHostToDevice);//CPU到GPU转移数据
cudaMemcpy(d_y, y, size, cudaMemcpyHostToDevice);//CPU到GPU转移数据
// Run kernel on GPU
add<<<1,1>>>(N, d_x, d_y);//调用内核代码,<<<1,1>>>表示使用单线程计算
// Copy result back to host
cudaMemcpy(y, d_y, size, cudaMemcpyDeviceToHost);//将结果返还给CPU
// Free memory
cudaFree(d_x); cudaFree(d_y);
delete [] x; delete [] y;


// GPU function to add two vectors
__global__ //添加关键字表示以下函数为内核函数
void add(int n, float *x, float *y) {
    
    
for (int i = 0; i < n; i++)
y[i] = x[i] + y[i];
}

If you want to use parallel computing to improve computing speed, you need to use multiple threads to calculate simultaneously. Need to rewrite kernel function

// GPU function to add two vectors
__global__
void add(int n, float *x, float *y) {
    
    
int index = threadIdx.x;//CUDA线程的索引
y[index] = x[index] + y[index];
}

add<<<1,256>>>(N, d_x, d_y);//使用一个线程块中的256个线程进行计算

Guess you like

Origin blog.csdn.net/weixin_44659309/article/details/134368502