1. Basic concepts
process (process) an instance of a computer program that is being executed
context (context): a collection of data to be processed, allowing the processor to suspend, keep processing execution and resume processing
concurrency: context switching, mainly used in single-core processors, round robin
Parallelism: multi-threaded execution
2. The first cuda
2.1cuda installation
Understand the micro-architecture and computing power of your computer, and download the corresponding version of cuda
ps. There are a lot of tutorials on the Internet in this part
2.2 The first cuda
2.2.1 Configure cuda template in visual studio
You can roughly refer to this blog
, but I installed cuda first, and then installed vs, so the configuration seems to be unsuccessful. I tried all the ways without reinstalling cuda and failed, reinstall cuda
Three. The basic steps of cuda programming
host code: the code that runs on the cpu
device code: the code that runs on the gpu
3.1 A Simple Demonstration Program
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
__global__ void hello_Cuda()
{
printf(
"hello CUDA world \n"
);
}
int main()
{
//hello_Cuda << <1, 1 >> > ();
//多个线程执行同一个操作
//hello_Cuda << <20, 20 >> > ();
//hello_Cuda << <1, 20 >> > ();//和下面输出相同
hello_Cuda << <20, 1 >> > ();
//host 不必等待 kenel执行完成
//为了强制等待内核执行完毕,即同步
cudaDeviceSynchronize();
//往往需要将结果复制到主机
//这里使用复位
cudaDeviceReset();
return 0;
}
3.2 Key concepts
The first two parameters of the block and grid
kernel startup:
- number of blocks
- The number of threads in each block
dim3 block(4);//y,z 默认是1
dim3 grid(8);
hello_Cuda << <grid, block >> > ();
In this way, the number of blocks for each dimension can be dynamically set
int nx, ny;
nx = 16;
ny = 4;
dim3 block(8, 4);
dim3 grid(nx / block.x, ny / block.y);
hello_Cuda << <grid, block >> > ();
3.3 Limitations of block and grid
4. Line program number
threadidx is initialized according to the position
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
__global__ void print_thread_idx()
{
printf(
"threadIdx.x : %d , threadIdx.y : %d,threadIdx.z : %d \n",
threadIdx.x, threadIdx.y, threadIdx.z
);
}
int main()
{
int nx, ny;
nx = 16;
ny = 16;
dim3 block(8, 8);
dim3 grid(nx / block.x, ny / block.y);
print_thread_idx << <grid, block >> > ();
cudaDeviceSynchronize();
cudaDeviceReset();
return 0;
}