[Notes] cuda master class 1-4

1. Basic concepts

process (process) an instance of a computer program that is being executed
context (context): a collection of data to be processed, allowing the processor to suspend, keep processing execution and resume processing
concurrency: context switching, mainly used in single-core processors, round robin
insert image description here
Parallelism: multi-threaded execution

2. The first cuda

2.1cuda installation

insert image description here
Understand the micro-architecture and computing power of your computer, and download the corresponding version of cuda
ps. There are a lot of tutorials on the Internet in this part

2.2 The first cuda

2.2.1 Configure cuda template in visual studio

You can roughly refer to this blog
, but I installed cuda first, and then installed vs, so the configuration seems to be unsuccessful. I tried all the ways without reinstalling cuda and failed, reinstall cuda

Three. The basic steps of cuda programminginsert image description here

host code: the code that runs on the cpu
device code: the code that runs on the gpu

3.1 A Simple Demonstration Program

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

__global__ void hello_Cuda()
{
    
    
	printf(
		"hello CUDA world \n"
	); 
}

int main()
{
    
    
	//hello_Cuda << <1, 1 >> > ();
	//多个线程执行同一个操作
	//hello_Cuda << <20, 20 >> > ();
	//hello_Cuda << <1, 20 >> > ();//和下面输出相同
	hello_Cuda << <20, 1 >> > ();

	//host 不必等待 kenel执行完成

	//为了强制等待内核执行完毕,即同步
	cudaDeviceSynchronize();

	//往往需要将结果复制到主机
	//这里使用复位
	cudaDeviceReset();

	return 0;
}


3.2 Key concepts

The first two parameters of the block and grid
insert image description here
kernel startup:

  1. number of blocks
  2. The number of threads in each block
    insert image description here
	dim3 block(4);//y,z 默认是1
	dim3 grid(8);
	hello_Cuda << <grid, block >> > ();

In this way, the number of blocks for each dimension can be dynamically set


int nx, ny;
nx = 16;
ny = 4;
dim3 block(8, 4);
dim3 grid(nx / block.x, ny / block.y);
hello_Cuda << <grid, block >> > ();

3.3 Limitations of block and grid

insert image description here
insert image description here

4. Line program number

threadidx is initialized according to the position
insert image description here

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

__global__ void print_thread_idx()
{
    
    
	printf(
		"threadIdx.x : %d , threadIdx.y : %d,threadIdx.z : %d \n",
		threadIdx.x, threadIdx.y, threadIdx.z

	);
}

int main()
{
    
    

	int nx, ny;
	nx = 16;
	ny = 16;
	dim3 block(8, 8);
	dim3 grid(nx / block.x, ny / block.y);
	print_thread_idx << <grid, block >> > ();

	cudaDeviceSynchronize();

	cudaDeviceReset();

	return 0;
}

おすすめ

転載: blog.csdn.net/weixin_50862344/article/details/130435837
1-4