"The Definitive Guide to cuda C Programming" 02 - Memory Management and Thread Management

 A typical CUDA programming structure consists of 5 main steps.

  1. Allocate GPU memory.
  2. Copy data from CPU memory to GPU memory .
  3. Call the CUDA kernel function to complete the operation specified by the program.
  4. Copy data from GPU back to CPU memory .
  5. Free up GPU memory space.

Let's take a look at how to allocate gpu memory first.

Table of contents

1. Memory management functions

1.1 Separate memory

1.2 Data copy

2. gpu memory structure

3. Chestnuts

3.1 Write in pure c (only added on cpu)

3.2 cuda writing (addition on gpu)

3.2.1 Thread Hierarchy

3.2.2 Definition

3.2.3 Synchronization issues

3.2.4 Kernel function

3.2.5 Debugging errors

3.2.6 Complete cuda program


1. Memory management functions

4 kinds of memory management functions, the purpose and the standard c language can be one-to-one correspondence. The difference is that one manages allocation and release on the cpu, and the other operates on the gpu.

1.1 Separate memory

cudaError_t cudaMalloc (void** devPtr, size_t size)

The device side (gpu) allocates linear memory of size bytes .

1.2 Data copy

cudaError_t cudaMemcpy(void* dst, const void* src, size_t count, cudaMemcpyKind kind)

The data copy here is used to transfer count bytes of data between the host end and the device end . The transmission direction is specified by kind, and there are four kinds of kind as follows.

This function executes synchronously because the host application blocks until the cudaMemcpy function returns and the transfer operation is complete.

Among them, the cudaError_t returned above can be interpreted as a readable error message.

char* cudaGetErrorString(cudaError_t e)

This function is similar to strerror in C language.

2. gpu memory structure

 The two main types of memory in GPUs are global memory and shared memory. Global memory is similar to system memory in cpu and shared memory is similar to cpu cache .

3. Chestnuts

Function: add the numbers in array a to array b and store them in array c

3.1 Write in pure c (only added on cpu)

#include <time.h>
#include <stdlib.h>  // srand

// cpu
void sumArraysOnHost(float* a, float* b, float* c, const int N)
{
	for (int i = 0; i < N; i++)
	{
		c[i] = a[i] + b[i];
	}
}

void initialData(float* p, const int N)
{
	//generate different seed from random number
	time_t t;
	srand((unsigned int)time(&t));  // 生成种子

	for (int i = 0; i < N; i++)
	{
		p[i] = (float)(rand() & 0xFF) / 10.0f;  // 随机数
	}
}

int main(void)
{
	// 1 分配内存
	int nElem = 1024;
	size_t nBytes = nElem * sizeof(nElem);  
	float* h_a, * h_b, * h_c;
	h_a = (float*)malloc(nBytes);
	h_b = (float*)malloc(nBytes);
	h_c = (float*)malloc(nBytes);

	// 初始化
	initialData(h_a, nElem);
	initialData(h_b, nElem);

	// 2 直接在cpu上相加
	sumArraysOnHost(h_a, h_b, h_c, nElem);

	// 3 释放内存
	free(h_a);
	free(h_b);
	free(h_c);
	
	return 0;
}

3.2 cuda writing (addition on gpu)

Put the addition operation on the GPU. The complete typical cuda programming structure will be given below.

3.2.1 Thread Hierarchy

Thread hierarchy. A grid contains many Blocks, and a Block contains many Threads.

All threads spawned by a kernel launch are collectively referred to as a grid. All threads in the same grid share the same global memory space (equivalent to system memory). blockIdx (the index of the thread block in the thread grid), threadIdx (the thread index in the block).

When executing a kernel function, the CUDA runtime assigns coordinate variables blockIdx and threadIdx (automatically generated variables) to each thread .

3.2.2 Definition

Defines the size of the block and calculates the grid size based on the size of the block and data. For example, there are 6 data

int nElem = 6;

// 定义一维数组线程块
dim3 block(3);  // 块内有3个线程组

// 定义一维数组网格. 有3个块。即网格大小3是块大小3的倍数。
dim3 grid((nElem + block.x - 1) / block.x);  // 为了保证倍数关系。(6+3-1)/3 = 3

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

__global__ void checkIndex(void) {
	printf("blockIdx: (%d, %d, %d) threadIdx: (%d, %d, %d) \n"
		"gridDim: (%d, %d, %d) blockDim: (%d, %d, %d) \n", 
		blockIdx.x, blockIdx.y, blockIdx.z,
		threadIdx.x, threadIdx.y, threadIdx.z,
		gridDim.x, gridDim.y, gridDim.z,
		blockDim.x, blockDim.y, blockDim.z
		);
}

int main(void)
{
	int nElem = 6;
    // 定义一维数组线程块
    dim3 block(3);  // 块内有3个线程组
    // 定义一维数组网格. 有3个块。即网格大小3是块大小3的倍数。
    dim3 grid((nElem + block.x - 1) / block.x);  // (6+3-1)/3 = 3

	// check grid and block dimension from the host side.
	printf("host gridIdx: (%d, %d, %d) \n", grid.x, grid.y, grid.z);
    printf("host blockIdx: (%d, %d, %d) \n", block.x, block.y, block.z);

	// check grid and block dimension from the device side.
	checkIndex << <grid, block >> > ();  // <<<grid_dim,block_dim>>>

	// reset device before you leavec
	cudaDeviceReset();

	return 0;
}

 It can be seen that when a kernel function checkIndex is executed, the CUDA runtime assigns coordinate variables blockIdx and threadIdx to each thread ,

in,

(1) Both blockDim and gridDim are (3,1,1).

(2) The order of blockIdx from (0,0,0) -> (2,0,0) -> (1,0,0).block is random;

(3) The threadIdx in different blocks is always from (0,0,0) -> (1,0,0) -> (2,0,0). 

3.2.3 Synchronization issues

(1) All kernel function calls and host threads are asynchronous. After calling the kernel function, it will continue to execute, without waiting for the kernel function to finish executing.

(2) Sometimes synchronization is required, the display forces the host to wait for the execution of all kernel functions to end, set:

cudaError_t cudaDeviceSynchronize(void);

(3) And some cuda APIs are implicitly synchronized between the host and the device. For example, cudaMemcpy, the host needs to wait for the copy to be completed before continuing to execute.

3.2.4 Kernel function

The kernel function is the code executed on the device side . When a kernel function is invoked, many different CUDA threads execute the same computational task in parallel.

__global__ void kernel_name(argument list);

Notice:

(1) The kernel function must be of void return type;

(2) The function type qualifier (modifier) ​​determines where to execute and who is called;

 The __device__ and __host__ qualifiers can be used together so that functions can be compiled on both the host and device side.

(3) Limitations of CUDA kernel functions

 The following restrictions apply to all kernel functions:
Can only access device memory
Must have void return type
Does not support variable number of arguments
Does not support static variables
Shows asynchronous behavior 

// 主机端:纯c语言
void sumArraysOnHost(float* a, float* b, float* c, const int N)
{
	for (int i = 0; i < N; i++)
		c[i] = a[i] + b[i];
}

// 设备端:去掉了循环。内置的线程坐标变量替换了数组索引
__global__ void sumArraysOnDevice(float* a, float* b, float* c)
{
	int i = threadIdx.x;  // sumArraysOnDevice << <1, 32 >> > (a, b, c);
    //int i = blockIdx.x;   // sumArraysOnDevice << <32, 1>> > (a, b, c);
	c[i] = a[i] + b[i];
}

// 调用方式:只有一个块,块内有32个线程。并发执行
sumArraysOnDevice << <1, 32 >> > (a, b, c);
// 强制用一个块和一个线程执行核函数,这模拟了串行执行程序。有助于调试和验证结果
sumArraysOnDevice << <1, 1 >> > (a, b, c);

If there is only one block in the grid, and there are 32 threads in the block, threadIdx.x can be used as the index;

If you have 32 blocks inside the grid, with 1 thread inside each block, you can use blockIdx.x as the index.

3.2.5 Debugging errors

#define CHECK(call)
{
	const cudaError_t error = call;
	if (error != cudaSuccess)
	{
		printf("Error: %s: %d, ", __FILE__, __LINE__);
		printf("code: %d, reason: %s\n", error, cudaGetErrorString(error));
		exit(1);
	}
}

CHECK(cudaMemCpy(d_c, gpuRef, nBytes, cudaMemcpyHostToDevice));

3.2.6 Complete cuda program

#include "cuda_runtime.h"
#include "device_launch_parameters.h"  // threadIdx

#include <stdio.h>    // io
#include <time.h>     // time_t
#include <stdlib.h>  // rand
#include <memory.h>  //memset

#define CHECK(call)                                   \
{                                                     \
    const cudaError_t error_code = call;              \
    if (error_code != cudaSuccess)                    \
    {                                                 \
        printf("CUDA Error:\n");                      \
        printf("    File:       %s\n", __FILE__);     \
        printf("    Line:       %d\n", __LINE__);     \
        printf("    Error code: %d\n", error_code);   \
        printf("    Error text: %s\n",                \
            cudaGetErrorString(error_code));          \
        exit(1);                                      \
    }                                                 \
}


void checkResult(float* hostRef, float* deviceRef, const int N)
{
	double eps = 1.0E-8;
	int match = 1;
	for (int i = 0; i < N; i++)
	{
		if (hostRef[i] - deviceRef[i] > eps)
		{
			match = 0;
			printf("\nArrays do not match\n");
			printf("host %5.2f gpu %5.2f at current %d\n", hostRef[i], deviceRef[i], i);
			break;
		}
	}
	if (match)
		printf("Arrays match!\n");
}

void initialData(float* p, const int N)
{
	//generate different seed from random number
	time_t t;
	srand((unsigned int)time(&t));  // 生成种子

	for (int i = 0; i < N; i++)
	{
		p[i] = (float)(rand() & 0xFF) / 10.0f;  // 随机数
	}
}


__global__ void checkIndex(void) {
	printf("blockIdx: (%d, %d, %d) threadIdx: (%d, %d, %d) \n"
		"gridDim: (%d, %d, %d) blockDim: (%d, %d, %d) \n", 
		blockIdx.x, blockIdx.y, blockIdx.z,
		threadIdx.x, threadIdx.y, threadIdx.z,
		gridDim.x, gridDim.y, gridDim.z,
		blockDim.x, blockDim.y, blockDim.z
		);
}

// cpu
void sumArraysOnHost(float* a, float* b, float* c, const int N)
{
	for (int i = 0; i < N; i++)
	{
		c[i] = a[i] + b[i];
	}
}

// 设备端:去掉了循环
__global__ void sumArraysOnDevice(float* a, float* b, float* c, const int N)
{
	int i = threadIdx.x;
	c[i] = a[i] + b[i];
}


int main(void)
{
	int device = 0;
	cudaSetDevice(device);  // 设置显卡号

	// 1 分配内存
	// host memory
	int nElem = 32;
	size_t nBytes = nElem * sizeof(nElem);
	float* h_a, * h_b, * hostRef, *gpuRef;
	h_a = (float*)malloc(nBytes);
	h_b = (float*)malloc(nBytes);
	hostRef = (float*)malloc(nBytes); // 主机端求得的结果
	gpuRef = (float*)malloc(nBytes);  // 设备端拷回的数据
	// 初始化
	initialData(h_a, nElem);
	initialData(h_b, nElem);
	memset(hostRef, 0, nBytes);
	memset(hostRef, 0, nBytes);

	// device memory
	float* d_a, * d_b, * d_c;
	cudaMalloc((float**)&d_a, nBytes);
	cudaMalloc((float**)&d_b, nBytes);
	cudaMalloc((float**)&d_c, nBytes);

	// 2 transfer data from host to device
	cudaMemcpy(d_a, h_a, nBytes, cudaMemcpyHostToDevice);
	cudaMemcpy(d_b, h_b, nBytes, cudaMemcpyHostToDevice);

	// 3 在主机端调用设备端核函数
	dim3 block(nElem);
	dim3 grid(nElem / block.x);
	sumArraysOnDevice<<<grid, block>>>(d_a, d_b, d_c, nElem);

	// 4 transfer data from device to host
	cudaMemcpy(gpuRef, d_c, nBytes, cudaMemcpyDeviceToHost);

	//确认下结果
	sumArraysOnHost(h_a, h_b, hostRef, nElem);
	checkResult(hostRef, gpuRef, nElem);

	// 5 释放内存
	cudaFree(d_a);
	cudaFree(d_b);
	cudaFree(d_c);

	free(h_a);
	free(h_b);
	free(hostRef);
	free(gpuRef);

	return 0;
}

Guess you like

Origin blog.csdn.net/jizhidexiaoming/article/details/132010214