CUDA（Ⅱ）：CUDA C入门

//__global__ 告诉编辑器，这是Device Code，函数应在Device上调用，圆括号里是传递给Device Code参数
__global__ void kernelFun(void){   
}

//<<<>>>尖括号里面的参数不是传递Device Code的参数，而是运行时如何启动Device Code，后续详细阐述
int main(){
    kernelFun<<<1,1>>>();   
    printf("Hello World:\n");
    return 0;
}

（1）CUDA C通过某种语法将函数标记位“Device Code”，表示将Host Code发送到一个编译器，而将Device Code发送到一个编译器。

（2）CUDA C的优势提供了与C语言级别上的集成，使得Device Code调用看上去非常向Host Code调用。

后面再详细论述这个函数调用背后发送的动作。

总之，CUDA C编译器在运行时，负责实现从Host Code中调用Device Code.

3.如何给Device Code传递参数

概念：

1.可以像调用C函数那样将参数传递给Kernel。

运行时系统负责处理将参数从Host传递给Device的过程中的所有复杂操作。

2.当Device执行任何有用的操作时，都需要分配内存。

4.如何在Device上分配内存和释放内存

cudaMalloc()

cudaFree()

__host____device__cudaError_t cudaFree (void *devPtr)
Frees memory on the device.
Parameters
devPtr
    - Device pointer to memory to free
Returns
    cudaSuccess, cudaErrorInvalidDevicePointer, cudaErrorInitializationError
Description
    Frees the memory space pointed to by devPtr, which must have been returned by a
previous call to cudaMalloc() or cudaMallocPitch(). Otherwise, or if cudaFree(devPtr)
has already been called before, an error is returned. If devPtr is 0, no operation is
performed. cudaFree() returns cudaErrorInvalidDevicePointer in case of failure.
The device version of cudaFree cannot be used with a *devPtr allocated using the host
API, and vice versa.

Note that this function may also return error codes from previous, asynchronous
launches.

See also:
    cudaMalloc, cudaMallocPitch, cudaMallocArray, cudaFreeArray, cudaMallocHost ( C
API), cudaFreeHost, cudaMalloc3D, cudaMalloc3DArray, cudaHostAlloc

5.如何访问Device内存

方法一：在Device Code中使用Device pointer

与标准C中指针的释放方式完全一样。

总的来说：主机指针只能访问Host Code中的内存，设备指针只能访问Device Code中的内存

方法二：调用cudaMemcpy()

cudaMemcpy()：

通过cudaMemcpuKind参数指定Device内存究竟是源指针还是内存指针。

__host__cudaError_t cudaMemcpy (void *dst, const void
*src, size_t count, cudaMemcpyKind kind)
Copies data between host and device.
Parameters
    dst
        - Destination memory address
    src
        - Source memory address
    count
        - Size in bytes to copy
    kind
        - Type of transfer
Returns
    cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevicePointer,
    cudaErrorInvalidMemcpyDirection
Description
    Copies count bytes from the memory area pointed to by src to the memory area
pointed to by dst, where kind specifies the direction of the copy, and must be one                                             of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost,
cudaMemcpyDeviceToDevice, or cudaMemcpyDefault. Passing cudaMemcpyDefault
is recommended, in which case the type of transfer is inferred from the pointer values.
However, cudaMemcpyDefault is only allowed on systems that support unified virtual
addressing. Calling cudaMemcpy() with dst and src pointers that do not match the
direction of the copy results in an undefined behavior.

‣ Note that this function may also return error codes from previous, asynchronous
launches.
‣ This function exhibits synchronous behavior for most use cases.

See also:
cudaMemcpy2D, cudaMemcpyToArray, cudaMemcpy2DToArray,
cudaMemcpyFromArray, cudaMemcpy2DFromArray, cudaMemcpyArrayToArray,
cudaMemcpy2DArrayToArray, cudaMemcpyToSymbol, cudaMemcpyFromSymbol,
cudaMemcpyAsync, cudaMemcpy2DAsync, cudaMemcpyToArrayAsync,
cudaMemcpy2DToArrayAsync, cudaMemcpyFromArrayAsync,
cudaMemcpy2DFromArrayAsync, cudaMemcpyToSymbolAsync,
cudaMemcpyFromSymbolAsync

6.如何查询Device各种参数功能

cudaGetDevice

查询系统中有多少个设备是支持CUDA架构的

__host____device__cudaError_t cudaGetDevice (int
*device)
Returns which device is currently being used.
Parameters
    device
        - Returns the device on which the active host thread executes the device code.
    Returns
        cudaSuccess
    Description
        Returns in *device the current device for the calling host thread.

Note that this function may also return error codes from previous, asynchronous
launches.

See also:
cudaGetDeviceCount, cudaSetDevice, cudaGetDeviceProperties, cudaChooseDevice

Data types ：cudaDeviceProp

Data types used by CUDA Runtime，包含了设备的相关属性，详细部分参照“CUDA_Runtime_API.pdf”

5.3. cudaDeviceProp Struct Reference
CUDA device properties
int cudaDeviceProp::asyncEngineCount
    Number of asynchronous engines
int cudaDeviceProp::canMapHostMemory
    Device can map host memory with cudaHostAlloc/cudaHostGetDevicePointer
int cudaDeviceProp::clockRate
    Clock frequency in kilohertz

int cudaDeviceProp::computeMode
    Compute mode (See cudaComputeMode)
int cudaDeviceProp::concurrentKernels
    Device can possibly execute multiple kernels concurrently
int cudaDeviceProp::concurrentManagedAccess
    Device can coherently access managed memory concurrently with the CPU
int cudaDeviceProp::deviceOverlap
    Device can concurrently copy memory and execute a kernel. Deprecated. Use instead
asyncEngineCount.

cudaGetDeviceProperties

设备属性的使用，后面慢慢用到什么了解什么就阔以啦，没必要一次都搞明白

__host__cudaError_t cudaGetDeviceProperties
(cudaDeviceProp *prop, int device)
Returns information about the compute-device.
Parameters
prop
    - Properties for the specified device
device
    - Device number to get properties for
Returns
    cudaSuccess, cudaErrorInvalidDevice
Description
Returns in *prop the properties of device dev. The cudaDeviceProp structure is defined
as:
‎ struct cudaDeviceProp {
    char name[256];
    size_t totalGlobalMem;
    size_t sharedMemPerBlock;
    int regsPerBlock;
    int warpSize;
    size_t memPitch;
    int maxThreadsPerBlock;
    int maxThreadsDim[3];
    int maxGridSize[3];
    int clockRate;
    size_t totalConstMem;
    int major;
    int minor;
    size_t textureAlignment;
    size_t texturePitchAlignment;
    int deviceOverlap;
    int multiProcessorCount;
    int kernelExecTimeoutEnabled;
    int integrated;
    int canMapHostMemory;
    int computeMode;
    int maxTexture1D;
    int maxTexture1DMipmap;
    int maxTexture1DLinear;
    int maxTexture2D[2];
    int maxTexture2DMipmap[2];
    int maxTexture2DLinear[3];
    int maxTexture2DGather[2];
    int maxTexture3D[3];
    int maxTexture3DAlt[3];
    int maxTextureCubemap;
    int maxTexture1DLayered[2];
    int maxTexture2DLayered[3];
    int maxTextureCubemapLayered[2];
    int maxSurface1D;
    int maxSurface2D[2];
    int maxSurface3D[3];
    int maxSurface1DLayered[2];
    int maxSurface2DLayered[3];
    int maxSurfaceCubemap;
    int maxSurfaceCubemapLayered[2];
    size_t surfaceAlignment;
    int concurrentKernels;
    int ECCEnabled;
    int pciBusID;
    int pciDeviceID;
    int pciDomainID;
    int tccDriver;
    int asyncEngineCount;
    int unifiedAddressing;
    int memoryClockRate;
    int memoryBusWidth;
    int l2CacheSize;
    int maxThreadsPerMultiProcessor;
    int streamPrioritiesSupported;
    int globalL1CacheSupported;
    int localL1CacheSupported;
    size_t sharedMemPerMultiprocessor;
    int regsPerMultiprocessor;
    int managedMemSupported;
    int isMultiGpuBoard;
    int multiGpuBoardGroupID;
    int singleToDoublePrecisionPerfRatio;
    int pageableMemoryAccess;
    int concurrentManagedAccess;
}


#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>


int main(void) {
	cudaDeviceProp  prop;  //将设备拥有的属性填充到cudaDeviceProp结构中

	int count;
	cudaGetDeviceCount(&count);  //支持CUDA架构的Device数量
	for (int i = 0; i< count; i++) {
		cudaGetDeviceProperties(&prop, i);  //获取设备的属性
		printf("   --- General Information for device %d ---\n", i);
		printf("Name:  %s\n", prop.name);
		printf("Compute capability:  %d.%d\n", prop.major, prop.minor);
		printf("Clock rate:  %d\n", prop.clockRate);
		printf("Device copy overlap:  ");
		if (prop.deviceOverlap)
			printf("Enabled\n");
		else
			printf("Disabled\n");
		printf("Kernel execution timeout :  ");
		if (prop.kernelExecTimeoutEnabled)
			printf("Enabled\n");
		else
			printf("Disabled\n");

		printf("   --- Memory Information for device %d ---\n", i);
		printf("Total global mem:  %ld\n", prop.totalGlobalMem);
		printf("Total constant Mem:  %ld\n", prop.totalConstMem);
		printf("Max mem pitch:  %ld\n", prop.memPitch);
		printf("Texture Alignment:  %ld\n", prop.textureAlignment);

		printf("   --- MP Information for device %d ---\n", i);
		printf("Multiprocessor count:  %d\n",
			prop.multiProcessorCount);
		printf("Shared mem per mp:  %ld\n", prop.sharedMemPerBlock);
		printf("Registers per mp:  %d\n", prop.regsPerBlock);
		printf("Threads in warp:  %d\n", prop.warpSize);
		printf("Max threads per block:  %d\n",
			prop.maxThreadsPerBlock);
		printf("Max thread dimensions:  (%d, %d, %d)\n",
			prop.maxThreadsDim[0], prop.maxThreadsDim[1],
			prop.maxThreadsDim[2]);
		printf("Max grid dimensions:  (%d, %d, %d)\n",
			prop.maxGridSize[0], prop.maxGridSize[1],
			prop.maxGridSize[2]);
		printf("\n");
	}
}

运行结果：

7.设备属性的使用

通过上述我们了解到，可以通过cudaGetDeviceProperties()来查询各种属性。

问题：在多GPU平台环境下，我们对每个Device设备进行迭代，操作有些繁琐，因此CUDA运行时，提供了一种自动方式来执行这个迭代操作。

NVIDIA的SLI（Scalable link Interface）可伸缩链路接口技术使得多个独立的GPU可以并排排列。

如果应用程序依赖于GPU的某些特定属性，或者需要在系统中最快的GPU上运行，那么你就应该熟悉这个API，因为CUDA运行时本身并不能保证为应用程序选择最优或者最合适的GPU。

cudaChooseDevice

__host__cudaError_t cudaChooseDevice (int *device,
const cudaDeviceProp *prop)

Select compute-device which best matches criteria.

Parameters
    device
        - Device with best match
    prop
        - Desired device properties
Returns
    cudaSuccess, cudaErrorInvalidValue
Description
    Returns in *device the device which has properties that best match *prop.

Note that this function may also return error codes from previous, asynchronous
launches.

See also:
cudaGetDeviceCount, cudaGetDevice, cudaSetDevice, cudaGetDeviceProperties

cudaSetDevice

__host__cudaError_t cudaSetDevice (int device)
Set device to be used for GPU executions.

Parameters
    device
        - Device on which the active host thread should execute the device code.
Returns
    cudaSuccess, cudaErrorInvalidDevice, cudaErrorDeviceAlreadyInUse
Description
    Sets device as the current device for the calling host thread. Valid device id's are 0 to(cudaGetDeviceCount() - 1).

    Any device memory subsequently allocated from this host thread using cudaMalloc(),
cudaMallocPitch() or cudaMallocArray() will be physically resident on device.
Any host memory allocated from this host thread using cudaMallocHost() or
cudaHostAlloc() or cudaHostRegister() will have its lifetime associated with
device. Any streams or events created from this host thread will be associated with
device. Any kernels launched from this host thread using the <<<>>> operator or
cudaLaunchKernel() will be executed on device.

   This call may be made from any host thread, to any device, and at any time. This
function will do no synchronization with the previous or new device, and should be
considered a very low overhead call.

Note that this function may also return error codes from previous, asynchronous
launches.

See also:
cudaGetDeviceCount, cudaGetDevice, cudaGetDeviceProperties, cudaChooseDevice


#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>


int main(void) {
	cudaDeviceProp  prop;
	int dev;

	cudaGetDevice(&dev);
	printf("ID of current CUDA device:  %d\n", dev);

	memset(&prop, 0, sizeof(cudaDeviceProp));  //将设备属性填充到一个cudaDeviceProp结构中
	prop.major = 1;
	prop.minor = 3;
	cudaChooseDevice(&dev, &prop); //CUDA运行时查询是否存在某个设备满足这些条件，并返回设备ID
	printf("ID of CUDA device closest to revision 1.3:  %d\n", dev);

	cudaSetDevice(dev); //所有的设备操作都在这个设备上运行
}

8.小结：

通过增加修饰符使我们可以指定哪些代码在设备上运行，哪些代码在主机上运行。
添加关键字__global__告诉编译器把该函数放在GPU上运行。
GPU上专门的内存对应的API，包括cudaMalloc()、cudaMemcpy()、cudaFree()，分别实现了分配设备内存，在设备和主机之间复制数据，以及释放设备内存等功能。