There is no difference between writing a hello world first and C
We call the CPU and the memory of the system the host, and the GPU and its memory as the device. The hello world sample program does not consider any computing devices other than the host.
Kernel function call
__global__ void kernel ( void ) {
}
int main(){
kernel<<1,1>>();
}
kernel() function with the modifier __global__, which tells the compiler that the function should run on the device instead of the host, and the main function is handed over to the host compiler
Why does the call to kernel() need angle brackets and a value
CUDA C requires some syntax to mark a function as "device code", which means sending host code to one compiler and device code to another compiler. The CUDA compiler is responsible for calling device code from host code at runtime
So this function call actually means calling the device code, the angle brackets means passing some parameters to the runtime system, these parameters are not parameters passed to the device code, but tell the runtime how to start the device code, the parameters passed to the device code is enclosed in parentheses
There is nothing special about the process of passing parameters to the kernel function, except for the angle bracket syntax, the appearance of the kernel function is the same as any function call in standard C. At runtime, the system is responsible for all processes of passing parameters from the host to the device.
When the device performs any useful operation, it needs to allocate memory, such as returning the calculated value to the host, to allocate memory through cudaMalloc(), this function tells the CUDA runtime to allocate memory on the device, the first parameter is a pointer, Pointer to hold the address of the newly allocated memory, the second parameter is the size of the allocated memory. This function behaves the same as malloc() except that the pointer to allocate memory is not used as the return value of the function, and returns void*
dev_c is a pointer on the host, that is, the address of the pointer is stored on the host, and the pointer points to an allocated address on the device. The statement *c=a+b represents adding the values of parameters a and b, and the result is saved in the memory pointed to by c
You cannot dereference the pointer returned by cudaMalloc() in the host code. You can pass parameters to this pointer and perform arithmetic operations on it, but you cannot use this pointer to read or write memory.
cudaMemcpy() copies memory, similar to memcpy() in C, except that there is one more parameter to specify whether the device memory pointer is a source pointer or a target pointer, cudaMemcpyDevicesToHost, the source pointer is the device pointer, and the target pointer is the host pointer , why use the address of a variable instead of a pointer for parameter one
The limitations of using device pointers are as follows:
Pass a pointer allocated by cudaMalloc() to a function executed on the device
Read/write memory using pointers allocated by cudaMalloc() in device code
Pass a pointer allocated by cudaMalloc() to a function executing on the host
Cannot read/write memory using a pointer allocated by cudaMalloc() in host code
Use cudaFree() to free memory allocated by cudaMalloc()
device query
int count; cudaGetDeviceCount(&count)
Returns the number of devices
cudaDeviceProp prop; cudaGetDevicesPropertise(&prop,0)
Returns the properties of the zeroth device