Getting Started with CUDA: Analysis of Basic Concepts

1. GPU and CUDA

A GPU, or Graphics Processing Unit, is a computing device designed to process images and video. A GPU consists of many small processor cores that can handle many simple computing tasks simultaneously. Unlike CPUs, GPUs have thousands of processor cores, making them ideal for parallel computing.

CUDA is a parallel computing platform and programming model for high-performance computing utilizing GPUs. CUDA is a technology introduced by NVIDIA and is the de facto standard for GPU computing.

2. CUDA programming basics

#include <stdio.h>
__global__ void helloCUDA()
{
    
    
    printf("Hello CUDA from GPU!\n");
}
int main()
{
    
    
    helloCUDA<<<1,1>>>();
    cudaDeviceSynchronize();
    return 0;
}


2.1 Kernel function

The core concept of CUDA is the kernel function. A kernel function is a function executed on the GPU that can process multiple data elements in parallel. In CUDA, a kernel function is called a "kernel".

In the Kernel, you can use special syntax to access GPU memory, threads and blocks, and use the "<<<...>>>" operator to call it on the GPU. These syntaxes include __global__, shared, __device__, etc. In the above example, a Kernel named "helloCUDA" is defined and called on the GPU using the "<<<1,1>>>" operator.

2.2 Thread Grid and Thread Blocks

Each kernel function consists of a thread grid and a thread block. A thread grid is composed of multiple thread blocks, and a thread block is composed of multiple threads. When writing CUDA programs, it is necessary to clarify the relationship between thread grids and thread blocks.

The relationship between thread blocks and thread grids is very important. Threads in a thread block can share local memory and synchronize operations, while threads in a thread grid can share global memory. In a GPU, it takes a long time to read global memory, so using local memory can significantly improve performance. In CUDA, thread blocks and threads are called "Block" and "Thread".

A Block is composed of multiple Threads, and one Kernel can start multiple Blocks. In the above example, a Block is started using the "<<<1,1>>>" operator, and "blockIdx.x" and "threadIdx.x" are used to access the IDs of the current Block and Thread.

CUDA memory model

CUDA uses a special memory model to manage GPU memory. In CUDA programs, the following four types of memory can be used:

  • Global memory: Global memory is the main memory pool on the GPU. It can be accessed by all Blocks and Threads, and can transfer data between the host and the device.
  • Shared memory: Shared memory is memory shared between threads in a Block. It can be used to speed up data access.
  • Constant memory: Constant memory is a read-only memory area that can be accessed by all Blocks and Threads. It is usually used to store constant data.
  • Texture memory: Texture memory is a cache that stores image data. It has some advanced features such as autoscaling and interpolation.
  • In a CUDA program, the following functions can be used to allocate, release, and access GPU memory:
    cudaMalloc(): used to allocate global memory on the GPU.
    cudaFree(): Used to release global memory on the GPU.
    cudaMemcpy(): Used to copy data between host and device.
    global void(): Used to define the Kernel on the GPU.
    shared void(): Used to declare variables in shared memory.
parameter explain
cudaMalloc() Used to allocate global memory on the GPU
cudaFree() Used to release global memory on the GPU
cudaMemcpy() Used to copy data between host and device
cudaMalloc() Used to define the Kernel on the GPU
cudaMalloc() Used to declare variables in shared memory

Example Program: Vector Addition

Now, a CUDA program for vector addition will be introduced, in which two vectors will be added on the GPU and the result will be stored in a third vector. Here are the main parts of the program:

#include <stdio.h>

__global__ void vecAdd(float *a, float *b, float *c, int n)
{
    
    
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
    
    
        c[i] = a[i] + b[i];
    }
}

int main()
{
    
    
    int n = 1000;
    float *a, *b, *c;
    float *d_a, *d_b, *d_c;
    int size = n * sizeof(float);

    a = (float*)malloc(size);
    b = (float*)malloc(size);
    c = (float*)malloc(size);

    cudaMalloc(&d_a, size);
    cudaMalloc(&d_b, size);
    cudaMalloc(&d_c, size);

    for (int i = 0; i < n; i++) {
    
    
        a[i] = i;
        b[i] = i * 2;
    }

    cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

    int blockSize = 256;
    int numBlocks = (n + blockSize - 1) / blockSize;

    vecAdd<<<numBlocks, blockSize>>>(d_a, d_b, d_c, n);

    cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

    for (int i = 0; i < n; i++) {
    
    
        printf("%f\n", c[i]);
    }

    free(a);
    free(b);
    free(c);

    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);

    return 0;
}

In this program, memory for three vectors (a, b, c) is first allocated on the host, and memory for three vectors (d_a, d_b, d_c) is allocated on the GPU using cudaMalloc(). The vectors a and b on the host are then copied into vectors d_a and d_b on the GPU using the cudaMemcpy() function.

Next, calculate the number of Blocks that need to be started and the number of Threads that need to be started in each Block. In this example, the block size is set to 256 and the vector size is set to 1000, so 4 blocks need to be started. Then use the <<<…>>> operator to launch the Kernel and perform vector addition on the GPU.

Finally, use the cudaMemcpy() function to copy the vector d_c on the GPU back to the host, and then release all host and GPU memory.

The Kernel function in this program is vector addition, which is defined as follows:

__global__ void vecAdd(float *a, float *b, float *c, int n)
{
    
    
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
    
    
        c[i] = a[i] + b[i];
    }
}

In this Kernel function, each thread will calculate the sum of the elements of a vector. i is the unique identifier of the thread, calculated by calculating the block number and the thread number. The CUDA built-in blockIdx and threadIdx variables are used here. First calculate the global index i of the thread, if i is less than the vector size n, add the elements in the a and b vectors and store the result in the c vector.

Note that the program uses the <<<…>>> operator when starting the Kernel. This operator specifies how many Blocks and how many Threads in each Block the Kernel should run on.

Additionally, the program uses the cudaMemcpy() function, which is used to copy data between the host and the GPU. This function has four parameters: source pointer, destination pointer, number of bytes to copy, and copy direction (from host to GPU or from GPU to host).

Summarize

This article briefly introduces the basic concepts and programming model of CUDA, including kernel functions, thread grids and thread blocks, memory management, and compilers. Additionally, a simple CUDA program that adds two vectors is presented.

Guess you like

Origin blog.csdn.net/Algabeno/article/details/129050172