Detailed explanation of CUDA programming interface

Detailed explanation of CUDA programming interface

This article will introduce in detail the core concepts in Chapter 3 (Programming Interface) of the NVIDIA CUDA Programming Guide , such as the NVCC compiler, CUDA runtime, version management and compatibility, computing modes, mode switching, and Tesla computing cluster mode under Windows. Here is the outline of this article:
insert image description here

1. Compilation with NVCC

NVCC (NVIDIA CUDA Compiler, NVIDIA CUDA Compiler) is a compiler for compiling CUDA C/C++ code. To compile CUDA code with NVCC, we need to set the extension of the CUDA source file to .cu, and use the following command to compile:

nvcc -o my_program my_program.cu

In this section, we will detail the usage and options of the NVCC compiler. The following is an overview of the contents of this section:

  • Introduction to NVCC Compiler
  • Common Compilation Options
  • Example: Compile a simple CUDA program
  • Example: Compile CUDA program using Makefile

1.1. Introduction to NVCC Compiler

NVCC is one of the core components of the CUDA platform, which converts CUDA C/C++ code into GPU-executable binary code. The running process of NVCC is divided into two stages: first, it separates the device code (device code) and host code (host code) in the CUDA code; then, it passes the device code to the PTX compiler (NVIDIA's GPU assembly and compilation compiler) to compile, and hand over the host code to a C/C++ compiler (such as GCC or Clang) for compilation.

1.2. Common compilation options

NVCC supports many compilation options, such as:

  • -arch=sm_XX: Set the target architecture, for example, -arch=sm_35to compile a program that supports computing capability 3.5;
  • -c: Compile only, do not link;
  • -o: Set the output file name;
  • -I: Set the header file search path;
  • -L: Set the library file search path;
  • -l: Link library files.

1.3. Example: compile a simple CUDA program

Suppose we have a simple CUDA program vector_add.cuwith the following contents:

#include <iostream>
#include <cuda_runtime.h>

__global__ void vector_add(const float *A, const float *B, float *C, int N) {
    
    
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < N) {
    
    
        C[i] = A[i] + B[i];
    }
}

int main() {
    
    
    // ...省略主机代码...
}

To compile this program using NVCC, we can execute the following command:

nvcc -o vector_add vector_add.cu

After successful compilation, we can run the resulting executable vector_add.

1.4. Example: Compile CUDA program using Makefile

In actual projects, we usually need to compile multiple source files and link multiple library files. In order to simplify the compilation process, we can use Makefile to manage compilation options and dependencies. Here is a simple Makefile example:

# 设置NVCC编译器和编译选项
NVCC = nvcc
CFLAGS = -arch=sm_35 -I/usr/local/cuda/include

# 设置目标文件和源文件
TARGET = vector_add
SOURCES = vector_add.cu

# 编译规则
all: $(TARGET)

$(TARGET): $(SOURCES)
	$(NVCC) $(CFLAGS) -o $@ $^

clean:
	rm -f $(TARGET)

Using this Makefile, we can directly run makecommands to compile CUDA programs.

2. CUDA Runtime

CUDA Runtime (CUDA Runtime) is a high-level API for managing devices, memory, and execution. It consists of the following parts:

  • Device management: including functions such as device query, device selection, and device attribute acquisition;
  • Memory management: including device memory allocation, release, data transmission and other functions;
  • Execution management: including kernel function calls, synchronization, stream management and other functions.

In this section, we will introduce the usage and precautions of CUDA runtime API in detail. The following is an overview of the contents of this section:

  • Device Management API
  • memory management API
  • Executive Management API

2.1. Device Management API

The device management API is mainly used to query and set CUDA devices. The following are some commonly used device management APIs:

  • cudaGetDeviceCount(int *count): Get the number of available CUDA devices;
  • cudaGetDeviceProperties(cudaDeviceProp *prop, int device): Get the properties of the specified device;
  • cudaSetDevice(int device): Select the device used by the current thread;
  • cudaGetDevice(int *device): Get the device used by the current thread.

2.2. Memory management API

The memory management API is mainly used to allocate, free and transfer device memory. The following are some commonly used memory management APIs:

  • cudaMalloc(void **devPtr, size_t size): allocate device memory;
  • cudaFree(void *devPtr): release device memory;
  • cudaMemcpy(void *dst, const void *src, size_t count, cudaMemcpyKind kind): Transfer memory data.

2.3. Execution Management API

The execution management API is mainly used to call kernel functions, synchronize and manage streams. The following are some commonly used execution management APIs:

  • __global__: define the kernel function;
  • <<<...>>>(): call the kernel function;
  • cudaStreamCreate(cudaStream_t *pStream): create stream;
  • cudaStreamDestroy(cudaStream_t stream): destroy stream;
  • cudaStreamSynchronize(cudaStream_t stream): synchronous flow;
  • cudaDeviceSynchronize(): Synchronize the device.

3. Versioning and Compatibility

CUDA uses the version number to indicate the compatibility of its API and ABI (Application Binary Interface, application program binary interface). The version number of CUDA consists of three parts: major version number, minor version number and revision number, for example CUDA 10.2.89.

Major and minor version numbers are used to indicate API compatibility. If the CUDA version used by the application is consistent with the CUDA version installed on the system in terms of major and minor version numbers, the application can run normally. The revision number is used to indicate ABI compatibility. If the CUDA version used by the application is the same or lower in revision number than the CUDA version installed on the system, the application can run normally.

In this section, we detail the concepts and practices of CUDA version management and compatibility. The following is an overview of the contents of this section:

  • CUDA version number
  • API compatibility
  • ABI compatibility
  • Example: Check CUDA version

3.1. CUDA version number

The CUDA version number consists of three parts:

  • Major version number (Major version): Indicates the main function and architecture changes of the CUDA platform;
  • Minor version: Indicates minor features and performance improvements of the CUDA platform;
  • Revision version number (Revision version): Indicates the detailed repair and optimization of the CUDA platform.

For example, CUDA 10.2.89 represents a major version number of 10, a minor version number of 2, and a revision number of 89.

3.2. API Compatibility

API compatibility means that a CUDA program can run normally on different versions of CUDA platforms. API compatibility is mainly determined by the major and minor version numbers:

  • If the CUDA version used by a CUDA program is consistent with the CUDA version installed in the system in terms of major version number and minor version number, the program can run normally;
  • If the major version number of the CUDA version used by a CUDA program is higher than the major version number of the CUDA version installed on the system, the program may not run properly;
  • If the major version number of the CUDA version used by a CUDA program is lower than the major version number of the CUDA version installed in the system, but the minor version number is higher than or equal to the minor version number of the CUDA version installed in the system, the program can run normally;
  • If a CUDA program uses a CUDA version whose major version number is lower than the system-installed CUDA version's major version number, and whose minor version number is lower than the system-installed CUDA version's minor version number, the program may not run properly.

3.3. ABI Compatibility

ABI compatibility means that the binary file of a CUDA program can run normally on different versions of the CUDA platform. ABI compatibility is primarily determined by the revision number:

  • If the CUDA version used by a CUDA program is the same or lower than the CUDA version installed on the system, the program can run normally;
  • If a CUDA program uses a CUDA version with a higher revision number than the CUDA version installed on the system, the program may not function correctly.

3.4. Example: Check CUDA version

To check the CUDA version, we can use the following method:

  • Check the version of CUDA runtime library: cat /usr/local/cuda/version.txt;
  • Check the version of CUDA compiler: nvcc --version;
  • Check the version of CUDA driver: nvidia-smi;
  • Get version information in CUDA program:
#include <iostream>
#include <cuda_runtime.h>

int main() {
    
    
    int runtime_version = 0;
    int driver_version = 0;
    cudaRuntimeGetVersion(&runtime_version);
    cudaDriverGetVersion(&driver_version);
    std::cout << "CUDA Runtime Version: " << runtime_version << std::endl;
    std::cout << "CUDA Driver Version: " << driver_version << std::endl;
    return 0;
}

4. Compute Modes

CUDA supports multiple computing modes for controlling access to devices. The calculation modes are divided into the following types:

  • Default (default): The device can be accessed simultaneously by multiple threads and multiple contexts (Context);
  • Exclusive Process: The device can only be accessed by one process, but can be accessed by multiple threads and multiple contexts at the same time;
  • Prohibited (forbidden): The device cannot be accessed by any thread or context.

In this section, we detail the concepts and practices of CUDA's computing modes. The following is an overview of the contents of this section:

  • Introduction to Calculation Models
  • Example: Setting Calculation Mode
  • Example: Query Calculation Schema

4.1. Introduction to Calculation Mode

Compute mode is a mechanism for controlling access to CUDA devices. By setting the computing mode, we can control the sharing behavior of the device among multiple threads and multiple contexts. The following is a detailed introduction to the calculation mode:

  • Default (default): The device can be accessed simultaneously by multiple threads and multiple contexts. This mode is suitable for most cases because it allows multiple CUDA programs to run in parallel;
  • Exclusive Process: The device can only be accessed by one process, but can be accessed simultaneously by multiple threads and multiple contexts. This mode is suitable for high-performance computing scenarios that require exclusive device resources;
  • Prohibited (forbidden): The device cannot be accessed by any thread or context. This mode is suitable for security or power saving scenarios where CUDA devices need to be disabled.

4.2. Example: Setting Computation Mode

To set the calculation mode, we can use the following method:

  • Set the calculation mode in the CUDA program:
#include <iostream>
#include <cuda_runtime.h>

int main() {
    
    
    int device = 0;
    cudaSetDevice(device);

    // 设置计算模式为独占进程
    cudaError_t err = cudaDeviceSetSharedMemConfig(cudaComputeModeExclusiveProcess);
    if (err != cudaSuccess) {
    
    
        std::cerr << "Failed to set compute mode: " << cudaGetErrorString(err) << std::endl;
        return 1;
    }

    // ...执行其他CUDA操作...

    return 0;
}
  • Set the calculation mode using nvidia-smithe command line tool:
# 设置设备0的计算模式为独占进程
nvidia-smi -i 0 -c EXCLUSIVE_PROCESS

4.3. Example: query calculation mode

To query the calculation mode, we can use the following method:

  • Query the computation mode in a CUDA program:
#include <iostream>
#include <cuda_runtime.h>

int main() {
    
    
    int device = 0;
    cudaSetDevice(device);

    // 获取计算模式
    cudaComputeMode compute_mode;
    cudaDeviceProp device_prop;
    cudaGetDeviceProperties(&device_prop, device);
    compute_mode = device_prop.computeMode;

    // 打印计算模式
    switch (compute_mode) {
    
    
        case cudaComputeModeDefault:
            std::cout << "Compute mode: Default" << std::endl;
            break;
        case cudaComputeModeExclusiveProcess:
            std::cout << "Compute mode: Exclusive Process" << std::endl;
            break;
        case cudaComputeModeProhibited:
            std::cout << "Compute mode: Prohibited" << std::endl;
            break;
        default:
            std::cerr << "Unknown compute mode" << std::endl;
            return 1;
    }

    return 0;
}
  • Query the calculation mode using nvidia-smithe command line tool:
# 查询设备0的计算模式
nvidia-smi -i 0 --query-gpu=compute_mode --format=csv

5. Mode Switches

CUDA's mode switching feature allows users to switch computing modes while the system is running. Mode switching nvidia-smiis implemented through command line tools, and users can switch computing modes at any time as needed.

Here is an example of a mode switch:

  1. View the current calculation mode:
nvidia-smi -i 0 --query-gpu=compute_mode --format=csv
  1. Switch the computation mode to an exclusive process:
nvidia-smi -i 0 -c EXCLUSIVE_PROCESS
  1. Verify that the calculation mode has changed:
nvidia-smi -i 0 --query-gpu=compute_mode --format=csv

It should be noted that mode switching may affect the running CUDA program. Before switching computing modes, make sure all related programs are closed.

6. Tesla Compute Cluster Mode for Windows

Tesla Compute Cluster Mode (TCC mode for short) is a high-performance computing mode provided by NVIDIA for the Windows platform. TCC mode is designed to improve the performance and stability of Tesla GPUs in high-performance computing scenarios, including the following features:

  • Optimized memory transfer performance;
  • Support large memory paging;
  • Support Peer-to-Peer memory transfer;
  • Lower kernel launch latency;
  • System Management Interrupt (SMI) isolation.

To enable TCC mode, users need to use the NVIDIA Control Panel or nvidia-smicommand line tools. When TCC mode is enabled, the Tesla GPU will operate as a dedicated computing device and cannot be used for graphics.

It should be noted that TCC mode is only applicable to Tesla GPU on Windows platform. Other platforms and GPU families are not affected.

Guess you like

Origin blog.csdn.net/kunhe0512/article/details/131017674