2.3.cuda driver API-context management settings and their functions

foreword

Teacher Du launched the tensorRT high-performance deployment course from scratch . I have read it before, but I didn’t take notes, and I forgot many things. Play it again this time, and take notes by the way

This course learns the streamlined CUDA tutorial - Driver API context management settings and their functions

The course outline can be seen in the mind map below

insert image description here

1. CUcontext

What you need to know about context is:

context is a context that can associate all operations on the GPU

Context is associated with a graphics card, and a graphics card can be associated with multiple contexts

Each thread has a stack structure to store the context . The top of the stack is the currently used context, which corresponds to the stack of push and pop functions operating the context. All APIs use the current context as the operation target

Just imagine, if you need to pass a device to decide which device to send to perform any operation, how troublesome it is

The figure below compares the code without context and with context

insert image description here

Figure 1-1 Comparison of code without context and code with context

It can be seen from the figure that we directly call the memory allocation, release and data copy functions of CUDA without using the context. The parameters of these functions need to pass in the device identifier to specify the device to operate. Each function call is independent, no association with the device is established

In the case of a context, we first use cuCreateContextthe function to create a context and associate it with a specific device. Then use cuPushCurrentthe function to set the context as the current context. Subsequent memory allocation, release and data copy function calls will automatically use the current context to operate without explicitly specifying the device identifier.

After completing the context-related operations, we use cuPopCurrentthe function to pop the context from the current context stack and restore the previous context settings.

The advantage of using context is that a series of related CUDA operations can be associated into a context, which simplifies the code and improves execution efficiency.

The above mentioned are all about manual management of context, but the following points need to be explained about automatic management of context:

Because it is a high-frequency operation, one thread basically accesses one graphics card unchanged, and only one context is used, and multiple contexts are rarely used

Multiple context management such as CreateContext, PushCurrent, and PopCurrent is cumbersome, but it has to be simple

Therefore, cuDevicePrimaryCtxRetain is introduced, which associates the main context with the device, and you don't need to manage allocation, release, setting, and stack

primaryContext: Give me the device id, give you the context and set it up. At this time, a graphics card corresponds to a primary context

No thread, as long as the device id is the same, the primary context is the same, and the context is thread-safe

The figure below compares the code for manually managing context and automatically managing context

insert image description here

Figure 1-2 Comparison of stack codes with and without context management

In the image above, we use cuDevicePrimaryCtxRetainthe function to associate the device's primary context with a specific device. Subsequent memory allocation, deallocation, and data copy function calls will operate using this primary context without explicitly setting the current context. In this case, there is no need to explicitly manage the context stack, and the code is more concise.

The sample code for the context case is as follows:


// CUDA驱动头文件cuda.h
#include <cuda.h>   // include <> 和 "" 的区别    
#include <stdio.h>  // include <> : 标准库文件 
#include <string.h> // include "" : 自定义文件  详细情况请查看 readme.md -> 5

#define checkDriver(op)  __check_cuda_driver((op), #op, __FILE__, __LINE__)

bool __check_cuda_driver(CUresult code, const char* op, const char* file, int line){
    
    
    if(code != CUresult::CUDA_SUCCESS){
    
        // 如果 成功获取CUDA情况下的返回值 与我们给定的值(0)不相等， 即条件成立， 返回值为flase
        const char* err_name = nullptr;    // 定义了一个字符串常量的空指针
        const char* err_message = nullptr;  
        cuGetErrorName(code, &err_name);    
        cuGetErrorString(code, &err_message);   
        printf("%s:%d  %s failed. \n  code = %s, message = %s\n", file, line, op, err_name, err_message); //打印错误信息
        return false;
    }
    return true;
}

int main(){
    
    

    // 检查cuda driver的初始化
    checkDriver(cuInit(0));

    // 为设备创建上下文
    CUcontext ctxA = nullptr;                                   // CUcontext 其实是 struct CUctx_st*（是一个指向结构体CUctx_st的指针）
    CUcontext ctxB = nullptr;
    CUdevice device = 0;
    checkDriver(cuCtxCreate(&ctxA, CU_CTX_SCHED_AUTO, device)); // 这一步相当于告知要某一块设备上的某块地方创建 ctxA 管理数据。输入参数 参考 https://www.cs.cmu.edu/afs/cs/academic/class/15668-s11/www/cuda-doc/html/group__CUDA__CTX_g65dc0012348bc84810e2103a40d8e2cf.html
    checkDriver(cuCtxCreate(&ctxB, CU_CTX_SCHED_AUTO, device)); // 参考 1.ctx-stack.jpg
    printf("ctxA = %p\n", ctxA);
    printf("ctxB = %p\n", ctxB);
    /* 
        contexts 栈：
            ctxB -- top <--- current_context
            ctxA 
            ...
     */

    // 获取当前上下文信息
    CUcontext current_context = nullptr;
    checkDriver(cuCtxGetCurrent(&current_context));             // 这个时候current_context 就是上面创建的context
    printf("current_context = %p\n", current_context);

    // 可以使用上下文堆栈对设备管理多个上下文
    // 压入当前context
    checkDriver(cuCtxPushCurrent(ctxA));                        // 将这个 ctxA 压入CPU调用的thread上。专门用一个thread以栈的方式来管理多个contexts的切换
    checkDriver(cuCtxGetCurrent(&current_context));             // 获取current_context (即栈顶的context)
    printf("after pushing, current_context = %p\n", current_context);
    /* 
        contexts 栈：
            ctxA -- top <--- current_context
            ctxB
            ...
    */
    

    // 弹出当前context
    CUcontext popped_ctx = nullptr;
    checkDriver(cuCtxPopCurrent(&popped_ctx));                   // 将当前的context pop掉，并用popped_ctx承接它pop出来的context
    checkDriver(cuCtxGetCurrent(&current_context));              // 获取current_context(栈顶的)
    printf("after poping, popped_ctx = %p\n", popped_ctx);       // 弹出的是ctxA
    printf("after poping, current_context = %p\n", current_context); // current_context是ctxB

    checkDriver(cuCtxDestroy(ctxA));
    checkDriver(cuCtxDestroy(ctxB));

    // 更推荐使用cuDevicePrimaryCtxRetain获取与设备关联的context
    // 注意这个重点，以后的runtime也是基于此, 自动为设备只关联一个context
    checkDriver(cuDevicePrimaryCtxRetain(&ctxA, device));       // 在 device 上指定一个新地址对ctxA进行管理
    printf("ctxA = %p\n", ctxA);
    checkDriver(cuDevicePrimaryCtxRelease(device));
    return 0;
}

The running effect is as follows:

insert image description here

Figure 1-3 running effect of context case

The code begins by creating two contexts ctxAand ctxB. cuCtexCreateCreate a context for a specific device (using a device identifier device) by calling the function. The code then uses cuCtxGetCurrentthe function to get the current context and print its address. It can be seen that after the context has just been created, the current context has ctxBthe same address as .

Next, the code pushes on the context stack through cuCtxPushCurrentthe function ctxAto become the current context. Then use cuCtxPopCurrentthe function to pop the current context, and use popped_ctxthe variable to receive the popped context. Call the function again cuCtxGetCurrentto see that the current context has changed ctxB, and popped_ctxthe popped one is saved in ctxA.

Finally, the code also demonstrates the use of cuDevicePrimaryCtxRetainfunctions to automatically manage context

2. Supplementary knowledge

We need to supplement the relevant knowledge of the context context: ( from Mr. Du )

All states of a device associated with a particular process. For example, a piece of kernel code you write will cause different states (memory mapping, allocation, loaded code) for the use of the GPU, and the Context saves all management data to control and use the device

An analogy example is like you have a conversation with Xiao Ming and Xiao Hong respectively, you talk about China (China) with Xiao Ming, and you also talk about China (porcelain) with Xiao Hong, but you may not talk about the same thing, we manage such data , so we create the context

The context of the gpu is equivalent to the program of the cpu. There can be contexts on a gpu, but they are isolated from each other. It is recommended that one piece of equipment has one context

Reference: https://aiprobook.com/deep-learning-for-programmers/

What context management can do:

Holds a list of allocated memory

Holds the kernel code loaded into the device

Unified memory between cpu and gpu

…

How to manage context:

The cuda driver also needs to display the management context

The context is created at the beginning cuCtxCreate()and cuCtxDestroydestroyed at the end. Like file management, manual switching is required.

Better to cuDevicePrimaryCtxRetain()create the context with !

cuCtxGetCurrent()get the current context

Multiple contexts can be managed using a stack cuCtxPushCurrent()push, cuCtxPopCurrent()push

cuCtxPushCurrent()Using and on ctxA cuCtxCreate()is equivalent to putting ctxA on the top of the stack (making it the current context)

cuda runtime can be created automatically, based on cuDevicePrimaryCtxRetain()the creation

Summarize

This course has learned the context management settings of Diver API. By manually creating a management context, a series of related CUDA operations can be associated into a context, simplifying code and improving execution efficiency. Of course, it is recommended to use to cuDevicePrimaryCtxRetainget the context associated with the device, it does not need to explicitly manage the context stack, and the code is more concise.