Day 7CUDA Stream

Understanding of CUDA Stream

CUDA programs generally need to process massive amounts of data, and memory bandwidth often becomes the main bottleneck

With the help of Stream, CUDA programs can effectively parallelize memory reading and numerical operations, thereby improving data throughput

Since GPU and CPU cannot directly read each other's memory, CUDA programs generally have three steps:

  1. Move data from CPU memory to GPU memory
  2. The GPU performs calculations and saves the results in GPU memory
  3. Copy the result from GPU memory to CPU memory

If no special treatment is done, CUDA will only use one Stream by default (Default Stream)

In this case, you must wait for one step to be completed before proceeding to the next step, for which you can see the role of Stream.

A CUDA stream represents a GPU operation queue, and the operations in the queue will be executed sequentially in the order they were added to the stream

A stream can be regarded as a task on the GPU, and different tasks can be executed in parallel

To use the CUDA stream, first select a device that supports the Device Overlap function

A GPU that supports device overlay can perform a CUDA kernel while copying data between the host and the device

This feature of devices supporting overlapping functions is very important and can improve the execution efficiency of GPU programs to a certain extent

In general, the CPU memory is much larger than the GPU memory. For a large amount of data, it is impossible to transfer the data in the CPU buffer to the GPU at one time, and it needs to be transferred in blocks.

If the GPU is also performing kernel function operations while transferring in blocks, such asynchronous operations use the overlapping function of the device to improve computing performance


CUDA Stream execution process

  1. stream is a stream handle, which can be regarded as a queue
    • The cuda executor reads and executes instructions one by one from the stream
    • For example, the cudaMemcpyAsync function is equivalent to adding a cudaMemcpy instruction to the stream queue and queuing
    • If the function of stream is used, it will immediately add instructions to the stream and return immediately, without waiting for the end of instruction execution
    • Through the cudaStreamSynchronize function, wait for all instructions in the stream to be executed, that is, the queue is empty
  2. When using streams, be careful
    • Since the asynchronous function will return immediately, the parameters passed in should consider its life cycle, and release it after confirming the end of the function call
  3. Events can also be added to the stream to monitor whether a certain checkpoint has been reached
    • cudaEventCreate, create event
    • cudaEventRecord, record the event, that is, add an event to the stream, and modify its state when the queue executes the event
    • cudaEventQuery, query the current state of the event
    • cudaEventElapsedTime, to calculate the time interval between two events. If you want to count the execution time of some kernel functions, please use this function to get the most accurate statistics
    • cudaEventSynchronize, synchronize an event, wait for the event to arrive
    • cudaStreamWaitEvent, waiting for an event in the stream
  4. The default stream, for synchronous functions such as cudaMemcpy, it is equivalent to executing
    • cudaMemcpyAsync(... default stream) enqueue
    • cudaStreamSynchronize (default stream) waits for execution to complete
    • The default stream is similar to the current device context and is associated with the current device
    • Therefore, if the default stream is used heavily, it will cause poor performance

code example


// CUDA运行时头文件
#include <cuda_runtime.h>

#include <stdio.h>
#include <string.h>

#define checkRuntime(op)  __check_cuda_runtime((op), #op, __FILE__, __LINE__)

bool __check_cuda_runtime(cudaError_t code, const char* op, const char* file, int line){
    
    
    if(code != cudaSuccess){
    
        
        const char* err_name = cudaGetErrorName(code);    
        const char* err_message = cudaGetErrorString(code);  
        printf("runtime error %s:%d  %s failed. \n  code = %s, message = %s\n", file, line, op, err_name, err_message);   
        return false;
    }
    return true;
}

int main(){
    
    

    int device_id = 0;
    checkRuntime(cudaSetDevice(device_id));

    cudaStream_t stream = nullptr;
    checkRuntime(cudaStreamCreate(&stream));

    // 在GPU上开辟空间
    float* memory_device = nullptr;
    checkRuntime(cudaMalloc(&memory_device, 100 * sizeof(float)));

    // 在CPU上开辟空间并且放数据进去,将数据复制到GPU
    float* memory_host = new float[100];
    memory_host[2] = 520.25;
    checkRuntime(cudaMemcpyAsync(memory_device, memory_host, sizeof(float) * 100, cudaMemcpyHostToDevice, stream)); // 异步复制操作,主线程不需要等待复制结束才继续

    // 在CPU上开辟pin memory,并将GPU上的数据复制回来 
    float* memory_page_locked = nullptr;
    checkRuntime(cudaMallocHost(&memory_page_locked, 100 * sizeof(float)));
    checkRuntime(cudaMemcpyAsync(memory_page_locked, memory_device, sizeof(float) * 100, cudaMemcpyDeviceToHost, stream)); // 异步复制操作,主线程不需要等待复制结束才继续
    checkRuntime(cudaStreamSynchronize(stream));
    
    printf("%f\n", memory_page_locked[2]);
    
    // 释放内存
    checkRuntime(cudaFreeHost(memory_page_locked));
    checkRuntime(cudaFree(memory_device));
    checkRuntime(cudaStreamDestroy(stream));
    delete [] memory_host;
    return 0;
}

Guess you like

Origin blog.csdn.net/qq_38973721/article/details/129796793