CUDA Streams: Exploiting Parallel Execution to Improve Performance

introduction

CUDA streams are a very important concept in CUDA programming. Streams are a mechanism for executing sequences of CUDA commands asynchronously, allowing the exploitation of device parallelism, thereby improving the performance of applications.

In this article, I will introduce the basic concepts of CUDA streams, how to create and use streams, and how to use streams to execute multiple sequences of CUDA commands in parallel to improve the performance of applications on the GPU.

1. CUDA stream overview

Streams are an important mechanism in CUDA parallel computing. In CUDA programming, data transfer between CPU and GPU is a very time-consuming operation. However, the GPU can perform computational operations while the CPU is performing data transfers. CUDA streams allow multiple sequences of CUDA commands to be executed in parallel on the GPU to take full advantage of device parallelism and improve application performance.

In CUDA, each stream represents a set of CUDA commands executed sequentially. In a CUDA stream, all CUDA commands are executed sequentially. Therefore, in a CUDA stream, the execution of previous CUDA commands must be completed before the execution of subsequent CUDA commands.

2. Create and use CUDA stream

In CUDA programming, a new CUDA stream can be created by calling the cudaStreamCreate() function. The prototype of the cudaStreamCreate() function is as follows:

cudaError_t cudaStreamCreate(cudaStream_t* pStream);

The cudaStreamCreate() function will create a new CUDA stream and store a handle to the newly created stream in pStream. If the creation of the stream was successful, cudaSuccess is returned. Otherwise, an appropriate error code will be returned.
Here is an example using CUDA streams:

cudaStream_t stream;
cudaStreamCreate(&stream);

In the above example, a new CUDA stream is created using the cudaStreamCreate() function and a handle to the stream is stored in stream.

To add a CUDA command to a stream, use a function similar to the standard CUDA command, but specify the stream to use. For example, to launch a CUDA kernel in a created stream, use the cudaLaunchKernel() function, specifying the stream to use, as follows:

myKernel<<<gridSize, blockSize, 0, stream>>>(/* arguments */);

In the example above, myKernel is a CUDA kernel function. gridSize and blockSize are the grid and block sizes used when launching the kernel. The last parameter stream specifies the stream to use.

When executing CUDA commands, you can use the cudaStreamSynchronize() function to wait for all CUDA commands in the stream to complete.

3. Parallel execution using CUDA streams

When you have multiple CUDA operations that need to be executed in parallel, you can use CUDA streams to achieve this parallelism. Each stream can perform its operations asynchronously and independently of other streams, and operations are performed sequentially within a stream, but not necessarily between streams. In practice, some interdependent operations can be distributed into different streams, so that higher parallelism and throughput can be achieved when executing operations.

In CUDA, CUDA streams can be created, destroyed and managed using the following functions:

cudaError_t cudaStreamCreate(cudaStream_t *stream);
cudaError_t cudaStreamDestroy(cudaStream_t stream);
cudaError_t cudaStreamSynchronize(cudaStream_t stream);
cudaError_t cudaStreamQuery(cudaStream_t stream);
function Function
cudaStreamCreate() Creates a new CUDA stream and stores its handle in the pointer specified by stream
cudaStreamDestroy() Destroys a CUDA stream and frees all resources associated with it
cudaStreamSynchronize() Blocks the CPU thread until all previously submitted operations in the stream have completed
cudaStreamQuery() Query whether an operation in a stream has completed without blocking the CPU thread

In order to submit an operation into a CUDA stream, the following functions can be used:

cudaError_t cudaMemcpyAsync(void* dst, const void* src, size_t count, cudaMemcpyKind kind, cudaStream_t stream);
cudaError_t cudaMemsetAsync(void* devPtr, int value, size_t count, cudaStream_t stream);
cudaError_t cudaLaunchKernel(const void* func, dim3 gridDim, dim3 blockDim, void** args, size_t sharedMem, cudaStream_t stream);
function Function
cudaMemcpyAsync() Asynchronously copies memory in the specified stream
cudaMemsetAsync() Asynchronously sets device memory to the specified value in the specified stream.
cudaLaunchKernel() Asynchronously starts the CUDA kernel function in the specified stream

. The following code example shows how to use CUDA streams to perform several operations:

#include <stdio.h>

__global__ void kernel(int* a, int N)
{
    
    
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < N) {
    
    
        a[idx] *= a[idx];
        a[idx] += 1;
    }
}

int main()
{
    
    
    int N = 1000000;

    int* h_a = new int[N];
    for (int i = 0; i < N; i++) {
    
    
        h_a[i] = i;
    }

    int* d_a;
    cudaMalloc(&d_a, N * sizeof(int));
    cudaMemcpy(d_a, h_a, N * sizeof(int), cudaMemcpyHostToDevice);

    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

    cudaStream_t stream1, stream2;
    cudaStreamCreate(&stream1);
    cudaStreamCreate(&stream2);

    kernel<<<blocksPerGrid, threadsPerBlock, 0, stream1>>>(d_a, N);
    cudaMemcpyAsync(h_a, d_a, N * sizeof(int), cudaMemcpyDeviceToHost, stream1);

    kernel<<<blocksPerGrid, threadsPerBlock, 0, stream2>>>(d_a, N);
    cudaMemcpyAsync(h_a, d_a, N * sizeof(int), cudaMemcpyDeviceToHost, stream2);

    cudaStreamSynchronize(stream1);
    cudaStreamSynchronize(stream2);

    cudaStreamDestroy(stream1);
    cudaStreamDestroy(stream2);

    for (int i = 0; i < 10; i++) {
    
    
        printf("%d ", h_a[i]);
    }
    printf("\n");

    delete[] h_a;
    cudaFree(d_a);

    return 0;
}

In this example, two streams stream1 and stream2 are created and each stream is associated with a CUDA kernel and an asynchronous memory copy operation. First start a kernel in stream1 and start an asynchronous memory copy operation, then start another kernel in stream2 and start another asynchronous memory copy operation. Finally, use the cudaStreamSynchronize function to synchronize the two streams, and destroy them.

4. Summary

This article describes how to use CUDA streams to improve the efficiency of parallel execution. It begins with an understanding of what CUDA streams are and their benefits, and then discusses how to create and manage CUDA streams. Next, you saw how to execute CUDA kernels and standard C/C++ functions on CUDA streams.

Guess you like

Origin blog.csdn.net/Algabeno/article/details/129152135