CUDA ---- Stream and Event(转)

Stream

Generally speaking, cuda c parallelism is manifested in the following two levels:

  • Kernel level
  • Grid level

So far, we have been discussing the kernel level, that is, a kernel or a task is executed in parallel by many threads on the GPU. The concept of Stream is relative to the latter. Grid level refers to the simultaneous execution of multiple kernels on a device.

Introduction to Stream and event

Cuda stream refers to a bunch of asynchronous cuda operations, which are executed on the device in the order of the host code calls. Stream maintains the order of these operations, and allows these operations to enter the work queue after all preprocessing is completed, and can also perform some query operations on these operations. These operations include data transfer from host to device, launch kernel and other host initiated actions executed by device. The execution of these operations is always asynchronous, and the cuda runtime will determine the appropriate timing of these operations. We can use the corresponding cuda api to ensure that the results obtained are obtained after all operations are completed. Operations in the same stream have a strict execution order, but different streams do not have this restriction.

Since the operations of different streams are executed asynchronously, the coordination between each other can be used to give full play to the utilization of resources. We are already familiar with the typical cuda programming model:

  • Transfer input data from host to device
  • Execute the kernel on the device
  • Transfer the results from the device back to the host

In many cases, it takes much more time to execute the kernel than to transfer data, so it is easy to think of hiding the communication time between cpu and gpu in the execution of other kernels. We can put data transmission and kernel execution on This function is implemented in different streams. Stream can be used to implement pipeline and double buffer (front-back) rendering.

Cuda API can be divided into two types: synchronous and asynchronous. Synchronous functions will block the execution of threads on the host side, and asynchronous functions will immediately return control to the host to continue the subsequent actions. Asynchronous functions and streams are the two cornerstones of grid level parallelism.

From a software perspective, different operations in different streams can be executed in parallel, but this is not necessarily the case from a hardware perspective. This depends on the PCIe link or the resources available to each SM. Different streams still need to wait for other streams to complete execution. The following will briefly introduce the behavior of stream on the device under different CC versions.

Cuda Streams

All cuda operations (including kernel execution and data transmission) run explicitly or implicitly in the stream. There are two types of streams, namely:

  • Implicitly declare stream (NULL stream)
  • Show declaration stream (non-NULL stream)

It is NULL stream by default, and it is this type in blog posts that have not been involved in stream before. If you explicitly declare a stream to be a non-NULL stream.

Asynchronous and stream-based kernel execution and data transmission can achieve the following types of parallel:

  • Host operation and device operation are parallel
  • Host arithmetic operation and data transmission from host to device are parallel
  • Host to device data transmission and device operation operations are parallel
  • Parallel operation in Device

The following code is a common form of use before, and NULL stream is used by default:

cudaMemcpy(..., cudaMemcpyHostToDevice);
kernel<<<grid, block>>>(...);
cudaMemcpy (..., cudaMemcpyDeviceToHost);

From the perspective of the device, the owner's three operations all use the default stream, and are executed in sequence from top to bottom of the code. The device itself does not know how other host operations are performed. From the perspective of the host, the data transmission is synchronous and will wait until the operation is completed. However, unlike data transmission, the launch of the Kernel is asynchronous, and the host can regain control almost immediately, regardless of whether the kernel has been executed or not, and proceed to the next step. Obviously, this asynchronous behavior helps to overlap the computing time between the device and the host.

The above content is covered in the previous blog posts. The special description here is data transmission, which can also be executed asynchronously. This uses the stream discussed this time. We must explicitly declare a stream to dispatch its execution. The following version is the asynchronous version of cudaMemcpy:

cudaError_t cudaMemcpyAsync(void* dst, const void* src, size_t count,cudaMemcpyKind kind, cudaStream_t stream = 0);

Pay attention to the last parameter newly added. In this way, after the host issues this function for the device to execute, the control right can be returned to the host immediately. The above code uses the default stream. If you want to declare a new stream, use the following API to define one:

cudaError_t cudaStreamCreate(cudaStream_t* pStream);

This defines a stream that can be used in the cuda asynchronous API function. One of the more common errors in using this function, or the place that is easy to cause confusion, is that the error code returned by this function may be generated by the last time an asynchronous function is called. In other words, the function returning error is not a necessary condition for calling the function to generate error.

When performing an asynchronous data transfer, we must use pinned (or non-pageable) memory. The allocation of Pinned memory is as follows, please refer to the previous blog post for details :

cudaError_t cudaMallocHost(void **ptr, size_t size);
cudaError_t cudaHostAlloc(void **pHost, size_t size, unsigned int flags);

By pinning the memory to the virtual memory of the host, the physical location of the memory can be forcibly allocated to the CPU memory so that it remains unchanged during the entire program life cycle. Otherwise, the operating system may change the physical address corresponding to the virtual memory on the host side at any time. Assuming that the asynchronous data transfer function does not use pinned host memory, the operating system may move the data from one physical space to another physical space (because it is asynchronous, the CPU performing other actions may affect this piece of data), and this When the cuda runtime is performing data transmission, this will lead to undefined behavior.

If you want to set the stream when executing the kernel, it is also very simple, as long as you add a stream parameter:

kernel_name<<<grid, block, sharedMemSize, stream>>>(argument list);
// Non-default stream declaration
cudaStream_t stream;
// Initialize
cudaStreamCreate(&stream);
// Resource release
cudaError_t cudaStreamDestroy(cudaStream_t stream);

When performing resource release, if there is still work that has not been done on the stream, the function will still return immediately, but these resources will be automatically released after the related work is completed.

Since all stram executions are asynchronous, some APIs are required to do synchronous operations when necessary:

cudaError_t cudaStreamSynchronize(cudaStream_t stream);
cudaError_t cudaStreamQuery(cudaStream_t stream);

The first one will force the host to block and wait until all operations in the stream are completed; the second one will check whether all operations in the stream are completed, and will not block the host even if any operations are not completed. If all operations are completed, return cudaSuccess, otherwise return cudaErrorNotReady.

Let's take a look at a code snippet to help understand:

Copy code

for (int i = 0; i < nStreams; i++) {
    int offset = i * bytesPerStream;
    cudaMemcpyAsync(&d_a[offset], &a[offset], bytePerStream, streams[i]);
    kernel<<grid, block, 0, streams[i]>>(&d_a[offset]);
    cudaMemcpyAsync(&a[offset], &d_a[offset], bytesPerStream, streams[i]);
}

for (int i = 0; i < nStreams; i++) {
    cudaStreamSynchronize(streams[i]);
}

Copy code

This code uses three streams, and data transmission and kernel operations are allocated to these concurrent streams.

 

The above picture is the same as the pipeline, so I won't say more. It should be noted that the data transmission operations in the above figure are not executed in parallel, even if they are in different streams. By convention, this situation is definitely a pot of hardware resources. There are only a few hardware resources. The optimization at the software level is nothing more than trying to make all hardware resources be used non-stop (the evil capitalism, um...), And here is the bottleneck of PCIe card. Of course, from a programming point of view, these operations are still independent of each other, but they have to be serial if they want to share hardware resources. Two PCIes can overlap these two data transmission operations, but it is also necessary to ensure different streams and different transmission directions.

The maximum number of concurrent kernels depends on the device itself. Fermi supports 16 parallel channels, and Kepler is 32. The number of parallels is limited by shared memory, registers and other device resources.

Stream Scheduling

Conceptually, all streams run simultaneously. However, this is usually not the case.

False Dependencies

Although Fermi supports up to 16 parallel channels, physically, all streams are packed into the only work queue on the hardware for scheduling. When a grid is selected for execution, the runtime will check the dependency of the task. If the current task depends on the previous The task will be blocked, because there is only one queue, the following tasks will be followed by waiting, even if the following tasks are tasks on other streams. As shown in the figure below:

 

C and P and R and X can be parallel because they are in different streams, but ABC, PQR, and XYZ are not. For example, C and P are waiting before B is completed.

Hyper-Q

The pseudo-dependency situation has been solved in the Kepler series. A technology called Hyper-Q is adopted. The simple and crude understanding is that since the work queue is not enough, it should be increased, so there are 32 work queues on Kepler. . This technology also realizes the application that can run compute and graphic at the same time on TPC. Of course, if more than 32 streams are created, there will still be false dependencies.

 

Stream Priorities

For CC3.5 and above, stream can have priority attributes:

cudaError_t cudaStreamCreateWithPriority(cudaStream_t* pStream, unsigned int flags, int priority);

This function creates a stream, gives priority to the priority, the high-priority grid can preempt the low-priority execution. However, the priority attribute is only valid for the kernel, not for data transmission. In addition, if the set priority exceeds the settable range, it will be automatically set to the highest or lowest. The effective settable range can be queried with the following functions:

cudaError_t cudaDeviceGetStreamPriorityRange(int *leastPriority, int *greatestPriority);

As the name suggests, leastPriority is the lower limit, and gretestPriority is the upper limit. The old rule is that a smaller value has a higher priority. If the device does not support priority setting, both of these values ​​return 0.

Cuda Events

Event is an important concept related to stream, which is used to mark a specific point in the execution process of strean. Its main uses are:

  • Synchronous stream execution
  • Control the running pace of the device

Cuda api provides related functions to insert an event into the stream and query whether the event is complete (or is it satisfying the conditions?). Only when all operations at the stream position marked by the event have been executed, the event is considered complete. The event associated with the default stream is valid for all streams.

Creation and Destruction

// statement
cudaEvent_t event;
// Create
cudaError_t cudaEventCreate(cudaEvent_t* event);
// destroy
cudaError_t cudaEventDestroy(cudaEvent_t event);

Similarly for the release of streeam, when the function is called, if the related operation is not completed, the resource will be automatically released after the operation is completed.

Recording Events and Mesuring Elapsed Time

Events mark a point in the process of stream execution. We can check whether the operation in the stream being executed has reached that point. We can insert event as an operation into many operations in the stream. When the operation is executed, The work done is to set a flag of the CPU to mark that it is complete. The following function associates the event to the specified stream.

cudaError_t cudaEventRecord(cudaEvent_t event, cudaStream_t stream = 0);

Waiting for the event will block the calling host thread, and the synchronization operation will call the following function:

cudaError_t cudaEventSynchronize(cudaEvent_t event);

This function is similar to cudaStreamSynchronize, except that it waits for an event instead of the entire stream to finish executing. We can also use the following API to test whether the event is complete, this function will not block the host:

cudaError_t cudaEventQuery(cudaEvent_t event);

This function is similar to cudaStreamQuery. In addition, there is a dedicated API to measure the time interval between two events:

cudaError_t cudaEventElapsedTime(float* ms, cudaEvent_t start, cudaEvent_t stop);

Returns the time interval between start and stop, in milliseconds. Start and stop do not have to be associated with the same stream, but it should be noted that if either of the two is associated with a non-NULL stream, the time interval may be larger than expected. This is because cudaEventRecord happens asynchronously, and we cannot guarantee that the measured time is exactly between the two events, so we just want the time interval between the gpu work, so stop and strat are associated with the default stream.

The following code simply shows how to use events to measure time:

Copy code

// create two events
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
// record start event on the default stream
cudaEventRecord(start);
// execute kernel
kernel<<<grid, block>>>(arguments);
// record stop event on the default stream
cudaEventRecord(stop);
// wait until the stop event completes
cudaEventSynchronize(stop);
// calculate the elapsed time between two events
float time;
cudaEventElapsedTime(&time, start, stop);
// clean up the two events
cudaEventDestroy(start);
cudaEventDestroy(stop);

Copy code

Stream Synchronization

Since all non-default stream operations are non-blocking for the host, corresponding synchronization operations are required.

From the perspective of the host, cuda operations can be divided into two categories:

  • Memory related operations
  • Kernel launch

Kernel launch is asynchronous to the host, and many memory operations are synchronous, such as cudaMemcpy. However, cuda runtime also provides asynchronous functions to perform memory operations.

We already know that Stream can be divided into two types: synchronous (NULL stream) and asynchronous (non-NULL stream). Synchronous and asynchronous is for the host. Asynchronous streams will not block the execution of the host, while most synchronous streams will block host, but the kernel launch exception does not block the host.

In addition, asynchronous streams can be divided into blocking and non-blocking. Blocking non-blocking refers to asynchronous streams for synchronous streams. If the asynchronous stream is a blocking stream, then the synchronous stream will block the operation in the asynchronous stream. If the asynchronous stream is a non-blocking stream, then the stream will not block the operation in the synchronous stream (a bit around...).

Blocking and non-blocking streams

The use of cudaStreamCreate to create a blocking stream, that is to say, the operation performed in the stream will be blocked by the synchronized stream executed earlier. Generally speaking, when a NULL stream is issued, the cuda context will wait for the completion of all previous blocking streams before executing the NULL stream. Of course, all blocking streams will also wait for the completion of the previous NULL stream before starting execution.

E.g:

kernel_1<<<1, 1, 0, stream_1>>>();
kernel_2<<<1, 1>>>();
kernel_3<<<1, 1, 0, stream_2>>>();

From the device point of view, these three kernels are executed serially, but of course, from the host point of view, they are parallel and non-blocking. In addition to the blocking stream generated by cudaStreamCreate, we can also generate a non-blocking stream through the following API configuration:

cudaError_t cudaStreamCreateWithFlags(cudaStream_t* pStream, unsigned int flags);
// There are two types of flag, the default is the first, and non-blocking is the second.
cudaStreamDefault: default stream creation flag (blocking)
cudaStreamNonBlocking: asynchronous stream creation flag (non-blocking)

If the previous kernel_1 and kernel_3 streams are defined as the second type, they will not be blocked.

Implicit Synchronization

Cuda has two types of synchronization between host and device: explicit and implicit. We have learned that the explicit synchronization APIs are:

  • cudaDeviceSynchronize
  • cudaStreamSynchronize
  • cudaEventSynchronize

These three functions are explicitly called by the host and executed on the device.

We have also learned about implicit synchronization. For example, cudaMemcpy will implicitly synchronize device and host. Because the synchronization function of this function is only a side effect of data transmission, it is called implicit. It is important to understand these implicit synchronizations, because inadvertently calling such a function may cause a drastic decrease in performance.

Implicit synchronization is a special case in cuda programming, because implicit synchronization behavior may cause unexpected blocking behavior, which usually occurs on the device side. Many memory-related operations will affect the operation of the current device, such as:

  • A page-locked host memory allocation
  • A device memory allocation
  • A device memset
  • A memory copy between two addresses on the same device
  • A modification to the L1/shared memory confi guration

Explicit Synchronization

From the grid level, the explicit synchronization methods are as follows:

  • Synchronizing the device
  • Synchronizing a stream
  • Synchronizing an event in a stream
  • Synchronizing across streams using an event

We can use the previously mentioned cudaDeviceSynchronize to synchronize all operations on the device. This function will cause the host to wait for the completion of operations or data transfer operations on all devices. Obviously, this function is a heavyweight function, and we should minimize the use of such functions.

By using cudaStreamSynchronize, you can make the host wait for all operations in a specific stream to complete or use a non-blocking version of cudaStreamQuery to test whether it is complete.

Cuda event can be used to achieve more fine-grained blocking and synchronization. The related functions are cudaEventSynchronize and cudaEventSynchronize, and the usage is similar to stream-related functions. In addition, cudaStreamWaitEvent provides a flexible way to introduce dependencies between streams:

cudaError_t cudaStreamWaitEvent(cudaStream_t stream, cudaEvent_t event);

This function will specify that the stream waits for a specific event. The event can be associated with the same or different streams. For different streams, as shown in the following figure:

 

Stream2 will wait for the event in stream1 to complete and continue execution.

Configurable Events

The configuration of Event can use the following functions:

cudaError_t cudaEventCreateWithFlags(cudaEvent_t* event, unsigned int flags);
cudaEventDefault
cudaEventBlockingSync
cudaEventDisableTiming
cudaEventInterprocess

cudaEventBlockingSync indicates that the event will block the host. The default behavior of cudaEventSynchronize is to use the CPU clock to constantly query the event status. Using cudaEventBlockingSync, the calling thread will go to sleep and transfer control to other threads or processes until the event is completed. But this will lead to a small amount of waste of CPU clock, and it will also increase the time consumption between event completion and thread waking up.

cudaEventDisableTiming specifies that the event can only be used for synchronization and does not need to record timing data. Throwing away the consumption of recording timestamps in this way can improve the call performance of cuudaStreamWaitEvent and cudaEventQuery.

cudaEventInterprocess specifies that the event can be used as an inter-process event.

 

NVIDIA CUDA section: https://developer.nvidia.com/cuda-zone

CUDA online documentation: http://docs.nvidia.com/cuda/index.html#

Reprint the original text and indicate: http://www.cnblogs.com/1024incn/p/5891051.html

Guess you like

Origin blog.csdn.net/csdn1126274345/article/details/102097226