CUDA Cooperation Group for Translation and Learning of CUDA10.0 Official Documents

table of Contents

background

Introduction

In-block group

Thread group and thread block

Tiled Partitions

Thread Block Tiles

Pseudo thread shuffle function

Pseudo thread voting function

Pseudo thread matching function

Merge group

Use of cooperation groups within a block

Discovery template

Pseudo-thread synchronization code template

combination

Grid synchronization

Multi-device synchronization

Conclusion

background

Today we translate the last part of the CUDA10.0 official document that deserves our attention-the cooperation group

Introduction

Cooperative Groups are an extension of the CUDA programming model introduced in CUDA 9 to organize communication thread groups. The cooperative group allows developers to express the granularity of thread communication to express richer and more effective parallel deconstruction.

Prior to this (see article CUDA programming model ), the CUDA programming model has provided a single simple structure for synchronizing cooperative threads: a barrier across all threads in the thread block implemented in the __syncthreads() function. However, programmers want to define and synchronize the synchronization of other granular thread groups to support better performance, design flexibility, and software reuse of group-level related function interfaces. In order to express a broader parallel interaction template, many performance-oriented programmers have implemented their own, individually customized but unsafe functions for the synchronization of threads in pseudo-threads or different thread blocks on a single GPU. Although the performance improvement achieved is remarkable, this has led to the proliferation of fragmented code, which becomes difficult to implement, adjust, and maintain over time and GPU updates. The cooperative team solves this problem by providing a secure and future-proof mechanism for supporting code with superior performance.

The Cooperative Group Programming Model extension describes synchronization templates in both CUDA thread blocks and between CUDA thread blocks. This provides applications with a way to define their own thread groups and synchronization thread groups. It also provides a new way to enforce certain restrictions. Start the API to ensure that the synchronization can work. These functions take advantage of the new template of cooperative parallelism in CUDA, including producer-consumer parallelism, opportunistic parallelism, and global synchronization within the entire grid.

Expressing the group as a first-order program object improves the composition of the software, because the related function can receive a clear object representing the participation in the thread group. This object also allows the programmer's intention to be clear, that is, to eliminate the fragmented code and unreasonable compiler optimizations. Limit the unsound architecture assumptions, and better adapt to the new GPU version.

The cooperative group programming model consists of the following elements:

  • Represents the data type of the cooperative thread;
  • Get the operation of the instruction-level group defined by the CUDA startup API;
  • The operation of dividing the existing group into a new group;
  • Synchronize the fence operation of a given group;
  • View group attributes and operations for specific group collections

In-block group

In this section, we describe the available functions to create a thread group that can synchronize and cooperate within a block. Note that the synchronization of a cooperative group across thread blocks or devices requires some additional considerations, which will be described later.

The cooperative group requires CUDA version >=9.0. In order to use this function, you need to add the header file: #include <cooperative_groups.h>, and use the cooperative group namespace: using namespace cooperative_groups;, and then include the code of the cooperative group function in any block You can use nvcc to compile in the usual way.

Thread group and thread block

Any CUDA programmer must already be familiar with a set of threads-thread blocks (see article CUDA programming model ). The expansion of the cooperative group introduced a new data type-thread_block to clearly express this concept in the kernel function. The group can be initialized like this: thread_block g = this_thread_block();. The thread_block data type comes from the more general thread_group data type. thread_group can be used to represent a wider range of groups, and provides the following functions:

void sync(); // 同步组内线程
unsigned size(); // 组内线程数
unsigned thread_rank(); // 调用线程的组内序号,值域为[0, size]
bool is_valid(); // 组是否违反了任何API约束

And thread_block provides the following additional block-based functions:

dim3 group_index(); // 网格内的块索引,三维
dim3 thread_index(); // 块内的线程索引,三维

For example, if the group g has been initialized as above, then g.sync(); will synchronize all threads in the block, which is equivalent to __syncthreads();. Note that all threads in the group must perform uniform operations, otherwise undefined behavior will occur.

Tiled Partitions

The tile_partition() function can be used to decompose thread blocks into multiple smaller cooperative thread groups. For example, if we first create a group that contains all the threads in the block:

thread_block wholeBlock = this_thread_block();

Then, we can divide it into smaller groups, such as 32 threads per group:

thread_group tile32 = tiled_partition(wholeBlock, 32);

Furthermore, we can divide each group of 32 threads into smaller groups, such as 4 threads per group:

thread_group tile4 = tiled_partition(tile32, 4);

Then, if we add the following code:

if (tile4.thread_rank() == 0) printf(“Hello from tile4 rank 0\n”);

Then, a paragraph will be printed every four threads: the 0th thread of each tile4 group and the 0th, 4th, 8th, and 12th threads of the wholeBlock group will output. Note that the current slice size can only be a power of 2 and cannot exceed 32

Thread Block Tiles

A templated version of the tiled_partition function can also be used, where the template parameter is used to specify the size of the slice-which is determined at compile time, so there is more room for execution optimization. Similar to the previous section, the following code will create two sets of slices with sizes 32 and 4:

thread_block_tile<32> tile32 = tiled_partition<32>(this_thread_block());
thread_block_tile<4> tile4 = tiled_partition<4>(this_thread_block());

Note that the thread_block_tile template data structure is used here, and the group size is passed to the tiled_partition() function as a template parameter instead of a function parameter.

Thread block sharding also provides the following additional functions:

.shfl()
.shfl_down()
.shfl_up()
.shfl_xor()
.any()
.all()
.ballot()
.match_any()
.match_all()

These cooperative synchronization operations are similar to pseudo-thread shuffling functions, pseudo-thread voting functions, and pseudo-thread matching functions. Here is a brief introduction.

Pseudo thread shuffle function

The pseudo-thread shuffling function is used to broadcast data between threads in the pseudo-thread without using shared memory. The function prototype is as follows:

T __shfl_sync(unsigned mask, T var, int srcLane, int width=warpSize);
T __shfl_up_sync(unsigned mask, T var, unsigned int delta, int width=warpSize);
T __shfl_down_sync(unsigned mask, T var, unsigned int delta, int
width=warpSize);
T __shfl_xor_sync(unsigned mask, T var, int laneMask, int width=warpSize);

Where T is the data type to be broadcast, which can be int, unsigned int, long, unsigned long, long long, unsigned long long, float or double. If the header file cuda_fp16.h is included, T can also be __half or __half2 ; Mask is used to mark the target thread that performs the exchange; srcLane represents the source thread that sends the broadcast. If the source thread id is greater than width, then the actual source thread id is equal to srcLane% width; width represents the packet size for performing the broadcast, which must be 2. If it is a whole power and does not exceed 32, the value will be broadcast in the specified size group; the function returns the four-byte word specified by value in the source thread;

__shfl_sync() function source thread id is srcLane; __shfl_up_sync() function source thread id is srcLane-delta; __shfl_down_sync() function source thread id is srcLane + delta; __shfl_xor_sync() function source thread id is srcLane xor laneMask

Pseudo thread voting function

Pseudo-thread voting functions allow threads in pseudo-threads to perform reduction broadcast operations. The prototypes of these functions are as follows:

int __all_sync(unsigned mask, int predicate);
int __any_sync(unsigned mask, int predicate);
unsigned __ballot_sync(unsigned mask, int predicate);
unsigned __activemask();

The predicate represents the judgment predicate, and the mask represents the thread participating in the voting. The function reads integer predicates from each thread in the pseudo-thread, and whether the value of these predicates is 0, and broadcasts the return value to each participating thread. The execution logic of the function is shown in the following table

function

Execution logic

__all_sync()

Evaluate the predicate value for all threads specified by mask that have not exited, and return a non-zero value only if the predicate value of all threads is non-zero

__any_sync()

Evaluate the predicate value for all threads specified by mask that have not exited. If the predicate value of any thread is non-zero, return a non-zero value

__ballot_sync()

Evaluate the predicate value for all threads specified by mask that have not exited, and return an integer. If and only if the Nth thread of the pseudo thread is active and the predicate value is not 0, the Nth bit of the integer is 1

__activemask()

Returns the 4-byte mask of all currently active threads in the pseudo thread. If the Nth thread in the pseudo-thread is active when this function is called, the Nth bit of the mask is 1, and the code bit corresponding to the exited or inactive thread is 0. Note that the thread that converges when this function is called cannot guarantee that it will still converge at downstream instructions, unless these instructions are built-in synchronization functions of pseudo-threads

Pseudo thread matching function

The pseudo-thread matching function will perform a synchronized broadcast comparison operation between threads in the pseudo-thread. It supports devices with computing power >= 7.X. The function prototype is as follows:

unsigned int __match_any_sync(unsigned mask, T value);
unsigned int __match_all_sync(unsigned mask, T value, int *pred);

T can be int, unsigned int, long, unsigned long, long long, unsigned long long, float, or double. The value indicates the value to be compared by broadcasting, and the mask specifies the thread to participate in. The return logic of these two functions is different, as shown in the following table

function

Return logic

__match_any_sync()

Returns the thread mask of the thread specified by mask that has a value equal to value

__match_all_sync()

If all threads specified by mask have the same value as value, mask is returned, and pred is true; otherwise, it returns 0 and pred is false

Going back to the thread block slicing section of cooperative group threads, these functions are used in the context of user-defined thread groups and provide better flexibility and productivity.

Merge group

In CUDA's SIMT architecture (see article CUDA hardware implementation ), at the hardware level, multiprocessors use 32 threads as a group (pseudo-thread) to execute threads. If there is a data-dependent conditional branch in the application code, the threads in the pseudo-thread are scattered, then the pseudo-thread will walk through each branch and block threads that are not on that path, but are active on the current execution path The execution of threads is called combined execution. The cooperative group has the function of discovering or creating all merged thread groups: coalesced_group active = coalesced_threads();. For example, consider a scenario: there are branches in the code where only the second, fourth, and eighth threads of each pseudo-thread remain active. The statement executed in this branch will create a name for each pseudo-thread The active group contains the three active threads (the ids in the group are 0, 1, 2 respectively)

Use of cooperation groups within a block

In this section, the function of the cooperative group is illustrated through some examples

Discovery template

Under normal circumstances, developers need to work with the active thread set. We cannot assume or specify which threads are currently available, but can only work with threads that happen to be there. Seen in "Aggregate Atomic Addition of Cross-Threads within a Thread" (written using the correct CUDA 9.0 function):

{
    unsigned int writemask = __activemask();
    unsigned int total = __popc(writemask); // 活跃的线程数
    unsigned int prefix = __popc(writemask & __lanemask_lt()); // 当前活跃线程前缀,比如活跃线程掩码为01010,那么对于第2个线程,__lanemask_lt()为00001,那么prefix就是0(第4个活跃线程对应的就是1)。因此前缀为0就表示当前为第一个活跃的线程

    int elected_lane = __ffs(writemask) - 1; // id最小的活跃线程
    int base_offset = 0;
    if (prefix == 0) {
        base_offset = atomicAdd(p, total);
    }

    base_offset = __shfl_sync(writemask, base_offset, elected_lane); // 把elected_lane中原子加前的值广播到所有的活跃线程中
    int thread_offset = prefix + base_offset;

    return thread_offset;
}

If you use the cooperative group API to rewrite, you will get the following code:

{
    cg::coalesced_group g = cg::coalesced_threads(); // 活跃线程组

    int prev;
    if (g.thread_rank() == 0) { // 第一个活跃线程
        prev = atomicAdd(p, g.size()); // 原子加
    }

    prev = g.thread_rank() + g.shfl(prev, 0); // 最小的活跃线程id + 老的值
    return prev;
}

Pseudo-thread synchronization code template

Developers may have pseudo-thread synchronization code, and made an implicit assumption of the size of the pseudo-thread, and coded according to this size. Now, the pseudo thread size needs to be specified explicitly:

auto g = tiled_partition<16>(this_thread_block());

However, users may want to partition the algorithm better and not use pseudo-threads to synchronize the built-in template parameters:

auto g = tiled_partition(this_thread_block(), 8);

In this case, group g can still be synchronized, and we can still construct multiple parallel algorithms based on it, but functions such as shfl() cannot be used:

__global__ void cooperative_kernel(...) {
    // 获取默认的块线程组
    thread_group my_block = this_thread_block();

    // 分组成32个线程一组的线程组(片),线程片将线程组平均瓜分,每个片内的线程都是连续的
    thread_group my_tile = tiled_partition(my_block, 32);

    // 只在块内前32个线程中执行操作
    if (my_block.thread_rank() < 32) {
        // ...
        my_tile.sync();
    }
}

combination

In the past, when writing code, there were some implicit restrictions on the implementation, such as the following code:

__device__ int sum(int *x, int n) {
    // ...
    __syncthreads();
    return total;
}

__global__ void parallel_kernel(float *x){
    // ...
    // 所有的线程块都要调用sum()
    sum(x, n);
}

The threads within the thread block must reach the __syncthreads() barrier, but this restriction is invisible to the developer calling sum(). Then, using a cooperative group, a better way to achieve this can be:

__device__ int sum(const thread_group& g, int *x, int n)
{
    // ...
    g.sync()
    return total;
}

__global__ void parallel_kernel(...)
{
    // ...
    sum(this_thread_block(), x, n);
    // ...
}

Grid synchronization

Before the introduction of cooperative group synchronization, the CUDA programming model only allowed the synchronization between thread blocks when the kernel function was completed, and the kernel function boundary had an implicit invalid state and potential performance impact. For example, in a specific use case, the application has a large number of small kernel functions, and each kernel function represents a stage of the pipeline. The current CUDA programming model requires that these kernel functions produce data before the thread block in the lower pipeline stage is ready to consume data. In this case, the ability to provide synchronization between global thread blocks will allow the application to reconstruct these thread blocks to synchronize devices when a given phase is complete.

In order to synchronize the grid within a kernel function, you can use the group: grid_group grid = this_grid();, and then call grid.sync();. In order to support cell synchronization, it is necessary to use cudaLaunchCooperativeKernel(), the CUDA runtime startup API when starting the kernel function, instead of the <<<>>> execution configuration syntax:

cudaLaunchCooperativeKernel(const T *func, dim3 gridDim, dim3 blockDim, void **args, size_t sharedMem = 0, cudaStream_t stream = 0)
// 或者CUDA驱动API的对应函数,这种核函数不能使用附录A中的动态并行功能

In order to ensure the coexistence of thread blocks on the GPU, the number of started blocks needs to be carefully considered. For example, we can start as follows:

cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, dev);
// 初始化,而后启动
cudaLaunchCooperativeKernel((void*)my_kernel, deviceProp.multiProcessorCount, numThreads, args);

Or, we can use the occupancy calculator in the following way to calculate how many thread blocks can exist on a multiprocessor at the same time:

cudaOccupancyMaxActiveBlocksPerMultiprocessor(&numBlocksPerSm, my_kernel, numThreads, 0));
// 初始化,而后启动
cudaLaunchCooperativeKernel((void*)my_kernel, numBlocksPerSm, numThreads, args);

Also note that in order to use grid synchronization, the device code must be compiled separately, and then linked into when the device is running (for details, please refer to the CUDA Compiler Driver NVCC document using independent compilation chapter), the simplest example is shown below :

nvcc -arch=sm_61 -rdc=true mytestfile.cu -o mytest

We also need to ensure that the device supports cooperative startup attributes, which can be viewed using the CUDA driver API cuDeviceAttribute():

int pi=0;
cuDevice dev;
cuDeviceGet(&dev,0) // 查询设备0
cuDeviceGetAttribute(&pi, CU_DEVICE_ATTRIBUTE_COOPERATIVE_LAUNCH, dev);

If pi is 1, it means that this attribute is supported on device 0, and only devices with computing capability ≥ 6.0 can support the cooperative startup attribute. In addition, we should run the program with cooperative startup function on Linux platform without MPS or Windows platform with TCC mode equipment

Multi-device synchronization

In order to support synchronization between multiple devices using a cooperative group, we need to use the CUDA api cuLaunchCooperativeKernelMultiDevice(), which is an important extension to the existing CUDA API and will support a host thread to start kernel functions on multiple devices . In addition to the restrictions and guarantees other than the cuLaunchCooperativeKernel() function, the cuLaunchCooperativeKernelMultiDevice() function has the following semantics:

  • This API guarantees that the startup is atomic, which means that if the API is successfully called, then a specified number of thread blocks will be started on all specified devices;
  • The kernel functions launched through this API must be the same. This part of the driver will not do an explicit check, because this check is basically infeasible in the driver, so the application should ensure this;
  • The two elements in the launchParamsList parameter cannot be mapped to one device;
  • The target device for this kind of startup must have the same computing power-the major version or the minor version must be equal;
  • The thread block size, grid size, and the amount of shared memory used by each cell must be equal for all devices. Note that this means that the maximum number of thread blocks started by each device depends on the device with the least number of multiprocessors;
  • Any custom __device__, __constant__ or __managed__ device global variables that exist in the module that calls the cu function to be started will be initialized independently on each device, and the user should ensure the correct initialization of such device global variables.

The startup parameters should be defined with the following structure:

typedef struct CUDA_LAUNCH_PARAMS_st {
    CUfunction function;
    unsigned int gridDimX;
    unsigned int gridDimY;
    unsigned int gridDimZ;
    unsigned int blockDimX;
    unsigned int blockDimY;
    unsigned int blockDimZ;
    unsigned int sharedMemBytes;
    CUstream hStream;
    void **kernelParams;
} CUDA_LAUNCH_PARAMS;

Then pass it to the startup API:

cudaLaunchCooperativeKernelMultiDevice(CUDA_LAUNCH_PARAMS *launchParamsList, unsigned int numDevices, unsigned int flags = 0);

This startup method is similar to the startup of the grid synchronization above, and there is also a synchronization method that is also similar:

multi_grid_group multi_grid = this_multi_grid();
multi_grid.sync();

Also need to use independent compilation.

We should ensure that the device supports the multi-device startup attribute in the same way as described in the previous section. You only need to change the parameter to CU_DEVICE_ATTRIBUTE_COOPERATIVE_MULTI_DEVICE_LAUNCH. Only devices with computing capability ≥ 6.0 can support the cooperative startup attribute. In addition, we should run the program with cooperative startup function on Linux platform without MPS or Windows platform with TCC mode equipment

Conclusion

At this point, I have finished sharing the translation of CUDA10.0 official documents. I personally translated the whole process, but my English level is limited. If there is any inappropriateness, please make suggestions in the comment area.

Guess you like

Origin blog.csdn.net/qq_37475168/article/details/112388296