cuda parallel reduction

References: https://developer.download.nvidia.cn/assets/cuda/files/reduction.pdf

This paper presents for Scalar Reduction optimization stage 6, in which more crucial idea is as follows:

1.Avoid warp divergent

2.Sequential addressing is conflict free (my understanding is that the continuous addressing speed is faster)

3.Unroll the last warp (ie warpReduce, here to note the role of the volatile keyword, if warpReduce parameters must be declared as volatile type)

4.Use Template to unroll all loop && avoid conditional statement(if .. else..)

5.Multiple Adds (that is, before shared_data do reduce, as much as possible so that each thread a few more elements, such as shared_data size of 256, the input array size to 2560, then let each thread plus 10 elements, becomes 256 elements in turn assigned to shared_data then do reduce)

In addition there Scalar Reduction Column Reudction, Row Reduction, and specifically refer to tensorflow Source: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/reduction_gpu_kernels.cu.h

Wherein when the number of columns or rows is less than 16, also uses the warp-primitive-level programming techniques, specifically refer to: https://devblogs.nvidia.com/using-cuda-warp-level-primitives/

The reason is simple: when the number is less than 16, a warp can be considered a minimum of two rows or two, it is necessary to add a few numbers, and in each warp, and this can be calculated exactly using the warp-primitive characteristics.

tensorflow internally calls the API cub provided by the library (cub if that is packaged some cuda API), in fact, can achieve their own.

Guess you like

Origin www.cnblogs.com/deepllz/p/11350764.html