[2023CANN Training Camp Season 2] - Ascend C operator development advanced - Ascend C Tiling calculation

Understand the basic concepts of Tiling

In this section, I came into contact with a new concept called Tiling calculation, which refers to the fact that in the Ascend C operator development process, the vector operator process is divided into three basic tasks: CopyIn, Compute, CopyOut . The CopyIn task is responsible for moving the input Tensor xGm and yGm on Global Memory to Local Memory. However, Local Memory cannot accommodate the input and output of all operators, so it is necessary to move in part of the data for calculation each time, then move it out, and then move in another part of the data. , repeat the above process until the complete result is finally obtained. This calculation of dividing all data into blocks is called Tiling calculation.

Tiling two implementation methods

There are two scenarios for Tiling implementation, namely fixed shape scenario and dynamic shape scenario.

Fixed shape scenario: The input size is fixed, and the implementation difficulty is low. As long as the logical processing of the shape is considered, the optimization difficulty is low.
Dynamic shape scenario: The shape can be passed into the kernel function through the input parameters of the kernel function to meet the scenario of shape changes. It is difficult to implement and different logic must be considered. Branch processing is also difficult to optimize.

Comparison of kernel function add_custom in two scenarios

Fixed shape kernel function implementation

#include "add_custom_unalign_tiling.h"
#include "register/op_def_registry.h"

namespace optiling {
constexpr uint32_t BLOCK_DIM = 8;
constexpr uint32_t SIZE_OF_HALF = 2;
constexpr uint32_t BLOCK_SIZE = 32;
// shape需要对齐到的最小单位
constexpr uint32_t ALIGN_NUM = BLOCK_SIZE / SIZE_OF_HALF;

The purpose of this code is to define some constants and calculate a value that needs to be aligned to the smallest unit.

Dynamic shape kernel function implementation:

#include "kernel_operator.h"
using namespace AscendC;
constexpr int32_t BUFFER_NUM = 2;
extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z, GM_ADDR workspace, GM_ADDR tiling)
{
    GET_TILING_DATA(tilingData, tiling);
    KernelAdd op;
    op.Init(x, y, z, tilingData.totalLength, tilingData.tileNum);
    if (TILING_KEY_IS(1)) {
        op.Process();
    }
}

Dynamic shape scene sample demonstration

The Ascend C vector addition implementation code under fixed shape is in the samples/cplusplus/level1_single_api/4_op_dev/6_ascendc_custom_op/kernel_invocation/Add/add_custom.cpp file, and the corresponding implementation code for dynamic shape is in samples/cplusplus/level1_single_api/4_op_dev/6_ascendc_custom_op/kernel_invocation/Add_tile/ in add_custom.cpp file

The following introduces sample demonstrations in the two scenarios respectively
1. Kernel function
For the two scenarios, the difference between the kernel functions lies in the dynamics There will be two more parameters in the scenario, workspace and tiling. The kernel function of fixed shape is only x, y, z. In addition, in the kernel function, the dynamic shape also has an additional GET_TILING_DATA function and two additional input parameters in the op.Init function.
image.png

image.png

2. Init() function
In the Init() function, the parameters of the fixed shape scenario use constants, while the dynamic shape uses member variables
image.png

image.png

The following is the implementation of running these two scenarios in CPU mode:
First execute the run.sh file in the Add_tile folder
Execute the command: < /span>
bash run.sh add_custom ascend910 AiCore cpu
The results are as follows:
image.png

It can be seen that there are multiple different processes, and the md5sum value is the same. In the dynamic shape scenario, there will be many more scalar calculations than in the fixed scenario.

Guess you like

Origin blog.csdn.net/qq_45257495/article/details/134349953