Introduction to operator development series (1)

Steps to write an operator:

1. Determine the input and output (data type and dimension): clarify the operator function, that is, the operation to be performed
2. Create the header file and source file of the operator: use C The syntax of the header file and source file are used to declare and define the operator function respectively
3. Declare the operator function in the header file: Use the extern keyword in the header file to declare the prototype of the operator function , including function name, parameter list, return value type, etc.
4. Define the operator function in the source file: implement the specific logic of the operator function in the source file. According to the function and input and output requirements of the operator, write the corresponding code to complete the operator calculation process
5. Compile and build the operator: Use the Ascend C compiler to compile the source file of the operator into a readable Execution file (the operator can be packaged into a library file as needed)
6. Test and verify the operator: write test code

Operator fusion

Operator fusion is a deep learning model optimization technology that aims to fuse multiple operators (operations) into one operator, thereby reducing the amount of calculation and number of parameters, and improving model performance and efficiency. In deep learning, operators usually refer to functions that operate on tensors (multidimensional arrays), such as convolution, fully-connected layer, etc.

By fusing multiple operators, the amount of calculation and number of parameters can be reduced, thereby improving calculation speed and memory usage efficiency. In addition, operator fusion can also help reduce model size and facilitate deployment on resource-constrained devices such as mobile devices.

Operator fusion can usually be achieved in two ways:

Weight Fusion: Fusion of the weight matrices of multiple operators into one matrix, thereby reducing the number of parameters.
Operator Scheduling Fusion: By adjusting the calculation order of operators, multiple operators are merged into one operator, thereby reducing the amount of calculation and memory usage.

It should be noted that operator fusion may cause the computational graph to become more complex, and therefore may require more compilation time and optimization time. In addition, not all operators can be fused and need to be evaluated and adjusted based on specific problems and hardware environments.

Implement NMS operator in MindSpore

import mindspore as ms
import mindspore.ops as ops

class NMS(ms.nn.Cell):
    def __init__(self, iou_threshold=0.5):
        super(NMS, self).__init__()
        self.iou_threshold = iou_threshold
        self.transpose = ops.Transpose()
        self.non_max_suppression = ops.NMSWithMask(0)

    def construct(self, boxes, scores):
        boxes = self.transpose(boxes, (1, 0))
        scores = self.transpose(scores, (1, 0))
        output, _ = self.non_max_suppression(boxes, scores, self.iou_threshold)
        return output

# 创建NMS算子
nms_op = NMS(iou_threshold=0.5)

# 使用NMS算子进行非极大值抑制
output_boxes = nms_op(boxes, scores)

Operator compilation process

Frontend Compilation

Syntax analysis: Parse the operator code text into an abstract syntax tree (AST)
Semantic analysis: Perform static analysis on the AST, check types, scopes, etc., and generate semantics Checked AST
Intermediate code generation: Convert the semantically checked AST into an intermediate representation (usually a three-address code in GIMPLE form)

Optimization Compilation

Perform various optimizations, such as constant folding, deletion of unused code, loop optimization, etc., to obtain more efficient intermediate code
Optimize target code for specific architectures, such as SIMD vectorization Wait

Code Generation

Convert the optimized intermediate code into architecture-specific assembly code or machine code
Perform register allocation, instruction selection and instruction sequencing, etc.

Backend compilation

Assemble the generated assembly code into target code, complete the link, and generate the final operator library
Perform code verification to ensure compilation correctness

Commonly used tools and frameworks in the compilation process include LLVM, GCC, TVM, etc. Through compilation, the high-level language code of the operator is efficiently converted into machine code that can be executed by specific hardware, thereby giving full play to the hardware performance.

——————————————————————————————————————
For operators of deep learning neural networks, the compilation process is slightly different from that of general operators, mainly in the following points:

tensorization

Design appropriate data types for tensor data representation in neural networks, such as float16, int8, etc.
Reduce data storage and transmission costs while controlling accuracy loss

Optimized for specific hardware

For example, GPU: consider thread block division, shared memory usage, loop unrolling, etc.
For example, TPU: consider optimization under pipeline array architecture to improve parallelism

automatic differentiation

Automatically generate the derivation code required for backpropagation
Support dynamic shape changes and accelerate the training process

Graph-level optimization

Overall optimization across operators, such as operator fusion and memory reuse
Higher-level performance improvement

Dynamic shape support

The shape of each layer may change during training, and compilation needs to consider shape uncertainty
Support changing shapes through dynamic scheduling

Deep learning compilation needs to automatically support more optimizations to generate efficient code to fully utilize hardware performance and achieve low latency and high throughput for training and inference.