KuiperInfer deep learning reasoning framework - source code reading and secondary development (3): calculation graph

Foreword : KuiperInfer is a high-performance deep learning reasoning library that implements from scratch. The Chinese tutorial has been very complete. This series of blogs is mainly a little note for my own study and a tutorial for secondary development. More AI reasoning enthusiasts are welcome to play together. This article writes about the knowledge points related to the calculation graph, focusing on the shortcomings of ONNX? Why choose PNNX? How to build a computational graph? What optimization methods are worth learning in the calculation graph of PNNX?

Table of contents

Why choose PNNX?

From the perspective of OpenPPL, why not choose ONNX?

discuss

Build a computational graph

C++ forward declaration

Learning: Calculation graph optimization method in PNNX

1. Constant Folding

2. Remove redundant operators

3. Various fusion passes

4. Operator equivalent change Transform

reference


Why choose PNNX?

First of all, I have to ask nihui this question, why do I have to write another pnnx when I have onnx? Of course, it is for the sake of output and promotion. Of course, onnx has its own shortcomings, and I strongly recommend nihui's own blog:

PNNX: PyTorch Neural Network Exchange

Here I summarize:

  • The operators of some models of PyTorch may not be available in ONNX, so when exporting ONNX, it will fail to export.
  • When PyTorch imports ONNX, sometimes it will export a very complex calculation graph, which will lead to a decrease in reasoning efficiency.
  • ONNX does not support operators in third-party libraries enough; the exported calculation graphs are complex and fragmented.

From the perspective of OpenPPL, why not choose ONNX?

I wrote in the previous openppl/ppl article that I did a lot of calculation graph optimization operations before quantization:

https://xduwq.blog.csdn.net/article/details/128839708

In fact, quantization accuracy largely depends on the support of reasoning. If onnx is used directly for reasoning, the operator will be disassembled, and too many quantization nodes will be inserted, which will seriously affect the model reasoning after quantization. So ppq does a lot of preprocessing before quantization - calculation graph optimization

https://xduwq.blog.csdn.net/article/details/128839708

discuss

Of course, many bigwigs think that there is no need to spend a lot of effort to rebuild ONNX. Excluding personal factors (affecting personal promotion and salary increase), many people think that we only need to do some graph optimization like OpenPPL.

I personally think that PNNX is still very easy to use, and it’s over when it’s really fragrant... 

Build a computational graph

In this regard, Fu Dagou said in great detail, Kuiper will encapsulate PNNX, and then convert it into its own calculation graph, please see the original text and video:

Self-made deep learning reasoning framework - Lesson 7 - Build your own calculation graph

Self-made Deep Learning Reasoning Framework - Lesson 7 - Building Your Own Computational Graph- Know about

C++ forward declaration

When defining IR, the knowledge points of forward declaration in C++ are used. First look at the source code (in KuiperInfer/ir.h ):

class Operator;
class Operand
{
public:
    void remove_consumer(const Operator* c);

    Operator* producer;
    std::vector<Operator*> consumers;

    // 0=null 1=f32 2=f64 3=f16 4=i32 5=i64 6=i16 7=i8 8=u8 9=bool 10=cp64 11=cp128 12=cp32
    int type;
    std::vector<int> shape;

    // keep std::string typed member the last for cross cxxabi compatibility
    std::string name;

    std::map<std::string, Parameter> params;

};

class Operator
{
public:
    std::vector<Operand*> inputs;
    std::vector<Operand*> outputs;

    // keep std::string typed member the last for cross cxxabi compatibility
    std::string type;
    std::string name;

    std::vector<std::string> inputnames;
    std::map<std::string, Parameter> params;
    std::map<std::string, Attribute> attrs;
};

Forward declaration : It is possible to declare a class without defining it. This declaration is sometimes called a forward declaration. After the declaration and before the definition, the class Operator is an incomplete type (incompete type), that is, it is known that the Operator is a type, but it is not known which members it contains. Incomplete types can only be used in limited ways, and objects of this type cannot be defined. Incomplete types can only be used to define pointers and references to this type, or to declare (rather than define) use this type as a formal parameter type or A function that returns a type.

Please note: despite the use of forward reference declarations, objects of that class cannot be defined nor used in inline member functions until a complete class declaration has been provided.  That is, when you use a forward reference declaration, you can only use the declared symbol, and cannot involve any details of the class.

It is wrong to write as follows:

class Fred;    //前向引用声明

class Barney
{
public:
    void method()
    {
        x->yabbaDabbaDo();    //错误:Fred类的对象在定义之前被使用
    }
private:
    Fred* x;   //正确,经过前向引用声明,可以声明Fred类的对象指针
};

class Fred
{
public:
    void yabbaDabbaDo();
private:
    Barney* y;
};

Learning: Calculation graph optimization method in PNNX

Calculation graph optimization is one of the important selling points of many AI reasoning frameworks/AI compilers, and PNNX does a very good job in this area, and it is very worth copying for reference !

ncnn/tools/pnnx at master · Tencent/ncnn · GitHub

PNNX is divided into 5 levels of optimization, which is very similar to the idea of ​​AI compiler high pass optimization to low pass optimization:

pnnx's five-level pass
Optimization of CINN

 The blogger’s skill does not have much experience in this area, we can see how the boss summed it up:

How to choose a deep learning inference framework? - Know almost

1. Constant Folding

The first is constant folding . Sometimes we have some calculations that use constants. These constants can be initialized as part of the graph and do not need to be calculated during reasoning. Situations where constant folding is appropriate include:

  • An operator that does not have an input node, but is not an input. For example, if a constant is imported into an operator, the Node needs to be picked out;
  • For intermediate calculations that use Shape, if the input is fixed, then the shape can be fixed, but for torchscript, this is more troublesome, because torchscript is dynamic by default;
  • Some weird operations can be shrunk. For example, if you calculate the Pad outside the Conv, we can soften the Pad operation into the Conv.

After finding these operators that can be folded, you can perform folding. In fact, these Nodes are changed into Attributes, and then a new Operator is created, and this Attribute is regarded as a fixed parameter for changing the Operator.

2. Remove redundant operators

For example something very simple:

  • Identity Elimination;
  • Slice Elimination;
  • Unsqueeze Elimination;
  • Dropout Elimination;
  • Expand Elimination;
  • Pooling Elimination;
  • Duplicated Reshape Elimination;
  • Opposite Operator Elimination;
  • Common subGraph Elimination;

These are basic operations. Everyone knows about dropout. Identity is actually returning your input, that is to say, what input you give him, what he returns to you. It seems that nothing is done, but some people write The sum of the network will use this. For example, if you want to delete a certain operator, but do not want to affect the subsequent operations, assuming that the dimension remains unchanged, you can directly change that layer to Identity.

The logic of removing the identity is also very simple. Traverse the entire graph to see which operators have the same input and output. If so, delete them and directly connect the input of the current node to the input of the next node.

For the elimination of Slice, it is mainly aimed at an operation where the dimension is d, and then the slice is [0, d-1]. This is the loneliness of the slice, and we can kill it directly. The judgment method is also very simple:

if (!op->inputs[0]->shape.empty() && op->inputs[0]->shape == op->outputs[0]->shape)
{
    matched = true;
}

Directly check whether the input shape is empty, and whether the shape is sliced ​​lonely.

And Expand can also be removed. If the dimension doesn’t change after you Expand, isn’t it lonely because of Expand? Kill it directly, and the judgment logic can reuse Slice.

This pooling can be killed at 1x1. If you have a pool, it means that you don't have a pool. What do you need?

Of course, the latter two cases depend on your actual support model. Generally advanced algorithm engineers will not do such trivial things.

There are also repeated Reshape, which can be merged into one.

Operators with opposite meanings before and after can also be removed. For example, if you Squeeze first, then Expandim, and the axis is the same, you can replace it with Identity, and then remove it by Pass later. So you see Identity in action here.

Common subgraphs can also be optimized. For example, multiple inputs to the same Node or Module can be merged into one.

Finally, there are actually some quantification-related operations that can be removed, such as:

  • Quant-dequant Elimination;

The support for QDQ will not be launched for the time being, and this pit is relatively deep.

3. Various fusion passes

In fact, these operations are also used in quantization, and these fusions are very necessary, which can greatly improve the running speed of the model, for example:

  • Conv + (Add) Bias fusion;
  • Conv + Scale fusion;
  • Conv + Mul fusion;
  • Conv + Relu merge;
  • MatMul + Add fusion;
  • MatMul + Scale fusion;
  • BN + Scale fusion;
  • Conv + Bn fusion;
  • Conv + Bn + Bias fusion;
  • Conv + Bn + Relu merge;
  • FC + BN fusion;
  • FC + Add Fusion;

Similar to this calculation equivalent, the purpose is to reduce the number of memory accesses, thereby passively increasing the reasoning speed. The premise is that there must be reasoning support for the corresponding fusion operator in the framework.

In addition, you can also do some higher-level fusion passes, which also require strong support from the reasoning layer. For example:

  • BERT Embeding + LayerNorm + AttentionMask fusion;
  • FC + LayerNorm fusion;
  • MultiHead Attention fusion;

These fusions can further improve the speed of the model. In Fastransformer and HuggingFace's transformer ONNXRuntime reasoning, there are such operations, especially for Transformer-based models, it is very necessary to fuse large operators.

Of course, this part also has quantitative related fusion:

  • Conv + BN + Relu -> ConvInt8;
  • Conv1dQuant -> Conv2dQuant;

4. Operator equivalent change Transform

Equivalenting operators is also an essential step. In this step, we replace some infrequently used operators with commonly used operators. Since commonly used operators are usually optimized for calculation, they can be faster.

  • MatMul to Conv2D;
  • FC (InnerProduct) to Conv2D 1x1;
  • BN to Scale;
  • ShuffleChannel to Reshape + Permute;
  • GroupConv to Conv + Slice

That’s almost all of these contents. In the end, after these optimizations, the encouraged operators can be deleted.

reference

Guess you like

Origin blog.csdn.net/qq_41895747/article/details/130246930