Graph optimization for deep learning performance optimization

Here is a summary of the partial graph optimization implemented by the author on the optimization of some actual deep learning models. Some existing deep learning frameworks already exist, and some are unique discoveries of the author, but the existing deep learning frameworks do not yet exist. New discoveries will continue to be updated in the future, and readers are welcome to contribute unique optimizations with a wider range of application scenarios.

Basic concept: The deep learning model is a directed acyclic graph (DAG) composed of operator connections. We can replace the subgraph of this graph with another subgraph that is computationally equivalent, so that the implementation of this subgraph is more efficient. calculation. Conventional graph optimization is to use existing operators in the framework to combine new subgraphs to replace old subgraphs, while operator fusion is a special case of graph optimization, which uses artificially or automatically generated new operators to replace The old subplot.

OneHot+MatMul to Gather

The pattern in the bert model in the figure below can actually be replaced by a gather calculation, which can reduce the number and types of operators, and significantly simplify the calculation method.

SpaceToDepth and DepthToSpace operators

In the figure, the reshape+transpose (perm=[0, 1, 3, 5, 2, 4]) calculation is equivalent to SpaceToDepth. Since there are so many transpose scenes, the calculation can be more targeted and optimized after replacement here. In addition, this The optimization makes it easier to optimize the conversion of formats such as NCHW to NHWC.

Similar to the above, here transpose(perm=[0, 1, 4, 2, 5, 3]) is equivalent to DepthToSpace calculation.

Of course, the conversion of 6D transpose to SpaceToDepth and DepthToSpace may have more than one perm. You can refer to these two operator definitions of Onnx and provide other perm possibilities. For example, there is this kind of channel dimension placed in the back:

This pattern is actually doing a space2depth

But the standard space2depth interprets NCHW as NCH1H0W1W0, then transpose as NH0W0CH1W1, and H0W0 as the block size of space2depth. However, the above pattern transpose type may be different from the standard space2depth. For example, the arrangement may be NW0H0CH1W1, but it can still be directly replaced by reshape+transpose+reshape.

transpose+reshape+transpose combined

In a more complicated situation, the reshape operator has split and merge elements at the same time:

Any number of transpose+reshape+transpose above can be combined into a pattern of reshape+transpose+reshape.

A relatively simple algorithm that combines any transpose+reshape+transpose is:

Split the input shape into the smallest prime number unit, for example [4, 9] is split into [a0, a1, a2, a3]. We need to use a symbol shape to represent the split result, and each symbol shape corresponds to an actual split prime number result.

One problem here is that, for example, 6 can be split into [2,3] or [3,2], and some splitting methods may cause subsequent processing to fail. For example, when 6 is split into [2,3] and the reshape is [3,2], it cannot be processed, but it can be split into [3, 2]. You can consider trying different split methods.

Then merge these small units into the actual shape group according to the input shape, for example [4,9] is [[a0,a1], [a2,a3]]

According to this small shape grouping, the output shape format is derived, for example, [[a2,a3], [a0,a1]]

Merge all small shapes into a whole and merge and group them according to the continuity of the symbol shape labels to get [a2,a3], [a0,a1]

Based on this information, it is easy to get the final merged reshape and transpose information.

These operators can be combined with some kinds of elemwise.

Merge effect demo:

Not all scenes can be merged into one transpose, such as this scene, you can consider merging only part of them.

Split and bias add exchange order

The scene can be seen in CLIP text encoder

This mul operator can be placed anywhere between reshape and transpose below, and can be placed in a suitable position to better integrate with other operators.

Matrix multiplication + BN fusion

Similar to Conv2D + batch normalization (BN) fusion with convolution + BN fusion_Luchang-Li's Blog-CSDN Blog_bn Fusion

Merge adjacent Conv2D or MatMul

There cannot be calculations such as nonlinear operators between these two operators, and two adjacent Conv2D and matrix multiplication can be directly combined into one.

Fusion of MatMul and Add, Mul vector calculation

In the figure, MatMul and the following three Add and Mul operators are combined into one MatMul+add. In fact, any number of Add and Mul operators can be fused behind MatMul, but there are some prerequisites, such as Add and Mul. The two inputs need to be constant, and the dimensions are 1D or 2D, etc.

For example, combine MatMul+bias and add an Add or Mul vector:

(in*B+bias0) + bias1 = in*B + (bias0+bias1)

(in*B+bias0) * sclae1 = in*(B*sclae1) + (bias0*sclae1)

Therefore, it can be merged all the time (the first MatMul has no bias and can be considered to contain a bias with a content of 0).

This fusion is definitely improved in performance, but there may be sacrifices in accuracy, especially the fusion of the mul operator. Because the original matrix multiplied by weight, bias and scale are usually relatively small values ​​less than 1, after integrating scale into weight and bias, the values ​​of weight and bias will be further reduced, which may lead to a decrease in accuracy.

You can exchange mul and add, and divide the content of add by the internal scalar value of mul, so that add can be integrated into matmul with bias, and the subsequent mul can also be integrated into matmul. However, FP16 calculation may have a large loss of accuracy.

DepthwiseConv+Add

The calculation is x*w1+bias1+x, which may be converted to x*w1+x*w2+bias1, where * is convolution.

Among them, w2 is a delta function, that is, a filter with 1 in the middle and 0 in the rest. So in the end this calculation can be converted to

x*(w1+w2)+bias1

(X+b1)*w2 = X*w2+b1*w2

 The first Add can be moved under MatMul, and then the following two Adds can be merged into one.

LayerNorm operator merge

The layernorm of some deep learning frameworks is composed of multiple small operators. Replace the above pattern recognition with a handwritten fused layernorm operator. The performance of the fused operator can be improved several times. The latest opset 17 of onnx adds this fusion operator definition.

Different layernorm implementations in models

This is another layernorm situation, which may be mainly due to code development problems, resulting in two sub operators appearing after the reduce, but the calculations of these two operators are exactly the same. The situation is the same.

In some models, when pb is converted to onnx, the reduce of layernorm is replaced by GlobalAvgPool. When the shape length is 3, it can be directly replaced by reducemean to avoid the above layernorm matching failure.

The three Slices here can actually be replaced by one Split, of course, you can also see which implementation method is better

Merged into ReduceSumSquare

In the figure below, it can be changed to reshape+transpose to calculate more efficiently. Of course, it is best to modify it at the code level of the model. This deeply reflects the same computing paradigm and the diversity of different code developers, but the computing efficiency of different composition methods of the same computing may be different.

Merge the same calculation

Especially transformation operators, such as transpose, reshape, etc.

How to identify these same computing patterns more intelligently and efficiently is also a problem worthy of research, instead of making judgments manually every time.

Delete unnecessary operators

Such as identity, expand

The expand in the figure below is actually unnecessary, because where, as a ternary elemwise, has its own broadcast capability, and expand requires additional memory reading and writing back to a large tensor, resulting in a waste of time. The generality judgment can be made as follows: expand is followed by an elemwise operator and other input shapes of this elemwise and the input shape of expand can broadcast to the shape output by elemwise, this expand can be deleted.

You can change reshape directly, and reshape generally does not require actual calculation. 

split concat tensor merge

The situation in the above picture may also be caused by code development. Many small tensors were split and then concat together. The input from split in concat is continuous, so the tensors in the middle split can be merged together. That is to say, 24 tensors are split in the above figure, but in fact, only 2 tensors need to be split. The first concat can be removed directly.

Shape-related constant folding

When the shape is fixed and known, the shape can be calculated in advance and replaced with const, so that the subsequent connection operators of the shape can perform constant folding.

Optimization related to transpose operator

Several Common Scenarios for Transpose Operator Optimization - Luchang-Li's Blog - CSDN Blog

Operator Order Adjustment

Adjust the order to fold the div into the matmul and bias_add parameters, the second is similar. Here we can consider abstracting a more general algorithm to deal with a wider range of situations.

Multi-channel same calculation patterns are merged into batch calculations, which significantly reduces the number of operators and improves the calculation intensity of operators, as shown in the scene below; in addition, another scenario is that the multiplication and combination of the three attention matrices in the transformer model can be combined into a batch matmul, or by itself The special implementation can accept multiple inputs and bias matrix multiplication at the same time.

Multi-way parallel slices can be replaced by gather under certain circumstances, and all the following elemwise can be synthesized into one calculation. The one in the figure below is a little more complicated. Slice can be replaced by gahter, and elemwise can be merged accordingly, but the merge of gatherV2 requires a custom operator batch gatherV2. 

Adjacent convolution/matmul+bias, if there is no nonlinear layer in the middle, it can be directly merged based on the linearity of convolution and matrix multiplication

operator replacement

If only one axis is sliced, it can be replaced with gather to simplify the implementation and use a more efficient implementation.

onnx and tensorflow pb model graph optimization method

onnx model graph optimization/model modification_Luchang-Li's Blog-CSDN Blog_onnx.checker.check_model

TensorFlow pb model modification and optimization_Luchang-Li's blog-CSDN blog_tensorflow model optimization

Guess you like

Origin blog.csdn.net/u013701860/article/details/126808023