Model Compression - Reparameterization


The main methods of model compression are: re-parameterization, pruning, quantization, etc. This chapter mainly talks about re-parameterization.
When solving the identity mapping, when interpolating the convolution kernel or regenerating a convolution kernel, some will directly use Dirac to initialize and increase. Weights.
Look here for Dirac initialization: https://www.e-learn.cn/topic/1523429

Reparameterization

Reparameterization includes parallel merging and serial merging (mainly for merging modules into a convolution). If you want to synthesize only one, you must first serialize and then parallelize, and the piece of serial merging cannot have ReLU, sigmoid, tanh There is only one kind of formula that is possible, and ReLU is not possible because the formulas for x>0 and x<0 are different.
Advantages: The multi-branch network is used during training to enable the model to obtain better feature expression, and the parallel fusion is merged into serial during testing, thereby reducing the amount of computation and parameters, and improving the speed (the recognition effect after fusion is the same as before fusion in theory, and the actual recognition effect is the same as before fusion). Basically, it is slightly lower)

Reparameterization methods for RepVGG and ACNe

ACNet

在add中: w 1 ⋅ x + w 2 ⋅ x = ( w 1 + w 2 ) ⋅ x w1\cdot x+w_2\cdot x=(w_1+w_2)\cdot x w1x+w2x=(w1+w2)Here
insert image description here
, 1x3 Conv is equal to 3x3 Conv with weights 0 in the second and third rows, and 3x1 Conv is equal to 3x3 Conv with weights 0 in the second and third columns.
Then, at this time, the three Convs with different sizes in the figure can be equivalent to three convolutions with the same size, and the formula of add above can be applied.
In fact, after Conv will be BN, the fusion of BN and Conv will look at the RepVGG below

RepVGG

RepVGG is very similar to ACNet. The main concept is to use the original multi-branch complex network during training. In the test, the multi-branches are merged into one branch for testing, and no training is required to improve the speed. (Parallel and parallel, there are restrictions on branches, such as the branch in the figure below, each branch can be merged into a convolution first, and then the branches are merged) The reparameterization process of RepVGG
insert image description here
insert image description here
:

  1. wi , : , : , : ′ = γ i σ iwi , : , : , : , bi ′ = − μ i γ i σ i + β i w_{i,: ,:,:}'=\frac{\gamma_i}{\sigma_i}w_{i,:,:,:},b_i'=-\frac{\mu_i\gamma_i}{\sigma_i}+\beta_iwi,:,:,:=piciwi,:,:,:,bi=pimici+bi,In that wi w_iwiis the original convolution weight, μ i \mu_imi σ i \sigma_i piγ i \gamma_iciSum β i \beta_ibiare the mean, variance, scale factor and offset factor of BN, respectively.
  2. Convert the fused Conv to a 3×3 Conv (the largest Conv). The conversion of 1×1 Conv to 3×3 Conv adopts the method that the central weight is equal to that of 1×1 Conv, and the surrounding weight is 0. The identity mapping is replaced by a 3×3 Conv whose intermediate weight is 1 under the i-th channel of the i-th kernel Conv and the rest is 0. (Conv of other scales is the same)
  3. Merge all 3×3 Convs in the branch. Add the weight w and bias b of the convolution kernels of all branches to form a new 3×3 Conv (ACNet above).

Derivation:
The second and third steps in the above process are very simple, but the third step can only use the method of add at the end of the branch connection, and the aggregation connection cannot. It is mainly the derivation of the first step, and this process will be deduced here.
The formula for each calculation in Conv can be expressed as: C onv ( x ) = w ⋅ x Conv(x)=w\cdot xConv(x)=wx (unbiased,unbiased function)
BN function:BN ( x ) = γ i ⋅ ( x − µ i σ i 2 + ϵ ) + β i BN(x)=\gamma_i \cdot (\ . frac{x-\mu_i}{\sqrt{\sigma_i^2+\epsilon}})+\beta_iBN(x)=ci(pi2+ ϵ x mi)+bi, ϵ \epsilonϵ极小值
是Conv+BN's formula isγ i ⋅ ( w ⋅ x − μ i σ i 2 + ϵ ) + β i = γ i σ i 2 + ϵ wi ⋅ x − γ i μ i σ i 2 + ϵ + β i \gamma_i \cdot (\frac{w\cdot x-\mu_i}{\sqrt{\sigma_i^2+\epsilon}})+\beta_i=\frac{\gamma_i}{\sqrt{\sigma_i ^2+\epsilon}}w_i\cdot x-\frac{\gamma_i\mu_i}{\sqrt{\sigma_i^2+\epsilon}}+\beta_ici(pi2+ ϵ w x mi)+bi=pi2+ ϵ ciwixpi2+ ϵ cimi+bi
Since ϵ \epsilonϵ is very small, it is possible to abbreviate, and it is possible thatγ i σ iwi ⋅ x − γ i μ i σ i + β i \frac{\gamma_i}{\sigma_i}w_i\cdot x-\frac{\gamma_i\ mu_i}{\sigma_i}+\beta_ipiciwixpicimi+bi, which is the new convolution: C onv ( x ) = w ′ ⋅ x + β ′ Conv(x)=w'\cdot x+\beta'Conv(x)=wx+b, w i , : , : , : ′ = γ i σ i w i , : , : , : , b i ′ = − μ i γ i σ i + β i w_{i,:,:,:}'=\frac{\gamma_i}{\sigma_i}w_{i,:,:,:},b_i'=-\frac{\mu_i\gamma_i}{\sigma_i}+\beta_i wi,:,:,:=piciwi,:,:,:,bi=pimici+bi
Thus ϵ \epsilon is ignoredϵ ,and the accuracy of the computer when the weight is fused, will cause the actual effect to drop slightly during the test.

Diverse Branch Block (DBB)

insert image description here
As can be seen from the above figure, DBB involves the fusion of series and parallel. The RepVGG and ACNet above have explained the parallel fusion, and the series fusion is mainly explained here.
The fusion of BN and Conv is also written above, so I won’t talk about it here

Concatenated 1x1 Conv and KxK Conv Fusion

Here is the fusion of concatenated 1x1 Conv and KxK Con convolution.
The fusion method proposed in DBB is to first swap the 0-dimension and 1-dimension of 1x1 Conv (the 0-dimension is the number of cores, and the 1-dimension is the number of channels), and then use it to perform the Conv operation on the KxK Conv to obtain the combined Conv kernel. Weight, and finally use the merged Conv check to perform the Conv operation on the Input, as shown in the figure below.
Formula: X ⨂ F 1 ⨂ F 2 = X ⨂ ( F 2 ⨂ trans ( F 1 ) ) X\bigotimes F1\bigotimes F2=X\bigotimes(F2\bigotimes trans(F1))XQ1 _F2 _=X( F2 _trans(F1))

insert image description here
insert image description here
Suppose the input is CxHxW, and there are M 1x1 Conv, then 1x1 Conv ∈ RM × C × 1 × 1 \in R^{M\times C\times 1\times 1}RM × C × 1 × 1 , the output is MxHxW, KxK Conv has N, then KxK Conv∈ RN × M × K × K \in R^{N\times M\times K\times K}RN × M × K × K , the output is NxHxW
after swapping the 0-dimension and 1-dimension of 1x1 Conv, 1x1 Conv∈ RC × M × 1 × 1 \in R^{C\times M\times 1\times 1}RC × M × 1 × 1 becomes C 1x1 Conv with the number of channels M. At this time, Conv is performed on KxK Conv, and the output is KxK Conv∈ RN × C × K × K \in R^{N\ times C\times K\times K}RN×C×K×K

1x1 Conv and AVG (Average Pooling) Fusion

AVG is equivalent to N-core Conv (assuming that the input feature map consists of N sheets), and the ownership value of the low-i core under the i-th channel is 1 size 2 \frac{1}{size^2}size21, the weights under the rest of the channels are 0. At this time, fusion of 1x1 Conv and AVG = fusion of 1x1 Conv and KxK Conv, returning to the above serial 1x1 Conv and KxK Conv fusion

splicing fusion

When each branch is implemented using a KxK Conv, then the output of all branches of Concat is equivalent to using a KxK Conv that stitches the Conv weights of all branches on the channel.

RMNet

insert image description here
RMNet is implemented based on ResNet. If there is ReLU in the block, there must be one at the end of the block. Only in this way can the input received by the next block be positive (linear mapping). For a network like MobileNet2, its ReLU operation is placed in the middle of the Block. At this time, there is no guarantee that the input of the next Block must be positive. So at this time, you can't do the ReLU operation directly, but use a PReLU (LeakReLU), and set the alpha parameter of the PReLU to 1 for the input features to maintain a linear mapping. For the features after convolution, set the alpha parameter of PReLU to 0, which is equivalent to ReLU at this time.

Remove the residual, there is no downsampling

As shown in the figure above, the focus of the residual connection is 2 blocks. The fusion method is as follows:
the input is known to be positive, the feature channel is N, there is ReLu after the first Conv in the Block, and the second Conv will be compared with the input of the Block. add, then ReLU.
insert image description here
Step1: Interpolate the first layer of Conv in the first Block, first insert N Conv cores, the middle weight of the i-th channel of the inserted i-th Conv core is 1, and the surrounding weight is 0. Now the feature map output by the first Conv has N more than the original, and the features composed of these N are the same as those input by the Block. Since the Block input is a positive value, the weight of the interpolated Conv is only 0 and 1, so even if it goes through ReLU again, the output is the same as the original (the i-th of the extra N sheets is equal to the i-th of the Block input feature) .
Step2: Interpolate the second layer of Conv in the Block. Each core of the second layer of Conv adds N more channels, the intermediate value of the i-th channel added to the i-th core is 1, and the rest are 0. (The output of the i-th core is the i-th feature map. When performing Res, it is suitable for adding the i-th feature map input by the Block. At this time, the i-th feature map input by the Block becomes the output of the previous layer of Conv. The i-th feature map is added, then each Conv kernel of the second layer of Conv corresponds to the convolution kernel channel of this piece, which is the i-th additional channel)

Remove residuals, downsampling

There are two types, the first one is as follows:
insert image description here
the 1x1 convolution with stride=2 in the bypass branch is filled with 0 pads to form a 3x3 convolution, and the number of channels is expanded (this convolution is originally an expanded channel).
At this time, the results of the convolution are positive and negative (similar to the situation of Mobilenetv2 discussed above). In order to ensure the identity mapping, we use PReLU here (the residual branch is the left one, and the alpha weight is 0, which is equivalent to ReLU. , the alpha weight of the bypass branch is 1, which is equivalent to the identity map).
Then we connect a 3x3 convolution initialized by Dirac to ensure the identity mapping. Finally, we can merge into the situation on the far right.
insert image description here
The second type: first use the identity map convolution with a size of 3x3 and stride=2 to reduce the resolution. Then use the 3x3 convolution with stride=1 to expand the number of channels.
Two fusion methods: The first layer of convolution of the two branches is spliced ​​in the dimension of number. If the left m-core Conv and the right n-core Conv, after splicing, it will be the Conv of m+n cores. The bottom two Convs with s=1 are spliced ​​on the channel. (Whether the channel is expanded first or later, the Conv on the right side of the first layer affects the number of cores, not the number of channels of the Conv core, so the splicing above is no problem. The original 1x1 Conv s=2 in the second It becomes a 1x1 Conv s=1, and finally uses padding to turn it into a 3x3 Conv, and then performs channel splicing with the Conv on the second layer on the left)
From the perspective of parameter quantity, the second type is relatively small

  • Scheme 1: Conv1(C 4C 3 3) + PReLU(4C) + Conv2(4C 2C 3 3) = 108C^2 + 4C (the HW of the two schemes can be offset, so only C is counted, and C is the channel of the input feature of the input block , 4C represents the expanded channel, the first layer on the left and right sides was originally expanded to 2C, and after merging, it becomes 4C)
  • Solution 2: Conv1(C 3C 3 3) + Conv2(3C 2C 3 3) = 81C^2 (2C is expanded on the left side and not expanded on the right side, so Conv1 is expanded by 3C in total, and 2C is actually output at the end)

Guess you like

Origin blog.csdn.net/weixin_39994739/article/details/123872722