[Thesis Learning] RepVGG: Making VGG-style ConvNets Great Again and Analysis of the Use and Effect of Network Reparameterization

  This article has studied RepVGG and several articles of the same author in the field of network reparameterization, summarized its main principles, tested the effect of reparameterization methods and analyzed its value and significance.
  RepVGG is a paper included in CVPR2021. The author is Dr. Ding Xiaohan from Tsinghua University. The same author has several other papers in this field:

  ACNet:https://arxiv.org/abs/1908.03930
  RepVGG:https://arxiv.org/abs/2101.03697
  DiverseBranchBlock:https://arxiv.org/abs/2103.13425
  ResRep:https://arxiv.org/abs/2007.03260
  RepMLP:https://arxiv.org/abs/2105.01883

  The author explains these articles in more detail in his Zhihu column: https://www.zhihu.com/column/c_168760745 . The codes are all open source, here: https://github.com/DingXiaoH
  I mainly sort out the knowledge points in the paper according to my understanding and practice them, so that I can deepen my study.

1. The principle of network reparameterization

  This series of articles revolves around the technique of network reparameterization. Network reparameterization can be considered as a new technology in the field of model compression. Its main idea is to convert a K x K convolutional layer into a parallel or series connection of several specific network layers. The converted network is more complex. , more parameters, so the converted network is expected to achieve better results in the training phase, and in the inference phase, the parameters of the large network can be equivalently converted to the original KxK convolutional layer, so that the network has both the training phase Better learning ability of large networks and higher computing speed of small networks in the inference stage.
  In the DBB (DiverseBranchBlock) article, the author mentioned 6 structures that can be converted equivalently to the KxK convolutional layer, such as Figure 2 in the paper. The Transform VI is the structure in the ACNet paper. In addition to these 6 types, there are also There is a structure that is not mentioned in the DBB paper, which is the structure in RepVGG. I draw it in Figure 1 below.
insert image description here
insert image description here

Figure 1. Various modules that can be equivalently reparameterized to KxK convolution

  The above structures can be summarized into the following points:
(1) "BN absorption" operation: KxK convolution layer can be combined with BN layer, and the parameters of BN can be converted into parameters of convolution, thereby removing BN.
(2) Convolution is a linear operation that satisfies the superposition property f(ax+by)=af(x)+bf(y), so conv(W1 + W2) = conv(W1) + conv(W2), so you can put The addition of two KxK convolutions becomes a KxK convolution after the addition of parameters.
(3) 1x1 convolution can be regarded as a KxK convolution that is 0 except for the center point (provided that K is an odd number); 1xK convolution and Kx1 convolution can be regarded as a row or column on both sides is 0 The KxK convolution; the identity transformation is a more special 1x1 convolution, which can be understood as a KxK convolution of such a CxCxKxK parameter: the CxC part is the identity matrix, and the center point of the KxK part is 1, and the rest are 0 .
(4) The concatenation of 1x1 convolution and KxK convolution can be converted into a KxK convolution by multiplying their parameters (some transposition operations are required for multiplication, see the original text for the formula).
(5) The AVG operation is equivalent to a KxK convolution whose elements are 1/(K*K).
(6) Due to the superposition property, the various structures mentioned above can also be infinitely superimposed and combined. For example, the following figure is the DBB module given in the paper.

insert image description here

Figure 2. DBB module

  RepMLP is the author's latest paper. In this paper, it is pointed out that not only KxK convolution can be equivalent to a series of more complex networks, but multi-layer fully connected network MLP is also a linear operation, and convolution can be equivalent to MLP. .

2. The purpose and significance of network reparameterization

  Based on this method, it is mainly useful for KxK convolution. Of course, it is the most effective for pure 3x3 convolution of VGG. The author also showed the huge effect of this technique on VGG in the article RepVGG. Ping's current SOTA model is really eye-catching. However, pure VGG also takes up a lot of video memory and is not suitable for deployment on end-side devices. This loses the original meaning of model compression, so I think RepVGG has little practical significance. In addition, the network reparameterization method cannot be used to compress any network model. It can only compress some specific network structures into KxK convolution, which is not a general method.
  However, KxK convolution is very common in various mainstream neural networks. If these KxK convolution layers are complicated and trained, then reparameterized and deployed for reasoning, it is expected to improve the model effect without affecting the speed of reasoning. , which may be a useful point.

3. My test

  In order to have a deeper understanding of this method and test its effect, I am going to do an experiment. In order to test it objectively, we do not use CIFAR10 and Imagenet, VGG, ResNet, etc. that the author has tested. I decided to choose a task at random, and I just finished a competition for wireless communication AI . The current mainstream method for this task is CRNet, which contains a variety of convolutional layers. We replace it with an equivalent multi-branch module to observe whether the effect can be effectively improved during training. The structure of CRNet is as shown in the figure below. The function of this network is to compress and encode a wireless signal on the user side, and then decode it after sending it to the base station to pursue a higher recovery rate. For details, please refer to the original paper https://arxiv.org/abs /1910.14322 . We don’t have to worry about the process from the reparameterization of the trained network to the KxK convolutional layer for inference, because it is strictly mathematically derived, and there will be no problems. What is uncertain here is the large network that is complicated into a multi-branch structure Whether the training process is really effective, so we focus on this.
insert image description here

Figure 3. CRNet

  This network structure contains 3x3 convolution, 5x5 convolution, and 1x9, 9x1, 1x5, 5x1 convolution, etc. We can try to replace these convolutional layers with DBB or some other equivalent structures in parallel for training, and then the inference stage Then reshape the parameters into the original CRNet network.

Table 1. Training after converting some layers of CRNet into parameter equivalent networks with branch structure
network structure Parameter amount FLOPs training time/ep Effect (recovery rate) ep10 ep100
CRNet 926k 711M 234s 0.8043 0.8152
KxK -> DBB, Kx1 -> Kx1+ 1x1 2273k 1744M 590s 0.8007 -
KxK -> DBB 2157k 1656M 480s 0.8037 0.8173
Kx1 -> Kx1+ 1x1 926k 711M 234s 0.7996 -

Note: For the effect of the competition, the CRNet I use here is higher than the number of middle layer channels in the original paper, reaching 128.
  It can be seen from the table that after equivalent replacement, it is not necessarily better to train a network with more branches and larger. In this example, it is harmful to connect both Kx1 and 1xK convolutions to a 1x1 convolution. The training effect decreases instead; in some cases, a larger equivalent network is indeed useful, such as replacing KxK with DBB, after 100epoch, the effect is slightly improved by 0.002 compared with the original CRNet, but the training speed is twice as slow as before.
  According to my analysis, the CRNet network itself has used a multi-branch structure, and KxK convolution is not used a lot, so this method does not have much effect. If it is a network with less branch structure and a large number of KxK convolution networks, The effect may be more obvious.

4. Summary

  Reparameterization uses a clever method to replace the KxK convolutional layer with some parallel branch structures whose parameters can be linearly combined, which can improve the training accuracy by sacrificing some training speed, and map the parameters back to the original KxK convolutional layer during inference. , which can keep the speed of inference unchanged and improve the accuracy. This approach is only useful for certain structures and cannot reparameterize an arbitrary network into a smaller network. This method will have a better training effect for a network with less branch structure and more KxK convolution, and a more regular network. It will be better for occasions that require higher inference speed and the hardware supports KxK convolution operations during inference. Applications. Due to many constraints, I personally think that the practicability of this method is still not very great, but this method is very instructive. It shows that the parameters of a large network may be mapped to a small network, maintaining the reasoning performance of the small network constant.

Guess you like

Origin blog.csdn.net/Brikie/article/details/120086924