Detailed explanation of RepLKNet paper: 31×31 super large convolution kernel model

Everything is working hard for a smooth graduation! This reads as Rep LP Net

Paper Title: Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs

Original paper address: https://arxiv.org/pdf/2203.06717.pdf 

代码:GitHub - DingXiaoH/RepLKNet-pytorch: Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs (CVPR 2022)

Table of contents

1. What is the main purpose of the thesis?

2. Reasons for large convolution kernel architecture design

2.1 Advantages

2.2 Problems

2.3 Large convolution kernel architecture refers to the principle

3. Overall structure of RepLKNet

4. Summary


1. What is the main purpose of the thesis?

(1) In convolutional neural networks, it is better to use a small number of large convolutions than a large number of small convolutions

(2) Propose a large convolution kernel neural network architecture RepLKNet, whose convolution kernel size can reach 31*31, which is significantly stronger than the traditional CNN architecture in terms of classification, detection, and segmentation, and has achieved similar or stronger results to mainstream Vision Transformers performance, higher operating efficiency

(3) For Vit, the large receptive field of the Self-attention module is an important reason for its excellent performance. After using the design of a large convolution kernel, the CNN architecture can also have considerable performance, and in Shape bias, etc. In terms of performance, it is closer to Vit

2. Reasons for large convolution kernel architecture design

2.1 Advantages

(1) Compared with direct deepening, the large convolution kernel can improve the effective receptive field more efficiently

According to the effective receptive field theory, the size of the effective receptive field is proportional to the size of the convolution kernel, and is proportional to the square root of the number of layers of the convolution kernel. That is to say, it is better to increase the receptive field by adding depth instead of directly enlarging the convolution kernel. It is more effective to improve the receptive field.

(2) Large convolution kernels can partially avoid the optimization problems brought about by the increase in model depth

It seems that ResNet can be made very deep, or can reach hundreds of layers, but its effective depth is not deep. Many of its signals are past the shortcut layer, and it does not increase its effective depth. As shown in the figure, put ResNet has increased from 101 layers to 152 layers. Although the depth has been greatly improved, its effective receptive field size is basically unchanged. However, if the convolution kernel size is increased from 13 to 31, the effective receptive field will expand. very significant.

(3) The visual downstream tasks of FCN-based (full convolution) are improved more significantly by large convolution kernels

2.2 Problems

(1) The large convolution kernel is not efficient enough: the amount of calculation is doubled

(2) It is difficult to take into account local features with large convolution kernels, and it is prone to excessive smoothing

(3) Compared with the self-attention module, the inductive bias of the convolution is too strong, which limits the representation ability on large data sets

2.3 Guiding Principles of Large Convolution Kernel Architecture

(1) Use structures such as Depthwise to sparse convolution, supplemented by appropriate underlying optimization

(2) Identity shortcut is very important in large convolution kernel design

(3) Use a small convolution kernel to do heavy parameterization to avoid the problem of excessive smoothing

(4) Pay attention to the performance of downstream tasks, not just the number of ImageNet points

       For downstream tasks, the larger the receptive field, the better, and a large convolution kernel can bring a larger receptive field; for classification tasks, the shape is more important, but the traditional CNN is more about learning texture features and increasing volume. The accumulation kernel can make the network learn more shape information.

(5) Large convolutions can also be used on small feature maps (feature maps), and large convolution kernel models can be trained at conventional resolutions

       Based on the above principles, referring to the macro-architecture of Swin Transformer, a structure RepLKNet is proposed. The difference is that it uses a large Depthwise convolution to replace the window attention that comes with Swin Transformer, and uses re-parameterization technology to re-parameterize large convolutions. , these small convolution kernels will be absorbed into large convolutions during inference, so there will be no additional computational complexity.

3. Overall structure of RepLKNet

The overall form refers to the Swin Transformer structure. The main thing is to replace the attention with a super-large convolution and its matching structure, and add a little CNN style change. According to the above five criteria, the design elements of RepLKNet include shortcut, Depthwise super large kernel, small kernel reparameterization, etc.

The structure of RepLKNet is shown in the figure, and the details of each module are as follows:

(1) Stem: Since the main application of RepLKNet is downstream tasks, more details need to be captured in the early stage of the network. After the initial stride=2 3x3 convolution downsampling, a 3x3 depth convolution is followed to extract low-dimensional features, followed by a 1x1 convolution and 3x3 depth convolution for downsampling

(2) Stages 1-4: The core layer of RepLKNet, which is stacked by RepLK Block and ConvFFN, RepLK Block includes normalization layer, 1×1 convolution and depth separable convolution, and the most important residual connection .

According to guideline 3, each depthwise convolution parallels a 5x5 depthwise convolution for structural reparameters. In addition to receptive domain and spatial feature extraction ability, the feature expression ability of the model is also related to the dimension of the feature. In order to increase the non-linearity and information exchange between channels, a 1x1 convolution is used to increase the feature dimension before the depthwise convolution.

(3) ConvFFN: Referring to the Feed-Forward Network (FFN) used by transformers and MLPs networks, the paper proposes a CNN-style ConvFFN, including short-circuit connections, two 1x1 convolution kernel GELUs, and a 1x1 convolution instead of a fully connected layer. . The same ConvFFN also uses residual connections between layers. When applied, the intermediate features of ConvFFN are generally 4 times the input. Referring to ViT and Swin, ConvFFN is placed behind each RepLK Block.

(4) Transition Blocks: Placed between stages, first use 1x1 convolution to expand the feature dimension, and then use two 3x3 depth convolutions to perform 2x downsampling.
 

4. Summary

(1) Large convolution has a larger receptive field: a single-layer large convolution kernel is more effective than a multi-layer small convolution.

(2) Large convolution can learn more shape information: RepLKNet with a large convolution kernel pays more attention to shape features. When the convolution kernel is reduced, RepLKNet-3 becomes more contextual features.

(3) Dense (ordinary) convolution and hole convolution: Hole convolution inserts holes in adjacent convolution kernels, making the original convolution kernel have a larger receptive field, which is a commonly used method to expand the scope of convolution. Method, the paper compares the hole depth convolution with the ordinary depth convolution. Although the maximum receptive field may be the same, the expression ability of the hole depth convolution is much weaker, and the accuracy rate drops very obviously. That is to say, although the hole convolution The receptive field is large, but its computation uses very few features. RepLKNet uses depth-separable convolution to rethink the use of large convolution kernels. It also expands the receptive field, and RepLKNet achieves better results. The effect of MobileNet V2 composed of pure hole convolution is worse than the original model. This hollow convolution is quite interesting, and I will write an article when I understand it, and I will keep it first.

Above, over, water, I can do it!

Guess you like

Origin blog.csdn.net/Zosse/article/details/127024471