[Paper Notes] RepLKNet Paper Reading Notes

paper:Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs

github:https://github.com/DingXiaoH/RepLKNet-pytorch

aistudio: no GPU? Model online one-click experience

Since the introduction of VGG, various CNN networks have emerged in an endless stream, but they all follow the design idea of ​​VGG, and obtain a large receptive field by stacking multiple small convolution kernels while ensuring a small amount of parameters (2 convolution kernels of 3x3 each. The receptive field is the same as the 5x5 convolution kernel receptive field, but parameter 18<25). As ViT gradually achieves the performance of SOTA in various visual tasks, CNN seems to be a little weak. RepLKNet breaks this phenomenon and proposes to use large convolution kernels in the CNN network, and RepLKNet achieves state-of-the-art performance in various visual tasks.


content

I. Introduction

2. Large convolution

1. Large depthwise separable convolutions are actually efficient

2. The identity shortcut structure is very important for large convolution kernels

3. Structural Reparameterization

4. Large convolution kernels perform better on downstream tasks than ImageNet classification tasks

5. Large convolution kernels are also effective on small feature maps (13x13 convolution kernels, 7x7 feature maps)

3. Network structure

4. Experimental results

 5. Discussion

1. Large convolution has a larger receptive field

2. Large convolutions can learn more shape information

3. Dense (ordinary) convolution and atrous convolution


I. Introduction

ViT has become a research hotspot as ViT tops the charts in various tasks. What makes ViT so powerful? Some people think it is a multi-head self-attention mechanism (MHSA), but there are many researchers who hold different views.

The existence of ViT attention mechanism makes it easier to obtain global information and local information. However, in CNN, large convolution kernels are rarely used except for the first layer. Generally, the receptive field is increased by stacking multiple small convolution kernels. Only some older networks and neural network search are used to search for them. The network structure will use larger convolution kernels (greater than 5x5). So here comes the question: "What if you use large convolutions instead of small ones?"

The author introduced a large-size depthwise separable convolution kernel in the convolutional neural network for experiments, and found the following guidelines for using large convolution kernels:

(1) The calculation of large convolution kernels can also be very efficient;

(2) The residual connection structure is very important for large kernel convolutional networks;

(3) Use small convolution kernel reparameterization to complement the optimization problem;

(4) Compared with ImageNet classification tasks, large convolution kernel networks perform better on downstream tasks;

(5) Even if the feature map is small, using a large convolution kernel can be very effective.

2. Large convolution

1. Large depthwise separable convolutions are actually efficient

The larger the convolution kernel, the higher the computational cost, and the use of depthwise separable convolution can effectively reduce the computational cost. In the parallel computing architecture of modern GPUs, the smaller the convolution kernel, the lower the computational efficiency, and the larger the convolution kernel, the greater the computational density. As the convolution kernel becomes larger, the calculation time should not increase as much as the kernel FLOP. After the underlying optimization of the convolution calculation, the calculation speed can be faster.

2. The identity shortcut structure is very important for large convolution kernels

Compared with MobileNetV2, the 13x13 convolution kernel is used to replace the 3x3 convolution kernel in the depthwise separable convolution, and the large convolution is used to increase the accuracy rate by 0.77%, but without the identity shortcut structure, the accuracy rate is only 53.98%.

3. Structural Reparameterization

As shown in the figure below, using a larger convolution on MobileNetV2, when the convolution kernel is increased from 9 to 13, the performance of the model decreases. Using a small convolution kernel for re-parameterization, the model performance is significantly improved (13 convolution nuclear is better than 9).

4. Large convolution kernels perform better on downstream tasks than ImageNet classification tasks

The following table shows the comparison of RepLKNet on the classification task kernel semantic segmentation task. As the convolution kernel increases, the semantic segmentation task mIoU index increases more. There are two reasons why large convolution kernels are more effective in downstream tasks:

(1) For downstream tasks, the larger the receptive field, the better, and a large convolution kernel can bring a larger receptive field;

(2) For classification tasks, shape is more important, but traditional CNN learns more texture features, and adding convolution kernels can make the network learn more shape information.

5. Large convolution kernels are also effective on small feature maps (13x13 convolution kernels, 7x7 feature maps)

 Increase the last stage convolution kernel of MobileNetV2. As can be seen from the table below, large convolution kernels are also effective in small feature maps.

3. Network structure

The network structure is shown below.

4. Experimental results

1. Image classification

 2. Semantic segmentation

 3. Target detection

 5. Discussion

1. Large convolution has a larger receptive field

As can be seen from the figure below, the ResNet101 kernel ResNet152 receptive field is almost the same, and RepLKNet can clearly see that as the convolution kernel increases, the receptive field increases.

2. Large convolutions can learn more shape information

Human vision pays more attention to the shape of objects when classifying pictures, while convolutional networks learn more about the texture information of pictures. The following figure compares the proportion of shape information learned by several networks on 16 objects. As the convolution kernel increases, the network can learn more shape information.

3. Dense (ordinary) convolution and atrous convolution

Hollow convolution is usually used to increase the receptive field. The following table shows the comparison between the hole convolution and the large convolution kernel used by MobileNetV2. It can be seen that the sparse hole convolution cannot learn enough information, resulting in a decrease in model performance.

Guess you like

Origin blog.csdn.net/qq_40035462/article/details/123642988