CVPR 2022 | The convolution kernel is as large as 31x31! Ups and downs and efficient! Tsinghua & Megvii proposed RepLKNet: a new visual backbone network...

Click the card below to follow the " CVer " public account

AI/CV heavy dry goods, delivered as soon as possible

This article is reproduced from: Ding Situ public account

How long has it been since you adjusted the kernel size?

When you have to adjust the depth, width, groups, and input resolution of the convolutional network (CNN), will you inadvertently remember that there is a design dimension, kernel size, which has always been so obvious but always Ignored, always defaulting to 3x3 or 5x5?

When you are reluctant to adjust the parameters of Transformer, do you want to have a simple, efficient, easy-to-deploy model, and the performance of downstream tasks is not weaker than Transformer, bringing you simple happiness?

Our work published in CVPR 2022 shows that the kernel size in CNN is a very important but always overlooked design dimension. With the blessing of modern model design, the larger the convolution kernel, the more violent it is. As big as 31x31, it works very well (as shown in the table below, the left column represents the kernel size of each of the four stages of the model)! Even on large-scale downstream tasks, the performance of our proposed large convolution kernel model RepLKNet is better or comparable to that of Transformers such as Swin!

53c947add7380da0ec94913082ba1aff.png

Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs

Paper: https://arxiv.org/abs/2203.06717

MegEngine code and model:

https://github.com/megvii-research/RepLKNet

PyTorch code and model:

https://github.com/DingXiaoH/RepLKNet-pytorch

0a20092d6d5f44c0335ca406303fa503.png


Too long to read the version

Here's a summary of what you can read in two minutes

A. How do we contribute to the industry's knowledge and understanding of CNNs and Transformers?

We challenged the following habitual perceptions:

1. Not only does the super convolution not increase, but also drop? We prove that the super convolution was not used in the past, but it does not mean that it can not be used now . Humans' cognition of science always spirals upward. With the blessing of modern CNN design (shortcut, re-parameterization, etc.), the larger the kernel size, the more it will increase!

2. The efficiency of super large convolution is very poor? We found that very large depth-wise convolutions do not increase FLOPs much. If you add some low-level optimization, the speed will be faster, and the computing density of 31x31 can be up to 70 times that of 3x3 !

3. Can large convolutions only be used on large feature maps? We found that using 13x13 convolution on a 7x7 feature map can increase a bit .

4. ImageNet Points Say It All? We found that performance on downstream (object detection, semantic segmentation, etc.) tasks may not be relevant to ImageNet .

5. Super deep CNN (such as ResNet-152) stacks a lot of 3x3, so the receptive field is large? We found that the effective receptive field of the deep small kernel model is actually very small. On the contrary, the effective receptive field of a small number of large convolution kernels is very large .

6. Transformers (ViT, Swin, etc.) have strong performance on downstream tasks because the nature of self-attention (the design form of Query-Key-Value) is stronger? We verified with the large convolution kernel and found that the kernel size may be the key to the downstream growth .

B. What specific work did we do?

1. Through a series of exploratory experiments, five guidelines for applying large convolution kernels in modern CNNs are summarized :

    a. Use depth-wise super convolution, it is best to add the underlying optimization ( it has been integrated into the open source framework MegEngine )

    b. 加shortcut

    c. Re-parameterization with small convolution kernels (ie, structural re-parameterization methodology , see our RepVGG last year, reference [1])

    d. To look at the performance of downstream tasks, you can't just look at the level of ImageNet points

    e. Large convolutions can also be used on small feature maps, and the kernel model can be trained with regular resolution

2. Based on the above criteria and simply borrowing from the macro-architecture of Swin Transformer, an architecture RepLKNet is proposed, in which a large number of large convolutions are used, such as 27x27, 31x31, etc. The other parts of this architecture are very simple and are simple structures like 1x1 convolution, Batch Norm, etc., without any attention.

3. Based on the large convolution kernel, discussion and analysis of topics such as effective receptive field , shape bias (does the model look at the shape of the object or the local texture when making a decision?), and the reasons why Transformers have powerful performance . We found that the effective receptive field of traditional deep small kernel models such as ResNet-152 is actually not large. The large kernel model not only has a larger effective receptive field but also is more human-like (high shape bias). Transformer may be the key to the large kernel rather than self- The specific form of attention. For example, the figure below shows the effective receptive fields of ResNet-101, ResNet-152, RepLKNet with all 13x13, and RepLKNet with a kernel as large as 31x31. It can be seen that the effective receptive field of the shallower large kernel model is very large.

546534f445f420a617ad4b0a2215a91d.png

C. How well does the proposed architecture RepLKNet work?

1. On ImageNet, it is comparable to Swin-Base. With additional data training, the super-large-scale model can achieve the highest accuracy of 87.8% . The super-large convolution kernel was not originally designed for brushing ImageNet, and this number of points can be considered satisfactory.

2. In the semantic segmentation of Cityscapes, only ImageNet-1K pretrain's RepLKNet-Base is used , even surpassing ImageNet-22K pretrain's Swin-Large . This is a transcendence across model magnitudes and across data magnitudes .

3. In terms of ADE20K semantic segmentation, the ImageNet-1K pretrain model greatly exceeds the traditional CNN of small kernels such as ResNet and ResNeSt. The Base level model significantly exceeds the Swin , and the Large model is comparable to the Swin. The hyperscale model achieves an mIoU of 56% .

4. In COCO target detection, it greatly exceeds the traditional model ResNeXt-101 of the same magnitude (more than 4.4 mAP ), and is comparable to Swin, reaching 55.5% mAP at the super-large level .

The following is a detailed introduction

Original intention: Why do we need a large kernel size?

In this day and age, we go to study the big kernel that sounds very retro, why?

1.  Revive the design elements that were "killed by mistake" and rectify the name of the big kernel . Historically, AlexNet used 11x11 convolutions, but after the advent of VGG, large kernels were gradually eliminated, which marked a paradigm shift in model design from shallow and large kernels to deep and small kernels . The reasons for this change include the fact that the efficiency of large kernels is found to be poor (the amount of parameters and computations of convolution are proportional to the square of the kernel size), and the accuracy of the larger kernel size becomes worse, etc. But times have changed. Can a large kernel that did not work in history work with the blessing of modern technology?

2.  Overcome the inherent defects of traditional deep small-kernel CNNs . We used to believe that a large kernel could be replaced with several smaller kernels, for example a 7x7 could be replaced by three 3x3s, which would be faster (3x3x3 < 1x7x7) and better (deeper, more nonlinear). Some students will think that although the stacking of deep small kernels is prone to optimization problems, this problem has been solved by ResNet (ResNet-152 has 50 layers of 3x3 convolutions), so what are the defects of this approach? ——The cost of ResNet to solve this problem is that even if the theoretical maximum receptive field of the model is large, the actual effective depth is not deep (Reference 2), so the effective receptive field is not large. This may also be the reason why traditional CNN is similar to Transformer on ImageNet, but is generally inferior to Transformer in downstream tasks. That said, ResNet essentially helps us sidestep the "hard to optimize deep model" problem without actually solving it. Since the deep and small kernel model has such an essential problem, what will be the effect of the shallow and large kernel design paradigm ?

3. Understand the reason why Transformer works . Transformers are known to perform well, especially on downstream tasks such as detection and segmentation. The basic component of Transformer is self-attention, and the essence of self-attention is to perform Query-Key-Value operation in a global scale or a larger window . So what is the reason for the powerful performance of Transformer, is it the design form of Query-Key-Value ? We guess, will " global scale or larger window " be the key? Corresponding to CNN, which requires a large convolution kernel to verify

Explore experiments

In order to understand how the big kernel should be used, we conducted a series of exploration experiments on MobileNet V2 and concluded five principles. The details are omitted here and only the conclusion is stated:

1. With a depth-wise large kernel , it can be quite efficient. Under our optimization ( which has been integrated into the open source framework MegEngine ), the time of 31x31 depth-wise convolution can be at least 1.5 times that of 3x3 convolution, and the FLOPs of the former is 106 times (31x31/9) of the latter, which means The efficiency of the former is 71 times that of the latter!

2. Without the identity shortcut, increasing the kernel will greatly reduce the point (ImageNet lost 15%); with the shortcut , increasing the kernel will increase the point.

3. If you want to further increase the kernel size, from a large kernel to a super large kernel, you can use a small kernel for structural reparameterization (Reference 1). That is to say, a 3x3 or 5x5 convolution is added in parallel during training, and the small kernel is equivalently merged into the large kernel after the training is completed. In this way, the model can effectively capture features at different scales. However, we found that the smaller the dataset and the smaller the model, the more important reparameterization is. Conversely, on our hyperscale dataset MegData73M, the reparameterization gain is small (0.1%). This finding is similar to ViT: the larger the data size, the less important the inductive bias.

4. What we want is the increase in the target task, not the increase in ImageNet. The accuracy of ImageNet is not necessarily related to downstream tasks . As the kernel size becomes larger and larger, ImageNet no longer increases, but Cityscapes and ADE20K semantic segmentation can still increase by one or two points, while increasing the amount of additional parameters and computation brought by the kernel is very cost-effective. Very high!

5. It is a bit counter-intuitive that using 13x13 on a small 7x7 feature map can also increase a bit! That is to say, the large kernel model does not necessarily need a large resolution to train , and the training method similar to the small kernel model can be used, which is fast and economical!

RepLKNet: Very Large Convolutional Kernel Architecture

We take Swin as the main comparison object, and have no intention to brush SOTA, so we simply draw on the macro architecture of Swin to design a super-large convolution kernel architecture. This architecture mainly consists of replacing the attention with a super-large convolution and its supporting structure, plus a little CNN-style change. According to the above five criteria, the design elements of RepLKNet include shortcut, depth-wise super-large kernel, and small-kernel re-parameterization.

c6ae0a7f0e24344ac32e7ab19e17c463.png

Increase the kernel size: the bigger the more violent!

We set different kernel sizes for the four stages of RepLKNet, and conducted experiments on ImageNet and ADE20K semantic segmentation datasets. The results are quite interesting: ImageNet can still increase from 7x7 to 13x13, but it will not increase from 13x13. But on ADE20K, from 13 for the four stages to 31-29-27-13 for the four stages, the mIoU increased by 0.82, the parameter volume only increased by 5.3%, and the FLOPs only increased by 3.5 %. Therefore, the following experiments mainly use the kernel size of 31-29-27-13, called RepLKNet-31B, and widen it to 1.5 times as a whole, called RepLKNet-31L.

a7b3d1f92497d2f8a61db9e98d289033.png

Cityscapes Semantic Segmentation

The volume of RepLKNet-31B is slightly smaller than that of Swin-Base. Under the premise of only using ImageNet-1K pretrain, the mIoU exceeds that of Swin-Large + ImageNet-22K, which has completed the transcendence of cross-model magnitude and cross-data magnitude .

1790c3405a2cfe4b32c46d0c6cd4faeb.png

ADE20K Semantic Segmentation

RepLKNet is quite capable, especially at the Base level. Compared with ResNet of similar magnitude, mIoU is 6.1 higher , reflecting the significant advantage of a small number of large kernels over a large number of small kernels. (The same conclusion is also obtained on COCO target detection. The mAP of RepLKNet-31B is 4.4 higher than that of ResNeXt-101 of comparable size ). RepLKNet-XL is a larger model, pre-trained with the private dataset MegData-73M, and achieves an mIoU of 56.0 (compared to ViT-L, this model is actually not very large).

43e71cbe91b210777940577cbf999149.png

ImageNet classification, COCO object detection

See the "Too long to read" section or paper for results.

discussion and analysis

Effective receptive field: the large kernel model far exceeds the deep small kernel model

We visualized the effective receptive fields of RepLKNet-31, RepLKNet-13 (each stage mentioned above is a 13x13 model), ResNet-101, and ResNet-152 (see the paper for the method) and found the effective receptive field of ResNet-101 In fact, it is very small, and the improvement of ResNet-152 relative to 101 is also small ; the effective receptive field of RepLKNet-13 is very large, and RepLKNet-31 further increases the effective receptive field by increasing the kernel size .

8fdfc8f6a6e6a57b65ca84394779f889.png

Shape bias: Large kernel models are more human-like

We also studied the shape bias of the model (that is, how much of the model's prediction is based on shape rather than texture), and the shape bias of humans is around 90% , see the diamond point on the left in the figure below. The models we selected include Swin, ResNet152, RepLKNet-31 and RepLKNet-3 (each stage mentioned above is a 3x3 small kernel baseline), and found that the kernel size of RepLKNet-3 and ResNet-152 is the same (3x3), The shape bias is also very close (the two vertical solid lines in the figure almost coincide). Interestingly, a work on shape bias mentions that the shapebias of ViT ( global attention ) is high (see figure in Ref. 3), while we found that the shape bias of Swin ( local attention within the window ) is actually not high (below Figure), which seems to indicate that the form of attention is not the key, but the scope of action is the key , which also explains the high shape bias of RepLKNet-31 (ie more human-like) .

e538cd8e73c110591f827ce97ace07e1.png

The above paper PDF and code download

Background reply: RepLKNet , you can download papers and codes

Backstage reply: CVPR2021, you can download the CVPR 2021 papers and open source papers collection

Background reply: ICCV2021, you can download the ICCV 2021 papers and open source papers collection

Background reply: Transformer review, you can download the latest 3 Transformer reviews PDF

CVer-Transfomer交流群成立
扫码添加CVer助手,可申请加入CVer-Transformer微信交流群,方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch和TensorFlow等群。
一定要备注:研究方向+地点+学校/公司+昵称(如Transformer+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群

▲长按加小助手微信,进交流群
CVer学术交流群(知识星球)来了!想要了解最新最快最好的CV/DL/ML论文速递、优质开源项目、学习教程和实战训练等资料,欢迎扫描下方二维码,加入CVer学术交流群,已汇集数千人!

▲扫码进群
▲点击上方卡片,关注CVer公众号

整理不易,请点赞和在看

Guess you like

Origin blog.csdn.net/amusi1994/article/details/123516508