Large-core CNN unifies multiple modalities and achieves SOTA in time series prediction tasks, which is extremely simple and efficient!

Today I would like to introduce this year’s new work (which is also a joint sequel to the two genres of “structured heavy-parameterized universe” and large-core convolution). Everyone is welcome to pay attention and star!

Paper: https://arxiv.org/abs/2311.15599

Model: https://huggingface.co/DingXiaoH/UniRepLKNet/tree/main

Home page: https://invictus717.github.io/UniRepLKNet/

GitHub (release all codes, all models, all replication experiment scripts, come by and give me a star!): https://github.com/AILab-CVC/UniRepLKNet

45859447dbb45a6d39de90ee201f3c63.png

Too long not to read version

Q: What contribution does this article make?

Answer: Four guidelines for large-core CNN architecture design, a powerful backbone called UniRepLKNet (only using ImageNet-22K pre-training, accuracy and speed SOTA, ImageNet reaches 88%, COCO reaches 56.4 box AP, ADE20K reaches 55.6 mIoU , the actual speed measurement advantage is great), using this backbone designed for images to achieve the SOTA level (global temperature and wind speed prediction, the former SOTA was published in the Nature sub-journal of the Transformer specially designed for this) on the very large data of time series prediction. In terms of point cloud, audio, and video, with extremely simple preprocessing methods and unchanged model structure, it exceeds or is close to the SOTA level.

Question: Why do we still need to study CNN in the era when Transformer unifies all modalities?

Answer: Transformer and CNN are just two structural design ideas that blend with each other. There is no reason to think that the former has essential superiority. The purpose of doing research is to correct human beings' understanding of unknown things. "Transformer unifies various modalities" is the understanding that this article attempts to correct. Just as before the advent of ConvNeXt, RepLKNet and other works in early 2022, "Transformer beats CNN on downstream tasks such as image segmentation, especially semantic segmentation and target detection" was the mainstream perception. At that time, these works revised this perception to " CNN and Transformer are similar in image tasks, and Transformer still beats CNN in other modalities." We need to further revise it: in point clouds, audio, and video, CNN is much stronger than we imagined; in time series prediction, a field where CNN has been repeatedly surpassed in the past two years (LSTM and other areas were once mainstream, in the past two years (There are more and more Transformers), CNN can surpass Transformer and successfully "steal it"; CNN may not be weaker than Transformer in terms of unification.

daedf688e80704c0b58217c1f6827a9d.png

Large convolution kernel CNN architecture design

(This chapter will discuss the ideas and reasoning process of model architecture design at a long length. Readers who are not interested in structural design can skip to the next chapter)

In 2022, I proposed several design principles for using very large convolution kernels (from 13x13 to 31x31) to build modern CNNs and the correct use of very large convolution kernels in RepLKNet [1]. But from an architectural perspective, RepLKNet simply uses the overall architecture of Swin Transformer without any changes. SLaK further increased the kernel size to 51x51, but it simply adopted the ConvNeXt architecture. Generally speaking, the current large-core CNN architecture design either follows the existing CNN design principles or the existing Transformer design principles.

We can’t help but ask: Can such an architecture give full play to the advantages of large convolution kernels?

What are the advantages of large convolution kernels? We think of it as a large receptive field that does not rely on deep stacking. What does depth mean? On the one hand, it is a higher level of feature abstraction, and on the other hand, it is a stronger general representation ability. Our new architectural design principles come from some thinking about these three concepts. First, let’s compare these three concepts in detail.

  • The maximum receptive field describes how far a point on the feature map can theoretically establish a spatial connection with points when the model attempts to extract a spatial pattern. For example, if I stack three 3x3s, the maximum receptive field I get is 7x7, so one point can be connected to another point that is at most 5 points away from it. However, in actual CNN, although there may be a theoretical connection between two points, this connection is realized through several layers, is very weak, and has minimal impact on the final output. This introduces the concept of effective receptive fields. A theoretical analysis shows that the effective receptive field of a model is proportional to the kernel size multiplied by the square root of the number of layers [2]. In other words, as the number of layers becomes deeper and deeper, the effectiveness of increasing the effective receptive field brought by further deepening the model is diminishing at the margin. Work such as RepLKNet shows that although a model like ResNet-152 has dozens of 3x3 convolutional layers, its effective receptive field is actually not large, and there is no essential improvement compared to ResNet-101.

  • The level of feature abstraction is also related to the spatial pattern. When discussing receptive fields, we focus on "how wide" the model can perceive, but in addition, we also focus on "how high the level of abstraction" the model can perceive. The intuitive explanation of the convolutional neural network is that the model uses convolution kernels to extract spatial patterns layer by layer, making the abstraction level of the obtained features higher and higher, such as from lines to textures, from textures to parts of objects, and then to objects. In addition to this intuitive abstraction level, the abstraction level extracted by CNN may also be incomprehensible to humans.

  • The general representation ability brought by depth comes from more trainable parameters and nonlinear activation functions. Generally speaking, deeper models with more parameters can fit more complex functions and thus learn more complex representations.

What do traditional convolutional network architectures have in common? We note that when we add a 3x3 or 5x5 convolutional layer to a network, we actually expect it to do three things simultaneously: increase the receptive field, increase the level of abstraction, and generally improve representational power by increasing depth.

What limitations does this bring to traditional convolutional network design?

  • Small convolution kernels must be stacked in large numbers to achieve a large receptive field, so we have to use many 3x3 or 5x5 layers, and the final effect is not good.

  • With more convolutional layers, the feature abstraction level is certainly enough, but how high an abstraction level is enough? No one knows, because this thing is tightly coupled with the receptive field and cannot be adjusted independently.

  • The convolutional layer takes up too many parameters and calculations. Under the limitation of model size, it is difficult to further improve its general representation ability.

With the support of large convolution kernels, we can achieve sufficient effective receptive fields with very few large convolution kernels. If we still follow the design paradigm of stacked convolution kernels of traditional CNN, what will go wrong?

  • The receptive field may be too large. The problem is not just a waste of computing power. For some downstream task frameworks (such as UpperNet for semantic segmentation), low-level features in the backbone may obtain too large a receptive field prematurely, which may have negative effects (originally low-level features should be local features, and UpperNet combines them with high-level features Only through combination can the point increase; and now the low-level receptive field is also very large, and it has become a global feature).

  • On the premise that the receptive field is completely sufficient, a depthwise 3x3 can obviously do the simple task of converting lower abstraction level features into higher abstraction level features. I insist on using a 31x31 to do it. It is really unnecessary.

  • The model may achieve a large receptive field with only a few layers, but if we stop at such a depth (for example, RepLKNet only uses 24 very large convolutional layers and 24 FFN structures), the model's representational ability may not be enough.

So what principles should we follow to design a large convolution kernel CNN architecture? Our answer is to decouple the above three elements and use the corresponding structure to achieve whatever effect is needed. Our ability to achieve such decoupling is guaranteed by the essential advantages of large convolution kernels.

  • Use a small number of large convolution kernels to ensure a large receptive field.

  • Use small convolutions such as depthwise 3x3 to improve the feature abstraction level.

  • Use some efficient structures (such as SE Block, Bottleneck structure, etc.) to improve the depth of the model and enhance its general representation ability.

Under the guidance of this idea, we conducted a series of systematic research and proposed four Architectural Guidelines for the design of large convolution kernel CNN, which are briefly summarized as follows:

  • Regarding local structure design: use some efficient structures like SE or bottleneck to increase depth.

  • On reparameterization: Capturing sparse features with dilated convolutions. This article proposes a sub-module called Dilated Reparam Block. In addition to large-core convolution, this module also uses parallel dilated convolution. Moreover, using the idea of ​​structural re-parameterization, the entire block can be equivalently converted into a large-core volume. product. This is because a small kernel + dilated convolution is equivalent to a large kernel + non-dilated convolution. As shown below.

  • Regarding kernel size: Select the kernel size based on the downstream tasks and the specific framework used. As mentioned earlier, for the semantic segmentation framework UpperNet, low-level features obtaining too large a receptive field prematurely may have negative effects. But this does not mean that a large kernel will reduce the representation ability of the model or the quality of the final feature! RepLKNet's conclusion that "at least it will not get worse as the kernel size increases" has not been overturned (RepLKNet uses DeepLabv3 for semantic segmentation and does not rely on the locality of low-level features), it has just been revised. For the tasks covered in this article, 13x13 is sufficient.

  • Regarding scaling law: For a small model that has used many large kernels, when increasing the depth of the model (for example, from 18 layers of the Tiny level model to 36 layers of the Base level), the added blocks should be depthwise 3x3 , there is no need to add a large kernel, the receptive field is already large enough, but it is always beneficial to use an efficient operation like 3x3 to improve the feature abstraction level.

369903559102c0c9990152a139d198fe.png

The Dilated Reparam Block structure proposed in this article is reparameterized into a large kernel convolution

Based on this, the UniRepLKNet model structure we proposed is as follows, which is very, very simple: each block mainly consists of three parts: depthwise conv, SE Block, and FFN. The depthwise conv can be a large convolution kernel (the above-mentioned Dilated Reparam Block), or it can only be depthwise 3x3.

e921570d2c1b7860b85897413b59e38c.png

Use UniRepLKNet for minimalist designs in other modalities

Out of the eternal pursuit of simplicity and versatility, when using UniRepLKNet for other modalities, we do not make any changes to the main body of the model architecture (all the following experiments use UniRepLKNet-Small), but only use video, audio, point cloud, The time series data is processed into an embedding map of C x H x W, just as we represent the image as a tensor of 3 x H x W. For example

  • We regard the audio spectrogram (T x F) as a single-channel image, that is, C=1, H=T, W=F;

  • We perform three-view projection of the point cloud and obtain three single-channel images, so C=3, H and W can be specified arbitrarily;

  • We splice the frames in the video together to obtain a large image very simply (for example, 16 frames of 3 x 224 x 224 video splicing results in an input of 3 x 896 x 896);

  • For time series data, we learn from the embedding layer in CorrFormer [3] to convert the data into a tensor in the latent space and then directly reshape it into a single-channel image format very roughly.

The results shown later will show that although this design is surprisingly simple, the effect it produces is extremely excellent.

Results: ImageNet, COCO, ADE20K

As the top three image modalities, the results on ImageNet, COCO, and ADE20K are naturally indispensable. We only use ImageNet-22K pre-training at most, and do not use larger data.

Although large-core CNN originally did not attach much importance to ImageNet (because the image classification task does not have high requirements for representation ability and receptive field, and cannot exert the potential of the large kernel), UniRepLKNet still surpasses many of the latest models, and its actual speed test results are especially gratifying. For example, UniRepLKNet-XL achieves 88% ImageNet accuracy and is actually three times faster than DeiT III-L. The advantages of the smaller UniRepLKNet over specially designed lightweight models such as FastViT are also very obvious.

4fa54e3826642a4d4ed165f7be59ce4a.png

On the COCO target detection task, our most powerful competitor is InternImage [4]: ​​UniRepLKNet-L is inferior to InternImage-L on COCO, but UniRepLKnet-XL exceeds InternImage-XL. Considering that the InternImage team has a very deep accumulation in the field of target detection, it is not easy for us to achieve this effect.

3792260c6762bf4daca8f0fd02b69795.png

In ADE20K semantic segmentation, UniRepLKNet has significant advantages, reaching a maximum mIoU of 55.6. Compare to ConvNeXt-XL by a full 1.6.

be2e9f08a26a09b6361fcf7192dbffe6.png

Results: Time series prediction, audio, video, point cloud

In order to verify UniRepLKNet's ability to process time series data, we challenged a "Nature" level task with extremely large data scale: global temperature and wind speed prediction. Although UniRepLKNet was originally designed for image-oriented tasks, it outperforms CorrFormer [3] (former SOTA) designed for this task.

bdce3cd4848067f8a38d78f959ef39d9.png

This finding is particularly interesting because it is generally believed that the effectiveness of time series forecasting depends on the model's ability to model interdependencies. This should be the area where Transformer's strengths should be best utilized (sounds like it is suitable for attention!), but now we use CNN I "stole" it, and considering that Transformer also stole it from NLP to CV, a strange feeling emerged spontaneously. As professional player Sun Yifeng has repeatedly proven in his StarCraft career, all situations will eventually change.

Our minimalist approach also works amazingly on audio, video, and point cloud tasks. (See the paper for details)

in conclusion

In addition to proposing a backbone that is very powerful on images, the findings reported in this paper seem to indicate that the potential of large-core CNNs is still completely unexploited. Even in terms of Transformer's theoretical strength - "large unified modeling capability", large-core CNN is more powerful than we imagined. This paper also reports relevant evidence: reducing the kernel size from 13 to 11, the performance of all four modes is significantly reduced. (See the paper for details)

Other FAQs

Q: For the visual field, under the wave of Transformer development, what is the significance of continuing to study CNN? Is it because CNN is more efficient in certain situations (such as small models, edge devices)?

Answer: This question seems to imply an assumption, that is, "Transformer is inherently stronger than CNN", so "CNN can only survive in certain fields that Transformer looks down upon or has not yet had time to challenge." In fact, the question of who is stronger or weaker, Transformer or CNN, has been discussed from 2020 to 2023, which is not interesting anymore. Anyway, sequence modeling is implemented in a learnable way, and the training process is also a black box, and the final effect is similar. What evidence do we have to support that Transformer is essentially stronger than CNN? Even if the attributive "some situations" is removed, even if cost and deployment factors are not considered, there is no reason to think that Transformer is inherently stronger than the performance limit under ideal conditions. Holders of Transformer's essential superiority theory generally believe that Transformer's scaling law is better and is stronger when the amount of data and model magnitude are extremely large. However, Google's latest work (https://arxiv.org/abs/2310.16764) uses JFT-4B also achieved 90.4% ImageNet accuracy in training NFNet (NFNet is an old model about the same age as ViT). This proves that at least in the image field, Transformer and CNN are just two models that blend with each other. The same is true for ViT. For many years, there is no need to continue this discussion of who is strong and who is weak into 2024.

Q: How to use CNN to do various generation tasks? Is this the essential weakness of CNN?

Answer: When ViT first appeared, it only did image classification. It was not even easy to do target detection and semantic segmentation. It was widely evaluated as "difficult to handle downstream tasks, difficult to deploy, and expensive to train. It is recommended that players who do not have a TPU stay away from it." ".

References

[1] Ding, Xiaohan, Xiangyu Zhang, Jungong Han, and Guiguang Ding. "Scaling up your kernels to 31x31: Revisiting large kernel design in cnns." InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11963-11975. 2022.

[2] Luo, Wenjie, Yujia Li, Raquel Urtasun, and Richard Zemel. "Understanding the effective receptive field in deep convolutional neural networks."Advances in neural information processing systems29 (2016).

[3] Wu, Haixu, Hang Zhou, Mingsheng Long, and Jianmin Wang. "Interpretable weather forecasting for worldwide stations with a unified deep model."Nature Machine Intelligence(2023): 1-10.

[4] Wang, Wenhai, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu et al. "Internimage: Exploring large-scale vision foundation models with deformable convolutions." InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14408-14419. 2023.

Recommended reading:

My 2022 Internet School Recruitment Sharing

My 2021 summary

A brief discussion on the difference between algorithm positions and development positions

Internet school recruitment R&D salary summary

Public account:AI snail car

Stay humble, stay disciplined, keep improving

7a2eb2c16fac5c74a7dd5340ad1c746c.jpeg

Send [Snail] to get a copy of "Hand-in-Hand AI Project" (AI Snail Cart)

Send [1222] to get a good leetcode test note

Send [Four Classic Books on AI] to get four classic AI e-books

Guess you like

Origin blog.csdn.net/qq_33431368/article/details/134813730