CVPR 2022 | Beyond Swin! Huawei Noah & Peking University propose Wave-MLP: a new visual backbone network

Click the card below to follow the " CVer " public account

AI/CV heavy dry goods, delivered as soon as possible

Reprinted from: Heart of the Machine

Researchers from Huawei's Noah's Ark Lab, Peking University, and the University of Sydney have proposed a new architecture for visual MLPs inspired by quantum mechanics.

In recent years, new architectures in the field of computer vision have emerged in an endless stream, including visual Transformer, MLP, etc., which have achieved performance beyond CNN in many tasks and have received extensive attention. Among them, the visual MLP has an extremely simple architecture, which consists of only a stack of multilayer perceptrons (MLPs). Compared to CNN and Transformer, these concise MLP architectures introduce less inductive bias and have stronger generalization performance.

However, the performance of existing visual MLP architectures is still weaker than CNN and Transformer. Researchers from Huawei's Noah's Ark Lab, Peking University, and the University of Sydney have proposed a quantum-mechanical-inspired visual MLP architecture that achieves state-of-the-art performance on multiple tasks such as ImageNet classification, COCO detection, and ADE20K segmentation.

a3b613688a64e4fadb1cbd237fa5f5b7.png

An Image Patch is a Wave: Quantum Inspired Vision MLP

Paper address: https://arxiv.org/abs/2111.12294

PyTorch code: https://github.com/huawei-noah/CV-Backbones/tree/master/wavemlp_pytorch

MindSpore code: https://gitee.com/mindspore/models/tree/master/research/cv/wave_mlp

Wave-MLP

Inspired by the wave-particle duality in quantum mechanics, this research expresses each image block (Token) in the MLP into the form of a wave function, thus proposing a new visual MLP architecture - Wave-MLP, which greatly improves the performance. Goes beyond existing MLP architectures and Transformers.

Quantum mechanics is a branch of physics that describes the laws of motion of microscopic particles, and classical mechanics can be regarded as a special case of quantum mechanics. A fundamental property of quantum mechanics is wave-particle duality, that is, all individuals (such as electrons, photons, atoms, etc.) can be described using both particle terms and wave terms. A wave usually includes two properties, amplitude and phase. The amplitude represents the maximum intensity a wave can reach, and the phase indicates where it is currently in a cycle. Representing a particle in the classical sense in the form of a wave (for example, de Broglie wave) can more completely describe the motion state of microscopic particles.

So, for image blocks in visual MLP, can it also be represented in the form of waves? The research uses the amplitude to express the actual information contained in each Token, and uses the phase to express the current state of the Token. When aggregating different Token information, the phase difference between different Tokens will modulate the aggregation process between them (as shown in Figure 3). Considering that tokens from different input images contain different semantic content, this study uses a simple fully connected module to dynamically estimate the phase of each token. For tokens with both amplitude and phase information, the authors propose a phase-aware token mixing module (PATM, shown in Figure 1 below) to aggregate their information. Alternately stacking PATM modules and MLP modules constitutes the entire Wave-MLP architecture.

b4004b1d095243dad1b965cabd50537a.png

Figure 1: A unit in the Wave-MLP architecture

Compared with existing visual Transformer and MLP architectures, Wave-MLP has obvious performance advantages (as shown in Figure 2 below). On ImageNet, the Wave-MLP-S model achieves 82.6% top-1 accuracy with 4.5G FLOPs, which is 1.3 points higher than Swin-T with similar computational cost. In addition, Wave-MLP can also be generalized to downstream tasks such as object detection and semantic segmentation, showing strong generalization performance.

fa431b695fe94b49bffe8ab68b635541.png

Figure 2: Comparison of Wave-MLP with existing visual Transformer, MLP architectures

Represent Token with waves

In Wave-MLP, Token is represented as a wave with both amplitude and phase information 54bd7c924553994e47c18428ba9de2de.png,

5242b41de96b6853b65886e2ee8a9f87.png    (1)

where i is an imaginary unit satisfying i^2 = -1, |·| represents absolute value operation, and ⊙ is element-wise multiplication. The magnitude |z_j| is a real-valued feature that represents what each Token contains. θ_j represents the phase, which is the current position of the Token within one wave cycle.

The phase difference between two tokens has a great influence on their aggregation process (as shown in Figure 3 below). When two tokens have the same phase, they will strengthen each other, resulting in a wave with a larger amplitude (Fig. 3(b)); when the two tokens have opposite phases, their synthesized waves will weaken each other. In other cases, their interactions are more complex but still depend on the phase difference (Fig. 3(a)). In the classical method, the real value is used to represent the token, which is actually a special case of the above formula.

265f53fd316697d02871a6ff1d6d8bb3.png

Figure 3: Aggregation process of two waves with different phases. The left side shows the superposition of two waves in the complex domain, and the right side shows their projection on the real axis as a function of phase. The dashed line represents two waves with different initial phases, and the solid line is their superposition.

Phase-aware token aggregation

Equation (1) contains amplitude and phase. The amplitude z_j is similar to the real-valued feature and can be generated by standard Channel-FC:

9b78e84f26cac6bff5234df560b8163e.png    (2)

For the phase, it can be estimated in a number of ways. In order for the phase to capture the specific properties of each input, this study uses a learnable estimation module to generate the phase θ_j. After obtaining the amplitude z_j and phase θ_j, the wave function representation of Token can be obtained according to formula (1) f233121beb51dca6ac8a6aaf78c50819.png. At the same time, formula (1) can be expanded into the form of concatenating two real-valued vectors by using Euler's formula:

c2a642906ae81341c227a129b2a1c334.png    (3)

It means that different Token wave functions will be aggregated through a Token-FC to get the output of the complex domain:

70d6cdd56ff03d2370b7f6000698888f.png    (4)

Similar to the measurement process in quantum computing, the complex domain needs to be mapped into the real domain to get meaningful output values. The real part and the imaginary part are summed according to a certain weight, and the output of the module is obtained:

964bb7ee13681efdeb84a832fba38d7b.png    (5)

In the visual MLP, this study constructs a phase perception module (PATM, Fig. 1) to complete the process of token aggregation. Alternately stacking PATM modules and channel-mixing MLPs form the entire WaveMLP architecture.

Experimental results

This study conducts extensive experiments on the large-scale classification dataset ImageNet, object detection dataset COCO and semantic segmentation dataset ADE20K.

The results of image classification on ImageNet are shown in Table 1 and Table 2: Compared with the existing Vision MLP architecture and Transformer architecture, WaveMLP achieves obvious performance advantages.

9c854b771b6909d00427e4c99fcc4f25.png

65a8467b298ee9b2b5d63211605e5a81.png

In tasks such as downstream object detection and semantic segmentation, Wave-MLP also shows better performance.

bbc6eaac2c8793e2fe0d2b22603f1f23.png

0d1fa392116bc7006a655d13345aff5c.png

Wave-MLP 论文和代码下载

后台回复:Wave,即可下载上面论文和代码
CVer-Transformer交流群成立
扫描下方二维码,或者添加微信:CVer6666,即可添加CVer小助手微信,便可申请加入CVer-Transformer 微信交流群。另外其他垂直方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch、TensorFlow和Transformer等。
一定要备注:研究方向+地点+学校/公司+昵称(如Transformer+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群

▲扫码或加微信: CVer6666,进交流群
CVer学术交流群(知识星球)来了!想要了解最新最快最好的CV/DL/ML论文速递、优质开源项目、学习教程和实战训练等资料,欢迎扫描下方二维码,加入CVer学术交流群,已汇集数千人!

▲扫码进群
▲点击上方卡片,关注CVer公众号

整理不易,请点赞和在看

Guess you like

Origin blog.csdn.net/amusi1994/article/details/123675864