CVPR 2022 Oral | MLP marches into the underlying vision! Google proposes MAXIM: multiple image processing tasks on the list, the code is open source! ...

Click the card below to follow the " CVer " public account

AI/CV heavy dry goods, delivered as soon as possible

Author : Fake Panda |   Reprinted with permission (Source: Zhihu) Editor: CVer

https://zhuanlan.zhihu.com/p/481256924

Are you tired of the "inflexibility" and "high space complexity" of the latest Transformer/MLP models?

Do you feel helpless about the application of the latest Transformer/MLP model that cannot be adapted to "different resolutions"?

Let's try the "MAXIM" model! The latest MLP-based UNet-type backbone network realizes both "global" and "local" receptive fields, and can be directly applied to "high-resolution images" under "linear complexity", with "full convolution" characteristics , you can "plug and play"! Code and models are open sourced! Family, what are you waiting for? !

Too Long to Watch (TL;DR)

  • Proposed a general image inpainting/enhancement task backbone network MAXIM, which is the first time to apply the recently popular "MLP" [1] to the underlying vision , in five major image processing tasks (denoising, deblurring, deraining, dehazing, Enhanced) more than 10 datasets achieve SOTA performance;

  • A "plug and play" multi-axis threshold MLP block (Multi-Axis gMLP block) is proposed, which realizes global/local spatial information interaction under linear complexity, and solves the pain point that MLP/Transformer cannot handle images of different resolutions [2], and has the characteristics of full convolution [3], which is tailored for the underlying vision task, and can also be applied to other dense prediction tasks (left for future filling);

  • Another "plug-and-play" cross -gating MLP block is proposed, which can painlessly replace the cross-attention mechanism, and also enjoys global/local receptive fields and full convolution characteristics in linear complexity.

3e2d0a5f4ad183745d243a5eff60a32e.png

MAXIM: Multi-Axis MLP for Image Processing

Paper address: arxiv.org/abs/2201.02973

Code/model/experimental results: ‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

https://github.com/google-research/maxim

Chinese video explanation: (very detailed, with a lot of background introduction, novice friendly)

youtu.be/gpUrUJwZxRQ

09dbf37664f5914fd50e731c32e22ed9.png

MAXIM model architecture diagram, source: MAXIM: Multi-Axis MLP for Image Processing

text begins

Introduction

It's 2022, are you still obsessed with adjusting the parameters in the "convolutional neural network"?

Vision Transformer (ViT) [4] has only been published for more than a year, and it is already "shark", "crazy" and "crazy" in major visual fields! Inspired by the elegant architecture of ViT, various tricks and ingenuity have emerged as the times require - MLP-Mixer [5] proposed by Google Brain replaces the self-attention mechanism with MLP to build a pure MLP architecture with powerful performance! Another brain of Google proposed the gMLP model [6] , built a threshold MLP module, and painlessly beat the Transformer in both vision and language modeling! Some Zhihu big V can't help but ask: MLP is all you need? [1]

16b407fa54ebc5fc7600d904c03e5a47.png
MLP Renaissance? via Twitter @giffmana

The new visual backbone networks such as ViT, Mixer, and gMLP have led a wave of paradigm shift that is fundamentally different from the traditional convolutional neural network (CNN) architecture design, namely "Global Models or Non" -Local Networks [7] ) - we no longer rely on long-standing prior knowledge about 2D images: translation invariance and local dependencies; instead we use global receptive fields and very large-scale data pre-training The "banknote capability" [8] . Of course, another feature of ViT is derived from the definition of the attention mechanism itself, that is, the dynamic weighted average adaptive to the input, but here we mainly discuss the global interaction properties of these Transformer-like models.

The global model allows global spatial interaction on the input feature map, i.e. each output pixel is weighted by each point of the input feature, requiring O(N) multiplications (assuming N = HW as the spatial dimension). Therefore, outputting the entire feature map of size N requires O(N^2) multiplication operations, which is the origin of the high computational complexity of the attention mechanism/Transformer. But essentially, global models with dense receptive fields such as ViT, Mixer, and gMLP all have square computational complexity. This kind of square operator that cannot be scaled up is difficult to be widely used as a general module in major visual tasks, such as target detection that requires training/inference at high resolution, semantic segmentation, etc., even for almost all Low-level visual tasks such as denoising, deblurring, superscore, deraining, dehazing, de-snow, de-shading, de-moir, de-reflection, de-watermarking, de-mosaic, and more. . .

However, you might as well use it directly! For the first time, the IPT model jointly created by Huawei and Peking University applied the ViT model to multiple low-level visual tasks, refreshed the major lists and published it in CVPR 2021 [9] [10] . Although the performance is good, the global attention mechanism used by IPT has some obvious limitations: (1) requires a large amount of data for pre-training (such as ImageNet), (2) cannot directly infer on high-resolution images. In actual inference, it is often necessary to cut the input image into blocks, infer each image block separately, and then splicing to restore the large image. This method often leads to some obvious "block effects" in the output picture (as shown in the figure below), and the inference speed is also relatively slow, which limits its actual landing and deployment capabilities.

cc4035c35c7312384a87ae90193581ec.png
Image patch-based inference can lead to blockiness, source: Pre-Trained Image Processing Transformer

This property of being able to train on small image patches and infer directly on large images is called "fully-convolutional" [3] . The full convolution attribute is very important for low-level vision tasks, because low-level vision such as image inpainting and enhancement requires pixel-level operations on the image, and the output image needs to be the same size as the input image, and cannot be resized and other operations like image classification. . Obviously, the current mainstream global networks ViT, Mixer, and gMLP cannot solve this pain point that cannot be adapted to different image resolutions.

He's here, he's here, he's here in CNN's guise!

Swin Transformer was born and won the ICCV 2021 Marr Prize [11] . Swin's contribution is remarkable, such as the introduction of a hierarchical structure, such as the proposed local attention mechanism (Local/Window attention) to solve the problem of computational complexity. But here comes the point! The local attention proposed by Swin brings good news to the underlying vision: it has the property of "full convolution"! The fundamental reason is that Self-Attention works in a small 7 x 7 window, and different windows in the whole space share weights. If you reason at a larger size, it is nothing more than having more windows. Come to think of it, is this thing the same idea as Strided Convolution? CNN, yyds!

As a very natural improvement inspired by CNN, local attention is very suitable for low-level visual tasks, so it has entered various low-level visual tasks non-stop. The two "first to bear" jobs are (1) Uformer  [12] (CVPR 22) proposed by the University of Science and Technology of China and (2) SwinIR [13] (ICCVW 21) proposed by the Luc Van Gool group  , both of which borrow or improve the idea of ​​Local Attention , and applied to multiple different low-level vision tasks, achieving amazing performance.

However, Local Attention re-introduced the idea of ​​Locality, returning to the basics, but abandoned a very important feature of the global model, "global interaction". We humbly believe that Shifted window attention is only a supplement to Local Attention, and does not really solve the pain point of how to conduct global interaction more efficiently (personal point of view, God, please spray QAQ lightly.. But this is one of the research motivations of this article)

Model method (Method)

We design the first MLP-based general low-level vision-like UNet backbone called MAXIM. Compared with some previous work on the underlying visual network, MAXIM has the following advantages:

  • MAXIM has a global receptive field on images of any size , and only requires linear complexity ;

  • MAXIM can directly reason on images of any size and has the property of " full convolution ";

  • MAXIM balances the use of local and global operators, enabling the network to achieve SOTA performance without requiring pre-training on very large datasets .

b93e55656b3441d8541a96b2ad2e9810.png

MAXIM network architecture diagram

The MAXIM backbone network architecture diagram is shown above. It has the basic structure of a symmetric UNet, including a down-sampling Encoder module, the bottom bottleneck, and an up-sampling Decoder module. Among them, each Encoder/Decoder/Bottleneck adopts the same design as Figure 2(b): multi-axis gated MLP block (global interaction) and residual convolution channel attention block (local interaction). Inspired by Attention-UNet [14] , we add a cross-gating block in the middle layer of UNet, and use the high-order semantic features output by Bottleneck to modulate the skip connection features between the encoder and the decoder. . It is worth noting that, different from the traditional various UNet magic modification networks, each module in the MAXIM backbone has a global/local receptive field, so it has greater learning potential.

Magic Change 1: Multi-axis Gated MLP Module

The core contribution in this model is the proposed Multi-axis gated MLP block, a plug-and-play parallel block that enables global/local spatial interaction and has linear complexity. We are inspired by the multi-axis self-attention module proposed in [NeurIPS 21] HiT-GAN [15] , which enables efficient global/local information interaction on low-resolution feature maps, achieving SOTA in multiple image generation tasks . However, this multi-axis is not multi-axis, we need to be able to use it on high-resolution low-level vision tasks, and at the same time need to have the "full convolution" property. But at the same time, I didn't want to sacrifice the important attributes of the global receptive field, so the magic reform began:

9328f0b652cabe58788642bad1b5f8d3.png
Multi-Axis gMLP block

As shown in the figure above, the input features are first channel-mapped and then split into two heads for global and local interactions respectively. Half of the heads enter the local branch (red in the figure), and we use the gMLP operator to perform local spatial interactions within a fixed window size; the other half of the heads feed into the global branch (green in the figure), and we also use the gMLP operator in the fixed window. The grid position of the global (dilation) spatial interaction. It is worth noting that the Block and Grid operations in the figure are both window divisions (same as Swin), but in the Block operation we fix the [window size], and in the Grid operation we fix the [number of windows] (or grid size) . In the two parallel branch structures, we only operate on the coordinates of a fixed dimension each time, and share parameters in other coordinates, thus achieving the property of "full convolution" and global/local receptive fields at the same time. Since we always use fixed window size and grid size, this module also has linear computational complexity O(N).

Magic Change 2: Cross-gating module

UNet's more classic magic network is Attention-UNet [14] , which adds a cross-attention mechanism to the symmetric skip connection to adaptively weight filter the feature maps that can be passed. Inspired by this, we carried out the second magic modification and proposed the [cross-gating module], as shown in Figure 2(c). Its design concept strictly follows the model of multi-axis gated MLP module, and also adopts multi-axis global/local interaction gMLP module. The only difference is that after extracting the gMLP operator’s gating weights, we use a cross-multiply method to exchange information. For example, X and Y are two different input features, and the concept of cross-gating can be simply expressed as (the specific formula can be found in the article or code):

17cecd32f93feac919a086a2158a81da.png

So far, we have proposed the first pure MLP cross-gating module that can perform multi-feature interaction, which can be used for global/local cross-information transfer and mutual modulation. It is functionally equivalent to the cross-attention mechanism and can be used without Brain is plug and play.

Magic change three: multi-stage multi-scale architecture

Circuit diagram warning! ! ! Circuit diagram warning! ! ! Circuit diagram warning! ! !

30a63cbf8164ef2da02c1acb6c8f512e.png
MAXIM adopts a multi-stage multi-scale architecture to repeat multiple backbone networks.

To balance performance-computational complexity, MAXIM adopts an improved multi-stage network and adopts a deep supervision strategy to supervise the outputs of multi-stage and multi-scale. In this paper, 2- and 3-stage networks are used for different tasks: MAXIM-2S, MAXIM-3S. Although MAXIM is a multi-stage network, it can still be trained end-to-end without the need for stepwise or incremental training. In the inference stage, it is only necessary to keep the maximum size output of the final stage as the final result. The loss function used is to calculate the weighted sum of the Charbonnier loss function and the L1 loss after frequency domain transformation for all outputs and inputs of multiple stages and multiple scales, which can be expressed as:

94d143eed4b1b598162385356c0c339e.png

where Rs,n represents the output image of the network at scale n at stage s, and Tn represents the target image (groundtruth) at scale n. The multi-stage multi-scale design refers to the network design experience in some previous works such as MPRNet [16] , HINet [17] , and MIMO- UNet [18] .

[PS] I have to say that this module is the author's only regret so far! I always feel that the proposed network structure is too complicated and not elegant enough. When the author used a single MAXIM backbone in the early parameter adjustment, it has never been able to achieve the performance close to SOTA. It was painful, wandering, and confused for a long time, and a lot of carbon emissions were wasted. Only later did I understand a lesson: we must respect the domain knowledge of each field, and don’t easily doubt why the network in this field is designed in this way, that is the result of the hard work of the predecessors, 996, TT. The multi-stage network has become a classic architecture in more difficult image inpainting tasks such as deblurring and rain removal; similarly, MAXIM's performance took off immediately after using the multi-stage architecture. Surprisingly, the performance and computational cost of using a multi-stage small network is greatly improved compared to a single-stage large network (see ablation experiments)! Surprised or not, surprised or not?

Experiments

Experimental setup

We aim to build a unified backbone network that can be applied to a wide range of low-level vision/image processing tasks. Therefore, we train-test on up to 17 datasets on five different vision tasks. The datasets used are summarized as follows:

3dddf4b5292084675adda978384d059a.png

Summary table of five tasks and 17 datasets tested by MAXIM

Experimental results

The experimental results for quantification and vision are shown below. A picture is worth a thousand words, so I won’t go into details. More experimental results can be found in the appendix of the paper.

1. Denoising on SIDD, DND dataset

351569cca6d7776366544d3069f00d6b.png

Real denoising performance comparison
c53bd5e573a7f943f67b5dc008a40da9.png
Comparison of real denoising visual results

2. Deblurring on GoPro, HIDE, RealBlur, REDS dataset

e4940a8cbbf49784c850cf347a3ecd4a.png
Deblurring performance comparison
303ca3eafdb9c309492951fd43104836.png
Comparison of deblurring visual effects

3. Deraining on Rain100L, Rain100H, Test100, Test1200, Test2800, RainDrop datasets

c6bf93069d0ac2f986deab4c3b0f6439.png
Rain removal performance comparison
8971b96f7132897a71a3d79ed9f8dcdd.png
Comparison of visual effects of removing rain
d93a3da56830a63a010691e4000beb02.png
Comparison of visual effects of removing raindrops

4. Dehazing on RESIDE indoor, outdoor dataset

917c1712415341ab1074007208b90b8d.png

Dehazing performance comparison
d7616d7c8528cce11581b44062026d72.png
Comparison of dehazing visual effects

5. (Light) Enhancement on MIT-Adobe FiveK, LOL dataset

243667a7f357f8abf7e68a8e73d6b9d1.png
Enhanced performance comparison
e2d9d473368a4635b29441506acbd024.png
Enhanced Visual Effects Comparison
42df496bd121bc8e474b2001c016ac76.png
Comparison of low-light enhanced visual effects

Ablation experiment

40de4065b49167af43995fc7140babff.png
Ablation experiment summary table

We do exhaustive ablation experiments to understand the MAXIM network:

  • [Module ablation] We found that using each of the newly proposed modules improved the final performance. The tested modules include intra- and inter-stage cross-threshold modules, SAM modules, and multi-scale deep supervision methods;

  • This paper mainly proposes a global interactive MLP network that can be applied to high-resolution images. Then the question comes: [How good is it to use global] ? Ablation experiment B shows that the local and global MLPs have comparable improvement effects on the network, and the combined effect is better;

  • 【Why use multi-stage? ] Practice is the only criterion for testing truth. We found that the performance improvement of using multiple stages is more obvious than using a deeper and wider single network, and the parameters and calculation amount are also more balanced;

  • [Universal] The multi-axis parallel module proposed by us is a general method to convert operators that cannot handle different resolutions into local/global operators, and has linear complexity and resolution adaptability. We tried using self-attention, gMLP, MLP-Mixer, FFT [19] as different spatial mixing operators and found that using self-attention and gMLP can achieve the best performance while using Mixer and FFT has faster computation speed .

-- Finish

reference

  1. ^ab[Tucao] MLP is all you need? https://zhuanlan.zhihu.com/p/370780575

  2. ^ The Transformer proposed by CVPR2021 cannot directly handle different resolutions https://github.com/huawei-noah/Pretrained-IPT/issues/18

  3. ^ab "fully-convolutional" (fully-convolutional): a natural feature of convolutional networks, which can be applied to images of different resolutions https://arxiv.org/abs/1411.4038

  4. ^An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale https://arxiv.org/abs/2010.11929

  5. ^MLP-Mixer: An all-MLP Architecture for Vision https://arxiv.org/abs/2105.01601

  6. ^Pay attention to MLPs https://proceedings.neurips.cc/paper/2021/hash/4cc05b35c2f937c5bd9e7d41d3686fff-Abstract.html

  7. ^Non-Local Neural Networks https://openaccess.thecvf.com/content_cvpr_2018/html/Wang_Non-Local_Neural_Networks_CVPR_2018_paper.html

  8. ^GPT-3: Money is All You Need https://twitter.com/arankomatsuzaki/status/1270981237805592576?s=20&t=jEDzZJ2KrCIRUYVKL0vW7A

  9. ^ The Transformer jointly built by Huawei and Peking University has surpassed CNN in the field of CV: a number of low-level visual tasks have reached SOTA https://zhuanlan.zhihu.com/p/328534225

  10. ^IPT CVPR 2021 | Underlying Vision Pre-training Transformer | Interpretation of Huawei Open Source Code https://zhuanlan.zhihu.com/p/384972190

  11. ^What do you think of swin transformer becoming the best paper of ICCV2021? https://www.zhihu.com/question/492057377

  12. ^Uformer: A General U-Shaped Transformer for Image Restoration https://arxiv.org/abs/2106.03106

  13. ^SwinIR: Image Restoration Using Swin Transformer https://arxiv.org/abs/2108.10257

  14. ^abAttention U-Net: Learning Where to Look for the Pancreas https://arxiv.org/abs/1804.03999

  15. ^Improved Transformer for High-Resolution GANs https://arxiv.org/abs/2106.07631

  16. ^Multi-Stage Progressive Image Restoration https://arxiv.org/abs/2102.02808

  17. ^HINet: Half Instance Normalization Network for Image Restoration https://arxiv.org/abs/2105.06086

  18. ^Rethinking Coarse-to-Fine Approach in Single Image Deblurring https://arxiv.org/abs/2108.05054

  19. ^Global Filter Networks for Image Classification https://arxiv.org/abs/2107.00645

The above paper and code download

Background reply: MAXIM, you can download the above paper/code

 
  
 
  

ICCV and CVPR 2021 Paper and Code Download

Backstage reply: CVPR2021, you can download the CVPR 2021 papers and open source papers collection

Background reply: ICCV2021, you can download the ICCV 2021 papers and open source papers collection

Background reply: Transformer review, you can download the latest 3 Transformer reviews PDF

目标检测和Transformer交流群成立
扫描下方二维码,或者添加微信:CVer6666,即可添加CVer小助手微信,便可申请加入CVer-Transformer或者目标检测 微信交流群。另外其他垂直方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch、TensorFlow和Transformer等。
一定要备注:研究方向+地点+学校/公司+昵称(如Transformer或者目标检测+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群

▲扫码或加微信: CVer6666,进交流群
CVer学术交流群(知识星球)来了!想要了解最新最快最好的CV/DL/ML论文速递、优质开源项目、学习教程和实战训练等资料,欢迎扫描下方二维码,加入CVer学术交流群,已汇集数千人!

▲扫码进群
▲点击上方卡片,关注CVer公众号

It is not easy to organize, please like and watchb77b043040cc590507cc74b4a730ca5c.gif

Guess you like

Origin blog.csdn.net/amusi1994/article/details/124071583