If you love RMB (iRMB), you won’t be EMO? | ICCV-2023: Inverted Residual Moving Module Design Combining CNN and Transformer

Title:Rethinking Mobile Block for Efficient Attention-based Models
Paper:https://arxiv.org/abs/2301.01146
Code :https://github.com/zhangzjn/EMO

1. Introduction

This paper is dedicated to the research of lightweight and efficient model structure design, while considering the trade-off between the model parameter amount ( #Params), calculation amount ( FLOPs) and accuracy ( ). The model based on has Accuracyan inverted residual module) [1] as the basic structure, while the research based on the attention mechanism still lacks a similar basic structure.CNNsIRB, Inverted Residual Block

Based on this, this paper rethinks the multi-head self-attention) and forward model) in MobileNetv2[1] IRBand [2], and abstracts only from a unified perspective . Following the simple and effective design principle, the author further instantiates a basic module ( , inverted residual mobile module) for mobile applications, which has both the static short-range modeling capability of CNN and the dynamic long-range feature interaction capability, and further designs A lightweight backbone model consisting only ofTransformerMHSA, Multi-Head Self-AttentionFFN, Feed-Forward Networksingle skipMeta Mobile Block, MMBiRMBInverted Residual Mobile BlockTransformeriRMBEMO, Efficient MOdel

Extensive experiments demonstrate the effectiveness of the proposed method, such as:

  • The EMO model at the 1M/2M/5M scale achieved 71.5/75.1/78.4 Top-1 accuracy on the ImageNet-1K classification data set, surpassing the current SoTA lightweight model based on CNN/Transformer;
  • Based on the SSDLite framework, only 0.6G/0.9G/1.8G FLOPs were used to obtain 22.0/25.2/27.9 mAP ;
  • Based on the DeepLabv3 framework, only 2.4G/3.5G/5.8G FLOPs are used to obtain 33.5/35.3/37.8 mIoU , which has a better effect under mobile conditions.

2. Methods and contributions

motivation

The IRB based on Depth-Wise Convolution has become the basic building block of the lightweight convolution model. The related high-efficiency models (such as MobileNetv2[1], EfficientNe[3], ShuffleNet[4], etc.) have high mobile speed, but are limited by Limited to the influence of static CNN inductive bias, the effects of these models (especially tasks such as downstream detection and segmentation) still need to be further improved. On the other hand, thanks to the dynamic modeling capability of Transformer, many works based on ViT[5] (Vison Transformer) (such as DeiT[6], PVT[7], SwinTransformer[8], etc.) have achieved obvious results compared with the CNN model. effect is improved. However, due to the quadratic calculation requirements of MHSA (Multi-Head Self-Attention), Transformer is rarely used in mobile scenarios.

Therefore, some researchers try to design a hybrid model that combines the above two, but generally introduce complex structures [9,10,11] or multiple hybrid modules [10,12], which are very unfriendly to application improvement and optimization , which prompts us to think about whether we can combine the advantages of the CNN/Transformer structure to build a lightweight basic module similar to IRB. Based on this, this paper first abstracts MMB (Meta Mobile Block) to summarize MHSA/FFN in IRB and Transformer, then instantiates efficient iRMB (Inverted Residual Mobile Block), and finally uses only this module to build an efficient EMO (Efficient MOdel) lightweight backbone model.

Meta Mobile Module

As shown on the left side of the figure above, by abstracting the IRB in MobileNetv2 and the core MHSA and FFN modules in Transformer, the author proposes a unified MMB to inductively represent the above structures, that is, using the expansion rate λ \ lambdaλ and efficient operatorFFF to instantiate different modules.

Inverted Residual Shift Module

The effects of different models mainly come from the efficient operator FFThe specific form of F , considering the lightweight and ease of use, the author willFFF modeling is a cascade of Expanded Window MHSA (EW-MHSA) and Depth-Wise Convolution (DW-Conv), which takes into account the advantages of dynamic global modeling and static local information fusion, and can effectively increase the receptive field of the model and improve capabilities for downstream tasks.

Further, the author will FFF is set to 4 and replaces the standard Transformer structure in DeiT and PVT to evaluate iRMB performance. As described in the following table, it can be found that iRMB can achieve higher performance with fewer parameters and calculations under the same training settings.

Lightweight and efficient model

In order to better measure the performance of the mobile lightweight model, the author defines the following four evaluation criteria:

  1. Usability : Simple implementation without complex operators, easy to optimize for the application side.
  2. Simplicity : Use as few core modules as possible to reduce model complexity.
  3. Effectiveness : Good classification and downstream task performance.
  4. Efficiency : There is a good balance between parameters, calculation amount, and accuracy.

The table below summarizes the differences between this method and several other mainstream lightweight models:

Based on the above criteria, we designed an efficient EMO model:

  1. At the level of the macro framework , EMO is only composed of iRMB without diversified modules, which is different from the design concept of the current lightweight model, which can be called the simple way.
  2. At the micro-module level , iRMB only consists of convolution and multi-head self-attention mechanisms without other complex structures. Furthermore, thanks to DW-Conv, iRMB can adapt the downsampling operation through strides and does not require any positional embeddings to introduce an inductive bias for MHSA.
  3. At the model variant level , the author adopts a simple way of increasing the expansion rate and the number of channels step by stage. The detailed configuration is shown in the following table:

  1. At the training strategy level , EMO does not use Strong Data Augmentation and Tricks, which fully reflects the effectiveness of its module design.

3. Experiment

ImageNet-1K classification

Target Detection

semantic segmentation

qualitative results

In addition, the author has done a lot of qualitative/quantitative experiments to illustrate the effectiveness of the proposed method, and you can move to the original text for further understanding.

Summarize

From a technical point of view, this work rethinks some of the key lightweight designs of ViTand , similar to the previous ones . Due to the limitation of the scene, this paper proposes a simple and effective module, namely the inverted residual moving module, and Without using strong data enhancement, it has achieved leading results on multiple datasets. The overall is simple and effective. The code and model have been open sourced. Welcome to use!CNNMeta-Formermobilemobile setting

references

[1] Sandler, Mark, et al. “Mobilenetv2: Inverted residuals and linear bottlenecks.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
[2] Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017).
[3] Tan, Mingxing, and Quoc Le. “Efficientnet: Rethinking model scaling for convolutional neural networks.” International conference on machine learning. PMLR, 2019.
[4] Zhang, Xiangyu, et al. “Shufflenet: An extremely efficient convolutional neural network for mobile devices.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018. 2018.
[5] Dosovitskiy, Alexey, et al. “An image is worth 16x16 words: Transformers for image recognition at scale.” arXiv preprint arXiv:2010.11929 (2020).
[6] Touvron, Hugo, et al. “Training data-efficient image transformers & distillation through attention.” International conference on machine learning. PMLR, 2021.
[7] Wang, Wenhai, et al. “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions.” Proceedings of the IEEE/CVF international conference on computer vision. 2021.
[8] Liu, Ze, et al. “Swin transformer: Hierarchical vision transformer using shifted windows.” Proceedings of the IEEE/CVF international conference on computer vision. 2021.
[9] Chen, Yinpeng, et al. “Mobile-former: Bridging mobilenet and transformer.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
[10] Maaz, Muhammad, et al. “Edgenext: efficiently amalgamated cnn-transformer architecture for mobile vision applications.” European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022.
[11] Mehta, Sachin, and Mohammad Rastegari. “Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer.” ICLR. 2022.
[12] Pan, Junting, et al. “Edgevits: Competing light-weight cnns on mobile devices with vision transformers.” European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022.

Guess you like

Origin blog.csdn.net/CVHub/article/details/132521740