ICCV 2023 | EfficientViT: SOTA semantic segmentation model for edge device applications, facilitating efficient SAM reasoning

Introduction

This paper aims to solve the problem of excessive computational cost faced by deploying state-of-the-art semantic segmentation models on edge devices. The author points out that previous semantic segmentation models usually rely on self-attention mechanisms and computationally intensive large convolutions Coreorcomplex topology to obtain good performance, but these methods are not suitable for edge devices. To this end, the paper proposes a new semantic segmentation model family, aiming to achieve efficient semantic segmentation on edge devices. EfficientViT

EfficientViTThe core of is a novellightweight multi-scale attention module, which also uses small convolution kernels to detect nearby /span> reduces computational complexity from quadratic to linear, while retaining It has the same feature extraction ability and can well combine the global receptive field with multi-scale learning. TokenAggregation, generating multi-scale tokens, and performing global attention on these multi-scale tokensReLU-based

Ultimately, this model achieves a good balance between performance and hardware efficiency, providing a feasible solution for deploying semantic segmentation applications on edge devices. EfficientViT has significant advantages in speed and performance over previous semantic segmentation models, making it a strong choice for practical applications.

method

Lightweight Multi-Scale Attention

The lightweight multi-scale attention module is designed to balance performance and efficiency when performing semantic segmentation on edge devices. Meanwhile, in terms of performance, global receptive field and multi-scale learning are very important for semantic segmentation tasks to improve the performance of the model. Different from the previous multi-scale attention mechanism module, the key point explored in this article is how to achieve the same global receptive field and multi-scale learning by relying only on hardware-friendly operators, which is undoubtedly the key pointTransformer NativeMHSA, high computational cost and low efficiency.

is shown on the right side of Figure 2, which shows the structure of LMSA. Its core idea is to significantly improve computational efficiency by reducing a certain model capacity. A key component involved is to achieve global perception. Wild lightweightReLU-based Attention.

ReLU-based Attention

As we all know, in order to achieve the global receptive field, traditional methods usually useheavy the self-attention mechanism, but this mechanism has high computational complexity and is not suitable for edge devices. Therefore, this article proposes to use lightweightReLU-basedglobal attention to replace the highly complex self-attention mechanism. This is a linear attention module that has been used in other fields before, but has not yet been used in semantic segmentation.

By usingReLU-basedglobal attention, a global receptive field can be achieved and the computational complexity can be reduced from quadratic to linear without changing the original feature extraction capability. Furthermore, ReLU-basedglobal attention does not involve hardware-inefficient operations like softmaxetc., and is therefore more efficient in hardware.

Multi-Scale Token

Use aloneReLU-basedThe capacity of attention is limited, so in order to enhance multi-scale learning capabilities, the paper proposes to aggregate contextual information from nearbyQ/K/V tokens to obtain multi-scaletokenmethods. This information aggregation process is independent for each Q, K and V in each header. In order to avoid the impact on hardware efficiency, the authors use small convolution kernels to perform information aggregation.

In addition, in order to further perform these operations efficiently onGPU, this paper fuses all depthwise separable convolutions (DWConv) into a single based on the group convolution paradigm. DWConv, that is, fuse all 1x1 convolutions into a single 1x1 group convolution. After completing multi-scale label generation, a global attention operation is performed on them to extract multi-scale global features. Finally, we are able to concatenate features from different scales along the head dimension and then feed into the final linear projection layer to fuse these features.

EfficientViT Architecture

As mentioned earlier, EfficientViT is a model family built on lightweight multi-scale attention (Lightweight Multi-Scale Attention,MSA) modules, designed for Designed for semantic segmentation tasks on edge devices, the framework is as follows:

As shown in the figure,EfficientViT adopts a standard backbone-head (encoder-decoder) architecture design, and its core building block isEfficientViT module, as shown on the left side of Figure 2. This module includes two key components:

  • Lightweight MSA Module: This module is used to extract contextual information to help the model understand the global information in the image.
  • MBConv: used to extract local information, helping to capture local features in the image.

The following is a brief introduction to its network structure:

  • Backbone network: The backbone of EfficientViT also follows the standard design, includingsteam external fourstage, in which the size of the feature map gradually decreases and the number of channels gradually increase. Among them, EfficientViT modules are inserted into the third and fourth stages. For downsampling, MBConv with a stride of 2 is used.

  • Head: P2, P3 and P4 represent the output of the second, third and fourth stages respectively, forming a feature map pyramid. For simplicity and efficiency, 1x1 convolutions and standard upsampling operations (e.g. bilinear/bicubic upsampling) are used to match their spatial and channel sizes, and then fused by additive operations. Since the backbone of EfficientViT already has powerful contextual information extraction capabilities, the head design is relatively simple, including several MBConv blocks and output layers (i.e., prediction and upsampling). In experiments, the researchers found that this simplified head design is sufficient to achieve SOTA performance, thanks to the lightweight MSA module.

In addition, in order to meet different efficiency requirements, a scaling mechanism is also provided. The detailed configuration is shown in the table below:

experiment

First of all, as shown in the table below, on the ImageNet image classification task, the overall performance of the EfficientViT backbone network is acceptable. In particular, EfficientViT-B3 achieved a Top-1 accuracy of 84.2 on ImageNet, which provides an accuracy improvement of 0.2 compared to EfficientNet-B6 and is 7.9 times faster. Compared with ConvNeXt, the inference speed is quite fast even when the performance is comparable.

Secondly, on the Cityscapes semantic segmentation data set, EfficientViT also achieved better results than the previous SOTA semantic segmentation model. Provides significant efficiency improvements without sacrificing performance. In particular, achieves up to 13x savings in (number of floating point operations) and 15x savings compared to SegFormer latencies are reduced while providing higher (average intersection-to-merge ratio). Compared to , EfficientViT achieves up to 9.3x speed improvement on mobile devices while maintaining higher mIoU. Even with similar computational costs, EfficientViT-B3 achieved an mIoU gain of +4.7 relative to SegFormer-B1. EfficientViTMACsmIoU
SegNeXt

On theADE20K semantic segmentation dataset,EfficientViT also achieves significant efficiency improvements. For example, EfficientViT-B1 achieved an mIoU gain of +0.5 compared to SegFormer-B1, while reducing MACs by 5.9x and latency by 6.5x. EfficientViT-B2 achieved an mIoU gain of +0.8 compared to SegNeXt-S, while reducing MACs by 2.7 times and latency by 5.2 times.

in conclusion

This article provides an in-depth exploration of the design of efficient semantic segmentation architecture on edge devices. In particular, the paper introduces a lightweight multi-scale attention module that simultaneously implements global receptive field and multi-scale learning, using lightweight and hardware-efficient operations, and therefore is relatively efficient on edge devices. >SOTA Semantic segmentation model achieves significant speedup without performance loss.

If you are also interested in efficient neural network architecture design, target detection, semantic segmentation and other directions, you are welcome to scan the QR code at the bottom of the screen to add the editor's WeChat and remark "Learning and Communication".

Guess you like

Origin blog.csdn.net/CVHub/article/details/134224830