[Reading Paper] FMViT: A multiple-frequency mixing Vision Transformer-Looking forward to the source code

FMViT: A multiple-frequency mixing Vision Transformer

Abstract

  • The transformer model has been widely used in computer vision tasks in recent years. However, since the time and memory complexity of self-attention is quadratic and proportional to the number of input tokens, most existing (Vision transformer, vit) is difficult to achieve efficient performance in actual industrial deployment scenarios, such as traditional TensorRT and CoreML that cnn has. Althoughthere have been recent attempts to design CNN-Transformer hybrid architectures to solve this problem, their overall performance has not met expectations. To address these challenges, we propose an efficient hybrid ViT architecture called FMViT. This method enhances the expressive ability of the model by mixing high-frequency features and low-frequency features, allowing it to effectively capture local and global information. In addition, we introduce deployment-friendly mechanisms such as convolutional multi-group reparameterization (gMLP), lightweight multi-head self-attention (RLMHSA), and convolutional fusion block (CFB) to further improve the model's performance and reduce computational overhead.

  • Our experiments show that FMViT outperforms existing CNN, ViTs and CNNTransformer hybrid architectures in terms of latency/accuracy trade-offs for various vision tasks. On the TensorRT platform, FMViT outperforms Resnet101 by 2.5% in top-1 accuracy on the ImageNet dataset (83.3% vs. 80.8%) while maintaining similar inference latency. Furthermore, FMViT achieves comparable performance to EfficientNet-B5 but improves inference speed by 43%. On CoreML, FMViT outperforms MobileOne by 2.6% in top-1 accuracy on the ImageNet dataset, and the inference latency is comparable to MobileOne (78.5% vs. 75.9%). Our code can be found at https://github.com/tany0699/FMViT.

Introduction

  • In recent years, (Vision transformer, ViTs) has achieved success in a variety of computer vision applications such as image classification, target detection, and semantic segmentation, and has received widespread attention from the industry and academia. Despite this, convolutional neural networks (CNNs) are still the first choice for real-world vision tasks, mainly because convolutional neural networks often perform slower than traditional CNNs (such as ResNets). The inference speed of the Transformer model is limited by elements such as the multi-head self-attention (MHSA) mechanism, non-fused LayerNorm and GELU layers, and frequent memory accesses.

  • A lot of effort has been put into liberating vit from high latency issues. For example, models such as Swin, PoolFormer, Reformer, MaxViT, SepViT, and MobileViT strive to develop more effective spatial attention methods and alleviate the secondary surge in MHSA computational complexity. Meanwhile, other projects including EfficientFormer and MobileViT are exploring ways to build hybrid CNN-Transformer architectures that balance accuracy and latency. This is achieved by integrating efficient convolution blocks with efficient Transformer blocks. It is worth noting that most of the current state-of-the-art (SOTA) models are designed as hybrid structures of CNN-Transformer. These models mainly use convolution blocks in the initial stage and retain the stacking of Transformer blocks in the final stage.

  • Currently, in existing work, neither convolution block nor transformer block can achieve both efficiency and performance at the same time. Although the trade-off between accuracy and latency has improved over the years, the overall performance of modern hybrid systems still needs improvement. This study introduces four key components for designing effective visual transformer networks to address these challenges. First, inspired by the mixing of NextViT high-frequency features and low-frequency features, a powerful multi-frequency fusion block (FMB) is introduced to fuse multiple high-frequency and low-frequency features together to enhance the information of the model. flow and expressiveness. Secondly, a lightweight convolutional fusion block (CFB) is proposed, which effectively integrates the local modeling ability of convolution with the multi-group reparameterization of convolution, further improving the modeling performance; thirdly, Proposed a convolutional multi-group reparameterization method. It integrates the spatial information of different sub-channels during the training phase. Fusing them into a single convolution during the inference stage improves the accuracy of the model while maintaining inference speed. Finally, a lightweight self-attention block called RLMHSA is developed. It employs lightweight and heavily parametric design to enhance modeling capabilities and speed up the inference phase.

  • Based on the above method, a CNN-Transformer hybrid architecture FMViT is proposed. Similar to nextvit, the use of TensorRT and CoreML represents the actual deployed architecture in server-side and mobile devices respectively, and its inference latency represents the actual time consumption in the industry.

  • As shown in the figure below, FMViT achieves the best balance between latency and accuracy in the ImageNet-1K classification task. On TensorRT, FMViT's top-1 accuracy on the ImageNet dataset exceeds Resnet101 by 2.5%, maintaining comparable inference latency. At the same time, its performance is comparable to EfficientNet-B5, and the inference speed is increased by 43%. On CoreML, the top-1 accuracy on the ImageNet dataset is 2.6% higher than MobileOne while maintaining similar inference latency.

    • Insert image description here

    • Speed-performance trade-off on ImageNet1K

  • Our main contributions are summarized below:

    • An efficient multi-frequency fusion block (FMB) is proposed to combine multiple sets of high-frequency and low-frequency features to enhance the information flow of the model and enhance the expressive ability of the model.

    • A lightweight convolutional fusion block (CFB) is proposed, which effectively fuses the local modeling capabilities of convolution and uses convolutional multi-group reparameterization to further improve modeling performance.

    • A convolutional multi-group reparameterization method is proposed, which fuses the spatial information of different sub-channels into one convolution during the training phase and into one convolution during the inference phase, thereby maintaining the inference speed while maintaining the same speed. Improve the accuracy of the model.

    • Multiple groups of Multilayer Perceptron Layer (gMPL) blocks are proposed to fuse global signals and local information and enhance the expressive ability of the model.

    • A lightweight self-attention module (RLMHSA) is proposed, which uses lightweight and heavy-parameterized design to enhance the global modeling capabilities of the module and improve the speed of the inference phase.

Related Work

Convolutional Networks
  • Since 2012, convolutional neural networks (CNN) have become the de facto visual architecture standard for various computer vision applications, such as semantic segmentation, object recognition, and image classification. ResNet uses residual connections to prevent network degradation, maintain network depth, and be able to capture high-level abstractions. DenseNet, on the other hand, promotes the connection of feature maps and the reuse of features. Convolutions were introduced by MobileNets at points and depths to create models with less memory usage and faster response times. By using group point-wise convolutions and channel transformations, ShuffleNet significantly reduces the computational cost. According to ShuffleNetv2, network architecture design should prioritize direct metrics, such as speed, rather than indirect metrics, such as FLOPs. ConvNeXt explores the architecture of visual transformers and proposes a pure CNN model that can effectively compete with state-of-the-art hierarchical visual transformers in a range of computer vision benchmarks while retaining the simplicity and efficiency of traditional CNNs.
Vision Transformers
  • The concept of Transformer was originally introduced in the field of Natural Language Processing (NLP). ViT achieves self-attention by segmenting images into small patches and treating these patches as words, thus demonstrating the Transformer's efficacy in a variety of vision-related tasks. The knowledge distillation teacher-student method proposed by DeiT is specially designed for transformers. T2T-ViT introduces a unique token-to-token process that progressively tokenizes images into tokens and structurally aggregates them. Swin Transformer introduces a common backbone that builds hierarchical features with a computational cost that scales linearly with image size. At the same time, PiT added a pooling layer to ViT and verified its effectiveness through experiments.
Hybrid Models
  • Recent research shows that hybrid designs integrate convolutions and Transformers, effectively leveraging the advantages of both architectures. BoTNet uses global self-attention to replace the spatial convolution of the last three bottleneck blocks in ResNet. At the same time, lightweight and efficient vits, such as MobileViT and MobileViTv2, have been launched for mobile devices. The fusion of MobileFormer with the proposed lightweight cross-attention model improves computational efficiency and improves representational capabilities. EfficientFormer and EfficientFormerV2 adhere to a dimensionally consistent design, seamlessly employing hardware-friendly 4D MetaBlocks and powerful 3D MHSA blocks for joint size and velocity searching via NAS. ToMe proposes a ViT model that can be accelerated without training. BiFormer establishes an efficient pyramid network architecture through bidirectional attention routing. NextViT captures one high-frequency feature and one low-frequency feature in the network separately, and then blends them to enhance the modeling capabilities of the model.
Structural Reparameterization
  • During the training phase, reparameterization uses complex modules to improve model performance. It follows the linear principle of the convolution operator and merges these complex modules into simpler modules during the inference stage. This process is designed to increase the model's inference speed without affecting performance. ACNet was the first to use heavy parameterization to merge 3x3 convolutions into 1x1 convolutions, while RepVGG used heavy parameterization skip connections, thereby reducing memory access costs. DBB further extends six popular reparameterization methods. The concept of linear training time overparameterization is introduced to enhance the capabilities of this type of model. MobileOne uses overparameterization to improve the performance of mobile device vision transformers.

Methods

  • This section introduces the proposed FMViT architecture, followed by a discussion and analysis of its key design. These include convolutional fusion module (CFB), multi-frequency mixing module (FMB), lightweight multi-head attention module (RLMHSA), convolutional multi-group reparameterization method (gMLP), and MLP modules built using this method.

Overview

  • The figure below illustrates the comprehensive FMViT architecture. FMViT adopts a traditional pyramid structure, with each stage including a downsampling module and a convolutional fusion module (CFB) with only convolutions or a multi-frequency mixing module (FMB) with a transformer. This stemreduces the spatial resolution to a quarter of the original input image, and each subsequent stage progressively halve the spatial resolution while gradually increasing the number of channels. We explored the information interaction module and, inspired by MetaFormer, introduced the convolutional fusion module (CFB) to resolve short-term data dependencies. A multi-frequency hybrid module (FMB) is proposed to further integrate local and global data by decomposing multiple frequency bands.

    • Insert image description here

    • The figure shows the overall structure of FMViT. It mainly includes convolutional fusion module (CFB), multi-frequency fusion module (FMB), lightweight multi-head attention module (RLMHSA) and parameterized multi-layer perceptron module (gMLP).

  • This multi-channel frequency fusion enhances the modeling capabilities of the model. In order to reduce the computational complexity of multi-head attention (MHSA), a lightweight RLMHSA module is proposed. Through parameter sharing and re-parameterization, the inference speed of the model is improved without affecting the accuracy of the model. These core modules develop a series of CNN-Transformer hybrid architectures that achieve the best balance between accuracy and latency on mobile CPUs and server GPUs, surpassing state-of-the-art models.

Convolutional multi-group reparameterization
  • 1x1 convolution is a linear fusion or channel conversion mechanism with global modeling capabilities. The translation invariance feature of kxk convolution is used to describe the local space. The lack of local modeling between specific adjacent channel features limits the effective information fusion capabilities of non-grouped convolution operators. In the training phase, we propose to re-parameterize kxk convolutions, and initially define the convolution operation as CONV(Kh,Kw,G), where Kh and Kw represent the convolution kernel Size, G represents the convolution group size.

  • Assume that the original convolution is defined as CONV A=CONV(Kh, Kw, G). During the training phase, multiple convolutions with different group sizes are connected in parallel:

    • C O N V B = C O N V A + ∑ i = 1 N C O N V ( K h i , K w i , G i ) CONVB=CONVA+\sum_{i=1}^NCONV(K_{h_i},K_{w_i},G_i) CONVB=CONVA+i=1NCONV(Khi,KIni,Gi)

    • 其中 ∀ i ∈ N , G i > = G , K h i < = K h , K w i < = K w ∀i∈N, G_i >= G, K_{hi} <= Kh, K_{wi} <= Kw iN,Gi>=G,Khi<=Kh,Kwi<=Kw, and N is a constant number

  • The figure below illustrates the process of reparameterizing CONVB to CONV A during the inference phase. Any CONV(Khi,Kwi,Gi) convolution is equivalent to sparse CONV(Kh,Kw, G) convolution, that is, the weight of the dotted line in the figure remains constant zero, while the other The weight remains unchanged. Based on additivity, the inference stage of two convolutions with the same number of groups can be reparameterized as a convolution CONV A, as shown in the lower half of the figure below. Here, the left side represents the training stage and the right side represents the inference stage, and the two are equivalent. Convolutional multi-group reparameterization improves model performance without affecting the inference time of the main model.

    • Insert image description here

    • Above: A schematic illustrating the concept of convolutional multi-group reparameterization, CONV(Khi, Kwi, Gi) is equivalent to sparse CONV(Kh, Kw, G). Bottom: By reparameterizing multiple sets of convolutions in the training phase, different sets of convolutions in the training phase are equivalent to a single convolution in the inference phase.

  • RepMLP proposes a reparameterization of the multilayer perceptron (MLP). RepMLP uses convKxK to fuse into FC, but its weight conversion is relatively sparse, and the converted parameters become KxK times the original, which is not suitable for lightweight scenarios. In Transformer, since mlp has global modeling capabilities, it contributes greatly to performance. However, the lack of local modeling capabilities limits its potential. In order to help the original convolution perform group modeling of local channels, multiple parallel conv1x1 convolutions with G ′ > 1 are introduced into the two original CONV(1,1,1) convolutions to reconstruct the MLP module. This method simultaneously focuses on information from different representation subspaces at different locations to achieve efficient local representation learning.

    • C O N V ′ X ( 1 , 1 , 1 ) = C O N V X ( 1 , 1 , 1 ) + ∑ i = 1 N C O N V ( 1 , 11 G i ) CONV'X(1,1,1)=CONVX(1,1,1)+\sum_{i=1}^NCONV(1,11G_i) CONVX(1,1,1)=CONVX(1,1,1)+i=1NCONV(1,11Gi)

    • In this context, X=1 or X=2 means CONV1 or CONV2. CONV1 and CONV2 represent the two pre-parameterized convolutions of MLP, while the post-parameterized convolution of MLP is also represented as CONV’1 and CONV’2.

  • In order to enhance the hybrid modeling of global aspects and local aspects in MLP, a depth convolution is added between the two convolutions. A shortcut connection ensures that global and local information flows do not interfere with each other, and the added depth convolutions are reparameterized. Experimental results show that enhancing the local information fusion capability of deep convolution on Imagenet1k improves the performance of MLP by 1.96%.

Convolutional Fusion Block(CFB)
  • The Transformer module has demonstrated remarkable results in various vision tasks, and the attention-based token mixer module and MetaFormer paradigm highlight their inherent advantages. However, the Transformer block may be more efficient at inference speed, especially on mobile devices where the performance of multi-head attention, LayerNorm, and GELU calculations is not optimal. We have demonstrated the success of the MetaFormer paradigm and based on it proposed an efficient CFB module that exclusively uses depthwise separable convolutions (DWConv) as the token mixer. CFB maintains the deployment advantages of the bottleneck block while achieving comparable performance to the Transformer block, as shown in the figure above. CFB is built using DWConv and MLP, following the general MetaFormer design . CFB ensures performance while significantly improving inference deployment performance. In addition, reparameterization is implemented during training to further improve the performance of CFB, where DWConv uses standard and widely used reparameterization. MLP adopts the convolutional multi-group reparameterization method proposed in this article.
Multi-frequency Fusion Block(FMB)
  • While CFB has effectively learned local representation, the pressing need to address global information collection remains. The transformer block captures low-frequency signals and provides global information such as shape and structure. Existing research suggests that the Transformer block may partially reduce high-frequency information, including local texture details. In order to extract more fundamental and unique features, signals from different frequency bands must be carefully integrated as they are crucial to the human visual system.

  • High-frequency signals provide local information, which is indispensable for maintaining the integrity of the information. Different frequency characteristics provide unique information, making high-frequency signals susceptible to transformer block degradation. The fusion of various high-frequency features and low-frequency features can enhance the information flow and expression ability of the model, inspired by information distillation and frequency mixing in image super-resolution. As shown in the figure, the CFB module initially captures high-frequency features and subsequently outputs three sets of high-frequency features at different frequencies. Patch embedding fusion is then used to splice the output of the lightweight multi-head attention module to generate high and low frequency signals. Through the MLP layer, more basic and significant features can be extracted. The following formula can be expressed as:

    • z 1 = f 1 ( x l − 1 ) z 2 = f 2 ( z 1 ) z 3 = f 3 ( z 2 ) z 4 = f 4 ( z 3 ) z = C o n c a t ( x l − 1 , z 1 , z 2 , z 3 , z 4 ) x l = z + M L P ( z ) z_1=f_1(x^{l-1})\\ z_2=f_2(z_1)\\ z_3=f_3(z_2)\\ z_4=f_4(z_3)\\ z=Concat(x^{l-1},z_1,z_2,z_3,z_4)\\ x^l=z+MLP(z) With1=f1(xl1)With2=f2(z1)With3=f3(z2)With4=f4(z3)With=Concat(xl1,With1,With2,With3,With4)xl=With+MLP(z)

    • 这り set义 x l − 1 x^{l−1} xl1second (l−1) import, x l x^l xl is the output of the lth block. CONCAT refers to CAT connection operation. f1-f3 represent high-pass filters that generate different high-frequency signals, taking CFB as an example. f4 is a low-pass filter that generates low-frequency signals, as shown in RLMHSA.

  • Unlike LN and GELU, FMB always uses BN and ReLU as the effective norm layer and activation layer. These operators can compute efficiently, especially on mobile devices, with minimal performance penalty due to their speed-friendly nature. FMB can collect and integrate multi-frequency information within a lightweight framework, thus significantly improving the performance of the model compared to traditional pure Transformer modules .

Lightweight Multi-head Self-Attention (RLMHSA)
  • Transformer's computational requirements are proportional to the square of the input token dimension, making it computationally expensive when processing large input dimensions. Despite the relatively small number of parameters, the inference time on mobile devices is long, necessitating a more lightweight design of the self-attention module. ViT-LSLA replaces the self-attention Key (K) and V value (V) with the original input (X), realizing a lightweight self-attention structure. As shown in the figure above, we propose a lightweight multi-head self-attention method that shares parameters and then applies reparameterization in this study. The original MSA is defined as follows:

    • A t t e n ( Q , K , V ) = s o f t m a x ( Q K T ) V Atten(Q,K,V)=softmax(QK^T)V Atten( Q,K,V)=softmax (QKT)V

    • Among Q = X W q , K = X W k , V = X W v Q = XW_q, K = XW_k, V = XW_v Q=XWq,K=XWk,IN=XWv, separate entry X ∈ R M × d X∈\R^{M×d} XRM×d, Query, Key, values矩阵 W q , W k , W v ∈ R d × d W_q, W_k, W_v∈\R^{d×d} INq,INk,INvRd×d, M is the number of tokens, and d is the token dimension. Transform Equation 3 to get:

    • A t t e n ( Q , K , V ) = S o f t m a x ( X W q ( X W k ) T ) X W v = S o f t m a x ( X W q W k T X T ) X W v = S o f t m a x ( X W q k X T ) X W v = S o f t m a x ( X ( X W q k T ) T ) X W v = A t t e n ( X , K ′ , V ) Atten(Q, K, V ) = Softmax(XW_q(XW_k)^T )XW_v\\ = Softmax(XW_qW^T_k X^T )XW_v\\ = Softmax(XW_{q^k}X^T )XW_v\\ = Softmax(X(XW^T_{q_k})^T )XW_v\\ = Atten(X, K′, V ) Atten( Q,K,V)=Softma x(XWq(XWk)T)XWv=Softma x(XWqINkTXT)XWv=Softma x(XWqkXT)XWv=Softma x(X(X WqkT)T)XWv=Atten( X,K,V)

    • Merge the projection matrices of Q and K to achieve parameter sharing and obtain a new matrix W q k , K = X W q k T W_{qk}, K = XW^ T_{qk} INqk,K=XWqkT. Moreover, Wqk and Wv are allowed to share parameters, and they share a projected convolution, denoted as W = W Tqk = Wv, then:

    • A t t e n ( X , K ′ , V ′ ) = S o f t m a x ( X ( X W ) T ) X W Atten(X, K′, V ′) = Sof tmax(X(XW )^T )XW Atten( X,K,V)=Softma x(X(X W)T)XW

    • Where K ’ = XW, V ’ = XW, the structure of RLMHSA is shown in Figure 2. Therefore, a separate convolution is needed to map the input vector, and the K and Q vectors share the same convolution, eliminating the need for two separate convolutions. During the training process, convolutional multi-group parameterization is used to simulate the local and global information fusion of the RLMHSA module, which improves the performance of the MHSA. Experimental results show that compared with MHSA on the Imagenet1k classification task, RLMHSA reduces the parameter count by 8M, speeds up 3%, and improves performance by 0.5%.

Stem block and Patch Embedding block
  • According to FastViT's suggestion, the initial stage of the model uses a valve stem with two downsampling operations to reduce the computational load. The lightweight structure is achieved by using Conv3x3+DWConv3x3 design. The initial convolution utilizes Conv3x1 and Conv1x3 to re-parameterize vertical and horizontal edge extraction. There are several scenarios for patch embedding: If the input and output channel numbers and token dimensions are the same, no action is required. When the input and output channel numbers are different, but the token dimensions remain the same, use Conv1x1 for channel number conversion. If the token dimension changes, use the lightweight downsampling operation avg-pool for downsampling, and then use Conv1x1 convolution for fusion or conversion. In the training phase, in order to improve the accuracy, a convolutional multi-group reparameterization method is also used.
FMViT Architectures
  • As shown in the table below, this study proposes five model structures for reference based on the number of parameters and the size of the model, namely fmvitt-t, fmvits, FMViTM, fmvitb and fmvitl. "Channels" here refers to the number of channels output by the sub-modules within the module; "FM-Channels" represents the number of frequency division channels in FMB, and "S" represents the convolution step, which is Avg-pool. The expansion ratio of each MLP layer is set to 2, and the head dimension in RLMHSA is fixed to 32. Standardization and activation functions consistent with nextit use 2022a), BatchNorm and ReLU.

    • Insert image description here

    • Architecture details of FMViT variants.

Experimental Results

  • In this experimental snippet, we use PyTorch version 1.12.1 for PyTorch inference latency and the TensorRT-8.5.3 framework for TensorRT (TRT) inference latency. Both were measured in a hardware environment with A10 GPU and CUDA 11.3. CoreML inference latency measured using iPhone 13 with iOS 16.6. All batch sizes are uniformly set to 1.
ImageNet-1K Classification
  • We performed an image classification experiment on the ImageNet-1K dataset, which includes approximately 1.28 million training images and 50,000 validation images in 1000 categories. To keep things fair, we copy the training parameters of a recent visual transformer and make slight adjustments. All FMViT variants were trained on 8 V100 GPUs with a total batch size of 2048 and 300 iterations. The input image is resized to a resolution of 224 x 224. With a weight decay of 0.1, we used the AdamW optimizer. For all FMViT variants, the learning rate is gradually reduced based on the cosine strategy, starting from 4e-5 and using a linear warm-up method for 20 epochs.

  • As shown in the table below, our approach achieves the best balance of accuracy and latency compared to state-of-the-art technologies such as CNN, VIT, and hybrid networks. When benchmarked against well-known CNNs such as ResNet101, FMViT's top-1 accuracy on the ImageNet dataset exceeds ResNet101 by 2.5% (83.3% vs. 80.8%) and is 45% faster on CoreML. At the same time, its performance is comparable to EfficientNet-B5, and the inference speed is increased by 43%. In the context of ViT, FMViT-B outperforms Swing-T, being 6x faster on CoreML and TensorRT, but performing 1.1% better. At the same inference speed, FMViT-S is 6.3% higher than DeiT-T on TensorRT (78.5% vs. 72.2%). At similar inference speed, FMViTS is 8.1% higher than CoreML (80.3% vs. 72.2%). Compared with hybrid methods, fmviti-l performs on par with EfficientFormer-L7, but achieves 30% and 96% faster inference on TensorRT and CoreML, respectively. Compared with MobileOne-S1, FMViTS achieves a 2.6% performance improvement (78.5% vs. 75.9%) with comparable CoreML inference speed, an 11% improvement in CoreML inference performance, and a 2.9% improvement in accuracy (78.5 % vs. 75.6%). These results demonstrate that the proposed FMViT design is a feasible and promising paradigm.

    • Insert image description here

    • Comparison of different state-of-the-art classification methods for ImageNet-1K.

Object Detection and Instance Segmentation
  • We evaluate the performance of FMViT on object detection and instance segmentation tasks based on the Mask R-CNN architecture and COCO2017. Specifically, all our models are initially trained on ImageNet-1K and subsequently fine-tuned using settings from previous work. The AdamW optimizer is used, the weight attenuation is 0.05, and the training time span is 12 epochs. A warm-up of 500 iterations is performed during the training process, and the learning rate is reduced by 10% at the 8th and 11th iterations. The input resolution is 1344x800. For a fair comparison, we only evaluate backbone latency, and the test environment remains consistent with the classification.

  • The following table gives the evaluation results using the Mask R-CNN architecture. To be fair, we specifically measured latency on the backbone network. As shown in Table 3, FMViT-B exceeds ResNet101 by 3.7 A P b AP^b APb, while achieving 16% faster inference on TensorRT. The inference speed of fmvitb on TensorRT is comparable to PoolFormer-S12, but with 6.8 A P b AP^b APEnhancement of b. Compared to EfficientFormer-L3, fmvitb achieves 7% faster inference and 2.7 AP/b performance improvement on TensorRT. Compared with Next-ViT-S, FMViT-B's inference speed on CoreML is increased by 3.9 times, and its performance is improved by 0.3 times. The performance of fmvitl is 3.8 higher than EfficientFormer-L7 A P b AP^b APb, the inference speed on TensorRT is increased by 32%. fmviti-l and resnesst101 have the same inference speed on TensorRT, but fmviti-l has 1.2 performance improvements A P b AP^b APb. Mask AP also shows similar advantages. In summary, FMViT performs well in object detection and instance segmentation while maintaining low inference latency.

    • Insert image description here

    • Comparison of different backbones for target detection and instance segmentation tasks based on Mask r-CNN. Superscripts b and m represent box detection and mask instance segmentation.

ADE20K Semantic Segmentation
  • We conducted semantic segmentation testing using the ADE20K dataset, which includes approximately 20K training images and 2K validation images for 150 categories. For fair comparison, we follow the training protocol framework of the previous visual transformer on semantic FPN. The model is first pre-trained on ImageNet-1K at a resolution of 224x224 and then trained on ADE20K with an input size of 512x512. For the semantic FPN framework, we adopt the AdamW optimizer with a learning rate and weight decay of 0.0001. We train the entire network for 40K iterations with a total batch size of 32. Considering the complexity of implementing the individual modules of Mask R-CNN on TensorRT and CoreML, for a fair comparison, we restricted the delayed evaluation to the backbone, maintaining the same test settings as for classification. For simplicity, we use an input size of 512x512 to measure latency.

  • The table below shows that FMViT-B exceeds ResNet101 by 4.7 mIoU while maintaining consistent inference speed on TensorRT and CoreML. 2.0miou reduction over swin-t. Compared to PoolFormer-S24, it achieves 3.2 mIoU higher performance and is 8% faster in TensorRT inference. Compared with next-vits, we improve performance by 0.5 mIoU and achieve 18% and 43% faster inference on TensorRT and CoreML respectively. The performance of fmviti-l is 1.7 mIoU higher than swing-s, 4.5 mIoU higher than CoreML, and 25 times faster than resnesst101. It has comparable inference performance to PoolFormer-S36, but with a 4.9 mIoU advantage. Inference on TensorRT and CoreML is 2.5% and 29% faster than next-vitb, with comparable mIoU. Comprehensive tests show that our FMViT has great potential in segmentation tasks.

    • Insert image description here

    • Comparison of different backbones on the ADE20K semantic segmentation task.

Ablation Study

  • We established a series of experiments to verify the efficiency of the FMB, gMPL and RLMHSA modules in FMViT, as shown in the table below. Here, we incorporate our proposed module into the FMViT-T0 model and adhere to the same training method as the original model. RLMHSA replaces the traditional MHSA, gMPL replaces the standard MPL, and the FMB is not used for mixing; only the standard MHSA output features are fed directly to the MLP

    • Insert image description here

      • Compare different modules.
  • Experimental results show that replacing the standard MHSA with our more streamlined RLMHSA degrades classification performance despite the increase in parameters and FLOPs. After replacing the traditional MLP module with the convolutional multi-group reparameterized gMLP module, the number of parameters in the inference stage remains the same as FLOPs, but the classification performance is improved. Finally, the introduction of the FMB module significantly increases the number of parameters and FLOPs.

Visualization
  • NextViT establishes that CNN convolution blocks tend to high-frequency signals, while ViT tends to low-frequency signals. Our proposed FMB captures multiple high-frequency and low-frequency signals simultaneously, thereby obtaining richer texture information and more accurate global information, enhancing the modeling capabilities of FMViT. To better understand FMViT, we visualized the Fourier spectra of the RLMHSA output features in FMViT-t0 at high and low frequencies. There are 5 features of different frequencies in RLMHSA, each feature represents a different frequency feature, recorded as f1-f5. The image below illustrates this. The output feature f1 of RLMHSA captures low-frequency signals, which shows that RLMHSA is good at modeling global information. f2-f5, the output of various cfb, capture different high-frequency signals. Each output corresponds to a different frequency, so they are proficient in handling various textures. The fusion of f1-f5 frequency features enhances the expressive ability of the model.

    • Insert image description here

    • Fourier spectra of output features of different modules in FMViT.

Conclusion

  • In this study, we introduce a hybrid ViT architecturethat is optimized for efficient deployment of mobile devices and server GPUs. This structureenhances the model's predictive powerby merging high- and low-frequency features of different frequencies, thereby enhancing the model's ability to capture local and global information. Experimental results show that FMViT achieves state-of-the-art latency and accuracy trade-offs in a variety of vision tasks, including image classification, object detection, and semantic segmentation. However, the models we were presented with were stacked together without further review. Future work can employ network architecture search (NAS) or other stacking methods to explore the impact of different combinations of models on performance.

Supongo que te gusta

Origin blog.csdn.net/weixin_43424450/article/details/134452465
Recomendado
Clasificación