【VIT】【MobileNet】【Apple】MobileVIT Note

Topic

A rare paper from Apple, by combining Mobilenet and VIT, got good results
insert image description here

Abstract

Lightweight Convolutional Neural Networks (CNNs) are a fact (and achievable) for mobile vision tasks. Their spatial inductive bias enables them to learn representations with fewer parameters in different vision tasks. However, these network jobs are spatially localized. To learn global representations, self-attention based visual deformers (ViTs) are employed. Unlike cnn, vit is heavyweight. In this paper, we ask the following question: Is it possible to combine the strengths of CNNs and vits to build a lightweight, low-latency network for mobile vision tasks? To this end, we introduce MobileViT, a Lightweight universal visual converter for devices. The mobile ViT provides a different perspective on the transformer's global information processing. Our results show that MobileViT significantly outperforms CNN and vit networks on different tasks and datasets. On the ImageNet-1k dataset, MobileViT achieves a top-1 accuracy of 78.4% at about 6 million pa parameters, which is higher than the accuracy of MobileNetv3 (cnn-based) and DeIT (vi-based) respectively under a similar number of parameters 3.2% and 6.2%. On the MS-COCO object detection task, MobileViT is 5.7% more accurate than MobileNetv3 with the same number of parameters

Introduction

Self-attention based models, especially ViT, are an alternative to convolutional neural networks (CNNs) for learning visual representations. Briefly, ViT divides an image into a series of non-overlapping patches, and then uses multi-head self-attention in Transformer to learn non-overlapping patch representations. The general trend is to increase the number of parameters in ViT networks to improve performance. However, these performance improvements come at the expense of model size (network parameters) and latency. Many real-world applications (e.g., augmented reality and autonomous wheelchairs) require visual recognition tasks (e.g., object detection and semantic segmentation) to run in a timely manner on resource-constrained mobile devices. To be effective, ViT models for such tasks should be lightweight and fast. Even if the model size of the ViT model is reduced to match the resource constraints of mobile devices, its performance is significantly inferior to lightweight CNNs. For example, for a parameter budget of about 5-6 million, DeIT (Touvron et al., 2021a) is 3% less accurate than MobileNetv3 (Howard et al., 2019). Therefore, it is imperative to design a lightweight ViT model.
Lightweight CNNs have powered many mobile vision tasks. However, vit-based networks are far from being used on these devices. Unlike lightweight CNNs, which are easy to optimize and integrate with task-specific networks, vits are heavy-loaded (e.g., ViT-B/16 vs. MobileNetv3: 86 vs. 7.5 million parameters) and harder to optimize (Xiao et al. , 2021), requires extensive data augmentation (data augmenten) and L2 regularization to prevent overfitting (Touvron et al., 2021a; Wang et al., 2021), and requires expensive decoders for downstream tasks, especially Intensive prediction tasks. For the vv case, the vit-based segmentation network (Ranftl et al., 2021) learns about 345 million parameters and achieves similar performance to the CNN-based network DeepLabv3 (Chen et al., 2017), with 59 million parameters. The need for more parameters in vit-based models may be due to their lack of image-specific inductive bias, which is inherent in CNNs (Xiao et al., 2021). A Hybrid Approach Combining Convolutions and Transformers for Robust and High-Performance ViT Models
insert image description here
insert image description here

Related Work

[Light-weight CNNs]
[Vision transformers]
Combining convolutions and transformers results in more robust and high-performance vits compared to plain vits. However, there is an open question here: how to combine the advantages of convolutions and transformers to build lightweight networks for mobile vision tasks? This paper focuses on designing lightweight ViT models that outperform state-of-the-art model. To this end, we introduce MobileViT, which combines the strengths of CNNs and vits to build a lightweight, general-purpose, and mobile-friendly network. MobileViT brings some novel observations:

  1. Better performance
  2. Generalization capabilityinsert image description here
  3. Robust

structure

  • MobileViT blocks. As shown in Figure 1b, the MobileViT block aims to model local and global information in input tensors with fewer parameters. Formally, for a given input tensor X ∈ RH×W×C, MobileViT applies an n × n standard convolutional layer followed by a pointwise (or 1×1) convolutional layer to produce XL ∈ RH×W×d. n×n convolutional layers encode local spatial information, while point convolutions project tensors into high-dimensional spaces (or d-dimensions, where d > C) by learning a linear combination of input channels
  • Using MobileViT, we wish to model long-range non-local dependencies with an efficient H × w receptive field, and one of the widely studied long-range dependency modeling methods is dilated convolution. However, this approach requires careful selection of the dilation rate. Oth-wise, the weights are applied to padded zeros, not valid spatial regions (Yu & Koltun, 2016; Chen et al., 2017; Mehta et al., 2018). Another promising solution is self-attention (Wang et al., 2018; Ramachandran et al., 2019; Bello et al., 2019; Dosovitskiy et al., 2021). Among the self-attention methods, visual deformers (ViTs) with multi-head self-attention show good results in visual recognition tasks. However, vit is heavyweight and exhibits subpar optimizeability. This is because vit lacks spatial inductive bias (Xiao et al., 2021; Graham et al., 2021)
    insert image description here
  • Standard convolution can be viewed as a stack of three sequential operations: (1) unrolling, (2) matrix multiplication (learning local representations), (3) folding. The MobileViT block is similar to convolution in that it also utilizes the same building blocks. The MobileViT block replaces local processing (matrix multiplication) in convolutions with deeper global processing (stack of transformer layers). Therefore, MobileViT has convolution-like properties (e.g., spatial bias). Therefore, MobileViT blocks can be viewed as convolutional transformations. One advantage of our intentionally simplified design is that low-level convolution and transformer implementations are available out-of-the-box; allowing us to use MobileViT on different devices without any additional work.
  • Figure 5 compares standard samplers and multi-scale samplers. Here, we use distributeddataparlena in PyTorch as the standard sampler. Overall, the multiscale sampler (i) reduces training time because it requires fewer optimizer updates for variable-sized batches (Fig. 5b), (ii) improves performance by about 0.5% (Fig. 10 ; §B) and (iii) force the network to learn better multi-scale representations (§B), i.e., the same network yields better performance when evaluated at different spatial resolutions compared to a network trained with a standard sampler . In §B, we also show that multi-scale samplers are general and improve the performance of CNNs (e.g., MobileNetv2).
    insert image description here

Experiments

insert image description here
insert image description here
insert image description here
insert image description here
insert image description here

Conclusion and Discussion

We observe that MobileViT and other ViT-based networks (e.g., DeIT and PiT) are slower compared to MobileNetv2 on mobile devices (Table 3). This observation contradicts previous work showing that ViT is more scalable compared to CNNs (Dosovit (skiy) et al., 2021). This difference is mainly due to two reasons. First, specialized CUDA cores are used for transformers on GPUs, which are implemented out-of-the-box in vit to improve their scalability and efficiency on GPUs (e.g., Shoeybi et al., 2019; Lepikhin et al., 2021). Second, CNNs benefit from several device-level optimizations, including batch normalization fused with convolutional layers (Jacob et al., 2018). These optimizations improve latency and memory access. However, this specialized, optimized transformer operation is not currently available for mobile devices. Consequently, the resulting MobileViT and VITC-based inference graphs for mobile device networks are suboptimal. We believe that, similar to CNNs, the inference speed of MobileViT and ViTs will be further improved in the future by dedicated device-level operations.

Self-Evaluation

To put it simply, the combination of CNN and Transformer, the two series, has produced good results. The appendix of the original text is more detailed. If you are interested, you can read
MobileVIT carefully.

Supongo que te gusta

Origin blog.csdn.net/weixin_53415043/article/details/128976228
Recomendado
Clasificación