1.3ms delay | Tsinghua ICCV 2023 The latest open source mobile neural network architecture RepViT, flying fast!

TL;DR : Today I share a paper on the topic of RepViTwork. This paper focuses on  improving lightweight  performance on mobile devices with limited resources by re-examining the design of lightweight convolutional neural networks and integrating lightweight ViTs effective architectural choices  .CNNs

picture

It can be seen that RepViT is indeed superior to other mainstream mobile ViT architectures. Next, let's look at the contributions of this work:

  1. It is mentioned in the paper that lightweight ViTs usually perform better than lightweight CNNs on vision tasks, mainly due to their multi-head self-attention module ( MSHA) which allows the model to learn global representations . However, the architectural differences between lightweight ViTs and lightweight CNNs have not been fully studied.

  2. In this study, the authors gradually improved standard lightweight CNNs (especially  MobileNetV3 mobile-friendliness) by incorporating effective architectural choices for lightweight ViTs. This led to the birth of a new family of purely lightweight CNNs, viz RepViT. It is worth noting that although RepViT has a MetaFormer structure, it is entirely composed of convolutions.

  3. Experimental results show that it RepViT surpasses the existing state-of-the-art lightweight ViTs, and shows superior performance and efficiency over the existing state-of-the-art lightweight ViTs on various vision tasks, including ImageNet classification, object detection and instance segmentation on COCO-2017, and semantic segmentation on ADE20k. In particular,  on  ImageNet,  we achieved nearly 1ms latency and over 80% Top-1 accuracy on , which is the first breakthrough for lightweight models.RepViTiPhone 12

Well, the next thing everyone should be concerned about should be " how to design such a low-latency but high-precision model "?

method

picture

In  ConvNeXt addition, the authors  finally designed a very excellent  pure convolutional neural network architecture that ResNet50 is comparable to it through rigorous theoretical and experimental analysis based on the architecture  . Swin-TransformerSimilarly, RepViTit is mainly through the gradual integration of the lightweight ViTs architecture design into the standard lightweight CNN, that is, MobileNetV3-Lto carry out targeted transformation (magic transformation). During this process, the authors considered design elements at different levels of granularity, and achieved the goal of optimization through a series of steps.

picture

Alignment of training recipes

First, the paper introduces a metric to measure latency on mobile devices and aligns the training strategy with existing lightweight ViTs. This step is mainly to ensure the consistency of model training, which involves two concepts, namely delay measurement and adjustment of training strategy.

Latency Metrics

In order to more accurately measure the performance of the model on real mobile devices, the authors chose to directly measure the actual latency of the model on the device as a benchmark metric. This metric is different from previous studies, which mainly FLOPsoptimize the inference speed of the model through indicators such as or model size, which do not always reflect the actual latency in mobile applications well.

Alignment of Training Policies

Here, the training strategy of MobileNetV3-L is adjusted to align with other lightweight ViTs models. This includes using  AdamW the optimizer [a must-have optimizer for the ViTs model] for 5 epochs of warm-up training, and using the cosine annealing learning rate schedule for 300 epochs of training. Although this adjustment resulted in a slight drop in model accuracy, fairness was guaranteed.

Optimization of Block Design

Next, the authors explore the optimal block design based on consistent training settings. Block design is an important part of CNN architecture, and optimizing block design can help improve the performance of the network.

picture

Separate Token Mixer and Channel Mixer

This block mainly  MobileNetV3-L improves on the block structure, separating the token mixer and the channel mixer. The original MobileNetV3 block structure consists of a 1x1 dilated convolution, followed by a depthwise convolution and a 1x1 projection layer, and then the input and output are connected by a residual connection. Based on this, RepViT advances the depthwise convolution, enabling the channel mixer and token mixer to be separated. To improve performance, structural reparameterization is also introduced to introduce multi-branch topology for deep filters at training time. Finally, the authors managed to separate the token mixer and channel mixer in the MobileNetV3 block, and named this block RepViT block.

Decrease the dilation ratio and increase the width

In the channel mixer, the original dilation ratio is 4, which means that the hidden dimension of the MLP block is four times the input dimension, which consumes a lot of computing resources and has a large impact on the inference time. To alleviate this problem, we can reduce the dilation ratio to 2, which reduces parameter redundancy and latency, and reduces the latency of MobileNetV3-L to 0.65ms. Subsequently, by increasing the width of the network, that is, increasing the number of channels in each stage, the Top-1 accuracy rate increased to 73.5%, while the delay only increased to 0.89ms!

Optimization of macro-architectural elements

In this step, this paper further optimizes the performance of MobileNetV3-L on mobile devices, mainly starting from the macro-architecture elements, including stem, downsampling layer, classifier and overall stage ratio. By optimizing these macro-architectural elements, the performance of the model can be significantly improved.

Shallow Networks Using Convolutional Extractors

picture

ViTs typically use a "patchify" operation that splits the input image into non-overlapping patches as a stem. However, this approach has issues with training optimization and sensitivity to training recipes. Therefore, the authors adopt early convolution instead, which has been adopted by many lightweight ViTs. In contrast, MobileNetV3-L uses a more complex stem for 4x downsampling. In this way, although the initial number of filters is increased to 24, the total delay is reduced to 0.86ms, and the top-1 accuracy rate is increased to 73.9%.

Deeper Downsampling Layers

picture

In ViTs, spatial downsampling is usually achieved through a separate patch pooling layer. So here we can adopt a separate and deeper downsampling layer to increase the depth of the network and reduce the information loss due to the reduction of resolution. Specifically, the authors first use a 1x1 convolution to adjust the channel dimension, and then connect the input and output of two 1x1 convolutions through a residual to form a feedforward network. In addition, they added a RepViT block in front to further deepen the downsampling layer, which improved the top-1 accuracy to 75.4% with a latency of 0.96ms.

simpler classifier

picture

In lightweight ViTs, the classifier usually consists of a global average pooling layer followed by a linear layer. In contrast, MobileNetV3-L uses a more complex classifier. Because the final stage now has more channels, the authors replaced it with a simple classifier, a global average pooling layer and a linear layer, which reduced latency to 0.77ms while top-1 accuracy was 74.8%.

overall stage ratio

The phase ratio represents the ratio of the number of blocks in different phases, thus expressing the distribution of calculations in each phase. The paper chooses a more optimal stage ratio of 1:1:7:1, and then increases the network depth to 2:2:14:2, thus achieving a deeper layout. This step improves the top-1 accuracy to 76.9% with a latency of 1.02 ms.

Microdesign adjustments

Next, RepViT tunes the lightweight CNN through layer-by-layer micro-design, which includes selecting the appropriate convolution kernel size and optimizing the position of the Squeeze-and-excitation (SE) layer. Both approaches can significantly improve model performance.

Selection of convolution kernel size

It is well known that the performance and latency of CNNs are usually affected by the size of the convolution kernel. For example, to model long-range context dependencies like MHSA, ConvNeXt uses large convolution kernels, which achieves significant performance gains. However, large convolution kernels are not friendly to mobile devices because of its computational complexity and memory access cost. MobileNetV3-L mainly uses 3x3 convolution, and some blocks use 5x5 convolution. The authors replaced them with 3x3 convolutions, which resulted in latency down to 1.00ms while maintaining a top-1 accuracy of 76.9%.

The location of the SE layer

One advantage of self-attention modules over convolutions is the ability to adjust weights according to the input, which is known as a data-driven property. As a channel attention module, the SE layer can make up for the limitation of convolution in lack of data-driven properties, leading to better performance. MobileNetV3-L incorporates SE layers in some blocks, mainly in the last two stages. However, compared to higher-resolution stages, lower-resolution stages get less accuracy boost from the global average pooling operation provided by SE. The authors devised a strategy to use SE layers in intersecting blocks across all stages to maximize accuracy gains with minimal latency increments, a step that boosted top-1 accuracy to 77.4% while reducing latency to 0.87ms.

[In fact, Baidu has already done experiments and compared this point a long time ago to get this conclusion. It is better to place the SE layer close to the deep layer]

Network Architecture

Finally, by integrating the above improvement strategies, we get RepViTthe overall architecture of the model, which has multiple variants, eg RepViT-M1/M2/M3. Likewise, different variants are mainly distinguished by the number of channels and blocks per stage.

picture

experiment

image classification

picture

Detection and Segmentation

picture

Summarize

This paper revisits the efficient design of lightweight CNNs by introducing the architectural choice of lightweight ViT. This leads to RepViT, a new family of lightweight CNNs designed for resource-constrained mobile devices. On various vision tasks, RepViT surpasses the existing state-of-the-art lightweight ViTs and CNNs, showing superior performance and latency. This highlights the potential of pure lightweight CNNs for mobile devices. [Source code attached at the end of the article]

Guess you like

Origin blog.csdn.net/jacke121/article/details/131877411