YOLOv8's latest improvement series: YOLOv8+RepViT, re-examines mobile CNN from the perspective of ViT, effectively improving the model detection effect! ! ! 1.3ms delay-Tsinghua ICCV 2023 latest open source mobile network architecture RepViT

YOLOv8 latest improvement series

[Click here for the paper proposed by RepViT]](https://arxiv.org/abs/2307.09283)

For detailed improvement tutorials and source code, click here! Click here! ! Click here! ! ! Station B: The source code of the AI academic beeping beast is in the link in the album, and there is also a link in the news. Thank you for your support! May scientific research be far ahead!

As of press time, the source code package of the latest improvement series of YOLOv8 at Station B has been updated with 29 + loss function improvements! After arranging and combining 2-4 types by yourself, there are more than 50,000 ways to improve it regardless of position! There are millions of arrangements possible after taking into account different positions! ! Focus on AI academics and follow the bloggers at Station B: AI academics are called beasts!

YOLOv8's latest improvement series: YOLOv8+RepViT, re-examines mobile CNN from the perspective of ViT, effectively improving the model detection effect! ! !

YOLOv8 latest improvement series
1. Overview of RepViT
- 1.3 Summary of RepViT articles
- 1.2 Introduction to RepViT
2. Experiment
3. Improve tutorials
4. Verify whether it is successful

1. Overview of RepViT

1.3 Summary of RepViT articles

In recent years, lightweight visual transformers (ViTs) have demonstrated higher performance and lower latency on resource-constrained mobile devices compared to lightweight convolutional neural networks (CNNs). This improvement is often attributed to the multi-head self-attention module, which enables the model to learn a global representation. However, the architectural differences between lightweight VIT and lightweight CNN have not been fully studied. In this study, we revisit the efficient design of lightweight CNNs and highlight their potential on mobile devices. We gradually enhance the mobile-friendliness of standard lightweight CNNs, especially MobileNetV3, by integrating efficient architectural choices of lightweight VIT. This gave rise to a new family of purely lightweight CNNs, RepViT. Extensive experiments show that RepViT outperforms existing lightweight vits and exhibits good latency in various visual tasks. On ImageNet, RepViT achieves over 80% top-1 accuracy on iPhone 12 with nearly 1ms latency, which to our knowledge is a first for a lightweight model.

1.2 Introduction to RepViT

Research on lightweight models has always been a focus in computer vision tasks, with the goal of achieving excellent performance while reducing computational costs. Lightweight models are particularly relevant for resource-constrained mobile devices, enabling edge deployment of vision models. In the past decade, researchers have mainly focused on the design of lightweight convolutional neural networks (CNNs) and proposed many efficient design principles, including separable convolution, inverse bottleneck structure, channel shuffling, and structural reparameterization. , resulting in representative models such as MobileNets, ShuffleNets and RepVGG.

On the other hand, visual Transformers (ViTs) become another efficient solution for learning visual representations. Compared with CNNs, ViTs show superior performance in various computer vision tasks. However, ViT models generally have large sizes and high latency, making them unsuitable for resource-constrained mobile devices. Therefore, researchers began to explore the lightweight design of ViT. Many efficient ViTs design principles have been proposed, which greatly improved the computing efficiency of ViTs on mobile devices, and produced representative models such as EfficientFormers and MobileViTs. These lightweight ViTs demonstrate stronger performance and lower latency than CNNs on mobile devices.

The reason why lightweight ViTs outperform lightweight CNNs is often attributed to the multi-head attention module, which enables the model to learn global representations. However, there are noteworthy differences between lightweight ViTs and lightweight CNNs in terms of block structure, macro- and micro-architectural design, but these differences have not been fully studied. This naturally begs the question: Can architectural choices for lightweight ViTs improve the performance of lightweight CNNs? In this work, we revisit the design of lightweight CNNs in conjunction with architectural choices for lightweight ViTs. Our aim is to bridge the gap between lightweight CNNs and lightweight ViTs and highlight the application potential of the former compared to the latter on mobile devices.

2. Experiment

Insert image description here
It can be seen that RepViT is indeed superior compared to other mainstream mobile ViT architectures. Next, let’s take a look at what contributions this work has made:

2.1 Highlights of this article (contributions)

1. As mentioned in the article, lightweight ViTs generally perform better than lightweight CNNs on visual tasks, mainly due to their multi-head self-attention module (MSHA) that allows the model to learn global representations. However, the architectural differences between lightweight ViTs and lightweight CNNs have not been fully studied.
2. In this study, the authors gradually improved the mobile friendliness of standard lightweight CNNs (especially MobileNetV3) by integrating effective architectural choices of lightweight ViTs. This resulted in a new pure lightweight CNN The birth of the family, namely RepViT. It is worth noting that although RepViT has a MetaFormer structure, it is entirely composed of convolutions. 3.
Experimental results show that RepViT surpasses the existing state-of-the-art lightweight ViTs and performs well in various Demonstrated superior performance and efficiency over existing state-of-the-art lightweight ViTs on vision tasks, including ImageNet classification, object detection and instance segmentation on COCO-2017, and semantic segmentation on ADE20k. In particular, on ImageNet, RepViT Achieving nearly 1ms latency and over 80% Top-1 accuracy on iPhone 12 is the first breakthrough for a lightweight model.

2.2 Method

In ConvNeXt, the authors based on the ResNet50 architecture through rigorous theoretical and experimental analysis, finally designed a very excellent pure convolutional neural network architecture that is comparable to Swin-Transformer. Similarly, RepViT mainly performs targeted transformation (magic modification) by gradually integrating the architectural design of lightweight ViTs into the standard lightweight CNN, namely MobileNetV3-L. In this process, the authors considered design elements at different levels of granularity and achieved optimization goals through a series of steps.

Detailed optimization steps are as follows:

2.2.1 Alignment of training recipes

The paper introduces a metric to measure latency on mobile devices and aligns the training strategy with existing lightweight ViTs. This step is mainly to ensure the consistency of model training, which involves two concepts, namely the adjustment of delay measurement and training strategy.

2.2.2 Latency metrics

In order to more accurately measure the performance of the model on real mobile devices, the authors chose to directly measure the actual latency of the model on the device as a baseline metric. This metric differs from previous studies, which mainly optimize the model’s inference speed through metrics such as FLOPs or model size, which do not always reflect the actual latency in mobile applications well.

2.2.3 Alignment of training strategies

Here, the training strategy of MobileNetV3-L is adjusted to align with other lightweight ViTs models. This includes 5 epochs of warm-up training using the AdamW optimizer, a must-have optimizer for the ViTs model, and 300 epochs of training using cosine annealing learning rate scheduling. Although this adjustment results in a slight decrease in model accuracy, fairness is guaranteed.

2.2.4 Optimization of block design

Based on consistent training settings, the authors explored optimal block designs. Block design is an important part of CNN architecture, and optimizing block design helps improve the performance of the network.

2.2.5 Separate Token mixer and channel mixer

This block mainly improves the block structure of MobileNetV3-L and separates the token mixer and channel mixer. The original MobileNetV3 block structure consists of a 1x1 dilated convolution, followed by a depthwise convolution and a 1x1 projection layer, and then connects the input and output via residual connections. On this basis, RepViT advances the depth convolution so that the channel mixer and token mixer can be separated. To improve performance, structural reparameterization is also introduced to introduce multi-branch topology for deep filters at training time. Finally, the authors succeeded in separating the token mixer and channel mixer in the MobileNetV3 block and named this block RepViT block.

2.2.6 Reduce expansion ratio and increase width

In the channel mixer, the original expansion ratio is 4, which means that the hidden dimension of the MLP block is four times the input dimension, which consumes a lot of computing resources and has a great impact on the inference time. To alleviate this problem, we can reduce the expansion ratio to 2, thereby reducing parameter redundancy and latency, reducing the latency of MobileNetV3-L to 0.65ms. Subsequently, by increasing the width of the network, that is, increasing the number of channels in each stage, the Top-1 accuracy increased to 73.5%, while the latency only increased to 0.89ms!

2.2.7 Optimization of macro-architectural elements

In this step, this article further optimizes the performance of MobileNetV3-L on mobile devices, mainly starting from macro-architectural elements, including stem, downsampling layer, classifier and overall stage ratio. By optimizing these macro-architectural elements, the performance of the model can be significantly improved.

2.2.8 Shallow networks using convolutional extractors

ViTs typically use a "patchify" operation that splits the input image into non-overlapping patches as stem. However, this approach has problems with training optimization and sensitivity to training recipes. Therefore, the authors adopted early convolution instead, which has been adopted by many lightweight ViTs. In contrast, MobileNetV3-L uses a more complex stem for 4x downsampling. In this way, although the initial number of filters is increased to 24, the total delay is reduced to 0.86ms, and the top-1 accuracy is increased to 73.9%.

2.2.9 Deeper downsampling layers

In ViTs, spatial downsampling is usually implemented through a separate patch merging layer. So here we can adopt a separate and deeper downsampling layer to increase the network depth and reduce the information loss due to resolution reduction. Specifically, the authors first used a 1x1 convolution to adjust the channel dimension, and then connected the input and output of two 1x1 convolutions through the residual to form a feedforward network. In addition, they also added a RepViT block in front to further deepen the downsampling layer. This step improved the top-1 accuracy to 75.4% with a latency of 0.96ms.

2.2.10 Simpler Classifier

In lightweight ViTs, the classifier usually consists of a global average pooling layer followed by a linear layer. In contrast, MobileNetV3-L uses a more complex classifier. Because the final stage now has more channels, the authors replaced it with a simple classifier, a global average pooling layer and a linear layer. This step reduced the latency to 0.77ms while being top-1 accurate. The rate is 74.8%.

2.2.11 Overall stage proportion

The stage ratio represents the ratio of the number of blocks in different stages, thereby indicating the distribution of computation in each stage. The paper chooses a better stage ratio of 1:1:7:1, and then increases the network depth to 2:2:14:2, thereby achieving a deeper layout. This step increases the top-1 accuracy to 76.9% with a latency of 1.02 ms.

2.2.12 Selection of convolution kernel size

It is known that the performance and latency of CNNs are usually affected by the size of the convolution kernel. For example, to model long-range context dependencies like MHSA, ConvNeXt uses large convolution kernels, resulting in significant performance improvements. However, large convolution kernels are not mobile-friendly due to its computational complexity and memory access cost. MobileNetV3-L mainly uses 3x3 convolution, and some blocks use 5x5 convolution. The authors replaced them with 3x3 convolutions, which resulted in a latency reduction to 1.00ms while maintaining a top-1 accuracy of 76.9%.

2.2.13 Location of SE layer

One advantage of self-attention modules over convolutions is the ability to adjust weights based on the input, which is called a data-driven property. As a channel attention module, the SE layer can make up for the limitations of convolution in the lack of data-driven properties, thereby leading to better performance. MobileNetV3-L adds SE layers in some blocks, mainly focusing on the last two stages. However, the lower-resolution stage gains smaller accuracy gains from the global average pooling operation provided by SE than the higher-resolution stage. The authors designed a strategy to use the SE layer in a cross-block manner at all stages to maximize the accuracy improvement with the smallest delay increment. This step improved the top-1 accuracy to 77.4% while delaying reduced to 0.87ms.

Notice! [In fact, Baidu has already done experiments and comparisons on this point and reached this conclusion a long time ago. The SE layer is more effective when placed close to the deep layer]

2.2.14 Microscopic design adjustments

RepViT adjusts lightweight CNN through layer-by-layer micro-design, which includes selecting the appropriate convolution kernel size and optimizing the position of the squeeze-and-excitation (SE) layer. Both methods significantly improve model performance.

2.3 Network architecture

Finally, by integrating the above improvement strategies, we obtained the overall architecture of the model RepViT, which has multiple variants, such as RepViT-M1/M2/M3. Likewise, the different variants are mainly distinguished by the number of channels and blocks per stage.
Insert image description here

3. Improve tutorials

3.1 Modify YAML file

3.2 Create new SwinTransformer.py

3.3 Modify tasks.py

4. Verify whether it is successful

Excuting an order

python train.py

Changes completed and call it a day!
Pay attention to Station B: AI academic is called a beast,
and it has embarked on the fast path of scientific research,
far ahead of its peers! ! ! !