1.3ms delay | Tsinghua ICCV 2023 The latest open source mobile network architecture RepViT, the speed is sneaky!

guide

TL;DR : Today I share a paper on the topic of RepViTwork. This paper focuses on improving lightweight performance on mobile devices with limited resources by re-examining the design of lightweight convolutional neural networks and integrating ViTseffective architectural choices for lightweight CNNs.

It can be seen that RepViT is indeed superior to other mainstream mobile ViT architectures. Next, let's look at the contributions of this work:

  1. It is mentioned in the paper that lightweight ViTs usually perform better than lightweight CNNs on vision tasks, mainly due to their multi-head self-attention module ( MSHA) which allows the model to learn global representations . However, the architectural differences between lightweight ViTs and lightweight CNNs have not been fully studied.

  2. In this study, the authors gradually improved the mobile friendliness of standard lightweight CNNs (especially ) by incorporating effective architectural choices for lightweight ViTs. MobileNetV3This led to the birth of a new family of purely lightweight CNNs, viz RepViT. It is worth noting that although RepViT has a MetaFormer structure, it is entirely composed of convolutions.

  3. Experimental results show that it RepViTsurpasses the existing state-of-the-art lightweight ViTs, and shows superior performance and efficiency over the existing state-of-the-art lightweight ViTs on various vision tasks, including ImageNet classification, object detection and instance segmentation on COCO-2017, and semantic segmentation on ADE20k. In particular, on ImageNet, we achieved nearly 1ms latency and over 80% Top-1 accuracy on , which is the first breakthrough for lightweight models.RepViTiPhone 12

Well, the next thing everyone should be concerned about should be " how to design such a low-latency but high-precision model "?

method

In ConvNeXt, the authors finally designed a very excellent pure convolutional neural network architecture ResNet50comparable to , through rigorous theoretical and experimental analysis based on the architecture . Swin-TransformerSimilarly, RepViTit is mainly through the gradual integration of the lightweight ViTs architecture design into the standard lightweight CNN, that is, MobileNetV3-Lto carry out targeted transformation (magic transformation). During this process, the authors considered design elements at different levels of granularity, and achieved the goal of optimization through a series of steps.

Alignment of training recipes

First, the paper introduces a metric to measure latency on mobile devices and aligns the training strategy with existing lightweight ViTs. This step is mainly to ensure the consistency of model training, which involves two concepts, namely delay measurement and adjustment of training strategy.

Latency Metrics

In order to more accurately measure the performance of the model on real mobile devices, the authors chose to directly measure the actual latency of the model on the device as a benchmark metric. This metric is different from previous studies, which mainly FLOPsoptimize the inference speed of the model through indicators such as or model size, which do not always reflect the actual latency in mobile applications well.

Alignment of Training Policies

Here, the training strategy of MobileNetV3-L is adjusted to align with other lightweight ViTs models. This includes using AdamWthe optimizer [the necessary optimizer for the ViTs model], performing 5 epoch warm-up training, and using the cosine annealing learning rate schedule for 300 epoch training. Although this adjustment resulted in a slight drop in model accuracy, fairness was guaranteed.

Optimization of Block Design

Next, the authors explore the optimal block design based on consistent training settings. Block design is an important part of CNN architecture, and optimizing block design can help improve the performance of the network.

Separate Token Mixer and Channel Mixer

这块主要是对 MobileNetV3-L 的块结构进行了改进,分离了令牌混合器和通道混合器。原来的 MobileNetV3 块结构包含一个 1x1 扩张卷积,然后是一个深度卷积和一个 1x1 的投影层,然后通过残差连接连接输入和输出。在此基础上,RepViT 将深度卷积提前,使得通道混合器和令牌混合器能够被分开。为了提高性能,还引入了结构重参数化来在训练时为深度滤波器引入多分支拓扑。最终,作者们成功地在 MobileNetV3 块中分离了令牌混合器和通道混合器,并将这种块命名为 RepViT 块。

降低扩张比例并增加宽度

在通道混合器中,原本的扩张比例是 4,这意味着 MLP 块的隐藏维度是输入维度的四倍,消耗了大量的计算资源,对推理时间有很大的影响。为了缓解这个问题,我们可以将扩张比例降低到 2,从而减少了参数冗余和延迟,使得 MobileNetV3-L 的延迟降低到 0.65ms。随后,通过增加网络的宽度,即增加各阶段的通道数量,Top-1 准确率提高到 73.5%,而延迟只增加到 0.89ms!

宏观架构元素的优化

在这一步,本文进一步优化了MobileNetV3-L在移动设备上的性能,主要是从宏观架构元素出发,包括 stem,降采样层,分类器以及整体阶段比例。通过优化这些宏观架构元素,模型的性能可以得到显著提高。

浅层网络使用卷积提取器

ViTs 通常使用一个将输入图像分割成非重叠补丁的 "patchify" 操作作为 stem。然而,这种方法在训练优化性和对训练配方的敏感性上存在问题。因此,作者们采用了早期卷积来代替,这种方法已经被许多轻量级 ViTs 所采纳。对比之下,MobileNetV3-L 使用了一个更复杂的 stem 进行 4x 下采样。这样一来,虽然滤波器的初始数量增加到24,但总的延迟降低到0.86ms,同时 top-1 准确率提高到 73.9%。

更深的下采样层

在 ViTs 中,空间下采样通常通过一个单独的补丁合并层来实现。因此这里我们可以采用一个单独和更深的下采样层,以增加网络深度并减少由于分辨率降低带来的信息损失。具体地,作者们首先使用一个 1x1 卷积来调整通道维度,然后将两个 1x1 卷积的输入和输出通过残差连接,形成一个前馈网络。此外,他们还在前面增加了一个 RepViT 块以进一步加深下采样层,这一步提高了 top-1 准确率到 75.4%,同时延迟为 0.96ms。

更简单的分类器

在轻量级 ViTs 中,分类器通常由一个全局平均池化层后跟一个线性层组成。相比之下,MobileNetV3-L 使用了一个更复杂的分类器。因为现在最后的阶段有更多的通道,所以作者们将它替换为一个简单的分类器,即一个全局平均池化层和一个线性层,这一步将延迟降低到 0.77ms,同时 top-1 准确率为 74.8%。

整体阶段比例

阶段比例代表了不同阶段中块数量的比例,从而表示了计算在各阶段中的分布。论文选择了一个更优的阶段比例 1:1:7:1,然后增加网络深度到 2:2:14:2,从而实现了一个更深的布局。这一步将 top-1 准确率提高到 76.9%,同时延迟为 1.02 ms。

微观设计的调整

接下来,RepViT 通过逐层微观设计来调整轻量级 CNN,这包括选择合适的卷积核大小和优化挤压-激励(Squeeze-and-excitation,简称SE)层的位置。这两种方法都能显著改善模型性能。

卷积核大小的选择

众所周知,CNNs 的性能和延迟通常受到卷积核大小的影响。例如,为了建模像 MHSA 这样的远距离上下文依赖,ConvNeXt 使用了大卷积核,从而实现了显著的性能提升。然而,大卷积核对于移动设备并不友好,因为它的计算复杂性和内存访问成本。MobileNetV3-L 主要使用 3x3 的卷积,有一部分块中使用 5x5 的卷积。作者们将它们替换为3x3的卷积,这导致延迟降低到 1.00ms,同时保持了76.9%的top-1准确率。

SE 层的位置

自注意力模块相对于卷积的一个优点是根据输入调整权重的能力,这被称为数据驱动属性。作为一个通道注意力模块,SE层可以弥补卷积在缺乏数据驱动属性上的限制,从而带来更好的性能。MobileNetV3-L 在某些块中加入了SE层,主要集中在后两个阶段。然而,与分辨率较高的阶段相比,分辨率较低的阶段从SE提供的全局平均池化操作中获得的准确率提升较小。作者们设计了一种策略,在所有阶段以交叉块的方式使用SE层,从而在最小的延迟增量下最大化准确率的提升,这一步将top-1准确率提升到77.4%,同时延迟降低到0.87ms。

【这一点其实百度在很早前就已经做过实验比对得到过这个结论了,SE 层放置在靠近深层的地方效果好】

网络架构

最终,通过整合上述改进策略,我们便得到了模型RepViT的整体架构,该模型有多个变种,例如RepViT-M1/M2/M3。同样地,不同的变种主要通过每个阶段的通道数和块数来区分。

实验

:::block-1

图像分类

:::

:::block-1

检测与分割

:::

总结

本文通过引入轻量级 ViT 的架构选择,重新审视了轻量级 CNNs 的高效设计。这导致了 RepViT 的出现,这是一种新的轻量级 CNNs 家族,专为资源受限的移动设备设计。在各种视觉任务上,RepViT 超越了现有的最先进的轻量级 ViTs 和 CNNs,显示出优越的性能和延迟。这突显了纯粹的轻量级 CNNs 对移动设备的潜力。

写在最后

如果你也对神经网络架构的研究感兴趣,非常欢迎扫描屏幕下方二维码或者直接搜索微信号 cv_huber 添加小编好友,备注:学校/公司-研究方向-昵称,与万千学者专家一起交流探讨更多有趣的神经网络架构!

Guess you like

Origin juejin.im/post/7258526520167350309