OREPA: Ali proposes a heavy-parameter strategy with fast training, halving the memory and doubling the speed | CVPR 2022

This paper proposes an online re-parameter method OREPA, which can re-parameter complex structures into a single convolution layer during the training phase, thereby reducing the time-consuming of a large number of training. To achieve this goal, the paper replaces the BN layer during training with a linear scaling layer, maintaining the diversity of optimization directions and feature expression capabilities. From the experimental results, the accuracy and efficiency of OREPA in various tasks are very good

. Source: Xiaofei's algorithm engineering note public account

Thesis: Online Convolutional Re-parameterization

Introduction


  In addition to accuracy, the inference speed of the model is also important. In order to obtain a deployment-friendly and high-accuracy model, many recent studies have proposed to improve model performance based on structural reparameterization. Models used for structural reparameterization have different structures in the training phase and the inference phase. A complex structure is used during training to obtain high accuracy, and a complex structure is compressed into a linear layer capable of fast inference through an equivalent transformation after training. . Compressed models usually have concise architectures such as VGG-like or ResNet-like structures. From this perspective, the reparameterization strategy can improve model performance without introducing additional inference time cost. Before the official account, RepVGG's paper was released to interpret "RepVGG: VGG, God Forever! | 2021 New Article" , if you are interested, you can go and see it.

  The BN layer is a key component in the heavily parameterized model, and a BN layer is added after each convolutional layer. If the BN layer is removed as shown in Fig. 1b, the removal of the BN layer results in severe accuracy drop. During the inference stage, complex structures can be compressed into a single convolutional layer. In the training phase, since the BN layer needs to non-linearly divide the feature map by its standard deviation, each branch can only be calculated individually. Therefore, there are a large number of intermediate computation operations (large FLOPS) and buffering feature maps (high memory usage), resulting in huge computational overhead. To make matters worse, the high training cost hinders the exploration of more complex and potentially more powerful re-parameter structures.
Why are BN layers so important for reparameterization? According to experiments and analysis, the paper finds that the scaling factor in the BN layer can diversify the optimization directions of different branches. Based on this finding, the paper proposes an online reparameterization method OREPA, as shown in Figure 1c, which consists of two steps:

  • block linearization: Remove all nonlinear generalization layers and introduce linear scaling layers instead. The linear scaling layer can not only diversify the optimization directions of different branches like the BN layer, but also merge computations during training.
  • block squeezing: Simplifies complex linear structures into a single convolutional layer.

  OREPA reduces the computational and storage overhead caused by the middle layer, can significantly reduce training consumption (65%-75% memory saving, 1.5-2.3 times speedup) and has little impact on performance, making it possible to explore more complex reparameterization results become possible. To verify this, the paper further proposes several reparameterized components for better performance.
The contributions of the paper include the following three points:

  • The online reparameterization method OREPA is proposed, which can greatly improve the training efficiency of the reparameterized model and make it possible to explore a stronger reparameterization structure.
  • According to the analysis of the principle of the heavy parameter model, the BN layer is replaced with a linear scaling layer to maintain the diversified characteristics of the optimization direction and the feature expression ability.
  • Experiments on various vision tasks show that OREPA outperforms previous reparameterized models in both accuracy and training efficiency.

Online Re-Parameterization


  OREPA is able to simplify complex structures during training into a single convolutional layer, maintaining the same accuracy. The transformation process of OREPA is shown in Figure 2, which includes two steps of block linearization and block squeezing.

Preliminaries: Normalization in Re-param

  BN层是重参数中多层和多分支结构的关键结构,是重参数模型性能的基础。以DBB和RepVGG为例,去掉BN层后(改为多分支后统一进行BN操作)性能会有明显的下降,如表1所示。
比较意外的是,BN层的使用会带来过高的训练消耗。在推理阶段,重参数结构中的所有中间操作都是线性的,可以进行合并计算。而在训练阶段,由于BN层是非线性的(需要除以特征图的标准差),无法进行合并计算。无法合并就会导致中间操作需要单独计算,产生巨大的计算消耗和内存成本。而且,过高的成本也阻碍了更复杂的结构的探索。

Block Linearization

  虽然BN层阻止了训练期间的合并计算,但由于准确率问题,仍然不能直接将其删除。为了解决这个问题,论文引入了channel-wise的线性缩放作为BN层的线性替换,通过可学习的向量进行特征图的缩放。线性缩放层具有BN层的类似效果,引导多分支向不同方向进行优化,这是重参数化性能的核心。

  基于线性缩放层,对重参数化结构进行修改,如图3所示,以下三个步骤:

  • 移除所有非线性层,即重参数化结构中的归一化层。
  • 为了保持优化的多样性,在每个分支的末尾添加了一个缩放层,即BN层的线性替代。
  • 为了稳定训练过程,在所有分支之后添加一个BN层。

  经过block linearization操作后,重参数结构中就只存在线性层,这意味着可以在训练阶段合并结构中的所有组件。

Block Squeezing

  Block squeezing将计算和内存过多的中间特征图上的操作转换为更快捷的单个卷积核核操作,这意味着在计算和内存方面将重参数的额外训练成本从 O ( H × In ) O(H\times W) 减少到 O ( K H × K In ) O(KH\times KW ) ,其中 ( K H , K In ) (KH, KW) 是卷积核的形状。
一般来说,无论线性重参数结构多复杂,以下两个属性都始终成立:

  • 重参数结构中的所有线性层(例如深度卷积、平均池化和建议的线性缩放)都可以用具有相应参数的卷积层来表示,具体证明可以看原文的附录。
  • 重参数结构可表示为一组并行分支,每个分支包含一串卷积层。

  有了上述两个属性,就以将多层(即顺序结构)和多分支(即并行结构)压缩为单个卷积,如图4a和图4b所示。原文有部分转换的公式证明,有兴趣的可以去看看原文对应章节,这块不影响对Block Squeezing的思想的理解。

Gradient Analysis on Multi-branch Topology

  论文从梯度回传的角度对多分支与block linearization的作用进行了分析,里面包含了部分公式推导,有兴趣的可以去看看原文对应章节。这里总结主要的两个结论:

  • 如果使用分支共享的block linearization,多分支的优化方向和幅度与单分支一样。
  • 如果使用分支独立的block linearization,多分支的优化方向和幅度与单分支不同。

  上面的结论表明了block linearization步骤的重要性。当去掉BN层后,缩放层能够保持优化方向的多样化,避免多分支退化为单分支。

Block Design

  由于OREPA节省了大量训练消耗,为探索更复杂的训练结构提供了可能性。论文基于DBB设计了全新的重参数模块OREPA-ResNet,加入了以下组件:

  • Frequency prior filter:Fcanet指出池化层是频域滤波的一个特例,参考此工作加入1x1卷积+频域滤波分支。
  • Linear depthwise separable convolution:对深度可分离卷积进行少量修改,去掉中间的非线性激活以便在训练期间合并。
  • Re-parameterization for 1x1 convolution:之前的研究主要关注3×3卷积层的重参数而忽略了1×1卷积,但1x1卷积在bottleneck结构中十分重要。其次,论文添加了一个额外的1x1卷积+1x1卷积分支,对1x1卷积也进行重参数。
  • Linear deep stem:一般网络采用7x7卷积+3x3卷积作为stem,有的网络将其替换为堆叠的3个3x3卷积取得了不错的准确率。但论文认为这样的堆叠设计在开头的高分辨率特征图上的计算消耗非常高,为此将3个3x3卷积与论文提出的线性层一起压缩为单个7x7卷积层,能够大幅降低计算消耗并保存准确率。

  OREPA-ResNet中的block设计如图6所示,这应该是一个下采样的block,最终被合并成单个3x3卷积进行训练和推理。

Experiment


  各组件对比实验。

  缩放层对各层各分支的相似性的影响。

  线性缩放策略对比,channel-wise的缩放最好。

  在线和离线重参数的训练耗时对比。

  与其他重参数策略进行对比。

  在检测和分割任务上进行对比。

Conclusion


  论文提出了在线重参数方法OREPA,在训练阶段就能将复杂的结构重参数为单卷积层,从而降低大量训练的耗时。为了实现这一目标,论文用线性缩放层代替了训练时的BN层,保持了优化方向的多样性和特征表达能力。从实验结果来看,OREPA在各种任务上的准确率和效率都很不错。



If this article is helpful to you, please like it or watch it~
For more content, please pay attention to the WeChat public account [Xiaofei's Algorithm Engineering Notes]

Supongo que te gusta

Origin juejin.im/post/7122315099074723847
Recomendado
Clasificación