CVPR2022 | Online Re-Param | OREPA further speeds up AI training, with slightly better accuracy than RepVGG!

Click on "Computer Vision Workshop" above and select "Star"

Dry goods delivered as soon as possible

bc84f87ae39422ae9b21c6b363f74fe0.png

Author: ChaucerG

Source丨Jizhi Shutong

67e2da7415e6759dcea5a8307a55daee.png

Structural reparameterization has attracted increasing attention in various computer vision tasks. It aims to improve the performance of deep models without introducing any inference time cost. While this model is effective during inference, it relies heavily on complexity training-time blockto achieve high accuracy, resulting in huge additional training costs.

In this paper, we propose Online Convolution Reparameterization (OREPA), a two-stage pipeline that aims training-time blockto reduce the huge training overhead by compressing complex ones into a single convolution. To achieve this, this paper introduces a linear scaling layer to better optimize online blocks. With the help of reducing training cost, the authors also explore some more efficient re-parameterization components. Compared with state-of-the-art reparameterized models, OREPA can save about 70% of the training-timecost and accelerate the training speed by about 2×. Meanwhile, with OREPA, the model outperforms the ImageNet method by +0.6%. The authors also conduct experiments on object detection and semantic segmentation and show consistent improvements on downstream tasks.

1 Introduction

Convolutional Neural Networks (CNNs) have been successfully applied in many computer vision tasks, including image classification, object detection, semantic segmentation, etc. The trade-off between accuracy and model efficiency has also been widely discussed.

In general, a higher accuracy model usually requires a more complex block, a wider or deeper structure. However, such models are always too heavy to deploy, especially in scenarios where hardware performance is limited and real-time inference is required. Given the efficiency, smaller, more compact and faster models are naturally preferred.

In order to obtain a deployment-friendly and high-accuracy model, some researchers have proposed methods based on structural reparameterization to unleash the performance. In these methods, the models have different structures in the training phase and the inference phase. Specifically, a complex training stage topology, i.e. reparameterized blocks, is used to improve performance. After training, a complex block is re-parameterized into a single linear layer via an equivalent transformation. The reparameterized model usually has a neat architecture model, for example, usually a VGG-like or a ResNet-like structure. From this perspective, the reparameterization strategy can improve the performance of the model without introducing additional inference time cost.

5b9e0fc5433802eca04f186a81531624.png

The BN layer is a key component of the reconstructed model. In a reparse block (Fig. 1(b)), a BN layer is added immediately after each convolutional layer. The performance degradation caused by removing these BN layers can be observed. However, when considering the efficiency, the use of such BN layers unexpectedly brings a huge computational overhead in the training phase. During the inference stage, complex blocks can be compressed into a convolutional layer. However, during training, the BN layers are non-linear, that is, they divide the feature map by its standard deviation, which prevents merging entire blocks. Therefore, there are a large number of intermediate computation operations (large FLOPS) and buffering feature maps (high memory usage). To make matters worse, such a high training budget makes it difficult to explore more complex and potentially stronger re-parameter blocks. Naturally, the following questions arise:

Why is normalization so important in reparameterization?

Through analysis and experiments, the authors believe that the scale factors in the BN layer are the most important because they can diversify the optimization directions of different branches.

Based on the observations, the authors propose online reparameterization (OREPA) (Fig. 1(c)), a two-stage pipeline that enables to simplify complex ones training-time re-param block.

In the first stage, block linearization , all nonlinear BN layers are removed and a linear scaling layer is introduced. These layers have similar properties to BN layers, so they diversify the optimization of different branches. Furthermore, these layers are all linear and can be combined into convolutional layers during training.

The second stage, block squeezing , simplifies complex linear blocks into a single convolutional layer. OREPA significantly reduces training costs with only a very small impact on performance by reducing the computational and storage overhead caused by the intermediate computational layers.

Furthermore, the high efficiency makes it possible to explore more complex reparameterization topologies. To verify this, the authors further propose several reparameterized components for better performance.

The proposed OREPA is evaluated on the ImageNet classification task. Compared to state-of-the-art inpainting models, OREPA reduces the additional training time GPU memory cost by 65% ​​to 75% and speeds up the training process by 1.5-2.3 times. Meanwhile, OREPA-ResNet and OREPA-VGG consistently outperform previous DBB and RepVGG methods by +0.2%∼+0.6%. The authors also evaluate OREPA on downstream tasks, namely object detection and semantic segmentation. The authors found that OREPA can also bring performance improvements on these tasks.

The main contributions of this paper are:

  1. The online convolutional reparameterization (OREPA) strategy is proposed, which greatly improves the training efficiency of the reparameterization model and makes it possible to explore stronger reparameterization blocks;

  2. Through the analysis of the working mechanism of the reparameterization model, the BN layer is replaced by the introduced linear scale layer, which still provides a different optimization direction and maintains the representation ability;

  3. Experiments on various vision tasks show that OREPA outperforms previous reparameterized models (DBB/RepVGG) in both accuracy and training efficiency.

2 Related work

2.1 Structural Reparameterization

Structural reparameterization has recently been emphasized and applied to many computer vision tasks, such as compact model design, architecture search, and pruning. Reparameterization means that different architectures can be converted to each other by equivalent transformation of parameters. For example, one branch of 1×1 convolution and one branch of 3×3 convolution can be transferred into a single branch of 3×3 convolution. In the training phase, multi-branch and multi-layer topologies are designed to replace ordinary linear layers (such as conv or fully connected layers) to enhance the model. Cao et al. discuss how to incorporate depthwise separable convolution kernels during training. Then during inference, the complex model at training time is transferred to a simpler model to facilitate faster inference.

While benefiting from complex training-timetopologies, current reparameterization methods are trained with non-negligible additional computational costs. As blocks get more complex to get stronger representations, GPU memory utilization and training time will get longer and longer, eventually going unacceptable. Unlike previous reparameterization methods, this paper focuses more on the training cost. A general online convolutional reparameterization strategy is proposed, which makes training-timethe structural reparameterization possible.

2.2 Normalization

BN was proposed to alleviate the vanishing gradient problem when training very deep neural networks. It is believed that BN layers are very important because they smooth the loss. Recent research on BN-free neural networks claims that BN layers are not indispensable. With good initialization and proper regularization, BN layers can be removed gracefully.

For the reparameterization model, the authors believe that the BN layer in the reparameterization block is critical. The BN-free variant will suffer from performance degradation. However, BN layers are non-linear, that is, they divide the feature map by its standard deviation, which prevents merging blocks online. To make online reparameterization feasible, the authors remove all BN layers in the reparameterization block and introduce a linear alternative to the BN layer, the linear scaling layer .

2.3 Convolution decomposition

Standard convolutional layers are computationally intensive, resulting in large FLOPs and parameters. Therefore, convolution decomposition methods are proposed and widely used in lightweight models for mobile devices. The reparameterization method can also be seen as a form of convolution decomposition, but it is more inclined towards more complex topologies. The difference of our method is that the convolution is decomposed at the kernel-level, rather than at the structure level.

3 Online Reparameterization

In this section, firstly, the key component, namely the BN layer in the reparameterized model is analyzed, and on this basis, the online reparameterization (OREPA) is proposed, which aims to greatly reduce the training time budget of the reparameterized model. OREPA is able to simplify complex training time blocks into one convolutional layer while maintaining high accuracy.

The overall pipeline of OREPA is shown in Figure 2, which includes a Block Linearization stage and a Block Squeezing stage .

4b656ad18d52517eacf4f8a718ab61d2.png
figure 2

The authors deeply investigate the effectiveness of reparameterization by analyzing the optimization diversity of multi-layer and multi-branch structures, and demonstrate that the proposed linear scaling layer and BN layer have similar effects.

Finally, as the training budget is reduced, more components are further explored to achieve a stronger reparameterized model with a slight increase in cost.

3.1 Normalization in reparameterization

The authors argue that the intermediate BN layer is a key component of the multi-layer and multi-branch structure in the reparameterization process. Taking SoTA models DBB and RepVGG as examples, removing these layers leads to severe performance degradation, as shown in Table 1.

7115b95386314beb0cc3622bc3dd98a3.png

This observation is also supported experimentally by Ding et al. Therefore, the authors believe that the intermediate BN layer is essential to reparameterize the performance of the model.

However, the use of intermediate BN layers brings a higher training budget. The authors note that in the inference stage, all intermediate operations in the reparameterization block are linear and thus can be combined into a single convolutional layer, resulting in a simple structure.

But during training, BN layers are non-linear, i.e. they divide the feature map by its standard deviation. Therefore, intermediate operations should be calculated separately, which will result in higher computational and memory costs. To make matters worse, such a high cost will prevent the exploration of more powerful training modules.

3.2 Block Linearization

As mentioned in 3.1, the intermediate BN layer prevents merging separate layers during training. However, simply removing them is not straightforward due to performance issues. To address this dilemma, the authors introduce channel-level linear scaling operations as a linear alternative to BN.

The scaling layer contains a learnable vector that scales the feature map in the channel dimension. Linear scaling layers have similar effects as BN layers, both of which promote multi-branch optimization in different directions, which is the key to the performance improvement when reparameterized. In addition to the performance impact, linear scaling layers can also be incorporated during training, enabling online reparameterization.

ece6e3d5da891f410d8760b493b906df.png

Based on the linear scaling layer, the authors modified the reparameterization block, as shown in Figure 3. Specifically, the linearization phase of the block consists of the following 3 steps:

  • First, all non-linear layers, i.e. BN layers in the reparameterization block, are removed

  • Second, to keep the optimization diversity, a scaling layer is added at the end of each branch, which is a linear alternative to BN

  • Finally, to stabilize the training process, a BN layer is added after the addition of all branches.

Once the linearization phase is complete, there are only linear layers in the reparameterization block, which means that all components in the block can be merged during the training phase.

3.3 Block Squeezing

The Block Squeezing step converts operations on intermediate feature maps that are computationally and memory-expensive to more economical operations on the kernel. This means a reduction in computation and memory from , where , is the spatial dimension of feature maps and convolution kernels.

In general, no matter how complex the linear reparameterization block is, the following 2 properties always hold:

  • All linear layers in Block, such as depthwise convolution, average pooling and the proposed linear scaling, can be represented by degenerate convolutional layers with corresponding parameters;

  • A block can be represented by a series of parallel branches, each of which consists of a series of convolutional layers.

With the above two characteristics, if it can be

  1. Multi-layer (i.e. sequential structure)

  2. Multi-branch (i.e. parallel structure)

dd76cb2b525c7fd6b037ca75a0fbba9f.png
Figure 4

Simplified to a single convolution, a block can be compressed. In the following sections, we will show how to simplify the sequential structure (Fig. 4(a)) and the parallel structure (Fig. 4(b)).

First define the notation of convolution. Let denote the number of input and output channels of a two-dimensional convolution kernel of size. , representing the input and output tensors. Bias is omitted here as a common practice, and the convolution process is represented as:

4aeddee4c6c4db7bef530330611bc031.png

1. Simplify the sequence structure

Consider a stack of convolutional layers represented as:

187f21d585a8c93a1fa5171262339276.png

which satisfy, . According to the associative law, these layers can be compressed into one layer by first convolving the kernel:

4af410f188729c371b96aca457e7bd3e.png

Among them, is the weight of the first layer. represents the end-to-end mapping matrix.

2. Simplify the parallel structure

Parallel structure According to the linearity of convolution, multiple branches can be merged into one branch according to the equation:

c0e0838503b1e794d4e6016a30befd8f.png

Among them, is the weight of the first branch, and ( is the unified weight. It is worth noting that when merging kernels of different sizes, the spatial centers of different kernels need to be aligned. For example, the 1×1 kernel should be the same as the 3×3 kernel. Center aligned (like RepVGG).

3. Training overhead: from features to kernels.

No matter how complex the block is, it must consist only of subtopologies with multiple branches and multiple layers. Therefore, it can be reduced to a single reduction rule according to the above two reduction rules. Finally, the integrated end-to-end mapping weights can be obtained and convolved only once during training. Actually converts operations on intermediate feature maps (convolution, addition) to operations on the convolution kernel. Therefore, the computational complexity in terms of computation and memory is reduced from .

3.4 Gradient analysis of multi-branch topology

To understand why the Block Linearization step is possible? , i.e. why is the scaling layer important? , the author analyzes the optimization of uniform weight reparameterization.

The conclusion is that for the branch that removes the BN layer, using the scaling layer can diversify its optimization direction and prevent it from degenerating into a single optimization direction .

To simplify the notation, only take the single dimension of the output Y. Consider a scaling sequence:

66420b983f684a831b2636f04273d29f.png

where is the vectorized pixels within the sliding window, W is the convolution kernel corresponding to a particular output channel, and is the scale factor. Assuming all parameters are updated by stochastic gradient descent, the map is updated as:

b0d9f3c9e9215d30a863e1453e0c831c.png

Among them, L is the loss function of the entire model, and η is the learning rate. is shared by the multi-branch topology, that is:

680a2c885302a6d504932798bdbc7e84.png

The end-to-end weights are equally optimized as the weights in equation (6) (the sequential structure can be obtained):

83ad1106f13dac0952060ad577f2e516.png

Use the same front-end t-moment matrix end-to-end. Therefore, no optimization changes were introduced. This conclusion is also supported by experiments. In contrast, a multi-branch topology with branch levels provides such variations:

b1815411b129dab64d3dded64a4b999a.png

The update of the end-to-end weights is different from equation (6):

0955d1e7012c250e016f6fd198e0b854.png

Satisfy the same preconditions and condition 1:

Condition 1 : At least 2 of all branches are active

17530311c6a0a5d4c3bd2caa607fcb71.png

Condition 2 : The initial state of each active branch is different

5e6959e7de88641c043813bfeeb24e9c.png

At the same time, when the condition 2 is satisfied, the multi-branch structure will not degenerate into a single structure, and multiple branches perform forward and backward propagation at the same time. This also reveals why the scale factor is important.

Note that when the weights of each branch are randomly initialized and the scaling factor is initialized to 1, both condition 1 and condition 2 are always satisfied.

49f88d5fb0226feabb82c3a93692ccb3.png

Proposition 1 : For a single-branch linear map, the entire end-to-end weight matrix will be optimized differently when partially or fully reparameterized through a super-two-layer multi-branch topology. If one layer of the map is reparameterized to a multi-branch topology of up to one layer, the optimization will remain the same.

The discussion so far has expanded on how reparameterization affects optimization. In fact, all currently valid reparameterization topologies can be verified by Proposition 1.

257bcc5b05a70d4d8e376c4214600095.png

For detailed analysis, please refer to the appendix part of the paper

3.5 Block design

Since the proposed OREPA saves a large amount of training cost, it enables the exploration of more complex training blocks. To this end, a new reparameterized model, OREPA-ResNet, is designed by linearizing the state-of-the-art model DBB and inserting the following components (Fig. 5).

077d045e159c63d7aff51864fad1af91.png ba83c3689b34db9728ffef50725db832.png

1. Frequency prior filter

In DBB, pooling layers are used in blocks. Qin et al. argue that pooling layers are a special case of frequency filters. To this end, the authors add a Conv1×1-frequency filter branch.

2. Linear depth separable convolution

The depthwise separable convolutions are slightly modified to allow merging during training by removing the intermediate non-linear activation layers.

3. Reparameterized 1×1 convolution

Previous work mainly focused on the reparameterization of 3×3 convolutional layers, while ignoring the reparameterization of 1×1 convolutional layers. The authors recommend reparameterizing the 1×1 layers because they play an important role in the bottleneck structure. Specifically, an additional Conv1×1-Conv1×1 branch is added to Block.

4. Linear Stem

Large convolution kernels are usually placed in the very first layers, e.g., 7×7 Stem, in order to achieve a larger receptive field. Guo et al. replaced 7×7 convolutions with stacked 3×3 convolutional layers for higher accuracy. However, due to the high resolution, stacking at the most initial layers requires greater computational overhead. It should be noted that the stacked deep stem can be scaled to 7 × 7 conv layers with the proposed linear scaling layer here, which can greatly reduce the training cost while maintaining high accuracy.

2691f2afb7077d6c8e3300a1f028707b.png

For each block in OREPA-ResNet (Figure 6):

  1. Added a frequency prior filter and a linear depthwise separable convolution

  2. Replace all Stems (i.e. initial 7×7 convolutions) with the proposed linear deep stem layer

  3. In bottleneck, in addition to 1 and 2, the proposed rep 1×1 block is further replaced by the original 1×1 convolution branch

4 experiments

4.1 Ablation experiment

1. Linear scaling layers and optimization diversity

98a77865fde0fb114e08f47db417e738.png
Figure 7

Experiments are first performed to verify the core idea that the proposed linear scaling layer plays a similar role as the BN layer. According to the analysis in 3.4. The article shows that both the scaling layer and the BN layer can diversify the optimization direction. To verify this, the authors visualized the branch-level similarity of all branches in Figure 7. The authors found that the use of scaling layers can significantly increase the diversity of different branches.

dea83331a190f45440bd5b0421034093.png

The validity of this diversity is verified in Table 2. Taking the ResNet-18 structure as an example, the two layers (BN and linear scaling) bring similar performance gains (i.e. 0.42 vs. 0.40). This strongly supports the paper's point that in reparameterization, it is the scaling part, not the statistical normalization part, that matters most.

2. Various linearization strategies

The authors of this paper try linearization strategies for various scale layers. Specifically 4 variants:

  • Vector : Takes a channel-level vector and performs scaling operations along the channel axis.

  • Scalar : Use a scalar to scale the entire feature map.

  • W/o scaling : Remove branch-level scaling layers.

  • W/o post-addition BN : Delete the post-BN layer.

35850dd67c29567e5c8ad0b5f074ea3d.png

From Table 3, it is found that deploying a scalar scaling layer or not deploying a scaling layer leads to poor results. Therefore, vector scaling is chosen as the default strategy.

The authors also study the effectiveness of the BN layer after addition. as described in 3.2. A post-BN layer is added to stabilize the training process. To demonstrate this, the authors remove these layers, as shown in the last row in Table 3, the gradient becomes infinite and the model fails to converge.

3. Every Component Matters

Experiments are conducted on both ResNet-18 and ResNet-50 structures. As shown in Table 2, each component contributes to improved performance.

4、Online vs. offline

b68d2f631e0e593199fc25955bdba3b5.png

The authors compared the training cost of OREPA-ResNet-18 with DBB-18. The consumed memory (Fig. 8(a)) and training time (Fig. 8(b)) are exemplified.

As the number of components increases, offline reparameterized models face rapidly increasing memory utilization and long training times. Deep Stem cannot even be introduced in ResNet-18 models due to high memory cost. In contrast, the online reparameterization strategy increases training speed by 4× and saves up to 96+% additional GPU memory. The overall training overhead is roughly on the same level as the base model (plain ResNet).

4.2 Comparison with other reparameterizations

d4a7ba3df089c3db42c9c01d5bcba814.png

It is observed from Table 4 that on the ResNet series, OREPA can consistently improve the performance by +0.36% on various models. At the same time, it accelerates the training speed by 1.5× to 2.3×, and saves about 70+% additional training time memory.

97066ad578fb48d28ec6a276321681fb.png

The authors also performed experiments on the VGG structure, comparing OREPA-VGG and RepVGG. For the OREPA-VGG model, simply use the OREPA-res-3×3 branch used in OREPA-. This modification introduces only marginal additional training costs, while also bringing significant performance gains (+0.25%∼+0.6%).

4.3 Object Detection and Semantic Segmentation

577c0ef6f93fd453227c9353d0c2c2cc.png

4.4 Limitations

When simply transferring the proposed OREPA from ResNet to RepVGG, the authors found inconsistent performance between residual-based and residual-free (VGG-like) structures. Therefore, all three branches are kept in the RepVGG block to maintain competitive accuracy, which slightly increases the computational cost. This is an interesting phenomenon.

5 Reference

[1].OREPA: Online Convolutional Re-parameterization

This article is for academic sharing only, if there is any infringement, please contact to delete the article.

Dry goods download and study

Backstage reply: Barcelona Autonomous University courseware, you can download the 3D Vison high-quality courseware accumulated by foreign universities for several years

Background reply: computer vision books, you can download the pdf of classic books in the field of 3D vision

Backstage reply: 3D vision courses, you can learn excellent courses in the field of 3D vision

3D visual quality courses recommended:

1. Multi-sensor data fusion technology for autonomous driving

2. A full-stack learning route for 3D point cloud target detection in the field of autonomous driving! (Single-modal + multi-modal/data + code)
3. Thoroughly understand visual 3D reconstruction: principle analysis, code explanation, and optimization and improvement
4. The first domestic point cloud processing course for industrial-level combat
5. Laser-vision -IMU-GPS fusion SLAM algorithm sorting
and code
explanation
Indoor and outdoor laser SLAM key algorithm principle, code and actual combat (cartographer + LOAM + LIO-SAM)

9. Build a structured light 3D reconstruction system from scratch [theory + source code + practice]

10. Monocular depth estimation method: algorithm sorting and code implementation

11. The actual deployment of deep learning models in autonomous driving

12. Camera model and calibration (monocular + binocular + fisheye)

13. Heavy! Quadcopters: Algorithms and Practice

14. ROS2 from entry to mastery: theory and practice

Heavy! Computer Vision Workshop - Learning Exchange Group has been established

Scan the code to add a WeChat assistant, and you can apply to join the 3D Vision Workshop - Academic Paper Writing and Submission WeChat exchange group, which aims to exchange writing and submission matters such as top conferences, top journals, SCI, and EI.

At the same time , you can also apply to join our subdivision direction exchange group. At present, there are mainly ORB-SLAM series source code learning, 3D vision , CV & deep learning , SLAM , 3D reconstruction , point cloud post-processing , automatic driving, CV introduction, 3D measurement, VR /AR, 3D face recognition, medical imaging, defect detection, pedestrian re-identification, target tracking, visual product landing, visual competition, license plate recognition, hardware selection, depth estimation, academic exchanges, job search exchanges and other WeChat groups, please scan the following WeChat account plus group, remarks: "research direction + school/company + nickname", for example: "3D vision + Shanghai Jiaotong University + Jingjing". Please remark according to the format, otherwise it will not be approved. After the addition is successful, the relevant WeChat group will be invited according to the research direction. Please contact for original submissions .

82ec7a4bf5778717ed81197ad5b587dc.png

▲Long press to add WeChat group or contribute

1f64b34601e8ae2d3fdc331535cce923.png

▲Long press to follow the official account

3D vision from entry to proficient knowledge planet : video courses for 3D vision field ( 3D reconstruction series , 3D point cloud series , structured light series , hand-eye calibration , camera calibration , laser/vision SLAM, automatic driving, etc. ), summary of knowledge points , entry and advanced learning route, the latest paper sharing, and question answering for in-depth cultivation, and technical guidance from algorithm engineers from various large factories. At the same time, Planet will cooperate with well-known companies to release 3D vision-related algorithm development positions and project docking information, creating a gathering area for die-hard fans that integrates technology and employment. Nearly 4,000 Planet members make common progress and knowledge to create a better AI world. Planet Entrance:

Learn the core technology of 3D vision, scan and view the introduction, unconditional refund within 3 days

9e6a08caf05fbbf05ceb841e75b41ebc.png

 There are high-quality tutorial materials in the circle, which can answer questions and help you solve problems efficiently

I find it useful, please give a like and watch~

Guess you like

Origin blog.csdn.net/qq_29462849/article/details/124014181