Dynamic convolution super evolution! Channel Fusion Replaces Attention, Reduces Parameters by 75% and Significantly Improves Performance | ICLR 2021

This article is a breakthrough exploration of dynamic convolution by researchers from Microsoft & University of San Diego. Aiming at the large number of parameters and difficulties in joint optimization of existing dynamic convolutions (such as CondConv and DY-Conv), a solution is proposed. A dynamic channel fusion mechanism replaces the previous dynamic attention.

image

paper: https://arxiv.org/abs/2103.08756

code: https://github.com/liyunsheng13/dcd

This article is a breakthrough exploration of dynamic convolution by researchers from Microsoft & University of San Diego. Aiming at the large number of parameters and difficulties in joint optimization of existing dynamic convolutions (such as CondConv and DY-Conv), a solution is proposed. A dynamic channel fusion mechanism replaces the previous dynamic attention. Compared with CondConv and DY-Conv, the proposed DCD achieves better performance with fewer parameters. Taking MobileNetV2 as an example, the proposed method only requires 25% of the parameters of CondConv, and can achieve a performance improvement of 0.6% (75.2% vs 74.6%). This article provides a more in-depth explanation of dynamic convolution, which is worthy of your study.

Abstract

Some recent studies on dynamic convolution have shown that: from the adaptive integration of K static convolution kernels, dynamic convolution can effectively improve the performance of existing CNNs . However, dynamic convolution also has two limitations: (1) the parameter of the convolution kernel is increased (K times); (2) the joint optimization of dynamic attention and static convolution kernel is extremely challenging.

We have rethinked dynamic convolution from the perspective of matrix decomposition and revealed the key issues: dynamic convolution is to map dynamic attention to a high-dimensional hidden space and then dynamically focus on channel groups. To solve this problem, we propose to use dynamic fusion to replace the dynamic attention that acts on the channel group . Dynamic channel fusion not only helps to reduce the dimensionality of the hidden space, but also alleviates the joint problem. Therefore, the proposed method is easier to train, requires fewer parameters and does not cause performance loss.

Dynamic Neural Network

Dynamic convolution is different from conventional convolution in that the convolution kernel parameters of dynamic convolution will dynamically change with the transformation of the input; while conventional convolution uses the same convolution kernel parameters for any input.

The well-known SE attention mechanism in the CNN field is a well-known dynamic network, which can adaptively adjust the weighting coefficient of each channel according to the input; SKNet adaptively adjusts the attention information on cores of different sizes; Google’s CondConv and Microsoft's DY-Conv coincidentally adopt the idea of ​​"three heads surpassed Zhuge Liang" to adaptively integrate multiple static convolution kernels.

image

The dynamic neural network includes, but is not limited to, the adaptive adjustment of the convolution kernel mentioned here, and also includes the neural network with adaptive depth, the neural network with adaptive input resolution, etc. (refer to the figure above). For a more systematic introduction of dynamic neural networks, please refer to the summary: "Dynamic Neural Network A Survey". Here is a more comprehensive type diagram of dynamic neural networks. For more detailed suggestions, please refer to the above overview.

image

Dynamic Convolution Decomposition

The most essential idea of ​​dynamic convolution is to dynamically integrate multiple convolution kernels to generate new weight parameters based on the input:

At present, the more well-known dynamic convolutions include: CondConv proposed by Google, DY-Conv proposed by MSRA, DyNet proposed by Huawei, and so on. However, dynamic convolution has two main limitations:

  • Lack of firmness;

  • Joint optimization problem.

In view of the above two problems, from the perspective of matrix decomposition, we express the dynamic convolution as the following form:

Among them is the mean kernel and the residual kernel matrix. The latter is further decomposed using SVD:, at this time:

The following figure shows a schematic diagram of the above matrix decomposition, that is to say: through decomposition, the dynamic characteristics can be achieved through dynamic residuals, and dynamic residuals project the input x into a higher-dimensional space, and then implement dynamic attention . This also means that the limitation of conventional dynamic convolution stems from the attention on the channel group , which introduces a high-dimensional hidden space, resulting in a small attention value that may inhibit the learning of the corresponding kernel.

Dynamic Channel Fusion

To solve the above problems, we propose Dynamic Convolution Decomposition(DCD)that it uses dynamic channel fusion to replace dynamic attention. The proposed DCD is based on the full dynamic matrix for channel fusion, as shown on the right of the figure above, the realization of the dynamic residual can be expressed as. The key innovation of dynamic channel fusion is that it can significantly reduce the dimensionality of hidden space . Dynamic convolution based on dynamic channel fusion can be expressed as follows:

Among them, it is used to compress the input into a low-dimensional space, and the resulting L channels are dynamically fused and expanded to the output channel. This process is the dynamic volume integral solution . The dimension L of the hidden space is constrained by the default setting.

Through the above method, the amount of static convolution parameters can be significantly reduced (), and a more compact model can be obtained.

At the same time, dynamic channel fusion also alleviates the joint optimization problem of conventional dynamic convolution. Since each column of P and Q is related to multiple dynamic coefficients, the learning is almost unlikely to be suppressed by a few small dynamic coefficients.

All in all, DCD adopts a dynamic integration form different from conventional dynamic convolution, which can be described as follows:

  • Conventional dynamic convolution uses a shared attention mechanism to integrate non-shared static basis vectors in high-dimensional hidden spaces ;

  • DCD then uses a non-sharing dynamic channel fusion mechanism to integrate a shared static basis vector in a low-dimensional hidden space .

General Formulation

The previous content mainly focused on the dynamic residual part, and proposed a dynamic channel fusion mechanism to achieve dynamic convolution. Next, we will discuss static cores. Here, we relax its constraints to get a more general form:

Among them is the diagonal matrix, that is to say: the channel-level attention is realized on the static core. And this generalized form can further improve model performance.

Note : The dynamic channel attention here is similar to but different from SE . The differences are as follows:

  • Parallel to convolution, and share input x with convolution; the computational complexity is;

  • SE is located after convolution, and takes the output of convolution as input; the computational complexity is.

  • Obviously, when the resolution of the feature map is larger, the SE requires more calculations.

Implementation

image

The figure above shows the implementation of DCD, which uses lightweight dynamic branches to generate dynamic channels and focus on dynamic channel fusion. The implementation of this dynamic branch is similar to SE : mean pooling + two fully connected layers. Finally, the final convolution kernel parameters are generated based on the above formula. Similar to static convolution, DCD is also followed by BatchNorm and a nonlinear activation layer.

In terms of computational complexity, DCD has a complexity similar to that of conventional dynamic convolution. Therefore, we mainly carry out a simple analysis for the parameter quantity. The parameters of static convolution and conventional dynamic convolution are respectively; and the parameter of DCD is. Because, the upper limit of the parameter amount is. At that time, the amount of parameters was approximately, which was much smaller than that of DynamicConv and CondConv.

Extension

The previous introduction mainly uses convolution as an example to introduce and analyze. Next, we use three ways to expand it: (a) sparse dynamic residual; (b) deep convolution; (c) convolution.

Sparse Dynamic Residual

The dynamic residual can be further simplified into a block diagonal matrix form, which can be expressed as follows:

among them. This form has a special form: B=1. That is, the static kernel is still a full matrix, and only the dynamic residual part is sparse. The following figure shows a schematic diagram of this special form. In follow-up experiments, we will show that: When B=8, the minimum performance degradation can be achieved, but it will still have better performance than the static core .

image

Deep convolution

The weights of the deep convolution form a matrix, and DCD can be obtained by replacing Q in the aforementioned formula with a matrix R:

Among them, all are matrices, and remain as diagonal matrices, which are used to reduce the number of core elements; used for dynamic fusion, and P is used for deep convolution on each core element. We default. Since deep convolution is channel separation, channel fusion is not performed, but cryptographic nucleus element fusion is performed.

convolution

Since the convolution kernel is in the form of, DCD can be extended by the following formula:

The meaning of the parameters here is similar to the depth convolution part, the default parameters:

image

We found that: more important than . Therefore, we will reduce it to 1, and increase L accordingly. In this case, R is simplified to a one-hot vector. The figure above shows a schematic diagram of dynamic convolution in this form. It can be seen from the figure b above that the dynamic residual only has one non-zero slice, which is equivalent to convolution. Therefore, the DCD of the convolution is equivalent to adding a dynamic residual to the static kernel.

Experiments

In order to verify the effectiveness of the proposed scheme, we conducted a series of comparative experiments & ablation experiments on the ImageNet dataset. The baseline model includes ResNet and MobileNetV2. All convolutions in ResNet are implemented by DCD, while MobileNetV2 uses DCD for all convolutions.

The above figure compares the impact analysis of different components of DCD (compared on two lightweight models), you can see from it:

  • Compared with static convolution, both dynamic components can significantly improve the accuracy of the model;

  • Compared with dynamic channel attention, dynamic channel fusion has slightly higher precision, parameter amount and FLOPs; and the combination of the two can further improve the model performance.

image

The above table shows the performance comparison between the proposed method and other dynamic convolutions. It can be seen that: DCD can significantly reduce the amount of model parameters while improving the accuracy of the model . For example, in MobileNetV2x1.0, DCD only needs fewer parameters (DynamicConv50%, CondConv25%) to achieve comparable accuracy; on ResNet18, it only requires 33% of DynamicConv parameters, and is better than 0.4%. DynamicConv.

Guess you like

Origin blog.csdn.net/jacke121/article/details/115264073