On the difference and connection between CondConv, DynamicConv and DyNet

The chapter was first published on the WeChat official account of the Extreme City platform (ID: extrememart)

CondConv: https://arxiv.org/abs/1904.04971

DynamicConv: https://arxiv.org/abs/1912.03458

DyNet:  https://arxiv.org/abs/2004.10694

When I first saw Huawei's DyNet article last month, I felt that this is not another version of CondConv. Why do you say it again? Because I have had a similar feeling when I saw the DynamicConv of MSRA before. I haven't had time to calm down and think about the difference between the three. It just happened to be free recently, so I took some time to analyze it.

1. CondConv

In conventional convolution, the convolution kernel parameters are determined by training and all input samples are treated "equally"; CondConvin, the convolution kernel parameter parameters are obtained by transforming the input. The process can be described as:

 

One of the samples depends on the weighting parameter. In CondConv, each convolution kernel has the same dimensions as the standard convolution kernel parameters.

The increase in capacity of conventional convolution depends on the increase in the size of the convolution kernel and the number of channels, which will further improve the overall calculation of the network; but it CondConvonly needs to calculate the weighted convolution kernel on the input samples through multiple experts before performing the convolution calculation. The key point is that each convolution kernel only needs to be calculated once and act on different positions. This means that the purpose of increasing the network capacity is achieved by increasing the amount of expert data, and the code is only a small reasoning time: each additional parameter only needs to be multiplied and added once.

 

CondConvIt is equivalent to a linear combination of multiple static convolutions, as shown in Figure b above. Therefore, it has the same capacity as n experts, but the calculation is more efficient. And the most important of these is: weighting parameters. It must have data dependency, otherwise it CondConvwill be equivalent to static convolution. So how to design the data-dependent parameters? The author uses three steps to calculate this parameter:

 

The proposed CondConvcan replace the standard convolution in the existing network, and it is suitable for both deep convolution and fully connected layers.

Obviously, it mentioned here CondConv is a standard dynamic filter convolution, convolution kernel here, but it has other design methods and dynamic convolution filter, there are some differences, but the essence of the same: data dependencies.

2. DynamicConv

​ The goal of dynamic convolution is to seek a balance between network performance and computational load. Conventional methods of improving network performance (wider and deeper) often lead to higher computational consumption, and therefore are not friendly to efficient networks.

​ The dynamic convolution proposed by the author will not increase the depth and width of the network. On the contrary, the fusion of multiple convolution kernels will improve the expression ability of the model. It should be noted that the obtained convolution kernel is related to the input, that is, different data have different convolutions, which is the origin of dynamic convolution.

2.1 Dynamic Perceptron

​ First, the author defines the traditional perceptron as, which respectively represent the weight, bias and activation function; then, the author defines the dynamic perceptron as follows:

 

Which represents the attention weight. The attention weight is not fixed, but changes with the input. Therefore, compared with static convolution, dynamic convolution has stronger feature expression ability

Compared with the static perceptron, the dynamic perceptron has a larger model. It contains two additional calculations: (a) attention weight calculation; (2) dynamic weight fusion. Nevertheless, these two additional calculations can be ignored compared to the calculation of the perceptron:

 

2.2 Dynamic Convolution

​ Similar to a dynamic perceptron, dynamic convolution also has K cores. According to the classic design in CNN, the author connects BatchNorm and ReLU after dynamic convolution.

  • Attention: The author uses a lightweight squeeze and excitationextraction of attention weights, as shown in the figure above. The difference from SEnet is: SEnet gives the channel an attention mechanism, while dynamic convolution gives the convolution kernel an attention mechanism.
  • Core integration: Since the core is relatively small, the core integration process is computationally efficient. The following table shows the comparison of the calculation amount of dynamic convolution and static convolution. It can be seen from this: the amount of calculation is very limited.

  • Dynamic CNN: Dynamic convolution can be easily embedded to replace the convolution of the existing network architecture, such as 1x1 convolution, 3x3 convolution, group convolution and deep convolution. At the same time, it has a complementary relationship with other technologies (such as SE, ReLU6, Mish, etc.).

Training Strategy

​ Training a deep dynamic convolutional neural network is extremely challenging, because it needs to optimize the convolution kernel and the attention part at the same time. The blue line in the right figure below shows the training and verification errors of DY-MobileNetV2. It can be seen that the convergence is slow and the performance is only 64.8%, which is not as good as the 65.4% of its static convolutional version.

​ The author believes that the sparse attention makes only some cores trained, which makes training inefficient. This inefficiency will become more serious as the network deepens. In order to verify this problem, the author performed verification on the DY-MobileNetV2 variant model (it only replaces the last 1x1 convolution with dynamic convolution in each module), see the left picture above. It can be seen that the training converges faster and the accuracy is higher (65.9%).

​ To solve the above problems, the author proposes to use a smooth attention method to encourage more convolution kernels to be optimized at the same time. The smoothing process is described as follows:

 

As can be seen from the figure above, the improved training mechanism can converge faster and have higher accuracy.

3. DyNet

​ The above figure shows the overall framework of the dynamic convolution proposed by the author. It contains coefficient prediction module and dynamic generation module. The coefficient prediction module is trainable and used to predict the fusion coefficients of the fixed convolution kernel; the dynamic generation module further uses the predicted coefficients to generate the dynamic convolution kernel. The proposed method is easy to implement and "plug and play", and can be easily embedded in existing networks.

3.1 Motivation

​ As mentioned by the existing network pruning technology, the convolution kernel in CNN is related. The author gives the characteristics of different networks distribution of Pearson product-moment correlation coefficient, see the figure below. The model pruning technique tries to reduce the correlation through compression, but it is difficult to compress the correlation in the lightweight network. The authors believe that this correlation is very important for performance maintenance, because they need to obtain noise uncorrelated features through a collaborative way.

​ Through analysis, the author found that: the uncorrelated features of the noise can be obtained through the dynamic fusion of multiple fixed cores without the need for the cooperative combination of multiple related cores. Based on this finding: the author proposes dynamic convolution.

3.2 Dynamic Convolution

​ The goal of dynamic convolution: learn a set of kernel coefficients and use them to fuse multiple fixed kernels into one dynamic kernel. The author uses a trainable coefficient prediction module to predict coefficients, and then uses a dynamic generation module for fixed core fusion.

  • Coefficient prediction module: It can predict the fusion coefficient based on the image content. The schematic diagram of the module is shown in the figure above, which is composed of global average pooling and fully connected layers. Note: C is the fixed number of convolution kernels, which is equivalent to the following formula.
  • Dynamic generation module: For dynamic convolution, its weight corresponds to a fixed kernel and a dynamic kernel. The shape of each convolution kernel is, which represents the number of groups, which is a hyperparameter. Assuming that it represents a fixed kernel and represents a dynamic kernel, the fusion process of the dynamic convolution kernel can be described as:

 

  • Training algorithm: For the proposed dynamic convolution, the conventional batch training algorithm is not suitable. This is because: in the same batch, different inputs have different cores. Therefore, the author proposes to fuse features based on coefficients instead of kernels in the training phase. And these two methods are equivalent, the proof is as follows:

 

4. Analysis

​ I have introduced the principles of CondConv, DynamicConvolution, and DyNet. Have you felt that these three are the same thing?

​ Yes, in essence, the three are the same. They all perform multi-convolution kernel fusion adaptively according to the input. The fusion mechanism is also the same. It is true that there is no essential difference between the three.

​ Of course there are some differences, these differences are manifested in the following aspects:

  • The starting point is different. The starting point of CondConv can be understood as "three heads and one Zhuge Liang"; the starting point of DynamicConv is the attention mechanism; and the starting point of DyNet is the "noise-irelevant feature". Maybe this is the so-called "different routes to the same goal".
  • The network architecture is not the same. CondConv and DynamicConv are "plug and play" at the convolutional level, while DyNet is embedding at the Block level.
  • The fusion coefficient modules are not completely consistent. CondConv's fusion coefficient module uses sigmoid for normalization, MSRA uses softmax for normalization, and DyNet also uses sigmoid for normalization.
  • The training method is different: CondConv uses conventional training and embeds Batch into the Channels layer to implement group convolution; DynamicConv takes into account the difficulty of training (the other two methods of this problem are not mentioned) and proposes a solution; DyNet considers the inapplicability of batch training methods and proposes a feature fusion scheme in the training phase.
  • Code open source: So far only CondConv has been open sourced, while the other two are not open source;
  • In terms of model comparison: The later proposed scheme is not compared with the earlier proposed method. For example, CondConv was first proposed, but DynamicConv and DyNet did not compare with it. DyNet also mentioned the first two at least, while DynamicConv did not mention it. CondConv, really shouldn't.

5. Conclusion

​ In this article, Google, Huawei, and MSRA and other major manufacturers proposed the adaptive fusion of dynamic convolution with multiple convolution kernels. Finally, the differences and connections between the three are compared from different angles.

​ In terms of time, Google’s CondConv was the first to propose, MSRA’s DynamicConvolutin was the second, and Huawei’s DyNet was the last. But the DyNet article mentioned: "This technique has been deployed in HUAWEI at the beginning of 2019, and the patent is filed in May 2019 as well". The specifics are unknown, but Google’s CondConv is undoubtedly the earliest from the time of publication of the paper. The latter two have not been compared with them. I don’t know why, but I don’t know.

​ From the perspective of network architecture, Google's CondConv and MSRA's DynamicConv are simple "plug and play"; Huawei's DyNet goes one step further, adjusting at the block level, which is more conducive to efficient network architecture design.

​ From the perspective of open source reproduction, Google's CondConv has open sourced and provided pre-training models, while the other two have no open source code. From this point of view, there is no doubt that Google’s CondConv is more atmospheric, haha, privately think that "not open source is a rogue", haha...

​ Finally, I still look forward to the open source code and pre-trained model of MSRA and Huawei as soon as possible. I look forward to...

Appendix

20200520 Supplement: Because I have been reading the first version of DynamicConvolution, I did not see its comparison with CondConv at that time. I went to look at arxiv early this morning and found that the article had been Tue, 31 Mar 2020 21:56:49updated on this day. In the updated version, the author makes a comparison between the two and now adds the following.

​ The author mentioned the difference between the two: CondConv's weight fusion coefficient prediction module finally uses sigmoid for normalization, while DynamicConvolution uses softmax for normalization. Although sigmoid can get a nearly uniform attention distribution, it will result in a relatively large nuclear space, as shown in the figure below. The DynamicConvolution will have a smaller kernel space, but at the same time there is a problem of difficulty in training, and this also introduces the author's second improvement: the smooth attention method, that is, the softmax temperature.

Edited on 2020-05-20

Guess you like

Origin blog.csdn.net/c2a2o2/article/details/114161435