The most easy-to-understand interpretation of the GHostNet network [do not accept refutation]

Frontier knowledge

Before introducing the GHostNet network, we first review grouped convolutions and depthwise separable convolutions.

group convolution

insert image description here

Suppose input.shape = [ C 1 C_1C1, H, W] output.shape = [ C 2 , H 1 , W 1 ] [C_2, H^1, W^1] [C2,H1,W1 ]
Enter the size of each feature map:W × H × C 1 g W×H× \frac {C_1} {g}W×H×gC1, a total of g groups.
The size of each group of a single convolution kernel : k × k × C 2 gk×k×\frac {C_2} {g}k×k×gC2, a convolution kernel is divided into g groups.
Output feature map size : W 1 × H 1 × g W^1×H^1×gW1×H1×g , a total of g feature maps are generated. Parameterparams = k 2 × C 1 g × C 2 g × g = k 2 C 1 g C 2 params=k^2×\frac {C_1} {g}×\frac{C_2}{g}\times g =k^2\frac{C_1}{g} C_2
现在我们计算一下分组卷积时的参数量和运算量:
params=k2×gC1×gC2×g=k2gC1C2
Operation amount FLOP s = k 2 × C 1 g C 2 × W 1 × H 1 = k 2 C 1 g C 2 W 1 H 1 FLOPs=k^2×\frac {C_1} {g} C_2×W^1 ×H^1=k^2\frac{C_1}{g} C_2W^1H^1FLOPs=k2×gC1C2×W1×H1=k2gC1C2W1H1

  • Analysis:
    Use hierarchical filter groups to improve CNN efficiency Paper address
    official analysis: Alex believes that the group conv method can increase the diagonal correlation between filters, and can reduce training parameters, and it is not easy to overfit, which is similar to the effect of regularization .

  • Code implementation (pytorch provides relevant parameters, take 2d as an example)

import torch
import torch.nn as nn
...
model = nn.Conv2d(in_channels = in_channel, out_channels = out_channel, 
kernel_size = kernel_size, stride = stride, padding = 1, dilation = dilation, group = group_num)

Through the above code, we can clearly see 分组卷积that a convolution has been done.

Depthwise Separable Convolution

insert image description hereinsert image description here
MobileNets: Paper address
Depth separable convolution is the essence of MobileNet, which consists of deep_wise convolution and point_wise convolution. I used to think that the depth separable convolution is an extreme group convolution (just set the number of groups to Cin). But thinking about it again today, I found that the big difference between them is that the group convolution only performs one convolution (one nn.Conv2d can be realized), the convolution results of different groups can be concat, and the depth separable convolution is performed. Two convolution operations are performed , the first deep_wise convolution is performed (that is, the features of each layer are collected), kernel_size = K K 1, the total parameter amount of the first convolution is K K Cin, and the second is for Get the output of Cout dimension, kernel_size = 1 1 Cin, the total parameter amount of the second convolution is 1 1 Cin*Cout. The output of the second convolution is the output of the depthwise separable convolution.

  • example

Take an example to compare the amount of parameters: suppose input.shape = [ c 1 c_1c1, H, W] output.shape = [ c 2 c_2 c2, H, W]
(a) Regular convolution parameters = k × k × c 1 × c 2 k \times k \times c_1 \times c_2k×k×c1×c2
(b) Depth separable convolution parameter amount = k × k × c 1 + 1 × 1 × c 1 × c 2 k \times k \times c1 + 1\times1\times c_1\times c_2k×k×c 1+1×1×c1×c2

上面的例子我们可以清楚地看到,得到相同的output.shape,直观看上去,深度可分离卷积的参数量比常规卷积少了一个数量级。

  • Code implementation (pytorch)
import torch
import torch.nn as nn
...
model = nn.Sequential(
        nn.Conv2d(in_channels = in_channel, out_channels = in_channel, 
kernel_size = kernel_size, stride = stride, padding = 1, dilation = dilation, group = in_channel),
        nn.Conv2d(in_channels = in_channel, out_channels = out_channel kernel_size = 1, padding = 0)
        )

The reason why I mentioned it in the previous article is that the title of this article must be grouped convolution first and then depth separable convolution, because in my opinion the latter is the extreme case of the former (group group of grouped convolutiong ro u p is set toinchannel in_channelinchann e l , that is, the number of channels in each group is 1), although there is a big difference between the two in form: group convolution only needs to perform one convolution operation, while depth-separable convolution needs to be performed twice—firstdepth − wise depth-wisedepthwise p o i n t − w i s e point-wise pointw ise convolution , but they are essentially the same.

深度可分离卷积进行一次卷积是无法达到输出指定维度的tensor的,这是由它将group设为in_channel决定的,输出的tensor通道数只能是in_channel,不能达到要求,所以又用了1*1的卷积改变最终输出的通道数。It's natural to think like this, Bottle N eck BottleNeckIsn't B o ttl e N ec k the first 1 1 convolution to reduce the amount of parameters and then 3 × 3 3\times 33×3 convolution feature map, and finally 11 to restore the original number of channels, soBottle N eck BottleNeckThe purpose of Bo ttl e Neck is to reduce the amount of parameters. The BottleNeck structure is mentioned to illustrate that 1*1 convolution is often used to change the number of channels.

GHostNet Network

GHostNet paper address
As shown in the figure below, it is a visualization of some intermediate feature maps generated by the first residual block in ResNet-50. We can see from the figure that there are many feature maps that are highly similar (indicated by different colors in the figure), in other words, there are many redundant feature maps. So from another perspective, can we use a series of linear changes to generate many "phantom" feature maps that can discover the required information from the original features at a small cost? (Redundant feature maps are very necessary to ensure that the network understands the input data more comprehensively.) **This is the core idea of ​​the entire article.

insert image description here

As shown below, I divide Ghost-Module into three parts:

  • The first part is an ordinary convolution operation, but we did not set its channel very large, we assume that the number of output channels is m ( m < n ) m(m<n)m(m<n) n n n O u t p u t Output O u tp u t Output channel number.
  • The second part is a grouped convolution operation. Remember that it is a grouped convolution. Many bloggers write depth-separable convolution operations. I also mentioned above that depth-separable convolutions are operated in two steps, but here The code is a convolution, so it is easy to understand. Through the second part of the group convolution operation we can get O utput OutputO u tp u t Output a part of the feature map in red below the feature map.
  • The third part, Identity, is easy to understand, that is, the number of channels obtained by the convolution of the first part is mmThe number of channels obtained by convolving the feature map of m with the second part of the group is ( m − 1 ) s (m-1)s(m1 ) The channel numbers of s are summed.
    insert image description here
    Among them,Φ i Φ_iPhiiis a linear transformation. In actual operation Φ i Φ_iPhiiThe transformation method is not fixed, it can be a 3x3 linear kernel or a 5x5 linear kernel (in fact, it is a bit similar to the depth convolution idea, but the difference is that the number of channels before and after the depth convolution is the same, and here Φ i Φ_iPhiiThe required number of channels can be generated, and multiple linear transformations can be performed on the same channel feature map [in fact, when s is 2, Φ i Φ_iPhiiIt is the original depth convolution]). In addition, it is theoretically possible to use a combination of convolution kernels of different sizes for linear transformation operations, but considering the reasoning of the CPU or GPU, the author recommends using all 3x3 convolution kernels or all 5x5 convolution kernels.

the code

This picture comes from the detailed explanation of GhostNet . First of all, we can know that the number of output channels of Output is nnn ,第一部分卷积the number of channels after passing ism = nsm = \frac{n}{s}m=sn, 第二部分the output channel obtained by m × ( s − 1 ) = nsm\times(s-1) = \frac{n}{s}m×(s1)=sn, then pass 第三步Identity IdentityI d e n t i t y identity mapping adds the feature maps obtained in the previous two parts according to the number of channels (dim = 1), so that we can getO utput OutputO u tp u t output feature map and the number of channels is (channel = m + m × ( s − 1 ) = s × m channel = m + m\times(s - 1) = s\times mchannel=m+m×(s1)=s×m ), and becausem = nsm = \frac{n}{s}m=sn, so the number of output channels is still nnn
insert image description here

Advantages of GHost Module over ordinary convolution

insert image description here

GHost Module module

insert image description here
There are two Ghost modules in a G-bneck, and the first Ghost module is used to increase the number of channels (expansion layer), specifying the ratio between the number of output and input channels as the expansion ratio. The second Ghost module reduces the number of channels to match the channels of the shortcut branch. When the step size is 2, a deep convolutional layer with a step size of 2 is added between the two Ghost modules.

Parameter comparison

insert image description here

GHostNet network structure

insert image description here

Guess you like

Origin blog.csdn.net/weixin_54546190/article/details/126439207