Deep learning basic learning - the role of 1x1 convolution kernel (in CNN)

foreword

I won’t go into details about the convolutional neural network here. I have done some integration by myself through some information read by bloggers, and the boss will bypass it.
The role of the 1x1 convolution kernel can be summarized as the following points

  1. Increase network depth (increase the number of nonlinear mappings)
  2. Dimension Ascension / Dimensionality Reduction
  3. Information exchange across channels
  4. Reduce convolution kernel parameters (simplified model)

insert image description here

1. Ordinary convolution

insert image description here

Here we first show one of our most common convolution methods (the number of channels is 1), a 5x5 image is afraid, and a 3x3 result is obtained by extracting features through a 3x3 convolution kernel. If the convolution kernel here is 1x1, then the effect is as follows
insert image description here

2. 1x1 convolution kernel function

2.1 Increase the depth of the network (increase the number of nonlinear mappings)

First of all, understand directly from the depth of the network. Although the 1x1 convolution kernel is small, it is also a convolution kernel. Adding 1 layer of convolution will naturally increase the network depth.
The 1x1 convolution kernel can greatly increase the nonlinear characteristics (using the subsequent nonlinear activation function) while keeping the scale of the feature map unchanged (that is, without loss of resolution), and make the network very deep. And the convolution process of the 1x1 convolution kernel is equivalent to the calculation process of the full connection. By adding a nonlinear activation function, the nonlinearity of the network can be increased, so that the network can express more complex features.
Specifically, quoting the content of the "frank909" blog:
In fact, the problem is to dig down, what should be the benefits of increasing the depth of the network? Why do you have to use 1x1 to increase the depth? Can't others?

In fact, this involves the problem of receptive field. We know that the larger the convolution kernel, the larger the receptive field of a single node on the featuremap it generates. As the network depth increases, the receptive field of the node on the later featuremap is also larger. bigger. Therefore, the features are becoming more and more abstract.

But sometimes, we want to deepen the network without increasing the receptive field, in order to introduce more nonlinearity.

And the 1x1 convolution kernel happens to be able to do it.

We know that the size of the generated image after convolution is affected by the size and span of the convolution kernel, but if the convolution kernel is 1x1 and the span is also 1, then the size of the generated image will not change.

But usually a convolution process includes an activation function, such as Sigmoid and Relu.

Therefore, when the input does not change in size, more nonlinearity is introduced, which will enhance the expressive ability of the neural network

2.2. Dimension enhancement/dimension reduction

In fact, the dimensionality increase and dimensionality reduction here specifically refer to the change in the number of channels. After we determine the size of the convolution kernel, our height and width remain unchanged, so the dimension here specifically refers to channels. We change the channels of the convolutional feature map by changing the number of convolution kernels to achieve the effect of dimensionality enhancement and dimensionality reduction. In this way, the original data volume can be increased or decreased.
The following two examples can clearly see the effect
insert image description here

2.2.1 Dimension upgrade

insert image description here

2.2.2 Dimensionality reduction

insert image description here

insert image description here

In fact, it can be clearly seen that whether it is dimension increase or dimension reduction, we all achieve it by changing the number of convolution kernels. The number of channels of the convolutional feature map is consistent with the number of convolution kernels. Here, in fact Not only the 1x1 convolution kernel can achieve this function, but also the convolution kernel of other sizes, so why do we choose the 1x1 convolution kernel?

2.2.3 Reasons for using 1x1 convolution kernel to increase/reduce dimensionality

When we just want to change the number of channels, the 1x1 convolution kernel is the smallest choice, because the convolution kernel larger than 1x1 will undoubtedly increase the amount of calculation parameters , and the memory will also increase accordingly, so I just want to simply To enhance or reduce the channel of the feature map, it is most appropriate to choose a 1x1 convolution kernel, which will use fewer weight parameters .

2.3 Cross-channel Information Interaction

The 1x1 convolution kernel has only one parameter. When it acts on a multi-channel feature map, it is equivalent to a linear combination of different channels. In fact, it is added and multiplied by a coefficient, but the output feature map is multiple The integrated information of the channel can enrich the features extracted by the network.

Using a 1x1 convolution kernel, the operation of dimensionality reduction and dimensionality enhancement is actually a linear combination of information between channels.

For example: Adding a convolution kernel with a size of 1x1 and a number of 28 channels after a convolution kernel with a size of 3x3 and a number of 64 channels becomes a convolution kernel with a size of 3x3 and a size of 28. The original 64 channels can be understood as the linear combination of cross-channels has become 28 channels, which is the information interaction between channels.

Note: It is only a linear combination on the channel dimension, and W and H are sliding windows with shared weights.

2.4 Reduce convolution kernel parameters (simplified model)

The following is just an example of calculating the number of weights (without adding bias)

2.4.1 Add a 1x1 convolution kernel to one layer of convolution, and calculate the weights separately

(1) Do not use 1x1 convolution kernel
insert image description here
(2) Use 1x1 convolution kernel
insert image description here
It can be seen that the convolution kernel without 1x1 is about 10 times that of using the convolution kernel

2.4.2 GoogLeNet's 3a module

(1) Do not use 1x1 convolution kernel
insert image description here
insert image description here

Number of weights: 192 × (1×1×64) +192 × (3×3×128) + 192 × (5×5×32) = 387072 The description of this network is as follows (1) Using convolution kernels of different
sizes
means With different sizes of receptive fields, the final stitching means the fusion of features of different scales;
(2) The reason why the convolution kernel size is 1, 3 and 5 is mainly for the convenience of alignment. After setting the convolution step stride = 1, as long as you set pad = 0, 1, 2 respectively, then the features of the same dimension can be obtained after convolution, and then these features can be directly spliced ​​together; (3)
Article It is said that pooling is very effective in many places, so it is also embedded in Inception.
(4) The further the network goes, the more abstract the features, and the larger the receptive field involved in each feature, so as the number of layers increases, the ratio of 3x3 and 5x5 convolutions also increases.
(2) Use 1x1 convolution kernel

insert image description here
insert image description here

Number of weights: 192 × (1×1×64) + (192×1×1×96+ 96 × 3×3×128) + (192×1×1×16+16×5×5×32) = 157184

The convolution kernel without 1x1 is twice the weight of the convolution kernel with 1x1

2.4.3 ResNet

ResNet also uses 1×1 convolution, and it is used before and after the 3×3 convolutional layer. It not only reduces the dimension, but also increases the dimension, and the number of parameters is further reduced. The right picture is also called "bottleneck"
insert image description here
. design", the purpose is clear at a glance, in order to reduce the number of parameters, the first 1x1 convolution reduces the 256-dimensional channel to 64 dimensions, and then restores it through 1x1 convolution at the end. When the number of our feature map channels is 256,
change The problem is that the calculation complexity will be very high. The method here is to map back to 64 dimensions through 1×1 convolution projection, then do a convolution with a constant number of 3×3 channels, and then pass 1×1 The convolution projection goes back to 256 dimensions, because the input is 256 dimensions, and the output must match, so the complexity of the design is almost the same as the left picture.
The amount of parameters in the left picture: 64 x ( 3 x 3 x 64 )+64 x ( 3 x 3 x 64 ) = 73728
When the number of channels increases to 256: 256 x ( 3 x 3 x 256 ) + 256 x ( 3 x 3 x 256 ) = 1179648
Parameters in the right figure: 256 x ( 1 x 1 x 64) + 64 x ( 3 x 3 x 64 ) + 256 x ( 1 x 1 x 64) = 69632

When the number of channels is increased to 256, it can be found that the parameter amount of adding two layers of 1x1 convolution is almost the same as 64 for the original residual block parameter amount.

For conventional ResNet, it can be used in networks with 34 layers or less. For Bottleneck Design's ResNet, it is usually used in deeper networks, in order to reduce calculations and parameters (practical purposes)

References


One article to understand the role of 1x1 convolution kernel in convolutional neural network
[Deep Learning] The role of 1x1 convolution kernel in CNN
Deep learning the role of 1x1 convolution kernel and the
role of 1x1 convolution

Guess you like

Origin blog.csdn.net/m0_47146037/article/details/127769028