Convolution operation for machine learning

Many good ideas are never heard again once they cross the semantic gap.


Convolution is a commonplace talk in the field of image and vision, but the specific work details are still worth learning for me.

Convolution principle

Convolution is to use a small matrix (or higher-dimensional vector) to act on the image matrix (or feature matrix), and then output a specific and meaningful value.

Specifically, this process is called feature mapping, and each feature map is a type of image feature.
For example, for an image with a cat, three convolution kernels are used to act on the image, and the result of convolution of each convolution kernel is a feature mapping result. For example, the three convolution kernels tend to be cat ears, cat feet, and cat tail features respectively.
Then by increasing the number of convolution kernels used, the characteristics of the image can be better grasped.

In practical applications, if it is necessary to identify numbers in the MNIST data set, since the pictures in the MNIST data set are all grayscale images, then there is one feature map (at least); correspondingly, if it is an RGB image, then there are three Feature maps (at least), one map per channel.

The size setting of the convolution kernel has a certain relationship with the feature map. If the feature map is the blue part and the green part is the convolution kernel, it can be seen that the number of layers of the convolution kernel is consistent with the number of layers of the feature map. For example, if the feature map is L 1 L_1L1W 1 W_1 with convolution kernelW1Do the dot product, L 2 L_2L2Give W 2 W_2W2Do the dot product… L n L_nLnGive W n W_nWnPerform dot multiplication, and finally obtain a one-dimensional vector whose dimension is the number of feature map (or convolution kernel) layers. The one-dimensional vector is accumulated and summed, and finally a numerical value (eigenvalue) is obtained.
Insert image description here
A more specific description is as follows (quoted from Qiu Xipeng's "Neural Network and Deep Learning")
Insert image description here
In the above picture, X 1 . . . XDX^1...X^DX1...XD are different layers of a feature map,W p , 1 . . . W p , DW^{p,1}...W^{p,D}Wp,1...Wp and D are different layers of a convolution kernel. They are dot multiplied and summed respectively, and then offsets are added (an offset value is added to each summation result), and the feature map is obtained through the activation function.
PS:The purpose of bias is to make feature extraction more flexible, weight the information, and adjust the proportion.


Convolution application

Convolutional layers replace fully connected layers.

Insert image description here

The convolutional layer acts on local area feature fusion, and the fully connected layer acts on global feature information fusion. The difference between the two is that the scope of the area of ​​effect is different, and there is the possibility of replacement.

But why do convolutional layers replace fully connected layers?

The number of parameters of the fully connected layer and the previous layer is fixed, resulting in a fixed image size for the convolution input. This limitation leads to problems such as time-consuming and complicated training for many tasks. Instead, the convolution operation is used. The number of convolution kernels has no impact on the size of the feature map of the previous layer, which can solve the problem of image size limitation.
For example, if a 512x512x3 image is input, the feature map of the previous layer of the fully connected layer is 16x16x8, and the fully connected layer has 128 neurons, then we can use 128 16x16x8 convolution kernels to perform the convolution operation and output a feature map of 1 1 128 .

Convolution result analysis

The general formula for calculating the result size of a convolutional neural network is
N = (input size − convolution kernel size + 2 ∗ padding size) stride stride + 1 \begin{aligned} N=\frac{(input size - convolution kernel size +2*padding size)}{stride} + 1 \end{aligned}N=step size s t r i d e( Enter sizeConvolution kernel size+2Fill size )+1

Library Functions
nn.Conv2d(in_channels,out_channels,kernel_size,stride,padding)

For example, if the input image size is 227x227X3, the convolution kernel size is 5x5x3, padding is 3, and stride=2, then the convolution result size is 227 − 5 + 2 ∗ 3 2 + 1 = 115 \frac{227- 5 + 2 * 3}{2} + 1=11522275+23+1=115


Convolution derivation operation

In convolution, we mainly focus on the convolution kernel parameters and bias coefficients, and perform parameter update and optimization based on error backpropagation.

Insert image description here
Of course, there is usually an activation layer. You only need to derive the result first with respect to the activation function, and then the activation function is derived with respect to each parameter.

Guess you like

Origin blog.csdn.net/qq_44116998/article/details/128823983