Article Directory
convolution operation
This is not difficult to understand. We know that an image is composed of individual pixels in a computer, which can be represented by a matrix.
Assuming a 5x5 input image, we define a 3x3 matrix (the values in which are randomly generated)
and then we take this convolution kernel, in the input image, select the 3x3 matrix in the upper left corner, and use the convolution kernel and this Multiply the corresponding positions of the matrix, and then get 9 numbers, add these 9 numbers together, and finally get a result.
Then move the convolution kernel one space to the right, continue to repeat the above calculation, and get another number.
Then after the calculation, continue to move to the right, and calculate again.
The value obtained by three calculations is
then move down one space, and continue to repeat the above operation until we have calculated the entire 5x5 input image and obtained 9 calculation results.
This is The result of our convolution, this whole operation is the convolution operation.
Then there are a few questions:
- Q1: Can each move to the right only be 1 space?
- A1: No, moving 1 square means the step size is 1. If we set the step size to 2, then move 2 squares each time, and the
stride
step size is set by us - Q2: How is the value in the convolution kernel set?
- A2: Initially generated randomly (will learn and update later)
- Q3: So after convolution, the image must become smaller?
- A3: No, in the above example, the input of 5x5 is convolved to get 3x3, then if we fill a circle with the 5x5 image, it becomes a 7x7 image, and then use this convolution kernel for convolution. You will get a 5x5 output. In practice, we do do this, there is a parameter
padding
that indicates whether to fill, we can set the range of filling, and the value of filling, generally fill with 0.
By the way, add a formula:
Assuming that the input image is W x W, the convolution kernel size is FxF, the step size is stride=S, and the padding is set to P (the number of filled pixels), then the
size of the output image = (W - F +2P)/S +1
Then, after understanding the entire convolution process, the following picture can be understood.
This figure shows that the input image is 5x5, the convolution kernel is 3x3, the step size is 1, padding=1, so the output is 5x5
Actual operation
The convolution process is as mentioned above. When actually writing the code, we can implement each step without so much trouble.
The framework has already encapsulated the corresponding functions for us. We only need to call the function and pass it the relevant parameters.
Let's take the pytorch framework as an example (tensorflow is similar).
We need to set the following parameters when operating Conv2d:
Let's explain a few commonly used ones:
- in_channels: the number of input channels
- out_channels: the number of output channels
- kernel_size: The size of the convolution kernel, the type is int or tuple. When the convolution is square, only an integer side length is required. If the convolution is not square, a tuple must be input to indicate the height and width. (The convolution kernel does not need to be set by you, it only needs to be given a size, and the value inside is randomly generated)
- stride: step size (that is, move a few pixels each time, the default is 1)
- padding: how many circles to fill, the default is 0, no padding (the value of padding is 0)
- dilation: Control the spacing between convolution kernels (set this to do hole convolution)
- groups: control the connections between inputs and outputs
- bias: bias, whether to add a learned bias to the output, the default is True
- padding_mode: set the padding mode
filter and kernel
Here we focus on explaining the following number of channels:
Suppose a picture is 6x6, the number of channels is 1 (such as a black and white image), the size of the convolution kernel is 3x3, the step size is 1, no padding (padding is 0) we will not consider it for the
time out_channels
being The setting problem will be discussed later.
That is to say, the current parameter setting is: in_channels=1
kernel_size=3
stride=1
padding=0
we can calculate this, the output image is 4x4, I drew a schematic diagram, you can see it:
then we also know that the rgb image is a three-channel, So if the above picture is an rgb image, what is the output result?
In other words, the parameter settings are: as shown in in_channels=3
kernel_size=3
stride=1
padding=0
the figure: our output result is still 1 channel.
It can be seen that the convolution kernel here has changed into three superpositions.
Some students just understand the single-channel convolution operation above, but they don't understand the multi-channel convolution operation.
When your input image is three-channel, the convolution kernel is also three-channel.
In fact in_channels
, the key point is in_channels
that it is the number of input channels, and it is also the number of channels of the filter.
kernel
We call it a convolution kernel, the size is 3x3
, and if the input is a three-channel image, then our convolution kernel will also be three-channel.
We call a single-layer convolution kernel kernel
a multi-layer stack and this is called filter
a filter.
Note: This explanation is not correct, it is just for easy understanding . As for the specific meaning of
kernel
andfilter
, there are historical reasons. These terms are also borrowed from other disciplines, and the current learning of neural networks does not need to studykernel
andfilter
refer to anything in detail, as long as it is understood that these are convolution kernels. You can also see the explanation given by the students in the comment area of this blog post
When your input image is three-channel, the convolution kernel is also three-channel.
The operation between them is performed by this new convolution kernel (with 27 numbers) and the corresponding position of the input image.
The 27 numbers are respectively multiplied by the 27 numbers in the input image, and then added together to get a number. Repeat this calculation and go through the entire input image to get 16 numbers.
As shown in the picture:
Therefore, the result of the calculation is also a one-dimensional result, that is, the result of a single channel.
Therefore, the concepts of kernel and filter are clear.
kernel
: The kernel is a 2D matrix, length × width.
filter
: The filter is a three-dimensional cube, length × width × depth, where the depth is how many kernels it consists of.
It can be said that the kernel is the basic element of the filter, and multiple kernels form a filter.
In fact, both kernel and filter are convolution kernels in essence, but one corresponds to a single channel and the other corresponds to multiple channels,
so filter
the dimension depends on the number of input channels
Then there are two questions:
How many kernels should a filter contain? The answer is: How many filters should there be in one layer determined
by the number of input channels ? The answer is: depending on how many features we want to extract, a filter is responsible for extracting a certain feature, and we can set as many filters as we want to output. So what are the parameters for setting the filter? It’s what we didn’t mention before . Don’t forget, it can also be set manually. In the above picture, the result of a filter operation is a single channel. If you set it, the output channel will be 2. as the picture shows:in_channels
out_channels
out_channels
out_channels=2
so. To sum it up.
filter
There are several that determine the number of output channels
. When we write code, we don't need to specify the number of filters, but directly specify the output channel, so the output channel is our hyperparameter.
in_channels
The number of channels of the filter is determined, out_channels
and the setting of the filter determines the number of filters. The result of this layer of convolution out_channels
is the next layer in_channels
.
So, out_channels
and in_channels
is not related.
1x1 convolutional layer
The 1x1 convolution layer is a special convolution layer.
The height and width of the convolution kernel are equal to 1, which means that it will not recognize spatial information, because it only looks at one spatial pixel at a time, so it will not recognize the spatial information in the channel.
But we often use it to merge channels.
Its output value is equivalent to weighting the values of different channels at the corresponding input position and
1 1. The function of the convolution kernel is to fuse the information of different channels, which can be considered as no space. The matching is just to directly fuse the input channel and the output channel at the input layer, which is equivalent to pulling the entire input into a vector, and the number of channels is equal to the number of features. 1 The convolutional layer of
1 is equivalent to a fully connected layer, not Do any control information, because the fully connected layer does not consider spatial information, it only considers the fusion in the feature dimension (that is, the input channel dimension)
Visual example
We can use an actual network LeNET5 to see what we just explained.
This input a 32x32 handwritten digital picture
6@28x28 means: the output channel of the first convolutional layer is 6, the output size is 28x28
The second is the pooling layer, the number of channels remains the same, or 6, the size is halved, and the It became 14x14,
the third is still a convolutional layer, 16 channels, size 10x10
and then the fourth is pooling layer, 16 channels, size 5x5,
and finally with two fully connected layers
and finally the output result.
The first layer of LeNET5 is a convolutional layer, the input data is 32x32x1, the convolution kernel size is 5x5, the step size=1, padding=0, and the output is 6 @ 28×28. Then, the input here is single-channel, that
is in_channels=1
, Then the depth of the filter is 1, but the output channel requirement is 6, that is, out_channels=6
6 filters are required, and finally 6 28x28 images are obtained.
As shown in the figure: This is the network visualization model of the entire LeNET5. The blue one is 32x32. After convolution, the next layer is obtained, which is the yellow layer. You can see that the yellow layer is a Cube, we can unfold it and
see: After unfolding, there are indeed six 28x28 results.
The website address of this visualization is: https://tensorspace.org/index.html
pooling
After understanding the convolution operation, pooling is much simpler. The pooling operation is to use a kernel, such as 3x3, to go to the position corresponding to 3x3 on the input image, and select the largest of these nine numbers as the output result. This is called max pooling.
Output channel = input channel
(when multiple channels are input, each channel should be pooled)
full connection
The fully connected layer is generally at the end of the convolutional neural network. His input is the result obtained by the previous convolution pooling. To " flatten " the result is to flatten the obtained result matrix into a column vector. So how does the full connection operate on this column vector?
As shown in the figure, assuming that x1, x2, and x3 on the left are the vectors obtained after we flatten, then we use x 1 × w 11 + x 2 × w 21 + x 3 × w 31 = b 1 x_1\times w_{11} +x_2\times w_{21}+x_3\times w_{31}=b_1x1×w11+x2×w21+x3×w31=b1
In the same way, b2 is also calculated in this way. This calculation process can be expressed as a matrix operation.
In this operation, as long as we increase the number of columns in the w matrix, we can get different numbers of results. For example, if w is set to 3x3, then the result of 1x3 will be obtained. So, the fully connected layer outputs a column of vectors, and the number of final results we can define.
So what's the point?
Fully connected layers (FC) play the role of "classifier" in the entire convolutional neural network. If operations such as the convolutional layer, pooling layer, and activation function layer map the original data to the hidden layer feature space, the fully connected layer plays the role of mapping the learned "distributed feature representation" to the sample label space . .
Doing so can reduce the impact of the feature position on the classification. Originally, the feature map is a matrix, so the position of the feature has an impact on the classification. For example, to identify the cat in the image, the cat is in the upper left corner of the image, then the upper left corner can be If it is detected, it cannot be detected in the lower right corner. However, our two-dimensional matrix is integrated into a value output through the fully connected layer. This value is the predicted probability of the cat. No matter where the cat is, as long as the probability is large, There are cats. Doing so ignores spatial structure features and enhances robustness.