Graphical neural network: convolution, pooling, full connection (the number of channels, the concept of kernel and filter)

convolution operation

This is not difficult to understand. We know that an image is composed of individual pixels in a computer, which can be represented by a matrix.
Assuming a 5x5 input image, we define a 3x3 matrix (the values ​​​​in which are randomly generated)
insert image description here
and then we take this convolution kernel, in the input image, select the 3x3 matrix in the upper left corner, and use the convolution kernel and this Multiply the corresponding positions of the matrix, and then get 9 numbers, add these 9 numbers together, and finally get a result.
insert image description here
Then move the convolution kernel one space to the right, continue to repeat the above calculation, and get another number.
insert image description here
Then after the calculation, continue to move to the right, and calculate again.
insert image description here
The value obtained by three calculations is
insert image description here
then move down one space, and continue to repeat the above operation until we have calculated the entire 5x5 input image and obtained 9 calculation results.
insert image description here
This is The result of our convolution, this whole operation is the convolution operation.
Then there are a few questions:

  • Q1: Can each move to the right only be 1 space?
  • A1: No, moving 1 square means the step size is 1. If we set the step size to 2, then move 2 squares each time, and the stridestep size is set by us
  • Q2: How is the value in the convolution kernel set?
  • A2: Initially generated randomly (will learn and update later)
  • Q3: So after convolution, the image must become smaller?
  • A3: No, in the above example, the input of 5x5 is convolved to get 3x3, then if we fill a circle with the 5x5 image, it becomes a 7x7 image, and then use this convolution kernel for convolution. You will get a 5x5 output. In practice, we do do this, there is a parameter paddingthat indicates whether to fill, we can set the range of filling, and the value of filling, generally fill with 0.

By the way, add a formula:
Assuming that the input image is W x W, the convolution kernel size is FxF, the step size is stride=S, and the padding is set to P (the number of filled pixels), then the
size of the output image = (W - F +2P)/S +1

Then, after understanding the entire convolution process, the following picture can be understood.
This figure shows that the input image is 5x5, the convolution kernel is 3x3, the step size is 1, padding=1, so the output is 5x5
insert image description here

Actual operation

The convolution process is as mentioned above. When actually writing the code, we can implement each step without so much trouble.
The framework has already encapsulated the corresponding functions for us. We only need to call the function and pass it the relevant parameters.
Let's take the pytorch framework as an example (tensorflow is similar).
We need to set the following parameters when operating Conv2d:
insert image description here
Let's explain a few commonly used ones:

  • in_channels: the number of input channels
  • out_channels: the number of output channels
  • kernel_size: The size of the convolution kernel, the type is int or tuple. When the convolution is square, only an integer side length is required. If the convolution is not square, a tuple must be input to indicate the height and width. (The convolution kernel does not need to be set by you, it only needs to be given a size, and the value inside is randomly generated)
  • stride: step size (that is, move a few pixels each time, the default is 1)
  • padding: how many circles to fill, the default is 0, no padding (the value of padding is 0)
  • dilation: Control the spacing between convolution kernels (set this to do hole convolution)
  • groups: control the connections between inputs and outputs
  • bias: bias, whether to add a learned bias to the output, the default is True
  • padding_mode: set the padding mode

filter and kernel

Here we focus on explaining the following number of channels:
Suppose a picture is 6x6, the number of channels is 1 (such as a black and white image), the size of the convolution kernel is 3x3, the step size is 1, no padding (padding is 0) we will not consider it for the
time out_channelsbeing The setting problem will be discussed later.
That is to say, the current parameter setting is: in_channels=1 kernel_size=3 stride=1 padding=0
we can calculate this, the output image is 4x4, I drew a schematic diagram, you can see it:
insert image description here
then we also know that the rgb image is a three-channel, So if the above picture is an rgb image, what is the output result?
In other words, the parameter settings are: as shown in in_channels=3 kernel_size=3 stride=1 padding=0
the figure: our output result is still 1 channel.
insert image description here

It can be seen that the convolution kernel here has changed into three superpositions.
Some students just understand the single-channel convolution operation above, but they don't understand the multi-channel convolution operation.
When your input image is three-channel, the convolution kernel is also three-channel.
In fact in_channels, the key point is in_channelsthat it is the number of input channels, and it is also the number of channels of the filter.
kernelWe call it a convolution kernel, the size is 3x3
, and if the input is a three-channel image, then our convolution kernel will also be three-channel.
We call a single-layer convolution kernel kernela multi-layer stack and this is called filtera filter.

Note: This explanation is not correct, it is just for easy understanding . As for the specific meaning of kerneland filter, there are historical reasons. These terms are also borrowed from other disciplines, and the current learning of neural networks does not need to study kerneland filterrefer to anything in detail, as long as it is understood that these are convolution kernels. You can also see the explanation given by the students in the comment area of ​​this blog post

insert image description here
When your input image is three-channel, the convolution kernel is also three-channel.
The operation between them is performed by this new convolution kernel (with 27 numbers) and the corresponding position of the input image.
The 27 numbers are respectively multiplied by the 27 numbers in the input image, and then added together to get a number. Repeat this calculation and go through the entire input image to get 16 numbers.
As shown in the picture:
insert image description here

Therefore, the result of the calculation is also a one-dimensional result, that is, the result of a single channel.

Therefore, the concepts of kernel and filter are clear.
kernel: The kernel is a 2D matrix, length × width.
filter: The filter is a three-dimensional cube, length × width × depth, where the depth is how many kernels it consists of.
It can be said that the kernel is the basic element of the filter, and multiple kernels form a filter.
In fact, both kernel and filter are convolution kernels in essence, but one corresponds to a single channel and the other corresponds to multiple channels,
so filterthe dimension depends on the number of input channels

Then there are two questions:
How many kernels should a filter contain? The answer is: How many filters should there be in one layer determined
by the number of input channels ? The answer is: depending on how many features we want to extract, a filter is responsible for extracting a certain feature, and we can set as many filters as we want to output. So what are the parameters for setting the filter? It’s what we didn’t mention before . Don’t forget, it can also be set manually. In the above picture, the result of a filter operation is a single channel. If you set it, the output channel will be 2. as the picture shows:in_channels



out_channels
out_channelsout_channels=2

insert image description here

so. To sum it up.
filterThere are several that determine the number of output channels
. When we write code, we don't need to specify the number of filters, but directly specify the output channel, so the output channel is our hyperparameter.
in_channelsThe number of channels of the filter is determined, out_channelsand the setting of the filter determines the number of filters. The result of this layer of convolution out_channelsis the next layer in_channels.
So, out_channelsand in_channelsis not related.

1x1 convolutional layer

The 1x1 convolution layer is a special convolution layer.
The height and width of the convolution kernel are equal to 1, which means that it will not recognize spatial information, because it only looks at one spatial pixel at a time, so it will not recognize the spatial information in the channel.
But we often use it to merge channels.
Its output value is equivalent to weighting the values ​​of different channels at the corresponding input position and
1 1. The function of the convolution kernel is to fuse the information of different channels, which can be considered as no space. The matching is just to directly fuse the input channel and the output channel at the input layer, which is equivalent to pulling the entire input into a vector, and the number of channels is equal to the number of features. 1 The convolutional layer of
1
is equivalent to a fully connected layer, not Do any control information, because the fully connected layer does not consider spatial information, it only considers the fusion in the feature dimension (that is, the input channel dimension)

Visual example

We can use an actual network LeNET5 to see what we just explained.
insert image description here
This input a 32x32 handwritten digital picture
6@28x28 means: the output channel of the first convolutional layer is 6, the output size is 28x28
The second is the pooling layer, the number of channels remains the same, or 6, the size is halved, and the It became 14x14,
the third is still a convolutional layer, 16 channels, size 10x10
and then the fourth is pooling layer, 16 channels, size 5x5,
and finally with two fully connected layers
and finally the output result.

The first layer of LeNET5 is a convolutional layer, the input data is 32x32x1, the convolution kernel size is 5x5, the step size=1, padding=0, and the output is 6 @ 28×28. Then, the input here is single-channel, that
is in_channels=1, Then the depth of the filter is 1, but the output channel requirement is 6, that is, out_channels=6
6 filters are required, and finally 6 28x28 images are obtained.
As shown in the figure: This is the network visualization model of the entire LeNET5. The blue one is 32x32. After convolution, the next layer is obtained, which is the yellow layer. You can see that the yellow layer is a Cube, we can unfold it and
insert image description here
see: After unfolding, there are indeed six 28x28 results.
insert image description here
The website address of this visualization is: https://tensorspace.org/index.html

pooling

After understanding the convolution operation, pooling is much simpler. The pooling operation is to use a kernel, such as 3x3, to go to the position corresponding to 3x3 on the input image, and select the largest of these nine numbers as the output result. This is called max pooling.
Output channel = input channel
(when multiple channels are input, each channel should be pooled)
insert image description here

full connection

The fully connected layer is generally at the end of the convolutional neural network. His input is the result obtained by the previous convolution pooling. To " flatten " the result is to flatten the obtained result matrix into a column vector. So how does the full connection operate on this column vector?
insert image description here
As shown in the figure, assuming that x1, x2, and x3 on the left are the vectors obtained after we flatten, then we use x 1 × w 11 + x 2 × w 21 + x 3 × w 31 = b 1 x_1\times w_{11} +x_2\times w_{21}+x_3\times w_{31}=b_1x1×w11+x2×w21+x3×w31=b1
In the same way, b2 is also calculated in this way. This calculation process can be expressed as a matrix operation.
insert image description here
In this operation, as long as we increase the number of columns in the w matrix, we can get different numbers of results. For example, if w is set to 3x3, then the result of 1x3 will be obtained. So, the fully connected layer outputs a column of vectors, and the number of final results we can define.
So what's the point?
Fully connected layers (FC) play the role of "classifier" in the entire convolutional neural network. If operations such as the convolutional layer, pooling layer, and activation function layer map the original data to the hidden layer feature space, the fully connected layer plays the role of mapping the learned "distributed feature representation" to the sample label space . .
Doing so can reduce the impact of the feature position on the classification. Originally, the feature map is a matrix, so the position of the feature has an impact on the classification. For example, to identify the cat in the image, the cat is in the upper left corner of the image, then the upper left corner can be If it is detected, it cannot be detected in the lower right corner. However, our two-dimensional matrix is ​​integrated into a value output through the fully connected layer. This value is the predicted probability of the cat. No matter where the cat is, as long as the probability is large, There are cats. Doing so ignores spatial structure features and enhances robustness.

Guess you like

Origin blog.csdn.net/holly_Z_P_F/article/details/122377935