Deep Learning - Fully Convolutional Neural Network (FCN)

1. Introduction
Fully Convolutional Networks (FCN) is a framework for image semantic segmentation proposed by Jonathan Long et al. in 2015 in Fully Convolutional Networks for Semantic Segmentation. Pioneering work in the field of segmentation. We know that for a neural network with a well-designed parameter structure for each layer, the size of the input image is required to be fixed. For example, AlexNet, VGGNet, GoogleNet and other networks require the input of a fixed-size image to work properly. The essence of FCN is to allow an already designed network to input images of any size\color{blue}{The essence of FCN is to allow an already designed network to input images of any size}The essence of FCN is to allow an already designed network to input pictures of any size .
2. FC network structure
The FCN network structure is mainly divided into two parts: the full convolution part and the deconvolution part. Among them, the full convolution part is some classic CNN networks (such as VGG, ResNet, etc.), which are used to extract features; the deconvolution part is to obtain the semantic segmentation image of the original size through upsampling. The input of FCN can be a color image of any size, the output is the same size as the input, and the number of channels is n (number of target categories) + 1 (background).
3. CNN and FCN network structure comparison
CNN network
If we want to design a network for distinguishing cats, dogs and backgrounds, the architecture of the CNN network should be as follows:
insert image description here
If the input image size is a color image of 14x14x3, as shown above, first After a 5x5 convolutional layer, the number of output channels of the convolutional layer is 16, and a set of feature maps of 10x10x16 is obtained, and then after a 2x2 pooling layer, a 5x5x16 feature map is obtained, and then two 50 neurons are entered after Flatten The fully connected layer of the unit, and finally output the classification result.
Among them, Flatten requires that the size of the feature map output by the convolution is fixed, because it needs to connect all the pixels of the feature map, which leads to a fixed requirement for the input size of the inverse convolution layer. \color{blue}{Among them, Flatten requires that the size of the feature map output by convolution is fixed, because it needs to connect all the pixels of the feature map, which leads to a fixed input size requirement for deriving the convolution layer. }Among them, Fla tt e n requires that the size of the feature map output by convolution is fixed, because it needs to connect all the pixels of the feature map, which leads to a fixed requirement for the input size of the inverse convolution layer.
For example: in a neural network with a fully connected layer, assuming that the input image size is the same, the size of the features obtained after convolution is also the same. If the input feature size is a × b, and then a 1 × c fully connected layer is connected, then the size of the weight matrix between the output of the convolutional layer and the fully connected layer is ( a × b ) × c. But if the input size is different from the original image, the new convolution output is a ′ × b ′
. Correspondingly, the size of the weight matrix between the output of the convolutional layer and the fully connected layer should be ( a ′ × b ′ ) × c . Obviously, the size of the weight matrix has changed, so it cannot be used and trained.

FCN network
Full convolutional neural network, as the name implies, the network is full of convolutional layer links, as shown in the figure below: the
insert image description here
structure of the network is the same as that of CNN in the first two steps, but when the CNN network is flattened, the FCN network will replace it It became a convolutional layer with a convolution kernel size of 5x5 and an output channel of 50, and the subsequent fully connected layers were replaced with 1x1 convolutional layers. 1x1 convolution is actually equivalent to a fully connected operation.
From the comparison of the above two figures, it can be seen that the main difference between the fully convolutional network and the CNN network is that the FCN replaces the fully connected layer in the CNN with a convolution operation. \color{blue}{The main difference is that FCN replaces the fully connected layer in CNN with a convolution operation. }The main difference is that FCN replaces the fully connected layer in CNN with convolution operation.
After switching to the full convolution operation, since there is no limit on the number of neurons in the input layer of the fully connected layer, the input of the convolution layer can accept images of different sizes, and there is no need to require the same size of the training image and the test image.
Then the question comes, if the input size is different, then the output size must be different, so how to understand the output of FCN?
4. Understand
the size change of the output feature map
of the FCN network. Let’s first look at the specific change in the size of the feature map in the above network, regardless of the number of channels
insert image description here
. In the above picture, we can see that the input is a 14x14 size image, after a 5x5 convolution (without padding), a 10x10 feature map is obtained, and then after a 2x2 pooling, the size is reduced to half Become a 5x5 feature map, and after a 5x5 convolution, the feature map becomes 1x1, and then perform two 1x1 convolutions (similar to a fully connected operation), and finally get a 1x1 output result, then the 1x1 The output result represents the classification of the front 14x14 image area. If it corresponds to the classification task of cats and dogs and the background above, then the final output result should be a 1x3 matrix, where each value represents the 14x14 input image. The classification score for the corresponding class.
Input images of different sizes
Okay, isn't it possible to accept input of any size? Next, let's see what kind of results will be obtained by inputting a larger picture, as shown in the figure below:
insert image description here
We can see the above picture, the input size has changed from 14x14 to 16x16, then after a 5x5 convolution (without padding), a 12x12 feature map is obtained, and then after a 2x2 pooling, the size Shrinking to half becomes a 6x6 feature map, and after a 5x5 convolution, the feature map becomes 2x2, and then performs two 1x1 convolutions (similar to a fully connected operation), and finally gets a 2x2 output result, Then the output result of the 2x2 represents the classification of the first 16x16 image area. However, the output is 2x2, how does it correspond to the previous one?
Which pixel corresponds to which area?
Let's look at the picture below:
insert image description here
According to the reverse deduction of convolution pooling, we know in Figure 3 above that the final output 1x1 represents the classification result of the previous 14x14 input, then we can deduce according to the scope of the convolution kernel, the last in the above figure The orange output in the upper left corner of the output 2x2 represents the orange area (red box) in the 16x16, and so on, the blue output in the upper right corner of the output 2x2 represents the yellow box area in the 16x16, and the blue output in the lower left corner of the 2x2 The color output represents the black frame area in 16x16, and the blue output in the lower right corner of the output 2x2 represents the purple frame area in 16x16, where the size of each frame is 14x14. That is to say, each value of the output represents Classification of a region in the input image.

Refer to
FCN (full convolutional neural network) for detailed explanation of
the fully convolutional neural network FCN

Guess you like

Origin blog.csdn.net/weixin_40826634/article/details/128197818