1. General Concept

The explanation of FCN can be explained by the following sentence: directly use the segmented labeled image (ground truth) as supervision information to train an end-to-end network. Let the network do pixelwise prediction and directly predict the classification result (also called label map).
write picture description here
Core idea
This paper contains three ideas of current CNN:
(1) Fully convolutional (fully conv) network without fully connected layer (fc). Can adapt to any size input.
(2) Deconvolution layer that increases the size of the data. Capable of outputting fine-grained results.
(3) A skip structure combining the results of different depth layers. While ensuring robustness and accuracy.

2. How to make pixelwise prediction

The traditional CNN network is subsampling, and the corresponding output size will be reduced after convolution. To do pixelwise prediction, the output size must be guaranteed, that is, upsampking is required. For this kind of problem, the solution is
as follows (1) The last fully connected layer of traditional networks such as AlexNet, VGG, etc. becomes a convolutional layer.
write picture description here

The upper part of the above figure is the traditional CNN classification network model. The last of the network is connected to a fully connected layer, which produces predictions for a certain class. But we don't need such a prediction, so if we remove it, it will look like the bottom half of the above image, and we will get a 16*16 feature map.
(2) Adding the upsampling layer
The implementation of this layer is realized by bilinear difference. The upsampling here is actually bilinear filtering, which is two-dimensional interpolation of digital image processing. Enlarge the feature map, and then use the crop layer to cut off the redundant part to make it the same size as the ground truth in order to calculate the predicted value of each pixel. Bilinear filtering can be implemented in the form of convolution. Unlike the previous conv layer, the convolution kernel in deconv is not random, but generates the same number of spherical matrices as the categories according to the enlarged size.

The purpose of this is to use the existing classical network to initialize the weights of the first half (classification) and then train the second half of the network

3. Network training

3.1 Network structure

(1) The first half is initialized
write picture description here
with a classical classification network. The last two levels are fully connected (red), and the parameters are discarded. For example, the VGG16 network, the fully connected layer behind the network, retains the part in front of the fully connected layer, and directly uses the trained VGG16 model parameters during initialization.
(2) The second
half The images in the paper are used for illustration.
write picture description here
It can be seen that FCN-32s is upsampling after direct classification is completed, and FCN-16s uses the result of pool4 plus twice the result of upsampling conv7, FCN- 8s is the result of using pool3 plus 2x upsampling pool4 and quadruple upsampling conv7

3.1 FCN-32s

write picture description here
Predict the segmentation small image (16*16*21) from the feature small image (16*16*4096), and then directly upsample it to a large image.
The stride of the deconvolution (orange) is 32, and this network is called FCN-32s.

3.2 FCN-16s

write picture description here
Upsampling is done in two steps (orange x 2). The predictions (blue) of the 4th pooling layer (green) are fused in before the second upsampling. Use the skip level structure to improve accuracy.
The second deconvolution stride is 16, and this network is called FCN-16s.

3.3 FCN-8s

write picture description here
Upsampling is done in three divisions (orange x 3). The prediction results of the third pooling layer are further fused. The third deconvolution stride is 8, denoted as FCN-8s.

The results obtained after using the different structures above for segmentation are as follows:
write picture description here
that is to say, the prediction results of the shallower layers contain more detailed information, and the segmentation using it has finer results. Fusion of prediction results from three different depths leads to the conclusion that the shallower results are more refined and the deeper ones are more robust. As for the reason why FCN-4s are not adopted, it should be not robust enough.

"Fully Convolutional Networks for Semantic Segmentation" study notes