FCN paper notes Fully Convolutional Networks for Semantic Segmentation

1. Paper related information

Time: 2014
Topic: Fully Convolutional Networks for Semantic Segmentation
Paper Address: HTTPS: //arxiv.org/abs/1411.4038
code: https://github.com/shelhamer/fcn.berkeleyvision.org
Author: Jonathan Long, etc.

2. Paper details

Background and introduction

The powerful model structure of CNN can learn hierarchical features.
Among them, the shallow convolutional layer has a smaller receptive field and can learn stronger local information.
The deep convolutional layer has a larger receptive field, can learn rich semantic information, is more abstract, but is not sensitive to some location information.

After traditional CNN has gone through a series of convolutional layers and pooling layers, its feature map size is much smaller, and the final output is highly abstract information. Such abstract features can achieve good performance when used in image classification, because image classification It is photo-wise, but when we want to achieve more precise location semantic segmentation (pixel-wise), only abstract semantic features cannot be achieved.
To this end, the author of this article proposes to build a full convolutional network, which can input pictures of any size and output the output of the corresponding size, and can be end-to-end, point-to-point training, and efficient inference. The network fuses the semantic information in the deep convolutional layer with the spatial information in the shallow convolutional layer to obtain an accurate and detailed segmentation result.

FCN architecture

Insert picture description here
The technology used:

1. Convolution:

The network used for classification usually connects to the fully connected layer at the end, which compresses the original two-dimensional matrix (picture) into one-dimensional, thus losing the spatial information, and finally trains to output a scalar, which is our classification label.

The output of image semantic segmentation needs to be a segmentation map, regardless of size, but at least two-dimensional. Therefore, we discard the fully connected layer and replace it with a convolutional layer, which is called convolution.
Insert picture description here

As shown in the figure above, the first one is in the traditional classification network. At the end of the convolution, full connection is added, and finally a one-dimensional vector is output. Each value in the vector corresponds to each category of the class. The figure below is after convolution, the full connection is removed, and the convolutional layer is used to maintain the original spatial information, and this solves the requirement of a fixed input size for the full connection, so that the network output size can be arbitrary .

2. Upsampling

Upsampling can be divided into two types:
First, resize operation, that is, linear interpolation in traditional image processing.
2. Deconvolution operation, also called deconvolution (Deconvolution) or transpose convolution (conv_transpose).
The second method is the one used in the FCN, as shown in the figure above to generate the heat map.

Transposed convolution is opposite to convolution in the forward and backward propagation of the neural network structure.
Insert picture description here

More convolution and deconvolution diagrams
Although the transposed convolution layer is the same as the convolution layer, it can also train parameters, but in the actual experiment process, the author found that making the transposed convolution layer learnable does not bring performance , So the lr of the transposed convolutional layer in the experiment is all set to zero

3. Skip Architecture

Insert picture description here
As shown in the figure, the output will eventually become the same dimension as the input through upsampling, and several outputs will be obtained.

For FCN-32s, directly perform 32x upsampling on pool5 feature to obtain 32x upsampled feature, and then perform softmax prediction on each point of 32x upsampled feature to obtain 32x upsampled feature prediction (ie segmentation map).
For FCN-16s, first perform 2x upsampling on pool5 feature to obtain 2x upsampled feature, then add pool4 feature and 2x upsampled feature point by point, and then perform 16 times upsampling on the added feature, and softmax prediction to obtain 16x upsampled feature prediction.
For FCN-8s, first perform the point-by-point addition of pool4+2x upsampled features, and then perform the point-by-point addition of pool3+2x upsampled, that is, perform more feature fusions. The specific process is similar to 16s and will not be repeated.
The output results of the above non-synchronous long upsampling are as follows: It

can be seen that FCN-8s has the best effect.

Evaluation indicators of semantic segmentation:

Intersection over union (IU) is the intersection of regions. Let n _ij be the number of pixels of class i predicted to belong to the j-th class, in which there are n _cl different classes, Insert picture description here
representing the total number of class i. Then we get the following calculation method:

The pixel accuracy is the correct classification of all pixels.
The average accuracy is the pixel accuracy divided by the number of classes.
Average IU: IU refers to the proportion of pixels of a certain type that are truly that type of pixel, and the average IU is the average of all types of IU.
Frequency weight IU: It is to multiply the IU of each class by a weight to add, and this weight is the proportion of pixels of that class to all pixels.

to sum up:

Important ideas of semantic segmentation:

Downsampling + Upsampling: Convlution + Deconvlution／Resize
Multi-scale feature fusion: feature addition point by point / feature channel dimension splicing
Obtain the pixel-level segment map: judge the category of each pixel

Two methods of feature fusion:

Add on the channel during fusion, and the spatial dimension should be the same. Such as the tight connection method of DenseNet.
When merging, each point is added, and the number of channels must be the same. Such as the fusion method in this article, and the shortcut of ResNet.

Reference article:

https://zhuanlan.zhihu.com/p/22976342
https://zhuanlan.zhihu.com/p/31428783