FCN learning and understanding (Fully Convolutional Networks for Semantic Segmentation)

The paper Fully Convolutional Networks for Semantic Segmentation is a milestone paper on image segmentation.

Clarify the focus of my study process.
fcn open source code
github download address: https://github.com/shelhamer/fcn.berkeleyvision.org
core idea
This paper contains three current ideological trends of CNN

  • A fully convolutional (fully conv) network without a fully connected layer (fc). Can accommodate input of any size.

  • A deconvolution (deconv) layer that increases the data size. Capable of outputting fine-grained results.

  • A skip structure that combines results from layers of different depths. Robustness and precision are ensured at the same time.

Some highlights:

The loss function is the loss sum of the pixels on the spatial map of the last layer, and softmax loss is used in each pixel

Using the skip structure to fuse multi-layer (3-layer) output, the underlying network should be able to predict more location information, because its receptive field is small and small pixels can be seen

When upsampling lower-resolution layers, if the sampled image is different in size from the previous image due to padding, etc., use crop, and when cropped to the same size, spatially aligned, use the concat operation to fuse the two layers

Preliminary knowledge:
CNN and FCN
usually connect several fully connected layers to the CNN network after convolution, and map the feature map (feature map) generated by the convolution layer into a fixed-length feature vector. The general CNN structure is suitable for image-level classification and regression tasks, because they all expect the classification probability of the input image. For example, the ALexNet network finally outputs a 1000-dimensional vector indicating the probability that the input image belongs to each category.
FCN classifies images at the pixel level, thus solving the problem of image segmentation at the semantic level. Different from the classic CNN that uses a fully connected layer to obtain a fixed-length feature vector for classification in the convolutional layer, FCN can accept an input image of any size, and uses a deconvolution layer to upsample the feature map of the last convolutional layer to restore it to the same size as the input image, so that a prediction can be generated for each pixel, while retaining the spatial information in the original input image, and finally the parity is classified on the upsampled feature map.
-The fully convolutional network (FCN) recovers the category to which each pixel belongs from the abstract features. That is, the classification from the image level is further extended to the classification at the pixel level.
FCN converts the fully connected layer in the traditional CNN into a convolutional layer. As shown in the figure below, in the traditional CNN structure, the first 5 layers are convolutional layers, the 6th and 7th layers are respectively a one-dimensional vector with a length of 4096, and the eighth layer is a one-dimensional vector with a length of 1000, corresponding to the probability of 1000 categories. FCN represents these 3 layers as convolutional layers, and the sizes of the convolutional kernels (number of channels, width, height) are (4096,7,7), (4096,1,1), (1000,1,1) respectively. All layers are convolutional layers, so it is called a fully convolutional network.
insert image description here
Note: Simply put, the difference between FCN and CNN is that FCN replaces the last fully connected layer of CNN with a convolutional layer, and outputs a labeled image.
Network Structure
The network structure is as follows. The input can be a color image of any size image; the output is the same size as the input, and the depth is: 20 types of objects + background = 21. (On the PASCAL dataset, there are 20 categories in PASCAL)
insert image description here
Full convolution - feature extraction
The half of the dotted line is a fully convolutional network. (Blue: convolution, green: max pooling). For input images of different sizes, the size (height, width) of each layer of data changes accordingly, and the depth (channel) remains unchanged.
This part is modified from the classic network AlexNet1 in the deep learning classification problem. It's just that the last two fully connected layers (fc) are changed to convolutional layers.
In the paper, the classification network that achieves the highest accuracy is VGG16, but the model provided is based on AlexNet. AlexNet is used here to facilitate plotting.
Converting fully-connected layers to convolutional layers: Of the two transformations, converting fully-connected layers to convolutional layers is more useful in practice. Assuming that the input of a convolutional neural network is a 224x224x3 image, a series of convolutional and downsampling layers transform the image data into an activation data volume of size 7x7x512. AlexNet uses two fully connected layers of size 4096, and the last fully connected layer with 1000 neurons is used to calculate the classification score. We can convert any of these three fully connected layers into a convolutional layer:
for the fully connected layer whose first connection area is [7x7x512], let its filter size be F=7, so that the output data volume is [1x1x4096].
For the second fully connected layer, let its filter size be F=1, so that the output data volume is [1x1x4096].
Do the same for the last fully connected layer, let it be F=1, and the final output is [1x1x1000] pixel-by-pixel prediction.
In
the lower part of the dotted line, the convolutional layer (blue×3) predicts the classification result with a depth of 21 from different stages of the convolutional network.
Example: The input of the first prediction module
is 16 16 4096, the convolution template size is 1 1, and the output is 16 16*21.
It is equivalent to applying a fully connected layer to each pixel, predicting 21 types of results from 4096-dimensional features.
How to predict the classification pixel by pixel:
reference blog: http://www.cnblogs.com/gujianhan/p/6030639.html

The deconvolution layer is used to upsample the feature map of the last convolutional layer to restore it to the same size as the input image, so that a prediction can be generated for each pixel while retaining the spatial information in the original input image, and finally pixel-by-pixel classification is performed on the upsampled feature map.

Specific process: insert image description here
After multiple convolutions and pooling, the obtained images are getting smaller and smaller, and the resolution is getting lower and lower. When the image reaches H/32∗W/32, when the image is the smallest layer, the generated image is called a heatmap heat map, and the heat map is our most important high-dimensional feature map.

After obtaining the heatmap of high-dimensional features, the most important and final step is to perform upsampling on the original image, and enlarge, enlarge, and enlarge the image to the size of the original image.

(That is, the corresponding segmented image when the high-dimensional feature map is translated into the original image!!) The
insert image description here
final output is 21 heatmaps that have been upsampled to the size of the original image. In order to classify and predict each pixel into the image that has been semantically segmented, here is a small trick, which is to find the maximum numerical description (probability) of the pixel position in the 21 images pixel by pixel as the classification of the pixel. Therefore, a picture that has been classified is produced, as shown in the picture below, with pictures of dogs and cats on the right.
Deconvolution - Upsampling
Reprinted: https://blog.csdn.net/qq_36269513/article/details/80420363

Guess you like

Origin blog.csdn.net/Adam897/article/details/126467581