Semantic Segmentation series (1) FCN understanding

For the study of this paper, I mainly went to see her three questions:

1, why the network can achieve the full convolution enter any image size? 2, why should fully connected into convolution? 3, how to sample the deconvolution?

Papers Address: https://arxiv.org/pdf/1605.06211.pdf

1, why the network can achieve the full convolution enter any image size?

Since the number of channel parameters of a convolutional network and the size of the convolution kernel and related input and output, and the size of the image does not matter.

The only difference between the full convolution layer and connection layer of neurons that convolution of the input layer are partially connected, and with a shared channel different neuron weights (weights) (channel) inside.

And a convolution layer fully connected layers are carried out by operation of a point, they have the same functional form. Thus the convolution layer may be converted to the corresponding fully connected layer, fully connected layer can also be converted to the corresponding convolution layer.

for example:

VGGNet [1], the first input is fully connected layers 7 * 7 * 512, the output is 4096. This can be a convolution kernel size 7 * 7, step (a stride of) 1, no padding (padding) the output of the channel convolutional layer 4096 equivalent representation, the output is 1 * 1 * 4096, and the equivalent layer fully connected. subsequent layers may be fully connected with 1x1 convolution equivalent alternatives.

Fully connected layer into a regular layer is convolution: the convolution kernel size to space size input advantage of this is that there is no limit on the convolution of the input layer size, and therefore can be efficiently made on a sliding-window test image. prediction.

2, why should fully connected into convolution?

  1. Parameter is too large If the input image is 1000 * 1000 pixels, i.e., input layer 1000 * 1000 nodes. Assume that the first hidden layer 100 nodes (this number is not much), then there is only this layer (1000 * 1000 + 1) * 100 = 100 000 000 parameters, this is too much! We see images only expand a little, the number of parameters will be a lot more, so it's expandability is poor.
  2. Position information between pixels is not used for image recognition, contact each pixel and its surrounding pixels are relatively close, and the pixel may be far enough away to contact very small. If a neuron and connected to all neurons on a layer, then it is equivalent to a pixel, all the pixels of the image are equated, it does not meet the foregoing assumptions. Once we have completed the heavy weights of each connection to learn, may eventually find that there are a lot of weight, their values are very small (that is, the connection does not really matter). Trying to learn a lot of weight is not important, this study will be very inefficient.
  3. Network layers restrict general network layers, the stronger the more expression. But the method by gradient descent training God fully connected neural network is difficult because the gradient fully connected neural network is difficult to transfer more than three layers. Therefore, we can not get a deep fully connected neural network, also limits its ability.

For more than three convolution neural network to solve the problem, there are three main ideas:

  1. The local connection is most likely to think, and not all of each neuron on a layer of neurons are connected, while a small part and is connected only to neurons. This reduces the number of parameters.
  2. A shared set of connection weights can share the same weight, but not connected to a different weight to each, which in turn reduces the number of parameters.
  3. Pooling downsampling may be used to reduce the number of samples of each layer, further reducing the number of parameters, but also can improve the robustness of the model.

3, how to sample the deconvolution?

Specific deconvolution low-level operation I have not quite thoroughly understand. But in terms of network architecture level or get to know.

First given convolution and deconvolution formula:

k: convolution kernel, s: convolution step, p: pading, X * X: Image Size

Convolution: (X-k + 2 * p) / s + 1

Deconvolution: (X-1) * s + k

 

Network Review:

  1. To Alnext example, the P5 -> F6 instead P5 -> C6 in some code I see is the convolution kernel into 7 * 6 * 6 7 some change (with data sets related .PS: personal feeling )
  2. The F6 -> F7 to C6 -> C7 layer as a convolution of size 1 * 1
  3. Followed by a simultaneous C8, FIG. 21 wherein the number (VOC paper is used inside the data set, the classification data set has 20 + 21 = background pixel), different sets of data, and finally allowed to map a generated feature the number should be different.
  4. In an extra C8, deconvolution operation on the sampling ratio as 32 (downsampling (Pooling) ratio of the number, the sampling ratio on how many columns).
  5. Network layer has a crop, in order to make an enlarged image, the size of the original input image identical size.

Or more modifications that are FCN-32s once the eye is one of the features of FIG convolution enlarged 32 times. This effect is not good, so there has been FCN-16s, FCN-8s

FCN-16s:

We likewise Alnext example. (First enlarged twice, and then 16 times enlarged)

  1. We are now generating results pool4 additional 21 score map with a convolution 1 * 1,
  2. Then sampled with twice FIG C8 wherein after the operation to make a crop,
  3. They then adding the results (FIG characteristic corresponding to the value adding inside)
  4. The sample 16 times, 32 times to obtain the characteristic of FIG expansion, with the original operation to make a crop

FCN-8s:

We likewise Alnext example. (First enlarged twice, then zoom twice, the last eight times zoom. About crop operation is not repeated say, this is mainly a write step)

  1. The result pool4 generates 21 score map with a convolution 1 * 1
  2. And characterized in that it twice the sampling C8 FIG later added to obtain a characteristic X in FIG.
  3. Let X times in the sample 2 pool3 added to the result to obtain a characteristic Y in FIG.
  4. Let the Y-sampled 8 times to obtain the required features of FIG. Let it do with the original crop operation, to get the final result.

About semantic segmentation and papers refer to the relevant code: https://github.com/mrgloom/awesome-semantic-segmentation

Published 10 original articles · won praise 21 · views 2605

Guess you like

Origin blog.csdn.net/mr_qin_hh/article/details/88237966