Detailed Interpretation of Vernacular (5)-----U-Net

I. Introduction

FCN is the pioneer of semantic segmentation model, and U-Net, as a derivative of FCN, has similarities with FCN in many aspects (for the content of FCN, please refer to my previous article https://blog.csdn.net /dongjinkun/article/details/109586004 ), for example, the fully connected layer with many parameters is discarded, and the full convolution method is selected. At present, most network models of semantic segmentation are modified on the "elegant" network structure of U-Net, so how important U-Net is, everyone should have a balance in their hearts. Please follow my ideas to dig deep into the elegant network model of U-Net step by step! ! !

Two, U-Net network structure

Insert picture description here

  • 1. U-Net is composed of a contracting path and an expanding path. The contracting path is mainly used to extract features using the convolutional layer, and the expanding path up-samples the feature map in order to predict the image at the pixel level.
  • 2. U-Net also uses the same techniques as FCN (skip structure, copy and crop in the text refers to the skip structure), which combines the local and detailed features obtained by the shallow convolution kernel with the global, detailed features obtained by the deep convolution kernel. The abstract features are combined to more accurately segment the image.
  • 3. U-Net does not simply add features like FCN, but first concatenate to generate a feature map of double channels, and then convolve.

三、Overlap-tile strategy

Careful friends may find that U-Net is not a strictly symmetric structure, and the size of the feature map on the same layer will be reduced (start counting from 0, and 0, 1 from top to bottom) , 2, 3, 4), for example, the first layer: the feature map size of the layer on the contracting path is 284*284; and the feature map size of the layer on the expanding path is 200*200. Why does this happen? Because U-Net uses valid convolution (in other words, there is no padding, if padding is used, it is called same convolution), using valid convolution, if there are many convolutional layers, then the feature map will be more Come smaller and smaller. Obviously, if we input the original image, after U-Net, the output image will be smaller than the original image, but we want to predict at the pixel level, we must keep the input image size = output image size, then how to keep the input image size = output image size?

  • The simplest way is of course to use the same convolution, but it is not used in the paper. Obviously, the same convolution has drawbacks. Moreover, the deeper the model, the higher the abstraction of the feature map, and the effect of padding will show a cumulative effect.
  • The other way is to directly upsample the final output feature map. You can use interpolation or transposed convolution. If interpolation is used, it will cause certain errors because it is not learnable; instead, use transpose. In the case of convolution, the amount of parameters will be increased, and the model may not be able to learn well.
    How did the paper solve this problem?
    The paper mentioned a strategy called overlap-tile, which is to pad the image before entering the network so that the final output size is consistent with the original image. In particular, this padding is mirrored padding, so that context information is provided when predicting the boundary area. The right of the picture below is the original image, and the left of the picture below is the image after mirror padding.
    Overlap-tile strategy can be used together with patch (image segmentation). When the memory resources are limited and it is impossible to predict the entire large image, the image can be mirrored and padding first, and then the image after the padding can be divided into fixed-size patches in order. In this way, it is possible to achieve seamless segmentation of arbitrarily large images, and at the same time, each image block also obtains the corresponding context information. In addition, when the amount of data is small, each image is divided into multiple patches, which is equivalent to expanding the amount of data. More importantly, this strategy does not require scaling of the original image, and the pixel value of each position is consistent with the original image, and there will be no errors due to scaling.
    Insert picture description here
    Tips:
    Insert picture description here
    One color of mirrored padding represents one step. Take red as an example. The column [1,3] is used as the reference object. The right side of the reference object is [2,4]. After mirroring, obviously the left side should also be [2, 4], and so on, you can get a mirror padding.

4. Explanation of related concepts

Attentive friends may find that there are several very unfamiliar concepts in the experimental result table, namely pixel error, rand error and warping error.

4.1 pixel error

The easiest way to evaluate image segmentation problems. Comparing the predicted label and the actual label, the error points are divided by the total number, which is the pixel error.
The advantage is simple, but it is too sensitive to position shifts, and shifts that are invisible to the naked eye will cause a lot of pixel errors.

4.2 rand error

Insert picture description here

4.3 warping error

Insert picture description here

V. References

Guess you like

Origin blog.csdn.net/dongjinkun/article/details/109723161