Detailed explanation of unet network

Dreams

  • Reference: U-Net: Convolutional Networks for Biomedical
    Image Segmentation
  • 作者:Olaf Ronneberger, Philipp Fischer, and Thomas Brox

What is the Unet model

Unet is an excellent semantic segmentation model, and its main execution process is similar to other semantic segmentation models. The difference from CNN is that CNN is an image-level classification, while unet is a pixel-level classification, and its output is the category of each pixel.

Unet loss function

Main part: softmax activation function + weighted cross-entropy loss function + weight calculation function

softmax activation function

The softmax activation function non-linearly superimposes the input features and weights of each pixel. After each pixel is processed by softmax, the number of output values ​​​​is equal to the number of categories in the label. Softmax transforms the output value of each pixel into a probability distribution whose value is positive and sums to 1, so as to obtain the confidence of each class in each pixel.

Cross entropy loss function

Cross-entropy loss function: a measurement function used to measure the difference between two probability distributions

insert image description here

In the above formula, yc represents the true distribution of the sample, its value is either 0 or 1, and Pc represents the predicted distribution of the sample.

This paper uses a cross-entropy loss function with boundary weights :

p is the output value after softmax processing;

l : Ω → {1, . . . , K}, is the true label of each pixel;

pl(x)(x): Point x is the activation value of the output of the category given by the corresponding label.

w : Ω → R is the weight added to each pixel during training.

weight calculation function

The formula w(x) mainly refers to the formula of normal distribution.
wc(x) is precomputed for each ground truth segmentation to compensate for the different frequencies of pixels of each class in the training dataset; d1
is the distance to the nearest cell boundary
d2 is the distance to the second closest cell boundary

When both d1 and d2 are equal to 0, the latter part has a maximum value, and the smaller d1 and d2 are, the larger the latter part is, that is, the greater the overall weight. When d1 and d2 are smaller, it means that the closer to the cell boundary, the greater the weight of the cell boundary, which can force the network to learn.

The role of weight: weight can adjust the importance of a certain area in the image. In the process of calculating the loss, the weight of the loss is added to the edge part where the two cells overlap, so that the network pays more attention to this kind of overlapping edge information.

Summary: First use the softmax operation to get the confidence of each class, and then use cross-entropy to measure the gap between the prediction and the label.

Unet main structure

insert image description here

Unet can be divided into three parts, as shown in the figure above:

The first part is the backbone feature extraction part. We can use the backbone part to obtain feature layers one after another. Unet's backbone feature extraction part is similar to VGG, which is a stack of convolution and maximum pooling. Using the backbone feature extraction part, we can obtain five preliminary effective feature layers. In the second step, we will use these five effective feature layers for feature fusion.

The second part is to strengthen the feature extraction part. We can use the five preliminary effective feature layers obtained in the backbone part to perform upsampling and perform feature fusion to obtain a final effective feature layer that integrates all features.

The third part is the prediction part. We will use the final effective feature layer to classify each feature point, which is equivalent to classifying each pixel point.

Backbone Feature Extraction Network

The main feature extraction part of Unet is composed of convolution layer + maximum pooling layer, and the overall structure is similar to VGG.

insert image description hereWhen the input image size is 512x512x3, the specific execution method is as follows:
1. conv1: Convolve the 64 channels of [3,3] twice to obtain a preliminary effective feature layer of [512,512,64], and then perform 2X2 Max pooling to obtain a [256,256,64] feature layer.
2. conv2: Perform two convolutions of 128 channels of [3,3] to obtain a preliminary effective feature layer of [256,256,128], and then perform 2X2 maximum pooling to obtain a feature layer of [128,128,128].
3. conv3: Perform three convolutions of 256 channels of [3,3] to obtain a preliminary effective feature layer of [128,128,256], and then perform 2X2 maximum pooling to obtain a feature layer of [64,64,256].
4. conv4: Perform three convolutions of [3,3] with 512 channels to obtain a preliminary effective feature layer of [64,64,512], and then perform 2X2 maximum pooling to obtain a [32,32,512] feature layer.
5. conv5: Perform three convolutions of 512 channels of [3,3] to obtain a preliminary effective feature layer of [32,32,512].
insert image description here

Why 572x572, not 512x512?

img

Because the image block at the image border has no surrounding pixels, the convolution will lose the information at the edge of the image, so the mirror expansion is used for the surrounding pixels.

Enhanced feature extraction structure

Using the first step, we can obtain five preliminary effective feature layers . In strengthening the feature extraction network, we will use these five preliminary effective feature layers for feature fusion . The way of feature fusion is to upsample the feature layer and perform stack .

In order to facilitate the construction of the network and better versatility, our Unet is slightly different from the Unet structure in the above picture. When upsampling, we directly perform twice upsampling and then perform feature fusion . The final feature layer and the height of the input picture same width.

insert image description here

Use features to get predictions

The process of using features to obtain prediction results is:
use a 1x1 convolution kernel for channel adjustment, and adjust the number of channels in the final feature layer to num_classes.

insert image description here

code reproduction

Dataset: ISBI

Model training:

insert image description here

Test set:
insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/qq_58529413/article/details/125704059