"U-Net: Convolutional Networks for Biomedical Image Segmentation" study notes

1. General

In the 15-year article: "U-Net: Convolutional Networks for Biomedical Image Segmentation", a model of a network trained on a small amount of data was proposed, which achieved good segmentation accuracy and the speed of the network was very fast. It only takes less than 1s to segment a 512*512 image. Its network model structure is like this, it looks like a U-shaped, so it is also called U-net
write picture description here
. From the above U-net network diagram structure, it can be seen that the descending part is realized by conv+max-pool, and The feature channels are added successively during the downsampling process; for the upsampling part, the feature map on the left is copied and cropped (because of the loss of edge pixels in each convolution process) + up_convolution, this structure allows the network to propagate during the upsampling process. to higher resolution layers.
The idea of ​​​​the whole network is similar to that of FCN. One difference is that it does not use CNN models such as VGG as a pre-training model, because u-net does binary segmentation of medical images, there is no need to use ImageNet's pre-training model, and u With the structure of -net, we can freely deepen the network structure according to our own data sets, such as when dealing with targets with larger receptive fields; another difference is that u-net uses shallow feature fusion when using It is the method of superposition, not the sum operation in FCN, that is, the white module in the above figure, which is directly superimposed from the blue module on the left (if it is implemented in Caffe, u-net is the Concat layer, while fcn is the Eltwise layer).
write picture description here
It accepts an image of any size as input, and the missing data is supplemented by mirroring operations. For example, the segmentation data in the yellow area on the right in the above figure needs to be provided by the data in the blue box in the left figure. If the data is insufficient, a mirror operation is required.

2. Training

In the above figure, a represents the original image; b represents the segmented ground truth; c represents the generated segmentation result, with black as the background and white as the target; d is the pixel-wise loss weight map, which forces the network to learn the edge The
write picture description here
pixel energy function is obtained by pixel-wise soft-max and cross-
write picture description here
entropy loss function based on the last feature map, where soft-max is defined as: -max deviation from 1:
write picture description here
the separation boundary is obtained by morphological operations. Where w(x) is defined as follows:
write picture description here
d1 represents the distance from the border of the nearest cell, d2 represents the distance from the border of the second closest cell, and w0 is set to 10 through experiments, and δ is 5 pixels

Network weight initialization: the standard deviation is sqrt(2/N), where N represents the number of input neuron nodes. For example, a 3*3 64-channel previous convolution has N=3*3*64=576.

3. Data increase

The number of datasets in a deep learning network is critical for both the expected results and robustness of training, but in the field of medical images such data conditions are not met, which requires data augmentation (expansion). In this paper the smooth deformation is generated by using random displacement vectors on a 3×3 grid. The displacements are sampled from a Gaussian distribution with a standard deviation of 10 pixels. The pixel-by-pixel displacement is then calculated using bicubic interpolation. Drop-out layers of the shrinking path perform further implicit data augmentation.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325802969&siteId=291194637