论文阅读——U-NetPlus: A Modified Encoder-Decoder UNet Architecture for Semantic and Instance Segmentation

The paper reading improved UNet's semantic segmentation and instance segmentation network for assisting minimally invasive surgery
U-NetPlus: A Modified Encoder-Decoder U-Net Architecture for Semantic and Instance Segmentation of Surgical Instrument

Summary:

In minimally invasive surgery, it is difficult to track the field of view of the surgical equipment, which limits the flexibility of the doctor's operation, and the segmentation of the video based on the semantic segmentation framework of deep learning can be used to assist the operation. The UNet+ in this article improves the original UNet encoding and decoding network. The encoding part introduces a pre-training network, and the decoding part uses the nearest neighbor-based upsampling operation instead of transposed convolution. Finally, semantic segmentation and instance segmentation were completed in the Medical Image Video Analysis Challenge (MICCAI 2017 EndoVis Challenge), which was greatly improved.

Section I Introduction

Minimally Invasive Surgery:

Compared with traditional large- invasive surgery , laparoscopic minimally invasive surgery is favored due to its low risk of infection and short hospital stay. It is very important to track surgical instruments in the field of vision of the endoscope, but it also faces a series of difficulties, such as occlusion and light changes.


Segmentation network:

On the other hand, although the segmentation network with the help of DNN has been successfully applied to street scenes, autonomous driving, etc., it is necessary to further improve the segmentation accuracy and accuracy to be applied to the clinic, and even small errors must be avoided. Difficulty in obtaining a large number of medical images also limits the clinical application of segmentation networks. At present, the problem of data volume is mainly solved by patching, data enhancement, and migration learning.

Multi-category segmentation in the surgical field was first proposed in 2018. It is also based on the traditional UNet codec network, but it is not to complete the mapping from 4 4 input to 1 pixel output, but to complete the mapping from 1 pixel input to 4 4 pixel output. . This model first greatly increases the amount of parameters, and it is also easy to cause uneven overlap during deconvolution.

 
Therefore, this paper uses the BN+ pre-trained VGG-11 and VGG-16 networks to replace the encoding part of the original UNet to speed up the model convergence; replace the transposed convolution of the decoding part with the nearest neighbor interpolation to remove the transposed convolution Issues such as artifacts and reducing the amount of parameters. And use this model for instance segmentation in surgery.

# Section II UNetPlus

Part A The

overall framework of UNetPlus is shown in Fig1 . It also generates pixel-level segmentation results based on the codec structure, and completes the skip connection between the codec networks in a concatenate manner to prevent problems such as gradient disappearance. Usually weights are randomly initialized during training, but due to the limited number of medical images, it is easy to cause over-fitting, so transfer learning is often used to initialize network parameters. Therefore, this paper uses VGG-11/VGG-16 pre-trained on ImageNet as a feature extraction network.

For example, VGG-11: Contains 7 convolutional layers using 3*3kernel, in_channel=64, out_channel=512, and the BN layer is added after each layer of convolution.
UNetPlus
The downsampling process will reduce the feature size and increase the number of feature maps; the upsampling process is the opposite, by continuously reducing the number of feature maps, increasing the feature size, and finally obtain a pixel-level segmentation map; in the upsampling process, in order to obtain high resolution The segmentation results of the rate, this article uses the nearest neighbor interpolation method, each block will set the search stride and kernel size.
Other details:
Data set: Medical Imaging Video Analysis Challenge (MICCAI 2017 Endoscopic Vision Challenge)

Data enhancement: With the help of The albumentations library library completes affine transformation and elastic transformation. The
experiment is based on Pytorch. First, the unwanted black edges in the video are cropped, and then normalized; the Adam optimizer is used to train 100epochs evaluation index IoU (Jaccard Index) and Dice coefficient.

Section III experiment results

Part A Quantitative Analysis


Firstly compare the binary segmentation results and multi-category segmentation results (shaft, wrist, and claspers) of UNetPlus and other networks (UNet, UNet+NN, Tiramisu, etc.) in this article. You can see the use NN's UNet results have been improved compared with the original UNet, and the UNetPlus in this article converges faster when doing two classifications, IoU increased by 10%, Dice increased by 5%; it surpassed the best lifting when doing multi-category segmentation Performance of the Misu model.
Insert picture description here
Qualitative analysis of Part B


From Fig4, it can be seen that the original UNet, tiramisu, etc. are mis-segmented. Taking multi-category segmentation as an example, UNet cannot segment the instrument handle and head correctly, and the segmentation quality of tiramisu is not as UNetPlus. The effect is good.


Fig 4Part C Visualization of Attention



In order to explore the reasons for the performance improvement of UNetPlus, this article visualizes the saliency heat map. For example, due to the pre-trained network used by the Tiremisu network, it has a better attention effect than UNet+NN, and UNetPlus can see To have the best attention to the head of the tweezers.



Insert picture description here







# Section IV Conclusions
The UNetPlus proposed in this article uses a pre-trained encoding network and a decoding network based on nearest neighbor interpolation to improve the segmentation effect of UNet, which is very suitable for tracking instruments in minimally invasive surgery to assist surgery.
The summary is that the encoder in UNet is replaced with pre-trained VGG; the transposed convolution in the decoder is replaced with NN nearest neighbor interpolation.

Guess you like

Origin blog.csdn.net/qq_37151108/article/details/105979763