Intensive reading of deep learning papers [8]: ParseNet

7f7fab6c6f993e79e7fbb2b8861e24de.jpeg

The U-shaped codec structure lays the foundation for deep learning semantic segmentation. As the performance of the baseline model gets better and better, the focus of deep learning semantic segmentation begins to improve the recovery of image pixels by upsampling the original codec architecture. How to make more effective use of image context information and extract multi-scale features. As a result, the second mainstream structural design of semantic segmentation: multi-scale structure. The interpretation of the next few papers will sort out the structural design network that focuses on image context information and multi-scale features, including ParseNet, PSPNet, Deeplab series with hole convolution as the core, HRNet and other representative multi-scale designs. .

Since the introduction of Fully Convolutional Networks (FCN) and UNet, the mainstream improvement ideas have been carried out around the codec structure. However, some improvements were not so "mainstream" at the time, and some of them were aimed at improving the global information extraction ability of the network. After FCN was proposed, some scholars believe that FCN ignores the image as the global information of the entire image, so it cannot effectively use the semantic context information of the image in some application scenarios. In addition to increasing the overall understanding of the image, the global information of the image also helps the model to judge the local image block. The previous mainstream method is to integrate the probability graph model into the CNN training to capture the context information of the image pixels. , such as adding Conditional Random Field (CRF) to the model, but this method will make the model difficult to train and become inefficient.

Aiming at how to efficiently use the global information of images, related research proposed ParseNet based on the FCN structure, an efficient end-to-end semantic segmentation network, which aims to use global information to guide local information judgments, and introduces too many additional Calculate overhead. The paper that proposed ParseNet is ParseNet: Looking Wider to See Better, published in 2015, is an improved design based on the context perspective based on FCN. In semantic segmentation, contextual information is very critical to improve the performance of the model. In the case of only local information, the classification judgment of pixels sometimes becomes ambiguous. Although in theory the deep convolutional layer will have a very large receptive field, in practice the effective receptive field is much smaller, which is not enough to capture the global information of the image. ParseNet obtains context information directly on the basis of FCN through the method of global average pooling. Figure 1 shows the context extraction module of ParseNet. Specifically, the context feature map is pooled using global average pooling to obtain global features, and then the global features are Perform L2 normalization processing, and then unpool the normalized feature map and fuse it with the local feature map. Such a simple structure can greatly improve the quality of semantic segmentation. As shown in Figure 2, ParseNet can pay attention to the global information in the image and ensure the integrity of image segmentation.

033fce3761d159589019e6669d453cce.png

e2e5c9f35e53417eafa5d2e80974c2e1.png

Regarding the fusion of global features and local features, ParseNet gives two fusion methods: early fusion and late fusion. Early fusion is the fusion method shown in Figure 6-1. After the global features are unpooled, they are directly fused with local features, and then pixel classification is performed. The late fusion is to classify the global features and local features separately and then perform some kind of fusion, such as weighting. However, whether it is early fusion or late fusion, if the normalization method selected is appropriate, the effect is similar.

The figure below shows the segmentation effect of ParseNet on the VOC 2012 dataset. It can be seen that the segmentation of ParseNet can clearly pay attention to the global information of the image.

241cec8dd2c887ae27497327f62847db.png

ParseNet author's source code based on caffe can refer to:

https://github.com/weiliu89/caffe

Past highlights:

 Intensive reading of deep learning papers [7]: nnUNet

 Intensive reading of deep learning papers [6]: UNet++

 Intensive reading of deep learning papers [5]: Attention UNet

 Intensive reading of deep learning papers [4]: ​​RefineNet

 Intensive reading of deep learning papers [3]: SegNet

 Intensive reading of deep learning papers [2]: UNet network

 Intensive reading of deep learning papers [1]: FCN full convolutional network

 Explainer video is here! Machine learning formula derivation and code implementation open record!

 end! "Machine Learning Formula Derivation and Code Implementation" PPT download of chapters 1-26

Guess you like

Origin blog.csdn.net/weixin_37737254/article/details/126047330