SegNet:A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

     The original English URL of the paper: https://arxiv.org/abs/1511.00561

 

    SegNet is also a classic network for image segmentation. The title of the paper can understand that SegNet is a deep, convolutional, and also has its own encoding-decoding structure. This structure is mainly used for image segmentation. SegNet is also a fully convolutional neural network. Its central structure mainly includes: an encoding network, where there is encoding, there is decoding, and the other is the decoding network, which is followed by a pixel-level classification layer. The structure of the encoding network is the same as the topology of the 13 convolutional layers of VGG16. The initial encoding network produces low-resolution features, and the role of the decoding network is to map this rough feature onto the entire input image-level resolution feature map for pixel-level classification (this mapping must produce accurate boundary localization. feature). The authors emphasize a little that the novelty of SegNet is the way the decoder upsamples its low-resolution input feature maps. In this way, the decoder implements nonlinear upsampling through the pooling index, which is calculated by the maximum pooling operation performed by the encoder corresponding to the decoder. This eliminates the need to learn upsampling, where the upsampled maps are sparse and then convolved with a trainable filter kernel to produce dense feature maps. The result after segmentation will be rough, mainly because the max pooling layer and downsampling reduce the feature The resolution of the maps. In order to allow objects with small shapes to be described, boundary information is preserved from the extracted image map. SGD is used for training because end-to-end training can adjust all the weights of the network together. As mentioned earlier, the encoding network of SegNet is the same as the 13 convolutional layers in VGG16. Since the fully connected layer of VGG is castrated, the network of SegNet is much smaller than other networks and easier to train. A key component of SegNet is the decoding network. This hierarchical network corresponds to the encoding network. As mentioned earlier, the decoder uses the maximum pooling index obtained from the corresponding encoder to perform nonlinear nonlinearity on the input feature map. upsampling. This idea is inspired by a structure for unsupervised feature learning. Reusing the max pooling index in the decoding process has several down-to-earth advantages: one is to improve the delineation of the boundary, and the other is to reduce the parameters for end-to-end training, because the parameters of the corresponding pooling layer are shared, The last is that this form of upsampling can be incorporated into any encoder-decoder. In the deep framework for segmentation, there is a huge amount of training parameters while having the same encoding network, namely VGG16. The difference is mainly reflected in the training and inference of the decoding network.

    The author mainly analyzes SegNet decoding technology and FCN, focusing on the practical trade-offs of segmentation structure design. The authors mention that recent deep frameworks for segmentation have their own encoding networks, like VGG16, but the training and inference of their decoding networks are different. Due to the large number of network parameters to train, it is difficult to train the network, which leads to multi-stage training, adding the network to a pre-trained architecture (like FCN), training with RPN, disjoint classification, and splitting the network and Additional pre-trained training data or full training methods are used to assist in a wave of inference.

    The author summarizes some past and present lives of segmentation, and records it here today. Before the advent of deep neural networks, the best way was to do manual feature design for each pixel to be classified. Classically, like random forest or Boosting, by sending a patch to the classifier to predict the class probability of each center pixel,

Predictions from per-pixel-level noise are smoothed using pairwise or higher-order CRFs to improve their accuracy. As mentioned earlier, it is based on predicting patch, shape-based features or based on SfM (SFM algorithm is an offline algorithm for 3D reconstruction based on various collected disordered pictures. Reference: https://blog.csdn.net/ qq_20791919/article/details/74936438 ) The processed appearance has been explored in road scene understanding tests.

  The per-pixel noise predictions from the classifier are smoothed using pairwise or higher level CRFs (Conditional Random Fields) to improve accuracy. A more recent approach is to generate high-quality unary by trying to predict the labels of all pixels in the patch instead of just predicting only the label of the center pixel (personally, I feel that this unary is translated and asked the unit). Although the unaries results based on random forests are improved, the category scores after the simplified structure are less effective. Another approach advocates utilizing a combination of hand-designed features and spatiotemporal superpixels to achieve higher accuracy. The best method in the CamVid test addresses the frequency imbalance between labels by combining the object detection output with the classifier predictions in the CRF framework. The author mentioned the indoor RGBD dataset, using methods such as RGB-SIFT, depth SIFT and pixel location as input to the neural network classifier, and then followed by a CRF for smoothing, by using LBP and region segmentation including richer features The collection is improved for better accuracy. In recent work, a combination of RGB and depth-based cues is used to infer class segmentation and support relations. Another approach focuses on joint reconstruction and semantic segmentation in real-time, where random forests are used as classifiers (random forests are really powerful), boundary detection and hierarchical grouping are used before class segmentation. A common feature of these methods above is the use of hand-designed features to classify RGB or RGBD images.

    Using the features of the deepest layer in the classification network to match the image size is applied to the segmentation of the image, but the resulting classification results are blocky. The other is to use a normal neural network to merge several low-resolution predictions, and the other uses a recurrent neural network to merge several low-resolution predictions to create a resolution prediction map of the input picture, its ability to delineate boundaries very poor.

     The new framework for segmentation is applied to segmentation by learning to decode or map low-resolution images to pixel-level predictions. Among them, the encoding network that produces the representation of low-resolution images is the VGG16 classification network with 13 convolutional layers and 3 fully connected layers. The decoding network differs for different structures and produces multiple features for each pixel for subsequent classification.

     Each decoder in the FCN learns to upsample the input feature maps and combine them with the corresponding encoder’s feature maps as the input to the next decoder. This structure is characterized by a large number of trainable parameters in the encoding network and few in the decoding network. The overall size of this network is large, making it difficult to train end-to-end. So, the authors use a segmented training process, each decoder in the decoding network is added to the training network one by one, the network does not increase all the time, and stops if the observed performance does not increase. The growth of the network stops after three decoders, and ignoring high-resolution feature maps leads to loss of edge information. The above is about training. Reusing the feature map generated by the encoder in the encoder will crowd the memory at test time.

     The FCN is added to the RNN for fine-tuning. When using the feature representation of the FCN, the RNN layer imitates the clear boundary delineation ability of the CRF. Compared with FCN-8, the previous structure has a significant improvement, and this improvement is not large as the data increases. The structure of FCN+CRF-RNN shows advantages in joint training, although at the cost of more complex training and inference, the performance of deconvolutional networks is significantly better than that of FCN. Here the author asks whether the perceptual advantage of CRF-RNN will decrease as the feedforward segmentation engine gets better. Under any conditions, the CRF-RNN network can be added to any segmentation architecture including SegNet. The authors mention that deep structures of excessive size are also popular. There are two flavors, one combining input images of small size from a deep feature extraction network, and one combining feature maps from different layers of a single deep architecture. A common idea is to extract features of different sizes to provide local or global context, while feature maps used in early coding layers retain more high-frequency details, resulting in clearer class boundaries. To make training easier, feature extraction with multiple convolutional paths inference using multiple stages of training and data augmentation is expensive. Add CRF to multi-scale network for joint training. Deconvolutional networks and their semi-supervised variants use encoder feature maps to implement nonlinear upsampling in the decoder network.

      The network structure proposed by the author is an encoder-decoder network, and the encoder obtains feature maps by convolution, nonlinear unit, max pooling and downsampling. For each sample, the maximum position index computed in the pool is stored into the decoder. The decoder performs an upsampling operation on the feature maps by the indices in the storage. The decoding network uses trainable convolution kernels to convolve with upsampled feature maps for reconstructing the input image. This structure is used for unsupervised pre-training, by using a small input patch for hierarchical feature learning so that the entire image can be used for hierarchical encoder learning.

     The network structure of SegNet

             

        SegNet has an encoding network and a corresponding decoding network, followed by a final pixel-level classification layer. The encoding network consists of 13 convolutional layers, which correspond to the first 13 convolutional layers in VGG16 and are used for object segmentation. The authors use an initialization training process to classify large datasets with trained weights. SegNet removes fully connected layers (and reduces the number of network parameters to some extent) in order to retain higher resolution feature maps at the deepest encoder output. Each encoder in the encoding network performs convolution to generate a set of feature maps. These feature maps are normalized and then processed by ReLU. Then perform a max pooling operation with a size of 2*2 and a step size of 2, and downsample the result by a factor of two. The max pooling operation is used to achieve translation invariance to small spatial displacements in the input image. Downsampling produces contextual information for each pixel in the feature map in an input image. Multi-layer max pooling and downsampling can achieve more translation invariance to make the classification more robust, but there is still a certain loss of spatial resolution on feature maps. Image representations that are lossy to boundary detail are not good for segmentation (segmentation where boundary division is important). Therefore, it is necessary to capture and store boundary information for the feature map in the encoder before downsampling. It is not possible to store all feature maps, only the position of the largest feature value in each pooling window of each encoder feature map. An appropriate decoder in the decoding network upsamples the input feature map by utilizing the max-pooling index values ​​stored in the feature map from the corresponding encoding network. This results in a sparse feature map. The decoding process of SegNet is as follows,

 

        It is also mentioned in the figure that the decoded feature map is also subjected to convolution operation to generate dense feature maps and then batch normalization. Unlike common sense here, although the encoder input consists of RGB3 channels, the decoder corresponding to the first encoder will generate a multi-channel feature map, which is different from other encoders in the network (other decoders can generate and The encoder inputs feature maps of the same number of sizes and channels). The high-dimensional feature representation output by the last decoder is fed to a trainable softmax classifier to segment each individual pixel. DeconvNet and Unet have similar structures to SegNet. However, since DeconvNet has full convolution and a large number of parameters, it is difficult to perform end-to-end training. However, Unet cannot reuse the pooled index, but transfers the entire feature map to the corresponding decoder, and performs upsampling through deconvolution to obtain the feature map of the decoder. And Unet does not have the conv5 and max-pool5 layers in VGG. SegNet uses all pretrained convolutional layer weights from VGG as pretrained weights.

The author later mentioned a miniSegNet with only four encoders and four decoders, which does not use bias after convolution, and there is no ReLU nonlinearity in the decoder network. The kernel size of all encoder decoder layers is defined as 7*7, and more contextual information can be obtained from the feature map.

 

For specific experiments and analysis, please refer to the original paper.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326339543&siteId=291194637