[Computer Vision | Semantic Segmentation] Essential information: A collection of introductions to common algorithms for semantic segmentation (1)

About U-Net

U-Net is a semantic segmentation architecture. It consists of contraction paths and expansion paths. The shrinkage path follows the typical architecture of convolutional networks. It consists of repeated applications of two 3x3 convolutions (unpadded convolutions), each followed by a rectified linear unit (ReLU) and a 2x2 max-pooling operation with stride 2 for downsampling. In each downsampling step, we double the number of feature channels. Each step in the expansion path consists of upsampling the feature map, followed by a 2x2 convolution ("upconvolution"), halving the number of feature channels, concatenating with the corresponding cropped feature map in the contraction path, and two 3x3 convolutions, each convolution followed by a ReLU. Since boundary pixels are lost with each convolution, cropping is required. In the last layer, each 64-component feature vector is mapped to the required number of classes using 1x1 convolution. The network has a total of 23 convolutional layers.

Insert image description here

二、Fully Convolutional Network

Fully Convolutional Network (FCN) is an architecture mainly used for semantic segmentation. They employ only locally connected layers such as convolution, pooling, and upsampling. Avoiding dense layers means fewer parameters (making the network training faster). This also means that FCN can handle variable image sizes since all connections are local.

The network consists of a downsampling path to extract and interpret context and an upsampling path to allow localization.

FCN also employs skip connections to recover fine-grained spatial information lost in the downsampling path.

Insert image description here

3. SegNet

SegNet is a semantic segmentation model. The core trainable segmentation architecture consists of an encoder network, a corresponding decoder network, and a pixel-level classification layer. The architecture of the encoder network is topologically the same as the 13 convolutional layers in the VGG16 network. The role of the decoder network is to map low-resolution encoder feature maps to full input resolution feature maps for pixel-level classification. The novelty of SegNet lies in the way the decoder upsamples its lower resolution input feature maps. Specifically, the decoder performs nonlinear upsampling using the pooling index calculated in the max-pooling step of the corresponding encoder.

Insert image description here

4. DeepLab

DeepLab is a semantic segmentation architecture. First, the input image is passed through the network using dilated convolutions. Then bilinear interpolation is performed on the output of the network, and the results are fine-tuned through a fully connected CRF to obtain the final prediction.

Insert image description here

5. DeepLabv3

DeepLabv3 is a semantic segmentation architecture that makes some modifications based on DeepLabv2. In order to deal with multi-scale object segmentation problems, modules using cascaded or parallel atrous convolutions are designed to capture multi-scale context by using multiple hole rates. In addition, DeepLabv2’s Atrous Spatial Pyramid Pooling module enhances image-level features that encode global context and further improves performance.

A variation of the ASSP module is that the authors apply global average pooling on the last feature map of the model, feed the resulting image-level features into a 1 × 1 convolution with 256 filters (and batch normalization), and then features to the desired spatial dimensions. Finally, the improved ASPP consists of (a) one 1 × 1 convolution and three 3 × 3 convolutions, with output stride = 16 and rate = (6, 12, 18) (all with 256 filters and batch normalization), and (b) image-level features.

Another interesting difference is that DeepLabv2's DenseCRF post-processing is no longer required.

Insert image description here

6. UNet++

UNet++ is a semantic segmentation architecture based on U-Net. By using densely connected nested decoder subnetworks, it enhances extracted feature processing, and the authors report that it outperforms in medical image segmentation of electron microscopy (EM), cells, nuclei, brain tumors, liver, and lung nodules. U-Net tasks.

Insert image description here

7. PSPNet

PSPNet (i.e., Pyramid Scene Parsing Network) is a semantic segmentation model that leverages the pyramid parsing module to exploit global context information through context aggregation based on different regions. Together, local and global cues make the final prediction more reliable. We also proposed an optimization plan

Given an input image, PSPNet uses pre-trained CNN and dilated network strategies to extract feature maps. The final feature map size is
1/8 of the input image. On top of the map, we use the pyramid pooling module to collect contextual information. Using our 4-level pyramid, the pooling kernel covers the whole, half, and small parts of the image. They are fused into a global prior. We then concatenate the prior with the original feature maps in the final part. Next are convolutional layers to generate the final prediction map.

Insert image description here

八、EfficientIt

EfficientDet is an object detection model that leverages multiple optimizations and backbone tuning, such as using BiFPN, as well as a compound scaling approach that uniformly scales the resolution, depth, and width of all backbones, feature networks, and box/class prediction networks simultaneously.

Insert image description here

9. SegFormer

SegFormer is a Transformer-based semantic segmentation framework that combines Transformer with a lightweight multi-layer perceptron (MLP) decoder. SegFormer has two attractive features: 1) SegFormer contains a novel hierarchical structure Transformer encoder that can output multi-scale features. It does not require positional encoding, thus avoiding the interpolation of positional codes, which can lead to performance degradation when the test resolution is different from the training resolution. 2) SegFormer avoids complex decoders. The proposed MLP decoder aggregates information from different layers to combine local attention and global attention to present a powerful representation.

Insert image description here

10. ENet

ENet is a semantic segmentation architecture using a compact encoder-decoder architecture. Some design choices include:

Downsample y using the SegNet method, save the selected element indices in the max pooling layer, and use them to generate a sparse upsampling map in the decoder.
Early downsampling optimizes the early stages of the network and reduces the cost of processing large input frames. The first two blocks of ENet significantly reduce the input size and use only a small set of feature maps.
Use PReLU as activation function
Use dilated convolution
Use spatial loss

Insert image description here

Guess you like

Origin blog.csdn.net/wzk4869/article/details/132863871