[Computer Vision | Semantic Segmentation] Essential information: A collection of introductions to common algorithms for semantic segmentation (3)

1. Criss-Cross Network

Criss-Cross Network (CCNet) aims to obtain full image contextual information in an effective and efficient manner. Specifically, for each pixel, a novel crisscross attention module collects contextual information of all pixels on its crisscross path. Through further looping operations, each pixel can finally capture the dependencies of the entire image. CCNet has the following advantages: 1) GPU memory friendly. The proposed loop cross attention module requires 11 times less GPU memory usage compared to non-local blocks. 2) High computational efficiency. Recurrent cross-attention significantly reduces FLOPs of non-local blocks by approximately 85%. 3) State-of-the-art performance.

Insert image description here

2. LiteSeg

LiteSeg is a lightweight architecture for semantic segmentation that uses a deeper version of the Atrous Spatial Pyramid Pooling module (ASPP) and applies short and long residual connections and depthwise separable convolutions, resulting in faster , a more efficient model.

Insert image description here

3. EdgeFlow

EdgeFlow is an interactive segmentation architecture that fully utilizes the interactive information of user clicks through edge guidance flow. Edge guidance is the idea of ​​interactive segmentation to progressively improve the segmentation mask through user clicks. Based on user clicks, an edge masking scheme is used, which uses the object edges estimated by previous iterations as prior information instead of direct mask estimation (which may lead to poor segmentation results if previous masks are used as input).

The architecture consists of coarse-to-fine networks, including CoarseNet and FineNet. For CoarseNet, HRNet-18+OCR is used as the basic segmentation model, and an edge guidance flow is added to process interactive information. For FineNet, three atrous convolution blocks are utilized to refine the coarse mask.

Insert image description here

4. BiSeNet V2

BiSeNet V2 is a dual-path architecture for real-time semantic segmentation. One path aims to capture spatial details with wide channels and shallow layers, called the detail branch. In contrast, another approach is introduced to extract categorical semantics with narrow channels and deep layers, called semantic branching. The semantic branch only requires a large receptive field to capture the semantic context, while detailed information can be provided by the detail branch. Therefore, semantic branches can become very lightweight, with fewer channels and fast downsampling strategies. Both types of feature representations are merged to build a stronger and more comprehensive feature representation.

Insert image description here

5. EfficientUNet++

The decoder architecture is inspired by the UNet++ structure and EfficientNet building blocks. EfficientUNet++ achieves higher performance and significantly reduces computational complexity through two simple modifications on the basis of retaining the structure of UNet++:

Replace UNet++'s 3x3 convolutions with residual bottleneck blocks with depthwise convolutions Apply
channel and spatial attention to bottleneck feature maps using concurrent spatial and channel squeeze and excitation (scSE) blocks

Insert image description here

6. PSANet

PSANet is a semantic segmentation architecture that utilizes Pointwise Spatial Attention (PSA) modules to aggregate long-range contextual information in a flexible and adaptive manner. Each location in the feature map is connected to all other locations through adaptively predicted attention maps, thereby harvesting various information near and far. In addition, the author also designed a two-way information dissemination path to fully understand complex scenarios. Each location collects information from all other locations to aid its own predictions, and vice versa, each location's information can be distributed globally to assist all other locations' predictions. Finally, the bidirectionally aggregated contextual information is fused with local features to form the final representation of the complex scene.

The author uses ResNet as the FCN backbone of PSANet, as shown in the figure on the right. The proposed PSA module is then used to aggregate remote context information from the local representation. It follows Phase 5 in ResNet, which is the final phase of the FCN backbone. Stage 5 features are semantically stronger. Aggregating them together provides a more comprehensive representation of the remote context. In addition, the spatial size of the stage 5 feature map is smaller, which can reduce computational overhead and memory consumption. In addition to the main loss, an auxiliary loss branch is also applied.

Insert image description here

七、The Ikshana Hypothesis of Human Scene Understanding Mechanism

Insert image description here

八、Adaptive Early-Learning Correction

Adaptive Early-Learning Correction for Segmentation from Noisy Annotations

9. O-Net

十、Difference of Gaussian Random Forest

十一、Context Aggregated Bi-lateral Network for Semantic Segmentation

As the demand for autonomous systems continues to increase, pixel semantic segmentation for visual scene understanding needs to be not only accurate, but also efficient for potential real-time applications. In this paper, we propose Context Aggregation Network, a two-branch convolutional neural network that significantly reduces computational cost compared to state-of-the-art techniques while maintaining competitive prediction accuracy. Based on existing dual-branch architectures for high-speed semantic segmentation, we design a high-resolution branch for effective spatial details and a contextual branch with lightweight versions of global aggregation and local distribution blocks, capable of capturing both long-range and Context dependence required for locally accurate semantic segmentation with low computational overhead. We evaluate our method on two semantic segmentation datasets, namely Cityscapes dataset and UAVid dataset. For the Cityscapes test set, our model achieves state-of-the-art results with an mIOU of 75.9% and 76 FPS on NVIDIA RTX 2080Ti and 8 FPS on Jetson Xavier NX. For the UAVid dataset, our proposed network achieves an mIOU score of 63.5% at high execution speed (15 FPS).

Guess you like

Origin blog.csdn.net/wzk4869/article/details/132869167