[Computer Vision] The segmentation papers on CVPR 2023 are really a fight between gods (introducing the first 12 papers, image segmentation, panoramic segmentation, semantic segmentation, instance segmentation)

1. Image Segmentation Class

1.1 AutoFocusFormer: Image Segmentation off the Grid

AutoFocusFormer: Image segmentation outside the grid

Paper address:

https://arxiv.org/abs/2304.12406

insert image description here
Real-world images often have highly unbalanced content densities. Some areas are very uniform, such as large patches of blue sky, while others are scattered with many small objects. However, the continuous grid downsampling strategy commonly used in convolutional deep networks treats all regions equally. Therefore, small objects are represented at few spatial locations, leading to worse results for tasks such as segmentation. Intuitively, preserving more pixels representing small objects during downsampling helps preserve important information. To achieve this goal, we propose AutoFocusFormer (AFF), a local attention transformer image recognition backbone that performs adaptive downsampling by learning to preserve the most important pixels for the task. Since adaptive downsampling generates a set of pixels irregularly distributed on the image plane, we abandon the classic grid structure. Instead, we develop a novel point-based local attention block, facilitated by a balanced clustering module and a learnable neighborhood pooling module, which generates representations for the point-based version of our state-of-the-art segmentation head. Experiments show that our AutoFocusFormer (AFF) achieves significant improvements over similarly sized baseline models.

Recommended reason:

The paper presents AutoFocusFormer (AFF), a local attention transformer image recognition backbone that performs adaptive downsampling by learning to preserve the most important pixels for the task. Abandoning the classic grid structure, this paper develops a new point-based local attention block, facilitated by a balanced clustering module and a learnable neighborhood merging module, which can provide a state-of-the-art segmentation head for point-based The version build representation for . Experiments show that AutoFocusFormer (AFF) achieves significant improvements over similarly sized baseline models.

1.2 FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation

FreeSeg: Image Segmentation with Unified, General and Open Vocabulary

Paper address:

https://arxiv.org/abs/2303.17225

insert image description here
Recently, open vocabulary learning has emerged to accomplish segmentation of arbitrary categories of text-based descriptions, which enables segmentation systems to generalize to more general application scenarios. However, existing methods focus on designing specialized architectures or parameters for specific segmentation tasks. These customized design paradigms lead to fragmentation among various segmentation tasks, thus hindering the unity of segmentation models. Therefore, in this paper, we propose FreeSeg, a general framework for image segmentation with a unified, general and open vocabulary. FreeSeg optimizes the all-in-one network through one-time training, and uses the same architecture and parameters to seamlessly handle various segmentation tasks during inference. In addition, adaptive hint learning helps to unify the model to capture task-aware and category-sensitive concepts, thus improving the robustness of the model in multi-task and various scenarios. Extensive experimental results show that FreeSeg establishes new state-of-the-art results in terms of performance and generalization on three segmentation tasks, substantially outperforming the best task-specific architectures: 5.5% mIoU for semantic segmentation, 17.6% for instance mAP The panoptic segmentation PQ of the see class is 20.1%.

Recommended reason:

The paper proposes FreeSeg, a general framework that enables unified, general and open vocabulary image segmentation. Extensive experimental results demonstrate that FreeSeg establishes new state-of-the-art results in terms of performance and generalization on three segmentation tasks, substantially outperforming the best task-specific architectures: 5.5% mIoU for semantic segmentation, 17.6% for instance mAP for segmentation, The panoptic segmentation PQ of the see class is 20.1%.

1.3 Parameter Efficient Local Implicit Image Function Network for Face Segmentation

Parameter Efficient Locally Implicit Image Function Networks for Face Segmentation

Paper address:

https://arxiv.org/abs/2303.15122

insert image description here
Face parsing is defined as the pixel-wise labeling of images containing faces. Labels are defined to identify key facial regions such as eyes, lips, nose, hair, etc. In this work, we exploit the structural consistency of human faces to propose a lightweight face parsing method using local implicit function networks, FP-LIIF. We propose a simple architecture with a convolutional encoder and a pixel-based MLP decoder that uses 1/26 the number of parameters compared to state-of-the-art models but matches or outperforms State-of-the-art models like CelebAMask-HQ and LaPa. We do not use any pre-training, and compared to other works, our network can also generate segmentations of different resolutions without changing the input resolution. This work allows face segmentation to be used on low computation or low bandwidth devices due to higher FPS and smaller model size.

Recommended reason:

Face parsing is defined as the per-pixel labeling of images containing faces. Define labels to identify key facial regions such as eyes, lips, nose, hair, etc. Taking advantage of the structural consistency of faces, this paper proposes a lightweight face parsing method using local implicit function network FP-LIF. A simple architecture with a convolutional encoder and a pixel-wise MLP decoder is also proposed, which uses 1/26 parameters compared to the state-of-the-art and LaPa) match or outperform state-of-the-art models.

2. Panoramic Segmentation Class

2.1 You Only Segment Once: Towards Real-Time Panoptic Segmentation

You Segment Only Once: Towards Real-Time Panoramic Segmentation

Paper address:

https://arxiv.org/abs/2303.14651

insert image description here
In this paper, we propose YOSO, a real-time panoptic segmentation framework. YOSO predicts masks via dynamic convolutions between panoramic kernels and image feature maps, where you only need to split once for instance and semantic segmentation tasks. To reduce computational overhead, we design a feature pyramid aggregator for feature map extraction, and a separable dynamic decoder for panoptic kernel generation. The aggregator reparameterizes the interpolation-first module in a convolution-first manner, which can significantly speed up the pipeline without any additional cost. The decoder performs multi-head cross-attention via separable dynamic convolutions to improve efficiency and accuracy. To the best of our knowledge, YOSO is the first real-time panoptic segmentation framework that provides competitive performance compared to state-of-the-art models. Specifically, YOSO achieved 46.4 PQ, 45.6 FPS on COCO; 52.5 PQ, 22.6 FPS on Cityscapes; 38.0 PQ, 35.4 FPS on ADE20K; 34.1 PQ, 7.1 FPS on Mapillary Vistas. The code is available at this https URL.

Recommended reason:

This paper presents YOSO, a real-time panoptic segmentation framework. YOSO predicts masks via dynamic convolutions between panoramic kernels and image feature maps, where only one segmentation is required for instance and semantic segmentation tasks. To reduce computational overhead, a feature pyramid aggregator for feature map extraction and a separable dynamic decoder for panorama kernel generation are designed.

2.2 UniDAformer: Unified Domain Adaptive Panoptic Segmentation Transformer via Hierarchical Mask Calibration

UniDAformer: A Uniform Domain Adaptive Panoramic Segmentation Transformer Calibrated by Hierarchical Masks

Paper address:

https://arxiv.org/abs/2206.15083

insert image description here
Domain-adaptive panoptic segmentation aims to alleviate data annotation challenges by leveraging readily available annotated data in one or more relevant source domains. However, existing studies employ two separate networks for instance segmentation and semantic segmentation, which lead to excessive network parameters and complex and computationally intensive training and inference processes. We design UniDAformer, a unified domain-adaptive panoptic segmentation transformer that is simple yet enables domain-adaptive instance segmentation and semantic segmentation simultaneously in a single network. UniDAformer introduces Hierarchical Mask Calibration (HMC) to correct inaccurate predictions at the region, superpixel, and pixel level through real-time online self-training. It has three unique features: 1) it enables unified domain-adaptive panoptic adaptation; 2) it mitigates mispredictions and effectively improves domain-adaptive panoptic segmentation; 3) it is end-to-end trainable, Has simpler training and inference pipelines. Extensive experiments on multiple public benchmarks demonstrate that UniDAformer achieves superior domain-adaptive panoptic segmentation compared to the state-of-the-art.

Recommended reason:

This paper designs UniDAformer, a unified domain-adaptive panoptic segmentation transformer, which is simple but can simultaneously achieve domain-adaptive instance segmentation and semantic segmentation in a single network. It has three unique features: 1) it can achieve unified domain adaptive panoptic adaptation; 2) it can effectively reduce misprediction and improve domain adaptive panoptic segmentation; 3) it is end-to-end, and can be trained and inferred through simpler pipeline for training. Extensive experiments on multiple public benchmarks demonstrate that UniDAformer achieves superior domain-adaptive panoptic segmentation compared to the state-of-the-art.

2.3 Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models

Open Vocabulary Panoramic Segmentation Using a Text-to-Image Diffusion Model

Paper address:

https://arxiv.org/abs/2303.04803

insert image description here
We propose ODISE: Open-Vocabulary Diffusion-Based Panoptic Segmentation, which unifies pretrained text-image-diffusion and discriminative models to perform open-vocabulary panoptic segmentation. Text-to-image diffusion models have the remarkable ability to generate high-quality images with descriptions in multiple open-vocabulary languages. This shows that their internal representation space is highly related to open concepts in the real world. On the other hand, text-image discriminative models like CLIP are good at classifying images into open-vocabulary labels. We leverage the frozen internal representations of these two models to perform panoptic segmentation of any class in the wild. Our method outperforms previous state-of-the-art on both open-vocabulary panoramas and semantic segmentation tasks. In particular, with only COCO training, our method achieves 23.4 PQ and 30.0 mIoU on the ADE20K dataset, an absolute improvement of 8.3 PQ and 7.9 mIoU over the previous state-of-the-art. We open source our code and models at this https URL.

Recommended reason:

This paper proposes ODISE: Open-Vocabulary Diffusion-Based Panoptic Segmentation, which unifies pre-trained text-image-diffusion and discriminative models to perform open-vocabulary panoptic segmentation. The paper's method outperforms previous state-of-the-art on both open-lexical panoramas and semantic segmentation tasks. In particular, with only COCO training, the method achieves 23.4 PQ and 30.0 mIoU on the ADE20K dataset, an absolute improvement of 8.3 PQ and 7.9 mIoU over the previous state-of-the-art.

3. Semantic Segmentation Class

3.1 Federated Incremental Semantic Segmentation

Joint Incremental Semantic Segmentation

Paper address:

https://arxiv.org/abs/2304.04620

insert image description here
Federated Learning-based Semantic Segmentation (FSS) has attracted widespread attention through decentralized training on local clients. However, most FSS models assume that categories are fixed in advance, so in real-world applications where local clients receive new categories incrementally without in-memory storage to access old categories, old categories are severely forgotten. In addition, new clients collecting new courses may join FSS's global training, further exacerbating catastrophic forgetting. To overcome the above challenges, we propose a Forgetting Balanced Learning (FBL) model to address heterogeneous forgetting of old classes from both intra-client and inter-client aspects. Specifically, guided by pseudo-labels generated via adaptive class-balancing pseudo-labels, we develop a forgetting-balanced semantic compensation loss and a forgetting-balanced relational consistency loss to correct for intra-client heterogeneity of old classes with background shifts. qualitative forgetting. It performs balanced gradient propagation and relational consistency distillation in the local client. Furthermore, to address the heterogeneous forgetting problem from an inter-client perspective, we propose a task transition monitor. It can identify new classes under privacy protection and store the latest old global models for relational distillation. Qualitative experiments show that our model improves substantially over comparative methods.

Recommended reason:

This paper proposes a Forgetting Balanced Learning (FBL) model to solve the heterogeneous forgetting problem on old classes from both intra-client and inter-client aspects. Guided by pseudo-labels generated by adaptive class-balancing pseudo-labels, a forgetting-balanced semantic compensation loss and a forgetting-balanced relational consistency loss are developed to correct intra-client heterogeneous forgetting of old categories with background transfer. In addition, the paper also proposes a task transfer monitor. It can identify privacy-preserved new classes and store up-to-date old global models for relation extraction. Qualitative experiments show that the model improves considerably compared to the comparative methods.

3.2 Exploiting the Complementarity of 2D and 3D Networks to Address Domain-Shift in 3D Semantic Segmentation

Exploiting the Complementarity of 2D and 3D Networks to Address Domain Shift in 3D Semantic Segmentation

Paper address:

https://arxiv.org/abs/2304.02991

insert image description here
3D semantic segmentation is a critical task in many real-world applications, such as autonomous driving, robotics, and mixed reality. However, this task is extremely challenging due to the ambiguities caused by the unstructured, sparse, and colorless nature of 3D point clouds. One possible solution is to combine 3D information with other information from sensors with different modalities, such as RGB cameras. Recent multimodal 3D semantic segmentation networks exploit these modalities relying on two branches that process 2D and 3D information independently, striving to preserve the strength of each modality. In this work, we first explain why this design choice is effective, and then show how it can be improved to make multimodal semantic segmentation more robust to domain shifts. Our surprisingly simple contribution achieves state-of-the-art performance on four popular multimodal unsupervised domain adaptation benchmarks and achieves better results in domain generalization scenarios.

Recommended reason:

3D semantic segmentation is a critical task in many real-world applications, such as autonomous driving, robotics, and mixed reality. However, this task is extremely challenging due to the ambiguities brought about by the unstructured, sparse, and uncolored nature of 3D point clouds. The paper contribution achieves state-of-the-art performance on four popular multimodal unsupervised domain adaptation benchmarks and achieves better results in domain generalization scenarios.

3.3 Instant Domain Augmentation for LiDAR Semantic Segmentation

Instant Domain Enhancement for LiDAR Semantic Segmentation

Paper address:

https://arxiv.org/abs/2303.14378

insert image description here
Despite the increasing popularity of LiDAR sensors, perception algorithms using 3D LiDAR data still struggle with the "sensor bias problem." Specifically, the performance of perception algorithms degrades significantly when unseen LiDAR sensor specifications are applied at test time due to domain differences. This paper proposes a fast and flexible LiDAR augmentation method for semantic segmentation tasks, called "LiDomAug". It aggregates raw LiDAR scans and creates LiDAR scans of any configuration, taking into account dynamic distortions and occlusions, enabling instant domain enhancement. Our on-demand augmentation module runs at 330 FPS, so it can be seamlessly integrated into the data loader in the learning framework. In our experiments, the proposed LiDomAug assisted by learning-based methods suffers less from the sensor bias problem and achieves new state-of-the-art domain adaptation on SemanticKITTI and nuScenes datasets without using target domain data. performance. We also propose a sensor-agnostic model that faithfully fits various LiDAR configurations.

Recommended reason:

This paper proposes a fast and flexible LiDAR augmentation method for semantic segmentation tasks, called "LiDomAug". It aggregates raw LiDAR scans and creates LiDAR scans of any configuration taking into account dynamic distortions and occlusions, enabling instant domain augmentation. In this experiment, the learning-based approach with the help of the proposed LiDomAug suffers less from the sensor bias problem and achieves new state-of-the-art domains on the SemanticKITTI and nuScenes datasets without using target domain data. adaptability.

4. Instance Segmentation Class

4.1 SIM: Semantic-aware Instance Mask Generation for Box-Supervised Instance Segmentation

SIM: Semantic-Aware Instance Mask Generation for Box-Supervised Instance Segmentation

Paper address:

https://arxiv.org/abs/2303.08578

insert image description here
Weakly supervised instance segmentation using only bounding box annotations has attracted much research attention recently. Most current efforts utilize low-level image features as additional supervision without explicitly exploiting high-level semantic information of objects, which becomes ineffective when foreground objects have similar appearance to background or other nearby objects. We propose a new approach to box-supervised instance segmentation by developing the semantic-aware instance mask (SIM) generation paradigm. Instead of relying heavily on local pairwise affinities between neighboring pixels, we construct a set of class feature centroids as prototypes to identify foreground objects and assign them semantic-level pseudo-labels. Considering that semantic-aware prototypes cannot distinguish different instances of the same semantics, we propose a self-correction mechanism to correct wrongly activated regions while enhancing correct ones. Furthermore, to handle occlusions between objects, we tailor the copy-paste operation for the weakly supervised instance segmentation task to augment challenging training data. Extensive experimental results demonstrate that our proposed SIM method outperforms other state-of-the-art methods.

Recommended reason:

Weakly supervised instance segmentation using only bounding box annotations has attracted extensive research attention recently. This paper proposes a new method for box-supervised instance segmentation by developing the Semantic-Aware Instance Mask (SIM) generation paradigm. Considering that semantic-aware prototypes cannot distinguish different instances of the same semantics, the paper proposes a self-correction mechanism to correct wrongly activated regions while enhancing correct ones. Extensive experimental results show that the proposed SIM method outperforms other state-of-the-art methods.

4.2 DynaMask: Dynamic Mask Selection for Instance Segmentation

DynaMask: Dynamic Mask Selection for Instance Segmentation

Paper address:

https://arxiv.org/abs/2303.07868

insert image description here
Representative instance segmentation methods mostly use fixed-resolution masks to segment different object instances, such as 28*28 grids. However, low-resolution masks lose rich details, while high-resolution masks incur quadratic computation overhead. Predicting the optimal binary mask for each instance is a challenging task. In this paper, we propose to dynamically select appropriate masks for different target proposals. First, a two-layer Feature Pyramid Network (FPN) with adaptive feature aggregation is developed to gradually increase the mask grid resolution, ensuring high-quality segmentation of objects. Specifically, an efficient region-level top-down pathway (r-FPN) is introduced to incorporate complementary context and detailed information from different stages of image-level FPN (i-FPN). Then, to mitigate the increased computation and memory cost caused by using large masks, we develop a mask switching module (MSM) with negligible computation cost to select the most appropriate mask resolution for each instance, while maintaining Achieve high efficiency with high segmentation accuracy. Without getting fancy, the proposed method, DynaMask, brings consistent and significant performance improvements over other state-of-the-arts with modest computational overhead.

Recommended reason:

To mitigate the increased computational and memory costs caused by using large masks, this paper develops a mask switching module (MSM) with negligible computational cost to select the most appropriate mask resolution for each instance, while maintaining a high High efficiency while achieving segmentation accuracy. Without getting fancy, the proposed method, DynaMask, brings consistent and significant performance improvements over other state-of-the-arts with modest computational overhead.

4.3 ISBNet: a 3D Point Cloud Instance Segmentation Network with Instance-aware Sampling and Box-aware Dynamic Convolution

ISBNet: 3D Point Cloud Instance Segmentation Network with Instance-Aware Sampling and Frame-Aware Dynamic Convolution

Paper address:

https://arxiv.org/abs/2303.00246

insert image description here
Existing methods for 3D instance segmentation are dominated by bottom-up designs—hand-fine-tuned algorithms group points into clusters, followed by network refinement. However, by relying on the quality of the clusters, these methods produce susceptible results when (1) nearby objects of the same semantic class are packed together, or (2) large objects with loosely connected regions. To address these limitations, we introduce ISBNet, a novel cluster-free approach that represents instances as kernels and decodes instance masks via dynamic convolutions. To efficiently generate high-recall and discriminative kernels, we propose a simple strategy named Instance-aware Farthest Point Sampling to sample candidates and utilize a local aggregation layer inspired by PointNet++ to encode candidate features . Furthermore, we show that predicting and exploiting 3D axis-aligned bounding boxes in dynamic convolutions further improves performance. Our method sets new state-of-the-art results on ScanNetV2 (55.9), S3DIS (60.8) and STPLS3D (49.2) in terms of AP and maintains fast inference time (237 ms per scene on ScanNetV2).

Recommended reason:

Existing methods for 3D instance segmentation are mainly bottom-up designs—manually fine-tuning algorithms to group points into clusters, followed by network refinement. To address these limitations, this paper introduces ISBNet, a novel cluster-free approach that represents instances as kernels and decodes instance masks via dynamic convolutions. To efficiently generate high-recall and discriminative kernels, a simple strategy named instance-aware furthest point sampling is also proposed to sample candidates and utilize a PointNet++-inspired local aggregation layer to encode candidate features.

Guess you like

Origin blog.csdn.net/wzk4869/article/details/131325920