[Computer Vision | Image Segmentation] arxiv Computer Vision Academic Express on Image Segmentation (Collection of Papers on September 1)

1. Segmentation | Semantic correlation (10 articles)

1.1 PointOcc: Cylindrical Tri-Perspective View for Point-based 3D Semantic Occupancy Prediction

PointOcc: Cylinder Three Views for Point-Based 3D Semantic Occupancy Prediction

https://arxiv.org/abs/2308.16896

Semantic segmentation in autonomous driving has experienced an evolution from sparse point segmentation to dense voxel segmentation, where the goal is to predict the semantic occupancy of each voxel in the 3D space of interest. The dense nature of the prediction space makes existing effective methods based on 2D projections (e.g., bird's-eye view, range map, etc.) ineffective since they can only describe subspaces of the 3D scene. To solve this problem, we propose a cylindrical three-perspective view to represent point clouds effectively and comprehensively and a PointOcc model to handle them efficiently. Considering the distance distribution of the LiDAR point cloud, we construct a three-perspective view in the cylindrical coordinate system to model closer areas more accurately. We employ spatial group pooling to preserve structural details during projection and a 2D backbone to efficiently handle each TPV plane. Finally, we obtain the features of each point by aggregating the projected features of each point without any post-processing. Extensive experiments on 3D occupancy prediction and LiDAR segmentation benchmarks demonstrate that the proposed PointOcc achieves state-of-the-art performance at faster speeds. Specifically, despite using only LiDAR, PointOcc significantly outperforms all other methods, including multimodal methods, by a large margin on the OpenOccupancy benchmark. Code: https://github.com/wzzheng/PointOcc.

1.2 Coarse-to-Fine Amodal Segmentation with Shape Prior

Thick and thin channel segmentation based on shape prior

https://arxiv.org/abs/2308.16825

Amodal object segmentation is a challenging task that involves segmenting the visible and occluded parts of an object. In this paper, we propose a new method, called coarse-to-fine segmentation (C2F-Seg), that addresses this problem of stepwise modeling of amodal segmentation. C2F-Seg initially reduces the learning space from a pixel-level image space to a vector-quantized latent space. This enables us to better handle long-range dependencies and learn coarse-grained amodal segments from visual features and visible segments. However, this latent space lacks detailed information about the objects, which makes it difficult to directly provide accurate segmentation. To address this problem, we propose a convolutional refinement module to inject fine-grained information and provide a more accurate amodal object segmentation based on visual features and coarse predicted segmentation. To aid research on amodal object segmentation, we created a synthetic amodal dataset named MOViD-Amodal (MOViD-A), which can be used for image and video amodal object segmentation. We evaluate our model extensively on two benchmark datasets: KINS and CocoA. Our empirical results demonstrate the superiority of C2F-Seg. Furthermore, we demonstrate the potential of our approach on the video amodal object segmentation task of FISH and our proposed MOViD-A. Project page: http://jianxgao.github.io/C2F-Seg.

1.3 BTSeg: Barlow Twins Regularization for Domain Adaptation in Semantic Segmentation

BTSeg: Domain-adaptive Barlow twin regularization algorithm for semantic segmentation

https://arxiv.org/abs/2308.16819

Semantic image segmentation is a key component in many computer vision systems such as autonomous driving. In such applications, adverse conditions (heavy rain, nighttime, snow, extreme lighting) present specific challenges on the one hand, but are often underrepresented in the available data sets. Generating more training data is tedious and expensive, and the process is inherently error-prone due to inherent arbitrary uncertainty. To address this challenging problem, we propose BTSeg, which utilizes image-level correspondences as weakly supervised signals to learn a segmentation model, which is agnostic of disadvantages. To do this, our method uses the Barlow twin loss from the field of unsupervised learning and treats images taken at the same location but under different adverse conditions as "augmentations" of the same unknown base image. This allows training segmentation models that are robust to appearance changes introduced by different adverse conditions. We evaluate our method ACDC and a new challenging ACG benchmark to demonstrate its robustness and generalization ability. Compared with current state-of-the-art methods, our method performs well while being easier to implement and train. Code will be posted upon acceptance.

1.4 Ref-Diff: Zero-shot Referring Image Segmentation with Generative Models

REF-DIFF: Zero-Shot reference image segmentation based on generative models

https://arxiv.org/abs/2308.16777

Zero-shot reference image segmentation is a challenging task as it aims to find an instance segmentation mask based on a given reference description without training on this type of paired data. Current zero-shot methods mainly focus on using pre-trained discriminative models (e.g., CLIP). However, we have observed that generative models (e.g., stable diffusion) have the potential to understand the relationship between various visual elements and textual descriptions, which is rarely investigated in this task. In this work, we introduce a new reference diffusion segmenter (Ref-Diff), a task that exploits the fine-grained multimodal information of the generative model. We demonstrate that, without a proposal generator, generative models alone can achieve comparable performance to existing SOTA weakly supervised models. When we combine generative and discriminative models, our Ref-Diff significantly outperforms these competing methods. This suggests that generative models are also beneficial for this task and can complement discriminative models for better reference segmentation. Our code is publicly available at https://github.com/kodenii/Ref-Diff.

1.5 Semi-Supervised SAR ATR Framework with Transductive Auxiliary Segmentation

Semi-supervised SAR ATR framework based on inductive-assisted segmentation

https://arxiv.org/abs/2308.16633

Convolutional neural networks (CNN) have achieved good performance in synthetic aperture radar (SAR) automatic target recognition (ATR). However, the performance of CNN depends heavily on the large amount of training data. The insufficient labeling of SAR images limits the recognition performance of SAR images and even renders some ATR methods ineffective. Furthermore, many existing CNNs are even ineffective with very little labeled training data. To address these challenges, we propose a semi-supervised SAR ATR framework with transduction-assisted segmentation (SFAS). This framework focuses on utilizing the auxiliary loss of available unlabeled samples as a regularized transductive generalization. Through auxiliary segmentation of unlabeled SAR samples and information residual loss (IRL) in training, the framework can adopt the proposed training loop process and gradually exploit the information compilation of identification and segmentation to build useful inductive biases, thus achieving high performance. Experiments on the MSTAR dataset demonstrate the effectiveness of our proposed SFAS Few-Shot learning. With 20 training samples per class, the recognition rate can reach 94.18%, and accurate segmentation results are obtained. For the variance of EOCs, when there are 10 training samples per class, the recognition rate is greater than 88.00%.

1.6 3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation

3D-STMN: Superpoint Text Matching Network for Dependency-Driven End-to-End 3D Referential Expression Segmentation

https://arxiv.org/abs/2308.16632

In 3D Reference Expression Segmentation (3D-RES), early methods adopt a two-stage paradigm to extract segmentation proposals and then match them with reference expressions. However, this traditional paradigm encounters significant challenges, most notably in generating lackluster initial proposals and significantly slowing down inference speed. Recognizing these limitations, we introduce an innovative end-to-end superpoint text matching network (3D-STMN), which enriches dependency-driven insights. One of the keys of our model is the Super Point Text Matching (STM) mechanism. Unlike conventional approaches, which navigate through instance suggestions, STM directly relates language indications to their respective hyperpoints, clustering semantically related points. This architectural decision enables our model to effectively exploit cross-modal semantic relationships, primarily leveraging densely annotated superpoint-text pairs rather than more sparse instance-text pairs. In pursuit of enhancing the role of text in guiding the segmentation process, we further incorporate a dependency-driven interaction (DDI) module to deepen the network's semantic understanding of referential expressions. Using dependency trees as beacons, this module identifies complex relationships between primary terms and their associated descriptors, thereby improving the localization and segmentation capabilities of our model. Comprehensive experiments on the ScanRefer benchmark show that our model not only sets a new performance standard, registering an mIoU gain of 11.7 points, but also achieves astonishing enhanced inference speed, exceeding the traditional method by 95.7 times. Code and models are available at https://github.com/sosppxo/3D-STMN.

1.7 Self-Sampling Meta SAM: Enhancing Few-shot Medical Image Segmentation with Meta-Learning

Self-sampling Meta SAM: Enhanced Few-Shot medical image segmentation with meta-learning

https://arxiv.org/abs/2308.16466

While any segmentation model (SAM) performs well in semantic segmentation of general images, its performance deteriorates significantly when applied to medical images, mainly attributed to insufficient representation of medical images in its training dataset. However, collecting universally applicable comprehensive datasets and training models is particularly challenging due to the common long-tail problem in medical images. To address this gap, here we propose a Self-Sampling Meta SAM (SSM-SAM) framework for Few-Shot medical image segmentation. Our innovation lies in the design of three key modules: 1) Online fast gradient descent optimizer, further optimized by a meta-learner to ensure rapid and robust adaptation to new tasks. 2) a self-sampling module designed to provide well-aligned visual cues to improve attention allocation; and 3) a robust attention-based decoder specifically designed for medical Few-Shot learning to capture the difference between different slices. relationship between. Extensive experiments on a popular abdominal CT dataset and MRI dataset show that the proposed method achieves significant improvements over state-of-the-art methods in Few-Shot segmentation, with average improvements of 10.21% and 1.80% in DSC ,respectively. In summary, we present a new method for rapid online adaptation of interactive image segmentation, adapting to a new organ in just 0.83 minutes. Code becomes publicly available on GitHub upon acceptance.

1.8 Dual-Decoder Consistency via Pseudo-Labels Guided Data Augmentation for Semi-Supervised Medical Image Segmentation

Double-decoding consensus algorithm based on pseudo-label guided data augmentation in semi-supervised medical image segmentation

https://arxiv.org/abs/2308.16573

Medical image segmentation methods often rely on fully supervised methods to achieve excellent performance, which depends on having a large number of labeled images for training. However, annotating medical images is expensive and time-consuming. Semi-supervised learning offers a solution by utilizing a large number of unlabeled images as well as a limited set of annotated images. In this paper, we introduce a semi-supervised medical image segmentation method based on the mean-teacher model, called dual-decoder consistency via pseudo-label guided data augmentation (DCPA). The method combines consistency regularization, pseudo-labels, and data augmentation to improve the efficiency of semi-supervised segmentation. First, the proposed model consists of a student and a teacher model with a shared encoder and two different decoders with different upsampling strategies. Minimizing the difference in outputs between decoders enforces a consistent representation, which is used as a regularization during student model training. Second, we introduce a blending operation to blend unlabeled data with labeled data to create blended data for data augmentation. Finally, pseudo-labels are generated by the teacher model and used as labels for the mixed data to compute the unsupervised loss. We compare the segmentation results of the DCPA model with six state-of-the-art semi-supervised methods on three publicly available medical datasets. In addition to the classic 10% and 20% semi-supervised settings, we investigate the performance with less supervision (5% labeled data). Experimental results show that our method consistently outperforms existing semi-supervised medical image segmentation methods in three semi-supervised settings.

1.9 Improving Multiple Sclerosis Lesion Segmentation Across Clinical Sites: A Federated Learning Approach with Noise-Resilient Training

Improving segmentation of multiple sclerosis lesions across clinical sites: a federated learning approach combined with anti-noise training

https://arxiv.org/abs/2308.16376

Accurate measurement of multiple sclerosis (MS) evolution with magnetic resonance imaging (MRI) critically informs understanding of disease progression and helps guide treatment strategies. Deep learning models have shown promise for automatically segmenting MS lesions, but a lack of accurately annotated data has hindered progress in this field. Obtaining sufficient data from a single clinical research center is challenging and does not address the need for heterogeneity for model robustness. In contrast, collecting data from multiple sites introduces data privacy concerns and potential label noise due to different annotation standards. To address this dilemma, we explore the use of a federated learning framework while accounting for label noise. Our approach enables collaboration between multiple clinical sites without compromising data privacy in a federated learning paradigm, incorporating a noise-robust training strategy based on label correction. Specifically, we introduce a decoupled hard label correction (DHLC) strategy that takes into account the imbalanced distribution and blurred boundaries of MS lesions, enabling correction of misannotations based on prediction confidence. We also introduce the Centralized Enhanced Label Correction (CELC) strategy, which utilizes the aggregated central model as the correction teacher for all sites, improving the reliability of the correction process. Extensive experiments on two multi-site datasets demonstrate the effectiveness and robustness of our proposed method, indicating its potential for multi-site collaborative clinical applications.

1.10 A Recycling Training Strategy for Medical Image Segmentation with Diffusion Denoising Models

Medical image segmentation loop training strategy based on diffusion denoising model

https://arxiv.org/abs/2308.16355

Denoising diffusion models are applied in image segmentation by generating segmentation masks conditioned on the image. Existing research mainly focuses on adjusting the model structure or improving inference, such as sampling strategies at test time. In this work, we focus on the improvement of the training policy and propose a new recycling method. During each training step, a segmentation mask is first predicted given an image and random noise. This predicted mask, replacing the conventional ground truth mask, is used for the denoising task during training. This approach can be interpreted as aligning the training policy with inference by removing the reliance on the ground truth mask used to generate the noisy samples. Our proposed method significantly outperforms standard diffusion training, self-regulation, and existing recovery strategies on multiple medical imaging datasets: muscle ultrasound, abdominal CT, prostate MR, and brain MR. This applies to two widely adopted sampling strategies: the diffuse probabilistic model for denoising and the diffuse implicit model for denoising. Importantly, existing diffusion models often exhibit declining or unstable performance during inference, whereas our new recycling consistently improves or maintains performance. Furthermore, we show for the first time that the proposed loop-based diffusion model achieves equivalent performance to non-diffusion-based supervised training in a fair comparison with the same network architecture and computational budget. This article summarizes these quantitative results and discusses their value with a fully reproducible JAX-based implementation published at https://github.com/mathpluscode/ImgX-DiffSeg.

Guess you like

Origin blog.csdn.net/wzk4869/article/details/132753218