[Computer Vision | Image Segmentation] arxiv Computer Vision Academic Express on Image Segmentation (Collection of Papers on September 20)

1. Segmentation | Semantic correlation (11 articles)

1.1 Few-Shot Panoptic Segmentation With Foundation Models

Few-Shot panoramic image segmentation based on basic model

https://arxiv.org/abs/2309.10726

Insert image description here
Current state-of-the-art methods for panoramic segmentation require large amounts of annotated training data, which is laborious and expensive to obtain, posing a significant challenge to their widespread adoption. At the same time, recent breakthroughs in visual representation learning have triggered a paradigm shift, leading to the emergence of large base models that can be trained with completely unlabeled images. In this work, we propose to exploit such task-agnostic image features to enable Few-Shot panoptic segmentation, proposing Segmentation Panoptic Information with Near-0 Labels (SPINO). In detail, our method combines the DINOv2 backbone with lightweight network heads for semantic segmentation and boundary estimation. We show that our method, while trained on only 10 annotated images, predicts high-quality pseudo-labels that can compete with any existing panorama segmentation method. Notably, we demonstrate that SPINO achieves competitive results against fully supervised baselines while using less than 0.3% of ground truth labels, paving the way for leveraging base models to learn complex visual recognition tasks. . To illustrate its general applicability, we further deploy SPINO in real-world robotic vision systems in outdoor and indoor environments. To facilitate future research, we have made the code and trained model publicly available at http://spino.cs.uni-freiburg.de.

1.2 Cross-modal and Cross-domain Knowledge Transfer for Label-free 3D Segmentation

Label-free 3D segmentation knowledge transfer based on cross-modal and cross-domain

https://arxiv.org/abs/2309.10649

Insert image description here
Current state-of-the-art point cloud-based perception methods often rely on large-scale labeled data, which requires expensive manual annotation. A natural choice is to explore unsupervised methods for 3D perception tasks. However, such methods often face substantial performance degradation difficulties. Fortunately, we discovered that a large number of image-based datasets exist and can propose alternatives, i.e., transferring knowledge from 2D images to 3D point clouds. Specifically, we propose a new approach to the challenging cross-modal and cross-domain adaptation task, fully explore the relationship between images and point clouds, and design effective feature alignment strategies. Without any 3D labels, our method achieves state-of-the-art 3D point cloud semantic segmentation performance on SemanticKITTI by using the knowledge of KITTI360 and GTA5 compared to existing unsupervised and weakly supervised baselines.

1.3 Edge-aware Feature Aggregation Network for Polyp Segmentation

Edge-aware feature aggregation network for polyp segmentation

https://arxiv.org/abs/2309.10523

Insert image description here
In clinical practice, accurate segmentation of polyps is crucial for the early diagnosis and prevention of colorectal cancer (CRC). However, due to scale variations and blurred polyp boundaries, it remains a challenging task to achieve satisfactory segmentation performance with different scales and shapes. In this study, we propose a new edge-aware feature aggregation network (EFA-Net) for polyp segmentation, which can fully exploit cross-level and multi-scale features to improve the performance of polyp segmentation. Specifically, we first propose an edge-aware guidance module (EGM) that combines low-level features with high-level features to learn edge enhancement features, which are incorporated into each decoder unit using a layer-by-layer strategy. Furthermore, the scale-aware convolution module (SCM) proposes to learn scale-aware functions by using dilated convolutions of different scales to effectively handle scale changes. Furthermore, the cross-level fusion module (CFM) is proposed to effectively integrate cross-level functions, which can exploit local and global context information. Finally, the output of CFM is adaptively weighted using the learned edge-aware features, which are then used to produce multiple side-out segmentation maps. Experimental results on five widely adopted colonoscopy datasets demonstrate that our EFA-Net outperforms state-of-the-art polyp segmentation methods in terms of generalization and effectiveness.

1.4 Spatial-Assistant Encoder-Decoder Network for Real Time Semantic Segmentation

Spatially assisted encoding and decoding network for real-time semantic segmentation

https://arxiv.org/abs/2309.10519

Insert image description here
Semantic segmentation is a necessary technology for self-driving cars to understand their surroundings. Currently, real-time semantic segmentation networks usually adopt encoder-decoder architecture or dual-path architecture. In general, encoder-decoder models tend to be faster, while dual-channel models exhibit higher accuracy. To exploit these two advantages, we propose the spatially assisted encoder-decoder network (SANet) to fuse the two architectures. In the overall architecture, we adhere to the encoder-decoder design while maintaining the feature map in the middle part of the encoder and utilizing the atrous convolution branch for feature extraction at the same resolution. At the end of the encoder, we integrate the Asymmetric Pooling Pyramid Pooling Module (APPPM) to optimize the semantic extraction of feature maps. This module contains asymmetric pooling layers that extract features at multiple resolutions. In the decoder, we propose a hybrid attention module, SAD, which integrates horizontal and vertical attention to facilitate the combination of various branches. To confirm the effectiveness of our approach, our SANet model achieves competitive results on the real-time CamVid and cityscape datasets. By using a single 2080 Ti GPU, SANet achieved 78.4% mIOU at 65.1 FPS on the Cityscape test dataset and 78.8% mIOU at 147 FPS on the CamVid test dataset. The training code and model for SANet are available at https://github.com/CuZaoo/SANet-main

1.5 Uncertainty Estimation in Instance Segmentation with Star-convex Shapes

Uncertainty estimation in star-convex instance segmentation

https://arxiv.org/abs/2309.10513

Insert image description here
Instance segmentation has witnessed promising progress through deep neural network based algorithms. However, these models often exhibit incorrect predictions with unnecessary confidence levels. Assessing forecast uncertainty therefore becomes key to informed decision-making. Existing methods mainly focus on quantifying uncertainty in classification or regression tasks and lack attention to instance segmentation. Our study addresses the challenge of estimating the spatial determinism associated with the locations of instances of star bulge shapes. Two different clustering methods are evaluated, calculating spatial and fractional certainty for each instance using Monte Carlo dropout or deep integration techniques for samples. Our study shows that combining spatial and fractional certainty scores produces improved calibration estimates over individual certainty scores. Notably, our experimental results show that the deep ensemble technique as well as our novel radial clustering method proves to be an effective strategy. Our findings highlight the significance of calibration for assessing model reliability and estimated certainty of decisions.

1.6 Single-Image based unsupervised joint segmentation and denoising

Unsupervised joint segmentation and denoising based on a single image

https://arxiv.org/abs/2309.10511

Insert image description here
In this work, we develop an unsupervised method for joint segmentation and denoising of a single image. To this end, we combine the advantages of variational segmentation methods with the power of single-image-based self-supervised deep learning methods. A major advantage of our approach is that, compared to data-driven approaches that require a large number of labeled samples, our model can segment images into meaningful regions without the need for any training database. Furthermore, we introduce a new energy function where denoising and segmentation are coupled in a way that both tasks benefit from each other. The limitation of existing single image-based variational segmentation methods, which is the inability to handle high noise or general texture, is addressed by this specific combination with self-supervised image denoising. We propose a unified optimization strategy and show that, especially for very noisy images in microscopy, our proposed joint approach outperforms its sequential counterpart as well as alternative approaches that focus purely on denoising or segmentation. Another comparison is made with supervised deep learning methods designed for the same application, highlighting the good performance of our method.

1.7 RECALL+: Adversarial Web-based Replay for Continual Learning in Semantic Segmentation

Recall+: adversarial network replay for continuous learning in semantic segmentation

https://arxiv.org/abs/2309.10479

Insert image description here
Catastrophic forgetting of previous knowledge is a key issue in continuous learning and is usually handled through various regularization strategies. However, existing methods are difficult to implement especially when performing several incremental steps. In this paper, we extend our previous approach (RECALL) and address forgetting using unsupervised web crawling data to retrieve old classes from an online database for example. Unlike the original method which does not perform any evaluation on the network data, here we introduce two new methods based on adversarial learning and adaptive thresholding to only select from the network data those whose statistics are very similar to the training data that are no longer available. sample. Additionally, we improve the pseudo-labeling scheme to achieve more accurate labeling of network data, also taking into account the current step at which the class is learning. Experimental results show that this enhanced method achieves significant results, especially when multiple incremental learning steps are performed.

1.8 Fully automated landmarking and facial segmentation on 3D photographs

Fully automatic labeling and face segmentation on 3D photos

https://arxiv.org/abs/2309.10472

Insert image description here
Three-dimensional facial stereophotogrammetry provides detailed representation of craniofacial soft tissues without the use of ionizing radiation. Although manual annotation of markers is used as the current gold standard for cephalometric analysis, it is a time-consuming process and prone to human error. The purpose of this study is to develop and evaluate an automated cephalometric annotation method using a deep learning-based approach. 10 landmarks were manually annotated on 2897 3D facial photos by a single observer. The automatic landmark workflow involves two consecutive DiffusionNet models and additional algorithms for face segmentation. The data set is randomly divided into training data set and test data set. The training dataset is used to train the deep learning network, while the test dataset is used to evaluate the performance of the automated workflow. The accuracy of the workflow was evaluated by calculating the Euclidean distance between automated and manual landmarks and comparing the intra- and inter-observer variability of the manual annotation and semi-automated landmark methods. Across all test cases, 98.6% of the workflows were successful. The landmark annotation method based on deep learning achieves accurate and consistent landmark annotation. The average precision was 1.69 (+/-1.15) mm, which was comparable to the manually annotated interobserver variability (1.31 +/-0.91 mm). 69% of the Euclidean distances between automatic and manual signs were within 2 mm. Automatic landmark annotation on 3D photos implements a DiffusionNet-based approach. The proposed method allows quantitative analysis of large data sets and can be used for diagnosis, follow-up, and virtual surgical planning.

1.9 UPL-SFDA: Uncertainty-aware Pseudo Label Guided Source-Free Domain Adaptation for Medical Image Segmentation

UPL-SFDA: Uncertainty-aware pseudo-label guided passive domain adaptive medical image segmentation

https://arxiv.org/abs/2309.10244

Insert image description here
Domain adaptation (DA) is very important for deep learning-based medical image segmentation models to handle test images from new target domains. Sourceless domain adaptation (SFDA) is attractive for efficient adaptation of data and annotations in the target domain since the source domain data is often unavailable when deploying a trained model at a new center. However, existing SFDA methods have limited performance due to lack of sufficient supervision with unavailable source domain images and unlabeled target domain images. We propose a new uncertainty-aware pseudo-label guided (UPL) method for SFDA medical image segmentation. Specifically, we propose Target Domain Growth (TDG), which enhances the diversity of predictions in the target domain by repeating the prediction head of a pre-trained model multiple times and perturbing it. Different predictions in these repeated heads are used to obtain pseudo-labels of unlabeled target domain images and their uncertainties to identify reliable pseudo-labels. We also propose a Two Forward Pass Supervision (TFS) strategy that uses reliable pseudo-labels in one forward pass to supervise predictions in the next forward pass. The adaptation is further regularized through an entropy minimization term based on the average prediction, which encourages confidence and consistent results in different prediction heads. UPL-SFDA was validated using a multi-site cardiac MRI segmentation dataset, a cross-modal fetal brain segmentation dataset, and a 3D fetal tissue segmentation dataset. Compared with the baseline, the average Dice of these three tasks is improved by 5.54, 5.01, and 6.89 percentage points, respectively, and outperforms several state-of-the-art SFDA methods.

1.10 Multi-level feature fusion network combining attention mechanisms for polyp segmentation

Multi-level feature fusion network combined with attention mechanism for polyp segmentation

https://arxiv.org/abs/2309.10219

Insert image description here
Clinically, automated polyp segmentation technology has the potential to significantly improve the efficiency and accuracy of medical diagnosis, thereby reducing patients' risk of colorectal cancer. Unfortunately, existing methods suffer from two significant weaknesses that can affect segmentation accuracy. First, the features extracted by the encoder are not fully filtered and utilized. Secondly, no attention is paid to the semantic conflicts and information redundancy caused by feature fusion. To overcome these limitations, we propose a new polyp segmentation method named MLFF-Net, which utilizes multi-level feature fusion and attention mechanisms. Specifically, MLFF-Net includes three modules: multi-scale attention module (MAM), high-level feature enhancement module (HFEM) and global attention module (GAM). Among them, MAM is used to extract multi-scale information and polyp details from the shallow output of the encoder. In HFEM, the deep features of encoders complement each other through aggregation. At the same time, the attention mechanism redistributes the weight of aggregated features, weakens the conflicting redundant parts, and highlights information useful to the task. GAM combines features from encoder and decoder features, as well as computing global dependencies to prevent receptive field locality. Experimental results on five public data sets show that this method can not only segment multiple types of polyps, but also outperform existing methods in terms of accuracy and generalization ability.

1.11 An Empirical Study of Attention Networks for Semantic Segmentation

Empirical study on attention network for semantic segmentation

https://arxiv.org/abs/2309.10217

Insert image description here
Semantic segmentation is an important problem in computer vision. In recent years, end-to-end convolutional neural networks have been a common solution for semantic segmentation, which is much more accurate than traditional methods. Recently, attention-based decoders have achieved state-of-the-art (SOTA) performance on various datasets. But these networks are often compared with the mIoU of previous SOTA networks to prove their superiority, ignoring their characteristics without considering various categories of computational complexity and accuracy, which are essential for engineering applications. Additionally, FLOP and memory analysis methods are not consistent across different networks, making them difficult to exploit. In addition, various methods utilize attention for semantic segmentation, but the conclusions of these methods are lacking. This paper first conducts experiments, analyzes their computational complexity, and compares their performance. Then we summarize the scenarios suitable for these networks and summarize the key points that should be paid attention to when building attention networks. Finally, the development direction of attention network is pointed out.

Guess you like

Origin blog.csdn.net/wzk4869/article/details/133093347