[Computer Vision | Image Segmentation] Arxiv Computer Vision Academic Express on Image Segmentation (A collection of papers on August 1)

Article directory

1. Segmentation | Semantic Correlation (16 articles)

1.1 DPMix: Mixture of Depth and Point Cloud Video Experts for 4D Action Segmentation

DPMix: Hybrid 4D Motion Segmentation for Depth and Point Cloud Video Experts

https://arxiv.org/abs/2307.16803

insert image description here
In this technical report, we present the results of a study conducted on the Human-Interaction 4D (HOI4D) dataset for the task of egocentric action segmentation. As a relatively new field of research, point cloud video methods may not be good at temporal modeling, especially for long point cloud videos (e.g., 150 frames). In contrast, traditional methods for video understanding have been well developed. Their effectiveness in temporal modeling has been extensively validated on many large-scale video datasets. Therefore, we convert point cloud videos to depth videos and adopt traditional video modeling methods to improve 4D action segmentation. Accuracy is significantly improved by fusing depth and point cloud video methods. The proposed method, named Depth and Point Cloud Video Expert Mixing (DPMix), won first place in the 4D motion segmentation track of the HOI4D Challenge 2023.

1.2 Investigating and Improving Latent Density Segmentation Models for Aleatoric Uncertainty Quantification in Medical Imaging

Research and Improvement of Latent Density Segmentation Model for Arbitrary Uncertainty Quantification in Medical Imaging

https://arxiv.org/abs/2307.16694

insert image description here
Data uncertainties, such as sensor noise or occlusions, may introduce irreducible ambiguities in images, leading to different but plausible semantic assumptions. In machine learning, this ambiguity is often referred to as arbitrary uncertainty. Latent density models can be used to address this problem in image segmentation. The most popular method is the Probabilistic U-Net (PU-Net), which uses the underlying normal density to optimize the conditional data log-likelihood evidence lower bound. In this work, we demonstrate that the PU-Net latent space is severely non-uniform. As a result, the effectiveness of gradient descent is inhibited and the model becomes extremely sensitive to the localization of samples in the latent space, leading to flawed predictions. To address this issue, we propose Sinkhorn PU-Net (SPU-Net), which uses Sinkhorn Divergence to promote homogeneity across all latent dimensions, effectively improving gradient descent updates and model robustness. Our results show that by applying it to public datasets for various clinical segmentation problems, SPU-Net achieves up to 11% performance improvement over previous latent variable models for probabilistic segmentation based on the Hungarian matching metric. The results show that latent density modeling for medical image segmentation can be significantly improved by encouraging a uniform latent space.

1.3 Domain Adaptation for Medical Image Segmentation using Transformation-Invariant Self-Training

Domain Adaptation of Medical Image Segmentation Based on Transformation Invariant Self-learning

https://arxiv.org/abs/2307.16660

insert image description here
Models that can leverage unlabeled data are critical to overcome the large distribution gaps between datasets acquired by different imaging devices and configurations. In this regard, pseudo-label based self-training techniques have been shown to be very effective for semi-supervised domain adaptation. However, the unreliability of pseudo-labels may hinder the ability of self-training techniques to induce abstract representations from unlabeled target datasets, especially when the distribution gap is large. Since neural network performance should be invariant to image transformations, we exploit this fact to identify uncertain pseudo-labels. In fact, we argue that transformation invariant detection can provide a more reasonable approximation to the ground truth. Therefore, we propose a semi-supervised learning strategy for domain adaptation called Transformation-Invariant Self-Training (TI-ST). The proposed method evaluates the reliability of pixel-level pseudo-labels and filters out unreliable detections during self-training. We comprehensively evaluate domain adaptation using three different modalities of medical images, two different network architectures, and several alternative state-of-the-art domain adaptation methods. Experimental results confirm the superiority of our proposed method in alleviating missing annotations of target domains and improving target domain segmentation performance.

1.4 Audio-visual segmentation, sound localization, semantic-aware sounding objects localization

Audiovisual segmentation, sound localization, semantically aware vocal object localization

https://arxiv.org/abs/2307.16620

insert image description here
The Audiovisual Segmentation (AVS) task aims to segment vocal objects from a given video. Existing works mainly focus on fusing audio and visual features of a given video to achieve vocal object masks. However, we observe that existing techniques tend to segment a salient object in videos regardless of audio information. This is because probed objects are usually the most salient objects in the AVS dataset. Therefore, current AVS methods may fail to localize real vocal objects due to dataset bias. In this work, we propose an audiovisual instance-aware segmentation method to overcome dataset bias. Briefly, our method first locates potential vocal objects in a video via an object segmentation network, and then associates vocal object candidates with a given audio. We noticed that an object may be a sounding object in one video, but a silent object in another video. This introduces ambiguity to training our object segmentation network, since only vocalizing objects have corresponding segmentation masks. Therefore, we propose a silent object-aware segmentation objective to alleviate the ambiguity. Furthermore, since the class information of audio is unknown, especially for multiple sound sources, we propose to explore audio-visual semantic correlations and then associate audio with potential objects. Specifically, we incorporate predicted audio category scores into latent instance masks that will highlight corresponding vocal instances while suppressing inaudible instances. When we enforce that the participating instance masks resemble the ground truth masks, we are able to establish audiovisual semantic correlations. Experimental results on the AVS benchmark demonstrate that our method can effectively segment vocalizing objects without biasing towards salient objects.

1.5 Contrastive Conditional Latent Diffusion for Audio-visual Segmentation

A Contrastive Conditional Implicit Diffusion Algorithm for Audiovisual Segmentation

https://arxiv.org/abs/2307.16579

insert image description here
We propose a latent diffusion model with contrastive learning for audiovisual segmentation (AVS) to broadly explore the contribution of audio. We interpret AVS as a conditional generation task, where audio is defined as a condition variable for the sound generator segment. According to our new interpretation, it is especially necessary to model the correlation between the audio and the final segmentation map to ensure its contribution. We introduce a latent diffusion model into our framework to enable semantically relevant representation learning. Specifically, our diffusion model learns the conditional generation process of ground-truth segmentation maps, enabling ground-truth-aware inference while performing the denoising process during the test phase. As a conditional diffusion model, we believe it is critical to ensure that the conditional variable contributes to the model output. We then introduce contrastive learning into our framework to learn audiovisual correspondences, which is shown to be consistent with maximizing the mutual information between model predictions and audio data. In this way, we explicitly maximize the audio contribution to the AVS via a contrastively learned latent diffusion model. Experimental results on benchmark datasets validate the effectiveness of our solution.

1.6 Transferable Attack for Semantic Segmentation

Transferable Attacks for Semantic Word Segmentation

https://arxiv.org/abs/2307.16572

insert image description here
Semantic segmentation models are known to be vulnerable to small input perturbations. In this paper, we comprehensively analyze the performance of semantic segmentation models on adversarial attacks, and observe that adversarial examples generated from source models cannot attack target models, that is, traditional attack methods, such as PGD and FGSM, cannot perform well transfer to the target model, so it is necessary to study transferable attacks, especially transferable attacks for semantic segmentation. We find that in order to achieve transferable attacks, the attack should have efficient data augmentation and translation invariant features to handle unseen models, and a stable optimization strategy to find the best attack direction. Based on the above observations, we propose an ensemble attack for semantic segmentation by aggregating multiple transferable attacks from classification to achieve more effective attacks with higher transferability.

1.7 Towards Unbalanced Motion: Part-Decoupling Network for Video Portrait Segmentation

Towards Unbalanced Motion: Partially Decoupled Networks for Video Portrait Segmentation

https://arxiv.org/abs/2307.16565

insert image description here
Video Portrait Segmentation (VPS), which aims to segment salient foreground portraits from video frames, has received a lot of attention in recent years. However, the simplicity of existing VPS datasets has limited extensive research on this task. In this work, we propose a new complex large-scale multi-scene video portrait segmentation dataset MVPS, consisting of 101 video clips from 7 scene categories, where 10,843 sampled frames are finely annotated at the pixel level . The data set has various scenarios and complex background environment, and is currently the most complex data set in VPS. Through the observation of a large number of portrait videos during the construction of the dataset, we found that due to the joint structure of the human body, the movements of portraits are partially correlated, resulting in relatively independent movements of different parts. That is, the movement of different parts of the portrait is uneven. Aiming at this imbalance, an intuitive and reasonable idea is that by decoupling the portrait into multiple parts, different motion states in the portrait can be better utilized. To achieve this goal, we propose a partially decoupled network (PDNet) for video portrait segmentation. Specifically, an Inter-Part Discriminatory Attention (IPDA) module is proposed, which unsupervisedly segments portraits into multiple parts and utilizes different attentions on the discriminative features specified for each different part. In this way, proper attention can be paid to the motion-imbalanced portrait parts to extract part-discriminative correlations, which can result in more accurate portrait segmentation. Experimental results show that our method achieves leading performance compared to the state-of-the-art methods.

1.8 Rethinking Collaborative Perception from the Spatial-Temporal Importance of Semantic Information

Rethinking collaborative perception from the spatio-temporal importance of semantic information

https://arxiv.org/abs/2307.16517

insert image description here
Collaboration by sharing semantic information is crucial for enhanced perception. However, existing collaboration-aware methods often only focus on the spatial features of semantic information, while ignoring the importance of the temporal dimension in collaborator selection and semantic information fusion, resulting in performance degradation. In this paper, we propose a novel collaborative perception framework - IoSI-CP, which considers the importance of semantic information (IoSI) from temporal and spatial dimensions. Specifically, we develop an IoSI-based collaborator selection method that effectively identifies favorable collaborators but excludes those that bring negative payoffs. Furthermore, we propose a semantic information fusion algorithm called HPHA (History Prior Hybrid Attention), which integrates a multi-scale transformer module and a short-term attention module to capture IoSI from spatial and temporal dimensions, and assign Different weights for efficient aggregation. Extensive experiments on two open datasets show that our proposed IoSI-CP significantly improves the perceptual performance compared to the state-of-the-art methods.

1.9 3D Medical Image Segmentation with Sparse Annotation via Cross-Teaching between 3D and 2D Networks

Sparsely annotated 3D medical image segmentation based on 3D and 2D network cross-teaching

https://arxiv.org/abs/2307.16256

insert image description here
Medical image segmentation often requires large and precisely annotated datasets. However, obtaining pixel-level annotations is a labor-intensive task that requires significant effort from domain experts, making it challenging to obtain this annotation in real clinical scenarios. In this case, reducing the amount of annotations required is a more practical approach. One viable direction is sparse annotation, which involves annotation of only a few slices and has several advantages over traditional weak annotation methods such as bounding boxes and scribbles, as it preserves precise boundaries. However, learning from sparse annotations is challenging due to the scarcity of supervisory signals. To address this issue, we propose a framework that can robustly learn from sparse annotations using cross-teaching of 3D and 2D networks. Considering the characteristics of these networks, we develop two strategies for pseudo-label selection, namely hard-soft confidence thresholding and consensus label fusion. Our experimental results on the MMWHS dataset demonstrate that our method outperforms state-of-the-art (SOTA) semi-supervised segmentation methods. Furthermore, our method achieves results comparable to fully supervised upper bound results.

1.10 ScribbleVC: Scribble-supervised Medical Image Segmentation with Vision-Class Embedding

ScribbleVC: Scribble-Supervised Medical Image Segmentation Based on Visual Class Embeddings

https://arxiv.org/abs/2307.16226

insert image description here
Medical image segmentation plays a vital role in clinical decision-making, treatment planning, and disease monitoring. However, accurate segmentation of medical images is challenging due to multiple factors such as lack of high-quality annotations, imaging noise, and anatomical variation among patients. Furthermore, there is still a considerable gap in performance between existing label-efficient methods and fully supervised methods. To address the above challenges, we propose ScribbleVC, a novel framework for scribble-supervised medical image segmentation that exploits visual and class embeddings through a multimodal information augmentation mechanism. In addition, ScribbleVC uniformly utilizes CNN features and Transformer features to achieve better visual feature extraction. The proposed method combines the scribble-based approach with a segmentation network and a class embedding module to produce accurate segmentation masks. We evaluate ScribbleVC on three benchmark datasets and compare it with state-of-the-art methods. Experimental results show that our method outperforms existing methods in terms of accuracy, robustness and efficiency. The dataset and code are published on GitHub.

1.11 PD-SEG: Population Disaggregation Using Deep Segmentation Networks For Improved Built Settlement Mask

PD-SEG: Improved Settlement Mask Population Decomposition Based on Deep Segmentation Networks

https://arxiv.org/abs/2307.16084

insert image description here
Any policy-level decision-making process and academic research that involves optimizing the use of resources for development and planning initiatives relies on accurate population density statistics. Current cutting-edge datasets from WorldPop and Meta have failed to achieve this goal in developing countries such as Pakistan; the inputs to their algorithms provide flawed estimates that fail to capture spatial and land-use dynamics. To accurately estimate population size at 30 m x 30 m resolution, we use precisely constructed subsidence masks obtained through a depth segmentation network and satellite imagery. Points of interest (POI) data are also used to exclude non-residential areas.

1.12 XMem++: Production-level Video Segmentation From Few Annotated Frames

XMem++: Production-grade video segmentation from a small number of annotated frames

https://arxiv.org/abs/2307.15958

insert image description here
Despite advances in user-guided video segmentation, consistently extracting complex objects for highly complex scenes remains a labor-intensive task, especially for production. It's not uncommon for most frameworks to require annotations. We introduce a novel semi-supervised video object segmentation (SSVOS) model, XMem++, which improves upon existing memory-based models with a persistent memory module. Most existing methods focus on single-frame annotations, while our method can efficiently handle multiple user-selected frames with different appearances for the same object or region. Our method can extract highly consistent results while keeping the required number of frame annotations low. We further introduce an iterative and attention-based frame proposal mechanism, which computes the next best annotated frame. Our method is real-time and does not require retraining after each user input. We also introduce a new dataset, PUMaVOS, which covers new and challenging use cases not found in previous benchmarks. We demonstrate SOTA performance on challenging (partial and multi-class) segmentation scenarios as well as on long videos, while ensuring significantly fewer frame annotations than any existing method.

1.13 CMDA: Cross-Modality Domain Adaptation for Nighttime Semantic Segmentation

CMDA: Cross-Channel Domain Adaptation for Nighttime Semantic Segmentation

https://arxiv.org/abs/2307.15942

insert image description here
Most studies on nighttime semantic segmentation are based on domain adaptation methods and image inputs. However, limited by the low dynamic range of conventional cameras, the images cannot capture structural details and boundary information in low-light conditions. As a new type of vision sensor, event camera complements traditional cameras with its high dynamic range. To this end, we propose a novel unsupervised cross-modal domain adaptation (CMDA) framework to exploit multimodal (image and event) information for nighttime semantic segmentation, adding labels only on daytime images. In CMDA, we design an image motion extractor to extract motion information and an image content extractor to extract content information from images to bridge the gap between different modalities (image to event) and domains (day to night). gap. Additionally, we introduce the first dataset for nighttime semantic segmentation of image events. Extensive experiments on public image datasets and the proposed image event dataset demonstrate the effectiveness of our proposed method.

1.14 A hybrid approach for improving U-Net variants in medical image segmentation

A hybrid method for improving U-net variants in medical image segmentation

https://arxiv.org/abs/2307.16462

insert image description here
Medical image segmentation is crucial to the field of medical imaging as it enables professionals to more accurately examine and understand the information provided by different imaging modalities. The technique of segmenting medical images into various segments or regions of interest is called medical image segmentation. The resulting segmented images can be used for many different purposes, including diagnosis, surgical planning, and treatment assessment.
In the initial stage of research, the main focus is to review existing deep learning methods, including studies such as MultiResUNet, Attention U-Net, classic U-Net, and other variants. Note that feature vectors or graphs dynamically add important weights to key information, and most of these variants use them to improve accuracy, but with stricter network parameter requirements. They suffer from certain issues such as overfitting due to their very high number of trainable parameters and very high inference time.
Therefore, the aim of this study is to use depthwise separable convolutions to reduce network parameter requirements while maintaining performance on certain medical image segmentation tasks, such as skin lesion segmentation using attention systems and residual connections.

1.15 An objective validation of polyp and instrument segmentation methods in colonoscopy through Medico 2020 polyp segmentation and MedAI 2021 transparency challenges

Objective Validation of Methods for Polyp and Instrument Segmentation in Colonoscopy via Medico 2020 Polyp Segmentation and MEDai 2021 Transparency Challenge

https://arxiv.org/abs/2307.16262

insert image description here
Due to the importance of early detection of precancerous polyps, automated analysis of colonoscopy images has been an active area of ​​research. However, detecting polyps during real-time examinations can be challenging due to various factors, such as differences in skill and experience of endoscopists, inattention, and high rates of missed polyps due to fatigue. Deep learning has emerged as a promising solution to this challenge as it can help endoscopists detect and classify overlooked polyps and abnormalities in real time. Beyond the accuracy of algorithms, transparency and explainability are critical to explaining why and how an algorithm predicts. Furthermore, most algorithms are developed in private data, closed-source or proprietary software, and methods lack reproducibility. Therefore, to facilitate the development of efficient, transparent methods, we organized the "Medico Automated Polyp Segmentation (Medico 2020)" and "MedAI: Transparency in Medical Image Segmentation (MedAI 2021)" competitions. We provide a comprehensive summary and analysis of each contribution, highlight the strengths of the top-performing methods, and discuss the possibility of translating such methods to the clinic. For the transparency task, a multidisciplinary team including gastroenterologists accessed each submission and assessed the teams based on open-source practices, failure case analysis, ablation studies, usability and understandability of assessments for more in-depth to better understand the credibility of the model for clinical deployment. Through a comprehensive analysis of the challenges, we not only highlight advances in polyp and surgical instrument segmentation, but also encourage qualitative evaluation to build more transparent and understandable AI-based colonoscopy systems.

1.16 Cross-dimensional transfer learning in medical image segmentation with deep learning

Application of interdimensional transfer learning of deep learning in medical image segmentation

https://arxiv.org/abs/2307.15872

insert image description here
Over the past decade, convolutional neural networks have emerged and driven state-of-the-art in a variety of image analysis and computer vision applications. The performance of 2D image classification networks continues to improve and is trained on databases consisting of millions of natural images. However, limited annotated data and acquisition constraints hinder progress in medical image analysis. These limitations are even more pronounced given the volume of medical imaging data. In this paper, we introduce an efficient method to transfer the efficiency of 2D classification networks trained on natural images to 2D, 3D unimodal and multimodal medical image segmentation applications. In this direction, we design novel architectures based on two key principles: weight transfer by embedding a 2D pre-trained encoder into a higher-dimensional U-Net, and by extending the 2D segmentation network to higher dimensions to Dimension transfer. The proposed network is tested on benchmarks containing different modalities: MR, CT and ultrasound images. Our 2D network ranks first in the CAMUS challenge dedicated to echocardiography data segmentation and surpasses the state-of-the-art. Regarding 2D/3D MR and CT abdominal images in the CHAOS challenge, our method outperforms other 2D-based methods described in the challenge paper by a large margin in terms of Dice, RAVD, ASSD and MSSD scores, and ranks No. 1 on the online evaluation platform. three. The 3D network we applied to the BraTS 2022 competition also achieved promising results, with an average Dice score of 91.69% (91.22%) for the whole tumor, 83.23% (84.77%) for the tumor core, and 83.23% (84.77%) for the tumor core. It was 81.75% (83.88%). Tumor enhancement using a weight (size) transfer-based approach. Experimental and qualitative results illustrate the effectiveness of our multidimensional medical image segmentation method.

Guess you like

Origin blog.csdn.net/wzk4869/article/details/132121943