Article directory

1. Detection related (8 articles)
2. Segmentation | Semantic Correlation (5 articles)

1. Detection related (8 articles)

1.1 Revisiting DETR Pre-training for Object Detection

Revisiting DETR pre-training for object detection

https://arxiv.org/abs/2308.01300

DETR-based methods have established new records on the COCO detection and segmentation benchmarks, so many recent efforts have shown increasing interest in how to further improve DETR-based methods by pre-training Transformers in a self-supervised manner while keeping the backbone frozen. big interest. Some studies have claimed significant improvements in accuracy. In this paper, we take a close look at their experimental approach and check whether their method is still valid against the latest state-of-the-art, such as $\mathcal{H}$ - Deformable DETR. We conduct a thorough experiment on the COCO object detection task to study the impact of the selection of pre-training datasets, localization and classification object generation schemes. Unfortunately, we found that previous representative self-supervised methods, such as DETReg, failed to improve the performance of robust DETR-based methods on full data regimes. We further analyzed the reason and found that simply combining the more accurate box predictor with Objects $365$ benchmark can significantly improve the results of subsequent experiments. We demonstrate the effectiveness of our method by achieving strong object detection results AP= $59.3\%$ COCO value set, more than $\mathcal{H}$ - Morph DETR + Swin-L + $1.4\%$ . Finally, we generate a series of synthetic pre-training datasets combining recent image-to-text captioning models (LLaVA) and text-to-image generation models (SDXL). Notably, pre-training on these synthetic datasets leads to significant improvements in object detection performance. Going forward, we anticipate that future extensions through synthetic pre-training datasets will bring great benefits.

1.2 A Hyper-pixel-wise Contrastive Learning Augmented Segmentation Network for Old Landslide Detection Using High-Resolution Remote Sensing Images and Digital Elevation Model Data

Superpixel Contrastive Learning-Enhanced Segmentation Network Based on High-Resolution Remote Sensing Imagery and Digital Elevation Model Data for Old Landslide Detection

https://arxiv.org/abs/2308.01251

Landslides, as a very harmful disaster, often bring huge losses to human beings, so it is very necessary to reliably detect landslides. However, traditional landslide remote sensing detection methods have problems such as blurred vision and small data volume, which bring great challenges to landslide remote sensing detection. In order to reliably extract semantic features, a Superpixel Contrastive Learning Enhanced Segmentation Network (HPCL-Net) is proposed, which enhances the local salient feature extraction of landslide boundaries through HPCL, and fuses high-resolution remote sensing images and digital elevation model data in the semantic Heterogeneous information in space. In order to make full use of the precious samples, a contrastive learning method based on global superpixel sample pair queues is proposed, which includes constructing a global queue for storing superpixel samples and an update scheme of the momentum encoder, which reliably improves the extraction of semantic features Ability. Experiments were carried out on the old landslide dataset in the Loess Plateau, and the results show that compared with the old landslide segmentation model, the model greatly improves the reliability of old landslide detection, the mIoU metric is improved from 0.620 to 0.651, and the Landslide IoU metric is improved from 0.334 increased to 0.394, and the F1-score measure increased from 0.501 to 0.565.

1.3 WCCNet: Wavelet-integrated CNN with Crossmodal Rearranging Fusion for Fast Multispectral Pedestrian Detection

WCCNet: Fusion of wavelet-integrated CNN and cross-pattern rearrangement for fast multispectral pedestrian detection

https://arxiv.org/abs/2308.01042

Multispectral pedestrian detection achieves better visibility under challenging conditions and thus has wide applications in various tasks where both accuracy and computational cost are critical. Most existing methods treat RGB and IR modalities equally, usually employing two symmetric CNN backbones for multimodal feature extraction, which ignores substantial differences between modalities, in order to reduce computational cost and effectively cross-modal Convergence poses great difficulties. In this work, we propose a novel and efficient framework, named WCCNet, that is capable of differentially extracting rich features from different spectra with low computational complexity and semantically rearranging these features efficiently across Modal fusion. Specifically, the discrete wavelet transform (DWT) allows fast inference and training speeds to be embedded to construct a two-stream backbone for efficient feature extraction. The DWT layer of WCCNet extracts the frequency components of the infrared modality, while the CNN layer extracts the spatial domain features of the RGB modality. This method not only greatly reduces the computational complexity, but also improves the extraction of infrared features to facilitate subsequent cross-modal fusion. Based on the extracted features, we carefully design a cross-modal rearrangement fusion module (CMRF), which can alleviate spatial misalignment and incorporate spatially correlated local regions of semantic complementary features to amplify cross-modal complementary information. We conduct comprehensive evaluations on the KAIST and FLIR benchmarks, where WCCNet outperforms state-of-the-art methods with considerable computational efficiency and competitive accuracy. We also conduct ablation studies and thoroughly analyze the impact of different components on WCCNet performance.

1.4 Three Factors to Improve Out-of-Distribution Detection

Three factors that improve out-of-distribution detection

https://arxiv.org/abs/2308.01030

In Out of Problem (OOD) detection, the use of auxiliary data as outlier data for fine-tuning has shown encouraging performance. However, previous methods suffer from a trade-off between classification accuracy (ACC) and OOD detection performance (AUROC, FPR, AUPR). To improve this balance, we make three contributions: (i) incorporating a self-knowledge distillation loss can improve the accuracy of the network; (ii) sampling semi-hard outliers for training can be performed with minimal impact on accuracy. (iii) Introducing our novel supervised contrastive learning can improve OOD detection performance and network accuracy simultaneously. By incorporating all these three factors, our method improves accuracy and OOD detection performance by addressing the trade-off between classification and OOD detection. Our method improves over previous methods on both performance metrics.

1.5 MDT3D: Multi-Dataset Training for LiDAR 3D Object Detection Generalization

MDT3D: Multi-Dataset Training for LiDAR 3D Object Detection Generalization

https://arxiv.org/abs/2308.01000

Supervised 3D object detection models show increasingly better performance in the single-domain case, where the training data comes from the same environment and sensors as the test data. However, in real-world scenarios, data from the target domain may not be available for fine-tuning or domain adaptation methods. In fact, 3D object detection models trained on source datasets with specific point distributions have shown difficulty generalizing to unseen datasets. We therefore decided to leverage the information obtained from several annotated source datasets in our Multi-Dataset Trained 3D Object Detection (MDT3D) approach to improve the robustness of 3D object detection models when tested in novel environments with different sensor configurations. sex. To address label gaps between datasets, we use a novel label map based on coarse labels. Furthermore, we show how to manage the mixture of datasets during training, and finally introduce a new cross-dataset augmentation method: cross-dataset object injection. We demonstrate that this training modality shows improvements for different types of 3D object detection models. The source code and other results of this research project will be publicly available on GitHub for access and use by interested parties: https://github.com/LouisSF/MDT3D

1.6 ForensicsForest Family: A Series of Multi-scale Hierarchical Cascade Forests for Detecting GAN-generated Faces

ForensicsForest Family: A series of multi-scale hierarchical cascade forests for detecting GaN-generated faces

https://arxiv.org/abs/2308.00964

Significant advances in generative models have significantly improved the realism of generated faces, raising serious concerns for society. Due to the high degree of realism of faces generated by GANs recently, traces of forgery have become more imperceptible, increasing the challenge of forensics. To combat GAN-generated faces, many convolutional neural network (CNN) based countermeasures have emerged due to their strong learning ability. In this paper, we rethink this question and explore a new approach based on forest models instead of CNNs. Specifically, we describe a simple and effective set of forest-based methods called {\em ForensicsForest Family} to detect GAN-generated faces. The ForensicsForest family consists of three variants, namely {\em ForensicsForest}, {\em Hybrid ForensicsForest} and {\em Divide-and-Conquer ForensicsForest}. ForensuisForest is a newly proposed multi-scale hierarchical cascading forest. It takes semantics, frequency and biological characteristics as input, hierarchically cascades features of different levels for authenticity prediction, and then adopts a multi-scale method that comprehensively considers different levels of information. The integrated scheme further improves the performance. Based on ForensicsForest, we develop Hybrid ForensicsForest, an extended version that integrates CNN layers into the model to further refine the effectiveness of enhancements. Furthermore, to reduce the memory overhead in training, we propose Divide and Conquer ForensicsForest, which can build a forest model using only a part of the training samples. In the training phase, we train multiple candidate forest models using a subset of the training samples. ForensicsForest is then assembled by picking suitable components from these candidate forest models…

1.7 Detection and Segmentation of Cosmic Objects Based on Adaptive Thresholding and Back Propagation Neural Network

Cosmic Object Detection and Segmentation Based on Adaptive Threshold and Backpropagation Neural Network

https://arxiv.org/abs/2308.00926

Astronomical imagery provides information about a wide variety of cosmic objects in the universe. Classifying and detecting celestial objects is a challenging task due to the large amount of celestial object data, the presence of numerous sources of bright spots and noise in the image, and the spatial gap between the object and the satellite camera. We propose an adaptive threshold method (ATM) based segmentation and backpropagation neural network (BPNN) for cosmic object detection, including a series of well-structured preprocessing steps aimed at improving segmentation and detection.

1.8 Multi-task learning for classification, segmentation, reconstruction, and detection on chest CT scans

Multi-task learning for chest CT scan classification, segmentation, reconstruction and detection

https://arxiv.org/abs/2308.01137

Lung cancer and the novel coronavirus are among the diseases with the highest morbidity and mortality in the world. Identifying lesions in the early stages of disease is difficult and time-consuming for physicians. Therefore, multi-task learning is a method to extract important features (such as lesions) from a small amount of medical data, because it can learn to generalize better. We propose a novel multi-task framework for classification, segmentation, reconstruction and detection. To our knowledge, we were the first to add detection capabilities to a multitasking solution. Furthermore, we examine the possibility of using two different backbones and different loss functions in the segmentation task.

2. Segmentation | Semantic Correlation (5 articles)

2.1 Data-Centric Diet: Effective Multi-center Dataset Pruning for Medical Image Segmentation

A data-centric diet: efficient multi-center dataset pruning for medical image segmentation

https://arxiv.org/abs/2308.01189

This paper aims to address dense labeling problems where a significant portion of the dataset can be pruned without sacrificing too much accuracy. We observe that, on standard medical image segmentation benchmarks, loss gradient norm-based metrics for individual training examples, applied to image classification, fail to identify important samples. To address this issue, we propose a data pruning method that takes into account training dynamic target regions using Dynamic Average Dice (DAD) scores. To the best of our knowledge, we were one of the first companies in the field of medical image analysis to address the importance of data in densely labeled tasks, making the following contributions: (1) investigating underlying causes through rigorous empirical analysis; (2) ) to identify efficient data pruning methods in dense labeling problems. Our solution can serve as a strong yet simple baseline, selecting important examples, combined with data source medical image segmentation.

2.2 DiffusePast: Diffusion-based Generative Replay for Class Incremental Semantic Segmentation

DiffusePast: Diffusion-Based Generative Replay for Class Incremental Semantic Segmentation

https://arxiv.org/abs/2308.01127

Class Incremental Semantic Segmentation (CISS) extends traditional segmentation tasks by incrementally learning newly added classes. Previous work has introduced generative replay, which involves replaying old class samples generated from pretrained GANs to address catastrophic forgetting and privacy issues. However, the generated images lack semantic precision and exhibit distributional properties that lead to inaccurate masks, further degrading segmentation performance. To address these challenges, we propose DiffusePast, a novel framework featuring a diffusion-based generative replay module that generates semantically accurate images with more reliable masks guided by different instructions (e.g. , text hints or edge maps). Specifically, DiffusePast introduces a dual-generator paradigm that focuses on generating old-class images that are distributed consistently with downstream datasets while preserving the structure and layout of the original images, leading to more accurate masks. To accommodate new visual concepts with newly added classes, we update the dual generator when embedding class-wise tokens. Furthermore, we assign sufficient pseudo-labels of the old class to background pixels in the new-step image, further mitigating the forgetting of previously learned knowledge. Through comprehensive experiments, our method is competitive on mainstream benchmarks, achieving a better balance between the performance of old and new classes.

2.3 Training-Free Instance Segmentation from Semantic Image Segmentation Masks

Training-free Instance Segmentation Based on Semantic Image Segmentation Templates

https://arxiv.org/abs/2308.00949

In recent years, the development of instance segmentation has gained great attention in a wide range of applications. However, the training of fully supervised instance segmentation models requires expensive instance-level and pixel-level annotations. In contrast, weakly supervised instance segmentation methods (i.e., with image-level class labels or point labels) struggle to meet the accuracy and recall requirements of real-world scenarios. In this paper, we propose a new paradigm called Training-Free Instance Segmentation (TFISeg), which achieves instance segmentation results from image mask prediction using off-the-shelf semantic segmentation models. TFISeg does not require training semantic or/and instance segmentation models, and avoids the need for instance-level image annotations. Therefore, it is efficient. Specifically, we first obtain the semantic segmentation mask of the input image through the trained semantic segmentation model. Then, we compute per-pixel displacement field vectors based on the segmentation mask, which can indicate representations belonging to the same category but different instances, i.e., capture the instance-level object information. Finally, instance segmentation results are obtained after refinement by a learnable category-agnostic object boundary branch. Extensive experimental results on two challenging datasets and representative semantic segmentation baselines (including CNNs and Transformers) demonstrate that TFISeg can achieve competitive results compared to state-of-the-art fully supervised instance segmentation methods, while No additional human resources or increased computational costs are required. Code is available at: TFISeg

2.4 CMUNeXt: An Efficient Medical Image Segmentation Network based on Large Kernel and Skip Fusion

CMUNeXt: An Efficient Medical Image Segmentation Network Based on Large Kernel and Skip Fusion

https://arxiv.org/abs/2308.01239

The U-shaped structure has become an important paradigm in the design of medical image segmentation networks. However, due to the inherent local limitations of convolutions, it is difficult for fully convolutional segmentation networks with U-shaped architectures to effectively extract global contextual information, which is crucial for precise lesion localization. Although hybrid architectures combining CNNs and Transformers can address these issues, their application in practical medical scenarios is limited due to the computational resource constraints imposed by the environment and edge devices. Furthermore, the convolutional inductive bias in the lightweight network subtly adapts to scarce medical data, which is lacking in Transformer-based networks. To extract global contextual information while exploiting inductive bias, we propose CMUNeXt, an efficient fully convolutional lightweight medical image segmentation network, which can achieve fast and accurate auxiliary diagnosis in real scenes. CMUNeXt utilizes a large kernel and a reverse bottleneck design to thoroughly mix long-distance spatial and location information to effectively extract global context information. We also introduce skip fusion blocks, which aim to enable smooth skip connections and ensure sufficient feature fusion. Experimental results on multiple medical image datasets show that CMUNeXt outperforms existing heavyweight and lightweight medical image segmentation networks in terms of segmentation performance, while providing faster inference speed, lighter weights and lower Computing costs. The code is available at https://github.com/FengheTan9/CMUNeXt.

2.5 Decomposing and Coupling Saliency Map for Lesion Segmentation in Ultrasound Images

Ultrasound Image Lesion Segmentation Based on Decomposition and Coupled Saliency Map

https://arxiv.org/abs/2308.00947

The complex scene of ultrasound images, where adjacent tissue (i.e., background) shares similar intensities with the lesion area and even contains richer texture patterns than the lesion area (i.e., foreground), presents unique challenges for accurate lesion segmentation . This work proposes a factorized coupling network, called DC-Net, to handle this challenge (foreground-background) in a saliency graph disentanglement fusion manner. DC-Net consists of a decomposition subnetwork, the former preliminarily decomposes the original image into foreground and background saliency maps, and the latter is assisted by saliency prior fusion for precise segmentation. Coupling subnetworks involves fusion strategies in three aspects, including: 1) Regional feature aggregation (via a differentiable context pooling operator in the encoder) to adaptively preserve local contextual details with larger receptive fields during dimensionality reduction ; 2) Relation-aware representation fusion (via cross-correlation fusion module in decoder) to efficiently fuse low-level visual features and high-level semantic features during resolution restoration; 3) Dependency-aware prior incorporation (via coupler) , to enhance foreground-salient representations with supplementary information derived from background representations. Furthermore, a harmonic loss function is introduced to encourage the network to focus more on low confidence and hard samples. The proposed method is evaluated on two ultrasound lesion segmentation tasks, which show significant performance improvement over existing state-of-the-art methods.

[Computer Vision | Target Detection | Image Segmentation] Arxiv Computer Vision Academic Express on Target Detection and Image Segmentation (A collection of papers on August 3)