[Computer Vision | Target Detection | Image Segmentation] Arxiv Computer Vision Academic Express on Target Detection and Image Segmentation (July 27 Collection of Papers)

1. Detection related (6 articles)

1.1 Memory-Efficient Graph Convolutional Networks for Object Classification and Detection with Event Cameras

Object Classification and Detection with Efficient Memory Graph Convolutional Networks Based on Event Cameras

https://arxiv.org/abs/2307.14124

insert image description here
Recent advances in event camera research emphasize processing data in its raw sparse form, which allows the use of its unique features such as high temporal resolution, high dynamic range, low latency, and anti-blurring images. A promising approach to analyze event data is through graph convolutional networks (GCNs). However, current research in this area mainly focuses on optimizing computational cost, ignoring the associated memory cost. In this paper, we consider these two factors to achieve satisfactory results with relatively low model complexity. To this end, we perform a comparative analysis of different graph convolution operations, considering factors such as execution time, number of trainable model parameters, data format requirements, and training results. Our results show a 450-fold reduction in the number of parameters of the feature extraction module and a 4.5-fold reduction in the size of the data representation while maintaining a classification accuracy of 52.3%, which is 6.3% higher than in the state-of-the-art The operation used in the method. To further evaluate the performance, we implemented an object detection architecture and evaluated its performance on the N-Caltech 101 dataset. The results show that the accuracy rate is 53.7%[email protected], and the execution speed reaches 82 graphics per second.

1.2 PNT-Edge: Towards Robust Edge Detection with Noisy Labels by Learning Pixel-level Noise Transitions

PNT-Edge: Robust Edge Detection with Noisy Labels by Learning Pixel-Level Noise Transfer

https://arxiv.org/abs/2307.14070

insert image description here
Previous edge detection methods have achieved high performance relying on large-scale training data with pixel-level labels. However, it is difficult to accurately label edges manually, especially for large datasets, so datasets inevitably contain noisy labels. This label noise problem has been extensively studied in classification, while still being explored in edge detection. To address the label noise problem for edge detection, this paper proposes to learn pixel-level NoiseTransitions to simulate the label corruption process. To achieve this goal, we develop a novel pixel-wise shift learning (PSL) module to estimate the transition from clean to noisy labels as a displacement field. Using the estimated noise transformation, our model, named PNT-Edge, is able to fit predictions to clean labels. Furthermore, a local edge density regularization term is designed to exploit local structure information for better transition learning. This term encourages the learning of large displacements of edges with complex local structures. Experiments on SBD and Cityscapes demonstrate the effectiveness of the proposed method in mitigating the effects of label noise. Code will be available on Github.

1.3 Controllable Guide-Space for Generalizable Face Forgery Detection

Controllable Guided Spaces for Generalized Face Forgery Detection

https://arxiv.org/abs/2307.14039

insert image description here
Recent research on face forgery detection has shown satisfactory performance for methods involving training datasets, but not ideal for uncharted domains. This has motivated many works to improve generalization, but falsifying irrelevant information, such as image background and identity, still exists in different domain features and leads to unexpected clustering, limiting generalization. In this paper, we propose a steerable guided space (GS) approach to improve the discrimination of different forgery domains, thereby increasing forgery-related features and thus improving generalization. A well-designed guide space can achieve both a correct separation of the fake domain and a large distance between the real fake domain in a definite and controllable manner. Furthermore, for better discrimination, we use a decoupling module to weaken the interference of spurious irrelevant correlations between domains. Furthermore, we adjust the decision boundary manifold according to the degree of clustering of the same domain features within the neighborhood. Extensive experiments in multiple intra-domain and cross-domain settings confirm that our method achieves state-of-the-art generalization.

1.4 EasyNet: An Easy Network for 3D Industrial Anomaly Detection

Easynet: A Simple 3D Industrial Anomaly Detection Network

https://arxiv.org/abs/2307.13925

insert image description here
3D anomaly detection is an emerging important computer vision task in industrial manufacturing. In recent years, many advanced algorithms have come out one after another, but most of them cannot meet the needs of instant communication. There are several disadvantages: i) difficult to deploy in the production line because their algorithms heavily rely on large pre-trained models; ii) greatly increase storage overhead due to excessive use of memory banks; iii) inference speed cannot be achieved in real-time. To overcome these problems, we propose a simple and deployment-friendly network (called EasyNet) without using pre-trained models and memory banks: First, we design a multi-scale multi-modal feature encoder-decoder, In order to accurately reconstruct the segmentation map of abnormal regions and encourage the interaction between RGB images and depth images; secondly, a multimodal anomaly segmentation network is adopted to achieve accurate abnormality maps; thirdly, we propose an attention-based information entropy Fusion module for feature fusion during inference, making it suitable for real-time deployment. Extensive experiments show that EasyNet achieves 92.6% AUROC for anomaly detection without using pre-trained models and memory banks. Moreover, EasyNet is faster than existing methods with a high frame rate of 94.55 FPS on a Tesla V100 GPU.

1.5 A real-time material breakage detection for offshore wind turbines based on improved neural network algorithm

Real-time detection of material damage in offshore wind turbines based on improved neural network algorithm

https://arxiv.org/abs/2307.13765

insert image description here
The integrity of offshore wind turbines, key to sustainable energy production, is often compromised by surface material imperfections. Despite the availability of various detection technologies, limitations remain in terms of cost-effectiveness, efficiency, and applicability. To address these shortcomings, this study introduces a novel approach that leverages an advanced version of the YOLOv8 object detection model, supplemented by a convolutional block attention module (CBAM) to improve feature recognition. The optimized loss function further refines the learning process. Our method is rigorously tested using a dataset of 5,432 images from Saemangeum offshore wind farms and publicly available datasets. The results show a substantial improvement in defect detection stability, marking a major step towards efficient turbine maintenance. The contribution of this study points the way for future research that could revolutionize sustainable energy practices.

1.6 TMR-RD: Training-based Model Refinement and Representation Disagreement for Semi-Supervised Object Detection

TMR-RD: Semi-supervised object detection with training-based model refinement and representation inconsistency

https://arxiv.org/abs/2307.13755

insert image description here
Semi-supervised object detection (SSOD) can combine limited labeled data and a large amount of unlabeled data to improve the performance and generalization ability of existing object detectors. Despite many advances, recent SSOD methods are still challenged by noisy/misleading pseudo-labels, classic exponential moving average (EMA) strategies, and teacher-student model consensus late in training. This paper proposes a new training-based model refinement (TMR) stage and a simple yet effective representation-discordant (RD) strategy to address the limitations of classical EMA and consensus problems. The TMR stage of the teacher-student model optimizes lightweight scaling operations to refine the model's weights and prevent overfitting or forgetting learned patterns from unlabeled data. At the same time, RD strategies help keep these models divergent, encouraging student models to explore complementary representations. Furthermore, we use cascaded regression to generate more reliable pseudo-labels to supervise the student model. Extensive experiments show that our method outperforms the performance of state-of-the-art SSOD methods. Specifically, the proposed method outperforms the unbiased teacher method with an average mAP margin of 4.6% and 5.3% on the MS-COCO dataset using partially labeled and fully labeled data, respectively.

2. Segmentation | Semantic Correlation (7 articles)

2.1 Resolution-Aware Design of Atrous Rates for Semantic Segmentation Networks

Resolution-aware ATROS Rate Design in Semantic Word Segmentation Networks

https://arxiv.org/abs/2307.14179

insert image description here
DeepLab is a widely used deep neural network for semantic segmentation, which owes its success to its parallel architecture called atrous spatial pyramid pooling (ASPP). ASPP uses multiple acyclic convolutions with different acyclic rates to extract local and global information. However, a fixed value of atrous rate is used for the ASPP module, which limits the size of its field of view. In principle, f-stop should be a hyperparameter that changes the size of the field of view according to the target task or dataset. However, unpowered manipulation is not subject to any criteria. This study proposes practical guidelines to obtain an optimal atrous rate. First, an effective semantic segmentation receptive field is introduced to analyze the internal behavior of the segmentation network. We observed that the use of ASPP modules produced a specific pattern in the effective receptive field, which is the underlying mechanism by which tracking reveals modules. Therefore, we derive practical guidelines to obtain the optimal atrous rate, which should be controlled based on the size of the input image. Using the best atrous rate consistently improves segmentation results across several datasets, including the STARE, CHASE_DB1, HRF, Cityscapes, and iSAID datasets, compared to other values.

2.2 Pre-Training with Diffusion models for Dental Radiography segmentation

Dental Image Segmentation Pre-training Based on Diffusion Model

https://arxiv.org/abs/2307.14066

insert image description here
Segmentation of medical radiography, especially dental radiography, is highly label-constrained, requiring specific expertise and the cost of labor-intensive annotation. In this work, we propose a simple pre-training approach for semantic segmentation utilizing the Denoising Diffusion Probability Model (DDPM), which achieves impressive results in generative modeling. Our simple approach achieves superior performance in terms of label efficiency and requires no architectural modifications between pre-training and downstream tasks. We propose to first pre-train Unet by utilizing the DDPM training objective, and then fine-tune the resulting model on the segmentation task. Our experimental results show that the proposed method is competitive to the state-of-the-art pre-trained methods for segmenting dental radiographs.

2.3 Unite-Divide-Unite: Joint Boosting Trunk and Structure for High-accuracy Dichotomous Image Segmentation

Joint-segmentation-joint: joint-boosting backbone and structure for high-accuracy bipartite image segmentation

https://arxiv.org/abs/2307.14052

insert image description here
The goal of high-precision binary image segmentation (DIS) is to precisely locate category-agnostic foreground objects from natural scenes. The main challenge of DIS is to identify highly accurate dominant regions while mapping detailed object structures. However, directly using general-purpose encoder-decoder architectures may lead to an oversupply of high-level features and ignore the shallow spatial information necessary to partition fine structures. To fill this gap, we introduce a novel Unite-Divide-Unite Network (UDUN) that reorganizes and bipartites complementary features to simultaneously improve the effectiveness of backbone and structure recognition. The proposed UDUN proceeds from several advantages. First, a double-sized input is fed into a shared backbone to produce more comprehensive and detailed features while keeping the model lightweight. Second, a simple divide-and-conquer module (DCM) is proposed to decouple multi-scale low- and high-level features into our structure decoder and backbone decoder to obtain structure and backbone information, respectively. Furthermore, we design a backbone structure aggregation module (TSA) in our joint decoder to perform cascaded integration for unified high-precision segmentation. Consequently, UDUN outperforms state-of-the-art competitors in all six evaluation metrics of overall DIS-TE, namely: achieving 0.772 weighted F-measure and 977 HCE. With 1024*1024 input, our model can perform real-time inference at 65.3 fps using ResNet-18.

2.4 3D Semantic Subspace Traverser: Empowering 3D Generative Model with Shape Editing Capability

3D Semantic Subspace Traverser: Enhancing Shape Editing Capabilities for 3D Generative Models

https://arxiv.org/abs/2307.14051

insert image description here
Shape generation is the practice of producing 3D shapes as various representations for 3D content creation. Previous research on 3D shape generation mainly focused on the quality and structure of shapes, with little or no consideration of the importance of semantic information. Therefore, such generative models often fail to preserve the semantic consistency of the shape structure or enable the manipulation of the shape's semantic properties during generation. In this paper, we propose a new semantic generative model named 3D Semantic Subspace Traversal, exploiting the semantic properties of category-specific 3D shape generation and editing. Our method utilizes implicit functions as 3D shape representations and combines a novel latent space GAN with a linear subspace model to discover 3D shapes in the local latent space of the semantic dimension. Each dimension of the subspace corresponds to a specific semantic property, and we can edit the properties of the generated shape by traversing the coefficients of these dimensions. Experimental results show that our method can generate plausible shapes with complex structures that enable editing of semantic properties. Code and trained models are available at https://github.com/TrepangCat/3D_Semantic_Subspace_Traverser.

2.5 Improving Semi-Supervised Semantic Segmentation with Dual-Level Siamese Structure Network

Improving Semi-supervised Semantic Segmentation Using Two-Layer Siamese Structured Networks

https://arxiv.org/abs/2307.13938

insert image description here
Semi-supervised Semantic Segmentation (SSS) is an important task that utilizes both labeled and unlabeled data to reduce the cost of labeling training samples. However, the effectiveness of SSS algorithms is limited and it is difficult to fully exploit the potential of unlabeled data. To address this issue, we propose pixel-wise contrastive learning of a two-layer Siamese structure network (DSSN). The proposed DSSN is designed to maximize the utilization of available unlabeled data by aligning positive alignment with pixel-wise contrastive loss using strong augmented views in both low-level image space and high-level feature space. Furthermore, we introduce a novel class-aware pseudo-label selection strategy with weak-to-strong supervision, addressing the limitations of most existing methods that do not perform selection or apply predefined thresholds for all classes. Specifically, our strategy selects top high-confidence predictions for weak views of each class to generate pseudo-labels for supervised strong augmented views. This strategy can take into account the class imbalance and improve the performance of long tail classes. Our proposed method achieves state-of-the-art results on two datasets PASCAL VOC 2012 and Cityscapes, significantly outperforming other SSS algorithms.

2.6 Deepfake Image Generation for Improved Brain Tumor Segmentation

Deep Fake Image Generation for Improved Brain Tumor Segmentation

https://arxiv.org/abs/2307.14273

insert image description here
As the world advances in technology and health, increase awareness of disease by revealing asymptomatic signs. Detecting and treating tumors at an early stage is important because it can be life-threatening. Computer-aided techniques are used to overcome lingering limitations facing disease diagnosis, while brain tumor segmentation remains a difficult process, especially when multimodal data are involved. This is mainly due to the lack of data and corresponding labels, resulting in poor training results. This work investigates the feasibility of deepfake image generation for efficient brain tumor segmentation. To this end, image-to-image translation is performed using a generative adversarial network to increase the dataset size, followed by image segmentation using a U-Net-based convolutional neural network trained with deepfake images. The performance of the proposed method is compared with ground truth on four publicly available datasets. Results show improved performance on image segmentation quality metrics and may be helpful when training with limited data.

2.7 Hybrid Representation-Enhanced Sampling for Bayesian Active Learning in Musculoskeletal Segmentation of Lower Extremities

Application of Bayesian Active Learning Method Based on Mixed Representation Augmented Sampling for Limb Musculoskeletal Segmentation

https://arxiv.org/abs/2307.13986

insert image description here
Purpose: Obtaining manual annotations to train deep learning (DL) models for automatic segmentation is often time-consuming. Uncertainty-based Bayesian Active Learning (BAL) is a widely adopted approach to reduce annotation efforts. Based on BAL, this study introduces a hybrid representation-augmented sampling strategy that integrates both density and diversity criteria to save manual annotation costs and efficiently select the most informative samples. Methods: Using the Bayesian U-net based BAL framework, experiments were performed on two lower extremity (LE) datasets of MRI and CT images. Our method selects uncertain samples with high density and diversity for manual modification, optimizing for maximum similarity to unlabeled instances and minimum similarity to existing training data. We evaluate accuracy and efficiency using DICE and a proposed metric called Reduced Annotation Cost (RAC), respectively. We further evaluate the impact of various acquisition rules on BAL performance and design an ablation study for effectiveness estimation. Results: The proposed method exhibits superiority or non-inferiority to other methods on two datasets across two acquisition rules, and quantitative results reveal the pros and cons of acquisition rules. Our ablation study volume-wise acquisition shows that the combination of density and diversity criteria is superior to using either of them alone in musculoskeletal segmentation. Conclusions: Our sampling approach is proven effective in reducing annotation costs for image segmentation tasks. The combination of the proposed method and our BAL framework provides a semi-automatic way for efficient annotation of medical image datasets.

Guess you like

Origin blog.csdn.net/wzk4869/article/details/132015687