[Computer Vision | Target Detection] Arxiv Computer Vision Academic Express on Target Detection (July 19 Paper Collection)

1. Detection related (9 articles)

1.1 GroupLane: End-to-End 3D Lane Detection with Channel-wise Grouping

GroupLane: End-to-End 3D Lane Detection with Grouping by Lane

https://arxiv.org/abs/2307.09472

insert image description here
Efficiency is very important for 3D lane detection due to practical deployment requirements. In this work, we propose a simple, fast, end-to-end detector that still maintains high detection accuracy. Specifically, we design a set of fully convolutional heads based on row classification. Compared to previous counterparts, our support recognizes both vertical and horizontal lanes. Furthermore, our method is the first to perform row classification in a bird's-eye view. In the head, we divide the features into groups, each group of features corresponds to a lane instance. During training, the loss is computed using the proposed one-vs-one-vs-one-to-one matching, which associates predictions with lane labels, and requires no post-processing operations for inference. In this way, our proposed fully convolutional detector GroupLane achieves DETR-like end-to-end detection. Evaluated on 3 real-world 3D Lane benchmarks, OpenLane, Once-3DLanes and OpenLane-Huawei, GroupLane with ConvNext-Base as the backbone achieves 13.6% higher F1 score on the OpenLane validation set than the published state-of-the-art PersFormer. Furthermore, GroupLane with ResNet18 still achieves 4.9% higher F1 score than PersFormer, while inference is nearly 7 times faster with only 13.3% of its FLOPs.

1.2 Occlusion Aware Student Emotion Recognition based on Facial Action Unit Detection

Occlusion-aware student emotion recognition based on face action unit detection

https://arxiv.org/abs/2307.09465

insert image description here
Given that approximately half of science, technology, engineering, and mathematics (STEM) undergraduates at US colleges and universities leave by the end of their first year [15], improving the quality of the classroom environment is critical. This study focuses on monitoring students' emotions in the classroom as an indicator of their engagement and suggests ways to address this. Experimentally evaluate the effect of different facial parts on the performance of emotion recognition models. To test the proposed model under partial occlusion, a dataset with artificial occlusion is introduced. The novelty of this work is to propose an occlusion-aware architecture for facial action unit (AU) extraction, which employs an attention mechanism and adaptive feature learning. AU can later be used to classify facial expressions in classroom settings.
The findings of this research paper provide valuable insights into handling occlusions when analyzing facial images for emotional engagement analysis. The presented experiments demonstrate the importance of considering occlusions and improving the reliability of facial analysis models in classroom settings. These findings can also be extended to other settings where occlusion is prevalent.

1.3 Knowledge Distillation for Object Detection: from generic to remote sensing datasets

Knowledge Extraction for Object Detection: From General Datasets to Remote Sensing Datasets

https://arxiv.org/abs/2307.09264

insert image description here
Knowledge distillation, a well-known model compression technique, is an active research area in computer vision and remote sensing. In this paper, we evaluate various off-the-shelf knowledge distillation methods for object detection in the remote sensing setting, which were originally developed on general-purpose computer vision datasets such as Pascal VOC. In particular, methods covering logit imitation and feature imitation methods are applied to vehicle detection using well-known benchmarks such as xView and VEDAI datasets. Extensive experiments are performed to compare the relative performance and interrelationships of these methods. Experimental results show large variations and confirm the importance of aggregation and cross-validation of remote sensing dataset results.

1.4 A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future

A Survey of Open Vocabulary Detection and Segmentation Research: Past, Present, and Future

https://arxiv.org/abs/2307.09220

insert image description here
As the most basic tasks of computer vision, object detection and segmentation have made great progress in the era of deep learning. Annotated categories in existing datasets are usually small-scale and predefined due to expensive manual labeling, i.e., state-of-the-art detectors and segmenters cannot generalize beyond closed vocabularies. To address this limitation, there has been increasing focus on Open Vocabulary Detection (OVD) and Segmentation (OVS) in the past few years. In this survey, we provide a comprehensive review of past and recent developments in OVD and OVS. To this end, we developed a taxonomy based on task type and method. We find that the permission and use of weakly supervised signals can well distinguish different approaches, including: visual-semantic space mapping, novel visual feature synthesis, region-aware training, pseudo-labelling, knowledge-distillation-based, and transfer-learning-based. The proposed taxonomy is general across different tasks, covering object detection, semantic/instance/panoramic segmentation, 3D scene and video understanding. Each category discusses in detail its key principles, key challenges, development paths, strengths and weaknesses. Furthermore, we benchmark each task as well as important components of each method. Finally, several promising directions are provided to stimulate future research.

1.5 You’ve Got Two Teachers: Co-evolutionary Image and Report Distillation for Semi-supervised Anatomical Abnormality Detection in Chest X-ray

You have two teachers: co-evolving image and report distillation for semi-supervised detection of anatomical abnormalities in chest X-rays

https://arxiv.org/abs/2307.09184

insert image description here
Chest X-ray (CXR) anatomical abnormality detection aims to localize and characterize cardiopulmonary radiology findings in radiographs, which can speed up clinical workflow and reduce observational oversight. Most existing methods attempt to perform this task in the fully supervised setting, which requires expensive large number of anomaly annotations, or in the weakly supervised setting, still severely lagging behind fully supervised methods in performance. In this work, we propose a co-evolutionary image and report distillation (CEIRD) framework that enables semi-supervised anomaly detection in CXR by associating visual detection results with text-classified anomalies in paired radiology reports, and vice versa. Specifically, based on the classic teacher-student pseudo-label distillation (TSD) paradigm, we additionally introduce an auxiliary report classification model whose predictions are used for report-guided pseudo-detection label refinement (RPDLR) in the main visual detection task. Instead, we also use the predictions of the visual detection model for anomaly-guided pseudo-classification label refinement (APCLR) in the auxiliary report classification task, and propose a co-evolutionary strategy in which the visual and report models promote each other, and RPDLR and APCLR are executed alternately. To this end, we effectively incorporate the reported weak supervision into a semi-supervised TSD pipeline. In addition to cross-modal pseudo-label refinement, we propose an intra-image modality-adaptive non-maximum suppression, in which the pseudo-detection labels generated by the teacher's visual model are dynamically corrected by the student's high-confidence predictions. Experimental results on the public MIMIC-CXR benchmark demonstrate that CEIRD outperforms several state-of-the-art weakly and semi-supervised methods.

1.6 Jean-Luc Picard at Touché 2023: Comparing Image Generation, Stance Detection and Feature Matching for Image Retrieval for Arguments

Jean-Luc Picard's talk at Touché 2023: Comparing Image Generation, Pose Detection, and Feature Matching for Parametric Image Retrieval

https://arxiv.org/abs/2307.09172

insert image description here
Participating in the shared task "Image retrieval with parameters", we use different pipelines for image retrieval, including image generation, pose detection, pre-selection, and feature matching. We submit four different runs with different pipeline layouts and compare them with a given baseline. The performance of our pipeline is similar to the baseline.

1.7 MLF-DET: Multi-Level Fusion for Cross-Modal 3D Object Detection

MLF-DET: Multi-Channel Fusion for Cross-Channel 3D Object Detection

https://arxiv.org/abs/2307.09155

insert image description here
In this paper, we propose a novel and effective multi-level fusion network, called MLF-DET, for high-performance cross-modal 3D object detection, which integrates feature-level fusion and decision-level fusion to fully utilize the information in images. For feature-level fusion, we propose the Multiscale Voxel Image Fusion (MVI) module, which densely aligns multiscale voxel features with image features. For decision-level fusion, we propose a lightweight Feature Hint Confidence Correction (FCR) module, which further utilizes image semantics to correct the confidence of detection candidates. Furthermore, we devise an effective data augmentation strategy called occlusion-aware GT sampling (OGS) to preserve more sampled objects in the training scene, thus reducing overfitting. Extensive experiments on the KITTI dataset demonstrate the effectiveness of our method. Remarkably, on the extremely competitive KITTI automotive 3D object detection benchmark, our method achieves a moderate AP of 82.89%, and achieves state-of-the-art performance without any bells and whistles.

1.8 MVA2023 Small Object Detection Challenge for Spotting Birds: Dataset, Methods, and Results

MVA2023 Bird Detection Challenge: Datasets, Methods and Results

https://arxiv.org/abs/2307.09143

insert image description here
Small object detection (SOD) is an important machine vision topic because (i) various real-world applications require object detection of distant objects, and (ii) SOD is a challenging task due to the noisy, blurry, and less informative appearance of images of small objects. This paper presents a new SOD dataset consisting of 39,070 images including 137,121 bird instances called the Small Object Detection Bird Observations Dataset (SOD4SB). This paper presents the details of the SOD4SB dataset challenge. A total of 223 participants took part in the challenge. This article briefly describes how to win the award. The dataset, baseline code, and evaluation website for the public test set are all publicly available.

1.9 In Defense of Clip-based Video Relation Detection

In defense of clip-based video relationship detection

https://arxiv.org/abs/2307.08984

insert image description here
Video Visual Relationship Detection (VidVRD) aims to detect visual relationship triplets in videos using spatial and temporal bounding boxes. Existing VidVRD methods can be broadly classified into bottom-up and top-down paradigms, depending on how they classify relations. The bottom-up approach follows the clip-based approach to classify the short-clip tubule-pair relations and then merge them into long-video relations. On the other hand, top-down methods directly classify long video tube pairs. While recent video-based methods utilizing video tubelets have shown promising results, we argue that effective modeling of spatial and temporal context plays a more important role than the choice between clip tubelets and video tubelets. This motivates us to revisit the clip-based paradigm and explore the critical success factors of VidVRD. In this paper, we propose a Hierarchical Context Model (HCM), which enriches object-based spatial context and clip-based relation-based temporal context. We demonstrate that using Clip Tubelet achieves superior performance compared to most video-based methods. Furthermore, using clip tubelets provides more flexibility for model design and helps alleviate limitations associated with video tubelets, such as the challenging long-term object tracking problem and temporal information loss in long-term tubelet feature compression. Extensive experiments on two challenging VidVRD benchmarks validate that our HCM achieves new state-of-the-art performance, highlighting the effectiveness of incorporating advanced spatial and temporal context modeling in clip-based paradigms.

Guess you like

Origin blog.csdn.net/wzk4869/article/details/131884087