[Computer Vision | Target Detection] arxiv Computer Vision Academic Express on Target Detection (Collection of Papers on September 14)

1. Detection related (13 articles)

1.1 Polygon Intersection-over-Union Loss for Viewpoint-Agnostic Monocular 3D Vehicle Detection

Viewpoint-independent monocular 3D vehicle detection based on polygon intersection loss

https://arxiv.org/abs/2309.07104

Monocular 3D object detection is a challenging task because depth information is difficult to obtain from 2D images. A subset of viewpoint-agnostic monocular 3D detection methods also explicitly exploit scene homography or geometry during training, meaning that models trained thereby can detect objects in images from arbitrary viewpoints. Such work predicts the projection of 3D bounding boxes on the image plane to estimate the position of the 3D box, but these projections are not rectangular, so the calculation of IoU between these projected polygons is not straightforward. This work proposes an efficient, fully differentiable algorithm for computing the IoU between two convex polygons, which can be used to compute the IoU between two 3D bounding box footprints observed from any angle. We test the performance of the proposed polygonal IoU loss (PIoU loss) on three state-of-the-art viewpoint-agnostic 3D detection models. Experiments show that the proposed PIoU loss converges faster than the L1 loss, and in the 3D detection model, the combination of PIoU loss and L1 loss gives better results than the L1 loss alone (AP 70 for MonoCon on the car) +1.64%, AP 70 for RTM 3D on the car is +0.18%, and MonoRCNN on the bike is +0.83%/+2.46% for AP 50/AP 25).

1.2 RadarLCD: Learnable Radar-based Loop Closure Detection Pipeline

RadarLCD: A radar-based learnable loop detection pipeline

https://arxiv.org/abs/2309.07094

Loop closed detection (LCD) is an important task in robotics and computer vision and is a fundamental component for various applications in different fields. These applications include object recognition, image retrieval, and video analysis. LCD consists of identifying whether the robot has returned to a previously visited position, called a loop, and then estimating the associated rotational translation relative to the analyzed position. Although radar sensors have many advantages, such as their ability to operate in different weather conditions and are compatible with other commonly used sensors such as Challenge, this study introduces RadarLCD, a novel supervised deep learning pipeline specifically designed for closed-loop detection using FMCW radar (Frequency Modulated Continuous Wave) sensors. RadarLCD is a learning-based LCD explicitly designed for radar systems method, makes a significant contribution by leveraging the pre-trained HERO (Hybrid Estimation Radar Ranging) model. HERO was originally developed for radar ranging, and its functionality is used to select key points for LCD tasks. The method has been used in various FMCW are evaluated in radar dataset scenarios and compared with state-of-the-art systems, such as scanning context for position identification and ICP for loop closure. The results show that RadarLCD outperforms several aspects of loop closure detection. to substitutes.

1.3 SupFusion: Supervised LiDAR-Camera Fusion for 3D Object Detection

SupFusion: Supervised LiDAR-camera fusion for 3D object detection

https://arxiv.org/abs/2309.07084

In this paper, we propose a novel training strategy called SupFusion, which provides auxiliary feature-level supervision for efficient LiDAR camera fusion and significantly improves detection performance. Our strategy involves a data augmentation method called Polar Sampling, which encrypts sparse objects and trains an auxiliary model to generate high-quality features as supervision. These features are then used to train a LiDAR-camera fusion model, where the fused features are optimized to simulate the resulting high-quality features. Furthermore, we propose a simple yet effective deep fusion module that continuously obtains superior performance compared to previous fusion methods with the SupFusion strategy. In this way, our proposal has the following advantages. First, SupFusion introduces auxiliary feature-level supervision, which can improve LiDAR camera detection performance without introducing additional inference costs. Secondly, the proposed deep fusion can continuously improve the capabilities of the detector. Our proposed SupFusion and deep fusion modules are plug-and-play, and we have done extensive experiments to prove their effectiveness. Specifically, we obtain approximately 2% 3D mAP improvement over the KITTI benchmark with multiple lidar camera 3D detectors.

1.4 FAIR: Frequency-aware Image Restoration for Industrial Visual Anomaly Detection

FIRE: Frequency-aware image restoration for industrial visual anomaly detection

https://arxiv.org/abs/2309.07068

Anomaly detection models based on image reconstruction have been widely studied in industrial visual inspection. However, existing models often suffer from a trade-off between normal reconstruction fidelity and abnormal reconstruction resolution, which hurts performance. In this paper, we find that the above trade-off can be better mitigated by exploiting the different frequency deviations between normal and abnormal reconstruction errors. To this end, we propose Frequency-Aware Image Restoration (FAIR), a new self-supervised image restoration task that recovers high-frequency components of images. It enables accurate reconstruction of normal patterns while mitigating adverse generalization anomalies. Using only simple vanilla UNet, FAIR achieves state-of-the-art performance and higher efficiency on various defect detection datasets. Code: https://github.com/liutongkun/FAIR.

1.5 Dynamic Causal Disentanglement Model for Dialogue Emotion Detection

Dynamic causal disentanglement model for conversational emotion detection

https://arxiv.org/abs/2309.06928

Emotion detection is a key technology widely used in various fields. Although the incorporation of common sense knowledge has proven to be beneficial to existing emotion detection methods, conversation-based emotion detection encounters many difficulties and challenges due to the variability of human agency and conversation content. However, they are often implicitly expressed. This means that a lot of real emotion remains hidden in a sea of ​​irrelevant words and dialogue. In this paper, we propose a dynamic causal disentanglement model based on latent variable separation, which is based on latent variable separation. The model effectively decomposes the content of conversations and investigates the temporal accumulation of emotions, thereby enabling more precise emotion recognition. First, we introduce a new causal directed acyclic graph (DAG) to establish correlations between hidden emotional information and other observed elements. Subsequently, our approach utilizes pre-extracted personal attributes and the distribution of latent variables of discourse topics as guidance factors, aiming to separate the irrelevant ones. Specifically, we propose a dynamic temporal disentanglement model to infer the propagation of utterances and latent variables, enabling the accumulation of emotion-relevant information throughout the conversation. To guide this decomposition process, we utilize ChatGPT-4.0 and LSTM network to extract discourse topics and personal attributes as observation information. Finally, we tested our method on two popular conversation emotion detection datasets, and the relevant experimental results verified the superiority of the model.

1.6 CCSPNet-Joint: Efficient Joint Training Method for Traffic Sihn Detection Under Extreme Conditions

CCSPNet-Joint: Efficient joint training method for traffic signal detection under extreme conditions

https://arxiv.org/abs/2309.06902

Traffic sign detection is an important research direction in intelligent driving. Unfortunately, existing methods often ignore extreme conditions such as fog, rain, and motion blur. Furthermore, end-to-end training strategies for image denoising and object detection models fail to effectively exploit inter-model information. To solve these problems, we propose CCSPNet, an efficient feature extraction module based on Transformers and CNN, which effectively utilizes contextual information, achieves faster inference speed, and provides stronger feature enhancement capabilities. Furthermore, we establish the correlation between object detection and image denoising tasks and propose a joint training model CCSPNet-Joint to improve data efficiency and generalization ability. Finally, to validate our method, we create the CCTSDB-AUG dataset for traffic sign detection in extreme situations. Extensive experiments show that CCSPNet achieves state-of-the-art performance in traffic sign detection under extreme conditions. Compared with the end-to-end method, the accuracy of CCSPNet-Joint is improved by 5.32%, and the accuracy of [email protected] is improved by 18.09%.

1.7 Video Infringement Detection via Feature Disentanglement and Mutual Information Maximization

Video infringement detection based on feature disentanglement and mutual information maximization

https://arxiv.org/abs/2309.06877

The self-media era has provided us with a large number of high-quality videos. Unfortunately, today's frequent video copyright infringement incidents have seriously damaged the interests and enthusiasm of video creators. Therefore, identifying infringing videos is an urgent task. Current state-of-the-art methods tend to simply feed high-dimensional hybrid video features into deep neural networks and rely on the network to extract useful representations. Despite its simplicity, this paradigm relies heavily on raw entangled features and lacks the constraints that guarantee useful task-relevant semantic extraction features. In this paper, we try to solve the above challenges from two aspects: (1) We propose to decompose the original high-dimensional features into multiple sub-features and explicitly decompose the features into exclusive low-dimensional components. We expect the sub-features to encode the non-overlapping semantics of the original features and remove redundant information. (2) On top of the disentangled sub-features, we further learn auxiliary features to enhance the sub-features. We theoretically analyze the mutual information between labels and disentangled features to maximize the loss in extracting task-relevant information. Shown on two large-scale benchmark datasets (i.e., SVD and VCSL), our method achieves 90.1% TOP-100 mAP on the large-scale SVD dataset, and also sets the latest VCSL benchmark data for new countries set. Our code and models have been released at https://github.com/yyyoooooo/DMI/, and we hope to contribute to the community.

1.8 Remote Sensing Object Detection Meets Deep Learning: A Meta-review of Challenges and Advances

Remote sensing target detection and deep learning: a review of challenges and progress

https://arxiv.org/abs/2309.06751

Remote sensing target detection is one of the most basic and challenging research topics in the field of remote sensing, and has always received widespread attention. In recent years, deep learning technology has demonstrated powerful feature representation capabilities and led to a great leap forward in the development of RSOD technology. In this era of rapid technological development, this article aims to comprehensively review the latest achievements of deep learning-based RSOD methods. More than 300 papers are covered in this review. We identify five main challenges in RSOD, including multi-scale object detection, rotated object detection, weak object detection, tiny object detection, and limited-supervised object detection, and systematically review the hierarchically divided manner in which corresponding methods are developed. We also review RSOD in the field of widely used benchmark data sets and evaluation metrics, as well as RSOD in application scenarios. It provides future research directions to further promote the research on RSOD.

1.9 MFL-YOLO: An Object Detection Model for Damaged Traffic Signs

MFL-YOLO: a damaged traffic sign object detection model

https://arxiv.org/abs/2309.06750

Traffic signs are important facilities to ensure traffic safety and smooth flow, but they may be damaged due to various reasons, posing great safety risks. Therefore, it is of great significance to study a method for detecting damaged traffic signs. Existing object detection techniques still lack damaged traffic signs. Since damaged traffic signs are closer to normal traffic signs in appearance, it is difficult to capture detailed local damage features using traditional object detection methods. This article proposes an improved target detection method based on YOLOv 5s, namely MFL-YOLO (Mutual Feature Levels Loss enhanced YOLO). We design a simple cross-layer loss function so that each layer of the model has its own role, which helps the model learn more diverse features and improve fine-graining. This method can be applied as a plug-and-play module and improves accuracy without increasing structural complexity or computational complexity. We also replaced traditional convolution and CSP with GSConv and VoVGSCSP in the neck of YOLOv 5s to reduce scale and computational complexity. Compared with YOLOv 5s, our MFL-YOLO improves F1 score and mAP by 4.3 and 5.1, while reducing FLOP by 8.9%. Grad-CAM heatmap visualization shows that our model can better focus on local details of damaged traffic signs. In addition, we also conduct experiments on CCTSDB 2021 and TT 100 K to further verify the generalization of our model.

1.10 Integrating GAN and Texture Synthesis for Enhanced Road Damage Detection

Integrating GaN and texture synthesis for enhanced road damage detection

https://arxiv.org/abs/2309.06747

In the field of traffic safety and road maintenance, accurate detection of road damage is crucial to ensure driving safety and extend road durability. However, current methods often fall short due to limited data. Previous attempts have used generative adversarial networks to generate lesions with different shapes and manually integrated them into appropriate locations. However, this problem has not been well explored and faces two challenges. First, they only enrich the location and shape of injuries, but ignore the diversity of injury severity, and the authenticity needs to be further improved. Second, they require a lot of manual effort. To address these challenges, we propose an innovative approach. In addition to using GAN to generate different shapes of damage, we also adopt texture synthesis technology to extract road texture. These two elements are then blended with different weights, allowing us to control the severity of the synthetic damage, which is then embedded into the original image via Poisson blending. Our approach ensures rich damage severities and better alignment with the background. To save labor costs, we exploit structural similarity for automatic sample selection during the embedding process. Each augmentation data of the original image contains versions with different severity levels. We implemented a simple screening strategy to mitigate distribution drift. Experiments are conducted on a public road damage dataset. The proposed method not only eliminates the need for manual labor but also achieves significant enhancements, improving mAP by 4.1% and F1 score by 4.5%.

1.11 MTD: Multi-Timestep Detector for Delayed Streaming Perception

MTD: Multi-timestep detector for delayed flow sensing

https://arxiv.org/abs/2309.06742

Autonomous driving systems require real-time environmental awareness to ensure user safety and experience. Streaming perception is a task of reporting the current state of the world to evaluate the latency and accuracy of autonomous driving systems. In practical applications, factors such as hardware limitations and high temperatures inevitably lead to delays in autonomous driving systems, resulting in offsets between model output and world state. To solve this problem, this paper proposes the multi-time step detector (MTD), which is an end-to-end detector that uses dynamic routing for multi-branch future prediction, making the model resistant to delay fluctuations. The Delay Analysis Module (DAM) proposes to optimize existing latency sensing methods, continuously monitoring model inference stacks and computational latency trends. Furthermore, a new time-step branching module (TBM) is constructed, which includes static flow and adaptive flow to adaptively predict specific time steps according to the delay trend. The proposed method has been evaluated on the Argoverse-HD dataset, and experimental results show that it has achieved state-of-the-art performance in various latency settings.

1.12 ShaDocFormer: A Shadow-attentive Threshold Detector with Cascaded Fusion Refiner for document shadow removal’ to the ICASSP 2024 online submission system

ShaDocFormer: A shadow-attention threshold detector and cascade fusion refiner for document shadow removal for the ICASSP 2024 Online Submission System

https://arxiv.org/abs/2309.06670

Document shadowing is a common problem that occurs when capturing documents using mobile devices, and it severely affects the readability of the document. Current methods encounter various challenges, including inaccurate detection of shadow masks and estimation of illumination. In this paper, we propose ShaDocFormer, a Transformer-based architecture that integrates traditional methods and deep learning techniques to solve the problem of document shadow removal. The ShaDocFormer architecture consists of two components: the shadow attention threshold detector (STD) and the cascade fusion refiner (CFR). The STD module adopts traditional threshold technology and uses the Transformer's attention mechanism to collect global information to achieve accurate detection of shadow masks. The cascade and aggregation structure of the CFR module facilitates the coarse-to-fine restoration process of the entire image. Therefore, ShaDocFormer excels at accurately detecting and capturing shadow and lighting changes, enabling effective shadow removal. Extensive experiments show that ShaDocFormer outperforms current state-of-the-art methods in both qualitative and quantitative measurements.

1.13 DF-TransFusion: Multimodal Deepfake Detection via Lip-Audio Cross-Attention and Facial Self-Attention

DF-TURING: Multimodal deepfakes detection based on labial cross-attention and facial self-attention

https://arxiv.org/abs/2309.06511

With the rise of manipulated media, deepfake detection has become a priority to protect the authenticity of digital content. In this paper, we propose a new multi-modal audio-video framework designed to process audio and video inputs simultaneously to perform deepfake detection tasks. Our model exploits lip synchronization with the input audio through a cross-attention mechanism while extracting visual cues through a fine-tuned VGG-16 network. Subsequently, a Transformer encoder network is employed to perform facial self-attention. We performed multiple ablation studies highlighting the different advantages of our approach. Our multi-modal approach outperforms state-of-the-art multi-modal deepfake detection techniques in terms of F-1 and AUC scores per video.

Guess you like

Origin blog.csdn.net/wzk4869/article/details/132895567