[Computer Vision | Target Detection] arxiv Computer Vision Academic Express on Target Detection (collection of papers on May 30)

1. Detection related (16 articles)

1.1 Contextual Object Detection with Multimodal Large Language Models

Contextual object detection based on multi-channel large language model

Paper address:

https://arxiv.org/abs/2305.18279

insert image description here
Recent Multimodal Large Language Models (MLLMs) are remarkable in visual language tasks such as image captioning and question answering, but lack the necessary perceptual capabilities, i.e., object detection. In this work, we introduce contextual object detection by introducing A new research problem addressed this limitation - understanding visible objects in different contexts of human-AI interaction. Three representative scenarios, including verbal cloze, visual subtitles, and question-and-answer. Furthermore, we propose ContextDET, a unified multimodal model capable of end-to-end differentiable modeling of visual-linguistic context in order to localize, recognize and associate visual objects with linguistic input in human-AI interactions. Our ContextDET involves three key sub-models: (i) a visual encoder for extracting visual representations, (ii) a pre-trained LLM for multimodal context decoding, and (iii) a prediction for given context object words A visual decoder for the bounding boxes of . The new generation-then-detection framework enables us to detect object words in the human vocabulary. Extensive experiments demonstrate the superiority of our proposed code benchmarks, open-vocabulary detection, and ContextDET for reference image segmentation. Github: https://github.com/yuhangzang/ContextDET.

1.2 Towards minimizing efforts for Morphing Attacks – Deep embeddings for morphing pair selection and improved Morphing Attack Detection

Minimizing Morph Attack Effort – Deep Embedding for Morph Pair Selection and Improved Morph Attack Detection

Paper address:

https://arxiv.org/abs/2305.18216

insert image description here
A face morphing attack poses a threat to the security of identity documents, especially for the subsequent access control process, as it enables both persons involved to exploit the same document. In this study, face embedding serves two purposes: to preselect images generated for large-scale morphing attacks and to detect potential morphing attacks. We build on previous embedding studies using the MagFace model in these two use cases. For the first objective, we employ a pre-selection algorithm, based on individual face embedding similarity. We quantify the attack potential of differently morphed face images to compare the usability of preselection among automatically generated many successful morphing attacks. Regarding the second objective, we compare the ability of embeddings from two state-of-the-art face recognition systems to detect deformation attacks. Our results show that ArcFace and MagFace provide valuable image preselection for face embedding. Both open-source and COTS face recognition systems are vulnerable to generative attacks, especially when pre-selection is based on embeddings rather than random pairings, which are only constrained by soft biometrics. More accurate face recognition systems exhibit greater vulnerability to attack, with COTS systems being the most vulnerable. Furthermore, MagFace embeddings can serve as a powerful alternative for detecting deformed face images compared to previously used ArcFace embeddings. The experimental results confirm the advantages of face embedding in the pre-selection of deformed face images and the detection of deformed face images. This is supported by extensive analysis of various designed attacks. The MagFace model proves to be a powerful alternative to the commonly used ArcFace model for targeting, pre-selection and attack detection.

1.3 Mining Negative Temporal Contexts For False Positive Suppression In Real-Time Ultrasound Lesion Detection

Negative temporal context mining for false positive suppression in real-time ultrasound lesion detection

Paper address:

https://arxiv.org/abs/2305.18060

insert image description here
Real-time lesion detection during ultrasound scans can help radiologists make accurate cancer diagnoses. However, this important task remains challenging and underexplored. Generic real-time object detection models may erroneously report significant false positives (FPs) when applied to ultrasound videos, which may mislead junior radiologists. A key problem is that they fail to exploit negative symptoms in previous frames, denoted as Negative Temporal Context (NTC). To address this issue, we propose to extract context from previous frames, including NTC, with reverse optical flow guidance. By aggregating the extracted context, we endow the model with the ability to suppress FP utilization of NTC. We refer to the resulting model as UltraDet. The proposed UltraDet shows significant improvement over previous state-of-the-art and achieves real-time inference speed. To facilitate future research, we will release the code, checkpoints and high-quality labels of the CVA-BUS dataset used in the experiments.

1.4 Pedestrian detection with high-resolution event camera

Pedestrian detection with high-resolution incident cameras

Paper address:

https://arxiv.org/abs/2305.18008

insert image description here
Despite the continuous development of computer vision algorithms, there are still many challenges in the implementation of perception and control systems for autonomous vehicles such as drones and self-driving cars. Video streams captured by traditional cameras are often prone to issues such as motion blur or image degradation due to challenging lighting conditions. Also, the frame rate (usually 30 or 60 frames per second) can be the limiting factor in some scenes. Event cameras (DVS - Dynamic Vision Sensors) are a potentially interesting technology to address the above issues. In this paper, we compare two methods of processing event data via deep learning for the task of pedestrian detection. We use representations in the form of video frames, convolutional neural networks, and asynchronous sparse convolutional neural networks. The obtained results illustrate the potential of event cameras and allow evaluating the accuracy and efficiency of the method for high-resolution (1280 × 720 pixels) footage.

1.5 View-to-Label: Multi-View Consistency for Self-Supervised 3D Object Detection

View to Label: Multi-view Consistency for Self-Supervised 3D Object Detection

Paper address:

https://arxiv.org/abs/2305.17972

insert image description here
For autonomous vehicles, safe driving is highly dependent on the ability to correctly perceive the 3D spatial environment, so the task of 3D object detection represents a fundamental aspect of perception. While 3D sensors provide precise metric perception, monocular approaches enjoy cost and availability advantages, which are valuable in a wide range of applications. Unfortunately, training monocular methods requires a large amount of annotated data. Interestingly, self-supervised methods have recently been successfully applied to simplify the training process and unlock access to widely available unlabeled data. Although related studies exploit different priors including LIDAR scans and stereo images, such priors again limit the usability. Therefore, in this work, we propose a novel approach for self-supervised 3D object detection purely from RGB sequences, exploiting multi-view constraints and weak labels. Our experiments on the KITTI 3D dataset show that performance is comparable to state-of-the-art self-supervised methods using LIDAR scans or stereo images.

1.6 CamoDiffusion: Camouflaged Object Detection via Conditional Diffusion Models

Disguised Diffusion: Disguised Object Detection Based on Conditional Diffusion Model

Paper address:

https://arxiv.org/abs/2305.17932

insert image description here
Camouflaged object detection is a challenging task in computer vision because of the high similarity between camouflaged objects and their surroundings. Existing COD methods mainly employ semantic segmentation, which suffers from overconfident incorrect predictions. In this paper, we propose a new paradigm that treats COD as a conditional mask generation task, utilizing a diffusion model. Our method, dubbed CamoDiffusion, employs a diffusion model denoising process to iteratively reduce the noise of the mask. Due to the random sampling process of diffusion, our model is able to sample multiple possible predictions from the mask distribution, avoiding the overconfident point estimation problem. In addition, we develop specialized learning strategies that include an innovative ensemble method for generating robust predictions and a customized forward diffusion method for efficient training, especially for the COD task. Extensive experiments on three COD datasets demonstrate the superior performance of our model compared to existing state-of-the-art methods, especially on the most challenging COD10K dataset, where our method outperforms the MAE Aspect reached 0.019.

1.7 T2FNorm: Extremely Simple Scaled Train-time Feature Normalization for OOD Detection

T2FNorm: Extremely Simple Train Time-Scale Feature Normalization for OOD Detection

Paper address:

https://arxiv.org/abs/2305.17797

insert image description here
Neural networks are notorious for overconfident predictors, which pose a significant challenge to their safe deployment in real-world applications. While feature normalization has gained considerable attention in the deep learning literature, current training-time regularization methods for out-of-distribution (OOD) detection have not fully exploited this potential. In fact, naive incorporation of feature normalization within neural networks does not guarantee improved OOD detection performance. In this work, we introduce T2FNorm, a novel method for training neural networks that transforms features into a hyperspherical space via normalization, while employing a non-transformed space for OOD scoring. This method yields a surprising enhancement of OOD detection capability without compromising model accuracy in distribution (ID). Our survey shows that the proposed technique substantially reduces the specification of features across all samples, even more in the case of out-of-distribution samples, thus addressing the common concern of overconfidence in neural networks. The proposed method also significantly improves various post-hoc OOD detection methods.

1.8 Real-time Object Detection: YOLOv1 Re-Implementation in PyTorch

Real-time Object Detection: A Reimplementation of YOLOv1 in PyTorch

Paper address:

https://arxiv.org/abs/2305.17786

insert image description here
Real-time object detection is a key problem to be solved by computer vision systems, which need to make timely and appropriate decisions based on the detection results. I chose the YOLO v1 architecture to implement it using the PyTorch framework for the purpose of getting familiar with the whole object detection pipeline, and I tried different techniques to modify the original architecture to improve the results. Finally, I compare the metrics of my implementation with the original.

1.9 Lighting and Rotation Invariant Real-time Vehicle Wheel Detector based on YOLOv5

Illumination and rotation invariant real-time wheel detector based on YOLOv5

Paper address:

https://arxiv.org/abs/2305.17785

insert image description here
In computer vision, creating an object detector presents some common challenges when initially developed based on a convolutional neural network (CNN) architecture. These challenges are even more pronounced when creating models that need to adapt to images captured by various camera orientations, lighting conditions, and environmental changes. The availability of initial training samples covering all these conditions can be a huge challenge with time and cost burdens. While this problem can exist when creating any type of object detection, some types are less common and for which no pre-labeled image datasets exist publicly. Sometimes public datasets are neither reliable nor comprehensive for rare object types. The wheel is one of the examples chosen to demonstrate the method of creating an illumination and rotation invariant real-time detector based on the YOLOv5 architecture. Our goal is to provide a simple method that can be used as a reference for developing other types of real-time object detectors.

1.10 Image Hash Minimization for Tamper Detection

Image Hash Minimization Algorithm for Tamper Detection

Paper address:

https://arxiv.org/abs/2305.17748

insert image description here
Tamper detection using image hashes is a very common problem today. Several studies and advancements have been made to address this issue. However, most existing methods lack the accuracy of tamper detection when the tampered area is low, as well as requiring long image hashes. In this paper, we propose a novel approach to objectively minimize hash length while improving performance in low-tamper regions.

1.11 k-NNN: Nearest Neighbors of Neighbors for Anomaly Detection

K-NNN: Neighborhood Nearest Neighbors for Anomaly Detection

Paper address:

https://arxiv.org/abs/2305.17695

insert image description here
The purpose of anomaly detection is to identify images that deviate significantly from the norm. We focus on algorithms that embed normal training samples in a space that, when given a test image, detect anomalous features on the basis of their distance to the k-nearest training neighbors. We propose a new operator that takes into account the importance of different structures and features in the embedding space. Interestingly, this not only considers the nearest neighbors, but also the neighbors of these neighbors (k-NNN). We show that by simply replacing the nearest neighbor component of our k-NNN operator in existing algorithms, while leaving the rest of the algorithm unchanged, each algorithm's own results are improved. This is true for common homogeneous datasets such as specific types of flowers or nuts, as well as more diverse datasets

1.12 Deep Learning based Fingerprint Presentation Attack Detection: A Comprehensive Survey

A Survey of Fingerprint Presentation Attack Detection Based on Deep Learning

Paper address:

https://arxiv.org/abs/2305.17522

insert image description here
The weaknesses of fingerprint authentication systems, when presented with security concerns, make them suitable for highly secure access control applications. Therefore, Fingerprint Presentation Attack Detection (FPAD) methods are crucial to ensure the reliability of fingerprint authentication. Due to the lack of generative capabilities of traditional hand-based methods, deep learning-based FPAD has become mainstream and achieved remarkable performance in the past decade. Existing reviews focus more on handcrafted rather than deep learning based methods, which are outdated. To facilitate future research, we will only focus on recent deep learning-based FPAD methods. In this paper, we first briefly introduce the most common presentation attack tools (PAI) and publicly available fingerprint presentation attack (PA) datasets. We then classify existing deep learning FPADs into contact, contactless, and smartphone-based methods. Finally, we conclude the paper by discussing the challenges faced at the current stage and highlighting potential future prospects.

1.13 FishEye8K: A Benchmark and Dataset for Fisheye Camera Object Detection

FishEye8K: A Benchmark and Dataset for Object Detection in Fisheye Cameras

Paper address:

https://arxiv.org/abs/2305.17449

insert image description here
With the development of artificial intelligence, road object detection has become a prominent topic in computer vision, mainly using perspective cameras. Fisheye lenses provide wide coverage in all directions, using fewer cameras to monitor road intersections, but with view distortion. To the best of our knowledge, there are currently no open datasets prepared for traffic monitoring on fisheye cameras. This paper introduces an open FishEye8K benchmark dataset for road object detection tasks, which includes 157K bounding boxes for five categories (pedestrian, bicycle, car, bus and truck). Additionally, we present benchmark results for state-of-the-art (SoTA) models, including variants of YOLOv5, YOLOR, YOLO7, and YOLOv8. The dataset consists of 8,000 images recorded in 22 videos of traffic surveillance in Hsinchu, Taiwan using 18 fisheye cameras with a resolution of 1080 × \times× 1080 and 1280× \times× 1280. Due to the large distortion of ultra-wide panoramic and hemispherical fisheye camera images and numerous road participants, especially motorcyclists, the data annotation and validation process is arduous and time-consuming. To avoid bias, frames from a specific camera are assigned to training or testing sets, and the number of images and bounding boxes in each category is kept at a ratio of approximately 70:30. The experimental results show that YOLOv8 and YOLOR respectively have an input size of 640× \times× 640 and 1280× \timesExcellent performance at × 1280. The dataset will be available in PASCAL VOC, MS COCO and YOLO annotation formats on GitHub. The FishEye8K benchmark will make a significant contribution to fisheye video analysis and smart city applications.

1.14 Robust Lane Detection through Self Pre-training with Masked Sequential Autoencoders and Fine-tuning with Customized PolyLoss

Robust lane detection by self-pretraining with a masked sequential autoencoder and fine-tuning with a custom PolyLoss

Paper address:

https://arxiv.org/abs/2305.17271

insert image description here
Lane line detection is key to vehicle localization and is the basis for autonomous driving and many intelligent and advanced driver assistance systems. Existing vision-based lane detection methods do not take full advantage of valuable features and aggregate contextual information, especially the interrelationships between lane lines and other regions of images in consecutive frames. To fill this research gap and improve lane detection performance, this paper proposes a pipeline consisting of self-pretraining with a mask-sequential autoencoder and custom PolyLoss fine-tuning of an end-to-end neural network model using multiple consecutive image frames. The algorithm uses a masked sequence autoencoder to pre-train the neural network model, with the goal of recovering the lost pixels in the random masked image. Then, in the fine-tuning segmentation stage that performs lane detection segmentation, consecutive image frames are used as input, and the pre-trained model weights are transmitted and further updated using the back-propagation mechanism, where a customized PolyLoss calculation outputs lane detection results with the marked ground Weighted error between truths. Extensive experimental results show that with the proposed pipeline, the performance of the lane detection model in both normal and challenging scenarios can advance beyond the state-of-the-art, providing the best test accuracy (98.38%), precision (0.937 ), and F1-measures (0.924) on the normal scene test set, and the best overall accuracy (98.36%) and precision (0.844) on the challenging scene test set, while the training time can be greatly reduced.

1.15 VoxDet: Voxel Learning for Novel Instance Detection

VoxDet: Voxel-wise Learning for New Instance Detection

Paper address:

https://arxiv.org/abs/2305.17220

insert image description here
Detecting unseen instances based on multi-view templates is a challenging problem due to its open-world nature. Traditional approaches that mainly rely on 2D representation and matching techniques are often insufficient to handle pose changes and occlusions. To address this issue, we introduce VoxDet, a pioneering 3D geometry-aware framework that leverages a powerful 3D voxel representation and a reliable voxel matching mechanism. For the first time, VoxDet cleverly proposes a Template Voxel Aggregation (TVA) module to efficiently convert multi-view 2D images into 3D voxel features. These features are aggregated into compact 3D template voxels by exploiting the associated camera poses. In novel instance detection, this voxel representation exhibits enhanced resilience to occlusions and pose changes. We also found that 3D reconstructed objects facilitate 2D-3D mapping in pretrained TVA. Second, for fast alignment with template voxels, VoxDet incorporates a query voxel matching (QVM) module. 2D queries are first transformed into their voxel representations using the learned 2D-3D mapping. We find that since 3D voxels represent encoded geometry, we can first estimate relative rotations and then compare aligned voxels, leading to improved accuracy and efficiency. Exhaustive experiments are conducted on the demanding LineMod-Occlusion, YCB video, and the newly built RoboTools benchmarks, where VoxDet significantly outperforms various 2D baselines with 20% higher recall and faster speed. To our knowledge, VoxDet is the first to exploit implicit 3D knowledge for 2D tasks.

1.16 Building One-class Detector for Anything: Open-vocabulary Zero-shot OOD Detection Using Text-image Models

Building a One-Class Detector for Everything: Open Vocabulary Zero-ShotOOD Detection Using Text-Image Models

Paper address:

https://arxiv.org/abs/2305.17207

insert image description here
We focus on the challenge of out-of-distribution (OOD) detection in deep learning models, which is a critical aspect to ensure reliability. Despite considerable efforts, this problem remains significantly challenging in deep learning models due to their tendency to output overconfident predictions for OOD inputs. We propose a new class of open-set OOD detectors that leverage text image pre-trained models in a zero-shot manner and combine various descriptions of the domain and OOD. Our method aims to detect anything out of domain and provides flexibility to detect a wide variety of OODs, defined by fine-grained or coarse-grained labels, even in natural language. We evaluate our method on challenging benchmarks, including large-scale datasets containing fine-grained, semantically similar classes, distribution-shifted images, and multi-object images containing a mixture of domain and OOD objects. Our method shows better performance than previous methods on all benchmarks. Code is available at www.example.com.

Guess you like

Origin blog.csdn.net/wzk4869/article/details/130946202