Article directory

1. Detection related (15 articles)

1. Detection related (15 articles)

1.1 Diving with Penguins: Detecting Penguins and their Prey in Animal-borne Underwater Videos via Deep Learning

Diving with penguins: Detecting penguins and their prey in animal-borne underwater videos using deep learning

https://arxiv.org/abs/2308.07267

The African penguin (Spheniscus demersus) is an endangered species. Little is known about their underwater hunting strategies and associated predation success rates, but this is crucial to guide conservation. Modern biologging techniques have the potential to provide valuable insights, but manually analyzing large volumes of data from animal-borne video recorders (AVRs) is time-consuming. In this paper, we release a dataset of underwater videos of penguins carried by animals and introduce a ready-to-deploy deep learning system capable of robust detection of penguins ([email protected]%) and fish ([email protected]%) . We note that the detector clearly benefits from bubble learning to improve accuracy. Extending this detector towards a two-stream behavior recognition network, we also present the first results for recognizing penguin underwater video predation behavior. While the results are promising, further work is needed for the useful applicability of the detection of predatory behavior in live scenarios. In conclusion, we provide a highly reliable underwater penguin finder, fish finder, and a worthy first attempt at automated visual detection of complex marine predator behavior. We release the network, DivingWithPenguins video dataset, annotations, segmentations and weights for full reproduction and immediate use by practitioners.

1.2 Towards Robust Real-Time Scene Text Detection: From Semantic to Instance Representation Learning

Towards Robust Real-time Scene Text Detection: From Semantics to Instance Representation Learning

https://arxiv.org/abs/2308.07202

Bottom-up segmentation-based methods have gradually become mainstream methods for real-time scene text detection due to their flexible representation of arbitrary-shaped scene text and simple pipeline processing capabilities. Despite great progress, these methods show insufficient robustness and still suffer from false positives and instance sticking. Unlike existing methods that integrate multi-grained features or multiple outputs, we resort to perspective representation learning, utilizing an auxiliary task that enables the encoder to jointly learn robust features with the main task of per-pixel classification during optimization. For semantic representation learning, we propose Global Dense Semantic Contrast (GDSC), where a vector is extracted for global semantic representation and then used to perform dense grid feature comparison of elements. To learn instance-aware representations, we propose to combine top-down modeling (TDM) with a bottom-up framework to provide implicit instance-level cues to the encoder. With the proposed GDSC and TDM, the encoder network learns stronger representations without introducing any parameters and computations during inference. Equipped with a very light decoder, the detector enables more robust real-time scene text detection. Experimental results on four public datasets show that the proposed method outperforms or approaches existing methods in both accuracy and speed. Specifically, the proposed method achieves 87.2% F-measure on a single GeForce RTX 2080 Ti GPU, 48.2 FPS on Total-Text, and 89.6% F-measure on MSRA-TD 500, At 36.9 FPS.

1.3 Survey on video anomaly detection in dynamic scenes with moving cameras

A Survey of Video Anomaly Detection in Dynamic Scenes with Motion Cameras

https://arxiv.org/abs/2308.07050

The increasing popularity of compact and inexpensive cameras such as ~dashboard cameras, body cameras, and cameras equipped on robots has led to a growing interest in detecting anomalies within dynamic scenes recorded by moving cameras. However, existing reviews mainly focus on video anomaly detection (VAD) methods assuming static cameras. The literature on VAD with mobile cameras remains fragmented, lacking comprehensive reviews to date. To address this gap, we endeavor to present the first comprehensive survey on moving camera video anomaly detection (MC-VAD). We delve into research papers related to MC-VAD, critically assessing its limitations and highlighting associated challenges. Our exploration covers three application domains: security, urban mobility, and marine environments, which in turn address six specific tasks. We compiled an extensive list of 25 publicly available datasets covering four different environments: underwater, surface, surface, and air. We summarize the types of anomalies that these datasets correspond to or contain, and propose five main categories of methods to detect such anomalies. Finally, we identify future research directions and discuss new contributions that could advance the field of MC-VAD. Through this survey, we aim to provide a valuable reference for researchers and practitioners striving to develop and advance state-of-the-art MC-VAD methods.

1.4 PatchContrast: Self-Supervised Pre-training for 3D Object Detection

PatchContrast: Self-supervised pre-training for 3D object detection

https://arxiv.org/abs/2308.06985

Accurately detecting objects in the environment is a key challenge for autonomous vehicles. However, obtaining annotated data for detection is expensive and time-consuming. We introduce PatchContrast, a novel self-supervised point cloud pre-training framework for 3D object detection. We propose to leverage two levels of abstraction to learn discriminative representations from unlabeled data: proposal level and patch level. The proposal level aims to localize an object relative to its surroundings, while the patch level adds information about the internal connections between object components, thereby distinguishing different objects based on their individual components. We demonstrate how these levels can be integrated into self-supervised pre-training of various backbones to enhance downstream 3D detection tasks. We show that our method outperforms existing state-of-the-art models on three commonly used 3D detection datasets.

1.5 PV-SSD: A Projection and Voxel-based Double Branch Single-Stage 3D Object Detector

PV-SSD: A Projection- and Voxel-Based Dual-Branch Single-Stage 3D Object Detector

https://arxiv.org/abs/2308.06791

LIDAR-based 3D object detection and classification is critical for autonomous driving. However, real-time inference from extremely sparse 3D data poses a formidable challenge. To solve this problem, a common approach is to project point clouds onto bird's-eye or perspective views, effectively converting them into an image-like data format. However, such over-compression of point cloud data often leads to loss of information. A 3D object detector based on voxel and projection dual-branch feature extraction (PV-SSD) is proposed to address the information loss problem. In the feature extraction stage, the voxel feature input containing rich local semantic information is added, and fully integrated with the projection feature to reduce the loss of local information caused by projection. Compared with previous work, good performance is achieved. In addition, this paper makes the following contributions: 1) A voxel feature extraction method with variable receptive fields is proposed; 2) Feature point sampling method using weighted sampling to filter out feature points that are more beneficial to the detection task; 3 ) proposed the MSSFA model on the basis of the SSFA model. To verify the effectiveness of our method, we design comparative experiments.

1.6 Target before Shooting: Accurate Anomaly Detection and Localization under One Millisecond via Cascade Patch Retrieval

Target before firing: Precise anomaly detection and localization in 1 ms via cascaded patch retrieval

https://arxiv.org/abs/2308.06748

In this work, by re-examining the nature of "matching" anomaly detection (AD), we propose a new AD framework that simultaneously enjoys new record AD accuracy and remarkably high running speed. In this framework, the anomaly detection problem is solved by cascading the patch retrieval process, retrieving the nearest neighbors for each test image patch in a coarse-to-fine fashion. Given a test sample, first the top K most similar training images are selected based on a robust histogram matching process. Second, the nearest neighbors for each test patch are retrieved at similar geometric locations on these "global nearest neighbors" by using carefully trained local metrics. Finally, an anomaly score for each test image patch is computed based on the distance to its "local nearest neighbor" and "non-background" probabilities. The proposed method is called "Cascaded Patch Retrieval" (CPR) in this work. Unlike traditional block-matching-based AD algorithms, CPR selects an appropriate "target" (reference image and location) before "shooting" (block-matching). On the well-established MVTec AD, BTAD and MVTec-3D AD datasets, the proposed algorithm consistently outperforms all comparative SOTA methods by a significant margin, as measured by various AD metrics. Also, CPR is very effective. It runs at 113 FPS on standard settings, while its simplified version takes less than 1 ms to process an image at the cost of a negligible loss in accuracy. The code for CPR is available at https://github.com/flyinghu123/CPR.

1.7 Camouflaged Image Synthesis Is All You Need to Boost Camouflaged Detection

Camouflage image synthesis is all that is needed to improve masquerade detection

https://arxiv.org/abs/2308.06701

Disguised objects mixed into natural scenes pose significant challenges to deep learning models for detection and synthesis. Although camouflaged object detection is an important task in computer vision, this research topic is limited due to the limited availability of data. We propose a framework to synthesize camouflaged data to improve the detection of camouflaged objects in natural scenes. Our method employs generative models to produce realistic camouflage images, which can be used to train existing object detection models. Specifically, we use a camouflage environment generator supervised camouflage distribution to classify synthetic camouflage images, which are then fed into our generator to augment the dataset. Our framework outperforms current state-of-the-art methods on three datasets (COD10k, CAMO, and CHAMELEON), demonstrating its effectiveness in improving disguised object detection. This approach can serve as a plug-and-play data generation and augmentation module to existing camouflaged object detection tasks and provides a new way to introduce more diversity and distribution to current camouflaged datasets.

1.8 Tiny and Efficient Model for the Edge Detection Generalization

A Tiny and Efficient Generalization Model for Edge Detection

https://arxiv.org/abs/2308.06468

Most high-level computer vision tasks rely on low-level image manipulation as their initial process. Operations such as edge detection, image enhancement, and super-resolution provide the basis for higher-level image analysis. In this work, we address edge detection considering three main goals: simplification, efficiency, and generality due to the increased complexity of state-of-the-art (SOTA) edge detection models for better accuracy. To achieve this, we propose Tiny and Efficient Edge Detector (TEED), an optical convolutional neural network with only 58,000 parameters, less than 0.2% of the state-of-the-art model. Training takes less than 30 minutes on the BIPED dataset, and each epoch takes less than 5 minutes. Our proposed model is easy to train, it converges quickly within the first few epochs, and the predicted edge maps are sharp and high-quality. Furthermore, we propose a new dataset to test the generalization of edge detection, which includes samples from popular image edge detection and image segmentation. Source code is available at https://github.com/xavysp/TEED.

1.9 Improved YOLOv8 Detection Algorithm in Security Inspection Image

An Improved YOLOv8 Detection Algorithm for Security Inspection Images

https://arxiv.org/abs/2308.06452

Security inspection is the first line of defense to ensure the safety of people's lives and property, and intelligent security inspection is an inevitable trend in the development of the security inspection industry in the future. Aiming at the problems of overlapping detection objects, false detection of contraband, and missed detection in the process of X-ray image detection, an improved X-ray contraband detection algorithm CSS-YOLO based on YOLOv 8 s is proposed.

1.10 M&M: Tackling False Positives in Mammography with a Multi-view and Multi-instance Learning Sparse Detector

M&M: Dealing with False Positives in Mammography with a Multi-View Multi-Instance Learning Sparse Detector

https://arxiv.org/abs/2308.06420

Deep learning-based object detection methods show promise for improving screening mammography, but high false positive rates may hamper their effectiveness in clinical practice. To reduce false positives, we identified three challenges: (1) unlike natural images, malignant mammograms often contain only one malignant finding; (2) mammographic examinations contain two views of each breast, and Both views should be considered to make a proper assessment; (3) Most mammograms are negative and contain no findings. In this work, we address the above three challenges by: (1) utilizing sparse R-CNN and showing that sparse detectors are better suited for mammography than dense detectors; (2) including multi-view intersection Attention module to synthesize information from different views; (3) Incorporate multiple instance learning (MIL) to leverage unannotated images for training and perform breast-level classification. The resulting model, M&M, is a multi-view and multi-instance learning system that both localizes malignant findings and provides breast-level predictions. We validate the detection and classification performance of M&M using five mammography datasets. Furthermore, we demonstrate the effectiveness of each declared component through comprehensive ablation studies.

1.11 Improving Pseudo Labels for Open-Vocabulary Object Detection

An Improved Pseudo-Labeling Method for Open Vocabulary Object Detection

https://arxiv.org/abs/2308.06412

Recent studies have shown promising performance in open-vocabulary object detection (OVD) using pseudo-label (PL) pretrained vision and language models (VLM). However, due to the gap between the pre-trained target of VLM and OVD, the PL generated by VLM is very noisy, which hinders the further development of PL. In this paper, we aim to reduce the noise in PL and propose a method called Online Self-Training and Split-and-Fuse Head OVD (SAS-Det). First, self-training fine-tunes VLMs to generate high-quality PL while preventing forgetting of knowledge learned in pre-training. Second, a Split and Fusion (SAF) head is designed to remove noise, which is usually neglected in existing methods for localizing the PL. It also incorporates complementary knowledge learned from accurate ground truth and noisy pseudo-labels to improve performance. Extensive experiments show that SAS-Det is an efficient and effective method. Our pseudo-labeling is 3x faster than existing methods. SAS-Det significantly outperforms existing state-of-the-art models at the same scale, and achieves 37.4 AP on the new category of COCO and LVIS benchmarks $_{50}$ 和27.3 AP $_r$ 。

1.12 Detecting and Preventing Hallucinations in Large Vision Language Models

Detecting and preventing hallucinations in large visual language models

https://arxiv.org/abs/2308.06394

Instruction-tuned large visual language models (LVLMs) have made significant progress in generalization across a variety of multimodal tasks, especially for visual question answering (VQA). However, generating detailed responses that are visually grounded remains a challenging task for these models. We find that even the current state-of-the-art LVLM (InstructBLIP) still contains a staggering 30% of hallucinated text in the form of non-existent objects, unfaithful descriptions, and inaccurate relationships. To address this issue, we introduce M-HalDetect, a {M}multimodal {Hal}luminescence{Detect}ion dataset that can be used to train and benchmark models for hallucination detection and prevention. M-HalDetect consists of 16k fine-grained labels on VQA examples, making it the first comprehensive multimodal hallucination detection dataset for detailed image description. Unlike previous work, which only considers object illusions, we additionally annotate entity descriptions and relations to be unfaithful. To demonstrate the potential of this dataset for preference alignment, we propose fine-grained direct preference optimization, as well as train a fine-grained multimodal reward model and evaluate its effectiveness using best-in-n rejection sampling. We conduct human evaluations on DPO and rejection sampling and find that they reduce the hallucination rate by 41% and 55%, respectively, a significant improvement over the baseline.

1.13 Towards Packaging Unit Detection for Automated Palletizing Tasks

Packaging unit detection for automated palletizing operations

https://arxiv.org/abs/2308.06306

For various automated palletizing tasks, the detection of packaging units is a critical step before industrial robots actually handle the packaging units. We propose a method to address this challenging problem that is adequately trained on synthetically generated data and can be robustly applied to arbitrary real-world packing units without further training or setup. Work. The proposed method is capable of handling sparse and low-quality sensor data, can leverage prior knowledge if available, and generalize to a wide range of products and application scenarios. To demonstrate the practical application of our method, we conduct extensive evaluations on real data from a variety of different retail products. Additionally, we are integrating our method in a laboratory demonstrator and will sell a commercial solution through industrial partners.

1.14 Out-of-distribution multi-view auto-encoders for prostate cancer lesion detection

Discrete multi-view autoencoder for prostate cancer lesion detection

https://arxiv.org/abs/2308.06481

Traditional deep learning (DL) methods based on the supervised learning paradigm require a large amount of annotated data, which is rarely available in the medical domain. Unsupervised out-of-distribution (OOD) detection is an alternative that requires less annotated data. Furthermore, OOD applications exploit the class skewness commonly found in medical data. Magnetic resonance imaging (MRI) has proven useful in prostate cancer (PCa) diagnosis and management, but current DL methods rely on T2w axial MRI, which has low out-of-plane resolution. We propose a multi-stream approach to adapt different T2w orientations to improve the performance of OOD methods for PCa lesion detection. We evaluated our method on a publicly available dataset and achieved better detection results (73.1 vs 82.3) in terms of AUC compared to unidirectional methods. Our findings show the potential of an OOD approach for MRI-based detection of PCa lesions.

1.15 Deep Learning-Based Open Source Toolkit for Eosinophil Detection in Pediatric Eosinophilic Esophagitis

An open source toolkit for eosinophil detection in children with eosinophilic esophagitis based on deep learning

https://arxiv.org/abs/2308.06333

Eosinophilic esophagitis (EoE) is a chronic, immune/antigen-mediated disease of the esophagus characterized by symptoms associated with esophageal dysfunction and histologic evidence of eosinophilic overt inflammation. Due to the complex microscopic characterization of EoE in imaging, current methods, which rely on manual identification, are not only labor-intensive but also prone to inaccuracy. In this study, we developed an open-source toolkit, named Open-EoE, to perform end-to-end whole-slide image (WSI)-level eosinophil (Eos) detection via Docker using a single command. Specifically, the toolkit supports three state-of-the-art deep learning-based object detection models. Furthermore, Open-EoE further optimizes the performance by implementing an ensemble learning strategy and improves the accuracy and reliability of our results. Experimental results show that the Open-EoE toolkit can effectively detect Eos on a test set of 289 WSIs. At the widely accepted threshold of >= 15 Eos/high-power field (HPF) for diagnosing EoE, Open-EoE achieved an accuracy of 91%, showing good agreement with pathologist assessment. This suggests a promising avenue for integrating machine learning methods into the EoE diagnostic process. Docker and source code are publicly available at https://github.com/hrlblab/Open-EoE.

[Computer Vision | Target Detection] Arxiv Computer Vision Academic Express on Target Detection (August 15 Paper Collection)