[Computer Vision | Image Classification] Arxiv Computer Vision Academic Express on Image Classification (July 17 Paper Collection)

1. Classification | Recognition related (11 articles)

1.1 Multimodal Distillation for Egocentric Action Recognition

Multimodal Extraction for Egocentric Behavior Recognition

https://arxiv.org/abs/2307.07483

insert image description here
The focus of egocentric video understanding is modeling hand-object interactions. Standard models, such as CNNs or Vision Transformers that receive RGB frames as input, perform well. However, their performance is further improved by employing additional input modalities that provide supplementary cues (such as object detection, optical flow, audio, etc.). On the other hand, the added complexity of modality-specific modules makes these models impractical for deployment. The goal of this work is to preserve the performance of such multimodal methods while using only RGB frames as input at inference time. We demonstrate that for egocentric action recognition on the Epic-Kitchens and Something-Something datasets, students taught by multimodal teachers tend to outperform architectures trained on ground-truth labels in a unimodal or multimodal fashion, etc. The effect model is more accurate and better calibrated. We further adopt a principled multimodal knowledge distillation framework that enables us to deal with problems that occur when applying multimodal knowledge distillation in a naive manner. Finally, we demonstrate the achieved reduction in computational complexity and show that our method maintains higher performance with a reduced number of input views.

1.2 Dual-Query Multiple Instance Learning for Dynamic Meta-Embedding based Tumor Classification

Dual-query multi-instance learning for tumor classification based on dynamic Meta-Embedding

https://arxiv.org/abs/2307.07482

insert image description here
Whole slide image (WSI) assessment is a challenging and critical step in cancer diagnosis and treatment planning. WSI requires high magnification to facilitate subcellular analysis. In the context of gigapixel WSIs, precise annotation for patch or even pixel-level classification is tedious and requires domain experts. On the other hand, coarse-grained labels are easily accessible, which makes WSI classification an ideal use case for multiple instance learning (MIL). In our work, we propose a novel embedding-based dual-query MIL pipeline (DQ-MIL). We contribute to both embedding and aggregation steps. Embedding models are currently limited in their ability to generalize as a common visual feature representation is not yet available. Through our work, we explore the potential of dynamic meta-embeddings based on cutting-edge self-supervised pre-trained models in the context of MIL. Furthermore, we propose a new MIL architecture capable of combining MIL attention with associated self-attention. The dual-query perceptron design of our approach enables us to exploit the concept of self-distillation and combine the advantages of a small model with the rich feature representation of a large model in the context of a low-data regime. We demonstrate the superior performance of our method on three histopathology datasets, where we show a 10% improvement over the state-of-the-art method.

1.3 Interactive Spatiotemporal Token Attention Network for Skeleton-based General Interactive Action Recognition

Skeleton-based interactive spatiotemporal token attention network for general interaction action recognition

https://arxiv.org/abs/2307.07469

insert image description here
Interactive action recognition plays an important role in human-computer interaction and collaboration. Previous methods use late fusion and joint attention mechanisms to capture interaction relations, which have limited learning ability or are inefficient to adapt to more interacting entities. Evaluation of more general settings involving topical diversity is also lacking due to the assumption that prior knowledge of each entity is known. To address these issues, we propose an Interactive Spatio-Temporal Token Attention Network (ISTA-Net), which simultaneously models spatial, temporal and interactive relations. Specifically, our network incorporates a tagger to partition Interactive Spatiotemporal Tags (IST), a unified way to represent the motion of multiple different entities. By expanding the entity dimension, IST provides better interactive representation. To jointly learn along three dimensions in IST, a multi-head self-attention block integrated with 3D convolutions is designed to capture inter-token correlations. When modeling dependencies, strict entity ordering is often irrelevant for recognizing interactive actions. To this end, entity rearrangement is proposed to eliminate the orderliness of interchangeable entities in IST. Extensive experiments on four datasets verify the effectiveness of ISTA-Net, outperforming state-of-the-art methods. Our code is publicly available at https://github.com/Necolizer/ISTA-Net

1.4 Defect Classification in Additive Manufacturing Using CNN-Based Vision Processing

CNN-Based Vision Processing for Defect Classification in Additive Manufacturing

https://arxiv.org/abs/2307.07378

insert image description here
Developments in computer vision and in-situ monitoring using visual sensors allow the collection of large data sets from additive manufacturing (AM) processes. Such datasets can be used with machine learning techniques to improve the quality of AM. This paper investigates two cases: first, using a convolutional neural network (CNN) to accurately classify defects in an image dataset from AM, and second, applying active learning techniques to the developed classification model. This allows building human-in-the-loop mechanisms to reduce the size of the data required for training and generating training data.

1.5 3D Shape-Based Myocardial Infarction Prediction Using Point Cloud Classification Networks

Myocardial infarction prediction based on point cloud classification network in 3D shape

https://arxiv.org/abs/2307.07298

insert image description here
Myocardial infarction (MI) is one of the most prevalent cardiovascular diseases, and clinical decisions related to it are often based on single-valued imaging biomarkers. However, such metrics only approximate the complex 3D structure and physiology of the heart, thus hindering better understanding and prediction of MI outcomes. In this work, we investigate the utility of complete 3D heart shapes in the form of point clouds to improve detection of myocardial infarction events. To this end, we propose a fully automated multi-step pipeline consisting of a 3D heart surface reconstruction step followed by a point cloud classification network. Our method leverages recent advances in deep learning of point cloud geometry to enable direct and efficient multi-scale learning on high-resolution surface models of cardiac anatomy. We evaluated the pervasive MI detection and event MI prediction tasks on 1068 UK Biobank subjects and found that our method improved by 13 % and 5%, respectively, compared to clinical baselines. In addition, we analyze the role of 3D shape-based MI detection for each ventricle and cardiac phase, and perform a visual analysis of the morphological and physiological patterns typically associated with MI outcomes.

1.6 One-Shot Action Recognition via Multi-Scale Spatial-Temporal Skeleton Matching

One-shot Action Recognition Based on Multi-Scale Spatiotemporal Skeleton Matching

https://arxiv.org/abs/2307.07286

insert image description here
One-shot skeletal action recognition, which aims to learn a skeletal action recognition model from a single training sample, has attracted increasing interest due to the challenges of collecting and annotating large-scale skeletal action data. However, most existing studies directly match skeleton sequences by comparing their feature vectors, ignoring the spatial structure and temporal order of skeleton data. In this paper, a novel technique for single-shot skeletal action recognition is proposed, which processes skeletal action recognition through multi-scale spatio-temporal feature matching. We represent skeleton data at multiple spatial and temporal scales, and achieve optimal feature matching from two perspectives. The first is multi-scale matching, which captures the scale-semantic correlation of skeleton data at multiple spatial and temporal scales simultaneously. The second is cross-scale matching, which handles different motion magnitudes and velocities by capturing sample correlations across multiple scales. Extensive experiments on three large-scale datasets (NTU RGB+D, NTU RGB+D 120, and PKU-MMD) show that our method achieves excellent one-shot skeletal action recognition, and it consistently outperforms the state-of-the-art substantially.

1.7 Complementary Frequency-Varying Awareness Network for Open-Set Fine-Grained Image Recognition

Open-set fine-grained image recognition based on complementary frequency conversion perception network

https://arxiv.org/abs/2307.07214

insert image description here
Open-set image recognition is a challenging topic in computer vision. Most works in the existing literature focus on learning more discriminative features from input images, however, they are usually insensitive to high- or low-frequency components in the features, resulting in performance degradation for fine-grained image recognition. To address this issue, we propose a complementary frequency change-aware network, which can better capture high-frequency and low-frequency information, called CFAN. The proposed CFAN consists of three sequential modules: (i) a feature extraction module is introduced to learn preliminary features from the input image; (ii) a frequency-variant filtering module is designed to learn preliminary features from the frequency domain via a frequency-tunable filter. Both high-frequency components and low-frequency components are separated in features; (iii) a complementary temporal aggregation module is designed to aggregate high-frequency components and low-frequency components into discriminative features via two long short-term memory networks. Based on CFAN, we further propose an open-set fine-grained image recognition method called CFAN-OSFGR, which learns image features through CFAN and classifies them through a linear classifier. Experimental results on 3 fine-grained datasets and 2 coarse-grained datasets show that CFAN-OSFGR significantly outperforms 9 state-of-the-art methods in most cases.

1.8 LightFormer: An End-to-End Model for Intersection Right-of-Way Recognition Using Traffic Light Signals and an Attention Mechanism

LightFormer: An End-to-End Right-of-Way Recognition Model and Attention Mechanism for Traffic Light Signals

https://arxiv.org/abs/2307.07196

insert image description here
For smart vehicles driving through signalized intersections, it is critical to determine whether the vehicle has the right-of-way for a given traffic light state. To address this, camera-based sensors can be used to determine whether a vehicle has permission to go straight, turn left, or turn right. This paper proposes a novel end-to-end intersection right-of-way recognition model called LightFormer to generate the right-of-way status of available driving directions at complex urban intersections. The model includes a spatio-temporal internal structure with an attention mechanism that incorporates features from past images to help classify the state of the current frame in the right way. In addition, a modified, multi-weight arc loss to improve the classification performance of the model. Finally, LightFormer is proposed to train and test two public traffic light datasets with manually augmented labels to demonstrate its effectiveness.

1.9 Leveraging Pretrained ASR Encoders for Effective and Efficient End-to-End Speech Intent Classification and Slot Filling

Effective and Efficient End-to-End Speech Intent Classification and Gap Filling Using a Pretrained ASR Encoder

https://arxiv.org/abs/2307.07057

insert image description here
We study Speech Intent Classification and Slot Filling (SICSF), proposing to use an encoder pretrained on Speech Recognition (ASR) to initialize an end-to-end (E2E) Conformer-Transformer model implemented on the SLURP dataset new state-of-the-art results with 90.14% intent accuracy and 82.27% SLURP-F1. We compare our model with encoders pretrained on self-supervised learning (SSL), and show that ASR pretraining is more effective than SSL for SICSF. To explore parameter efficiency, we freeze the encoder and add adapter modules, and show that parameter efficiency can only be achieved with ASR pretrained encoders, while SSL encoders require full fine-tuning to achieve comparable results. Furthermore, we provide an in-depth comparison of the end-to-end model with the cascade model (ASR+NLU), and show that the E2E model outperforms the cascade model unless an Oracle ASR model is provided. Last but not least, our model is the first E2E model, achieving the same performance as the Oracle ASR cascade model. Code, checkpoints and configuration are available.

1.10 A metric learning approach for endoscopic kidney stone identification

A Metric Learning Method for Endoscopic Kidney Stone Recognition

https://arxiv.org/abs/2307.07046

insert image description here
Recently, several deep learning (DL) methods have been proposed for automatic identification of kidney stones during ureteroscopy to enable rapid treatment decisions. Even though these DL methods yield promising results, they are mainly applicable to kidney stone types for which large amounts of labeled data are available. However, only a few labeled images are available for some rare types of kidney stones. The contribution leverages deep metric learning (DML) methods to i) handle such classes with a small number of samples, ii) generalize well to samples from the distribution, and iii) handle new classes added to the database better. The proposed guided deep metric learning approach is based on a novel architecture designed to learn data representations in an improved manner. The solution is inspired by Few-Shot Learning (FSL) and utilizes a teacher-student approach. The teacher model (GEMINI) generates a simplified hypothesis space based on prior knowledge from labeled data, and it is used as the student model (i.e., ResNet50). Extensive testing was first performed on two separate datasets for recognition, a set of images collected for the surface of kidney stone fragments and a set of images of fragment parts. The proposed DML method improves the recognition accuracy by 10% and 12% compared to the DL method and other DML methods, respectively. Furthermore, model embeddings from both dataset types are merged in an organized way through a multi-view scheme, while exploiting both surface and part fragment information. Tests using the resulting hybrid model improved recognition accuracy by at least 3% and up to 30%, respectively, compared to DL models and shallow machine learning methods.

1.11 Bridging the Gap: Heterogeneous Face Recognition with Conditional Adaptive Instance Modulation

Bridging the Gap: Heterogeneous Face Recognition with Condition Adaptive Instance Modulation

https://arxiv.org/abs/2307.07032

insert image description here
Heterogeneous face recognition (HFR) aims to match face images from different domains, such as thermal and visible spectra, extending the applicability of face recognition (FR) systems to challenging scenarios. However, the domain gap and limited availability of large-scale datasets in the target domain make it difficult to train robust and invariant HFR models from scratch. In this work, we treat different ways as different styles and propose a framework to adapt feature maps and bridge the domain gap. We introduce a new Conditional Adaptive Instance Modulation (CAIM) module that can be integrated into a pre-trained FR network to convert it to an HFR network. The CAIM block modulates the intermediate feature maps to suit the style of the target modality, effectively bridging the domain gap. Our proposed method allows end-to-end training with a minimal number of paired samples. We extensively evaluate our method on multiple challenging benchmarks, showing superior performance compared to the state-of-the-art methods. The source code and protocol used to reproduce the research results will be publicly available.

Guess you like

Origin blog.csdn.net/wzk4869/article/details/131794382