[Computer Vision | Image Classification] arxiv Computer Vision Academic Express on Image Classification (Collection of Papers on December 5) (Part 1)

1. Classification|Identification related (14 articles)

1.1 Object Recognition as Next Token Prediction

Object recognition as next-generation coin predictions

https://arxiv.org/abs/2312.02142

We propose a method for gesture object recognition as next token prediction. The idea is to apply a language decoder that automatically and regressively predicts text tokens from image embeddings to form labels. To build this prediction process in autoregression, we tailor a non-causal attention mask for the decoder, which contains two key features: modeling tags from different labels as independent, and treating image tags as prefixes . This masking mechanism motivates an efficient method - single-shot sampling - to simultaneously sample tags for multiple labels in parallel and rank the generated labels by their probabilities during inference. To further improve efficiency, we propose a simple strategy to build a compact decoder by simply discarding the intermediate blocks of the pre-trained language model. This approach produces a decoder that matches the performance of the full model while being significantly more efficient. Code is available at https://github.com/kaiyuyue/nxtp

1.2 Effective Adapter for Face Recognition in the Wild

An effective face recognition adapter in the wild

https://arxiv.org/abs/2312.01734

In this paper, we address the challenge of face recognition in the wild, where images often suffer from low quality and real-world distortions. Traditional heuristic approaches, either training models directly on these degraded images or their enhanced counterparts using face restoration techniques have proven to be ineffective, mainly due to the degradation of facial features and differences in image domains. To overcome these issues, we propose an efficient adapter for enhancing existing facial recognition models trained on high-quality facial datasets. The key to our adapter is to handle both unrefined and enhanced images through two similar structures, one of which is fixed and the other is trainable. This design can bring two benefits. First, the dual-input system minimizes the domain gap while providing different perspectives to the face recognition model, where the enhanced image can be viewed by the restoration model as a complex nonlinear transformation of the original image. Second, both similar structures can be initialized by pre-trained models without losing past knowledge. Extensive experiments in the zero-shot setting show that our method outperforms the baselines by approximately 3%, 4% and 7% on three datasets. Our code will be publicly available at https://github.com/liuyunhaozz/FaceAdapter/.

1.3 RiskBench: A Scenario-based Benchmark for Risk Identification

RiskBtch: A scenario-based risk identification benchmark

https://arxiv.org/abs/2312.01659

Intelligent driving systems aim to achieve a zero-collision mobility experience and require interdisciplinary efforts to improve safety performance. This work focuses on risk identification, the process of identifying and analyzing risks from dynamic traffic actors and emergencies. While the community has made significant progress, current evaluations of different risk identification algorithms use independent datasets, making direct comparisons difficult and hindering collective progress in security performance enhancements. To address this limitation, we introduce \textbf{RiskBench}, a large-scale ARIO-based benchmark for risk identification. We design a scenario classification and enhancement pipeline to systematically collect ground truth risks under different scenarios. We evaluated ten algorithms for their ability to (1) detect and locate risk, (2) predict risk, and (3) facilitate decision-making. We conduct extensive experiments and conclude with future research risk identification. Our goal is to encourage collaborative efforts to achieve a collision-free society. We have made our dataset and benchmark toolkit publicly available on the project page: https://hcis-lab.github.io/RiskBench/

1.4 TextAug: Test time Text Augmentation for Multimodal Person Re-identification

TextAug: Test-time text augmentation for multi-pass person re-identification

https://arxiv.org/abs/2312.01605

Multimodal person re-identification is gaining popularity in the research community due to its effectiveness compared to counterpart unimodal frameworks. However, the bottleneck of multi-modal deep learning is the need for a large number of multi-modal training examples. Data augmentation techniques, such as cropping, flipping, rotating, etc., are often used in the image field to improve the generalization ability of deep learning models. Enhancement in forms other than images, such as text, is challenging and requires significant computational resources and external data sources. In this study, we investigate the effectiveness of two computer vision data enhancement techniques: shearing and shearing blending, for text enhancement in multimodal person re-identification. Our approach merges these two augmentation strategies into a strategy called CutMixOut, which involves randomly removing words or subphrases from sentences (Cutout) and mixing parts of two or more sentences to create different examples (CutMix), each operation is assigned a certain probability. This enhancement is implemented at inference time without any prior training. Our results show that the proposed technique is simple yet effective in improving the performance of multi-modal person re-identification benchmarks.

1.5 D 2 ^2 2ST-Adapter: Disentangled-and-Deformable Spatio-Temporal Adapter for Few-shot Action Recognition

D 2 ^2 2ST-Adapter: Unwrapped deformable space-time adapter for Few-Shot action recognition

https://arxiv.org/abs/2312.01431

Adapting large pre-trained image models to Few-Shot action recognition has proven to be an effective and efficient strategy for learning robust feature extractors, which is essential for Few-Shot learning. Typical fine-tuning-based adaptation paradigms are prone to overfitting in Few-Shot learning scenarios and provide little modeling flexibility for learning temporal features in video data. In this work, we propose to unravel and deform the space-time adapter (D 2 ^2 2ST adapter), a new adapter adapted to the framework of Few-Shot action recognition, which is designed in a dual-channel architecture to encode spatial and temporal functions in one The way to unravel. In addition, we design the deformed spatiotemporal attention module as D 2 ^2 2The core component of the ST adapter, which can be customized to model spatial and temporal features in the corresponding paths, making our D 2 ^2 2ST adapter is able to encode features in a global view in 3D spatio-temporal space while maintaining a lightweight design. Extensive instantiation experiments on our method on pre-trained ResNet and ViT demonstrate that our method outperforms state-of-the-art methods in Few-Shot action recognition. Our method is particularly suitable for challenging scenarios where temporal dynamics are crucial for action recognition.

1.6 DiFace: Cross-Modal Face Recognition through Controlled Diffusion

DiFace: Cross-modal face recognition via controlled diffusion

https://arxiv.org/abs/2312.01367

Diffusion Probabilistic Models (DPM) have demonstrated remarkable capabilities in generating visual media with excellent quality and realism. Nonetheless, their potential in non-generative domains such as face recognition remains to be thoroughly investigated. Meanwhile, although multimodal face recognition methods have been widely developed, their focus is mainly on the visual modality. In this context, face recognition through text description presents a unique and promising solution that not only transcends the limitations of application scenarios, but also expands the research potential in the field of cross-modal face recognition. Sadly, this avenue remains underexplored and underutilized, primarily due to three challenges: 1) the inherent imprecision of textual description; 2) the huge gap between text and images; 3) A huge obstacle posed by insufficient databases. To solve this problem, we propose DiFace, a solution that effectively enables face recognition from text through a controllable diffusion process, by establishing its theoretical connection with probabilistic transmission. Our approach not only unlocks the potential of DPM across a wider range of tasks, but also, to our knowledge, achieves significant accuracy for text-to-image face recognition for the first time, as demonstrated by our experiments on both verification and recognition. .

1.7 Facial Emotion Recognition Under Mask Coverage Using a Data Augmentation Technique

Facial emotion recognition under mask based on data enhancement technology

https://arxiv.org/abs/2312.01335

Using artificial intelligence-based computer vision systems to recognize human emotions when an individual wears a mask presents a new challenge in the current COVID-19 pandemic. In this study, we propose a facial emotion recognition system capable of identifying emotions from individuals wearing different masks. A new data augmentation technique is used to improve the performance of our model, using four mask types for each face image. We evaluate the effectiveness of four convolutional neural networks Alexnet, Squeezenet, Resnet50 and VGGFace2 trained using transfer learning. Experimental results show that our model effectively works in multi-mask mode compared to single-mask mode. The VGGFace2 network achieved the highest accuracy, using the JAFFE dataset, 97.82% for the human-dependent mode and 74.21% for the human-independent mode. However, we evaluate our proposed model using the UIBVFED dataset. Resnet50 shows excellent performance, with an accuracy of 73.68% in person-dependent mode and 59.57% in person-independent mode. In addition, we employ metrics such as precision, sensitivity, specificity, AUC, F1 score, and confusion matrix to measure the efficiency of our system in detail. Furthermore, the LIME algorithm is used to visualize the decision-making strategy of CNN.

Guess you like

Origin blog.csdn.net/wzk4869/article/details/134814635