Paper Express: AAAI 2023 | Youtu 16 papers at a glance, including multi-label classification, pose estimation, target detection, HOI, small sample learning and other research directions

Recently, AAAI 2023 (Association for the Advancement of Artificial Intelligence) International Association for Advanced Artificial Intelligence announced the acceptance results . A total of 8,777 papers were submitted for this session, and 1,721 papers were accepted, with an acceptance rate of 19.6%.

AAAI is one of the main academic organizations in the field of artificial intelligence. It is an international non-profit scientific organization that aims to promote research and application in the field of artificial intelligence and enhance public understanding of artificial intelligence. The conference started in 1980, focusing on both theory and application, and also discussing social, philosophical, economic and other topics that have an important impact on the development of artificial intelligence.

This year, Tencent Youtu Lab has 16 papers selected, covering research directions such as multi-label classification, pose estimation, target detection, HOI, small sample learning, etc., demonstrating Tencent Youtu's technical capabilities and academic achievements in the field of artificial intelligence .

The following is an overview of the papers selected by Tencent Youtu Lab:

01

Towards Facial Expression Recognition with Labeled Noise

Attack can Benefit: An Adversarial 

Approach to Recognizing Facial Expressions under Noisy Annotations

Large-scale facial expression datasets usually exhibit extreme noisy label problems, and models are prone to overfitting to noisy label samples. At the same time, the facial expression dataset also exhibits an extreme category distribution imbalance. The two problems are coupled with each other, making it difficult to solve the problem of noisy labels in facial expression recognition data.

In this paper, we propose a novel noisy label localization and relabeling method that uses adversarial attacks to localize noisy label samples. First, to alleviate the impact of data distribution imbalance, this paper proposes a divide-and-conquer strategy to divide the entire training set into two relatively balanced subsets.

Second, based on two observations (1) for deep convolutional neural networks trained with noisy labels, data near the decision boundary is more indistinguishable and more likely to be mislabeled; (2) the network's memory of noisy labels can lead to significant For adversarial weaknesses, we design a geometry-aware adversarial vulnerability estimation method that can discover more attackable data in the training set and label them as candidate noise samples. Finally, the remaining clean data are used to relabel these candidate noise samples.

Experimental results show that our method achieves SOTA, and the associated visualization results also demonstrate the advantages of the proposed method.

02

Research on the robustness of federated learning against

Delving into the Adversarial 

Robustness of Federated Learning

Similar to centrally trained models, federated learning (FL) trained models also lack adversarial robustness. This paper mainly discusses adversarial robustness in federated learning. To better understand the robustness of existing FL methods, we evaluate various adversarial attacks and adversarial training methods.

Furthermore, we reveal the downside of directly adopting adversarial training in FL, namely that it can severely compromise the accuracy on clean examples, especially in non-IID settings. In this work, we propose a decision boundary-based federated adversarial training (DBFAT) method, which consists of two components (namely, local reweighting and global regularization) to improve the accuracy and robustness of FL systems. Stickiness.

Extensive experiments on multiple datasets show that DBFAT consistently outperforms other baseline methods in both IID and non-IID settings.

03

 TaCo: A Contrastive Learning-Based

Text attribute recognition method

TaCo: Textual Attribute Recognition 

via Contrastive Learning

With the continuous acceleration of the office digitization process, artificial intelligence technology is used to automatically, quickly and accurately analyze the content of input document images, and further understand, extract and summarize, that is, document intelligence (DocAI), which is currently a crossover between computer vision and natural language processing. A popular research direction of the subject. In the actual business scenarios of Youtu, document intelligence technology has produced good business value, and played a key role in forms understanding, layout analysis and other scenarios. The unique multimodal attribute of visually rich documents, that is, the high coupling of text content, image information and the overall layout of the document, not only increases the complexity of the problem, but also provides a new focus for technological innovation.

Text is an important carrier of information. In addition to content, its various visual attributes, such as font/color/italic/bold/underline, also convey the designer's ideas and ideas. If accurate visual attributes of text can be obtained, it will directly help design practitioners to quickly obtain materials and develop efficiency tools such as document picture conversion to word. However, thousands of Chinese and English fonts, combined with open color design and various states such as bold and italic, even for text design experts, it is a big challenge to accurately judge the visual attributes of text. Therefore, the development of the ability to recognize visual attributes of text has the potential to enable a wide range of applications.

Designing a text visual attribute recognition system is not as simple as imagined, because the difference between text visual attributes is often subtle. Taking fonts as an example, there are often only subtle differences in local details between two different fonts. The ever-increasing new text styles further exacerbate the difficulty of recognition, and also put forward higher requirements for the generalization of the system. In addition, we observed in practice that even scanned PDFs and well-taken images introduce noise and blur, making fine local details more difficult to distinguish and making partitioning in the feature space more difficult.

From an algorithm point of view, text visual attribute recognition can be defined as a multi-label classification problem, input text images, and output each visual attribute of the text. Existing art solutions can be divided into three categories: 1) Methods based on handcrafted feature descriptors and template matching. Usually different text attributes have different visual styles, which can be described and identified through statistical features; 2) classification method based on deep neural network, using network to extract features and used for recognition; 3) sequence-based attribute recognition method. Based on the observation of the actual scene, multiple characters in a single text line often have consistent attributes. By treating the input image as a continuous sequence of signals and modeling the temporal correlation, the relevant information and semantic consistency between characters can be used to improve the recognition effect.

Unfortunately, the above solutions suffer from: 1) The data preprocessing process is complicated. Supervised methods rely on a large amount of expert-labeled data; 2) poor scalability and only support some pre-defined categories; 3) low accuracy, it is difficult to capture the subtle differences of similar attributes in actual scenarios.

Based on the above observations, we designed the TaCo (Textual Attribute Recognition via Contrastive Learning) system to bridge the gap.

04

Based on the twin cloze autoencoder

Self-supervised vision pre-training method

The Devil is in the Frequency: 

Geminated Gestalt Autoencoder for 

Self-Supervised Visual Pre-Training

In recent years, the self-supervised masked image modeling (MIM) paradigm is gaining more and more researchers' interest due to its excellent ability to learn visual representations from unlabeled data. This paradigm follows the "mask-reconstruction" process of recovering content from mask images. In order to learn high-level semantic abstract representations, a series of research works try to use large-scale mask strategies to reconstruct pixels.

However, this type of method has the problem of "over-smoothing". In contrast, work in the other direction introduces additional data and directly incorporates semantics into supervised information in an offline manner. Different from the above methods, we transfer the view to the Fourier domain with global view, and propose a new masked image modeling (MIM) method called Gemini Cloze Autoencoder (Ge2-AE), Used to solve vision pre-training tasks.

Specifically, we equip the model with a pair of parallel decoders, which are responsible for reconstructing image content from pixel and frequency spaces, respectively, with mutual constraints. With this method, the pre-trained encoder can learn more robust visual representations, and a series of experimental results on downstream recognition tasks confirm the effectiveness of this method.

We also conduct quantitative and qualitative experiments to study the learning mode of our method. In the industry, this is the first MIM work to solve visual pre-training tasks from the perspective of frequency domain.

05

Localization Regeneration: Bounding Box-Based Visual-Language Connections

Scene Text Visual Question Answering Method

Locate Then Generate: Bridging Vision 

and Language with Bounding Box 

for Scene-Text VQA

*This article was jointly completed by Tencent Youtu Lab and University of Science and Technology of China

In this paper, we propose a novel multimodal Scene Text Visual Question Answering framework (STVQA), which can read scene text in images for question answering. In addition to text or visual objects that can exist independently, scene text, while being a visual object in an image, naturally connects text and visual forms by conveying language semantics.

Different from the traditional STVQA model, which regards the linguistic semantics and visual semantics in the scene text as two independent features, this paper proposes the "Locate-Later-Generate" (LTG) paradigm, which explicitly unifies these two semantics and uses spatial The bounding boxes act as bridges connecting them.

Specifically, LTG first utilizes an answer localization module (ALM) composed of an area proposal network and a language refinement network to localize regions in images that may contain answer words, both of which are transformed through a one-to-one mapping by scene text bounding boxes. Next, given the answer words selected by ALM, LTG generates a readable answer sequence using an answer generation module (AGM) based on a pre-trained language model. The advantage of using explicit alignment of vision and language semantics is that even without any scene text-based pre-training task, LTG can improve the absolute accuracy by 6.06% and 6.92% on the TextVQA dataset and ST-VQA dataset, respectively, compared with Compared to non-pretrained baseline methods, we further demonstrate that LTG effectively unifies visual and textual modalities via spatial bounding box connections, which has been lightly studied in previous methods.

06

Prototype learning for robust network graphs guided by few real samples

FoPro: Few-Shot Guided Robust Webly-Supervised Prototypical Learning

Recently, Internet (image)-based supervised learning (WSL) research aims to exploit the large amount of accessible data from the Internet. Most existing methods focus on learning noise-robust models from Internet images, often ignoring the performance degradation caused by the difference between the Internet image domain and the real-world business domain. Only by addressing the above performance gaps can we fully exploit the practical value of open source datasets on the Internet.

To this end, we propose a method called FoPro that utilizes a small number of real-world samples to guide the learning of prototype representations for images in the web. It only needs a small number of labeled samples in real business scenarios, and can significantly improve the performance of the model in real business domains.

Specifically, this method uses a small amount of real scene data to initialize the feature representation of each category center as a "realistic" prototype. Then, the intra-class distance between network image instances and real prototypes is reduced by contrastive learning. Finally, the method uses metric learning to measure the distance between network images and prototypes of each category. The category prototypes are continuously revised by neighboring high-quality network images in the representation space, and participate in the removal of out-of-distribution samples (OOD) with a large distance.

In the experiment, FoPro used some real-world samples to guide the training and learning of the network dataset, and evaluated it on the real-world dataset. The method achieves state-of-the-art performance on three fine-grained datasets and two large-scale datasets. Compared with existing WSL methods, under the same experimental setting of few real samples, FoPro shows excellent generalization performance in real scenes.

07

A general coarse-fine vision

Transformer Acceleration Solution

CF-ViT: A General Coarse-to-Fine 

Method for Vision Transformer

*This article is jointly completed by Tencent Youtu Lab and Xiamen University

The core operation of Vision Transformers (ViTs) is self-attention, and the computational complexity of self-attention is proportional to the square of the number of input tokens. Therefore, the most direct way to compress the calculation amount of ViT is to reduce the number of tokens during reasoning, that is, to reduce The number of patches for image division.

In this paper, the number of tokens in the reasoning process is reduced through two-stage adaptive reasoning: the first stage divides the image into coarse-grained (large-size) patches, the purpose is to use less calculation to identify "simple" samples; the second stage divides the first stage into Coarse-grained patches with high medium information content are further divided into fine-grained (small-sized) patches, with the purpose of identifying “difficult” samples with less computation.

This paper also designs global attention to identify coarse-grained patches with high information content, and a feature multiplexing mechanism to increase the model capacity of two-stage reasoning. Without affecting the accuracy of Top-1, this method reduces the FLOPs of LV-ViT-S by 53% on ImageNet-1k, and the measured inference speed on GPU is also accelerated by 2 times.

08

distilled through visual language knowledge

End-to-End Human-Object Interaction Detection

End-to-End Zero-Shot HOI Detection via Vision and Language Knowledge Distillation

Most existing human-object interaction detection methods rely heavily on full annotations with predefined human-object interaction categories, which are limited in diversity and costly to scale further. Our goal is to advance zero-shot human-object interaction detection to detect both visible and invisible human-object interactions. The fundamental challenge is to discover potential human-thing pairs and identify new categories of human-thing interactions. To overcome the above challenges, we propose a novel end-to-end zero-shot human-object interaction detection framework based on visual-linguistic knowledge extraction.

We first design an interactive scoring module that is combined with a two-stage bipartite matching algorithm to enable interaction discrimination of person-object pairs in an action-agnostic manner. We then transfer the action probability distributions from the pre-trained visual language teacher along with the seen ground-truth annotations to the human-object interaction detection model for zero-shot human-object interaction classification. Extensive experiments on the HICO Det dataset demonstrate that our model discovers potential interaction pairs and is able to identify unknown human-object interactions. Finally, our method outperforms previous state-of-the-art methods under various zero-shot settings. Furthermore, our method is generalizable to large-scale object detection data to further amplify the action set.

09

Multi-label learning of open dictionaries based on multi-modal knowledge transfer

Open-Vocabulary Multi-Label 

Classification via Multi-modal

 Knowledge Transfer

In practical applications, classification models will inevitably encounter a large number of labels that do not appear in the training set. To recognize these labels, traditional multi-label zero-shot learning methods implement knowledge transfer from training set visible labels to training set unseen labels by introducing language models such as GloVe. Although the unimodal language model models the semantic consistency between labels well, it ignores the key visual consistency information in image classification.

Recently, the Open-Vocabulary classification model based on the graph-text pre-training model has achieved impressive results in single-label zero-shot learning, but how to transfer this ability to multi-label scenarios still needs to be explored urgently question.

In this paper, the author proposes a framework based on Multi-modal Knowledge Transfer (MKT) to realize multi-label open dictionary classification. Specifically, the author realizes label prediction based on the powerful image-text matching ability of the image-text pre-training model. In order to optimize the label mapping and improve the consistency of image-label mapping, the author introduces prompt learning (Prompt-Tuning) and knowledge distillation (Knowledge Distillation).

At the same time, the author proposes a simple but effective dual-stream module to simultaneously capture local and global features, improving the multi-label recognition ability of the model. Experimental results on two public datasets, NUS-WIDE and OpenImage, show that this method effectively implements multi-label open set learning.

10

Based on adaptive hierarchical branch fusion

Summary of Online Knowledge Distillation Algorithms

Adaptive Hierarchy-Branch Fusion for 

Online Knowledge Distillation

*This article is jointly completed by Tencent Youtu Lab and East China Normal University

Online knowledge distillation does not need to use pre-trained teacher models for knowledge distillation, which greatly improves the flexibility of knowledge distillation. Existing methods mainly focus on improving the prediction accuracy after ensembles of multiple student branches, often ignoring the homogenization problem that makes the student model overfit quickly and hurts performance. The problem stems from using the same branch architecture and a poor branch integration strategy. To alleviate this problem, in this paper we propose a novel Adaptive Hierarchical Branch Fusion Framework for Online Knowledge Distillation, abbreviated as AHBF-OKD.

The framework mainly designs a hierarchical branch structure and an adaptive hierarchical branch fusion module to improve model diversity, so that the knowledge of different branches can complement each other. In particular, to efficiently transfer knowledge from the most complex branch to the simplest target branch, this paper proposes an adaptive hierarchical branch fusion module to recursively create inter-hierarchical auxiliary teacher modules. During training, the knowledge within the auxiliary teacher module from the upper level is effectively distilled to the auxiliary teacher module and the student branch in the current hierarchy. Therefore, the importance coefficients of different branches are adaptively assigned to reduce branch homogeneity.

Extensive experiments verify the effectiveness of AHBF-OKD on different datasets, including CIFAR-10/100 and ImageNet 2012. For example, the distilled ResNet18 achieves 29.28% Top-1 error rate on ImageNet 2012.

11

Multi-Person Pose Estimation Method Based on Consistency Between Images

Inter-image Contrastive Consistency for Multi-person Pose estimation

In recent years, impressive progress has been made in multi-person pose estimation (MPPE). However, it is difficult for the model to learn consistent keypoint representations due to occlusion or large appearance differences between human bodies. In this paper, we propose an inter-image contrast consistency method to enhance the consistency of keypoint features between images in the MPPE task.

Specifically, we consider dual consistency constraints, including Single Keypoint Contrast Consistency (SKCC) and Pairwise Keypoint Contrast Consistency (PRCC). SKCC is used to strengthen the consistency of the key points of the same category in the image, thereby improving the robustness of specific categories. Although SKCC enables the model to effectively reduce localization errors due to appearance changes, it is still challenging under extreme poses (e.g., occlusions) due to the lack of keypoint structural relationship guidance. Therefore, we propose PRCC to enforce the consistency of pairwise keypoint relations between images. PRCC collaborated with SKCC to further improve the model's ability to handle extreme poses.

Extensive experiments on three datasets (i.e., MS-COCO, MPII, CrowdPose) show that the proposed ICON achieves large improvements over baselines.

12

A Few-Shot Object Detection Model Based on Variational Feature Fusion

Few-Shot Object Detection via 

Variational Feature Aggregation

Since few-shot object detectors are usually trained on base classes with more samples and fine-tuned on novel classes with fewer samples, their learned models are usually biased towards base classes and are sensitive to the variance of novel class samples. To address this issue, this paper proposes two feature aggregation algorithms based on the meta-learning framework.

Specifically, this paper first proposes a category-independent feature aggregation algorithm CAA, which enables the model to learn category-independent feature representations by aggregating different categories of query (Query) and support (Support) features, and reduces the basic class Confusion with novelty classes.

Based on CAA, this paper proposes a variational feature aggregation algorithm, VFA, which achieves more robust feature aggregation by encoding samples into distributions of categories. In this paper, a variational autoencoder (VAE) is used to estimate the distribution of categories and sample variational features from a distribution that is more robust to sample variance.

Furthermore, we decouple classification and regression tasks so that feature aggregation can be performed on the classification branch without compromising object localization.

13

High-Resolution Iterative Feedback Networks for Camouflaged Object Segmentation

High-resolution Iterative Feedback Network for Camouflaged Object Detection

Spotting camouflaged objects that are visually assimilated into the background is a thorny problem for both object detection algorithms and humans. Because both are easily confused or deceived by the perfect inner similarity between the foreground object and the background environment.

To address this challenge, we extract high-resolution texture details to avoid detail degradation that would cause visual blurring effects on edges and boundaries. We introduce a novel HitNet network framework to improve low-resolution representations through high-resolution features in an iterative feedback manner, the essence of which is a global loop-based feature interaction between multi-scale resolutions.

Additionally, to design a better feedback feature flow and avoid feature collapse caused by recursive paths, we propose an iterative feedback strategy to impose more constraints on each feedback connection.

Extensive experiments on four challenging datasets demonstrate that our HitNet breaks the performance bottleneck and achieves significant improvements compared to 35 state-of-the-art methods. Furthermore, to address the data scarcity problem in camouflage scenarios, we provide an application that converts salient objects into camouflage objects, thereby generating more camouflage training samples from different salient objects, and the code will be made public.

14

 SpatialFormer: based on semantic and

A few-shot learning method for object-aware attention

SpatialFormer: Semantic and Target Aware Attentions for Few-Shot Learning

Recent few-shot learning methods emphasize generating strongly discriminative embedded features to accurately compute the similarity between support and query sets. Current CNN-based cross-attention methods generate more discriminative features by enhancing mutually semantically similar regions of support and query image pairs. However, it suffers from two problems: the CNN structure produces inaccurate attention maps based on local features, and the similar backgrounds lead to interference.

To alleviate these problems, we design a novel SpatialFormer structure to generate more accurate attention regions based on global features. The traditional Transformer modeling intrinsic instance-level similarity leads to a decrease in small-sample classification accuracy, while our SpatialFormer explores the semantic-level similarity between inputs to improve performance.

Then, we propose two attention modules, called SpatialFormer Semantic Attention (SFSA) and SpatialFormer Target Attention (SFTA), to enhance target object regions while reducing background distractions. Among them, SFSA highlights regions with the same semantic information between pairs of features, while SFTA finds potential foreground object regions of novel features similar to base categories.

Extensive experiments demonstrate the effectiveness of our method, and we achieve superior performance on several benchmark datasets.

15

Sparsely Labeled Object Detection Based on Corrected Teacher Model

Calibrated Teacher for 

Sparsely Annotated Object Detection

Fully supervised object detection needs to label all object instances in the training images, but this requires a lot of labor cost for labeling, and there are often unavoidable missing labels in the labeling. Missing objects in images will provide misleading supervision and damage model training, so we study sparsely labeled object detection methods to alleviate this problem by generating pseudo-labels for missing objects.

Early sparsely labeled target detection methods often relied on preset score thresholds to filter missing boxes, but in different training stages, different target categories, and different target detectors, the effective thresholds are different. Therefore, existing methods with fixed thresholds still have room for optimization and require tedious tuning of hyperparameters for different detectors.

To address this obstacle, we propose a "calibrated teacher model", in which confidence estimates for predictions are score-calibrated to match the actual accuracy of the detector. Thus, different detectors will have similar output confidence distributions for different training stages, so multiple detectors can share the same fixed threshold and achieve better performance.

Furthermore, we propose a simple but effective FIoU mechanism to reduce the classification loss weight of false negative objects caused by missing annotations.

Extensive experiments show that our method achieves state-of-the-art performance under 12 different sparsely annotated object detection settings.

16

Based on a large-scale general data set

High-resolution GAN inversion method for degraded images

High-Resolution GAN Inversion 

for Degraded Images 

in Large Diverse Datasets

Over the past few decades, large and diverse image data have shown increasing resolution and quality. However, some of the images we obtain may suffer from multiple degradations, affecting perception and application to downstream tasks. We need a general method to generate high-quality images from degraded images. In this paper, we propose a new framework to address the aforementioned issues by leveraging the powerful generative capabilities of StyleGAN-XL for inversion.

In order to alleviate the challenges encountered by StyleGAN-XL in inversion, we propose Cluster Regular Inversion (CRI): (1) divide the huge and complex latent space into multiple subspaces by clustering, and provide Find a better starting point for initialization, thereby reducing the difficulty of optimization. (2) Utilizing the characteristics of the latent space of the GAN network, an offset with a regularization term is introduced in the inversion process to constrain the latent vector within the latent space that can generate high-quality images.

We validate our CRI scheme on multiple inpainting tasks (completion, colorization, and super-resolution) on complex natural images, achieving both quantitative and qualitative results. We further demonstrate that CRI is robust to different data and different GAN models.

To the best of our knowledge, this paper is the first work employing StyleGAN-XL to generate high-quality images from degraded natural images.

Note: The above data are laboratory data

Guess you like

Origin blog.csdn.net/qq_41050642/article/details/128305126