Tencent Youtu was selected as the top artificial intelligence conference AAAI papers -- 10 papers

AAAI 2020, the top international conference in the field of artificial intelligence, will be held in New York, USA from February 7th to February 12th. In recent years, with the rise of artificial intelligence, the annual academic conferences held by AAAI have become more and more popular, attracting a large number of researchers and developers from academia and industry to contribute and participate in the conference.

Taking AAAI2019 as an example, the number of paper submissions was as high as 7,745, setting a record high in AAAI's history that year. Like other top conferences, AAAI 2020 is even more popular. According to the official notification email sent by the conference, 8,800 valid papers were finally received, and 1,591 papers were accepted, with an acceptance rate of only 20.6%.

As one of the oldest and most extensive academic conferences in the field of artificial intelligence, the conference papers cover all fields of AI and machine learning, and the traditional topics of concern include but are not limited to natural language processing, deep learning, etc. Topics in the technical field, such as AI + industry applications, etc.

A total of 10 papers were selected by Tencent Youtu Lab this time, involving quick calculation and correction, video recognition, etc.

The following is a detailed interpretation

1. Rethinking Temporal Domain Fusion from Temporal and Semantic Levels for Video-Based Person Re-Identification (Oral)

Rethinking Temporal Fusion for Video-basedPerson Re-identification on Semantic and Time Aspect (Oral)

Keywords: person re-identification, temporal and semantic, temporal fusion

Download link: https://arxiv.org/abs/1911.12512

In recent years, research in the field of person re-identification (ReID) has continued to deepen, and more and more researchers have begun to pay attention to methods based on the aggregation of entire video information to obtain human body features. However, existing person re-identification methods ignore the differences in the semantic level of information extracted by convolutional neural networks at different depths, which may result in insufficient representation capabilities of the final video features. In addition, the traditional method does not take into account the relationship between frames when extracting video features, resulting in information redundancy when time sequence fusion forms video features, and the resulting dilution of key information.

To address these issues, this paper proposes a novel and general temporal fusion framework that aggregates frame information at both semantic and temporal levels. At the semantic level, this paper uses a multi-stage aggregation network to extract video information at multiple semantic levels, so that the finally obtained features can more comprehensively represent video information. On the temporal level, this paper improves the existing intra-frame attention mechanism, adds an inter-frame attention module, and effectively reduces information redundancy in temporal fusion by considering inter-frame relationships.

The experimental results show that the method in this paper can effectively improve the accuracy of video-based pedestrian recognition and achieve the best performance so far.

2. Recognition of structured text in quick calculation correction

Accurate Structured-Text Spotting forArithmetical Exercise Correction

Keywords: quick calculation and correction, calculation detection and recognition

Correcting math homework has always been a labor-intensive task for elementary and middle school teachers. To alleviate the burden on teachers, this paper proposes Arithmetic Homework Checker, a system that automatically evaluates the correctness and error of all arithmetic expressions on images. The main challenge is that arithmetic expressions are often composed of a mixture of printed and handwritten text with special formatting (eg, multiline, fractional). Faced with this challenge, the traditional quick calculation correction scheme has exposed many problems in actual business. This paper proposes solutions to practical problems in two aspects of algorithm detection and recognition. Aiming at the problem of illegal algorithm candidates in algorithm detection, this paper further designs a loss function for horizontal edge focusing on the basis of the anchor box-free detection method CenterNet. CenterNet locates the formula object by capturing the two corner positions of the object, and at the same time learns the information inside the object as a supplement, avoiding the generation of "hollow" objects, and has better adaptability in the formula detection task. The loss function focused on the horizontal edge further focuses the loss update on the left and right edges of the calculation formula that are easier to generate and harder to locate, so as to avoid generating reasonable but illegal calculation formula candidates. This method has a significant improvement in the detection recall and accuracy. In terms of algorithm recognition frames, in order to avoid meaningless context information from interfering with recognition results, a recognition method based on context gate functions is proposed in this paper. This method uses a gate function to balance the input weights of image representation and context information, forcing the recognition model to learn more image representation, thereby avoiding meaningless context information from interfering with the recognition results.

3. Fast Learning of Temporal Action Nomination Based on Dense Boundary Generator

Fast Learning of Temporal Action Proposal via Dense Boundary Generator

Keywords: DBG action detection method, algorithm framework, open source

Download link: https://arxiv.org/abs/1911.04127

Video action detection technology is the basis of tasks such as highlight video collection, video subtitle generation, and action recognition. With the rapid development of the Internet, it has been more and more widely used in the industry. Many challenges are presented, such as complex video scenes and large differences in action lengths.

In response to these challenges, this paper proposes three innovations for the DBG motion detection algorithm:

(1) Propose a fast, end-to-end dense boundary action generator (Dense Boundary Generator, DBG). The generator is able to estimate dense bounded confidence maps for all action proposals.

(2) Introduce an additional temporal action classification loss function to supervise the action score feature (asf), which can promote Action-aware Completeness Regression (ACR).

(3) Design an efficient action nomination feature generation layer (Proposal Feature Generation Layer, PFG), which can effectively capture the global features of actions, and facilitate the implementation of subsequent classification and regression modules.

Its algorithm framework mainly includes three parts: video feature extraction (Video Representation), dense boundary motion detector (DBG), and post-processing (Post-processing). At present, the relevant code of Tencent Youtu DBG has been open sourced on github, and it ranks first on ActivityNet.

4. TEINet: Towards an Efficient Architecture for Video Recognition

TEINet: Towards an Efficient Architecture for Video Recognition

Keywords: TEI module, timing modeling, timing structure

Download link: https://arxiv.org/abs/1911.09435

This paper proposes a fast time series modeling module, TEI module, which can be easily added to the existing 2D CNN network. Different from the previous timing modeling methods, TEI learns timing features through attention on the channel dimension and timing interaction on the channel dimension.

First, the MEM module contained in TEI can enhance motion-related features while suppressing irrelevant features (such as background), and then the TIM module in TEI supplements the front and back timing information in the channel dimension. These two modules can not only capture the timing structure flexibly and effectively, but also ensure the efficiency during inference. This paper verifies the effectiveness of the two modules in TEI on multiple benchmarks through sufficient experiments.

5. Revisiting image aesthetic quality assessment through self-supervised feature learning

Revisiting Image AestheticAssessment via Self-Supervised Feature Learning

Keywords: aesthetic evaluation, self-supervision, computer vision

Download link: https://arxiv.org/abs/1911.11419

Image aesthetic quality assessment is an important research topic in the field of computer vision. In recent years, researchers have proposed many effective methods and made great progress in the aesthetic evaluation problem. These methods basically rely on large-scale image labels or attributes related to visual aesthetics, but such information often requires a huge human cost.

In order to alleviate the cost of manual annotation, "using self-supervised learning to learn aesthetically expressive visual representations" is a research direction. This paper proposes a simple and effective self-supervised learning method in this direction. The core motivation of our approach is that if a representation space cannot identify the aesthetic quality changes caused by different image editing operations, then this representation space is not suitable for the task of image aesthetic quality assessment. Starting from this motivation, this paper proposes two different self-supervised learning tasks: one is used to require the model to identify the type of editing operation applied to the input image; the other requires the model to distinguish the same type of operation under different control parameters. The difference in the aesthetic quality changes of different images can be used to further optimize the visual representation space.

In order to compare the needs of experiments, this paper compares the proposed method with existing classic self-supervised learning methods (such as Colorization, Split-brain, RotNet, etc.). The experimental results show that the method in this paper can achieve competitive performance on three public aesthetic evaluation datasets (ie AVA, AADB, and CUHK-PQ). And it is worth noting that the method in this paper can outperform the method of directly using the labels of the ImageNet or Places dataset to learn the representation. Furthermore, we verify that on the AVA dataset, models based on our method can achieve comparable performance to the best methods without using labels from the ImageNet dataset.

6. Video domain adaptation technology based on generative model

Generative Adversarial Networks forVideo-to-Video Domain Adaptation

Keywords: video generation, unsupervised learning, domain adaptation

Endoscopic videos from multiple centers usually have different imaging conditions, such as color and lighting, which makes the models trained on one domain not generalize well to another domain. Domain adaptation is one of the potential solutions to this problem. However, few works currently focus on the processing task of domain adaptation for video data.

To address the above issues, this paper proposes a novel generative adversarial network (GAN), namely VideoGAN, to transform video data between different domains. Experimental results show that domain-adapted colonoscopy videos generated by VideoGAN can significantly improve the segmentation accuracy of colorectal polyps by deep learning networks on multi-center datasets. Since our VideoGAN is a general-purpose network architecture, this paper also tests it on the CamVid driving video dataset. Experiments show that our VideoGAN can narrow the inter-domain gap.

7. Asymmetric Collaborative Teaching for Unsupervised Cross-Domain Person Re-Identification

Asymmetric Co-Teaching for UnsupervisedCross-Domain Person Re-Identification

Key words: person re-identification, asymmetric collaborative teaching, domain adaptation

Download link: https://arxiv.org/abs/1912.01349

Pedestrian re-identification has always been a very challenging topic due to the high variance of samples and the quality of the image. Although great progress has been made in re-ID in some fixed scenes (source domain), only a few works can achieve good results on target domains that the model has not seen. At present, there is an effective solution, which is to apply pseudo-labels to unlabeled data through clustering to assist the model to adapt to new scenarios. However, clustering often introduces label noise and discards low-confidence samples, hindering the improvement of model accuracy.

In this paper, an asymmetric collaborative teaching method is proposed to make more effective use of mining samples and improve domain adaptation accuracy. Specifically, two networks are used, one network receives samples as pure as possible, and the other network receives samples as diverse as possible. Under the framework of "like collaborative teaching", this method can filter out noise samples while Incorporate more low-confidence samples into the training process. Multiple public experiments show that this method can effectively improve the accuracy of domain adaptation at the current stage, and can be used for domain adaptation under different clustering methods.

8. Orientation Sensitive Loss with Angle Regularization for Person Re-Identification

Viewpoint-Aware Loss with AngularRegularization for Person Re-Identification

Keywords: pedestrian re-identification, orientation, modeling

Download link: https://arxiv.org/abs/1912.01300

Significant progress has been made in supervised person re-identification (ReID) in recent years, but the large orientation differences among pedestrian images make this problem still challenging. Most existing orientation-based feature learning methods map images from different orientations into separate and independent sub-feature spaces. This method only models the identity-level feature distribution of an orientation-lower human body image, but ignores the potential correlation between orientations.

To solve this problem, this paper proposes a new method called orientation-sensitive loss with angle regularization (VA-ReID). Compared with learning a subspace for each orientation, this method can map features from different orientations to the same hypersphere, so that the feature distribution of identity level and orientation level can be modeled simultaneously. On this basis, compared with traditional classification methods that model different orientations as hard labels, this paper proposes an orientation-sensitive adaptive label smoothing regularization method (VALSR). This method can give feature representation adaptive soft orientation labels, thus solving the problem that some orientations cannot be clearly marked.

A large number of experiments on the Market1501 and DukeMTMC data sets prove the effectiveness of the method in this paper, and its performance significantly exceeds the existing best supervised ReID method.

9. How to train conditional adversarial generative models with weakly supervised information

Robust Conditional GAN fromUncertainty-Aware Pairwise Comparisons

Keywords: CGAN, weak supervision, pairwise comparison

Download link: https://arxiv.org/abs/1911.09298

Conditinal GAN ​​(CGAN) has made great achievements in recent years, and has been successfully applied in fields such as image attribute editing. However, CGAN often requires a large number of annotations. In order to solve this problem, most of the existing methods are based on unsupervised clustering, such as first using unsupervised learning methods to obtain pseudo-labels, and then using pseudo-labels as real labels to train CGAN. However, when the target attribute is continuous rather than discrete, or the target attribute cannot represent the main differences between the data, then this method based on unsupervised clustering is difficult to achieve the desired effect. This paper further considers using weak supervision information to train CGAN. In this paper, we consider pairwise comparison of this weak supervision. Compared with absolute labeling, pairwise comparison has the following advantages: 1. Easier to label; 2. More accurate; 3. Less susceptible to subjective influence.

We propose to first train a comparison network to predict the score of each image, and then use this score as a condition to train CGAN. In the first part of the comparison network, we were inspired by the Elo rating system algorithm commonly used in chess and other games. We regard the annotation of a pairwise comparison as a game, and use a network to predict the score of the picture. We designed the A neural network that can learn with backpropagation. We also considered a Bayesian version of the network, giving the network the ability to estimate uncertainty. For the image generation part, we extend the Robust Conditional GAN ​​(RCGAN) to the case where the condition is a continuous value. Specifically, the prediction scores corresponding to the generated fake maps are polluted by a resampling process before being received by the discriminator. This resampling process requires uncertainty estimation using Bayesian comparison networks.

We conduct experiments on four datasets, changing the age and face value of face images respectively. Experimental results show that the proposed weakly supervised method is comparable to the fully supervised baseline, and far better than the unsupervised baseline.

10. Unsupervised Domain Adaptive Semantic Segmentation Based on Adversarial Perturbation

An Adversarial PerturbationOriented Domain Adaptation Approach for Semantic Segmentation

Keywords: Unsupervised Domain Adaptation, Semantic Segmentation, Adversarial Training

Download link: https://arxiv.org/pdf/1912.08954.pdf

Nowadays, neural networks can achieve good results with a large amount of labeled data, but they often cannot generalize well to a new environment, and a large amount of data labeling is very expensive. Therefore, unsupervised domain adaptation tries to use the existing labeled data to train the model and migrate it to the unlabeled data.

Adversarial alignment methods are widely used in unsupervised domain adaptation problems to globally match the marginal distribution of feature representations between two domains. However, due to the serious long-tail distribution of data on semantic segmentation tasks and the lack of domain adaptation supervision on categories, the process of matching between domains will eventually be dominated by large object categories (such as: roads, buildings), resulting in This strategy tends to ignore the feature expression of tail categories or small objects (eg: red street lights, bicycles).

This paper proposes a framework for generating and defending against perturbations. First, the framework designs several adversarial targets (classifiers and discriminators), and generates adversarial samples point by point in the feature spaces of the two domains through the adversarial targets. These adversarial examples connect the feature expression spaces of the two domains, and contain information about the vulnerability of the network. The framework then enforces the model to defend against adversarial examples, resulting in a model that is more robust to domain variations and object size, class long-tail distributions.

The anti-perturbation framework proposed in this paper is verified on two tasks of migrating synthetic data to real data. This method not only achieves excellent performance in overall image segmentation, but also improves the accuracy of the model on small objects and categories, proving its effectiveness.

Guess you like

Origin blog.csdn.net/lee_x_lee/article/details/130196129