Google research team finds ways to improve the robustness of self-supervised learning

image

Author | Google Brain planning | Yuying translator | Li Dongmei How to further improve the robustness of self-supervised learning is a major problem in the field of computer vision. Researchers from Google Brain published a paper on this issue. This article is the 105th AI Frontline This paper guide, we will interpret the specific methods and effects of this paper.

Recently, Google Brain researchers published a new study on "how to improve the robustness of self-supervised learning." The research proposed a general framework for automatically deleting shortcut features, which can make self-supervised models superior to models trained in traditional ways.

Summary

In self-supervised visual performance learning, the feature extractor is trained on a “pre-set task” (Pretext task can be understood as an indirect task or preset task designed to achieve a specific training task), because it can be quickly Generate labels. One of the main problems with this method is that the feature extractor focuses on low-level visual features, such as color difference or watermark, in the fast learning process, and cannot learn useful semantic representations.

To solve this problem, the researchers proposed a general framework for automatically deleting shortcut features. Our main hypothesis is that those features that were initially used to solve preset tasks are also the most likely to become features that increase the difficulty of the task after confrontation training. We train the "camera" network to make minor image changes to minimize the performance of preset tasks, proving that this assumption is applicable to common preset tasks and data sets. In all tests, the performance of learning with modified images is better than the performance of learning with unmodified images. In addition, the modifications made by the lens reveal how the choice of preset tasks and data sets affects the characteristics of self-supervised learning.

Method

We recommend using a lightweight image-to-image conversion network (or "lens") to process images to improve the visual performance of self-supervision. The network uses adversarial training to weaken the performance of the feature extraction network on preset tasks. In this study, we first defined the concept of "quick" visual features.

Intuitively speaking, according to downstream applications of preset tasks and learning performance, shortcut features can be defined as (i) features that can quickly and accurately solve preset tasks by focusing on low-level visual features; (ii) it is useless for downstream applications, And can prevent learning useful semantic representations.

image

Legend: Example of automatic quick deletion of rotation prediction preset tasks. The lens has learned to remove features that are easier to solve for preset tasks (specifically, it hides the watermark in this example). Quick deletion forces the network to learn more advanced features to solve preset tasks and improve the quality of semantic representation.

We first normalize the general settings of SSL based on preset tasks, and then describe how to modify this setting to prevent shortcut features.

image

For classification preset tasks, we can train the lens to make the predicted class probability biased toward the least likely class. So the loss function becomes:

image

The specific method can be boiled down to:

  • Propose a simple and universal method of automatically deleting shortcuts, which can be applied to almost any preset task.
  • We validated the proposed method on a large number of preset tasks and two different training data sets (ImageNet and YouTube-8M frames). In all methods, the upstream training data set and two downstream/evaluation data sets (ImageNet and YouTube-8M frames) Places205) showed improvement. It is particularly important to note that our method can replace the preprocessing process of manually deleting shortcut features.
  • We use lenses to compare shortcut features between different preset tasks and data sets.

Experiment

In the experiment, the researchers trained a self-supervised model on the open source dataset CIFAR-10 and made predictions on it to predict the correct orientation of a slightly rotated image. In order to test the lens, they added shortcut features with orientation information to the input image. These shortcut features enable the model to solve the rotation task without learning target-level features. The researchers stated in the report that the semantic representation of the model (without shots) learned from the synthesized shortcut features performed poorly, and dramatically, the feature extractors learned from the shots performed better overall.

image

Legend: Schematic diagram of the model. In the experiments in this article, we use the U-Net framework for the shot L and the ResNet50 v2 framework for the feature extractor F.

image

Legend: Use different self-supervised preset tasks to evaluate the representation of the model trained on ImageNet. These scores are accurate in the logistic regression model (in %). The value in bold is better than the next best method at the 0.05 level of significance. The training images are preprocessed according to their original files.

imageLegend: Top: Three sample images from ImageNet, processed by shots trained on different preset tasks. The dotted square on the input image shows the area used for patch-based tasks; bottom: the average reconstruction loss function value of 1280 images randomly selected from the test set. For display, intercept the 95th percentile.

In the second test, the team trained a model on one million images in the open source corpus ImageNet and asked it to predict the relative position of one or more patches contained in the image. The researchers said that for all tested tasks, adding lenses can improve the baseline.

in conclusion

The researchers concluded: "The results show that the benefits of using adversarial-trained shots to automatically delete shortcuts can be widely applied to all preset tasks and data sets. In addition, we found that all types of feature extractors have this ability. In addition to improving the representation method, our method allows us to more intuitively see the features learned through self-supervision, and to quantify and compare these features. We confirm that this method can detect and weaken the previous work Quick features."

In future research, the Google Brain research team plans to explore new lens architectures and explore whether the technology can be applied to further improve the supervised learning algorithm.


Guess you like

Origin blog.51cto.com/15060462/2676180