Self-Supervised Paper S4L Translation

Abstract

This work addresses the problem of semi-supervised learning of image classifiers. Our main point is that the field of semi-supervised learning can benefit from the rapidly developing field of self-supervised visual representation learning. Combining these two approaches, we propose the framework of self-supervised semi-supervised learning (S4L), and use it to derive two new semi-supervised image classification methods. We demonstrate the effectiveness of these methods compared to carefully tuned baselines and existing semi-supervised learning methods. We then demonstrate that S4L and existing semi-supervised methods can be jointly trained, leading to state-of-the-art results on semi-supervised ILSVRC-2012 with 10% labels.

Introdution

Modern computer vision systems have shown outstanding performance on various challenging computer vision benchmarks, such as image recognition [31], object detection [20], semantic image segmentation [7], etc. Their success depends on the availability of mainframe computers. Large amounts of time-consuming and expensive annotated data. Moreover, the applicability of such systems is often limited within the bounds defined by the datasets on which they are trained. Many real-world computer vision applications focus on visual categories that do not exist in standard benchmark datasets, or involve applications of a dynamic nature, where the visual category or its appearance may change over time. Unfortunately, building large labeled datasets for all these cases is practically infeasible. Therefore, it is an important research challenge to devise a learning method that can successfully learn to recognize new concepts by using only a small number of labeled examples. The fact that people can quickly understand new concepts after seeing only a few (labeled) examples shows that this goal is in principle achievable.

It is noteworthy that many research results are dedicated to learning from unlabeled data, and in many practical applications, obtaining data is much easier than labeled data. In this effort, the field of self-supervised visual representation learning recently demonstrated the most promising results [15]. Self-supervised learning techniques define antecedent tasks that can only be formulated using unlabeled data, but do require higher-level semantic understanding to solve. As a result, models trained to solve these pre-tasks learn representations that can be used to solve other downstream tasks of interest, such as image recognition. **

Despite demonstrating encouraging results [15], purely self-supervised techniques learn significantly worse visual performance than fully supervised techniques . Therefore, their practical applicability is limited, and self-supervision alone is not enough so far. We hypothesize that self-supervised learning techniques can benefit significantly from a small number of labeled examples. Through the study of various methods, we link self-supervised learning and semi-supervised learning, and propose a framework of semi-supervised loss induced by self-supervised learning objectives. We call this framework self-supervised semi-supervised learning, or S4L for short, and the resulting technique can be seen as a new semi-supervised learning technique for natural images. Figure 1 illustrates the idea of ​​the proposed S4L technique. We therefore evaluate our model both in the semi-supervised setting, as well as in the transfer setting typically used to evaluate self-supervised performance. Furthermore, we design strong baselines for benchmarking methods that only learn 10% or 1% of the labels in ILSVRC-2012.

We further experimentally investigate whether our S4L method can further benefit from the regularization proposed in the semi-supervised literature, and find that they are complementary, i.e. combining them leads to improved results. Our main contributions can be summarized as follows: • We propose a new family of semi-supervised learning techniques that leverage natural images to mirror recent advances in self-supervised representation learning.
• We demonstrate that the proposed self-supervised semi-supervised (S4L) technique outperforms carefully fully-tuned baselines trained without unlabeled data and achieves performance competitive with previously proposed semi-supervised learning techniques.
• We further demonstrate that by combining our best S4L approach with existing semi-supervised techniques, we achieve state-of-the-art performance on the semi-supervised ILSVRC-2012 benchmark

relatework

semi-supervised learning

In this work, we build on the current state-of-the-art in both areas of semi-supervised learning and self-supervised learning. Therefore, in this section we review the most relevant developments in these areas.
2.1. Semi-supervised learning Semi-supervised learning describes a class of algorithms that attempt to learn from both unlabeled and labeled samples, which are usually assumed to be sampled from the same or similar distributions. There are different ways to get what information from the structure of the unlabeled data

Given the wide variety of semi-supervised learning techniques proposed in the literature, we refer to [3] for an extensive survey. For more background information, we focus on recent developments based on deep neural networks. The standard protocol for evaluating semi-supervised learning algorithms works as follows: (1) start with a dataset labeled with the criterion; (2) keep only a fraction of the labels (say 10%) on this dataset; (3) treat the rest as for unlabeled data. Although this approach may not reflect the practical setting of semi-supervised learning [27], it is still the standard evaluation protocol and we will follow it in this work. Many preliminary results in semi-supervised learning of deep neural networks are based on generative models, such as denoising autoencoders [30], variational autoencoders [14] and generative adversarial networks [26, 32]. Recently, a study showed that better results can be obtained over standard baselines by increasing the consistency regularization loss computed on unlabeled data . These consistency regularization losses measure the difference between predictions made for perturbations of unlabeled data points . Other improvements were shown by smoothing forecasts before measuring these perturbations . These methods include - model [17], temporal ensemble [17], moderate teacher [38] and virtual adversarial training [21]. More recently, fast-SWA [1] showed better results by training cyclic learning rates and measuring the difference of predicted values ​​from multiple checkpoints. By minimizing the consistency loss, these models implicitly push the decision boundary away from the dense parts of the unlabeled data . This could explain their success on typical image classification datasets, where points in each cluster often share the same class. Two other important approaches to semi-supervised learning, which have shown success in both deep neural networks and other types of models, are pseudo-labeling [18], one of which is to classify Approximate classification on unlabeled data only on labeled data and conditional entropy minimization[10], all unlabeled examples are encouraged to make confident predictions about a certain class. Semi-supervised learning algorithms are usually evaluated on small-scale datasets such as CIFAR-10 [16] and SVHN [22] . We know that there are few examples in the literature evaluating semi-supervised learning algorithms on larger, more challenging datasets such as ILSVRC-2012 [31]. As far as we know, Mean Teacher [38] currently holds state-of-the-art results on ILSVRC-2012 using only 10% of the labels

Self-Supervised Learning Framework

Self-supervised learning is a general learning framework that relies on proxy (pretend) tasks that can only be formed using **unsupervised data. **A pretext task is designed in such a way that solving it requires learning a useful image representation . Self-supervised techniques have two applications in the broad field of computer vision [13, 34, 6, 28, 33]. In this paper, we employ self-supervised learning techniques to learn useful visual representations from image databases. These techniques achieve state-of-the-art performance among methods that only learn visual representations from unsupervised images. Below we provide a non-exhaustive summary of the most important developments in this regard. Doersch et al. proposed to train a CNN model that predicts the relative position of two randomly sampled non-overlapping image patches [4]. Subsequent literature [23, 25] generalized this idea to predict the permutation of multiple randomly sampled and permuted patches. '

In addition to the aforementioned patch-based methods, there are self-supervised techniques using image-level losses. Among them, in [39], the authors propose to use grayscale image colorization as an excuse task . Another example is a pretext task [9] that predicts the angle of a rotation transformation applied to an input image. Some techniques go beyond solving proxy classification tasks and enforcing constraints on the representation space. A prominent example is the sample loss in [5], which encourages the model to learn representations that are invariant to heavy image augmentation . Another example is [24], which enforces an additive constraint on visual representations: the sum of the representations of all image patches should be close to the representation of the entire image . Finally, [2] proposes a learning procedure that alternately clusters images in a representation space and learns a model that assigns images to their clusters

Method

In this section, we introduce our self-supervised semi-supervised learning (S4L) technique. We first give a general description of our approach. Afterwards, we present concrete examples of our method. The focus is on the problem of semi-supervised image classification. Formally, we assume that the (unknown) data on images and labels generate a joint distribution p(X,Y). The learning algorithm can obtain a labeled training set Dl sampled from iid of p(X,Y), and an unlabeled training set Du sampled from iid of the marginal distribution p(X). The learning objectives of the semi-supervised methods we consider in this paper are as follows:

We now describe our self-supervised semi-supervised learning technique. For simplicity, we present our algorithm in the context of multiclass image recognition, although it can be easily generalized to other scenarios such as dense image segmentation. Note that in practice, Objective 3 is optimized using stochastic gradient descent (or variants) , updating the parameters θ using mini-batches of data. In this case, the size of the supervised mini-batch xl, yl⊂Dl and the unsupervised mini-batch xu⊂Du can be chosen arbitrarily . In our experiments, we always default to the simplest possible option, which is to use mini-batches of equal size . We also note that we can choose whether to incorporate the mini-batch xl into the self-supervised loss, i.e. apply Lself to the union of xu and xl. We experimentally investigate the effect of this choice in Experimental Section 4.4. We demonstrate frameworks for two important self-supervised techniques: predicting image rotations [9] and on-sample [5] . Note that under our framework, more self-supervised losses can be explored in the future

rotation

where R is the set of 4 rotations {0°, 90°, 180°, 270°}, xr is the image x rotated by r, fθ( ) is the model with parameter θ, and L is the cross-entropy loss. This leads to a 4-class classification problem. We follow the suggestion of [9], and in a single optimization step we always apply and predict all four rotations for each image in the mini-batch. We also apply a self-supervised loss to the labeled images in each mini-batch. Since in this case we are dealing with rotated supervised images, it is proposed to apply classification loss to these images as well . This can be seen as another way to regularize the model in scenarios when a small number of labeled images are available. We will evaluate the effect of this choice in Section 4.4.

The idea of ​​Examplar self-supervision [5] is to learn a visual representation that is invariant to various image transformations . Specifically, we use [Inception] cropping [37], random horizontal mirroring, and HSV-space color randomization (as described in [5]) to generate 8 different instances of each image in a mini-batch. Following [15], we implement Lu as a batch hard triplet loss [12] with soft margins. This encourages 3 identical images to be converted to have similar representations , and conversely, different images are encouraged to be converted to have different representations . Similar to the case of rotated self-supervision, Ls is applied to all eight instances of each image.

The idea of ​​exemplary self-supervision [5] is to learn a visual representation that is invariant to various image transformations. Specifically, we use [Inception] cropping [37], random horizontal mirroring, and HSV-space color randomization (as described in [5]) to generate 8 different instances of each image in a mini-batch. Following [15], we implement Lu as a batch hard triplet loss with soft margins [12]. This encourages transformations of the same image to have similar representations, and conversely encourages transformations of different images to have different representations. Similar to the case of rotated self-supervision, Ls is applied to all eight instances of each image

Question: What exactly is the self-supervised loss Ls

In the following sections, we compare S4L with several leading semi-supervised learning algorithms that are not based on self-supervised objectives. We now describe the method of comparison. Our proposed objective 3 is also applicable to semi-supervised learning methods, where the loss Lu is a standard semi-supervised loss as described below. Virtual Adversarial Training (VAT) [21]: The idea is to make the predicted labels robust to the vicinity of input data points without local perturbations. It approximates the maximum change in prediction around variants of unlabeled data points, where variants are hyperparameters. Specifically, the VAT loss of model fθ is:

Virtual Adversarial Training (VAT) [21]: The idea is to make the predicted labels robust around input data points against local perturbations

. ILSVRC-2012 Experiments and Resu

In this section, we present the results of the main experiments. Due to the widespread use of the ILSVRC-2012 dataset in self-supervised learning methods, we used this method and observed the scalability of semi-supervised methods. Since no test set for ILSVRC-2012 is provided, and the number of validation sets is usually reported in the literature, we performed all hyperparameter selection for all models trained on custom train/validation parts of the public training set. This custom split contains 1 231 121 training and 50 046 validation images. We then retrain the model using the best hyperparameters on the full training set (1 281 167 images) , possibly with fewer labels, and report the final results obtained on the public validation set (50 000 images). We follow standard practices [38, 29] and conduct experiments where labels are only available for 10% of the dataset . Note that 10% of ILSVRC-2012 still corresponds to about 128 000 labeled images, and previous work used the full (public) validation set for model selection. Although we use a custom validation set extracted from the training set, it is not practical to use such a large validation set, as described in [30, 38, 27]. We also hope to cover more realistic cases in our evaluation. Therefore, **we conduct experiments on 1% labeled examples (about 13000 labeled images), while also using a validation set of only 5000 images. We will analyze the effect of validation set size in Section 7. **We always define epochs in terms of the available labeled data,** i.e. one epoch corresponds to one full pass of the labeled data,** no matter how many unlabeled examples are seen. Unless otherwise stated, we use stochastic gradient descent and momentum to optimize the model with a minibatch size of 256 . While adjusting the learning rate, we keep the momentum at 0.9 in all experiments. Table 1 summarizes our main results

We found that for S4L rotations and S4L samples, self-supervised weight loss w=1 works best (though not very well), and the optimal weight decay and learning rate are the same as the supervised baseline.

As described in Section 3.1, we apply a self-supervised loss on both labeled and unlabeled images. Additionally, both rotation and sample self-supervision generate 8 copies of each image, and we apply a supervision loss to all copies of the labeled image. To investigate this choice, we conduct a case study on S4L rotations and find that it has no significant effect whether the self-supervised loss Lself is also applied to labeled images. On the other hand, applying a supervised loss Lsup on self-supervised generated augmented images does improve performance by almost 1%. Additionally, this allows multiple transformed copies of the image (e.g. four rotations) to be used at inference time and their predictions averaged. While these four rounds of predictions were accurate in the range of 1% to 2%, our reported results do not take advantage of this to keep the comparison fair.
The results shown in Table 1 show that our proposed self-supervised semi-supervised learning method is indeed effective for the two self-supervised methods we tried. We hypothesize that these methods can be designed for other self-supervised objectives.

Using the above model, we assign pseudo-labels to the entire dataset by making predictions for five crops and four rotations per image. We then train the same network again in exactly the same way (i.e. all losses), except for the following three differences: (1) the network is initialized with the weights obtained in the first step (2) each example has a label: pseudo Label(3) Thus, one epoch now corresponds to the full dataset; thus, we train for 18 epochs, and the learning rate drops after 6, 12 epochs.

Currently, the standard practice in semi-supervised learning is to train on a large dataset using a subset of labels, but still use the scores obtained on the full validation set for model selection. Worse, for ILSVRC-2012, this validation set can be used to select hyperparameters as well as report the final performance. Remember that we avoided this by creating a custom validation set for a portion of the training set selected by all hyperparameters, but having a large labeled validation set at hand contradicts the promised practicality of semi-supervised learning, which That's about examples with tags only. This fact has been acknowledged in [30] but has been all but ignored in the semi-supervised literature. Oliver et al. [27] questioned the feasibility of small validation set tuning by comparing the estimated model accuracy on a small validation set. They found that the variance of the estimated accuracy gap between two models can be larger than the actual gap between these models, suggesting that using a small validation set for model selection may not be feasible. That said, they did not empirically assess whether an optimal model could be found using a small validation set, especially when selecting hyperparameters for a particular semi-supervised method. We now describe our analysis of this important question.
We examine a number of models trained on the 1% common supervision benchmark in ILSVRC-2012. For each model, we compute a validation score on a validation set of 1000 labeled images (i.e., one labeled image per class), 5000 labeled images (i.e., five labeled images per class), And compare these scores on the "full-size" validation set, containing 50 046 labeled images. The results are shown in Figure 3, which is striking: there is a very strong correlation between performance on tiny and full validation sets. In particular, despite higher variability in some regions, the most efficient parameters are available in either case. Most notably, the best models tuned on the small validation set were also the best models tuned on the large validation set. Therefore, we conclude that a small validation set is sufficient for selecting the hyperparameters of the model.

discussion

In this paper, we bridge the gap between self-supervised methods and semi-supervised learning by proposing a framework (S4L) that can be used to transform any self-supervised method into a semi-supervised learning algorithm . We instantiate two such methods: S4L-Rotation and S4L-Exemplar, and show that they are competitive with methods in the semi-supervised literature on the challenging ILSVRC-2012 dataset. We further demonstrate that S4L methods are complementary to existing semi-supervised techniques, and those we propose combined with MOAM can lead to state-of-the-art performance. While all methods we study show promising results for learning with 10% labels on ILSVRC-2012, the situation is less clear when only 1% labels are used. In this low data case, when only 13 labeled examples per class are available, the setting may fade to a minority case and require a very different set of methods to achieve better performance. Nonetheless, we hope this work inspires other researchers in the self-supervision field to consider extending their methods to semi-supervised approaches using our S4L framework, and researchers in the semi-supervised learning field to draw inspiration from it. A large number of self-supervised methods have been proposed recently.

Guess you like

Origin blog.csdn.net/qq_33859479/article/details/105852445