A Simple Framework for Contrastive Learning of Visual Representations[Thesis Learning] SimCLR

We simplify recently proposed contrastive self-supervised learning algorithms without requiring
specialized architectures or a memory bank.

  1. composition of data augmentations plays a critical role in defining effective predictive tasks,【The composition of data increments plays a key role in defining effective predictive tasks】
  2. Introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations【Introducing a learnable nonlinear transformation between the representation and the contrastive loss greatly improves the quality of the learned representation】
  3. contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.

Most mainstream approaches fall into one of two classes: generative or discriminative.
Most mainstream approaches fall into one of two classes: generative or discriminative.

Generative approaches learn to generate or otherwise model pixels in the input space on representations learned with different self-supervised methods (pretrained on ImageNet).
Generative approaches learn to generate or otherwise (representations learned with different self-supervised methods) Modeled on pixels in input space. [Generative methods learn to generate or otherwise model pixels in the input space that are learned by different self-supervised methods (pretrained on ImageNet).


Discriminative approaches learn representations using objective functions similar to those used for supervised learning, but train networks to perform pretext tasks where both the inputs and labels are derived from an unlabeled dataset. .
But to train the network to perform pretense tasks where both the input and labels come from unlabeled datasets.

Many such approaches have relied on heuristics to design pretext tasks , which could limit the generality of the learned representations.【Many such approaches have relied on heuristics to design pretext tasks, which could limit the generality of the learned representations.

Discriminative approaches based on contrastive learning in the latent space have recently shown great promise, achieving state-of-the-art results。

Composition of multiple data augmentation operations
is crucial in defining the contrastive prediction tasks that
yield effective representations. In addition, unsupervised
contrastive learning benefits from stronger data augmentation than supervised learning. [The composition of multiple data augmentation operations is critical to define a contrastive prediction task that yields efficient representations.
Furthermore, unsupervised contrastive learning benefits from powerful data augmentation more than supervised learning.

Introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations.
[Introducing a learnable nonlinear transformation between the representation and the contrastive loss greatly improves the quality of the learned representations.

Representation learning with contrastive cross entropy loss benefits from normalized embeddings and an appropriately adjusted temperature parameter

Contrastive learning benefits from larger batch sizes and longer training compared to its supervised counterpart.Like supervised learning, contrastive learning benefits from deeper and wider networks. training. Like supervised learning, contrastive learning benefits from deeper and wider networks.

Inspired by recent contrastive learning algorithms (see Sec-tion 7 for an overview), SimCLR learns representations by maximizing agreement between differently augmented views of the same data example via a contrastive loss in the latent space. , to learn representations by maximizing the consistency between different augmented views of the same data instance.

  • A random data augmentation module can transform any given data example, resulting in two correlated views
    Xi,Xj of the same example, which we consider a positive pair. We apply three simple augmentation methods in sequence: random cropping followed by
    resize back to the original size, random color distortions, and
    random Gaussian
    blur [random cropping, then resized to the original size, random color distortion, and random Gaussian blur. ] As shown in Section 3, random cropping and color warping are crucial to achieve a good result.

  • An encoder-based neural network capable of obtaining representation vectors from augmented data examples.
    Our framework allows various choices of network structures without any constraints. We choose the simple method and use the commonly used ResNet to get hi = f( ~xi) =
    ResNet( ~xi) where hi∈Rd is the output after the average pooling layer.

  • A small neural network projection head g(-), mapping the representation into the space with a contrastive loss applied.

    • insert image description here
  • insert image description here
    A contrastive prediction task (on pairs of augmented examples derived from minibatches) is defined, resulting in 2N data points.
    We do not explicitly give negative examples. Instead we give a pair of positive pairs and use the remaining 2N-1 as negative examples. The positive pair loss function for example (i, j) is defined as follows:
    insert image description here
    τ denotes a temperature parameter.
    The final loss is calculated among all positive pairs. (normalized temperature-scaled cross-entropy loss).
    insert image description here
    Training with large batch size may be unstable when using standard SGD/Momentum with linear learning rate scaling. [When using standard SGD/Momentum with linear learning rate scaling, large batch training may be unstable. 】 To stabilize the training, we use the LARS optimizer (You et al.,2017) for all batch sizes.

In data-parallel distributed training, the mean and variance of BN are usually aggregated locally on each device. In our contrastive learning, since pairs of positive numbers are computed in the same device, the model can exploit local information leakage to improve prediction accuracy without requiring improved representations. We address this issue by aggregating BN mean and variance ensemble devices during training.

Experimental parameter setting

We use ResNet-50 as the base encoder network, and a 2-layer MLP projection head to project the representation to a 128-dimensional latent space.【We use ResNet-50 as the base encoder network, and a 2-layer MLP projection head, to project representations into a 128-dimensional latent space. 】 As the loss, we use NT-Xent, optimized using LARS with learning
rate of 4.8 (= 0.3 × BatchSize/256) and weight decay of 10−6. We train at batch size 4096 for 100 epochs. We use linear warmup for the first 10 epochs,
and decay the learning rate with the cosine decay schedule without restarts. [We use linear warmup for the first 10 epochs.
And decay the learning rate using a cosine decay schedule, no restart required.

Three data enhancement methods
Space\geometry conversion method
Appearance conversion method (color distortion, brightness, contrast, saturation)
Gaussian filter, Sobel filter

Our experiments show that unsupervised contrastive learning benefits from stronger data augmentation than supervised learning. Although previous work has reported that data augmentation is useful for self-supervised learning. We show that data augmentation, which yields no accuracy advantage in supervised learning, can still be of great benefit to contrastive learning.

4.1 Unsupervised Contrastive Learning Benefits from Larger Models (Depth and Span)
We find that as the model size increases, the gap between a supervised model and a linear classifier trained on an unsupervised model shrinks, suggesting that no Supervised learning benefits from larger models more than supervised models.

5. The normalized cross-entropy loss with adjustable temperature is better than the
alternative 5.2 Contrastive learning benefits from larger bitch_size and longer training process.
The training epochs is a small value (100), so a larger bitch_size There are more obvious advantages. When the number of training epochs increases, the gap will decrease or even disappear. In contrastive learning, a larger bitch_size provides more negative samples and promotes convergent fusion. (It takes fewer epochs to achieve a given accuracy rate). Similarly, the growth of the training process will also provide more negative samples.

What are pretext tasks

Pretext tasks are usually translated as "predecessor tasks" or "proxy tasks", and sometimes "surrogate task" instead. is to learn a good task representation.

  • This kind of training is not our own training task, and it is not what we need to do in this training.
  • Although it is not what needs to be done in this training, it can promote our training and achieve better results.

The strength of this simple framework suggests that, despite a recent surge in interest, self-supervised learning remains undervalued.

Guess you like

Origin blog.csdn.net/baidu_41810561/article/details/123559756