Supervised Contrastive Learning

This article is very interesting and proposes a supervised contrastive learning framework among contrastive learning methods that represent the field of self-supervision.

Introduction

The cross-entropy loss function has shortcomings such as lack of robustness to noisy labels and poor margins, but many alternatives cannot universally replace it. This work proposes a supervised learning loss that builds contrastive self-supervised information by leveraging label information. Pulling different instances apart while also bringing those from the same class closer together. It is clear from the picture: Insert image description here
The left side is traditional self-supervision. You can see that similar samples are also spread out in the embedding space. In theory, we hope that similar features should be close together. This is why the author wants to do this.
The selection strategy of positive and negative instances in supervised contrastive learning is also different: positive samples are extracted from samples of the same category as the anchor rather than data augmentation of the anchor as done in self-supervised learning. Each anchor uses multiple positive instances and multiple negative instances, eliminating the need to explore the selection of negative instances. The author also analyzed the effects of the two self-supervised loss functions set.

Method

Given a batch of input data, data augmentation is first applied twice to obtain two copies of the batch. Both copies are forward-propagated through the encoder network to obtain a 2048-dimensional normalized embedding. During training, this representation is further propagated through a projection network, which is discarded at inference time. Supervised contrastive loss is computed on the output of the projection network. The difference from supervised learning is that before the classification head, a stage is divided for optimizing the contrastive loss. Different from self-supervised learning, the selected positive and negative examples are distinguished based on whether they are of the same type . As shown below:
Insert image description here

Data Augmentation ：

For each input sample x, generate two random augmentations, $x = A u g (x)$ , each representing a different view of the data.

Encoder Network：

This module is mainly used to encode x into a representation vector. Two enhanced samples are input to the same encoder respectively to generate a pair of representation vectors. $r = E n c (x)$

Projection Network:

Map the representation vector r into a projection space: $z = P r o j (r)$ .

Loss function

For samples and label pairs $\left\{ {x}_{k}, {y}_{k}\right\}_{k=1 \ldots N}$ during training $2 N$ 对 $\left\{\hat{x}_{2k}, \hat{y}_{2k}\right\}_{k=1 \ldots N}$
$\left\{\hat{x}_{2k-1}, \hat{y}_{2k-1}\right\}_{k=1 \ldots N}$ . We refer to these augmented sample symmetries as multi-view batches.
Within a multi-view batch, let $\in I \equiv\{1 \ldots 2 N\}$ is any enhanced sample in the batch.

$\begin{gathered} \mathcal{L}_{\text {out }}^{\text {sup }}=\sum_{i \in I} \mathcal{L}_{\text {out }, i}^{\text {sup }}=\sum_{i \in I} \frac{-1}{|P(i)|} \sum_{p \in P(i)} \log \frac{\exp \left(\boldsymbol{z}_{i} \cdot \boldsymbol{z}_{p} / \tau\right)}{\sum_{a \in A(i)} \exp \left(\boldsymbol{z}_{i} \cdot \boldsymbol{z}_{a} / \tau\right)} \\ \mathcal{L}_{\text {in }}^{\text {sup }}=\sum_{i \in I} \mathcal{L}_{\text {in }, i}^{\text {sup }}=\sum_{i \in I}-\log \left\{\frac{1}{|P(i)|} \sum_{p \in P(i)} \frac{\exp \left(\boldsymbol{z}_{i} \cdot \boldsymbol{z}_{p} / \tau\right)}{\sum_{a \in A(i)} \exp \left(\boldsymbol{z}_{i} \cdot \boldsymbol{z}_{a} / \tau\right)}\right\} \end{gathered}$
The above two equations are the two losses proposed by the author, where $\equiv{p \in A(i) }$ is the positive sample pair of the multi-view batch, ${a\in A(i)}$ exclusioniiSample pairs outside $i .$ $\tau$ is the temperature hyperparameter. The meanings of the two formulas are very intuitive, that is, in the projection space, the distance to the positive examples is shortened and the distance to the negative examples is widened. The difference is that $The position of l o g$ is different, so the author also compares the difference between the two losses:
Insert image description here
$\mathcal{L}^{sup}_{out}$ Obviously better than $\mathcal{L}^{sup}_{in}$ . The author believes that this is the positive normalization term The deviation caused by $1$ $/$ $P$ $($ $i$ $) .$ Because in $\mathcal{L}^{sup}_{in}$ In , the positive normalization term is located in $Inside log$ , the impact on the entire loss function is only at a constant level $.$ The former is in $l o g$ is external and can effectively affect gradient changes. Without any normalization effect, $\mathcal{L}^{sup}_{in}$ The gradient of is more susceptible to positive bias, leading to suboptimal training.

Experiments

Insert image description here
Due to the use of labels, the main purpose is to compare the

robustness to data damage with cross entropy (ablation of different networks and different data enhancements on imageNet):

At first glance, it seems that the gimmick is greater than reality, and the class labels are all known in advance. Let’s make a comparison. Learning framework, what is the difference between this and cross entropy. . . After a simple implementation later, it can still improve. It is indeed a bit metaphysical. If I insist on explaining it, I personally feel that the knowledge representation ability is stronger than cross entropy in the space of comparative learning.