Self-Supervised Learning, also known as self-supervised learning, we know that general machine learning is divided into supervised learning, unsupervised learning and reinforcement learning. Self-Supervised Learning is a kind of unsupervised learning, mainly hoping to learn a general feature expression for downstream tasks (Downstream Tasks).

Self-supervised learning mainly uses auxiliary tasks (pretext) to mine its own label information from large-scale unlabeled data, and trains the network through this structured label information, so that it can learn valuable representations for downstream tasks. That is, self-supervised learning does not require any labeled data, the labels are derived from the input data itself. The mode of self-supervised learning is still the mode of Pretrain-Fintune , that is, in the first stage, pre-training is performed on pretext , and then in the second stage, the learned parameters are transferred to the downstream task network for fine-tuning to obtain the final network .

2. The significance of self-supervised learning and the thinking of work

Reduce labor costs : Because the performance of a deep neural network largely depends on the ability of the model and the amount of training data. However, the process of labeling large-scale datasets requires a lot of human and financial resources. Another more futuristic meaning may be to explore the way of future AI learning.

Why can he wrok? This blog gives some thoughts, as follows:

1) Utilize prior information existing in nature

There is an association between the category and color of the object; the association between the object category and the shape texture; the association between the default orientation of the object and the category; kinematic properties ;

2) Coherence between data

Images have spatial coherence; videos have temporal coherence ;

3) Data internal structure information

slightly.

3. Analysis of two stages of self-supervised learning

Self supervised learning roughly has two stages of training. The first stage needs to train the model to extract general representations. The official language expression is called: in a task-agnostic way. The second stage is based on specific downstream tasks (downstream data sets, with labels) ) to do fine tune, the official language is called in a task-specific way.

3.1 The first stage pretrain

The first stage does not involve any downstream tasks. It is to take a bunch of unlabeled data for pre-training without specific tasks. This is called in a task-agnostic way in official language. First of all, the first stage is also a complete training process like supervised training. First, there are two sets of input data. Although the data set is unlabeled data, the researchers still thought of a way to train the initial network. We make a copy of the original data X as a label, and record it as Y, and then perform mask or other pretext tasks ( pseudo-labels can be automatically generated according to data attributes ) on the original data X, and use the data after the pretext task as network input training For the network, we expect the network to restore the original data X as completely as possible. The loss at this stage is the loss of the data predicted by the network and the original copy of the data Y that we copied. Note: I have doubts about the selection of the loss function at this stage , and I still need to check the data and code. The information I learned initially is to select different losses according to different pretext tasks . Waiting for follow-up additions . The figure below is an example of the mask operation.

Example of BERT's pretext Task:

When training the model, use the mask to randomly cover the token of the input data, as shown in the figure. Next, do a Linear Transformation on the vector output corresponding to the position of the covered token, and then do softmax to output a distribution, which is the probability of each word. Because at this time BERT does not know that the word covered by the Mask is "Bay", but we know it, so the loss is to make this output as close as possible to the covered "Bay".

Pretext Task is the task of filling in the blanks. This task has nothing to do with downstream tasks, and even looks stupid, but BERT has learned a good Language Representation through such Pretext Tasks, and is well adapted to downstream tasks.

3.2 The second stage

The second stage is mainly to use the pretrain network to make a fine tune for specific downstream tasks. For example, for the segmentation task, for the pretrain network, we need to input the segmented image and label to make simple adjustments to the network. Most of the loss at this stage still uses cross entropy (the choice of this loss function is consistent with the loss in supervised learning). Personal understanding of this stage can be compared to transfer learning in supervised learning, where the weights of the network are fixed, and then the last few layers of the network are fine-tuned.

4. Pretext task understanding in the first stage of self-supervised learning

4.1 Blog 1 mentioned two main technical routes: contrastive learning and generative learning

Self-supervised learning can be divided into two main technical routes: contrastive learning and generative learning .

4.1.1 Contrastive Learning

The core idea of contrastive learning is to compare positive samples and negative samples in the feature space , and learn the feature representation of samples. The difficulty lies in how to construct positive and negative samples . Contrastive learning first learns a general representation of images on an unlabeled dataset, which can then be fine-tuned using a small number of labeled images to boost performance on a given task (e.g., classification). Contrastive representation learning can be thought of as learning representations of samples by comparison. On the other hand, generative learning is to learn a discriminative model of the mapping of some (pseudo) labels and then reconstruct the input samples. In contrastive learning, representations are learned by making comparisons between input samples. Instead of learning a signal from a single data sample at a time, contrastive learning learns by making comparisons between different samples. Comparisons can be made between positive pairs of "similar" inputs and negative pairs of "dissimilar" inputs.

Contrastive learning learns by simultaneously maximizing the similarity between representations of samples of the same class and minimizing the similarity between samples of different classes. To put it simply, comparative learning is to achieve similar representations of samples of the same category, so it is necessary to maximize the similarity between the representations of samples of the same category. On the contrary, if it is a sample of different categories, it is necessary to minimize the similarity between them. Through such comparative training, the encoder (encoder) can learn higher-level general features (sample-level representations) of samples, rather than attribute (pixel) level generative models (attribute-level generation).

4.1.2 Generative Learning

The generative learning methods based on self-supervised learning mainly include AE, VAE, and GAN. Among them, VAE is a generative model based on AE, and GAN is a more popular and effective generative method in recent years. The recent diffusion model is also a hot spot in the generative model.

4.2 Blog 2 mentioned these four pretext tasks

Generation-based, context-based, free semantic tag-based and cross-modal based, as shown in the figure below.

4.2.1 Generation-based methods:

This type of approach learns visual features by solving proxy tasks involving image or video generation.

• Image Generation: Learning visual features through the process of image generation tasks. Methods of this type include image colorization [18] , image super-resolution [15] , image inpainting [19] , generative adversarial networks ( GAN ) to generate images [83] , [84] .

• Video generation: Visual features are learned through the process of video generation tasks. Methods of this type include video generation using GANs [85] , [86] and video prediction [37] .

4.2.2 Context-based approach:

The design of context-based agent tasks mainly utilizes the contextual features of images or videos, such as contextual similarity, spatial structure, temporal structure, etc.

• Contextual similarity: The agent task is designed based on the contextual similarity between image patches. Methods of this type include methods based on image clustering [34] , [44] and methods based on graph constraints [43] .

• Spatial context structure: Agent tasks are designed according to the spatial relationship between image patches. Methods of this type include image puzzles [20] , [87] , [88] , [89] , context prediction [41] and geometric transformation recognition [28] , [36], etc.

• Temporal context structure: The temporal order of the video is used as a supervisory signal. ConNet is trained to verify whether the input frame sequence is in the correct order [40] , [90] or to recognize the order of the frame sequence [ 39] .

4.2.3 Approaches based on free semantic tags:

This type of approach trains the network with automatically generated semantic labels. Labels are generated by traditional hard-coded algorithms [50] , [51] or game engines [30] . Including moving object segmentation [81] , [91] , contour detection [30] , [47] , relative depth prediction [92] , etc.

4.2.4 Cross-modality-based approach:

This type of proxy task trains a ConvNet to verify that two different channels of input data correspond to each other. Methods of this type include video and audio correspondence verification [25] , [93] , RGB stream correspondence verification [24] and self-sensing [94] , [95] .