Self-Supervised Learning (SSL) Self-Supervised Learning

Table of contents

Supervised, unsupervised, semi-supervised, weakly supervised, self-supervised learning

main idea

Pretext Tasks

View Prediction View Prediction (Cross modal-based)

downstream tasks

Analyzing the effectiveness of predecessor tasks

Visualize Kernels and Feature Maps

nearest neighbor extraction

choose

main method

mainstream classification

generative methods

contrastive methods


In 2022, where will pre-training go?

Supervised, unsupervised, weakly supervised, semi-supervised, self-supervised learning

Supervised: trained with labeled data;
unsupervised: trained with unlabeled data;

Weak Supervision: Training with labeled data that contains noise.


Semi-supervised: training with both labeled and unlabeled data.

It has been very hot recently, and the development of this field is also very rapid.

Previously it was usually a two-stage training,

First train a Teacher model with (smaller) labeled data,

Then use (larger scale) unlabeled data to predict pseudo-labels as the training data for the Student model;

There are already a lot of direct end-to-end training, which greatly reduces the work of semi-supervised training;

Self-supervised: training on unlabeled data,

Through some methods, let the model learn the inner representation of the data, and then connect to downstream tasks, such as adding an mlp as a classifier.

However, after receiving the downstream tasks, it is still necessary to finetune on specific labeled data, but sometimes you can choose to completely fix the previous layer and only finetune the parameters of the subsequent network.
 

The ability to evaluate self-supervised learning is mainly through the Pretrain-Fintune mode.

Supervised Pretrain - Finetune process:

1. Train from a large amount of labeled data to obtain a pre-trained model,

2. For the new downstream task (Downstream task), we will transfer the learned parameters (such as the parameters of the layer before the output layer), and perform "fine-tuning" on the new labeled task, so as to obtain a task that can adapt to the new task. network of.

Self-supervised Pretrain - Finetune process:

1. From a large amount of unlabeled data , train the network through pretext (automatically construct supervisory information in the data) to obtain a pre-trained model

2. For new downstream tasks, like supervised learning, fine-tune the learned parameters after migration.

Therefore, the ability of self-supervised learning is mainly reflected by the performance of downstream tasks.

Features of supervised learning:

  1. For each picture, the machine predicts a category or bounding box
  2. The training data is manually labeled
  3. Each sample can only provide very little information (for example, 1024 categories have only 10 bits of information)

Features of self-supervised learning:

  1. For a picture, the machine can predict any part (automatically build supervision signals)
  2. For video, future frames can be predicted
  3. Each sample can provide a lot of information

main idea

 Self-Supervised Learning 

1. Use unlabeled data to change the first parameters from no training to preliminary formation , Visual Representation.

2. According to the difference of downstream tasks (Downstream Tasks), use the labeled data set to train the parameters to be fully formed ,

Then the amount of data set used at this time does not need to be too much, because the parameters have been trained almost after the first stage.

The first stage does not involve any downstream tasks. It is to pre-train with a bunch of unlabeled data without specific tasks. This is called in a task-agnostic way in official language.

The second stage involves downstream tasks, which is to fine-tune downstream tasks with a bunch of labeled data. This is called in a task-specific way in official language.

 Involved areas

Self-Supervised Learning is not only in the field of NLP, but also in the field of CV and voice. It can be divided into three categories: Data Centric, Prediction (also called Generative) and Contrastive .

Shallow layers capture some low-level features such as edges, corners, and textures, while deeper layers capture task-related high-level features. Therefore, only the visual features of the first few layers are transferred during the supervised training stage of downstream tasks.

Pretext Tasks

The essence of the pre-task is: the model can learn some transformations of the data itself (the data is still considered as the original data after transformation, and the data and the original data are in the same embedding space after transformation), and the model can distinguish other different data samples .

But the pre-quest itself is a double-edged sword, a particular pre-quest may be good for some problems, but not for others.

The original image is regarded as an anchor, its enhanced image is regarded as a positive sample, and the rest of the images are regarded as negative samples.

Most of the predecessor tasks can be divided into four categories:

  • Color transformation: original image, Gaussian noise, Gaussian blur, color distortion
  • Geometric transformation: original image, crop, rotate, flip
  • Context based tasks:

Puzzle (space): The original image is regarded as an anchor, divided into small images, the disturbed image is regarded as a positive sample, and the rest of the images are regarded as a negative sample

Timing (time):

Frames in one video are taken as positive samples, and the rest of the videos are taken as negative samples.

Or randomly sample two clips in a long video, or do geometric transformations on each video clip.

A positive pair is two augmented video clips from the same video

insert image description here

  • Cross-Pattern Based Tasks

View Prediction View Prediction (Cross modal-based)

The view prediction task is generally used when the data itself has multiple views.

In [23], the anchor and its positive images are from simultaneous viewpoints, they should be as close as possible in the embedding space, and as far away as possible from the negative images from other positions in the timeline.

In [24], multi-views of one sample are treated as positive samples (intra-sampling), and the rest inter-sampling as negative samples.

insert image description here

downstream tasks

The downstream tasks focus on specific applications. When optimizing the downstream tasks, the model utilizes the knowledge learned during the optimization period of the previous tasks. These tasks can be classification, detection, segmentation, prediction, etc.

insert image description here

transfer learning

Analyzing the effectiveness of predecessor tasks

Common downstream high-level vision tasks:
Semantic Segmentation: The task of assigning semantic labels to each pixel in an image.
Object Detection: The task of locating an object in an image and identifying its class.
Image Classification: The task of identifying the class of objects in each image, typically using only one class label per image. Apply the self-supervised learning model to each image to extract features, then use these features to train a classifier (such as SVM), and compare the performance of the classifier on the test set with the self-supervised model to evaluate the learned features. quality.
Human Action Recognition: Recognize what people are doing in a video to get a list of predefined action categories. Usually used for the quality of features learned from videos.

In order to test the effect of features learned in self-supervised learning on downstream tasks, some methods, such as

  • kernel visualization
  • feature map visualization
  • nearsest-neighbor based approaches

In addition to the above quantitative evaluation of the learned features, there are some qualitative visualization methods to evaluate the features of self-supervised learning (Qualitative Evaluation): Kernel Visualization

: Qualitative visualization through the first convolution of the pre-task learning layer, and compare the kernels of supervised models. Evaluate the effectiveness of supervised learning models by comparing their kernel similarity with self-supervised learning models.

Feature Map Visualization: Visualize the feature map to show the attention area of ​​​​the network. Larger activation indicates that the neural network pays more attention to the corresponding area in the image. Usually, the feature map is qualitatively visualized and compared with the supervised model.

Nearest Neighbor Retrieval: Usually images with similar appearance are closer in feature space. The nearest neighbor method is used to find the top K nearest neighbors from the feature space of features learned by a self-supervised learning model.

Visualize Kernels and Feature Maps

Here, kernels of features from the first convolutional layer (from self-supervised training and supervised training, respectively) are used for comparison.

Similarly, attention maps of different layers can also be used to test the effectiveness of the model.

insert image description here

The attention map trained by AlexNet

nearest neighbor extraction

In general, samples of the same class should have similar positions in the hidden space. For an input sample, using the nearest neighbor method, top-K extraction can be used in the data set to analyze whether the self-supervised learning model is effective.

Using the learned parameters as a pre-training model, and then fine-tuning downstream advanced tasks, the ability of this transfer learning can prove the generalization ability of the learned features.


 

choose

1. Shotcuts: Design auxiliary tasks according to your own data and task characteristics, which often have the effect of getting twice the result with half the effort. For example, for lens detection tasks, it is more effective to obtain information such as imaging chromatic aberration, lens distortion, and vignetting to construct auxiliary tasks.

2. Selection of the complexity of auxiliary tasks: The previous experimental results show that the more complex the auxiliary tasks are, the more effective they are. For example, in the image reconstruction task, the optimal number of patches is 9. Too many patches will lead to too few features for each patch. , and the difference between adjacent patches is not large, resulting in poor learning effect of the model

3. Fuzziness: Fuzziness means that the label of the designed auxiliary task must be uniquely determined, otherwise it will introduce noise to the network learning and affect the performance of the model. For example, in motion prediction, the half-squatting action is ambiguous, because its next state may be squatting, or it may be standing up, and the label is not unique.

main method

1. Context based

2. Based on timing (Temporal Based)

3. Contrastive Based

mainstream classification

generative methods

Can be reconstructed => can extract good feature expression

eg:MAE、BERT
 

contrastive methods

Discriminate between different inputs in feature space

Figure 1 : Intuitive understanding of contrastive learning: make the original picture and the enhanced picture closer, and make the original picture and other pictures farther away.

More recently, self-supervised learning combines features of generative and contrastive models: learning representations from large amounts of unlabeled data.

A popular way is to design various pretext tasks to let the model learn features from pseudo-labels. Examples include image inpainting, image colorization, puzzles, super-resolution, video frame prediction, audio-visual correspondence, etc. These pre-tasks are proven to learn good representations.

 insert image description here

 Figure 2: Comparing self-supervised learning training paradigms

Reference link:

Self-supervised learning | (1) Getting started with Self-supervised Learning_CoreJT's blog-CSDN blog_Self-supervised learning

Self-supervised learning | (2) Read Self-Supervised Learning_CoreJT's Blog-CSDN Blog

A Survey of Contrastive Self-Supervised Learning- A Survey of Contrastive Self-Supervised Learning_Xovee's Blog-CSDN Blog_Self-Supervised Comparative Learning

Self-supervised Learning (Self-supervised Learning)

The definitions and differences of supervised, semi-supervised, unsupervised, weakly supervised, and self-supervised - Programmer Sought

[23] Sermanet et al., Time-contrastive networks: Self-supervised learning from video, 2017.

[24] Tao et al., Self-supervised video representation learning using inter-intra contrastive framework, 2020.

Guess you like

Origin blog.csdn.net/qq_28838891/article/details/125321606
Recommended