Self-supervised contrastive learning + MoCo trilogy + self-supervised model evaluation method in deep learning

Summary

For a long time in the past, the field of computer vision has achieved great success relying on large-scale labeled data sets, especially the application of convolutional neural networks, which has enabled the leapfrog development of various sub-fields of vision, academia and industry The world began to invest a lot of research and applications, which once convinced everyone that the building of artificial intelligence was about to be completed. However, recent research on self-supervised learning (Self-supervised Learning, SSL), Transformer, MLP, etc. has become a hot spot in the academic circle, especially the attack of Transformer and MLP, which is likely to kill supervised learning and convolutional structures on the beach. Based on the above rhythm, the author believes that the field of computer vision (CV) is entering a new era of change.

This article mainly focuses on the relevant content of self-supervised learning in the CV field, including basic concepts, relationships and applications in various fields of vision, as well as current progress and some thinking. There are too many articles on specific self-supervised learning method principles and technologies. Interpretation, this article does not involve it for the time being, and strives to observe the characteristics and current limitations of self-supervised learning from other angles, analyze and summarize experience, in order to inspire you with more innovative ideas. Since the author himself has great limitations, some opinions are inevitably biased, and I hope you can criticize and correct me. 

 1. Introduction to Self-Supervised Learning

At the AAAI2020 conference, Yann LeCun gave a report on self-supervised learning, saying that self-supervised learning is the future of artificial intelligence. From the end of 2019 to the present, a series of methods such as MoCo series, SimCLR, BYOL, etc. have developed rapidly. Through the unlabeled dataset, the effect on the labeled dataset has been achieved, and almost all downstream tasks have gained benefits, making it a CV field. research hotspot. The advantage of self-supervised learning is that training can be completed on unlabeled data, while supervised learning requires a large amount of labeled data, and reinforcement learning requires a large number of interactive attempts with the environment. In the era when data is king, this feature also makes everyone fully believe that Self-supervised learning is the direction of development of artificial intelligence.

Self-supervised learning is a new term for supervised learning and unsupervised learning that is familiar to everyone. This type of method was first classified into the category of unsupervised learning. Regarding the concept of self-supervised learning , the definition given by Paper with code [1] is to use unlabeled data to learn a representation method in a self-supervised manner , specifically by learning an objective function of a substitute task (pretext task) to get feature representation. The surrogate task can be a predictive task, a generative task, or a contrastive learning task. The source of supervision information for the surrogate task is obtained from the data itself. For example, alternative tasks can be picture coloring, picture matting position prediction, video frame sequence prediction, etc. Or we can reverse the method from the results. For self-supervision, the data itself has no labels, and we need to determine the labels of the data by designing tasks ourselves. For example, in the following figure [2], 9 blocks are deducted from the picture, and the model predicts the position of each block. The process of automatically constructing labels for each block is the process of generating labels, and the work of predicting positions is a replacement task.

Figure 1 Prediction of relative position of image blocks

The recently popular self-supervised models MoCo series, SimCLR, etc., which have excellent effects, except BYOL and SimSiam discarded the negative sample data, and are basically constructed by comparing positive and negative sample pairs . BYOL and SimSiam also built two networks. The comparison forms of all methods belong to the category of contrastive learning (Contrastive Learning) tasks . It can be said that the current popularity of self-supervised learning is the popularity of contrastive learning self-supervised methods . The basic principle is to adopt a network structure in the form of Siamese, and calculate the output loss of the two branches of the network by inputting positive and negative sample pairs, so that the network can learn the characteristics that can bring similar samples closer and dissimilar samples farther away. . The process of automatically constructing labels is a variety of commonly used data enhancement methods, as shown in the following figure [3], the original image uses random cropping, color transformation, blurring, etc. to construct similar sample pairs, while different original images or enhanced images are non-similar pair of samples. The trained comparative learning network, when transferred to data sets such as downstream tasks (classification, detection, segmentation), has achieved an effect comparable to that of a supervised learning model.

Figure 2 The data augmentation method used by SimCLR

The development history of self-supervised methods based on comparison is shown in the figure below. Several methods that have received more attention have been selected, and the time limit is March 2021 . The two research teams of Facebook and Google are fighting with each other, comparing the learning framework to gradually remove some skills and structures, and move towards the concept of Chinese philosophy "the road is simple".

Figure 3 The development process of self-supervised contrastive learning

Thinking from another angle, if you abandon the finetune of downstream tasks and only focus on the learning of alternative tasks, then self-supervised learning is like a big dye vat. As long as various alternative tasks can be constructed and embedded in the self-supervised learning framework, eventually The learned features and networks have the discriminative ability to replace tasks. From this, as if by magic, we were able to customize the capabilities of the neural network. At present, many research results have been published, and self-supervision can be used to complete frame sequence prediction, video playback speed judgment, image rotation direction prediction, etc.

2. The relationship and thinking of self-supervised learning and other fields

Due to the strong development momentum of contrastive learning and its absolute proportion in the field of self-supervision, this article directly replaces the statement of self-supervised learning with contrastive learning in the following. Methods such as distillation learning, representation learning, etc. are similar or related, and will be discussed one by one below.

1. Contrastive learning and distillation learning

The network structure forms of the two are very similar, the same is a two-way network structure, and the loss is also calculated for the final two-way network output. The difference is that distillation learning often fixes a teacher network, and the size of the student network is smaller than that of the teacher. In comparative learning, the two network structures are often consistent, and the network parameters are updated together. In distillation learning, the parameters of the teacher network are fixed. Of course, there are differences in input, loss, parameter update, etc., but the distillation network provides us with another way of thinking about the contrastive learning architecture. The update method of momentum update and the stop gradient technique commonly used in comparative learning can be understood as the slow update teacher version and variant of distillation learning, so we can understand the comparative network as a two-way network learning from each other and fighting each other left and right. Even, in the paper DINO [4], the two branches in the network structure diagram are directly written as teacher and student.

Figure 4 DINO algorithm network structure

2. Contrastive Learning and Representation Learning

Contrastive learning is a method of representation learning. The features obtained through comparative learning can be transferred to downstream tasks and finetune can achieve the effect of supervised learning, which is very similar to the manual features in the early CV field. The loss function setting of contrastive learning is also the starting point of representation learning. The distance of similar samples in the feature space is still close, and vice versa. The supervised learning network also learns a good feature representation, so it can perform better on our classification and other tasks. Now what to do in contrastive learning is to learn a feature representation with stronger generalization on the basis of no labels. It is foreseeable that we can replace the imagenet pre-training model with the comparative learning model as the starting point for various task training, because the training set size of comparative learning can easily exceed imagenet, and the training is more generalized beyond classification tasks. feature representation.

Figure 5 Process of supervised learning

3. Contrastive Learning and Autoencoders

Autoencoder is also a way of image feature extraction in the unsupervised field. This method is based on an encoder (encoder) that maps the input to features, and then restores the mapped features to the original image through the decoder (decoder) to reduce the weight. The structural error is the training target.

Figure 6 Schematic diagram of the autoencoder network structure

The encoding process of the autoencoder can be regarded as a single branch structure of contrastive learning. The difference between the two is that the autoencoder uses the reconstructed output as self-supervised information and avoids ordinary solutions, while the contrastive network relies on the output of the two-way network. Compare and solve problems. From the point of view of extracting image features, contrastive learning directly performs constraint optimization on the extracted features, maintaining the Alignment (similar instances have similar features) and Uniformity (retaining more information and uniform distribution) of feature distribution in the embedding space. In addition, if a combination of the two methods is also a direction that can be tried, magic does not necessarily have to defeat magic, and the addition of two kinds of magic may also create a magical world.

4. Contrastive Learning and Natural Language Processing

The success of self-supervised learning in the field of natural language processing (NLP) has led to the upsurge of comparative learning in the field of CV. The success of word vector (Word2Vec) and other methods, whether it can be successfully reproduced in the visual field, drives everyone to explore in the direction of self-supervised vision.

There are also differences between the two. The number of words or phrases is finite, while the number of pictures is infinite. Sentences can construct various types of changes through masks (masks). How efficient is the change in the picture field? Accurately obtaining sample pairs and improving the effect of downstream tasks are problems to be solved and optimized. There are also various simple applications that can be directly transferred. For example, ALBERT [5] proposed that the sentence order prediction (SOP) task can be directly transferred to the order prediction of video clips.

5. Contrastive Learning and Generative Adversarial Networks (GANs)

Q: Can contrastive learning and GAN be related?

Answer: Hello, yes.

Please see the network architecture from the videoMoCo[6] article, where the generator is used as a way to generate similar sample pairs, and the discriminator is the framework for comparative learning. It can be said that the task of the discriminator in GAN to distinguish between true and false is basically the same as the task of distinguishing positive and negative sample pairs in contrastive learning.

Although the generator method used by videoMoCo here is relatively naive, it gives us a huge space for imagination. One of the difficulties in contrastive learning is how to construct alternative tasks. Currently, all kinds of comparative learning methods are completed by mechanical data enhancement. If the network is used to complete the label generation of positive and negative sample pairs, can it promote the effect of contrastive learning? , and even expand the scope of application of contrastive learning. Everything can be compared, as long as it can be generated.

Figure 7 videoMoCo algorithm network structure

6. Contrastive learning and metric learning, image retrieval

Through communicating with colleagues who study metric learning, from the perspective of related network algorithms and loss functions, contrastive learning and metric learning are closely related, or directly regarded as two names of the same concept. The goal is to make the learned features similar to objects. The smaller the distance, the larger the distance between dissimilar objects. Most of the comparative learning field now uses the InfoNCE loss function, but the various losses used in metric learning are rarely involved. It is also possible to further optimize the direction of referencing these losses.

Image retrieval is an important field where we try to use contrastive learning as a practical application. Comparative learning can naturally obtain image embedding, and also has the characteristics of distinguishing similar images or non-similar images. Under certain retrieval requirements, it is a perfect application. We have also tried to compare the contrastive learning model with the model trained by ArcFace. The two are applied to image retrieval after embedding. The difference in simple verification is not large. In terms of model adaptability, the original data enhances the diversity. The impact is greater.

3. Comparing the development trend of self-supervised learning

1. The road to simplicity

You may have some doubts after reading some of the previous comparative learning research papers, why stop gradient works, and what the role of momentum is, it seems not so intuitive. In the follow-up method, the momentum update is discarded, and the negative samples can also be discarded, while Barlow Twins [7] is open and closed, discarding all kinds of tricks and tricks, and implementing comparative learning on the most intuitive cross-correlation matrix, which is as simple as maddening. In contrast, the root of various methods and losses is the cross-correlation matrix. The cross-correlation matrix succinctly handles the sampling method of sample pairs. Compared with other algorithms, it has a more efficient data sampling method and data scale. The previous methods are like in Swinging around the enemy's heart, Barlow Twins is like a swordsman stabbing straight into the enemy's heart. Of course, the 8192 high-dimensional mapping layer it uses is also a question worth discussing.

It is already a consensus in contrastive learning that the number of negative samples is very important for feature learning. This method aims to reduce the redundancy of each feature dimension. In another way of thinking, we can convert the cross-correlation matrix into a similarity matrix of images in a batch, so as to obtain large-scale negative sample data to improve the model effect, no longer Limited by hardware limitations, an efficient comparison model training can also be completed. Of course, the a priori assumption of this method is that the images in the same batch are all negative sample pairs.

The use of loss function also has the meaning of returning to tradition. The following are the loss function used by Yann LeCun in the 2006 paper [13] and the loss function of Barlow Twins. Do you see if these two losses look like twins?

Contrastive loss proposed in 2006

Cross-correlation matrix vs loss used by Barlow Twins

2.Transformer or MLP?

At the beginning of April 2021 , Chen Xinlei, He Yuming and other great masters released the MoCo V3[8] version of the self-supervised method, introducing Visual Transformers (ViT) into comparative learning. At the end of April, the DINO [4] paper was released, pointing out that self-supervised ViT features contain obvious semantic segmentation information, and there is no similar performance in supervised ViT and convolutional networks.

Figure 8 DINO algorithm segmentation effect display

In the field of vision, there is a tendency for Transformers to replace convolutional networks, as if a fledgling young man punched the old master to death. And it has already attacked the field of self-supervised learning from simple image classification, and has shown more powerful features. I believe that more research will appear based on self-supervised Transformer.

Or methods such as Nirvana reborn MLP[9] may also show their talents in the field of self-supervision, corresponding to the title of [9]: MLP-Mixer: An all-MLP architecture for vision, I have thought about the topic of the MLP method under self-supervision Already: An all-MLP Architecture for self-supervised Learning.

3. Contrasting self-supervised applications in the field of video

There are also many applications of contrastive learning methods in the video field. [10] input videos with different playback speeds into the contrastive learning network, and train the model for playback speed discrimination; background erasing (Background Erasing [11]) is superimposed on each frame of the video Random frames in the current video to reduce the influence of the background on model judgment and improve the accuracy of behavior recognition. The network input is normal video and superimposed frame video; [12] sample different clips of the same video, and see The data enhancement of the video is to input the positive sample pairs into the network to obtain the representation learning of the video features.

Currently, in various applications in the video field, the phenomenon that the replacement task is consistent with the downstream task is serious, causing the model to only have a recognition effect on specific tasks. At the same time, the representation learning of video features, the phenomenon of copying the image method is obvious, and the migration can be done by replacing 2D convolution with 3D convolution. Specificity does some specialized work.

Advances in video representation learning will surely advance the field of video retrieval. In the field of video retrieval, we can use self-supervised learning to build a retrieval method that uses video to search for video, and we can also do cross-modal video retrieval, such as searching for video by text, searching for video by voice, etc. Conversely, imagine that video can also generate text, and video can generate speech.

4. Supervised Contrastive Learning

Contrastive learning, which shines in the field of self-supervision, can also be applied in the field of supervised learning. The paper [14] has done this. The basis of comparative learning in the self-supervised field is whether the two pictures are of the same origin, and the combination with supervised learning becomes whether the two pictures are of the same type. Performance over cross-entropy is obtained after using a supervised contrastive loss.

Figure 9 Self-supervised comparison and supervised comparison

However, the core of this method is to train the network for extracting embedding by contrastive learning, and then freeze the feature extraction network to train a fully connected classification network. In essence, it is consistent with self-supervised network transfer to downstream tasks. The key lies in the construction of surrogate tasks that incorporate information from supervised data. The magic halo of self-supervised learning is verified again, and the excellent ability of contrastive loss in extracting effective features relative to classification cross-entropy loss is also proved.

4. Some thoughts

1. Theoretical principles

Although self-supervised learning has achieved good results, the mathematical principles and theories behind it are not particularly solid. Most of the experimental results are used to reverse the effect of the model structure and strategies, which may cause many researches to take a detour. Starting from the theoretical basis , the effect of going straight to the final goal may be better.

2. Construction of alternative tasks

The construction of the current alternative tasks, especially the video direction, is mostly dominated by downstream tasks, and there is no specific paradigm or rule. What the surrogate task can accomplish is the boundary of what the self-supervised model can accomplish. There are various alternative tasks, which lead to different types of tasks. There is no way to compare the performance. It can only be the application of a simple network on another task. Currently, the image field is mostly based on a variety of data enhancement methods to construct alternative tasks , while the video field It is also possible to propose a unified construction method.

There are very few alternative tasks that can be done in a "semi-automatic" way. In various image algorithm applications, it may be a stumbling block that affects the adaptability of self-supervised methods.

3. Can you build end-to-end learning that directly leads to downstream tasks?

Since it has been found in [4] that there are obvious semantic segmentation features in self-supervision, whether adding a segmentation branch network to the back end of the comparison model will help network learning, or whether it can be directly trained to obtain a usable segmentation network is worth studying. The problem.

4. Construct feature extraction network in other forms except comparison

In essence, the comparison network is a way to train the feature representation in addition to the conventional network, which is similar to the autoencoder mentioned above. The success of contrastive learning lies in the fact that the feature extraction network trained by it performs well in downstream tasks and is also an effective representation of the proposed features. From this we can be inspired, is there any other way to build a training network that can also extract effective features. I believe that the proposal of the new model will definitely lead a wave of research just like comparative learning.

5. The vast world, great potential

Self-supervised learning is still in the exploratory stage, and there are many parts that can be explored in depth. I believe that self-supervised learning will have a wide range of applications in both academia and industry. As a kind of magic in deep learning, more people are needed to tap its potential and create more miracles.

Summarize

This article aims at the research of the current popular self-supervised learning field in the field of CV, sorting out its similarities and differences with other CVs, and discussing several cutting-edge research points. I hope that through this article, everyone will have a clearer positioning of self-supervised learning. If it is helpful to your research and ideas, it will be the author's greater comfort———————————————— —————————————————

references:

[1] https://www.paperswithcode.com/task/self-supervised-learning

[2] Doersch C, Gupta A, Efros A A. Unsupervised visual representation learning by context prediction[C]//Proceedings of the IEEE international conference on computer vision. 2015: 1422-1430.

[3] Chen T, Kornblith S, Norouzi M, et al. A simple framework for contrastive learning of visual representations[C]//International conference on machine learning. PMLR, 2020: 1597-1607.

[4] Caron M, Touvron H, Misra I, et al. Emerging properties in self-supervised vision transformers[J]. arXiv preprint arXiv:2104.14294, 2021.

[5] Lan Z, Chen M, Goodman S, et al. Albert: A lite bert for self-supervised learning of language representations[J]. arXiv preprint arXiv:1909.11942, 2019.

[6] Pan T, Song Y, Yang T, et al. Videomoco: Contrastive video representation learning with temporally adversarial examples[J]. arXiv preprint arXiv:2103.05905, 2021.

[7] Zbontar J, Jing L, Misra I, et al. Barlow twins: Self-supervised learning via redundancy reduction[J]. arXiv preprint arXiv:2103.03230, 2021.

[8] Chen X, Xie S, He K. An empirical study of training self-supervised visual transformers[J]. arXiv e-prints, 2021: arXiv: 2104.02057.

[9] Tolstikhin I, Houlsby N, Kolesnikov A, et al. MLP-Mixer: An all-MLP architecture for vision[J]. arXiv preprint arXiv:2105.01601, 2021.

[10] Wang J, Jiao J, Liu Y H. Self-supervised video representation learning by pace prediction[C]//European Conference on Computer Vision. Springer, Cham, 2020: 504-521.

[11] Wang J, Gao Y, Li K, et al. Removing the Background by Adding the Background: Towards Background Robust Self-supervised Video Representation Learning[J]. arXiv preprint arXiv:2009.05769, 2020.

[12] Feichtenhofer C, Fan H, Xiong B, et al. A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning[J]. arXiv preprint arXiv:2104.14558, 2021.

[13] Hadsell R, Chopra S, LeCun Y. Dimensionality reduction by learning an invariant mapping[C]//2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06). IEEE, 2006, 2: 1735-1742.

[14] Khosla P, Teterwak P, Wang C, et al. Supervised contrastive learning[J]. arXiv preprint arXiv:2004.11362, 2021.


Appendix 1: The work of He Yuming's team in the field of self-supervision - MoCo Trilogy

Taking advantage of the recent release of MoCov3 by FAIR, it just happened to be enough for the MoCo trilogy. Let's review what the entire MoCo series has done from the beginning, and explore why the MoCo series has had such a great impact on the field of Self-Supervised Learning.

For basic knowledge about Self-Supervised and InfoNCE, see Self-Supervised: How to Avoid Degenerate Solutions . This article only involves how the Self-Supervised method of the MoCo series evolved.

01 MoCov1

Time went back to the end of 2019, when the Transformer in the NLP field was further applied to Unsupervised representation learning, resulting in the BERT and GPT series models that were later far-reaching. In contrast, in the CV field, ImageNet was saturated, and it seemed that it had encountered a barrier that could not be crossed. Spin between different tasks to find a way out. Just when the CV field was stagnant, it was Kaiming He who came out with MoCo and swept the 7 major data sets including PASCAL VOC and COCO. So far, CV has opened a new chapter of Self- Supervised , and Transformer has become a hot research direction in deep learning.

MoCo mainly designs three core operations: Dictionary as a queue , Momentum update and Shuffling BN .

(1)Dictionary as a queue

As mentioned in my previous article, the best way to avoid degenerate solutions is to satisfy alignment and uniformity at the same time, that is, positive pair and negative pair are required. Among them, uniformity is to distribute different features as evenly as possible on the unit hypersphere. In order to achieve this goal more efficiently, a very intuitive way is to increase the negative pair (ie batch size) included in each gradient update, before MoCo There are many ways how to increase the negative pair has been researched a lot.

title

Among them, the picture (a) is the simplest and rude, directly end-to-end, and the size of the batch size depends on the size of the GPU capacity. Figure (b) designs a memory bank to save the characteristics of all the data in the data set, randomly samples from the memory bank when using it, and then performs momentum update on the samples, so that multiple epochs can be considered to approximate a large batch, but this method There is a problem that saving all data features in the data set takes up a lot of video memory .

title

MoCo proposed to improve the method of memory bank to dictionary as a queue, which means that it is similar to memory bank and also saves the data characteristics of the data set, but it is stored in the form of queue, so that each epoch will enqueue a batch of data features, and then dequeue out the data features of a batch with the longest storage time in the dictionary. On the whole, for each epoch, the total number of data features saved in the dictionary is unchanged, and the data features of the dictionary will be updated as the epoch proceeds. The capacity of the dictionary does not need to be very large, the essence!

(2)Momentum update

However, if MoCo only uses the dictionary as a queue, it cannot achieve good results, because the parameters of the encoder will mutate between different epochs, and the data features of multiple epochs cannot be approximated as a static large batch data feature. Therefore, MoCo adds a momentum encoder operation on the basis of dictionary as a queue. The encoder parameter of the key is equal to the moving average of the encoder parameter of the query. The formula is as follows:

 \theta_{\mathrm{k}} \leftarrow m \theta_{\mathrm{k}}+(1-m) \theta_{\mathrm{q}}

 \theta _{k}and  \theta _{q}are the parameters of the encoder of the key and the encoder of the query respectively, and m is the momentum coefficient between 0-1.

Because of the existence of the momentum encoder, the parameters of the key branch avoid mutations, and the data characteristics of multiple epochs can be approximated as a static large batch data characteristic, which is ingenious!

(3)Shuffling BN

In addition, MoCo also found that the BN layer in ResNet will prevent the model from learning a good feature. Since the calculation of mean and std between samples in each batch leads to information leakage, a degenerate solution is generated. MoCo solves this problem through multi-GPU training, calculates BN separately, and shuffles BN information generated on different GPUs.

(5) Experiment

Experimental results

Through the comparative experiments of the three methods of end-to-end, memory bank and MoCo, it can be seen that the MoCo algorithm has great advantages. Because the momentum update of the memory bank is data, the training process may be more unstable, resulting in a much lower accuracy than end-to-end and MoCo; end-to-end cannot use a larger batch size due to the limitation of the GPU capacity ;MoCo through three ingenious designs of dictionary as a queue, momentum encoder and shuffle BN, can continuously increase the number of K, and fully exert the power of Self-Supervised.

02 MoCov2

On the basis of MoCov1, MoCov2 added the successful tricks of the SimCLR experiment, and then overtook SimCLR to become the SOTA at that time. FAIR and Google Research competed head-to-head.

(1)SimCLR vs MoCo

title

The method SimCLR actually uses is the end-to-end method mentioned in MoCo . Of course, there is also the problem of GPU capacity limitation, but in front of Google, what kind of limitation is the GPU capacity? I have a dozen TPUs , so SimCLR passed the large Batch, big epoch, more and stronger data enhancement and adding an MLP pulled MoCo down from the throne. Of course MoCo is not convinced, SimCLR you cheat, I want to use more and stronger data enhancement and MLP! So MoCov2 was born in the form of an experiment report.

experiment

title

It can be seen from the experiment that adding MLP, stronger aug, and large epoch can greatly improve the accuracy of MoCo.

title

Compared with SimCLR, MoCov2 can achieve better results when the batch size is smaller.

03 MoCov3

The starting point of MoCov3 is that the architecture used by Unsupervised representation learning in the NLP field is Transformer, and the Self-Supervised in the CV field is still using the CNN architecture. Is it possible to use the Transformer architecture in Self-Supervised? So MoCov3 continues to explore where the upper limit of Self-Supervised+Transformer is, and it has a financial + computer flavor.

(1)Stability of Self-Supervised ViT Training

MoCov3 replaces the backbone with ViT, and then conducts experimental research to explore whether Self-Supervised uses the Transformer architecture. However, using ViT as the backbone in the experiment will lead to instability in the training process of Self-Supervised, and this instability cannot be captured by the results of the final migration prediction. In order to reveal what caused this instability, MoCov3 uses kNN curves to monitor each epoch result of self-supervised.

(2)Empirical Observations on Basic Factors

By controlling variables, we mainly explore the influence of batch size, learning rate and optimizer on the self-supervised training process.

It can be seen from the experiment that as the batch increases or the lr increases, the kNN accuracy gradually appears dip, and the degree of dip gradually increases, showing a periodic appearance. When using the LAMB optimizer, as lr increases, although the kNN accuracy is still a smooth curve, the middle part will still decline.

(3)A Trick for Improving Stability

title

In order to explore the reason for the emergence of dip, the author further draws the change of the gradient of the first layer and last layer of the model as the epoch increases. It is found that during the training process, gradient mutations will occur in different layers, resulting in the appearance of dip. By comparing the gradient peaks of each layer, it is found that the first layer will have a gradient peak earlier, and then spread to the last layer layer by layer.

Based on this observation, the author boldly speculates that the instability phenomenon will occur earlier in the shallow layer. So the author conducted an ablation experiment to compare the results of fixed random patch projectionr and learned patch projectionr.

title

It can be seen that during the training process, under different Self-Supervised algorithms, the fixed random patch projection is much more stable than the learned patch projection, and the kNN accuracy also has a certain improvement.

The author also mentioned that fixed random patch projection can only alleviate the instability problem to a certain extent, but it cannot completely solve it. When lr is large enough, there will still be instability. It is unlikely that the first layer is the root cause of the instability, on the contrary, this instability problem is related to all layers. It's just that the first layer uses a larger gap between the conv and the subsequent self-attention, which has a greater impact on instability, and the first layer is fixed and easier to handle.

As expected, the experimental results hang on the previous Self-Supervised algorithm. On the whole, MoCov3 has gained insight into the problems of Self-Supervised+Transformer through experimental exploration, and used a simple method to alleviate this problem. -Supervised+Transformer provides good inspiration.

(4) Summary

Let me talk about my own opinion. In the future, CV is likely to be similar to NLP, with unsupervised pre-training. The architecture of CNN may not be able to support unsupervised pre-training of a large amount of data, so it is necessary to use transformer as the unsupervised pre-training architecture of CV. of. From the exploration of MoCov3, it can be seen that FAIR is trying to find the future direction of CV from the two hot directions of Self-Supervised and Transformer. NLP gradually dominates the entire NLP field from Transformer -> BERT -> GPT series. MoCo seems to want to copy The successful path of NLP is to gradually explore the upper limit of Unsupervised representation learning in the CV field from MoCov1 -> MoCov2 -> MoCov3. Will Self-Supervised+Transformer be the BERT in the CV field?

The last thing I want to say is that, as shown in the screenshot, although it is called MoCov3, it is actually less and less like MoCo. It lacks the most essential dictionary as a queue of MoCo and the spirit of MoCo. It can be seen that FAIR is very important to The helplessness and compromise of precision, the initial splendor of MoCo may only exist in history.


Appendix II: Self-supervised model evaluation method

来自论文《Exponential Moving Average Normalization for Self-supervised and Semi-supervised Learning

1. Linear Classification and Finetuning

  • Linear Classification , also known as linear probe (linear probe) , is a method of testing the performance of pre-trained models, also known as linear probing evaluation . After training, to evaluate the quality of the model, replace the last layer with a linear layer, and then only train this linear layer is the linear probe. The parameters of the pre-trained model have not changed after curing .
  • Finetuning  refers to the process of applying the pre-trained model to your own data set and adapting the parameters to your own data set . The parameters of the entire network will change .

2. kNN Classification

3. Image Retrieval

4. Low-shot Classification

Guess you like

Origin blog.csdn.net/qq_23981335/article/details/122576120