Self-supervised learning-contrast learning-MoCo Momentum Contrast reading notes

Momentum Contrast for Unsupervised Visual Representation Learning

原文地址:CVPR 2020 Open Access Repositoryhttps://openaccess.thecvf.com/content_CVPR_2020/html/He_Momentum_Contrast_for_Unsupervised_Visual_Representation_Learning_CVPR_2020_paper.html

Included: CVPR 2020 Best paper

代码: GitHub - facebookresearch/moco: PyTorch implementation of MoCo: https://arxiv.org/abs/1911.05722


1. Background knowledge

A. What is comparative learning?

Insert image description here

If there are two categories, use the same feature extractor to extract features respectively. Contrastive learning hopes that in a feature space, features of different classes should be as far away from each other as possible.

B. What is unsupervised training?

In the field of vision, everyone cleverly designs agent tasks to artificially define some rules. These rules can be used to define which pictures are similar and which pictures are dissimilar, thus providing a supervision signal to train the model. This is the so-called self-supervised training

Let’s talk about the most extensive agent task: instance discrimination. How to define which pictures are similar and which pictures are dissimilar? Instance discrimination does this: only the ones cropped from your own pictures are positive samples and belong to the same category. Others are negative samples. (Or different perspectives of an object, whether a picture is RGB or grayscale can be used as a positive sample) From this, each picture is its own class. ImageNet does not have 1,000 categories, but more than 1 million categories (total number of images).

Such a framework is a common implementation method of comparative learning. It may seem unremarkable, but the great thing about comparative learning is its flexibility. As long as you can find a way to define what is a positive sample and what is a negative sample, the rest of the operations are relatively standard.

Everyone used their imagination to formulate many rules for positive samples and negative samples.

For example, in the video field, any two frames of the same video are positive samples, while all frames in other videos are negative samples.
In the field of NLP, NLP and simCSE throw the same sentence to the model, but do 2 forwards and get 2 features of a sentence through different dropout; they are the same as the features of all other sentences. Negative sample.
CMC paper: Different views of an object (front, back; RGB image and grayscale image; different crops) can be used as positive samples in different forms.
Comparative learning is so flexible that it can be used in any field. Expanding to the multi-modal field also creates the CLIP model of open AI

3.Momentum

yt-1 is the output at the previous moment, xt is the current input, and m is the momentum hyperparameter, 0~1. When m approaches 1, it is less dependent on the current input. moco takes advantage of this feature to update the dictionary learning features slowly and be as consistent as possible.


 Summary

MOCO is used for unsupervised representation learning. We understand contrastive learning from another perspective, that is, from the perspective of a dictionary query. Create a dynamic dictionary in moco. This dictionary consists of two parts: a queue and a moving-averaged encoder.

These two things create abiggerandmore consistent The dictionary of is very helpful for unsupervised contrastive learning. (The samples in the queue do not need to undergo gradient backpropagation, so many negative samples can be stored in the queue, so that the dictionary can become very large. The purpose of using the moving average encoder is to make the sample characteristics in the queue as consistent as possible , that is, different samples obtain the coding representation of features through encoders that are as similar as possible.)

Moco under the comon linear protocol can achieve very good results in imageNet classification (comon linear protocol refers to freezing the pre-trained backbone, adding a linear protocol, classification, and you can do other tasks). More importantly, , the features learned by moco can be well transferred to downstream tasks! (This is the essence of this article by moco, because the purpose of unsupervised learning is to obtain a better pre-trained model through large-scale unsupervised pre-training, and then be able to deploy it on other downstream tasks, which usually may not So much labeled data can be used for model training). MoCo can outperform its supervised pre-training counterpart in 7 detection/segmentation tasks on PASCAL VOC, COCO, and other datasets, sometimes surpassing it by large margins. In this way, the gap between supervised and unsupervised is largely bridged.


1. introduction

GPT and BERT have proven the success of unsupervised pre-training inNLP. Supervision still dominates the visual field. The reason for this difference in performance between the language model and the visual model may come from the difference in the original signal spaces of the visual and language models. In the language model task, the original signal space isdiscrete. These input signals are all words or root affixes, and it is relatively easy to build a tokenized dictionary. (tokenize: map a certain word to a certain feature). With this dictionary, unsupervised learning can be easily carried out based on it. (All keys (entries) in the dictionary can be regarded as a category, so the unsupervised language model is also similar to the supervised paradigm, that is, there is something similar to a label. To help the model learn, so in NLP, it is easy to model and the model is relatively easy to optimize). However, the original visual signal is in a continuous and high-dimensional space. It does not have strong semantic information like words (words can be condensed well and concisely). Therefore, images (due to the original signalcontinuous and high-dimensional) are not suitable for building a dictionary, making unsupervised learning difficult to model, and supervised models are more powerful.

There are some unsupervised learning based on contrastive learning that have achieved good results, which can be summarized asConstructing a dynamic dictionary.

[Some concepts in contrastive learning: anchor point-anchor-a sample-an image, positive sample-positive-obtained from the anchor point through transformation, negative sample-negative-other samples, and then send these samples to the encoder to obtain features It means that the goal of contrastive learning is to make the anchor point as close as possible to the positive sample and as far away from the negative sample as possible in the feature space]

[Further, in contrastive learning, the positive samples and negative samples are put together and understood as a dictionary, each sample is a key, and then the sample selected as the anchor anchor at the beginning is regarded as the query, and the comparative learning is converted into A dictionary query problem, that is to say, comparative learning requires training some encoders, and then performing a dictionary search. The purpose of the search is to make an encoded query match its features as much as possible (features obtained by encoding positive samples) Similar and far away from other keys (features obtained by negative sample encoding), so that the entire self-supervised learning becomes a contrastive learning framework, and then the loss function of contrastive learning is used]

[In Moco, contrastive learning is summarized as a dictionary search problem, so later in the paper, query is basically used to replace anchor, and key is used to replace positive and negative samples]

Looking at comparative learning from the perspective of dynamic dictionaries, in order to obtain better learning results, the dictionary needs two characteristics: one is that the dictionary should be as large as possible, and the other is The sample representation (key) in the dictionary should be as consistent as possible during the training process. [Cause analysis: The larger the dictionary, the better it can sample in the high-dimensional visual space. The more keys in the dictionary, the richer the visual information represented by the dictionary, and then use query and these When the key is compared and learned, it is more likely to learn more essential features that can distinguish objects. And if the dictionary is small, the model may learn a short cut solution. As a result, the pre-trained model cannot generalize well. Regarding consistency, we hope that the keys in the dictionary are sample features obtained through the same or similar encoder, so that when comparing with the query, the effectiveness of the comparison can be ensured. On the contrary, if these keys are obtained through different encoders, when querying, it is possible to find a key generated by the same or similar encoder. This is equivalent to introducing a short cut solution in disguise, making model learning difficult]And the existing contrastive learning methods are limited by at least one of the above two aspects. .

moco has two contributions. Based on this, moco builds a large and consistent queue, which is better for unsupervised learning. visual representation.

1. Build queue queue

[Analysis: This is the first time that the queue data structure is used in comparative learning. The queue is mainly used to store the dictionary mentioned earlier. When the dictionary is large, the memory of the graphics card is not enough. Therefore, it is hoped that the size of the dictionary has nothing to do with the size of the batch size during each forward process of the model.

Specifically, this queue can be very large, but each time the queue is updated, it is done bit by bit, that is, during each forward process of model training, the features extracted from the current batch enter the queue. The characteristics of the oldest batch are removed from the queue. That isfirst in, first out. After the queue operation is introduced, the size of the mini-batch can be separated from the size of the dictionary (queue), so the size of the final dictionary (queue) can be set very large (the maximum setting in the paper is 65536), and all the files in the queue The elements do not need to be updated in every iteration, so you can train a good model using onlyan ordinary GPU, (because simclr The model will require TPU for training, which is very equipment intensive)]

2. Momentum update for encoder

[Analysis: The keys in the dictionary are best to maintain consistency, that is, the keys in the dictionary are best passed throughthe same or similar encoderGet. When this queue is updated, the characteristics of each batch are obtained by the encoder at different times, which conflicts with the consistency mentioned above. In order to solve this problem, Moco proposes to use the momentum encoder. If a large momentum m is selected, the update of the momentum encoder is actually very slow. Experiments have also proved that using a larger momentum 0.999 will achieve better results]

Agent task selection. Moco provides a mechanism that provides a dynamic dictionary for unsupervised contrastive learning, so you also need to consider what kind of agent task to choose. moco is very flexible and can be combined with many agent tasks. The moco paper uses a simple proxy task instance discrimination, but the effect is very good. Using a pre-trained model combined with a linear classifier can achieve results comparable to supervised ones on imageNet. [Instance discrimination individual discrimination task, example: there is a query and a key, which are random crops of the same object from different perspectives. This is a positive sample pair, and the others are negative samples]

The main purpose of unsupervised learning is to pre-train a model on a large amount of unlabeled data, and the features obtained by this model can be directly transferred to downstream tasks. moco has achieved very good results on 7 downstream tasks (detection, segmentation, etc.), tying or even exceeding supervised methods. [Moco is the first unsupervised company to achieve such good results]

moco has been pre-trained on imageNet, which has a sample size of 1 million. Furthermore, in order to explore the upper limit of unsupervised performance, moco also conducted pre-training on Instagram with a sample size of 1 billion [Facebook's own, more biased towards the real world, has many problems in real scenarios, such as imbalanced data classification. Long-tail problems, one picture contains multiple objects, and the selection and annotation of pictures in the Instagram data set are not so strict] The experiment found that the performance of the model trained on Instagram has improved.

Therefore, movo has filled the gap between supervised and unsupervised through experiments, achieving comparable or even better results, and unsupervised pre-trained models may replace supervised pre-trained models. [There is a lot of work in academia and industry to expand the model based on imageNet pre-training, so moco has a great influence. 】


2. related work 

In unsupervised learning, there are two main aspects to make a fuss about, one isagent task, and the other is lossObjective function. The ultimate goal of the proxy task is to obtain better feature representation. The objective function can separate the proxy task and perform some designs. Moco is mainly designed and improved from the objective function. The framework design in moco ultimately affects the objective function of InfoNCE.

Objective function: can be divided into discriminant, generative, contrastive learning and adversarial learning. Designgenerative model (reconstruct the entire image), you can use L1 or L2 loss function; discriminant Model (eight positions, predicted positions), through the form of loss functions such as cross entropy.

Contrast loss function: Measure the similarity between each sample pair in the feature space. The goal is to make the features between similar objects closer and not similar. Features between objects are pushed away as much as possible. Unlike generative or discriminative loss functions, where the goal of the latter is fixed, the goal of the former (contrastive learning) is constantly changing during the model training process. That is, the target in contrastive learning is determined by the features (dictionary) extracted by the encoder.

Adversarial loss function:Mainly measures the difference between two probability distributions. Mainly used for unsupervised data generation. Later, feature learning was also done. Many methods in transfer learning use adversarial loss functions to learn features. Because if it can generate ideal images, it means that the model has learned the underlying distribution of the data, and the features learned by such a model may also be very good.

Agent task:denoising auto-encoder reconstructs the entire image, context auto-encoder reconstructs a patch of the image, cross-channel auto-encoder/colorization images are colored and generated Pseudo-labels pseudo-labels/transformations of a single image, patch orderings (shuffle the order or predict the orientation), use video information to track, or cluster and other methods.

Contrastive learning VS proxy task: A certain proxy task can be paired with a certain contrastive loss function. Contrastive predictive coding (CPC) [46] is a form of context auto-encoding. Predictive contrastive learning uses context information to predict the future. Contrastive multiview coding (CMC) uses different perspectives of an object to compare and pictures. Colorization is more similar. [CPC and CMC are some early classic works of comparative learning methods. 】

[The reason why we write related work around these two points is that the objective function and agent task are the main differences between unsupervised learning and supervised learning tasks. Supervised tasks have label information, while unsupervised learning does not have labels. Self-supervised signals can only be generated through proxy tasks to serve as label GT. Finally, a Loss is needed to measure the difference between the output and GT so that the model can learn better]


3. Method

3.1. Contrastive Learning as Dictionary Look-up

1) Loss function

       The loss function used in this article isInfoNCE. The loss function is the logarithmic loss of a K+1 (k negative sample, 1 positive sample) SoftMax classifier:

 

 Comparing the objective function of learning, we hope that this objective function has the following properties: when query is similar to the only positive sample key+, we hope that the value of this objective function is relatively low. When query is not similar to other keys, the value of loss should also be Very small, [because the goal of contrastive learning is to shorten the distance between query and positive sample key+, and at the same time widen the distance between query and other samples. If this goal can be achieved, it means that the model training is almost complete, and the value of the objective function at this time It should be as small as possible and no longer update the model. 】Similarly, when the query is not similar to the positive sample key+ or the query is similar to the negative sample key, we hope that the loss value of the objective function is as large as possible to punish the model and allow the model to continue to update parameters.

 The above is the output, the label below 

This is almost the same as the cross entropy function, except that cross entropy K represents the number of categories in the data set. [In contrastive learning, it is theoretically possible to use cross entropy as the objective function, but it is not feasible in the specific implementation of many agent tasks. For example, when using instance discrimination, the number of categories k becomes the number of samples in the dictionary. Softmax cannot work when there are a huge number of categories. At the same time, exponential exponential operation will be very computationally complex when the vector dimension is millions. High, during model training, it will be very time-consuming to calculate like this in each iteration. Based on the above situation, withNCE loss, noise contrastive estimation, the idea is to simplify multi-category into a two-category problem, data category data sample and noise category noise sample, and then compare the data sample with the noise sample.

In addition, if the size of the dictionary is the entire data set, the computational complexity has not been reduced. In order to solve this problem, the idea is to select some negative samples from the entire data set for loss calculation to estimate the loss on the entire data set, that is, estimation. Therefore, if fewer negative samples are selected, the deviation of the approximate results will be larger, so the size of the dictionary will become a factor affecting the performance of the model. That is, the larger the dictionary, the better the approximation will be provided, and the better the model will be.

InfoNCE is a simple variant of NCE. The idea is that if it is only regarded as a two-classification problem, there are only data samples and noise samples, which is not so friendly to model learning, because the noise samples are probably not a class, so the noise samples are regarded as It would be more reasonable to make multiple categories, and eventually NCE becomes the above formula InfoNCE.

[In this formula, q and k are the logits output of the model, and t is a temperature parameter used to control the shape of the distribution. The larger the value, the smoother the distribution. The smaller the value, the sharper the distribution. If the temperature parameter is too large, the contrast loss treats all negative samples equally, resulting in insignificant model learning. The small temperature parameter causes the model to only focus on particularly difficult negative samples, and these negative samples may also be potential positive samples (for example, samples in the same category as the query). If the model pays too much attention to difficult negative samples, it will cause the model to be very difficult. It is difficult to converge, or the feature generalization performance of model learning is relatively poor]

2) Agent task
       Since the focus of MoCo is not to design a new agent task, they use the instance recognition task. Of course, the agent tasks can be arbitrary.
       Under random data enhancement, the same image is randomly cropped to obtain two views, which can be regarded as positive sample pairs. If a q and a k come from the same sample image, they are also a positive sample pair, otherwise they are a negative sample pair. For each mini_batch, the encoded q and the corresponding k form a positive sample pair, and the negative samples come from the queue.

3.2. Momentum Contrast 

1) Treat a dictionary as a queue

In traditional contrastive learning, the size of the dictionary is the size of mini_batch, but this is often limited by GPU memory and computing power and cannot be too large. So MoCo uses queues to store dictionaries. The queue update method is: the current mini_batch is added to the queue after encoding is completed, and the oldest mini_batch in the queue is dequeued, so the dictionary is always a subset of all data. In this way, the size of the queue is decoupled from the size of mini_batch, so that a large queue can be implemented.

2) Momentum update

Momentum updates make the dictionary larger, but also make backpropagation to update the encoder difficult, because the gradient has to be propagated to all samples in the queue. Therefore, MoCo proposes to only update the fq encoder with gradients, and the fk encoder copies the parameters of fq without gradient updates. However, simply copying is not enough. The experimental results of doing so are very poor, because fq is a query queue, so θq is constantly updated and the update speed is very fast. To update θk, you need to copy θq, but this will lead to the correct The rate is not high, so momentum update is used instead to make the update of θk more continuous. Therefore, the update of the fk encoder also needs to update the formula based on momentum:


       Where, θq is the parameter of the encoder fq, θk is the parameter of the encoder fk, and m is the momentum parameter, which belongs to [0,1]. In the above formula, only θq is updated through backpropagation. A larger m value, such as 0.999, has better results.
       Copy parameter values ​​+ momentum update formula so that MoCo’s dictionary can remain consistent with query q

3)Relations to previous work

Previous comparative learning methods can basically be summarized as a dictionary search problem, and are more or less limited by the size or consistency of the dictionary.

(a) End-to-end, end-to-end learning method. As shown in (a) above, in the end-to-end learning method, both encoder q and encoder k can be updated through gradient backpropagation. Both The encoders can be different networks or the same network. The two encoders in moco have the same network architecture Res50, because query and key are obtained from the same batch, and the features of all samples can be obtained through one forward propagation, and these features are highly consistent. The limitation is that the size of the dictionary is limited. In the end-to-end method, the size of the dictionary is the same as the size of the mini-batch. If you want the dictionary to be large, the batch size must be very large. Currently, the GPU cannot load such a large batch size. . In addition, even if the hardware conditions are met, the optimization of large batch sizes is still a difficulty. If not handled properly, the model will have difficulty converging.

[simclr (google) is an end-to-end learning method, and uses more data enhancement methods. There is a pojector after the encoder to make the learned features better. In addition, it also requires a TPU with large memory to support more The large batch size is 8192, with 16382 negative samples. ] The advantage of the end-to-end learning method is that since the encoder k can be updated in real time, the consistency of the keys in the dictionary will be very good. The disadvantage is that the batch size is large and the memory is insufficient.

(b) Memory bank, pays more attention to the size of the dictionary, hoping that the dictionary will be large, at the expense of a certain consistency. There is only one encoder q in this method, and it can be updated through gradient return. The dictionary processing is to store the features of the entire data set. For example, ImageNet has 1.28 million features, each feature has 128 dimensions and 600M memory. Space, nearest neighbor query efficiency is also very high. Then each time the model is trained, only randomly sample keys from the memory bank to form a dictionary. The memory bank is equivalent to offline processing, and the dictionary can be set to be very large.

However, the feature consistency of the memory bank is not good. The sample features selected as key during each training of the sample features will be updated through the encoder q, and the update of the encoder q will be very fast, and the differences in the features updated each time will be large, so the consistency of features on the memory bank will be poor. In addition, since the memory stores the entire data set, it means that the model needs to be trained for a whole epoch before all the features on the memory bank can be updated once.

moco solves the above mentioned problems of dictionary size and feature consistency through this dynamic dictionary and momentum encoder.

[The moco and memory bank methods are more similar. They both have only one encoder and require additional memory space to store the dictionary. The proximal optimization loss is also proposed in the memory bank to make the training smoother, which is similar to the momentum update in moco. . The memory bangk momentum updates the features, and the moco momentum updates the encoder k. Moco has good scalability and can be used on billions of data sets. The memory bank method will still be limited by the memory size when the data set is large. . 】

moco is simple, efficient and has good scalability.

3.3. Pretext Task

pseudocode:

The default batch size used in moco is 256. Data augmentation yields positive sample pairs. The query length (feature dimension) in the memory bank is 128. For consistency, moco also uses 128.

For the current mini-batch, we encode queries and their corresponding keys that form positive sample pairs. Negative samples come from the queue.

1) Technical details

  The encoder can be arbitrary卷积神经网络, fq and fk can be exactly the same, or they can be partially the same or different. We adopt ResNet [33] as the encoder, whose last fully connected layer (after global average pooling) has a fixed-dimensional output (128-D [61]). This output vector is normalized by its L2 norm [61]. This is the representation of query or key. The temperature τ is set to 0.07 in units of Eqn.(1) [61]. The data augmentation setup follows [61]: 224 × 224 pixel crops are obtained from randomly resized images, followed by random color dithering, random horizontal flipping, and random grayscale transformations, all of which are available in PyTorch’s torchvision package.

2)Shuffling BN

Both our encoders fq and fk have Batch Normalization (BN) [37] like standard ResNet [33]. In experiments, we found that using BN prevents the model from learning good representations, as similarly reported in [35] (avoiding the use of BN). The model seems to "cheat" the pretext task and easily find a low-loss solution. This may be because intra-batch communication (caused by BN) between samples leaks information.

We solve this problem by shuffling BN. We use multiple GPUs for training and perform BN on the samples independently for each GPU (as is done in common practice). For key encoder fk, we shuffle the sample order in the current mini-batch before allocation in the GPU (and shuffle after encoding); the sample order of the mini-batch of query encoder fq does not change. This ensures that the batch statistics used to calculate a query and its positive key come from two different subsets. This effectively solves the cheating problem and allows training to benefit from BN.

4. Experiments

Dataset:ImageNet-1M, 1 million data volume, Instagram-1B, 1 billion data volume. The latter is to verify that the moco model has good scalability. Since the agent task of individual discrimination is used, the category volume of ImageNet is the data volume. In addition, the Inatagram data set can better reflect the data distribution in the real world. The samples in this data set are not carefully selected and have problems such as long tail and imbalance. There are one or more objects in the picture.

Training:[Compared with subsequent work such as SimCLR and BYOL, moco has the lowest hardware requirements and is relatively easier to implement and affordable. Moreover, the generalization performance of moco and moco-v2 series work is very good. When doing downstream tasks, the features learned in pre-training are still very powerful. SimCLR's papers have relatively higher citations, and moco's method is more approachable]

200epoch, ResNet50, 53h. Optimizer: SGD, weight decay rate 0.0001, SGD momentum 0.9.
For ImageNet-1M: learning rate 0.03, 8GPU, batch_size128.
For Instagram-1B: learning rate 0.12, 64GPU, batch_size1024.

4.1. Linear Classification Protocol

The pre-trained model adds a linear classification head, which is a fully connected layer.

The author did a grid search and found that the optimal learning rate of the classification head is 30, which is incredible. In general deep learning work, few learning rates will exceed 1. Therefore, the author believes that the reason for such a weird learning rate is that the feature distribution learned by unsupervised contrastive learning is very different from the feature distribution learned by supervised learning.

Ablation:contrastive loss mechanisms

Ablation experiments. A comparison among three contrastive learning schools. The experimental results are shown in the figure below:

In the above figure, the abscissa k represents the number of negative samples, which can be approximately understood as the size of the dictionary, and the ordinate is the top one accuracy on ImageNet. The end-to-end method is limited by the memory of the graphics card and the dictionary size is limited. With 32G of graphics card memory, the maximum batch size that can be used is 1024. The memory bank method is generally less effective than end-to-end and moco, mainly due to the inconsistency of special diagnosis. In addition, the trend of the black line is unknown, because the hardware cannot support related experiments, and the subsequent results may be higher or worse. The dictionary size of moco has increased from 16384 to 65536. The effect improvement is not very obvious, so there is no comparison with larger dictionaries. Conclusion from the above figure: moco has good performance, low hardware requirements, and good scalability.
 

 Ablation:momentum

Ablation experiments, momentum updates. The experimental results are shown in the figure below:

Momentum parameter 0.99-0.9999, the effect will be better. The large momentum parameter ensures the consistency of features in the dictionary. The momentum parameter is 0, that is, in each iteration, the encoder q is directly used as the encoder k, causing the model to fail to converge, the loss to consistently oscillate, and ultimately the model training fails. This experiment very strongly proves the importance of dictionary consistency.

Comparison with previous results.

The larger the model capacity, the better the effect. Moco can achieve good results on both small and large models.

4.2. Transferring Features

The main goal of unsupervised learning is to learn a transferable feature and verify whether the features obtained by the moco model perform on downstream tasks and have good transfer learning effects.

In addition, since the feature distribution of unsupervised learning is very different from the feature distribution of supervised learning, when applying the unsupervised pre-training model to downstream tasks, parameter search cannot be performed on the classification head of each task, so that The meaning of unsupervised learning is lost, and the solution is: normalization, and then fine-tuning the parameters of the entire model. The BN layer uses synchronized BN, that is, when training multiple cards, the information of all cards is collected, the total running mean and running variance are calculated, and then the BN layer is updated to make the feature normalization more thorough and the model training more stable.

Schedules. If the downstream task data set is large and pre-training is not performed on ImageNet, when the training time is long enough, the model effect can still be very good, so the superiority of moco will not be reflected. However, when the training time of the downstream data set is very short, pre-training in moco is still effective. Therefore, shorter training time is used in moco.

PASCAL VOC Object Detection, data set PASCAL VOC, task: target detection, the experimental results are as follows:

Coco data set, the experimental results are as follows:

 Other tasks: human body key point detection, posture detection, instance segmentation, semantic segmentation, the experimental results are as follows:

Summary: Moco surpasses the results of supervised pre-training on ImageNet on many tasks, and is slightly worse on a few tasks, mainly instance segmentation and semantic segmentation tasks. [So some people later believe that contrastive learning is not suitable for density prediction. This task requires prediction for each pixel. Later, there was some related work to improve this, such as density contrast, pixel contrast, etc.]

In all these tasks, the model pre-trained on Instagram is better than the one on ImageNet, indicating that moco has good scalability, which is similar to the conclusion in NLP, that is, the larger the amount of self-supervised pre-training data, the better, which is consistent with unsupervised learning. the ultimate goal.

5. Reference

1.  MoCo paper intensive reading paragraph by paragraph [intensive reading of the paper]_bilibili_bilibili

2. Detailed explanation of MoCo paper_flowzheng's blog-CSDN blog_moco paper

3. Self-supervised learning-MoCo-paper notes_xueliang_’s blog-CSDN blog_self-supervised learning

Guess you like

Origin blog.csdn.net/weixin_43656644/article/details/125338131