Self-Supervised Learning of Pretext-Invariant Representation

Self-Supervised Learning of Pretext-Invariant Representation

1. Abstract

A self-supervised learning method of semantic representation (Pretext Invariant Representation Learning-PIRL) that does not change according to the image transformation in different pretext tasks is proposed. The image representation learned by this method has the characteristics of invariance, and the semantic quality is higher, and Exceeds the performance of many supervised learning pre-training tasks.

2. Thesis method

The idea of ​​other papers is to predict some properties of the image transformation after the original image is transformed. Therefore, the learned features are low-level features that change with this transformation, which does not perform well for some semantic recognition tasks.
Insert picture description here
PIRL in this article: First define a representation network N; image A is represented by N as A_f, and after image A is processed (the puzzle rearrangement shown in the figure above), image a is represented by N and represented as a_f; after training Make A_f and a_f as close as possible, while A_f and x_f (x≠a) are quite different.

Train the network parameters by minimizing the experience loss. Where DDD represents the image data set, p(T) represents the distribution of image transfomation,I t I^tIt represents the image after changing t,θ \thetaθ stands for network parameters,VI V_IVIRepresents the characteristics of the image learned through the network.
Insert picture description here

  • Loss Function
    defines a Contrastive loss function L, the goal is to make the representation of the image I as much as possible to its converted I t I^tIt is similar, and the representation of other image data is as different as possible.
    Insert picture description here
    s (⋅, ⋅) s(·,·)s() Represents the calculation of cosine similarity, and before calculating s, the feature is subjected to different "head" calculations, g(·) and f(·).
    Insert picture description here
    In order to increase the number of negative examples without increasing the batch size, memory bank is used The way. The feature representation of each picture I is included in M, and thef (VI) f(V_I)calculated by the previous epoch is updated by the exponetial moving average methodf(VI) . In the
    Insert picture description here
    final loss function
    Insert picture description here
    , the second term makesf (VI) f(V_I)f(VI) As much as possible and memory characterizem I m_ImISimilar to m I 'm_I'mI' As much as possible.
  • Implementation details
    f (VI) f(V_I)f(VI) : Image I passes through the res5 network (the first 5 layers of ResNet-50), and then performs an average pooling and a linear mapping to obtain a 128-dimensional vector representation;
    g (V (I t)) g(V_(I^t ))g ( V(It )): Image I is divided into 3 pieces of puzzles, each piece of puzzle is processed by the res5 network and then average pooling is performed, and then linear mapping is performed respectively to obtain 3 vectors with a total of 128 dimensions, and these 3 vectors are sorted randomly Merge, and then perform a linear mapping to obtain a 128-dimensional vector representation;

3. Experimental results

Insert picture description here
The target detection task surpasses other self-supervised learning methods, and the original Jigsaw pretext task pre-training results are improved by five points. In other tasks such as IMage classification with linear models and different data sets, it has also achieved more than other self-supervised pre-training methods.
Insert picture description here
By comparing the l2 distance of the original image characterization and the transformed image characterization, it is proved that the characterization learned by PIRL is invariant.

Guess you like

Origin blog.csdn.net/pitaojun/article/details/108563762