Object Detection Pre-training Method Based on Contrastive Learning

Object Detection Pre-training Method Based on Contrastive Learning

Reference link: Target detection pre-training model (based on contrastive learning) related papers

This paper mainly records the following papers:
1. DenseCL (CVPR21 oral)
2. DetCo (ICCV21)
3. InstanceLoc (CVPR21)
4.
The core of self-EMD is: patch segmentation strategy (better learning local information), positive Matching of negative pairs (higher quality contrast pairs)

1、(DenseCL) Dense Contrastive Learning for Self-Supervised Visual Pre-Training (CVPR 21 oral)

Code link
paper link
insert image description here
Local comparison loss:
M1 and M2 are obtained after data enhancement on the same image, F1 and F2 are obtained after backbone, 7*7=49 is obtained through spatial partitioning, and each block is passed through dense projection head (MLP) Finally, R (49 feature representations) and T (49 feature representations) are obtained respectively. Then perform cosine distance similarity calculation on the two matrices, select the most similar tj for each ri ( choose the optimal positive ), and perform comparative learning.
Finally, the training is combined with the global+local comparison loss.
Object detection performance of different pre-trained models
It can be seen that in the target detection task, DenseCL is better than MoCov2 pre-training and better than supervised training.

2、DetCO (ICCV 21)

Code link
Paper link

insert image description here
The biggest difference from Dense CL is that the above is to select different patches according to the similarity matrix for the optimal positive selection. This article directly concatenates the patches for comparative learning and
proposes 3 principles: 1) Learning based on contrast is better than supervision/clustering. 2) Keeping low-level + high-level features at the same time is beneficial for detection. 3) Local features are very important.
Proposed method:
1) Combination of low-level + high-level features : Get 4-level features from stage 2, 3, 4, and 5 of res50, and obtain 4 groups of k and q after passing through MLP respectively. Loss (GG) is one of the 4 groups of kq
2) local feature : Select 9 patches from the original image, and form 2 images M1 and M2 in different arrangements. The 9 patches get the local feature representation of 9 patches through encoder+MLP, and concatenate them , to calculate Loss(LL).
3) global+local : compare the global feature representation with the concatenated local representation.
Questions:
1) How to select 9 patches, should they be rearranged?
2) The characteristics of the 9 patches must be the same as those of the whole image? Why?
insert image description here

3、InstanceLoc (CVPR 21)

pytorch code link
paper link
blog link
insert image description here
insert image description here

Specific methods: 1) Paste two random crops ( constituting a positive sample pair
) of the same image to different background images, and RoIAlign extracts RoI features according to the position of the bounding box for comparative learning. (Guiding means spatially ignoring the background and obtaining localization capabilities) 2) Bounding box enhancement: Randomly select an anchor with an IOU<0.5 with the bounding box, and the feature corresponding to its region is a negative sample .

4、self-EMD

COCO can be used directly for training without pre-training on imagenet.
insert image description here

insert image description here

Motivation :
1) The prior condition of the existing self-supervised training is that different views/crops of the same image correspond to the same object, otherwise the random crop graph will cause semantic noise . Object-centric such as imagente can be satisfied, but multi-objective such as coco cannot be satisfied. (Use EMD for best patch matching)
2) Global pooling destroys spatial structure information and loses local information , affecting positioning accuracy. (Not practical MLP, use convolutional neural network)
Method: After random cropping, use SPP crop to obtain patch features of different ranges, and use EMD to calculate the similarity between different patches (instead of L2)
EMD pile distance (from a minimum effort to convert a mound into another mound)
insert image description here

Guess you like

Origin blog.csdn.net/sinat_34201763/article/details/125524725