Unsupervised comparative learning-pretending to be supervised SwAV

The previously mentioned MOCO and SimCLR mainly focus on increasing negative cases, which is time-consuming and laborious, and SwAV has come back to basics. .
Insert picture description here

Brief steps

  1. The input data of each batch is x ∈ RN ∗ C ∗ H ∗ W x\in R^{N*C*H*W}xRN C H W , after different Aug, getx 1, x 2 x_1, x_2x1,x2
  2. The x 1, x 2 x_1, x_2x1,x2Enter the network and get the output z 1, z 2 ∈ RN ∗ d z_1, z_2 \in R^{N*d}with1,with2RNd
  3. Known K cluster centers, expressed as C ∈ RK ∗ d C\in R^{K*d}CRK d , calculate the similarity between the output and the cluster center, and get the similarity matrixQ ∈ RK ∗ NQ \in R^{K*N}QRK N. Ideally, the similarity between the sample and its own cluster center is 1, and the other is 0. In fact, it is similar to one-hot label in supervised tasks, but the author found that the effect of soft label is better. In this way, each sample gets a new representation (Codes).
  4. Calculating the loss, after having z and q, theoretically z and q produced by different views of the same picture can also be predicted to each other, the author defines a new loss: L (zt, zs) = l (zt, qs) + l (zs, qt) L(z_{t},z_{s})=l(z_{t},q_{s})+l(z_{s},q_{t})L(zt,withs)=l(zt,qs)+l(zs,qt)
    其中 l ( z t , q s ) = − ∑ k q s ( k ) log ⁡ g p t ( k ) l(z_{t},q_{s})=- \sum _{k}q_{s}^{(k)}\log gp_{t}^{(k)} l(zt,qs)=kqs(k)loggpt(k)
    p t = e x p ( z t T c k / τ ) ∑ k ′ e x p ( z t T c k / / τ ) p_{t}= \frac{exp(z_{t}^{T}c_{k}/ \tau)}{\sum _{k^{\prime}}exp(z_{t}^{T}c_{k}// \tau)} pt=kexp(ztTck/ / τ )exp(ztTck/ T )
    So the title says that SwAV seems to be pretending to be supervised. .

At the same time, SwAV also proposed a new data enhancement method to mix views of different resolutions.
Multi-crop strategy includes:
(1) Two standard RandomResizedCrop;
(2) V additional small views.
For example, for the ImageNet data set, in the following code:

  1. nmb_crops = [2, 6] means two standard random crops and six small views;
  2. size_crops = [224, 96] indicates that the size obtained after the standard RandomResizedCrop is 224 ∗ 224 224*2242242 2 4 , the size of small views after RandomResizedCrop is96 ∗ 96 96*969696
  3. min_scale_crops = [0.14, 0.05], max_scale_crops = [1.00, 0.14] means that the scale of small views in RandomResizedCrop is (0.05, 0.14), and the scale in standard RandomResizedCrop is (0.14, 1.00).
color_transform = [get_color_distortion(), RandomGaussianBlur()]
if pil_blur:
    color_transform = [get_color_distortion(), PILRandomGaussianBlur()]
mean = [0.485, 0.456, 0.406]
std = [0.228, 0.224, 0.225]
trans = []
for i in range(len(size_crops)):
    randomresizedcrop = transforms.RandomResizedCrop(
        size_crops[i],
        scale=(min_scale_crops[i], max_scale_crops[i]),
    )
    trans.extend([transforms.Compose([
        randomresizedcrop,
        transforms.RandomHorizontalFlip(p=0.5),
        transforms.Compose(color_transform),
        transforms.ToTensor(),
        transforms.Normalize(mean=mean, std=std)])
    ] * nmb_crops[i])
self.trans = trans

SwAV further narrows the distance between self-supervised learning methods and supervised learning, and the accuracy of supervised learning is only 1.2%. The SwAV here is trained for 800 epochs on a large batch (4096). In the end, the combination of the two methods brought 4.2 points of improvement:
Insert picture description here
comparing the performance of different self-supervised learning methods when batch_size = 256, SwAV is still the most effective SOTA.
Insert picture description here
The role of Multi-crop
As shown in the figure below, for self-supervised learning methods, the Multi-crop strategy of 2 160+4 96 is always better than the augmentation of 2*224.
Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_42764932/article/details/112845236