《Vision Transformers with Patch Diversification》

Insert image description here
Paper link: https://arxiv.53yu.com/pdf/2104.12753.pdf?ref=https://githubhelp.com
Code link: https://github.com/ChengyueGongR/PatchVisionTransformer

1. Motive

Insert image description here

Vision Transformer has shown good performance in challenging computer vision tasks. However, it was found that training of the Vision Transformer is not particularly stable, especially as the model becomes wider and deeper. In order to study the reasons for training instability, the author extracts the patch representation of each self-attention layer on two popular visual Transformer variants (DeiT, Swin-Transformer) and calculates the average absolute cosine similarity between patch representations . It was found that in these two model species, the similarity between patch representations increased significantly, as shown in Figure 1(b) above. This behavior reduces the overall expressiveness of the patch representation and reduces the learning ability of powerful visual transformers. More specifically, for deep vision Transformers, the self-attention module tends to map different patches into similar latent representations, resulting in information loss and performance degradation . ( It is very similar to the problem to be solved in the paper "REVISITING OVER-SMOOTHING IN BERT FROM THE PERSPECTIVE OF GRAPH" , but the way to solve the problem is different)
Note: If the input patch representation sequence is h = [ hclass , h 1 , ⋯ , hn ] h =[h_{class}, h_1, \cdots, h_n]h=[hclass,h1,,hn] , then the absolute cosine similarity calculation formula is as follows (class patch is ignored here),
Insert image description here

2. Method

In order to alleviate the above problems, this article does not modify any model framework in the visual Transformer training , but only introduces a new loss function to explicitly encourage different patch representations to extract features more differently. Specifically, this paper proposes three different losses, namely
1) Patch-wise cosine loss: directly improves the diversity between different patch representations by penalizing patch-wise cosine similarity
2) Patch-wise contrastive loss: a patch-based The contrastive loss is to encourage the representations between corresponding patches learned between the first layer and subsequent layers to be similar, and the representations between non-corresponding patches to be different. (This is because the author observed that the input patch representation of the first self-attention layer only depends on the input pixels and therefore tends to be more diverse) 3
) Patch-wise mixing loss: A patch-wise mixing loss, similar to cutmix. Blend input patches from two different images and use the patch representation learned from each image to predict its corresponding class label. In the case of this loss, the self-attention layer is forced to only focus on the patches most relevant to its own category, thereby learning more distinctive features.

  • Patch-wise cosine loss
    Insert image description here
    , as a direct solution, proposes to directly minimize the absolute value of cosine similarity between different patch representations, as shown in (a) above. Given input xxThe last patch of x represents h [ L ] h^{[L]}h[ L ] , adding a patch-wise cosine loss to the training objective:
    Insert image description here
    this regularization loss explicitly minimizes the pairwise cosine similarity between different patches, which can be viewed as minimizing the maximum eigenvalue of $$h The upper bound of , thereby improving the expressiveness of the representation.

  • Patch-wise contrastive loss
    Insert image description here
    The representations learned in early layers are more diverse than those learned in deeper layers. Therefore, a contrastive loss is proposed that uses representations from early layers and regularizes deeper patches to reduce the similarity of patch representations. Specifically, given the input image xxx h [ 1 ] = { h i [ 1 ] } i h^{[1]}=\{ h^{[1]}_i \}_i h[1]={ hi[1]}i h [ L ] = { h i [ L ] } i h^{[L]}=\{ h^{[L]}_i \}_i h[L]={ hi[L]}iRepresenting the patches of the first layer and the last layer respectively, we constrain each hi [ L ] h^{[L]}_ihi[L]hi [ 1 ] h^{[1]}_ihi[1]Similar to any other patch hj ≠ i [ 1 ] h^{[1]}_{j \neq i}hj=i[1], that is,
    Insert image description here
    in the experiment, the gradient of h^{[1]}$ was stopped.

  • Patch-wise mixing loss
    Insert image description here
    recommends training each patch to predict the class label, rather than just using class patches for the final prediction. This can be combined with Cutmix’s data augmentation to provide additional training signals to the visual Transformer. As shown in Figure ©, input patches from two different images are blended and a shared linear classification head is attached to each output patch representation for classification. The hybrid loss forces each patch to only focus on a subset of patches from the same input image, ignoring irrelevant patches. Therefore, it effectively prevents simple averaging between different patches to produce a more informative and useful patch representation. The hybrid loss of this patch can be expressed as
    Insert image description here
    where hi [ L ] h^{[L]}_ihi[L]Represents the patch representation of the last layer, g is an additional linear classification head, yi y_iyiRepresents patch-wise class labels, L ce \mathcal{L}_{ce}Lcerepresents the cross-entropy loss.

Finally, by simply jointly minimizing α 1 L cos + α 2 L contrast + α 3 L mixing \alpha_1 \mathcal{L}_{cos} + \alpha_2 \mathcal{L}_{contrast} + \alpha_3 \mathcal {L}_{mixing}a1Lcos+a2Lcontrast+a3LmixingWeighted combination to improve the training of visual Transformer. No network modifications are required and it is not tied to any specific architecture. In experiments, this paper simply sets α1 = α2 = α3 = 1 without any specific hyperparameter adjustment.

3. Some experimental results

  • Image classification results
    1) ImageNet library
    Insert image description here
    2) ImageNet-22K library
    Insert image description here
  • Transfer learning results for semantic segmentation
    1) ADE20K library Insert image description here
    2) Cityscapes libraryInsert image description here
  • Average patch absolute cosine similarity comparison (ImageNet library)
    Insert image description here
  • Ablation experiment
    1) Effectiveness of regularization strategy
    Insert image description here
    2) Stability of training
    Insert image description here

4 Conclusion

1) The core of this article is to promote the diversity of patches when training image Transformer, thereby improving the learning ability of the model. This purpose is mainly achieved by proposing three losses.
2) The paper’s experience shows that by diversifying the patch representation without changing the Transformer model structure, it is possible to train larger and deeper models and obtain better performance in image classification tasks.
3) The paper only conducts experiments on supervised tasks

Guess you like

Origin blog.csdn.net/weixin_43994864/article/details/123289613