LV-ViT:All Tokens Matter: Token Labeling for Training Better Vision Transformers

insert image description here

This article is a kind of ViT trainingEnhancement method LV-ViT. The previous Vision Transformer classification tasks only used class tokens to gather global information for final classification. The author proposes to use the patch token as the calculation of loss. It is equivalent to converting the classification problem of an image into the recognition problem of each token, and the classification label of each token is supervised by the machine.

Original link: All Tokens Matter: Token Labeling for Training Better Vision Transformers
Another version: Token Labeling: Training a 85.5% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet
Source address: https://github.com/zihangJiang/TokenLabeling

All Tokens Matter: Token Labeling for Training Better Vision Transformers[NIPS2021]

Abstract

In this paper, a new training objective, Token Labeling, is proposed for training a high-performance Vision Transformer (VIT). The standard training goal of ViTs is to compute the classification loss on an additional trainable class token, and the proposed goal is to exploit 所有的图像patch token密集地计算训练损失.

That is to reformulate the image classification problem as a multi-token-level recognition problem , and assign each patch token separate location-specific supervision generated by a machine annotator.

The 26M Transformer model uses Token Labeling, which can achieve 84.4% Top-1 accuracy on ImageNet.
By slightly expanding the model size to 150M, the result can be further increased to 86.4%, making 86% the smallest size model among previous models (above 250M).

1 Introduction

Recent vision transformers usually use class tokens to predict output classes , while ignoring the role of other patch tokens, which encode rich information on their respective local image patches.

In this paper, a new Vision Transformer training method is proposed calledLV-ViT, using both patch token and class token. The method adopts K-dimensional fractional maps generated by machine annotators as supervision to supervise all tokens in a dense manner , where K is the number of categories of the target dataset. In this way, each patch token is explicitly associated with a single location -specific supervision indicating the presence of a target object within the corresponding image patch , thereby improving the object recognition capability of Vision Transformer with negligible computational overhead. This is work 首次that proves 密集监控beneficial for vision Transformers in image classification.

As shown, LV ViT has 56M parameters and produces 85.4% top-1 accuracy on ImageNet, outperforming all other Transformer-based models with parameters up to 100M. When the model size is enlarged to 150M, the result can be further improved to 86.4%.

insert image description here

2 Method

Conventional ViT divides the image into patches, and then adds a class token. After multiple rounds of similarity calculations, the image information is aggregated into the class token. Finally, only image-level labels are used as supervision, while ignoring the image embedded in each image block. Rich information in . where X cls X^{cls}Xc l s is the output of the last Transformer Black,H ( ⋅ , ⋅ ) H(·,·)H(,) is the softmax cross-entropy loss,yclsy^{cls}yc ls is the class label.
insert image description here
In this paper, a new training target token labeling is proposed, which exploitsthe complementary information between patch tokens and class tokens.

2.1 Token Labeling

insert image description here

Token Labeling emphasizes the importance of all output tokens and argues that each output token should be associated with a single location-specific label. Therefore, the label of the input image does not only involve a single K-dimensional vector yclsy^{cls}yc l s (image-level labels), also involvesK × NK × NK×N matrix orK-dimensional fraction graphsuch as[ y 1 , . . . , y N ] [y^1,...,y^N][y1,...,yN ], where N is the number of output patch tokens. That is to say,one token has one label, so each token can be used as an auxiliary loss. But this label indicates whether the target object exists in the corresponding image patch. How to obtain the K-dimension fraction map is not detailed in the article, and it is beyond my direction. Interested students can read thisToken Labeling.

Each training image utilizes a dense score map, and uses the cross-entropy loss between each output patch token and the corresponding aligned label in the dense score map as an auxiliary loss in the training stage . The loss function of the patch token is defined as:
insert image description here
the overall loss function is the conventional class token loss plus patch token loss , where β is a hyperparameter to balance these two items. Experimentally set it to 0.5. The formula is:
insert image description here
Token Labeling has the following advantages:

  1. The way of knowledge distillation usually requires the teacher model to generate supervised labels online, while token labeling is a simpler operation. Dense score maps can be generated by pre-trained models (eg, EfficientNet, NFNet). During training, only the score map needs to be cropped and interpolated to align with the cropped image in spatial coordinates. Therefore, the additional computational cost is negligible.
  2. Second, unlike most classification models and relabeling strategies that use a single label vector as supervision , this method also utilizes the score map to supervise the model in a dense manner , so the label of each patch token provides 位置特定信息, which can help the trained model to easily discover the target object And improve the recognition accuracy .
  3. Due to the dense supervision employed in training, pretrained models with token labeling benefit downstream tasks with dense predictions , such as semantic segmentation.

2.2 Token Labeling with MixToken

List several previous enhancement methods:

  1. Mixup: Mix two random samples in proportion, and the classification results are distributed in proportion;
  2. Cutout: Randomly cut out some areas in the sample, and fill in the 0 pixel value, and the classification result remains unchanged;
  3. CutMix: It is to cut out a part of the area but not fill 0 pixels but randomly fill the area pixel values ​​​​of other data in the training set, and the classification results are distributed according to a certain ratio.

In this paper, the author proposes a new image enhancement method, MixToken, and compares it with CutMix. After CutMix operates on the input image , it will generate a patch that contains the mixed area in the two images , that is, the red part. The goal of MixToken is to mix tokens after patch embedding . This makes each token after patch embedding have clean content.
insert image description here

Generation method:
For two images, it is represented as I 1 , I 2 I1, I2I 1 , I 2 have pre-generated the corresponding token labelY 1 = [ y 1 1 , … , y 1 N ] Y_1=[y_1^1,…,y^N_1]Y1=[y11,,y1N]以及Y 2 = [ y 2 1 , … , y 2 N ] Y_2=[y^1_2,…,y^N_2]Y2=[y21,,y2N]

Input two images into the patch embedding module to get the final token sequence: T 1 = [ t 1 1 , … , t 1 N ] T_1=[t^1_1,…,t^N_1]T1=[t11,,t1N] T 2 = [ t 2 1 , … , t 2 N ] T_2=[t^1_2,…,t^N_2] T2=[t21,,t2N] . Then a new token sequence is generated through the binary mask M, the formula: ⊙ is the dot product, and the mask M is generated according to the method in the paper "Regularization strategy to train strong classifiers with localizable features".
insert image description here
Do the same for labels:

insert image description here
The class label becomes the average of the two image labels, the formula is as follows: where M ˉ \bar MMˉ is the average value of all elements in M.

insert image description here

3 Conclusion

  1. The core idea of ​​this article is to use the previously ignored patch token as an auxiliary loss.
  2. The label corresponding to each patch token is obtained through pre-generation, which is not described in detail in the article. For specific methods, see: Token Labeling
  3. A new data enhancement method MixToken is used to prevent the mixed area at the boundary where the two images are combined.

Finally, I wish you all success in scientific research, good health, and success in everything~

Guess you like

Origin blog.csdn.net/qq_45122568/article/details/125545947