ICML2021《Training data-efficient image transformers & distillation through attention》

Insert image description here
Paper link: http://proceedings.mlr.press/v139/touvron21a/touvron21a.pdf
Code link: https://github.com/facebookresearch/deit

1. Motive

VIT training requires a large amount of computing resources and takes a long time to train. Furthermore, it is difficult to generalize when there is not enough data

2. Contribution

  • The authors demonstrate that neural networks that do not contain convolutional layers can achieve competitive results with the state of the art on ImageNet without external data. And they are learned on a single node with 4 GPUs, taking 3 days. The two new models of this article, DeiT -S and DeiT-Ti, have fewer parameters and can be regarded as equivalents of ResNet-50 and ResNet-18.
  • A new distillation process based on distilled tokens is introduced , which functions the same as the class token except that its purpose is to reproduce the teacher-estimated labels. These two tokens interact in the Transformer via attention. This Transformer-specific strategy works much better than traditional distillation.
  • The authors' pre-learned models on Imagenet are competitive when transferred to different downstream tasks, such as fine-grained classification, on several popular public benchmarks: CIFAR-10, CIFAR-100, Oxford-102 flowers, Stanford Cars and inaturalist18/19.

3. Method

3.1 Visual Transformer

  • Multi-head Self Attention layers (MSA)
    attention mechanism is based on the joint memory of a trainable (key, value) vector pair. Use inner product to query vector q ∈ R dq \in \mathbb{R}^dqRd with a set ofkkk key vectors (packed into a matrixK ∈ R k × d K \in \mathbb{R}^{k \times d}KRk × d ) for matching. These inner products are then scaled and normalized with the softmax function to obtainkk weights. The output of attention is a set ofkWeighted sum of k -valued vectors (packed into V ∈ R k × d V \in \mathbb{R}^{k \times d}VRk × d ). ForNNN query vector sequences (packed intoQ ∈ RN × d Q \in \mathbb{R}^{N \times d}QRN × d ), which produces an output matrix (of sizeN × d N \times dN×d ):
    Insert image description here
    where the Softmax function is applied to each row of the input matrix,d \sqrt{d}d Proper standardization is provided. Vaswani et al. (2017) proposed a self-attention layer. The query, key and value matrices themselves are calculated from the NNN input vector sequences (stuffed intoX ∈ RN × DX \in \mathbb{R}^{N \times D}XRN×D): Q = X W Q Q = XW_Q Q=XWQ K = X W K K = XW_K K=XWK, V = X W V V = XW_V V=XWV, using the possession constraint k = N k = Nk=Linear transformation of N WQ W_QWQ W K W_K WKJapanese WV W_VWV, which means that attention is between all input vectors.
    Finally, by considering hhh attention heads, i.e.hhh self-attention functions define the multi-head self-attention layer (MSA). Each header provides a header of sizeN × d N \times dN×d sequence. ThesehhThe h sequence is rearranged intoN × dh N \times dhN×d h sequence, then reprojected by linear layer intoN × DN \times DN×D

  • Transformer block for images
    In order to get a complete Transformer block (Vaswani et al., 2017), the authors added a feedforward network (FFN) on the MSA layer. This FFN consists of two linear layers separated by GeLu activations (Hendrycks & Gimpel, 2016). The first linear layer changes the dimensionality from DDD expanded to4D 4D4D , the second layer changes the dimension from4D 4D4 D restore toDDD. _ Due to skip connections, both MSA and FFN operate as residual operators with layer normalization (Ba et al., 2016).
    In order to obtain a Transformer to process images, this work is built on the ViT model (Dosovitskiy et al., 2020). It is a simple yet elegant architecture that handles input images just like input token sequences. Decompose a fixed size input RGB image into a batch ofNNN fixed size is16 × 16 16 \times1616×16 pixels (N = 14 × 14 N = 14 \times 14N=14×14 ) patch. Each patch is projected with a linear layer that maintains its overall dimension3 × 16 × 16 = 768 3 \times 16 \times 16 = 7683×16×16=768 . The above Transformer blocks are invariant patch embedding order, so their position is ignored. Position information is incorporated into fixed (Vaswani et al., 2017) or trainable (Gehring et al., 2017) position embeddings. They are added before the first Transformer block of the patch token, which is then fed into the Transformer block stack.

  • The class token
    is a trainable vector that is appended to the patch token before the first layer, which is passed through the transformation layer and then projected with the linear layer to predict the class. This class token inherits from NLP (Devlin et al., 2018) and differs from the typical pooling layer used in computer vision to predict classes. Therefore, Transformer handles dimension DDD (N+1)(N+1)(N+1 ) Token batches where only the class vectors are used to predict the output. This architecture forces self-attention to propagate information between patch tokens and class tokens: at training time, the supervision signal only comes from class embeddings, and patch tokens are the only variable input to the model.

  • Fixed positional encoding across resolutions
    Touvron et al. (2019) showed that it is desirable to use lower training resolutions and fine-tune the network at larger resolutions. This speeds up complete training and improves accuracy under existing data augmentation schemes. When increasing the resolution of an input image, we keep the patch size constant, so the input NNThe number of N patches will change. Due to the Transformer block and token-like architecture, there is no need to modify the model and classifier to handle more tokens. Instead, one needs to adapt positional embeddings, because withNNThere are N position embeddings, and each patch corresponds to one position embedding. Dosovitskiy et al. (2020) insert positional encoding when changing resolution and demonstrate that the method is suitable for subsequent fine-tuning stages.

3.2 Distillation through attention

In this section, the authors assume that a strong image classifier can be used as a teacher model. It can be a convolutional neural network or a hybrid of classifiers. Then use this teacher to solve the problem of how to learn Transformer. As the following table shows by comparing the trade-off between accuracy and image throughput, it is beneficial to replace the convolutional neural network with a Transformer . This covers two directions of distillation: hard distillation and soft distillation, classic distillation and distillation token.
Insert image description here

  • Some works on soft distillation
    minimize the Kullback-Leibler divergence between the softmax of the teacher model and the softmax of the student model. Let Z t Z_tZtis the logarithm of the teacher model, Z s Z_sZsis the logarithm of the student model, use τ \tauτ represents the distillation temperature,λ \lambdaλ balances the coefficients of Kullback Leibler divergence loss (KL) and cross entropy (LCE):
    Insert image description here

  • Hard-label distillation
    introduces a variant of distillation here, taking the teacher's hard decision as a real label. Let yt = argmaxc Z t ( c ) y_t = argmax_cZ_t(c)yt=argmaxcZt( c ) is the hard decision of the teacher, and the goal associated with this hard-label distillation is:
    Insert image description here
    for a given image, the hard-label associated with the teacher may change based on specific data augmentation. As we will see, this option is better than the traditional one, while being parameterless and conceptually simpler: the teacher predictsyt y_tytwith real label yyy plays the same role.

  • Label smoothing
    hard-labels can also be converted to soft labels via label smoothing (Szegedy et al., 2016), where the true label is considered to have 1 ε \varepsilonThe probability of ε , the remainingε \varepsilonε is shared among the remaining classes. In all experiments using true labels, setε = 0.1 \varepsilon= 0.1e=0.1 . Note that teacher-provided pseudo-labels (e.g., in hard distillation) are not smoothed here.

  • Distillation token
    As shown in Figure 2, the author adds a new token, the distillation token, to the initial embedding (patch and class token). The distilled token here is similar to the use of class tokens: it interacts with other embeddings through self-attention and is output by the network after the last layer. Its purpose is given by the lost distilled components. Distilled embeddings allow our model to learn from the teacher’s output, just like in regular distillation, while remaining complementary to class embeddings.
    Insert image description here

  • Fine-tuning with distillation
    The authors use real labels and teacher predictions in the fine-tuning stage at higher resolution. Use teachers with the same target resolution, usually obtained from low-resolution teachers via the method of Touvron et al. (2019). Only true labels are tested here as well, but this reduces the teacher's benefit and leads to lower performance.

  • Classification with our approach: joint classifiers
    At test time, either the classes produced by the Transformer or the distilled embeddings are associated with linear classifiers and are able to infer image labels. The reference method in this article is to perform post-fusion of these two separate heads, and then add the softmax outputs of the two classifiers to make predictions. Evaluate these three options as shown in the table below.
    The proposed strategy further improves performance, showing that the two tokens provide complementary and useful classification information: the classifiers on both tokens significantly outperform the independent class and distilled classifiers, which themselves already outperform the distilled baseline. Embeddings associated with distilled tokens perform slightly better than embeddings with class tokens. It also correlates better with convolutional neural network predictions. In all cases, its inclusion improved the performance of the different classifiers.
    Insert image description here

3.3 Three variants of DeiT architecture

Insert image description here

4. Some experimental results

4.1 Distillation results under different teacher structures

As Abnar et al. (2020) explain, convolutional neural networks are a better teacher, probably because the Transformer inherits an inductive bias through distillation. In all subsequent distillation experiments in this paper, the default teacher network is a RegNetY-16GF (Radosavovic et al., 2020) with 84M parameters, trained using the same data and the same data augmentation as DeiT. This teacher network achieved first place in ImageNet with an accuracy of 82.9%.
Insert image description here

4.2 Comparison of distillation methods

The performance of different distillation strategies is compared as shown in Table 3 above. For Transformer, hard distillation significantly outperforms soft distillation, even using only class tokens: at a resolution of 224 × 224 224 \times 224224×At 224 , the hard distillation reaches 83.0%, while the soft distillation accuracy is 81.8%.
Insert image description here

4.3 Agree with the teacher’s views & inductive bias?

As mentioned above, the structure of the teacher has important implications. Does it inherit existing inductive biases that favor training? Although we think it is difficult to answer this question formally, we analyze in Table 4 the convnet teacher, our image Transformer DeiT learning only from labels, and DeiT decision-making agreement. Compared to Transformers learned from scratch, our distillation model is more related to convolutional neural networks. As expected, the classifiers associated with distilled embeddings are closer to convolutional networks than the classifiers associated with class embeddings, while the classifiers associated with class embeddings are more similar to DeiT without distillation learning. Unsurprisingly, the joint class + distillation classifier provides a middle ground.
Insert image description here

4.4 token analysis

The learned classes and distilled tokens converge towards different vectors: the average cosine similarity (cos) between these tokens is equal to 0.06. The class embeddings and distillation embeddings calculated at each layer gradually become more similar through the network, until the last layer where the similarity is high (cos=0.93), but still less than 1. This is to be expected since they aim to produce similar but not identical goals.
Compared to simply adding additional class tokens related to the same target label, this paper verifies that distilled tokens add something to the model: the authors use a Transformer with two class tokens instead of teacher pseudo-labels. Even if we initialize them randomly and independently, they converge to the same vector (cos=0.999) during training and the output embeddings are quasi-identical. Contrary to our distillation strategy, additional class tokens do not bring any impact on classification performance.

4.5 Transfer learning to downstream tasks

Although DeiT performs well on ImageNet, in order to measure the generalization ability of DeiT, it is important to evaluate their transfer learning on other datasets. The authors evaluate this in a transfer learning task by fine-tuning the dataset in Table 8. Table 6 compares DeiT transfer learning results with ViT and EfficientNet. DeiT is comparable to competitive convnet models, which is consistent with previous conclusions on ImageNet1k.
Insert image description here
Insert image description here

4.6 Data enhancement

Transformers require more data than models that integrate more priors (such as convolutions). Therefore, in order to train with the same size dataset, this paper relies on extensive data augmentation. The authors evaluate different types of strong data augmentation to achieve data-efficient training methods. At the same time, different optimizers were considered and cross-validated with different learning rates and weight decay. The transformer is very sensitive to the settings of optimized hyperparameters.
Insert image description here

4.7 Training time

For DeiT-B, typical training for 300 epochs takes 37 hours (2 nodes) or 53 hours (single 8-GPU node). For comparison, similar training using RegNetY-16GF (84M parameters) is 20% slower. DeiT-S and DeiT-Ti train in less than 3 days on 4 GPUs. We can then fine-tune the model at larger resolutions. This took 20 hours on 8 GPUs to adjust the resolution to 384 × 384 384 \times 384384×384 DeiT-B model, which is equivalent to 25 epochs. Without relying on batch specifications, batch sizes can be reduced without impacting performance, which makes training larger models easier. Note that due to the use of repeat augmentation for 3 iterations, only one third of the image is visible during an epoch.

5 Conclusion

  • Using convnet teacher provides better performance than using Transformer.
  • For Transformer, hard distillation is significantly better than soft distillation , even if only class tokens are used
  • Transformer As the network deepens, tokens gradually become more similar , that is, over-smoothing
  • On small data sets, without Imagenet pre-training, the performance of training from scratch is not as good as that of pre-training because the diversity of the network is much lower.
  • The experiments in this paper confirm that Transformer requires powerful data augmentation: almost all data augmentation methods evaluated by the authors proved useful. One exception is dropout , which the author excludes from the training process
  • Regularization like Mixup and Cutmix can improve transformer performance

Guess you like

Origin blog.csdn.net/weixin_43994864/article/details/123589610