ViT-Method

3 METHOD
In model design we follow the original Transformer (Vaswani et al., 2017) as closely as possible.
An advantage of this intentionally simple setup is that scalable NLP Transformer architectures – and their effificient implementations – can be used almost out of the box.
In terms of model design, we try to keep it as consistent as possible with the original Transformer. The advantage of this intentionally simple design is that it has good scalability for the NLP Transformer and can be used almost "out of the box".

3.1 V ISION T RANSFORMER (V I T)
An overview of the model is depicted in Figure 1. The standard Transformer receives as input a 1D sequence of token embeddings. To handle 2D images, we reshape the image x R H × W × C into a sequence of flflattened 2D patches x p R N × ( P 2 · C ) , where ( H, W ) is the resolution of the original image, C is the number of channels, ( P, P ) is the resolution of each image patch, and N = HW/P 2 is the resulting number of patches, which also serves as the effective input sequence length for the Transformer. The Transformer uses constant latent vector size D through all of its layers, so we flatten the patches and map to D dimensions with a trainable linear projection (Eq. 1). We refer to the output of this projection as the patch embeddings.
The overall structure of the model is shown in Figure 1. The standard Transformer accepts a 1-dimensional sequence of token embeddings as input. In order to process 2D images, we reshape the image of is the number of image channels, (P, P) is the resolution of each image block, N = H W / P 2 is the number of image blocks, and is also the length of the effective input sequence of the Transformer. All layers of Transformer use hidden vectors of fixed dimension D, so we use a trainable linear layer to map image patches to D dimensions, as shown in Equation (1). We call the output of this linear map patch embeddings.
Similar to BERT’s [class] token, we prepend a learnable embedding to the sequence of embed
ded patches ( z 00 = x class ), whose state at the output of the Transformer encoder ( z 0 L ) serves as the image representation y (Eq. 4). Both during pre-training and fifine-tuning, a classifification head is attached to z 0 L . The classifification head is implemented by a MLP with one hidden layer at pre-training time and by a single linear layer at fifine-tuning time.
Similar to the [class] token in the BERT method, we add a learnable embedding before patch embeddings: z 0 0 = x c l a s s , and its state z L 0 at the output end of the Transformer encoder is used as the representation of the image, as shown in Equation (4) shown. The classification head is implemented by an MLP with a hidden layer in the pre-training stage and a linear layer in the fine-tuning stage.
The Transformer encoder (Vaswani et al., 2017) consists of alternating layers of multiheaded self attention (MSA, see Appendix A) and MLP blocks (Eq. 2, 3). Layernorm (LN) is applied before
every block, and residual connections after every block (Wang et al., 2019; Baevski & Auli, 2019).
The MLP contains two layers with a GELU non-linearity.
The Transformer encoder consists of layers containing multi-head self-attention (MSA), MLP blocks. A Layernorm (LN) layer is applied before each block and there is a residual connection after each block. MLP contains two nonlinear layers with GELU.
Inductive bias.
We note that Vision Transformer has much less image-specifific inductive bias than CNNs. In CNNs, locality, two-dimensional neighborhood structure, and translation equivariance are baked into each layer throughout the whole model. In ViT, only MLP layers are local and translationally equivariant, while the self-attention layers are global. The two-dimensional neighborhood structure is used very sparingly: in the beginning of the model by cutting the image into patches and at fine-tuning time for adjusting the position embeddings for images of different resolution (as described below). Other than that, the position embeddings at initialization time carry no information about the 2D positions of the patches and all spatial relations between the patches have to be learned from scratch.
Inductive bias
  We noticed that Vision Transformer has less inductive bias for images than CNN. In CNN, local parameter sharing, 2-dimensional neighborhood structure and translation invariance run through every layer of the entire model. In ViT, only MLP has local parameter sharing and translation invariance, and the self-attention layer is global. The two-dimensional neighborhood structure is used very rarely. It is only useful when the model initially crops image blocks and in the fine-tuning stage to adjust position embedding for images of different resolutions. In addition, position embeddings do not carry any 2-dimensional position information during initialization, and all spatial relationships between image blocks are learned from scratch.

 
Hybrid Architecture.
As an alternative to raw image patches, the input sequence can be formed from feature maps of a CNN (LeCun et al., 1989). In this hybrid model, the patch embedding projection E (Eq. 1) is applied to patches extracted from a CNN feature map. As a special case, the patches can have spatial size 1x1, which means that the input sequence is obtained by simply flattening the spatial dimensions of the feature map and projecting to the Transformer dimension. The classifification input embedding and position embeddings are added as described above.
Hybrid Structure
  As an alternative to the original image patches, the input sequence can be constructed from the feature map of the CNN. In this hybrid model, the patch embedding map E is applied to patches extracted from CNN feature maps. In extreme cases, the patch size can be 1 × 1, that is, the input sequence can be obtained by simply stretching the feature map and mapping it to the Transformer dimension. Classification input embeddings and position embeddings are added in the same way as above.

 
3.2 F INE - TUNING AND H IGHER R ESOLUTION
Typically, we pre-train ViT on large datasets, and fine-tune to (smaller) downstream tasks. For
this, we remove the pre-trained prediction head and attach a zero-initialized D × K feedforward
layer, where K is the number of downstream classes. It is often benefificial to fine-tune at higher
resolution than pre-training (Touvron et al., 2019; Kolesnikov et al., 2020). When feeding images
of higher resolution, we keep the patch size the same, which results in a larger effective sequence length. The Vision Transformer can handle arbitrary sequence lengths (up to memory constraints), however, the pre-trained position embeddings may no longer be meaningful. We therefore perform 2D interpolation of the pre-trained position embeddings, according to their location in the original image. Note that this resolution adjustment and patch extraction are the only points at which an inductive bias about the 2D structure of the images is manually injected into the Vision Transformer.
Usually we pre-train ViT on large datasets and then fine-tune on small datasets downstream. To this end, we remove the pre-trained prediction head and add a zero-initialized D × K forward layer, where K is the number of predicted categories. It is often beneficial to fine-tune using higher resolution images than pre-training. We keep the image patch size constant when inputting higher resolution images, which results in a longer effective sequence length. Vision Transformer can handle sequences of any length if memory allows, but the position embeddings obtained by pre-training are no longer meaningful. Therefore we perform 2D interpolation of pre-trained position embeddings based on the position of the original image. Note that this resolution adjustment and block extraction operation is the only point in Vision Transformer where the 2D structure of the image is manually introduced.
 
4 EXPERIMENTS
We evaluate the representation learning capabilities of ResNet, Vision Transformer (ViT), and the
hybrid. To understand the data requirements of each model, we pre-train on datasets of varying size and evaluate many benchmark tasks . When considering the computational cost of pre-training the model, ViT performs very favourably, attaining state of the art on most recognition benchmarks at a lower pre-training cost. Lastly, we perform a small experiment using self-supervision, and show that self-supervised ViT holds promise for the future.
We evaluate the expression learning capabilities of ResNet, Vision Transformer (ViT), and hybrid structures. To understand the data requirements of each model, we pre-trained on datasets of different sizes and evaluated on a number of benchmark tasks . When considering the computational load of pre-training, ViT performs extremely well, reaching the SOTA level with less pre-training overhead. Finally, we conducted a small self-supervised experiment, demonstrating the potential of self-supervised ViT in the future.
4.1 S ETUP
Datasets. To explore model scalability, we use the ILSVRC-2012 ImageNet dataset with 1k classes and 1.3M images (we refer to it as ImageNet in what follows), its superset ImageNet-21k with 21k classes and 14M images (Deng et al., 2009), and JFT (Sun et al., 2017) with 18k classes and 303M high-resolution images. We de-duplicate the pre-training datasets w.r.t. the test sets of the downstream tasks following Kolesnikov et al. (2020). We transfer the models trained on these dataset to several benchmark tasks: ImageNet on the original validation labels and the cleaned-up ReaL labels (Beyer et al., 2020), CIFAR-10/100 (Krizhevsky, 2009), Oxford-IIIT Pets (Parkhi et al., 2012), and Oxford Flowers-102 (Nilsback & Zisserman, 2008). For these datasets, pre-processing follows Kolesnikov et al. (2020) 

Datasets. To explore the scalability of the model, we use ILSVRC-2012 ImageNet (1,000 categories, 13 million images), ImageNet-21k (21,000 categories, 140 million images), and JFT (18,000 categories, 3.03 billion images). images) data set. We follow Kolesnikov et al. to deduplicate the pre-training set with reference to the test set of the downstream task. We transfer the models trained on these datasets to some benchmark tasks: ImageNet, Oxford-IIIT Pets, and Oxford Flowers-102 on original validation set labels and cleaned RealL labels. For these datasets, preprocessing follows the method of Kolenikov et al.

We also evaluate on the 19-task VTAB classifification suite (Zhai et al., 2019b). VTAB evaluates
low-data transfer to diverse tasks, using 1 000 training examples per task. The tasks are divided into
three groups: Natural – tasks like the above, Pets, CIFAR, etc. Specialized – medical and satellite
imagery, and Structured – tasks that require geometric understanding like localization.


  We also evaluated on the VTAB dataset with 19 classification tasks. VTAB uses 1,000 training images for each task to evaluate the transferability of limited data to various tasks. These tasks are divided into 3 groups: natural image tasks - similar to Pets and CIFAR mentioned above, specific image tasks - medical and satellite images, and structured image tasks - which require understanding of geometry, such as positioning.

Model Variants. We base ViT confifigurations on those used for BERT (Devlin et al., 2019), as
summarized in Table 1. The “Base” and “Large” models are directly adopted from BERT and we
add the larger “Huge” model. In what follows we use brief notation to indicate the model size and
the input patch size: for instance, ViT-L/16 means the “Large” variant with 16 × 16 input patch size.
Note that the Transformer’s sequence length is inversely proportional to the square of the patch size, thus models with smaller patch size are computationally more expensive.
Model variant
  As shown in Table 1, we determine the ViT configuration based on the model structure used by BERT. “Base” and “Large” are taken directly from BERT, and “Huge” is the larger model we added. In the following, we use concise annotations to represent model size and input image patch size: for example, ViT-L/16 represents a "Large" model with an input patch size of 16 × 16 16\times 1616×16. Note that the sequence length of the Transformer is inversely proportional to the square of the image patch size, so models with smaller image patch sizes are more computationally intensive.

 
For the baseline CNNs, we use ResNet (He et al., 2016), but replace the Batch Normalization layers (Ioffe & Szegedy, 2015) with Group Normalization (Wu & He, 2018), and used standardized convolutions (Qiao et al., 2019). These modififications improve transfer (Kolesnikov et al., 2020), and we denote the modifified model “ResNet (BiT)”. For the hybrids, we feed the intermediate feature maps into ViT with patch size of one “pixel”. To experiment with different sequence lengths, we either (i) take the output of stage 4 of a regular ResNet50 or (ii) remove stage 4, place the same number of layers in stage 3 (keeping the total number of layers), and take the output of this extended stage 3. Option (ii) results in a 4x longer sequence length, and a more expensive ViT model.
For the CNN baseline, we use ResNet, but replace the Batch Norm layer with a Group Norm layer and then use normalized convolution. These changes can improve the performance of migration, and we use ResNet (BiT) to represent the modified model. For the hybrid model, we send the feature map of the middle layer to ViT with a block size of 1 pixel. To experiment with sequences of different lengths, we (i) use the output of stage 4 in regular ResNet50 (ii) remove stage 4, replace it with the same number of layers in stage 3, and then take the output of this extended stage 3. Option (ii) can obtain a sequence 4 times longer, so the corresponding ViT model is more computationally expensive.
 
Training & Fine-tuning. We train all models, including ResNets, using Adam (Kingma & Ba,
2015) with β 1 = 0 . 9 , β 2 = 0 . 999 , a batch size of 4096 and apply a high weight decay of 0 . 1 , which
we found to be useful for transfer of all models (Appendix D.1 shows that, in contrast to common
practices, Adam works slightly better than SGD for ResNets in our setting). We use a linear learning rate warmup and decay, see Appendix B.1 for details. For fifine-tuning we use SGD with momentum, batch size 512, for all models, see Appendix B.1.1. For ImageNet results in Table 2, we fifine-tuned at higher resolution: 512 for ViT-L/16 and 518 for ViT-H/14, and also used Polyak & Juditsky (1992) averaging with a factor of 0 . 9999 (Ramachandran et al., 2019; Wang et al., 2020b).
Training and fine-tuning
  We train all models including ResNets using the Adam optimizer, β 1 = 0.9, β 2 = 0.999, w e i g h t _ d e c a y = 0.1, B A T C H _ S I Z E = 4096, which we find works well for transfer across all models (Appendix D.1 shows that, contrary to general experience, Adam performs slightly better than SGD for ResNets training). We use a linear learning rate warm-up and decay, see Appendix B.1 for details. When fine-tuning, we use SGD with momentum, B A T CH _ S I Z E = 512, see Appendix B.1.1. For the ImageNet results in Table 2, we use high resolution for fine-tuning: 512 for ViT-L/16 and 518 for ViT-H/14.

 
Metrics. We report results on downstream datasets either through few-shot or fine-tuning accuracy. Fine-tuning accuracies capture the performance of each model after fifine-tuning it on the respective dataset. Few-shot accuracies are obtained by solving a regularized least-squares regression problem that maps the (frozen) representation of a subset of training images to {− 1 , 1 } K target vectors. This formulation allows us to recover the exact solution in closed form. Though we mainly focus on fifine-tuning performance, we sometimes use linear few-shot accuracies for fast on-the-flfly evaluation where fifine-tuning would be too costly.


Evaluation Metrics
  We report the accuracy of small samples and fine-tuning on downstream datasets. Fine-tuning accuracy reflects the performance of each model after fine-tuning on the corresponding data set. The small-sample accuracy is obtained by solving the least squares regression problem that maps the representation of the training image subset to { − 1 , 1 } K target vectors. This formula allows us to obtain an exact solution in a closed-loop manner. Although we mainly focus on fine-tuning performance, sometimes we also use linear few-shot accuracy for fast dynamic evaluation when fine-tuning is too expensive.

4.2 Comparison to SOTA

We first compare our largest models – ViT-H/14 and ViT-L/16 – to state-of-the-art CNNs from
the literature. The first comparison point is Big Transfer (BiT) (Kolesnikov et al., 2020), which
performs supervised transfer learning with large ResNets. The second is Noisy Student (Xie et al., 2020), which is a large EffificientNet trained using semi-supervised learning on ImageNet and JFT- 300M with the labels removed. Currently, Noisy Student is the state of the art on ImageNet and BiT-L on the other datasets reported here. All models were trained on TPUv3 hardware, and we report the number of TPUv3-core-days taken to pre-train each of them, that is, the number of TPU v3 cores (2 per chip) used for training multiplied by the training time in days.


  We first compare the largest models ViT-H/14 and ViT-L/16 with CNNs in the SOTA literature. The first comparison point is Big Transfer (BiT), which uses large ResNets for supervised transfer learning; the second point is Noisy Student, which uses EfficientNet trained on the delabeled ImageNet and JFT-300M datasets in a semi-supervised manner. . Currently, Noisy Student is SOTA on ImageNet and BiT-L is SOTA on other datasets. All models are trained using TPUv3, and we report the value of pre-training TPUv3-core-days for each model, which is equal to the product of the number of TPUv3 cores used for training (2 cores per TPUv3 block) times the number of training days.

Table 2 shows the results. The smaller ViT-L/16 model pre-trained on JFT-300M outperforms BiT-L (which is pre-trained on the same dataset) on all tasks, while requiring substantially less computational resources to train. The larger model, ViT-H/14, further improves the performance, especially on the more challenging datasets – ImageNet, CIFAR-100, and the VTAB suite. Interestingly, this model still took substantially less compute to pre-train than prior state of the art. However, we notethat pre-training effificiency may be affected not only by the architecture choice, but also other parameters, such as training schedule, optimizer, weight decay, etc. We provide a controlled study of performance vs. compute for different architectures in Section 4.4. Finally, the ViT-L/16 model pre-trained on the public ImageNet-21k dataset performs well on most datasets too, while taking fewer resources to pre-train: it could be trained using a standard cloud TPUv3 with 8 cores in approximately 30 days.


  Table 2 shows the comparative experimental results. Also pre-trained on JFT-300M, the smaller ViT-L/16 outperforms BiT-L on all tasks and significantly reduces the required training resources. The performance of the larger ViT-H/14 has been further improved, especially on more difficult data sets such as ImageNet, CIFAR-100, and VTAB. Interestingly, the resource overhead of ViT pre-training is also greatly reduced compared to the previous SOTA method. However, we noticed that the efficiency of pre-training is not only related to the choice of model structure, but also to factors such as training strategy, optimizer, and weight attenuation. We conducted a control experiment on the performance and training volume of different model structures in Section 4.4. Finally, ViT-L/16 pre-trained on Image-21k performs very well on most data sets and requires less pre-training resources: training can be completed in about 30 days using a standard 8-core TPUv3 .

Figure 2 shows the performance of BiT, VIVI (ResNet jointly trained on ImageNet and Youtube) and S4L (trained on ImageNet in a mixed supervised and semi-supervised manner) on the VTAB task. On the Natural and Structured task branches, ViT-H/14 is better than BiT-R152x4 and other methods, and on the Specialized branch, the performance is close to BiT-R152x4.
 

4.3 P RE - TRAINING D ATA R EQUIREMENTS
The Vision Transformer performs well when pre-trained on a large JFT-300M dataset. With fewer
inductive biases for vision than ResNets, how crucial is the dataset size? We perform two series of  experiments.
First, we pre-train ViT models on datasets of increasing size: ImageNet, ImageNet-21k, and JFT-
300M. To boost the performance on the smaller datasets, we optimize three basic regularization
parameters – weight decay, dropout, and label smoothing. Figure 3 shows the results after fifine
tuning to ImageNet (results on other datasets are shown in Table 5) 2 . When pre-trained on the

smallest dataset, ImageNet, ViT-Large models underperform compared to ViT-Base models, despite (moderate) regularization. With ImageNet-21k pre-training, their performances are similar. Only with JFT-300M, do we see the full benefifit of larger models. Figure 3 also shows the performance region spanned by BiT models of different sizes. The BiT CNNs outperform ViT on ImageNet, but with the larger datasets, ViT overtakes.

 

Second, we train our models on random subsets of 9M, 30M, and 90M as well as the full JFT-
300M dataset. We do not perform additional regularization on the smaller subsets and use the same hyper-parameters for all settings. This way, we assess the intrinsic model properties, and not the effect of regularization. We do, however, use early-stopping, and report the best validation accuracy achieved during training. To save compute, we report few-shot linear accuracy instead of full finetuning accuracy. Figure 4 contains the results. Vision Transformers overfifit more than ResNets with comparable computational cost on smaller datasets. For example, ViT-B/32 is slightly faster than ResNet50; it performs much worse on the 9M subset, but better on 90M+ subsets. The same is true for ResNet152x2 and ViT-L/16. This result reinforces the intuition that the convolutional inductive bias is useful for smaller datasets, but for larger ones, learning the relevant patterns directly from data is suffificient, even benefificial.

Overall, the few-shot results on ImageNet (Figure 4), as well as the low-data results on VTAB
(Table 2) seem promising for very low-data transfer. Further analysis of few-shot properties of ViT
is an exciting direction of future work.

4.4 S CALING S TUDY
We perform a controlled scaling study of different models by evaluating transfer performance from JFT-300M. In this setting data size does not bottleneck the models’ performances, and we assess performance versus pre-training cost of each model. The model set includes: 7 ResNets, R50x1, R50x2 R101x1, R152x1, R152x2, pre-trained for 7 epochs, plus R152x2 and R200x3 pre-trained for 14 epochs; 6 Vision Transformers, ViT-B/32, B/16, L/32, L/16, pre-trained for 7 epochs, plus L/16 and H/14 pre-trained for 14 epochs; and 5 hybrids, R50+ViT-B/32, B/16, L/32, L/16 pretrained for 7 epochs, plus R50+ViT-L/16 pre-trained for 14 epochs (for hybrids, the number at the end of the model name stands not for the patch size, but for the total dowsampling ratio in the ResNet backbone).
        ​ ​ ​ We conducted experiments on different models by comparing their migration performance on the JFT-300M dataset. In this set of experiments, the dataset size is not the performance bottleneck of the model, and we evaluate the relationship between the performance of each model and the pre-training overhead. The experimental model includes: 7ResNets, R50x1, R50x2, R101x1, R152x1, R152x2 pre-trained for 7 epochs; plus R152x2, R200x3 pre-trained for 14 epochs; ViT-B/32, B/15, pre-trained for 7 epochs. L/32, L/16, plus R50+ViT-L/16 pre-trained for 14 epochs (in order to test the hybrid structure, the number after the model does not indicate the image block size, but the total downsampling rate in the ResNet backbone) .
Figure 5 contains the transfer performance versus total pre-training compute (see Appendix D.5
for details on computational costs). Detailed results per model are provided in Table 6 in the Ap
pendix. A few patterns can be observed. First, Vision Transformers dominate ResNets on the
performance/compute trade-off. ViT uses approximately 2 4 × less compute to attain the same
performance (average over 5 datasets). Second, hybrids slightly outperform ViT at small compu
tational budgets, but the difference vanishes for larger models. This result is somewhat surprising, since one might expect convolutional local feature processing to assist ViT at any size. Third, Vision Transformers appear not to saturate within the range tried, motivating future scaling efforts.

  The experimental results are shown in Figure 5, and the detailed results of each model are shown in Table 6. Some patterns can be found. First, ViT is fully based on ResNet in terms of performance/computation overhead balance, and can reduce the amount of training by 2 to 4 times to achieve the same performance level (an average of 5 data sets). Second, the hybrid model performs better than ViT when the computational budget is small, but this gap will disappear as the model increases. This result is somewhat surprising, as we might expect to assist ViTs of arbitrary sizes via local convolutional features. Third, ViT has not experienced performance saturation within the trial range, which will promote future expansion work.

 

4.5 I NSPECTING V ISION T RANSFORMER
To begin to understand how the Vision Transformer processes image data, we analyze its internal representations. The fifirst layer of  the Vision Transformer linearly projects the flflattened patches into a lower-dimensional space (Eq. 1). Figure 7 (left) shows the top principal components of the the learned embedding fifilters. The components resemble plausible basis functions for a low-dimensional representation of the fifine structure within each patch.
To understand how ViT processes image data, we analyzed the model’s internal representation. The first layer of ViT linearly maps flattened image patches to a lower latitude space (Equation 1). Figure 7 left shows the main role of the learned embedding filter, which appears to be a low-dimensional basis function for the fine structure in each image patch.
After the projection, a learned position embedding is added to the patch representations. Figure 7 (center) shows that the model learns to encode distance within the image in the similarity of position embeddings, i.e. closer patches tend to have more similar position embeddings. Further, the row-column structure appears; patches in the same row/column have similar embeddings. Finally, a sinusoidal structure is sometimes apparent for larger grids (Appendix D). That the position embeddings learn to represent 2D image topology explains why hand-crafted 2D-aware embedding variants do not yield improvements (Appendix D.4).
After linear mapping, the image patch features are appended with a learned position embedding. Figure 7 shows that the model encodes the distance in the image using position embedding similarity, that is, image blocks that are closer together tend to have more similar position embedding. Moreover, image blocks in the same row or column also have similar position embedding. Finally, pronounced sinusoidal structures sometimes appear in larger grids. Position embedding can learn to express the topological structure of 2D images, which also explains why the performance of hand-designed 2D position embedding has not improved.

Self-attention allows ViT to integrate information across the entire image even in the lowest layers. We investigate to what degree the network makes use of this capability. Specififically, we compute the average distance in image space across which information is integrated, based on the attention weights (Figure 7, right). This “attention distance” is analogous to receptive fifield size in CNNs. We fifind that some heads attend to most of the image already in the lowest layers, showing that the ability to integrate information globally is indeed used by the model. Other attention heads have consistently small attention distances in the low layers. This highly localized attention is less pronounced in hybrid models that apply a ResNet before the Transformer (Figure 7, right), suggesting that it may serve a similar function as early convolutional layers in CNNs. Further, the attention distance increases with network depth. Globally, we fifind that the model attends to image regions that are semantically relevant for classifification (Figure 6)
Self-supervision allows ViT to integrate information from the entire image even at the lowest level. We investigate the extent to which a network's self-supervision capabilities can be used. Specifically, we calculate the average distance of information integration in the image space according to the attention weight, as shown on the right side of Figure 7. This "attention distance" is similar to the receptive field size in CNN. We found that some attention heads have integrated most areas of the image at the bottom of the network, indicating that ViT does use the ability of global information integration. The other attention head only pays attention to a small part of the image. This highly concentrated attention is even rarer in hybrid models (ViT plus a ResNet), indicating that the role of local attention is similar to that of the first few convolutional layers in CNN. Moreover, the attentional distance increases as the network deepens. From a global perspective, we find that the model pays more attention to image regions related to classification semantics, as shown in Figures 6 and 14.
 

4.6 S ELF - SUPERVISION
Transformers show impressive performance on NLP tasks. However, much of their success stems not only from their excellent scalability but also from large scale self-supervised pre-training (Devlin et al., 2019; Radford et al., 2018). We also perform a preliminary exploration on masked patch  prediction for self-supervision, mimicking the masked language modeling task used in BERT. With self-supervised pre-training, our smaller ViT-B/16 model achieves 79.9% accuracy on ImageNet, a signifificant improvement of 2% to training from scratch, but still 4% behind supervised pre-training.Appendix B.1.2 contains further details. We leave exploration of contrastive pre-training (Chen et al., 2020b; He et al., 2020; Bachman et al., 2019; Henaff et al., 2020) to future work.

4.6 Self-Supervision
  Transformer has demonstrated excellent performance in NLP tasks. However, most of their success is not only due to the excellent scalability of Transformer, but also due to large-scale self-supervised pre-training. We also imitated the masked language modeling task used in BERT and made preliminary explorations into self-supervised masked patch prediction. Under self-supervised pre-training, our smaller ViT-B/16 model achieved 79.9% accuracy on ImageNet, a 2% improvement compared to training from scratch, but still 4% behind supervised pre-training. Appendix B.1.2 contains more details. We leave the exploration of contrasting pre-training to future work.

5. Conclusion

We have explored the direct application of Transformers to image recognition. Unlike prior works
using self-attention in computer vision, we do not introduce image-specifific inductive biases into
the architecture apart from the initial patch extraction step. Instead, we interpret an image as a
sequence of patches and process it by a standard Transformer encoder as used in NLP. This simple, yet scalable, strategy works surprisingly well when coupled with pre-training on large datasets. Thus, Vision Transformer matches or exceeds the state of the art on many image classifification datasets, whilst being relatively cheap to pre-train.

  We explored directly applying Transformer to image recognition. Unlike previous work using self-supervision in computer vision, we do not introduce image-specific inductive biases into the network structure except for initial image patch extraction. Instead, we treat an image as a sequence of image patches and then process it using NLP’s standard Transformer encoder. This simple yet scalable strategy works surprisingly well when combined with pre-training on large datasets. Therefore, ViT approaches or even surpasses SOTA on many image datasets, and its pre-training overhead is much reduced in comparison.

While these initial results are encouraging, many challenges remain. One is to apply ViT to other
computer vision tasks, such as detection and segmentation. Our results, coupled with those in Carion et al. (2020), indicate the promise of this approach. Another challenge is to continue exploring self-supervised pre-training methods. Our initial experiments show improvement from self-supervised pre-training, but there is still large gap between self-supervised and large-scale supervised pretraining. Finally, further scaling of ViT would likely lead to improved performance.

  While these initial results are exciting, many challenges remain. One is to apply ViT to other computer vision tasks, such as detection and segmentation. Our results, as well as those of Carion, demonstrate the credibility of this approach. Another challenge is the continued exploration of self-supervised pre-training methods. Our initial experiments show the improvement brought by self-supervised pre-training, but there is still a huge gap between self-supervised and large-scale supervised. Finally, further expansion of ViT may lead to new performance improvements.

Guess you like

Origin blog.csdn.net/qq_45828494/article/details/124696206