4.1 S
ETUP
Datasets.
To explore model scalability, we use the ILSVRC-2012 ImageNet dataset with 1k classes and 1.3M images (we refer to it as ImageNet in what follows), its superset ImageNet-21k with 21k classes and 14M images (Deng et al., 2009), and JFT (Sun et al., 2017) with 18k classes and 303M high-resolution images. We de-duplicate the pre-training datasets w.r.t. the test sets of the downstream tasks following Kolesnikov et al. (2020). We transfer the models trained on these dataset to several benchmark tasks: ImageNet on the original validation labels and the cleaned-up ReaL labels (Beyer et al., 2020), CIFAR-10/100 (Krizhevsky, 2009), Oxford-IIIT Pets (Parkhi et al., 2012), and Oxford Flowers-102 (Nilsback & Zisserman, 2008). For these datasets, pre-processing follows Kolesnikov et al. (2020)
Datasets. To explore the scalability of the model, we use ILSVRC-2012 ImageNet (1,000 categories, 13 million images), ImageNet-21k (21,000 categories, 140 million images), and JFT (18,000 categories, 3.03 billion images). images) data set. We follow Kolesnikov et al. to deduplicate the pre-training set with reference to the test set of the downstream task. We transfer the models trained on these datasets to some benchmark tasks: ImageNet, Oxford-IIIT Pets, and Oxford Flowers-102 on original validation set labels and cleaned RealL labels. For these datasets, preprocessing follows the method of Kolenikov et al.
We also evaluate on the 19-task VTAB classifification suite (Zhai et al., 2019b). VTAB evaluates
low-data transfer to diverse tasks, using 1 000 training examples per task. The tasks are divided into
three groups:
Natural
– tasks like the above, Pets, CIFAR, etc.
Specialized
– medical and satellite
imagery, and
Structured
– tasks that require geometric understanding like localization.
We also evaluated on the VTAB dataset with 19 classification tasks. VTAB uses 1,000 training images for each task to evaluate the transferability of limited data to various tasks. These tasks are divided into 3 groups: natural image tasks - similar to Pets and CIFAR mentioned above, specific image tasks - medical and satellite images, and structured image tasks - which require understanding of geometry, such as positioning.
Model Variants.
We base ViT confifigurations on those used for BERT (Devlin et al., 2019), as
summarized in Table 1. The “Base” and “Large” models are directly adopted from BERT and we
add the larger “Huge” model. In what follows we use brief notation to indicate the model size and
the input patch size: for instance, ViT-L/16 means the “Large” variant with
16
×
16
input patch size.
Note that the Transformer’s sequence length is inversely proportional to the square of the patch size, thus models with smaller patch size are computationally more expensive.
Model variant
As shown in Table 1, we determine the ViT configuration based on the model structure used by BERT. “Base” and “Large” are taken directly from BERT, and “Huge” is the larger model we added. In the following, we use concise annotations to represent model size and input image patch size: for example, ViT-L/16 represents a "Large" model with an input patch size of 16 × 16 16\times 1616×16. Note that the sequence length of the Transformer is inversely proportional to the square of the image patch size, so models with smaller image patch sizes are more computationally intensive.
For the baseline CNNs, we use ResNet (He et al., 2016), but replace the Batch Normalization layers (Ioffe & Szegedy, 2015) with Group Normalization (Wu & He, 2018), and used standardized convolutions (Qiao et al., 2019). These modififications improve transfer (Kolesnikov et al., 2020), and we denote the modifified model “ResNet (BiT)”. For the hybrids, we feed the intermediate feature maps into ViT with patch size of one “pixel”. To experiment with different sequence lengths, we either (i) take the output of stage 4 of a regular ResNet50 or (ii) remove stage 4, place the same number of layers in stage 3 (keeping the total number of layers), and take the output of this extended stage 3. Option (ii) results in a 4x longer sequence length, and a more expensive ViT model.
For the CNN baseline, we use ResNet, but replace the Batch Norm layer with a Group Norm layer and then use normalized convolution. These changes can improve the performance of migration, and we use ResNet (BiT) to represent the modified model. For the hybrid model, we send the feature map of the middle layer to ViT with a block size of 1 pixel. To experiment with sequences of different lengths, we (i) use the output of stage 4 in regular ResNet50 (ii) remove stage 4, replace it with the same number of layers in stage 3, and then take the output of this extended stage 3. Option (ii) can obtain a sequence 4 times longer, so the corresponding ViT model is more computationally expensive.
Training & Fine-tuning.
We train all models, including ResNets, using Adam (Kingma & Ba,
2015) with
β
1
= 0
.
9
,
β
2
= 0
.
999
, a batch size of 4096 and apply a high weight decay of
0
.
1
, which
we found to be useful for transfer of all models (Appendix D.1 shows that, in contrast to common
practices, Adam works slightly better than SGD for ResNets in our setting). We use a linear learning rate warmup and decay, see Appendix B.1 for details. For fifine-tuning we use SGD with momentum, batch size 512, for all models, see Appendix B.1.1. For ImageNet results in Table 2, we fifine-tuned at higher resolution: 512
for ViT-L/16 and
518
for ViT-H/14, and also used Polyak & Juditsky (1992) averaging with a factor of 0
.
9999
(Ramachandran et al., 2019; Wang et al., 2020b).
Training and fine-tuning
We train all models including ResNets using the Adam optimizer, β 1 = 0.9, β 2 = 0.999, w e i g h t _ d e c a y = 0.1, B A T C H _ S I Z E = 4096, which we find works well for transfer across all models (Appendix D.1 shows that, contrary to general experience, Adam performs slightly better than SGD for ResNets training). We use a linear learning rate warm-up and decay, see Appendix B.1 for details. When fine-tuning, we use SGD with momentum, B A T CH _ S I Z E = 512, see Appendix B.1.1. For the ImageNet results in Table 2, we use high resolution for fine-tuning: 512 for ViT-L/16 and 518 for ViT-H/14.
Metrics.
We report results on downstream datasets either through few-shot or fine-tuning accuracy. Fine-tuning accuracies capture the performance of each model after fifine-tuning it on the respective dataset. Few-shot accuracies are obtained by solving a regularized least-squares regression problem that maps the (frozen) representation of a subset of training images to {−
1
,
1
}
K
target vectors. This formulation allows us to recover the exact solution in closed form. Though we mainly focus on fifine-tuning performance, we sometimes use linear few-shot accuracies for fast on-the-flfly evaluation where fifine-tuning would be too costly.
Evaluation Metrics
We report the accuracy of small samples and fine-tuning on downstream datasets. Fine-tuning accuracy reflects the performance of each model after fine-tuning on the corresponding data set. The small-sample accuracy is obtained by solving the least squares regression problem that maps the representation of the training image subset to { − 1 , 1 } K target vectors. This formula allows us to obtain an exact solution in a closed-loop manner. Although we mainly focus on fine-tuning performance, sometimes we also use linear few-shot accuracy for fast dynamic evaluation when fine-tuning is too expensive.
4.2 Comparison to SOTA
We first compare our largest models – ViT-H/14 and ViT-L/16 – to state-of-the-art CNNs from
the literature. The first comparison point is Big Transfer (BiT) (Kolesnikov et al., 2020), which
performs supervised transfer learning with large ResNets. The second is Noisy Student (Xie et al., 2020), which is a large EffificientNet trained using semi-supervised learning on ImageNet and JFT- 300M with the labels removed. Currently, Noisy Student is the state of the art on ImageNet and BiT-L on the other datasets reported here. All models were trained on TPUv3 hardware, and we report the number of TPUv3-core-days taken to pre-train each of them, that is, the number of TPU v3 cores (2 per chip) used for training multiplied by the training time in days.
We first compare the largest models ViT-H/14 and ViT-L/16 with CNNs in the SOTA literature. The first comparison point is Big Transfer (BiT), which uses large ResNets for supervised transfer learning; the second point is Noisy Student, which uses EfficientNet trained on the delabeled ImageNet and JFT-300M datasets in a semi-supervised manner. . Currently, Noisy Student is SOTA on ImageNet and BiT-L is SOTA on other datasets. All models are trained using TPUv3, and we report the value of pre-training TPUv3-core-days for each model, which is equal to the product of the number of TPUv3 cores used for training (2 cores per TPUv3 block) times the number of training days.
Table 2 shows the results. The smaller ViT-L/16 model pre-trained on JFT-300M outperforms BiT-L (which is pre-trained on the same dataset) on all tasks, while requiring substantially less computational resources to train. The larger model, ViT-H/14, further improves the performance, especially on the more challenging datasets – ImageNet, CIFAR-100, and the VTAB suite. Interestingly, this model still took substantially less compute to pre-train than prior state of the art. However, we notethat pre-training effificiency may be affected not only by the architecture choice, but also other parameters, such as training schedule, optimizer, weight decay, etc. We provide a controlled study of performance vs. compute for different architectures in Section 4.4. Finally, the ViT-L/16 model pre-trained on the public ImageNet-21k dataset performs well on most datasets too, while taking fewer resources to pre-train: it could be trained using a standard cloud TPUv3 with 8 cores in approximately 30 days.
Table 2 shows the comparative experimental results. Also pre-trained on JFT-300M, the smaller ViT-L/16 outperforms BiT-L on all tasks and significantly reduces the required training resources. The performance of the larger ViT-H/14 has been further improved, especially on more difficult data sets such as ImageNet, CIFAR-100, and VTAB. Interestingly, the resource overhead of ViT pre-training is also greatly reduced compared to the previous SOTA method. However, we noticed that the efficiency of pre-training is not only related to the choice of model structure, but also to factors such as training strategy, optimizer, and weight attenuation. We conducted a control experiment on the performance and training volume of different model structures in Section 4.4. Finally, ViT-L/16 pre-trained on Image-21k performs very well on most data sets and requires less pre-training resources: training can be completed in about 30 days using a standard 8-core TPUv3 .
Figure 2 shows the performance of BiT, VIVI (ResNet jointly trained on ImageNet and Youtube) and S4L (trained on ImageNet in a mixed supervised and semi-supervised manner) on the VTAB task. On the Natural and Structured task branches, ViT-H/14 is better than BiT-R152x4 and other methods, and on the Specialized branch, the performance is close to BiT-R152x4.
4.3 P
RE
-
TRAINING
D
ATA
R
EQUIREMENTS
The Vision Transformer performs well when pre-trained on a large JFT-300M dataset. With fewer
inductive biases for vision than ResNets, how crucial is the dataset size? We perform two series of experiments.
First, we pre-train ViT models on datasets of increasing size: ImageNet, ImageNet-21k, and JFT-
300M. To boost the performance on the smaller datasets, we optimize three basic regularization
parameters – weight decay, dropout, and label smoothing. Figure 3 shows the results after fifine
tuning to ImageNet (results on other datasets are shown in Table 5)
2
. When pre-trained on the
smallest dataset, ImageNet, ViT-Large models underperform compared to ViT-Base models, despite (moderate) regularization. With ImageNet-21k pre-training, their performances are similar. Only with JFT-300M, do we see the full benefifit of larger models. Figure 3 also shows the performance region spanned by BiT models of different sizes. The BiT CNNs outperform ViT on ImageNet, but with the larger datasets, ViT overtakes.
Second, we train our models on random subsets of 9M, 30M, and 90M as well as the full JFT-
300M dataset. We do not perform additional regularization on the smaller subsets and use the same hyper-parameters for all settings. This way, we assess the intrinsic model properties, and not the effect of regularization. We do, however, use early-stopping, and report the best validation accuracy achieved during training. To save compute, we report few-shot linear accuracy instead of full finetuning accuracy. Figure 4 contains the results. Vision Transformers overfifit more than ResNets with comparable computational cost on smaller datasets. For example, ViT-B/32 is slightly faster than ResNet50; it performs much worse on the 9M subset, but better on 90M+ subsets. The same is true for ResNet152x2 and ViT-L/16. This result reinforces the intuition that the convolutional inductive bias is useful for smaller datasets, but for larger ones, learning the relevant patterns directly from data is suffificient, even benefificial.
Overall, the few-shot results on ImageNet (Figure 4), as well as the low-data results on VTAB
(Table 2) seem promising for very low-data transfer. Further analysis of few-shot properties of ViT
is an exciting direction of future work.
4.4 S
CALING
S
TUDY
We perform a controlled scaling study of different models by evaluating transfer performance from JFT-300M. In this setting data size does not bottleneck the models’ performances, and we assess performance versus pre-training cost of each model. The model set includes: 7 ResNets, R50x1, R50x2 R101x1, R152x1, R152x2, pre-trained for 7 epochs, plus R152x2 and R200x3 pre-trained for 14 epochs; 6 Vision Transformers, ViT-B/32, B/16, L/32, L/16, pre-trained for 7 epochs, plus L/16 and H/14 pre-trained for 14 epochs; and 5 hybrids, R50+ViT-B/32, B/16, L/32, L/16 pretrained for 7 epochs, plus R50+ViT-L/16 pre-trained for 14 epochs (for hybrids, the number at the end of the model name stands not for the patch size, but for the total dowsampling ratio in the ResNet backbone).
We conducted experiments on different models by comparing their migration performance on the JFT-300M dataset. In this set of experiments, the dataset size is not the performance bottleneck of the model, and we evaluate the relationship between the performance of each model and the pre-training overhead. The experimental model includes: 7ResNets, R50x1, R50x2, R101x1, R152x1, R152x2 pre-trained for 7 epochs; plus R152x2, R200x3 pre-trained for 14 epochs; ViT-B/32, B/15, pre-trained for 7 epochs. L/32, L/16, plus R50+ViT-L/16 pre-trained for 14 epochs (in order to test the hybrid structure, the number after the model does not indicate the image block size, but the total downsampling rate in the ResNet backbone) .
Figure 5 contains the transfer performance versus total pre-training compute (see Appendix D.5
for details on computational costs). Detailed results per model are provided in Table 6 in the Ap
pendix. A few patterns can be observed. First, Vision Transformers dominate ResNets on the
performance/compute trade-off. ViT uses approximately
2
−
4
×
less compute to attain the same
performance (average over 5 datasets). Second, hybrids slightly outperform ViT at small compu
tational budgets, but the difference vanishes for larger models. This result is somewhat surprising, since one might expect convolutional local feature processing to assist ViT at any size. Third, Vision Transformers appear not to saturate within the range tried, motivating future scaling efforts.
The experimental results are shown in Figure 5, and the detailed results of each model are shown in Table 6. Some patterns can be found. First, ViT is fully based on ResNet in terms of performance/computation overhead balance, and can reduce the amount of training by 2 to 4 times to achieve the same performance level (an average of 5 data sets). Second, the hybrid model performs better than ViT when the computational budget is small, but this gap will disappear as the model increases. This result is somewhat surprising, as we might expect to assist ViTs of arbitrary sizes via local convolutional features. Third, ViT has not experienced performance saturation within the trial range, which will promote future expansion work.
4.5 I
NSPECTING
V
ISION
T
RANSFORMER
To begin to understand how the Vision Transformer processes image data, we analyze its internal representations. The fifirst layer of the Vision Transformer linearly projects the flflattened patches into a lower-dimensional space (Eq. 1). Figure 7 (left) shows the top principal components of the the learned embedding fifilters. The components resemble plausible basis functions for a low-dimensional representation of the fifine structure within each patch.
To understand how ViT processes image data, we analyzed the model’s internal representation. The first layer of ViT linearly maps flattened image patches to a lower latitude space (Equation 1). Figure 7 left shows the main role of the learned embedding filter, which appears to be a low-dimensional basis function for the fine structure in each image patch.
After the projection, a learned position embedding is added to the patch representations. Figure 7 (center) shows that the model learns to encode distance within the image in the similarity of position embeddings, i.e. closer patches tend to have more similar position embeddings. Further, the row-column structure appears; patches in the same row/column have similar embeddings. Finally, a sinusoidal structure is sometimes apparent for larger grids (Appendix D). That the position embeddings learn to represent 2D image topology explains why hand-crafted 2D-aware embedding variants do not yield improvements (Appendix D.4).
After linear mapping, the image patch features are appended with a learned position embedding. Figure 7 shows that the model encodes the distance in the image using position embedding similarity, that is, image blocks that are closer together tend to have more similar position embedding. Moreover, image blocks in the same row or column also have similar position embedding. Finally, pronounced sinusoidal structures sometimes appear in larger grids. Position embedding can learn to express the topological structure of 2D images, which also explains why the performance of hand-designed 2D position embedding has not improved.
Self-attention allows ViT to integrate information across the entire image even in the lowest layers. We investigate to what degree the network makes use of this capability. Specififically, we compute the average distance in image space across which information is integrated, based on the attention weights (Figure 7, right). This “attention distance” is analogous to receptive fifield size in CNNs. We fifind that some heads attend to most of the image already in the lowest layers, showing that the ability to integrate information globally is indeed used by the model. Other attention heads have consistently small attention distances in the low layers. This highly localized attention is less pronounced in hybrid models that apply a ResNet before the Transformer (Figure 7, right), suggesting that it may serve a similar function as early convolutional layers in CNNs. Further, the attention distance increases with network depth. Globally, we fifind that the model attends to image regions that are semantically relevant for classifification (Figure 6)
Self-supervision allows ViT to integrate information from the entire image even at the lowest level. We investigate the extent to which a network's self-supervision capabilities can be used. Specifically, we calculate the average distance of information integration in the image space according to the attention weight, as shown on the right side of Figure 7. This "attention distance" is similar to the receptive field size in CNN. We found that some attention heads have integrated most areas of the image at the bottom of the network, indicating that ViT does use the ability of global information integration. The other attention head only pays attention to a small part of the image. This highly concentrated attention is even rarer in hybrid models (ViT plus a ResNet), indicating that the role of local attention is similar to that of the first few convolutional layers in CNN. Moreover, the attentional distance increases as the network deepens. From a global perspective, we find that the model pays more attention to image regions related to classification semantics, as shown in Figures 6 and 14.
4.6 S
ELF
-
SUPERVISION
Transformers show impressive performance on NLP tasks. However, much of their success stems not only from their excellent scalability but also from large scale self-supervised pre-training (Devlin et al., 2019; Radford et al., 2018). We also perform a preliminary exploration on
masked patch
prediction for self-supervision, mimicking the masked language modeling task used in BERT. With self-supervised pre-training, our smaller ViT-B/16 model achieves 79.9% accuracy on ImageNet, a signifificant improvement of 2% to training from scratch, but still 4% behind supervised pre-training.Appendix B.1.2 contains further details. We leave exploration of contrastive pre-training (Chen et al., 2020b; He et al., 2020; Bachman et al., 2019; Henaff et al., 2020) to future work.
4.6 Self-Supervision
Transformer has demonstrated excellent performance in NLP tasks. However, most of their success is not only due to the excellent scalability of Transformer, but also due to large-scale self-supervised pre-training. We also imitated the masked language modeling task used in BERT and made preliminary explorations into self-supervised masked patch prediction. Under self-supervised pre-training, our smaller ViT-B/16 model achieved 79.9% accuracy on ImageNet, a 2% improvement compared to training from scratch, but still 4% behind supervised pre-training. Appendix B.1.2 contains more details. We leave the exploration of contrasting pre-training to future work.
5. Conclusion
We have explored the direct application of Transformers to image recognition. Unlike prior works
using self-attention in computer vision, we do not introduce image-specifific inductive biases into
the architecture apart from the initial patch extraction step. Instead, we interpret an image as a
sequence of patches and process it by a standard Transformer encoder as used in NLP. This simple, yet scalable, strategy works surprisingly well when coupled with pre-training on large datasets. Thus, Vision Transformer matches or exceeds the state of the art on many image classifification datasets, whilst being relatively cheap to pre-train.
We explored directly applying Transformer to image recognition. Unlike previous work using self-supervision in computer vision, we do not introduce image-specific inductive biases into the network structure except for initial image patch extraction. Instead, we treat an image as a sequence of image patches and then process it using NLP’s standard Transformer encoder. This simple yet scalable strategy works surprisingly well when combined with pre-training on large datasets. Therefore, ViT approaches or even surpasses SOTA on many image datasets, and its pre-training overhead is much reduced in comparison.
While these initial results are encouraging, many challenges remain. One is to apply ViT to other
computer vision tasks, such as detection and segmentation. Our results, coupled with those in Carion et al. (2020), indicate the promise of this approach. Another challenge is to continue exploring self-supervised pre-training methods. Our initial experiments show improvement from self-supervised pre-training, but there is still large gap between self-supervised and large-scale supervised pretraining. Finally, further scaling of ViT would likely lead to improved performance.
While these initial results are exciting, many challenges remain. One is to apply ViT to other computer vision tasks, such as detection and segmentation. Our results, as well as those of Carion, demonstrate the credibility of this approach. Another challenge is the continued exploration of self-supervised pre-training methods. Our initial experiments show the improvement brought by self-supervised pre-training, but there is still a huge gap between self-supervised and large-scale supervised. Finally, further expansion of ViT may lead to new performance improvements.