ViT-Method

3 METHOD

In model design we follow the original Transformer (Vaswani et al., 2017) as closely as possible.

An advantage of this intentionally simple setup is that scalable NLP Transformer architectures – and their effificient implementations – can be used almost out of the box.

在模型设计方面，我们尽可能和原始Transformer保持一致。这种有意设计简单化的优点在于对NLP的Transformer有良好的扩展性，几乎可以实现“开箱即用”。

3.1 V ISION T RANSFORMER (V I T)

An overview of the model is depicted in Figure 1. The standard Transformer receives as input a 1D sequence of token embeddings. To handle 2D images, we reshape the image x ∈ R H × W × C into a sequence of flflattened 2D patches x p ∈ R N × ( P 2 · C ) , where ( H, W ) is the resolution of the original image, C is the number of channels, ( P, P ) is the resolution of each image patch, and N = HW/P 2 is the resulting number of patches, which also serves as the effective input sequence length for the Transformer. The Transformer uses constant latent vector size D through all of its layers, so we flatten the patches and map to D dimensions with a trainable linear projection (Eq. 1). We refer to the output of this projection as the patch embeddings.

模型整体结构如图1所示。标准Transformer接收1维token embeddings序列作为输入。为了处理2维图像，我们将x ∈ R H × W × C 的图像reshape成拉平的2D图像块序列x p ∈ R N × ( P 2 ⋅ C ) ，其中( H , W )是原始图像的分辨率，C 是图像通道数，( P , P )是每个图像块的分辨率，N = H W / P 2为图像块的数量，也是Transformer有效输入序列的长度。Transformer的所有层均使用固定维度D的隐藏向量，所以我们使用一个可训练的线性层将图像块映射到D维，如式(1)所示。我们将该线性映射的输出称为patch embeddings。

Similar to BERT’s [class] token, we prepend a learnable embedding to the sequence of embed

ded patches ( z 00 = x class ), whose state at the output of the Transformer encoder ( z 0 L ) serves as the image representation y (Eq. 4). Both during pre-training and fifine-tuning, a classifification head is attached to z 0 L . The classifification head is implemented by a MLP with one hidden layer at pre-training time and by a single linear layer at fifine-tuning time.

类似BERT方法中的 [class] token，我们在patch embeddings之前增加了一个可学习的embedding：z 0 0 = x c l a s s ，它在Transformer编码器输出端的状态z L 0 作为图像的表征，如式(4)所示。分类头在预训练阶段由带有一个隐藏层的MLP实现，微调阶段时由一个线性层实现。

The Transformer encoder (Vaswani et al., 2017) consists of alternating layers of multiheaded self attention (MSA, see Appendix A) and MLP blocks (Eq. 2, 3). Layernorm (LN) is applied before

every block, and residual connections after every block (Wang et al., 2019; Baevski & Auli, 2019).

The MLP contains two layers with a GELU non-linearity.

Transformer编码器由包含多头自注意力（MSA）、MLP块的层组成。每个块之前应用一个Layernorm（LN）层，每个块之后有一个残差连接。MLP包含两个带有GELU的非线性层。

Inductive bias.

We note that Vision Transformer has much less image-specifific inductive bias than CNNs. In CNNs, locality, two-dimensional neighborhood structure, and translation equivariance are baked into each layer throughout the whole model. In ViT, only MLP layers are local and translationally equivariant, while the self-attention layers are global. The two-dimensional neighborhood structure is used very sparingly: in the beginning of the model by cutting the image into patches and at fine-tuning time for adjusting the position embeddings for images of different resolution (as described below). Other than that, the position embeddings at initialization time carry no information about the 2D positions of the patches and all spatial relations between the patches have to be learned from scratch.

归纳偏置
我们注意到Vision Transformer相比CNN具有更少的针对图像的归纳偏置。在CNN中，局部参数共享、2维邻域结构和平移不变性贯穿于整个模型的每一层。在ViT中，只有MLP具有局部参数共享和平移不变性，而且自注意力层是全局的。二维邻域结构使用的非常少，仅在模型最开始裁剪图像块时和微调阶段针对不同分辨率图像调整position embedding有用到。除此以外，position embeddings在初始化时不携带任何2维位置信息，图像块之间的所有空间关系都是从头学习得到的。

Hybrid Architecture.

As an alternative to raw image patches, the input sequence can be formed from feature maps of a CNN (LeCun et al., 1989). In this hybrid model, the patch embedding projection E (Eq. 1) is applied to patches extracted from a CNN feature map. As a special case, the patches can have spatial size 1x1, which means that the input sequence is obtained by simply flattening the spatial dimensions of the feature map and projecting to the Transformer dimension. The classifification input embedding and position embeddings are added as described above.

混合结构
作为原始图像块的替代，输入序列可以从CNN的特征图构建。在这种混合模型中，patch embedding映射E 应用于从CNN特征图中提取的patch。在极端情况下，patches尺寸可以是1 × 1 ，即输入序列可以是特征图经过简单拉伸并映射到Transformer维度得到。分类输入embedding和position embeddings按照上述相同的方式加入。

3.2 F INE - TUNING AND H IGHER R ESOLUTION

Typically, we pre-train ViT on large datasets, and fine-tune to (smaller) downstream tasks. For

this, we remove the pre-trained prediction head and attach a zero-initialized D × K feedforward

layer, where K is the number of downstream classes. It is often benefificial to fine-tune at higher

resolution than pre-training (Touvron et al., 2019; Kolesnikov et al., 2020). When feeding images

of higher resolution, we keep the patch size the same, which results in a larger effective sequence length. The Vision Transformer can handle arbitrary sequence lengths (up to memory constraints), however, the pre-trained position embeddings may no longer be meaningful. We therefore perform 2D interpolation of the pre-trained position embeddings, according to their location in the original image. Note that this resolution adjustment and patch extraction are the only points at which an inductive bias about the 2D structure of the images is manually injected into the Vision Transformer.

通常我们在大型数据集上对ViT进行预训练，然后在下游小型数据集上进行微调。为此，我们移除了预训练的预测头并增加一个零初始化的D × K 的前向层，其中K是预测的类别数量。通常使用比预训练更高分辨率的图像微调是有益的。当输入更高分辨率图像时，我们保持图像块尺寸不变，这会导致更长的有效序列长度。Vision Transformer在内存允许的情况下可以处理任意长度的序列，但是预训练得到的position embeddings就不再有意义了。因此我们根据原始图像的位置对预训练的position embeddings进行2D插值。请注意，这种分辨率调整和块提取操作是Vision Transformer中唯一一处手动引入图像2维结构的点。

4 EXPERIMENTS

We evaluate the representation learning capabilities of ResNet, Vision Transformer (ViT), and the

hybrid. To understand the data requirements of each model, we pre-train on datasets of varying size and evaluate many benchmark tasks . When considering the computational cost of pre-training the model, ViT performs very favourably, attaining state of the art on most recognition benchmarks at a lower pre-training cost. Lastly, we perform a small experiment using self-supervision, and show that self-supervised ViT holds promise for the future.

我们评估了ResNet、Vision Transformer(ViT)和混合结构的表达学习能力。为了理解每个模型的数据需求，我们在不同大小的数据集上进行了预训练，并在许多基准任务上进行了评估。当考虑预训练的计算量时，ViT表现得非常出色，以更少的预训练开销达到了SOTA水平。最后，我们进行了一个自监督的小实验，表明了自监督ViT在未来是具有潜力的。

4.1 S ETUP

Datasets. To explore model scalability, we use the ILSVRC-2012 ImageNet dataset with 1k classes and 1.3M images (we refer to it as ImageNet in what follows), its superset ImageNet-21k with 21k classes and 14M images (Deng et al., 2009), and JFT (Sun et al., 2017) with 18k classes and 303M high-resolution images. We de-duplicate the pre-training datasets w.r.t. the test sets of the downstream tasks following Kolesnikov et al. (2020). We transfer the models trained on these dataset to several benchmark tasks: ImageNet on the original validation labels and the cleaned-up ReaL labels (Beyer et al., 2020), CIFAR-10/100 (Krizhevsky, 2009), Oxford-IIIT Pets (Parkhi et al., 2012), and Oxford Flowers-102 (Nilsback & Zisserman, 2008). For these datasets, pre-processing follows Kolesnikov et al. (2020)

数据集. 为了探索模型的可扩展性，我们使用ILSVRC-2012 ImageNet（1000类别，1300万张图像）、ImageNet-21k（2.1万类别，1.4亿万张图像）以及JFT（1.8万类别，30.3亿张图像）数据集。我们依照Kolesnikov等人，参照下游任务的测试集对预训练集进行去重。我们将在这些数据集上训练的模型迁移到一些基准任务上：原始验证集标签上和清理过的ReaL标签上的ImageNet，Oxford-IIIT Pets，以及Oxford Flowers-102。对于这些数据集，预处理遵循Kolenikov等人的方法。

We also evaluate on the 19-task VTAB classifification suite (Zhai et al., 2019b). VTAB evaluates

low-data transfer to diverse tasks, using 1 000 training examples per task. The tasks are divided into

three groups: Natural – tasks like the above, Pets, CIFAR, etc. Specialized – medical and satellite

imagery, and Structured – tasks that require geometric understanding like localization.

我们同样在具有19个分类任务的VTAB数据集上进行了评估。VTAB每个任务使用1000张训练图像，评估有限数据到各种任务的迁移能力。这些任务分为3个组：自然图像任务——和上述Pets、CIFAR相似，特定图像任务——医学和卫星图像，结构化图像任务——需要理解几何，比如定位。

Model Variants. We base ViT confifigurations on those used for BERT (Devlin et al., 2019), as

summarized in Table 1. The “Base” and “Large” models are directly adopted from BERT and we

add the larger “Huge” model. In what follows we use brief notation to indicate the model size and

the input patch size: for instance, ViT-L/16 means the “Large” variant with 16 × 16 input patch size.

Note that the Transformer’s sequence length is inversely proportional to the square of the patch size, thus models with smaller patch size are computationally more expensive.

模型变体
如表1所示，我们在BERT所使用的模型结构基础上确定ViT配置。“Base”和“Large”直接取自BERT，“Huge”是我们增加的更大模型。在下文中，我们使用简明的注释来表示模型尺寸和输入图像块尺寸：比如ViT-L/16表示输入块尺寸为16 × 16 16\times 1616×16的“Large”模型。注意Transformer的序列长度和图像块尺寸的平方成反比，因此图像块尺寸更小的模型计算量更大。

For the baseline CNNs, we use ResNet (He et al., 2016), but replace the Batch Normalization layers (Ioffe & Szegedy, 2015) with Group Normalization (Wu & He, 2018), and used standardized convolutions (Qiao et al., 2019). These modififications improve transfer (Kolesnikov et al., 2020), and we denote the modifified model “ResNet (BiT)”. For the hybrids, we feed the intermediate feature maps into ViT with patch size of one “pixel”. To experiment with different sequence lengths, we either (i) take the output of stage 4 of a regular ResNet50 or (ii) remove stage 4, place the same number of layers in stage 3 (keeping the total number of layers), and take the output of this extended stage 3. Option (ii) results in a 4x longer sequence length, and a more expensive ViT model.

对于CNN基线，我们使用ResNet，但是将Batch Norm层替换成Group Norm层，然后使用标准化卷积。这些改动可以提升迁移的性能，我们用ResNet(BiT)表示修改后的模型。对于混合模型，我们将中间层的特征图送给ViT，块尺寸为1个像素。为了实验不同长度的序列，我们(i)使用常规ResNet50中stage 4的输出(ii)移除stage4，使用stage3中同样数量的层进行替换，然后取这个扩展stage 3的输出。选项(ii)可以得到4倍长度的序列，因此对应的ViT模型计算开销更大。

Training & Fine-tuning. We train all models, including ResNets, using Adam (Kingma & Ba,

2015) with β 1 = 0 . 9 , β 2 = 0 . 999 , a batch size of 4096 and apply a high weight decay of 0 . 1 , which

we found to be useful for transfer of all models (Appendix D.1 shows that, in contrast to common

practices, Adam works slightly better than SGD for ResNets in our setting). We use a linear learning rate warmup and decay, see Appendix B.1 for details. For fifine-tuning we use SGD with momentum, batch size 512, for all models, see Appendix B.1.1. For ImageNet results in Table 2, we fifine-tuned at higher resolution: 512 for ViT-L/16 and 518 for ViT-H/14, and also used Polyak & Juditsky (1992) averaging with a factor of 0 . 9999 (Ramachandran et al., 2019; Wang et al., 2020b).

训练和微调
我们训练所有包括ResNets在内的模型，都是用Adam优化器，β 1 = 0.9 , β 2 = 0.999 ， w e i g h t _ d e c a y = 0.1 , B A T C H _ S I Z E = 4096，我们发现这对于所有模型的迁移都很有效（附录D.1表明，和一般经验相反，对于ResNets训练，Adam比SGD要稍好一些。）。我们使用一个线性学习率预热和衰减，细节见附录B.1。微调时我们使用带动量的SGD，B A T C H _ S I Z E = 512，见附录B.1.1。对于表2中的ImageNet结果，我们使用高分辨进行微调：ViT-L/16使用512，ViT-H/14使用518。

Metrics. We report results on downstream datasets either through few-shot or fine-tuning accuracy. Fine-tuning accuracies capture the performance of each model after fifine-tuning it on the respective dataset. Few-shot accuracies are obtained by solving a regularized least-squares regression problem that maps the (frozen) representation of a subset of training images to {− 1 , 1 } K target vectors. This formulation allows us to recover the exact solution in closed form. Though we mainly focus on fifine-tuning performance, we sometimes use linear few-shot accuracies for fast on-the-flfly evaluation where fifine-tuning would be too costly.

评价指标
我们报告了下游数据集小样本和微调的准确率。微调准确率体现的是每个模型在对应数据集上微调后的性能。小样本准确率是通过求解将训练图像子集的表征映射到{ − 1 , 1 } K目标向量的最小二乘回归问题得到。该公式使得我们能够以闭环的方式获取精确解。尽管我们主要关注微调性能，有时候微调开销过大时，我们也会使用线性小样本准确率来进行快速的动态评估。

4.2 Comparison to SOTA

We first compare our largest models – ViT-H/14 and ViT-L/16 – to state-of-the-art CNNs from

the literature. The first comparison point is Big Transfer (BiT) (Kolesnikov et al., 2020), which

performs supervised transfer learning with large ResNets. The second is Noisy Student (Xie et al., 2020), which is a large EffificientNet trained using semi-supervised learning on ImageNet and JFT- 300M with the labels removed. Currently, Noisy Student is the state of the art on ImageNet and BiT-L on the other datasets reported here. All models were trained on TPUv3 hardware, and we report the number of TPUv3-core-days taken to pre-train each of them, that is, the number of TPU v3 cores (2 per chip) used for training multiplied by the training time in days.

我们首先拿最大的模型ViT-H/14和ViT-L/16和SOTA文献中的CNN比较。第一个比较点是Big Transfer(BiT)，使用大型ResNets进行有监督的迁移学习；第二个点是Noisy Student，使用半监督的方式在去除标签的ImageNet和JFT-300M数据集上训练的EfficientNet。目前，Noisy Student是ImageNet上的SOTA，BiT-L是其他数据集上的SOTA。所有模型都使用TPUv3训练，我们报道了每个模型预训练TPUv3-core-days的数值，该数值等于用于训练的TPUv3核心数量（每块TPUv3有2个核心）乘以训练天数的乘积。

Table 2 shows the results. The smaller ViT-L/16 model pre-trained on JFT-300M outperforms BiT-L (which is pre-trained on the same dataset) on all tasks, while requiring substantially less computational resources to train. The larger model, ViT-H/14, further improves the performance, especially on the more challenging datasets – ImageNet, CIFAR-100, and the VTAB suite. Interestingly, this model still took substantially less compute to pre-train than prior state of the art. However, we notethat pre-training effificiency may be affected not only by the architecture choice, but also other parameters, such as training schedule, optimizer, weight decay, etc. We provide a controlled study of performance vs. compute for different architectures in Section 4.4. Finally, the ViT-L/16 model pre-trained on the public ImageNet-21k dataset performs well on most datasets too, while taking fewer resources to pre-train: it could be trained using a standard cloud TPUv3 with 8 cores in approximately 30 days.

表2是对比实验结果。同样在JFT-300M上进行预训练，更小的ViT-L/16在所有任务上表现都优于BiT-L，而且大幅减少了所需的训练资源。更大的ViT-H/14性能有进一步提升，尤其是在诸如ImageNet、CIFAR-100、VTAB这类难度更大的数据集上。有趣的是，ViT预训练的资源开销相比之前的SOTA方法也大大减少。然而，我们注意到预训练的效率不仅和模型结构选择相关，而且和训练策略、优化器、权重衰减等因素有关。我们在4.4节提进行了一个关于不同模型结构性能和训练量的控制实验。最后，在Image-21k上预训练的ViT-L/16在绝大多数数据集上都表现的很好，而且所需预训练资源更少：使用标准的8核TPUv3大约使用30天可以完成训练。

图2展现了BiT、VIVI（在ImageNet和Youtube上联合训练的ResNet）和S4L（在ImageNet混合监督和半监督方式训练）在VTAB任务上的性能。在Natural和Structured任务分支上，ViT-H/14比BiT-R152x4和其他方法好，在Specialized分支上和BiT-R152x4性能接近。

4.3 P RE - TRAINING D ATA R EQUIREMENTS

The Vision Transformer performs well when pre-trained on a large JFT-300M dataset. With fewer

inductive biases for vision than ResNets, how crucial is the dataset size? We perform two series of experiments.

First, we pre-train ViT models on datasets of increasing size: ImageNet, ImageNet-21k, and JFT-

300M. To boost the performance on the smaller datasets, we optimize three basic regularization

parameters – weight decay, dropout, and label smoothing. Figure 3 shows the results after fifine

tuning to ImageNet (results on other datasets are shown in Table 5) 2 . When pre-trained on the

smallest dataset, ImageNet, ViT-Large models underperform compared to ViT-Base models, despite (moderate) regularization. With ImageNet-21k pre-training, their performances are similar. Only with JFT-300M, do we see the full benefifit of larger models. Figure 3 also shows the performance region spanned by BiT models of different sizes. The BiT CNNs outperform ViT on ImageNet, but with the larger datasets, ViT overtakes.

Second, we train our models on random subsets of 9M, 30M, and 90M as well as the full JFT-

300M dataset. We do not perform additional regularization on the smaller subsets and use the same hyper-parameters for all settings. This way, we assess the intrinsic model properties, and not the effect of regularization. We do, however, use early-stopping, and report the best validation accuracy achieved during training. To save compute, we report few-shot linear accuracy instead of full finetuning accuracy. Figure 4 contains the results. Vision Transformers overfifit more than ResNets with comparable computational cost on smaller datasets. For example, ViT-B/32 is slightly faster than ResNet50; it performs much worse on the 9M subset, but better on 90M+ subsets. The same is true for ResNet152x2 and ViT-L/16. This result reinforces the intuition that the convolutional inductive bias is useful for smaller datasets, but for larger ones, learning the relevant patterns directly from data is suffificient, even benefificial.

Overall, the few-shot results on ImageNet (Figure 4), as well as the low-data results on VTAB

(Table 2) seem promising for very low-data transfer. Further analysis of few-shot properties of ViT

is an exciting direction of future work.

4.4 S CALING S TUDY

We perform a controlled scaling study of different models by evaluating transfer performance from JFT-300M. In this setting data size does not bottleneck the models’ performances, and we assess performance versus pre-training cost of each model. The model set includes: 7 ResNets, R50x1, R50x2 R101x1, R152x1, R152x2, pre-trained for 7 epochs, plus R152x2 and R200x3 pre-trained for 14 epochs; 6 Vision Transformers, ViT-B/32, B/16, L/32, L/16, pre-trained for 7 epochs, plus L/16 and H/14 pre-trained for 14 epochs; and 5 hybrids, R50+ViT-B/32, B/16, L/32, L/16 pretrained for 7 epochs, plus R50+ViT-L/16 pre-trained for 14 epochs (for hybrids, the number at the end of the model name stands not for the patch size, but for the total dowsampling ratio in the ResNet backbone).

我们通过比较JFT-300M数据集上的迁移性能对不同模型进行了实验。这组实验中，数据集大小不是模型的性能瓶颈，我们评估每个模型性能和预训练开销的关系。实验模型包含：预训练7个epoch的7ResNets，R50x1，R50x2，R101x1，R152x1，R152x2；加上预训练14个epoch的R152x2，R200x3；预训练7个epoch的ViT-B/32，B/15，L/32，L/16，加上预训练14个epoch的R50+ViT-L/16（为了测试混合结构，模型后面的数字不是表示图像块大小，而是ResNet骨干中的总下采样率）。

Figure 5 contains the transfer performance versus total pre-training compute (see Appendix D.5

for details on computational costs). Detailed results per model are provided in Table 6 in the Ap

pendix. A few patterns can be observed. First, Vision Transformers dominate ResNets on the

performance/compute trade-off. ViT uses approximately 2 − 4 × less compute to attain the same

performance (average over 5 datasets). Second, hybrids slightly outperform ViT at small compu

tational budgets, but the difference vanishes for larger models. This result is somewhat surprising, since one might expect convolutional local feature processing to assist ViT at any size. Third, Vision Transformers appear not to saturate within the range tried, motivating future scaling efforts.

实验结果如图5所示，每个模型的详细结果见表6。可以发现一些规律。第一，ViT在性能/计算开销的均衡方面全面由于ResNet，可以减少2~4倍的训练量达到相同的性能水平（平均5个数据集）。第二，混合模型在计算预算较少情况下，性能优于ViT，但是这个差距会随着模型增大而消失。这个结果有些令人惊讶，因为我们也许可以期望通过局部卷积特征来辅助任意尺寸的ViT。第三，ViT在尝试范围内未出现性能饱和的现象，推动未来的扩展工作。

4.5 I NSPECTING V ISION T RANSFORMER

To begin to understand how the Vision Transformer processes image data, we analyze its internal representations. The fifirst layer of the Vision Transformer linearly projects the flflattened patches into a lower-dimensional space (Eq. 1). Figure 7 (left) shows the top principal components of the the learned embedding fifilters. The components resemble plausible basis functions for a low-dimensional representation of the fifine structure within each patch.

为了理解ViT如何处理图像数据，我们分析了模型的内部表征。ViT的第一层将展平的图像块线性映射到一个较低纬度的空间（式1）。图7左展现了学习得到的嵌入滤波器的主要作用，似乎是每个图像块中精细结构的低维基础函数。

After the projection, a learned position embedding is added to the patch representations. Figure 7 (center) shows that the model learns to encode distance within the image in the similarity of position embeddings, i.e. closer patches tend to have more similar position embeddings. Further, the row-column structure appears; patches in the same row/column have similar embeddings. Finally, a sinusoidal structure is sometimes apparent for larger grids (Appendix D). That the position embeddings learn to represent 2D image topology explains why hand-crafted 2D-aware embedding variants do not yield improvements (Appendix D.4).

线性映射之后，图像块特征附加了一个通过学习得到的position embedding。图7中展现了模型将图像中的距离使用position embedding相似性进行编码，即靠的更近的图像块倾向于有更相似的position embedding。而且，同一行或同一列的图像块也有相似的position embedding。最后，在更大的网格中有时会出现明显的正弦结构。position embedding可以学习到表达2D图像的拓扑结构，这也解释了为什么手工设计的2D position embedding性能没有提升。

Self-attention allows ViT to integrate information across the entire image even in the lowest layers. We investigate to what degree the network makes use of this capability. Specififically, we compute the average distance in image space across which information is integrated, based on the attention weights (Figure 7, right). This “attention distance” is analogous to receptive fifield size in CNNs. We fifind that some heads attend to most of the image already in the lowest layers, showing that the ability to integrate information globally is indeed used by the model. Other attention heads have consistently small attention distances in the low layers. This highly localized attention is less pronounced in hybrid models that apply a ResNet before the Transformer (Figure 7, right), suggesting that it may serve a similar function as early convolutional layers in CNNs. Further, the attention distance increases with network depth. Globally, we fifind that the model attends to image regions that are semantically relevant for classifification (Figure 6)

自监督使得ViT即使是在最底层也能整合整张图像的信息。我们对网络自监督能力的使用程度进行了研究。具体来说，我们根据注意力权重计算图像空间中信息整合的平均距离，如图7右所示。这个“注意力距离”和CNN中的接收域尺寸相似。我们发现有些注意力头在网络底层就已经整合了图像的绝大部分区域，表明ViT的确使用全局信息整合的能力。另外的注意力头则只注意到图像的一小部分。这种高度集中的注意力在混合模型（ViT前加一个ResNet）中更少见，表明局部注意力的作用和CNN中前几层卷积层的作用相近。而且，注意力距离随着网络的加深而增大。从全局上看，我们发现模型更关注与和分类语义相关的图像区域，如图6和14所示。

4.6 S ELF - SUPERVISION

Transformers show impressive performance on NLP tasks. However, much of their success stems not only from their excellent scalability but also from large scale self-supervised pre-training (Devlin et al., 2019; Radford et al., 2018). We also perform a preliminary exploration on masked patch prediction for self-supervision, mimicking the masked language modeling task used in BERT. With self-supervised pre-training, our smaller ViT-B/16 model achieves 79.9% accuracy on ImageNet, a signifificant improvement of 2% to training from scratch, but still 4% behind supervised pre-training.Appendix B.1.2 contains further details. We leave exploration of contrastive pre-training (Chen et al., 2020b; He et al., 2020; Bachman et al., 2019; Henaff et al., 2020) to future work.

4.6 Self-Supervision
Transformer在NLP任务中展现了优异的性能。然而他们大多数的成功并不仅仅是因为Transformer优异的扩展性，同样归功于大规模的自监督预训练。我们还模仿BERT中使用的masked language modeling任务，对自监督的masked patch prediction做了初步探索。在自监督预训练下，我们较小的ViT-B/16模型在ImageNet上达到了79.9%的准确率，相比从头训练提升了2%，但是相比有监督的预训练仍然落后4%。附录B.1.2包含更多的细节。我们将对比预训练的探索留给未来工作。

5. Conclusion

We have explored the direct application of Transformers to image recognition. Unlike prior works

using self-attention in computer vision, we do not introduce image-specifific inductive biases into

the architecture apart from the initial patch extraction step. Instead, we interpret an image as a

sequence of patches and process it by a standard Transformer encoder as used in NLP. This simple, yet scalable, strategy works surprisingly well when coupled with pre-training on large datasets. Thus, Vision Transformer matches or exceeds the state of the art on many image classifification datasets, whilst being relatively cheap to pre-train.

我们探索了直接将Transformer应用于图像识别。和前人在计算机视觉中使用自监督的工作不同，除了初始图像块提取以外，我们不引入针对图像的归纳偏置到网络结构中。作为代替，我们将一张图像视为一个图像块序列，然后使用NLP的标准Transformer编码器处理。这种简单但可扩展的策略结合在大型数据集上进行预训练效果出奇的好。因此，ViT在许多图像范磊数据集上逼近甚至超越了SOTA，而且其预训练开销相比之下降低了很多。

While these initial results are encouraging, many challenges remain. One is to apply ViT to other

computer vision tasks, such as detection and segmentation. Our results, coupled with those in Carion et al. (2020), indicate the promise of this approach. Another challenge is to continue exploring self-supervised pre-training methods. Our initial experiments show improvement from self-supervised pre-training, but there is still large gap between self-supervised and large-scale supervised pretraining. Finally, further scaling of ViT would likely lead to improved performance.

尽管这些初始的结果很令人激动，但仍面临着许多挑战。其中一个是将ViT应用到其他计算机视觉任务中，比如检测和分割。我们以及Carion的结果表明这种方法的可信度。另一个挑战是持续探索自监督的预训练方法。我们最初的实验表明自监督预训练带来的提升，但是自监督和大规模有监督之间仍有巨大的鸿沟。最后，ViT的进一步扩展可能会带来新的性能提升。

猜你喜欢