AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE-李沐老师-摘要引言相关工作

Abstract：

While the Transformer architecture has become the de-facto standard for natural

language processing tasks, its applications to computer vision remain limited. In

vision, attention is either applied in conjunction with convolutional networks, or

used to replace certain components of convolutional networks while keeping their

overall structure in place.

虽然Transformer架构已经成为NLP领域的标准，但在计算机视觉领域的应用还非常有限。在视觉领域，注意力要么是结合卷积网络来使用，要么是用于替换卷积网络的某些组件，同时保持整体结构不变。

We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classifification tasks.

我们将证明，在图像分类任务上，对于CNN的依赖不是必要的，直接将Transformer应用到图像块序列上也可以有非常好的性能。

When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

当ViT在大型数据集上进行预训练，并迁移到中小型图像识别数据集（ImageNet、CIFAR-100，VTAB等）上时，性能大幅超过现SOTA的卷积网络，并且可以大幅降低训练所需的计算资源。

课堂笔记：

transformer已经是nlp领域的⼀个标准，即BERT model，GPT3，T5模型。

但是⽤transformer来做cv，还是很有限的，在cv⾥，⾃注意⼒要么是跟卷机⽹络⼀起⽤，要么去把某⼀些卷积神经⽹络⾥的卷积替换成⾃注意⼒。但是还是保持整体的结构不变，⽐如：⼀个残差⽹络，res50，还有四个stage，res2,res3,res4,res5,其实这个stage是不变，只是去取代每⼀个stage，每⼀个block的操作。这篇⽂章证明，这种对于卷积神经⽹络的依赖是不必要的，⼀个纯的vision transformer直接作⽤于⼀系列的图像块的时候，也可以在图像分类这个任务上表现得很好，尤其是在⼤规模数据集上做预训练，然后⼜迁移到中⼩型数据集上使⽤，这个vision transformer能获得跟最好的卷积神经⽹络相媲美的结果。

1.Introduction

Self-attention-based architectures, in particular Transformers (Vaswani et al., 2017), have become the model of choice in natural language processing (NLP). The dominant approach is to pre-train on a large text corpus and then fifine-tune on a smaller task-specifific dataset (Devlin et al., 2019). Thanks to Transformers’ computational effificiency and scalability, it has become possible to train models of unprecedented size, with over 100B parameters (Brown et al., 2020; Lepikhin et al., 2020). With the models and datasets growing, there is still no sign of saturating performance.

基于自注意力的结构，特别是Transformer，已经成为了NLP方法的标准选择。主流方法是先在大型文本语料库上进行预训练，然后在较小的特定任务数据集上微调。得益于Transformer的计算效率和可扩展性，训练超过1000亿参数的超大模型也成为可能。随着模型大小和数据集规模的增长，目前仍然没有性能饱和的迹象。

课堂笔记：

⾃注意⼒机制的⽹络，尤其是transformer，已经是⾃然语⾔处理⾥的必选模型了。现在⽐较主流的⽅式，就是先去⼀个⼤规模的数据集上做预训练，再在⼀些特定的⼩数据集上做微调。多亏了transformer的计算⾼效性和可扩展性，现在已经可以训练超过1000亿参数的模型了，⽐如GPT3，随着模型和数据级的增⻓，还没有看到任何性能饱和的现象。

很多时候不是⼀味扩⼤数据集或者扩⼤模型就能获得更好的效果，尤其是当扩⼤模型的时候，很容易碰到过拟合的问题。对于transformer来说，还没有观测到这个瓶颈。

微软和英伟达联合推出了超级⼤的语⾔⽣成模型，叫 Megatron- Turing。已经有5300亿的参数了，还能在各个任务上继续⼤幅度提升性能，没有任何性能饱和的现象。

In computer vision, however, convolutional architectures remain dominant (LeCun et al., 1989;

Krizhevsky et al., 2012; He et al., 2016). Inspired by NLP successes, multiple works try combining CNN-like architectures with self-attention (Wang et al., 2018; Carion et al., 2020), some replacing the convolutions entirely (Ramachandran et al., 2019; Wang et al., 2020a). The latter models, while theoretically effificient, have not yet been scaled effectively on modern hardware accelerators due to the use of specialized attention patterns. Therefore, in large-scale image recognition, classic ResNet like architectures are still state of the art (Mahajan et al., 2018; Xie et al., 2020; Kolesnikov et al., 2020).

但是在CV领域，卷积结构仍占主导地位。受NLP成功的启发，有多项工作尝试将卷积结构和自注意力相结合，有的则替换整个卷积结构（孤立自注意力、轴自注意力）。后一种方法虽然理论上有效，但是由于使用了特定的注意力模式，目前并不能在现在硬件加速器上进行有效的扩展。（？）因此，经典的ResNet结构在大规模图像识别任务上仍处于领先地位。

课堂笔记：

因为以下所说的数据计算量复杂度的问题，卷积神经⽹络还是占主导地位的，但是很多⼯作就是想把⾃注意⼒⽤到视觉⾥，⼀些⼯作就把卷积神经⽹络和⾃注意⼒混在⼀起⽤，另外⼀些⼯作就把整个卷积神经⽹络换掉，全⽤⾃注意⼒。这些⽅法都在想办法降低序列⻓度。

Inspired by the Transformer scaling successes in NLP, we experiment with applying a standard

Transformer directly to images, with the fewest possible modififications. To do so, we split an image into patches and provide the sequence of linear embeddings of these patches as an input to a Transformer. Image patches are treated the same way as tokens (words) in an NLP application. We train the model on image classifification in supervised fashion.

受到Transformer在NLP成功的启发，我们实验直接将标准Transformer用于图像，并进行尽可能少的修改。为此，我们将图像分解成图像块，然后将这些图像块的线性嵌入式序列作为Transformer的输入。图像块等同于NLP中tokens（words）的概念。我们使用监督方式在图像分类任务上训练模型。

When trained on mid-sized datasets such as ImageNet without strong regularization , these models yield modest accuracies of a few percentage points below ResNets of comparable size. This seemingly discouraging outcome may be expected: Transformers lack some of the inductive biases inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well when trained on insuffificient amounts of data.

在不使用强正则化的情况下，我们在中等大小的数据集比如ImageNet上训练ViT，模型精度相比同等参数规模的ResNet低了几个百分点。这种看似令人沮丧的结果是符合预期的：Transformer缺少一些CNN的固有性质，比如平移不变性和局部参数共享，因此当在不充分数据集下训练时，泛化性能不是很好。

However, the picture changes if the models are trained on larger datasets (14M-300M images). We find that large scale training trumps inductive bias. Our Vision Transformer (ViT) attains excellent results when pre-trained at suffificient scale and transferred to tasks with fewer datapoints. When pre-trained on the public ImageNet-21k dataset or the in-house JFT-300M dataset, ViT approaches or beats state of the art on multiple image recognition benchmarks. In particular, the best model reaches the accuracy of 88 . 55% on ImageNet, 90 . 72% on ImageNet-ReaL, 94 . 55% on CIFAR-100, and 77 . 63% on the VTAB suite of 19 tasks.

但是，当模型在更大的数据集（1.4~30亿）上训练时，情况会有所改变，我们发现大规模训练胜过CNN的归纳偏置（Inductive Bias）。我们的Vision Transformer(ViT)当在足够大的数据集上进行预训练后，然后迁移至拥有较少数据集的任务上时，可以获得极好的结果。当在公开ImageNet-21K或JFT-300M室内数据集上进行预训练，ViT在多个图像识别基准上接近甚至超越当前的SOTA。特别的，最好的模型在ImageNet上达到了88.55%的准确率，在ImageNet-ReaL达到了90.72%，在CIFAR-100上达到了94.55%，在VTAB系列的19个任务上达到了77.63%。

课堂笔记：

⼤规模的预训练就⽐这个归纳偏置要好。 vision transformer只要在有⾜够的数据去预训练的情况下，就能在下游任务上获得很好的迁移学习效果。

课堂笔记：

把transformer⽤在视觉问题上的难处，transformer最主要的操作是⾃注意⼒操作，即每个元素都要跟每个元素做互动-两两互相的，然后算得⼀个attention，就是算得⼀个⾃注意⼒的图，⽤这个⾃注意⼒的图去做加权平均，最后得到这个输出。因为在做⾃注意⼒的时候，是两两相互的，所以计算复杂度是序列的⻓度乘三倍，现在硬件能⽀持的⻓度⼀般也就是⼏百或者上千。

Q1，如何把⼀个2d的图⽚变成⼀个1d的序列，或者说变成⼀个集合？

最直观的⽅式就是把每个像素点，当成这边的这种元素，直接把2d的图⽚拉直，把它扔到transformer⾥，⾃⼰跟⾃⼰学去；想法很美好，实现起来复杂度太⾼。⼀般来说在视觉⾥⾯训练分类任务的时候，图⽚的输⼊⼤⼩，⼤概是224*224，如果把图⽚⾥的每⼀个像素点都直接当成这⾥的元素来看待，其实他的序列⻓度就不是512，序列⻓度就是224*224=50176，⼀共就这么多像素点，所以这就是序列的⻓度，相当于BERT：512*100.这还只是分类任务，图⽚⼤⼩就是24*24，

对于检测或者分割，现在很多模型的输⼊都已经是600*600/800*800.

既然⽤像素点导致序列⻓度太⻓，就不⽤图⽚当transformer的直接输⼊，把⽹络中间的特征图当作输⼊。Eg：⽤残差⽹络res 50,其实在最后⼀个stage到res 4,他的feature map size 只有14*14，拉平只有196的序列⻓度。

轴⾃注意⼒：图⽚的复杂度是因为它的序列⻓度⻓度是H*W，是2d的矩阵，把这个2d的矩阵分成两个1d的向量，先在⾼度这个维度上去做⼀次self- attention，然后在宽度这个维度上再做⼀次self- attention。

孤⽴⾃注意⼒：复杂度⾼来源于整张图，如果⽤⼀个local（局部的）的window（窗⼝），这个复杂度可以控制，可以通过控制窗⼝的⼤⼩来让计算的复杂度控制在可接受范围内。

完全⽤⾃注意⼒做卷积操作的这⼀类⼯作，虽然理论上是⾮常⾼效的。但事实上，都是⼀些⽐较特殊的⾃注意⼒操作。所以说，没有在现在的这个硬件上去加速，导致很难训练⼀个⼤模型。

在⼤规模的图像识别上，传统的残差⽹络还是效果最好的。

2 RELATED W ORK

Transformers were proposed by Vaswani et al. (2017) for machine translation, and have since become the state of the art method in many NLP tasks. Large Transformer-based models are often pre-trained on large corpora and then fine-tuned for the task at hand: BERT (Devlin et al., 2019) uses a denoising（去噪） self-supervised pre-training task, while the GPT line of work uses language modeling as its pre-training task (Radford et al., 2018; 2019; Brown et al., 2020).

Transformer由Vaswani等人于2017年提出用于机器翻译，并从此成为许多NLP任务的SOTA方法。大型基于Transformer的模型通常在大型语料库上预训练，然后在手头任务上进行微调：BERT使用去噪自监督预训练任务，GPT一系列工作则使用语言模型作为其预训练任务。

Naive application of self-attention to images would require that each pixel attends to every other

pixel. With quadratic cost in the number of pixels, this does not scale to realistic input sizes. Thus,

to apply Transformers in the context of image processing, several approximations have been tried in the past. Parmar et al. (2018) applied the self-attention only in local neighborhoods for each query pixel instead of globally. Such local multi-head dot-product self attention blocks can completely replace convolutions (Hu et al., 2019; Ramachandran et al., 2019; Zhao et al., 2020). In a different line of work, Sparse Transformers (Child et al., 2019) employ scalable approximations to global self attention in order to be applicable to images. An alternative way to scale attention is to apply it in blocks of varying sizes (Weissenborn et al., 2019), in the extreme case only along individual axes (Hoet al., 2019; Wang et al., 2020a). Many of these specialized attention architectures demonstrate promising results on computer vision tasks, but require complex engineering to be implemented effificiently on hardware accelerators.

对图像应用自注意力的原始方法需要每个像素关注自身以外的所有其他像素。由于像素数量的二次方开销，无法扩展到实际输出尺寸。因此，为了将Transformer应用到图像处理任务中，过去有人尝试几种近似方法。Parmar等人将自注意力仅用于每个查询像素的局部邻域内，而不是全局区域；局部多头点积自注意力模块可以完全替代卷积。另外，稀疏Transformer对全局自注意力使用可扩展近似，以便适用于图像。另一种扩展注意力的方法是将其用在不同尺寸的block上，极端情况下只沿单个轴。许多这些定制化的自注意力结构在CV任务上呈现了令人信服的结果，但是想要在硬件加速器上实现这些方法还需要复杂的工程。

Most related to ours is the model of Cordonnier et al. (2020), which extracts patches of size 2 × 2

from the input image and applies full self-attention on top. This model is very similar to ViT,

but our work goes further to demonstrate that large scale pre-training makes vanilla transformers

competitive with (or even better than) state-of-the-art CNNs. Moreover, Cordonnier et al. (2020)

use a small patch size of 2 × 2 pixels, which makes the model applicable only to small-resolution

images, while we handle medium-resolution images as well.

和我们工作最相关的是Cordonnier等人的工作，他们从输入图像中提取2 × 2 2\times22×2大小的图像块，然后在顶层应用自注意力，这个模型和ViT非常相似，但是我们的工作进一步阐述了大规模预训练可以使质朴的Transformer和SOTA的CNN媲美。而且，Cordonnier等人使用很小的2 × 2 2\times22×2大小的图像块，使得模型仅适用于低分辨率图像，而我们的方法同时可以处理中等分辨率的图像。

There has also been a lot of interest in combining convolutional neural networks (CNNs) with forms of self-attention, e.g. by augmenting feature maps for image classifification (Bello et al., 2019) or by further processing the output of a CNN using self-attention, e.g. for object detection (Hu et al., 2018; Carion et al., 2020), video processing (Wang et al., 2018; Sun et al., 2019), image classifification (Wu et al., 2020), unsupervised object discovery (Locatello et al., 2020), or unifified text-vision tasks (Chen et al., 2020c; Lu et al., 2019; Li et al., 2019).

同样有很多工作聚焦于将CNN和自注意力结合，比如为图像分类增强特征图，或对CNN的输出进一步使用自注意力（目标检测、视频处理、图像分类等）。

Another recent related model is image GPT (iGPT) (Chen et al., 2020a) , which applies Transformers to image pixels after reducing image resolution and color space. The model is trained in an unsupervised fashion as a generative model, and the resulting representation can then be fine-tuned or probed linearly for classifification performance, achieving a maximal accuracy of 72% on ImageNet.

另一个和ViT接近的模型是图像GPT，它首先对图像的分辨率和颜色空间进行缩减，然后对图像像素使用Transformer。GPT采用无监督方式训练作为生成模型，然后结果可以通过微调或线性探索提高分类性能，最终在ImageNet上达到了72%的准确率。

Our work adds to the increasing collection of papers that explore image recognition at larger scales than the standard ImageNet dataset. The use of additional data sources allows to achieve state-of the-art results on standard benchmarks (Mahajan et al., 2018; Touvron et al., 2019; Xie et al., 2020). Moreover, Sun et al. (2017) study how CNN performance scales with dataset size, and Kolesnikov et al. (2020); Djolonga et al. (2020) perform an empirical exploration of CNN transfer learning from large scale datasets such as ImageNet-21k and JFT-300M. We focus on these two latter datasets as well, but train Transformers instead of ResNet-based models used in prior works.

在比标准ImageNet更大规模的数据集上进行图像识别的论文持续增加，我们方法也是其中之一。使用额外的数据源使得在标准基准上达到SOTA的结果。此外，Sun等人研究了CNN性能如何随数据集规模变化，Kolesnikov和Djolonga等人则对大规模数据集比如ImageNet-21k和JFT-300M的CNN迁移学习进行了实证探索。我们也关注这两个数据集，但是使用Transformer训练而不是先前方法所使用的ResNet模型。

参考： (29条消息) 《An Image is Worth 16x16 Words》完整版翻译_Maples丶丶的博客-CSDN博客

https://blog.csdn.net/qq_16137569/article/details/120689000?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522165207285516781683926742%2522%252C%2522scm%2522%253A%252220140713.130102334..%2522%257D&request_id=165207285516781683926742&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2~all~sobaiduend~default-2-120689000-null-null.142%5Ev9%5Econtrol,157%5Ev4%5Econtrol&utm_term=an+image+is+worth+16x16+words&spm=1018.2226.3001.4187 ViT论文逐段精读【论文精读】_哔哩哔哩_bilibili

https://www.bilibili.com/video/BV15P4y137jb?spm_id_from=333.999.0.0

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE-李沐老师-摘要引言相关工作

猜你喜欢