一个挑战 ViT,MLP-Mixer 的新模型 ConvMixer:Patches Are All You Need? [Under Review ICLR 2022]

Convolutions Attention MLPs Patches are All Your Need?

[OpenReview] [GitHub]

2021/11/13 更新: 已经确定,被 ICLR 2022 拒稿了。原因是 patches are all you need 这个论点证明的不够充分。三个审稿意见均提到实验不够公平,不够充分。所以,这个工作大家可以学习一些思路就可以了,当你提出一个大胆设想,最关键的还是要用严谨缜密的理论或实验证明之。

本文看点:

1. 本文原文非常短,只有 4 页多一点,整个模型也很简单,但它 挑战了 ViT 有效性的原因

2. 总结了最近特别火的 ViT,MLP-Mixer,ResMLP 等新构架之所以效果很好的共性

特斯拉 AI 高级总监 Andrej Karpathy 在推特上感叹道:我被新的 ConvMixer 架构震撼了。【(包括下图)转自 patch成为了ALL You Need?挑战ViT、MLP-Mixer的简单模型来了

目录

Convolutions Attention MLPs Patches are All Your Need?

Abstract

Introduction

A Simple Model: ConvMixer

Experiments

Related Work

Conclusion

Appendix

A Comparison to other models

Experiments on CIFAR-10

Weight Visualizations

Implementation

References


Abstract

Although convolutional networks have been the dominant architecture for vision tasks for many years, recent experiments have shown that Transformer-based models, most notably the Vision Transformer (ViT), may exceed their performance in some settings. However, due to the quadratic runtime of the self-attention layers in Transformers, ViTs require the use of patch embeddings, which group together small regions of the image into single input features, in order to be applied to larger image sizes. This raises a question: Is the performance of ViTs due to the inherently-more-powerful Transformer architecture, or is it at least partly due to using patches as the input representation?

In this paper, we present some evidence for the latter: specifically, we propose the ConvMixer, an extremely simple model that is similar in spirit to the ViT and the even-more-basic MLP-Mixer in that it operates directly on patches as input, separates the mixing of spatial and channel dimensions, and maintains equal size and resolution throughout the network. In contrast, however, the ConvMixer uses only standard convolutions to achieve the mixing steps.

Despite its simplicity, we show that the ConvMixer outperforms the ViT, MLP-Mixer, and some of their variants for similar parameter counts and data set sizes, in addition to outperforming classical vision models such as the ResNet.

动机:

尽管多年来卷积网络一直是视觉任务的主要架构,但最近的实验表明,基于 Transformer 的模型,尤其是 Vision Transformer (ViT),在某些设置下可能会超过卷积的性能。然而,由于 transformer中自注意层的 quadratic runtime,ViT 需要使用 patch embeddings,将图像中的小区域组合成单个输入特征,以便应用于更大的图像尺寸。这就提出了一个问题: ViT 的性能是由于固有的更强大的Transformer架构,还是至少部分是因为使用补丁作为输入表示?

方法:

本文为后者提供了一些证据:具体来说,提出了 ConvMixer,这是一个非常简单的模型,受到类似于 ViT 启发和更基本的 MLP-Mixer,ConvMixer 直接对输入的 patch 进行操作,分离空间通道维度的混合,并在整个网络中保持相同的大小和分辨率。然而,相比之下,ConvMixer 只使用标准的卷积来实现混合步骤。

结果:

尽管它很简单,本文证明了 ConvMixer 在类似参数计数和数据集大小的情况下优于 ViT、MLP-Mixer 及其一些变体,此外还优于 ResNet 等经典视觉模型。

Introduction

For many years, convolutional neural networks have been the dominant architecture for deep learning systems applied to computer vision tasks. But recently, architectures based upon Transformer models, e.g., the so-called Vision Transformer architecture (Dosovitskiy et al., 2020), have demonstrated compelling performance in many of these tasks, often outperforming classical convolutional architectures, especially for large data sets. An understandable assumption, then, is that it is only a matter of time before Transformers become the dominant architecture for vision domains, just as they have for language processing. In order to apply Transformers to images, however, the representation had to be changed: because the computational cost of the self-attention layers used in Transformers would scale quadratically with the number of pixels per image if applied naively at the per-pixel level, the compromise was to first split the image into multiple “patches”, linearly embed them, and then apply the transformer directly to this collection of patches.

研究背景:目前的 CV 领域,Transformer 大有代替传统卷积之趋势。但 Transformer 中的 self-attention 计算复杂度确实不适合直接处理像素级的图像,因此需要将图像分割成若干 patches,再输入 Transformer 模型中。

“ 多年来,卷积神经网络一直是应用于计算机视觉任务的深度学习系统的主导架构。但是最近,基于Transformer 模型的架构,例如所谓的 Vision Transformer 架构,在许多这些任务中展示了引人注目的性能,通常优于经典的卷积架构,特别是对于大型数据集。一个可以理解的假设是,Transformers 成为视觉领域的主导架构只是时间问题,就像它们成为语言处理领域的主导架构一样。然而,为了将 Transformers 应用到图像上,必须改变其表示方式:因为如果只是简单地将像素级别的图像应用于 Transformer 的 self-attention 层,其计算成本将为图像像素的数量每平方规模,一个妥协的方法是首先把图像分割成多个“patches”,线性地嵌入他们,然后直接应用这个 patch 的集合在 Transformer 里。”

In this work, we explore the question of whether, fundamentally, the strong performance of vision transformers may result more from this patch-based representation than from the Transformer architecture itself. We develop a very simple convolutional architecture which we dub the “ConvMixer” due to its similarity to the recently-proposed MLP-Mixer (Tolstikhin et al., 2021). This architecture is similar to the Vision Transformer (and MLP-Mixer) in many respects: it directly operates on patches, it maintains an equal-resolution-and-size representation throughout all layers, it does no downsampling of the representation at successive layers, and it separates “channel-wise mixing” from the “spatial mixing” of information. But unlike the Vision Transformer and MLP-Mixer, our architecture does all these operations via only standard convolutions.

研究方法:本文发现 ViT 之所以厉害,是因为采用了 patch 作为输入的这种表示方法,并进而提出了新的架构 -- ConvMixer

“ 本文进而探讨了这样一个问题:即 vision transformers 的强大性能可能更多地来自于这种基于 patch 的表示,而不是 Transformer 架构本身。本文开发了一个非常简单的卷积架构,将其命名为 “ConvMixer”,因为它与最近提出的 MLP-Mixer 相似。这个架构在很多方面与 Vision Transformer (和 MLP-Mixer ) 相似:1) 它直接对 patches 进行操作,2) 在所有层中保持相同分辨率和大小的表示,3) 它在连续层不对表示进行下采样,4) 它将信息的 “channel-wise mixing” 与 “spatial mixing” 分离。但与 Vision Transformer 和 MLP-Mixer 不同的是,ConvMixer 架构只通过标准的卷积来完成所有这些操作。”

The chief result we show in this paper is that this ConvMixer architecture, despite its extreme simplicity (it can be implemented in ≈ 6 lines of dense PyTorch code), outperforms both “standard” computer vision models such as ResNets of simiar parameter counts and some corresponding Vision Transformer and MLP-Mixer variants, even with a slate of additions intended to make those architectures more performant on smaller data sets. This suggests that, at least to some extent, the patch representation itself may be the most critical component to the “superior” performance of newer architectures like Vision Transformers. While these results are naturally just a snapshot, we believe that this provides a strong “convolutional-but-patch-based” baseline to compare against for more advanced architectures in the future.

研究结论:patch representation 是 Vision Transformer 卓越性能的一个重要原因;本文的 ConvMixer 优于 ResNet,ViT,MLP-Mixer。一个新的架构的诞生!

“ 本文主要结果是,尽管 ConvMixer 架构极其简单 (它可以在约 6 行密集的 PyTorch 代码中实现),但它优于 “标准” 计算机视觉模型,如参数计数相似的 ResNets,Vision Transformer 和 MLP-Mixer变体,即使添加了一些旨在使这些体系结构在更小的数据集上具有更高的性能。这表明,至少在某种程度上,patch 表示本身可能是像 Vision Transformer 这样的新架构 “卓越” 性能的最关键组件。虽然这些结果只是一个 snapshot,但我们相信这提供了一个强大的 “convolutional-but-patch-based” 的基线,可以与未来更高级的架构进行比较。”

A Simple Model: ConvMixer

Our model, dubbed ConvMixer, consists of a patch embedding layer followed by repeated applications of a simple fully-convolutional block. We maintain the spatial structure of the patch embeddings, as illustrated in Fig. 2. Patch embeddings with patch size p and embedding dimension h can be implemented as convolution with c_{in} input channels, h output channels, kernel size p, and stride p:

z_0 = BN (\sigma \{Conv_{c_{in}\rightarrow h}(X, stride=p, kernel\_size=p)\})      (1)

提出的 ConvMixer 由 patch embedding layer 和重复的全卷积块组成。其中 patch embedding 是通过卷积 (1) 实现的,即对于通道为 c_{in} 的输入,输出为 h 通道、patch size 为 p,则卷积的卷积核为 p,stride 也为 p

The ConvMixer block itself consists of depthwise convolution (i.e., grouped convolution with groups equal to the number of channels, h) followed by pointwise (i.e., kernel size 1 × 1) convolution. As we will explain in Sec. 3, ConvMixers work best with unusually large kernel sizes for the depthwise convolution. Each of the convolutions is followed by an activation and post-activation BatchNorm:

z'_ l = BN (\sigma \{Conv{Depthwise}(z_{l-1})\}) + z_{l-1}       (2)

z_{l+1} = BN (\sigma \{ConvPointwise(z '_ l )\})        (3)

After many applications of this block, we perform global pooling to get a feature vector of size h, which we pass to a softmax classifier. See Fig. 3 for an implementation of ConvMixer in PyTorch.

def ConvMixer(h, depth, kernel_size=9, patch_size=7, n_classes=1000):
    Seq, ActBn = nn.Sequential, lambda x: Seq(x, nn.GELU(), nn.BatchNorm2d(h))
    Residual = type('Residual', (Seq,), {'forward': lambda self, x: self[0](x) + x})
    return Seq(ActBn(nn.Conv2d(3, h, patch_size, stride=patch_size)),
               ActBn(nn.Conv2d(h, h, 1))) for i in range(depth)],
               nn.AdaptiveAvgPool2d((1,1)), nn.Flatten(), nn.Linear(h, n_classes))

ConvMixer 模块本身由 depthwise 卷积 (即,由等于通道数 h 的组组成的分组卷积) 和 pointwise 卷积 (即,核大小1 × 1) 组成。对于 depthwise 卷积,ConvMixer 在大的卷积核尺度下工作得最好。每个卷积之后都有一个激活和后激活 BatchNorm,即公式 (2) 和 (3)。

在这个块的许多应用程序之后,我们执行全局池以获得大小为 h 的特征向量,并将其传递给softmax 分类器。参见上面代码在 PyTorch 中的实现 ConvMixer。

Design parameters

An instantiation of ConvMixer depends on four parameters: (1) the “width” or hidden dimension h (i.e., the dimension of the patch embeddings), (2) the depth d, or the number of repetitions of the ConvMixer layer, (3) the patch size p which controls the internal resolution of the model, (4) the kernel size k of the depthwise convolutional layer. We name ConvMixers after their hidden dimension and depth, like ConvMixer-h/d. We refer to the original input size n divided by the patch size p as the internal resolution; note, however, that ConvMixers support variable-sized inputs.

参数设计:

ConvMixer 取决于四个参数的实例化:(1) 宽度或隐藏维度 h (即 patch embeddings 的维数),(2) 深度 d,即 ConvMixer 层重复的次数,(3) patch 的大小 p,其控制模型的内部分辨率,(4) depthwise 卷积层卷积核的大小 k。本文以其隐藏维度和深度来命名 ConvMixers,比如ConvMixer-h/d。将原始输入大小 n 除以 patch 大小 p 作为内部分辨率;但是请注意,ConvMixers支持可变大小的输入。

Motivation

Our architecture is based on the idea of mixing, as in Tolstikhin et al. (2021). In particular, we chose depthwise convolution to mix spatial locations and pointwise convolution to mix channel locations. A key idea from previous work is that MLPs and self-attention can mix distant spatial locations, i.e., they can have an arbitrarily large receptive field. Consequently, we used convolutions with an unusually large kernel size to mix distant spatial locations.

While self-attention and MLPs are theoretically more flexible, allowing for large receptive fields and content-aware behavior, the inductive bias of convolution is well-suited to vision tasks and leads to high data efficiency. By using such a standard operation, we also get a glimpse into the effect of the patch representation itself in contrast to the conventional pyramid-shaped, progressively downsampling design of convolutional networks.

正如 Tolstikhin et al.(2021) 所述,本文提出的架构基于混合的理念。具体来说,ConvMixer  选择 depthwise 卷积来混合空间位置,选择 pointwise 卷积来混合通道位置。先前研究的一个关键观点是,MLPs 和 self-attention 可以混合遥远的空间位置,也就是说,它们可以有任意大的接受域。因此,我们使用具有异常大核大小的卷积来混合遥远的空间位置

虽然 MLP 和 self-attention 在理论上更灵活,允许更大的接收域和内容感知行为,但卷积的归纳偏差 inductive bias 非常适合视觉任务,并导致高数据效率通过使用这种标准操作,我们还可以看到 patch 表示本身的效果与传统的金字塔形、渐进式下采样的卷积网络设计的对比。

Experiments

CIFAR-10 Experiments

We first perform smaller-scale experiments on CIFAR-10, where ConvMixers achieve over 96% accuracy with as few as 0.7M parameters, demonstrating the data efficiency of the convolutional inductive bias. Details of these experiments are presented in Appendix B.

我们首先在 CIFAR-10 上进行了较小规模的实验,其中 ConvMixers 在仅 0.7M 参数的情况下实现了超过 96% 的精度证明了卷积 inductive bias 的数据效率。这些实验的细节见附录B

Training setup

We primarily evaluate ConvMixers on ImageNet-1k classification without any pretraining or additional data. We added ConvMixer to the timm framework (Wightman, 2019) and trained it with nearly-standard settings: we used RandAugment (Cubuk et al., 2020), mixup (Zhang et al., 2017), CutMix (Yun et al., 2019), random erasing (Zhong et al., 2020), and gradient norm clipping in addition to default timm augmentation. We used the AdamW (Loshchilov & Hutter, 2018) optimizer and a simple triangular learning rate schedule. Due to limited compute, we did no hyperparameter tuning on ImageNet and trained for fewer epochs than competitors. Consequently, our models could be over- or under-regularized, and the accuracies we report likely underestimate the capabilities of our model.

我们主要在 ImageNet-1k 分类上评估 ConvMixers,没有任何预训练或额外数据。我们在 timm 框架 (Wightman, 2019) 中添加了 ConvMixer,并使用接近标准的设置对其进行训练:除了默认的 timm augment 外,我们还使用了 RandAugment (Cubuk et al., 2020)、mixup (Zhang et al.,  2017)、CutMix (Yun et al., 2019)、随机删除 random erasing (Zhong et al., 2020) 和梯度模剪辑。我们使用 AdamW (Loshchilov &Hutter, 2018) 优化器和一个简单的 triangular 学习率 schedule

注意:由于计算量有限,本文没有在 ImageNet 上进行超参数调优,并且比竞争对手训练的时间更少。因此,我们的模型可能是 over- or under-regularized,报告的准确性可能低估了本文模型的能力。
 

这里采用的 tricks 将在博客最后给出论文链接和摘要。

Results

A ConvMixer-1536/20 with 52M parameters can achieve 81.4% top-1 accuracy on ImageNet, and a ConvMixer-768/32 with 21M parameters 80.2% (see Table 1). Wider ConvMixers seem to converge in fewer epochs, but are memory- and compute-hungry. They also work best with large kernel sizes: ConvMixer-1536/20 lost ≈ 1% accuracy when reducing the kernel size from k = 9 to k = 3 (we discuss kernel sizes more in Appendix A & B). ConvMixers with smaller patches are substantially better in our experiments, similarly to Sandler et al. (2019); we believe larger patches require deeper ConvMixers. We trained one model with ReLU to demonstrate that GELU (Hendrycks & Gimpel, 2016), which is popular in recent isotropic models, isn’t necessary.

得出的几个比较重要的结论(注意红体字):

52M 参数的 ConvMixer-1536/20 在 ImageNet 上可以达到 81.4% 的 top-1 精度,而 21M 参数的ConvMixer-768/32 可以达到 80.2% (见表1)。更宽的 ConvMixers 似乎在更少的时间内收敛,但内存和计算都很消耗。当把内核大小从 k = 9 减少到 k = 3 时,ConvMixer-1536/20 的精度降低了1% (在附录A & B 中详细讨论内核大小)。与 Sandler et al.(2019) 相似,小的 patch size 的 ConvMixers 明显更好;我们相信更大的 patch 需要更深的 ConvMixers。本文还对比了 ReLU 和 GELU (Hendrycks &Gimpel, 2016),发现 GELU 更好,但作者说它是不必要的。

Comparisons

Our model and training setup closely resemble that of (ImageNet1k-only) DeiT (Touvron et al., 2020); among recent isotropic models, we think that DeiT and ResMLP are the most fair comparisons, in contrast to models like CaiT (Touvron et al., 2021b) that have additional refinements. We trained ResNets using the same process as ours, as the original results are now antiquated.

Looking at Table 1 and Fig. 1, ConvMixers achieve competitive accuracies for a given parameter budget: ConvMixer-1536/20 outperforms both ResNet-152 and ResMLP-B24 despite having substantially fewer parameters and is competitive with DeiT-B. ConvMixer-768/32 uses just a third of the parameters of ResNet-152, but is similarly accurate. However, ConvMixers are substantially slower at inference than the competitors, likely due to their smaller patch size; hyperparameter tuning and optimizations could narrow this gap. For more discussion and comparisons, see Table 2 and Appendix A.

本文的模型和训练设置非常类似于 (ImageNet1k-only) DeiT (Touvron et al.,  2020);在最近的各向同性模型中,作者认为,与 CaiT 等模型相比,DeiT ResMLP 是最公平的比较。本文使用与本文相同的过程来训练 ResNets,因为原来的结果现在已经过时了。

看看表 1 和图 1,ConvMixers 在给定的参数预算下实现了具有竞争力的精度:ConvMixer-1536/20 优于 ResNet-152 和 ResMLP-B24,尽管其参数大大减少,且与 DeiT-B 相媲美(略低)。ConvMixer-768/32 只使用了 ResNet-152 的三分之一的参数,但同样准确。然而,ConvMixers 的推理速度比这些对比的方法慢得多,可能是因为它们的 patch 尺寸较小;超参数调优和优化可以缩小这一差距。更多讨论和比较请参见表 2 和附录 A。

Related Work

Isotropic architectures

Vision transformers have inspired a new paradigm of “isotropic” architectures, i.e., those with equal size and shape throughout the network, which use patch embeddings for the first layer. These models look similar to repeated transformer-encoder blocks (Vaswani et al., 2017) with different operations replacing the self-attention and MLP operations. For example, MLP-Mixer (Tolstikhin et al., 2021) replaces them both with MLPs applied across different dimensions (i.e., spatial and channel location mixing); ResMLP (Touvron et al., 2021a) is a data-efficient variation on this theme. CycleMLP (Chen et al., 2021), gMLP (Liu et al., 2021), and vision permutator (Hou et al., 2021), replace one or both blocks with various novel operations. These are all quite performant, which is typically attributed to the novel choice of operations. However, as our investigation of ConvMixers suggests, these works may conflate the effect of the new operation with the effect of the use of patch embeddings and the resulting isotropic architecture.

Vision transformers 激发了各向同性架构的新范式,即那些在整个网络中具有相同大小和形状的架构,在第一层使用 patch embeddings。这些模型看起来类似于重复的 transformer-encoder blocks,不同的操作取代了 self-attention 和 MLP 操作。例如,MLP-Mixer (Tolstikhin et al., 2021) 用应用于不同维度 (即空间和通道位置混合) 的 MLP 取代了两者ResMLP (Touvron等人,2021a) 是该主题的一个数据效率变体CycleMLP (Chen et al., 2021)、gMLP (Liu et al., 2021) 和 vision permutator (Hou et al., 2021) 用各种新操作替换一个或两个 block ( 指 self-attention 和 MLP 。这些都是相当的性能,这通常归因于新颖的操作选择。然而,正如对ConvMixers 的研究表明的那样,这些工作很可能将 使用 patch embeddings 和 各向同性结构 这二者贡献的效果合并在一起了

A study predating vision transformers investigates isotropic (or “isometric”) MobileNets (Sandler et al., 2019), and even implements patch embeddings under another name. Their architecture simply repeats an isotropic MobileNetv3 block. They identify a tradeoff between patch size and accuracy that matches our experience, and train similarly performant models (see Appendix A, Table 2). However, their block is substantially more complex than ours; simplicity and motivation sets our work apart.

在 vision transformers 之前的一项研究调查了各向同性 (或 “等距”) MobileNets (Sandler et al., 2019),甚至以另一个名字实现了 patch embeddings。他们的结构只是重复了一个各向同性的MobileNetv3 模块。他们确定了 patch 大小和精度之间的权衡,符合我们的经验,并训练了类似的性能模型 (见附录 A,表 2)。然而,他们的块基本上比我们的复杂;简单和动机让我们的工作与众不同。

Patches aren’t all you need

Several papers have increased vision transformer performance by replacing standard patch embeddings with a different stem: Xiao et al. (2021) and Yuan et al. (2021a) use a standard convolutional stem, while Yuan et al. (2021b) repeatedly combines nearby patch embeddings. However, this conflates the effect of using patch embeddings with the effect of adding convolution or similar inductive biases e.g., locality. We attempt to focus on the use of patches.

有几篇论文通过使用不同的 stem 替换标准的 patch embeddings 来提高 vision transformer 的性能Xiao et al.(2021) 和 Yuan et al. (2021a) 使用标准的卷积 stem,而 Yuan et al. (2021b) 反复结合 nearby patch embeddings。然而,这将使用 patch embeddings 的影响与添加卷积或类似的归纳偏差 (如局部性) 的影响合并在一起。我们试图专注于 patches 的使用。

CNNs meet ViTs

Many efforts have been made to incorporate features of convolutional networks into vision transformers and vice versa. Self-attention can emulate convolution (Cordonnier et al., 2019) and can be initialized or regularized to be like it (d’Ascoli et al., 2021); other works simply add convolution operations to transformers (Dai et al., 2021; Guo et al., 2021), or include downsampling to be more like traditional pyramid-shaped convolutional networks (Wang et al., 2021). Conversely, self-attention or attention-like operations can supplement or replace convolution in ResNet-style models (Bello et al., 2019; Ramachandran et al., 2019; Bello, 2021). While all of these attempts have been successful in one way or another, they are orthogonal to this work, which aims to emphasize the effect of the architecture common to most ViTs by showcasing it with a less-expressive operation.

这段介绍了一些将卷积和 Transformer 结合的工作。目前很多努力把卷积网络的特性整合到 vision transformers  中,反之亦然。这些工作包括:

自注意力可以模拟卷积 (Cordonnier et al., 2019)

可以初始化或正则化到类似卷积 (d Ascoli et al., 2021)

给 transformers 添加卷积运算 (Dai et al., 2021;Guo等人,2021)

包括向下采样,使其更像传统的金字塔形卷积网络 (Wang等人,2021)

相反,自注意或类似注意的操作可以补充或替代 resnet 样式模型中的卷积 (Bello et al., 2019;Ramachandran et al., 2019;Bello ,2021)

虽然所有这些尝试都在某种程度上取得了成功,但它们与这个作品是正交的,这个作品的目的是通过不那么富有表现力的操作来强调大多数 vit 共同的结构效果。

Conclusion

We presented ConvMixers, an extremely simple class of models that independently mixes the spatial and channel locations of patch embeddings using only standard convolutions. While neither our model nor our experiments were designed to maximize accuracy or speed, ConvMixers outperform the Vision Transformer and MLP-Mixer, and are competitive with ResNets, DeiTs, and ResMLPs.

本文提出了ConvMixers,这是一种非常简单的模型,它仅使用标准卷积就能独立地混合 patch embeddings 的空间和通道位置。虽然我们的模型和实验都不是为了最大化准确性或速度而设计的,但 ConvMixers 优于 Vision Transformer 和 MLP-Mixer,并与 ResNets、DeiTs 和 ResMLPs 相媲美

We provided evidence that the increasingly common “isotropic” architecture with a simple patch embedding stem is itself a powerful template for deep learning. Patch embeddings allow all the downsampling to happen at once, immediately decreasing the internal resolution and thus increasing the effective receptive field size, making it easier to mix distant spatial information. Our title, while an exaggeration, points out that attention isn’t the only export from language processing into computer vision: tokenizing inputs, i.e., using patch embeddings, is also a powerful and important takeaway.

本文提供的证据表明,越来越常见的 “各向同性” 架构与一个简单的 patch embedding stem 本身是一个强大的模板的深度学习Patch embeddings 允许所有的下采样同时发生,立即降低内部分辨率,从而增加有效接受域的大小,使其更容易混合长距离的空间信息 (解释了为什么 patch embedding 是一个强大的表示方式。我们的标题,虽然有点夸张,指出注意力并不是 NLP 到 CV 的唯一输出:tokenizing 输入,即使用 patch embeddings,也是一个强大而重要要点

While our model is not state-of-the-art, we find its simple patch-mixing design to be compelling. We hope that ConvMixers can serve as a baseline for future patch-based architectures with novel operations, or that they can provide a basic template for new conceptually simple and performant models.

虽然本文的模型不是最先进的,但本文发现它的简单混合设计是引人注目的。希望 ConvMixers 可以作为未来具有新操作的基于 patch 的架构的基线,或者它们可以为新的概念简单和性能模型提供基本模板。

Future work

We are optimistic that a deeper ConvMixer with larger patches could reach a desirable tradeoff between accuracy, parameters, and throughput after longer training and more regularization and hyperparameter tuning. Low-level optimization of large-kernel depthwise convolution could substantially increase throughput. Similarly, small enhancements to our architecture like the addition of bottlenecks or a more expressive classifier could trade simplicity for performance.

作者乐观地认为,在经过更长的训练和更多的正则化和超参数调优之后,带有更大 patch 的更深层次的 ConvMixer 可以在准确性、参数和吞吐量之间达成理想的折衷。大 kernel depthwise 卷积的低级优化可以大幅提高吞吐量。类似地,对本文的体系结构进行小的改进,如添加 bottlenecks 或更具有表达能力的分类器,可以以简单提升性能。

A note on paper length

Expecting more text in this paper? Wondering if it’s a workshop paper we hastily submitted to ICLR? No. This paper presents a simple idea, one where we genuinely believe that a short paper presentation is more effective. Do we really need exactly 8 (now 9? 10?) pages to describe every machine learning architecture and algorithm in existence? We proposed an incredibly simple architecture and made a very simple point that we think is worth more discussion: patches work well in convolutional architectures. We think that four pages is more than enough space for this. The details of the experiments and architectures are in the appendix for those who want to read through it all.

期待这篇论文有更多内容吗? 想知道这是不是作者匆忙提交给ICLR的研讨会论文? 不。这篇论文提出了一个简单的想法,作者真诚地相信简短的论文陈述更有效。真的需要正好 8 (现在是 9? 10?) 页来描述现有的每种机器学习架构和算法? 作者提出了一个非常简单的架构,并提出了一个我们认为值得更多讨论的非常简单的观点:patch 在卷积架构中工作得很好。作者认为四页就足够了。实验和体系结构的细节在附录中,方便那些想要通读它的人。

Appendix

Comparison to other models

​ 

Experiment overview

We did not design our experiments to maximize accuracy: We chose “common sense” parameters for timm and its augmentation settings, found that it worked well for a ConvMixer-1024/12, and stuck with them for the proceeding experiments. We admit this is not an optimal strategy, however, we were aware from our early experiments on CIFAR-10 that results seemed robust to various small changes. We did not have access to sufficient compute to attempt to tune hyperparameters for each model: e.g., larger ConvMixers could probably benefit from more regularization than we chose, and smaller ones from less regularization. Keeping the parameters the same across ConvMixer instances seemed more reasonable than guessing for each.

作者并没有设计大量实验来最大化本文模型的准确性:本文为 timm 及其增强设置选择了 “常识” 参数,发现它在 ConvMixer-1024/12 中工作得很好,并在接下来的实验中继续使用它们。因为没有对各种超参做细致调优,作者也认为本文给出的超参配置并不是最佳策略,然而,从 CIFAR-10 的早期实验中意识到,各种微小变化的结果似乎都是可靠的。本文没有足够的计算来尝试调优每个模型的超参数:例如,较大的 ConvMixer 可能会从选择的更多的正则化中受益,而较小的则可能从较少的正则化中受益。在 ConvMixer 实例中保持参数相同似乎比猜测每个实例的参数更合理

However, to some extent, we changed the number of epochs per model: for earlier experiments, we merely wanted a “proof of concept”, and used only 90–100 epochs. Once we saw potential, we increased this to 150 epochs and trained some larger models, namely ConvMixer-1024/20 with p = 14 patches and ConvMixer-1536/20 with p = 7 patches. Then, believing that we should explore deeper-but-less-wide ConvMixers, and knowing from CIFAR-10 that the deeper models converged more slowly, we trained ConvMixer-768/32s with p = 14 and p = 7 for 300 epochs. Of course, training time was a consideration: ConvMixer-1536/20 took about 9 days to train (on 10× RTX8000s) 150 epochs, and ConvMixer-768/32 is over twice as fast, making 300 epochs more feasible.

然而,在某种程度上,改变了每个模型 epoch 的数量:在早期的实验中,仅仅需要一个 “概念证明”,并且只使用了 90-100 个 epoch。一旦看到了潜力,便将其增加到 150 个 epoch,并训练一些更大的模型,即带有 p = 14 个 patch 的 ConvMixer-1024/20 和带有 p = 7 个 patch 的ConvMixer-1536/20。然后,作者相信应该探索更深更小范围的 ConvMixer,并且从 CIFAR-10 得知,更深的模型收敛得更慢,本文训练了ConvMixer-768/32s, p = 14 和 p = 7 300 个 epoch。当然,训练时间也是一个考虑因素:ConvMixer-1536/20 大约需要 9 天 (在10× RTX8000s上) 训练150个epoch,而 ConvMixer-768/32 的速度是前者的两倍多,可运行 300 个 epoch。

If anything, we believe that in the worst case, the lack of parameter tuning in our experiments resulted in underestimating the accuracies of ConvMixers. Further, due to our limited compute and the fact that large models (particularly ConvMixers) are expensive to train on large data sets, we generally trained our models for fewer epochs than competition like DeiT and ResMLP (see Table 2).

作者认为实验中缺少参数调优导致低估了 ConvMixers 的准确性。此外,由于有限的计算和大型模型 (特别是ConvMixers) 在大型数据集上训练的成本很高的事实,作者训练 ConvMixers 的时间通常比 DeiT 和 ResMLP 等竞争对手训练的时间更少 (见表2)。

A note on throughput

We measured throughput using batches of 64 images in half precision on a single RTX8000 GPU, averaged over 20 such batches. In particular, we measured CUDA execution time rather than “wall-clock” time. We noticed discrepancies in the relative throughputs of models, e.g., Touvron et al. (2020) reports that ResNet-152 is 2× faster than DeiT-B, but our measurements show that it is only 1.25× faster. We therefore speculate that our throughputs may underestimate the performance of ResNets and ConvMixers relative to the transformers. The difference may be due to using RTX8000 rather than V100 GPUs, or other low-level differences. Our throughputs were similar for batch sizes 32 and 128.

在单个 RTX8000 GPU 上用半精度的 64 batches 的图像测量吞吐量,平均超过 20 个 batches。特别是,我们测量的是 CUDA 的执行时间,而不是 “wall-clock” 时间。注意到模型的相对吞吐量存在差异,例如,Touvron等人(2020)报告称,ResNet-152 比 DeiT-B 快 2 倍但本文的测量显示,它只快1.25倍。因此,本文推测相对于 Transformer,本文的吞吐量可能低估了 ResNets 和 convmixer 的性能。这种差异可能是由于使用 RTX8000 而不是 V100 GPUs,或者其他低级别的差异造成的。对于 batch 大小 32 和 128,我们的吞吐量是相似的。

ResNets

As a simple baseline to which to compare ConvMixers, we trained three standard ResNets using exactly the same training setup and parameters as ConvMixer-1536/20. Despite having fewer parameters and being architecturally much simpler, ConvMixers substantially outperform these ResNets in terms of accuracy. A possible confounding factor is that ConvMixers use GELU, which may boost performance, while ResNets use ReLU. In an attempt to rule out this confound, we used ReLU in a later ConvMixer-768/32 experiment and found that it still achieved competitive accuracy. We also note that the choice of ReLU vs. GELU was not important on CIFAR-10 experiments (see Table 3). However, ConvMixers do have substantially less throughput.

作为比较 ConvMixers 的简单基线,使用与 ConvMixer-1536/20 完全相同的训练设置和参数训练了三个标准 ResNets。尽管参数更少,架构也更简单,但 ConvMixers 在准确性方面明显优于这些ResNets。一个可能的混淆因素是 ConvMixers 使用 GELU,这可能会提高性能,而 ResNets 使用ReLU。为了排除这种混淆,作者在后来的 ConvMixer-768/32 实验中使用了 ReLU,发现它仍然达到了相当高的精度。我们还注意到,在 CIFAR-10 实验中,ReLU 和 GELU 的选择并不重要(见表3)。然而,ConvMixers 的吞吐量确实要低得多。

DeiTs

We believe that DeiT is the most reasonable comparison in terms of vision transformers: It only adds additional regularization, as opposed to architectural additions in the case of CaiT (Touvron et al., 2021b), and is then essentially a “vanilla” ViT modulo the distillation token (we don’t consider distilled architectures). In terms of a fixed parameter budget, ConvMixers generally outperform DeiTs. For example, ConvMixer-1536/20 is only 0.43% less accurate than DeiT-B despite having over 30M fewer parameters; ConvMixer-768/32 is 0.36% more accurate than DeiT-S despite having 0.9M fewer parameters; and ConvMixer-512/16 is 0.39% more accurate than DeiT-Ti for nearly the same number of parameters. Admittedly, none of the ConvMixers are very competitive in terms of throughput, with the closest being the ConvMixer-512/16 which is 5× slower than DeiT-Ti.

我们相信 DeiT 是最合理进行比较的 vision transformers:它只会增加额外的正则化,而不是 architectural 添加在 CaiT (Touvron et al ., 2021 b),然后基本上是一个“vanilla” ViT 模蒸馏 token (我们不认为蒸馏体系结构)。在固定参数预算方面,ConvMixers 通常优于 DeiT-S。例如,ConvMixer-1536/20 的精度仅比 DeiT-B 低 0.43%,尽管比 DeiT-B 少 30M 以上的参数,尽管参数比 DeiT-S 少 0.9 M,但 ConvMixer-768/32 的精度比 DeiT-S 高 0.36%;对于几乎相同数量的参数,ConvMixer-512/16 的精度比DeiT-Ti高0.39%。诚然,没有一个 ConvMixers 在吞吐量方面是非常有竞争力的,最接近的是 ConvMixer-512/16,它比 DeiT-Ti 慢 5 倍

A confounding factor is the difference in patch size between DeiT and ConvMixer; DeiT uses p = 16 while ConvMixer uses p = 7. This means DeiT is substantially faster. However, ConvMixers using larger patches are not as competitive. While we were not able to train DeiTs with larger patch sizes, it is possible that they would outperform ConvMixers on the parameter count vs. accuracy curve; however, we tested their throughput for p = 7, and they are even slower than ConvMixers. Given the difference between convolution and self-attention, we are not sure it is salient to control for patch size differences.

一个混杂因素是 DeiT 和 ConvMixer 的 patch 大小的差异:DeiT 使用 p = 16,而 ConvMixer 使用 p = 7。这意味着 DeiT 要快得多。然而,使用更大 patch 的 ConvMixers 没有那么有竞争力。虽然我们不能用更大的 patch 尺寸训练 DeiTs,但它们可能在参数计数与精度曲线上优于 ConvMixers ;然而,我们测试了他们的吞吐量 p = 7,他们甚至比 ConvMixers 慢。考虑到卷积和自注意力之间的差异,我们不确定控制 patch 大小差异是否显著。

DeiTs were subject to more hyperparameter tuning than ConvMixers, as well as longer training times. They also used stochastic depth while we did not, which can in some cases contribute percent differences in model accuracy (Touvron et al., 2021a). It is therefore possible that further hyperparameter tuning and more epochs for ConvMixers could close the gap between the two architectures for large patches, e.g., p = 16.

与 ConvMixers 相比,DeiTs 受到更多的超参数调整,以及更长的训练时间。他们还使用了随机深度,而我们没有,在某些情况下,这可能导致模型精度的百分比差异 (Touvron等人,2021a)。因此,进一步的超参数调优和 ConvMixers 的更多时代可能会缩小两种体系结构之间的差距,例如,p = 16。

ResMLPs

Similarly to DeiT for ViT, we believe that ResMLP is the most relevant MLP-Mixer variant to compare against. Unlike DeiT, we can compare against instances of ResMLP with similar patch size: ResMLP-B24/8 has p = 8 patches, and underperforms ConvMixer-1536/20 by 0.37%, despite having over twice the number of parameters; it also has similarly low throughput. ConvMixer-768/32 also outperforms ResMLP-S12/8 for millions fewer parameters, but 3× less throughput.

ResMLP did not significantly improve in terms of accuracy for halving the patch size from 16 to 8, which shows that smaller patches do not always lead to better accuracy for a fixed architecture and regularization strategy (e.g., training a p = 8 DeiT may be challenging).

类似于 DeiT for ViT,本文认为 ResMLP 是与 MLP-Mixer 最相关的变体,并进行比较。与 DeiT 不同的是,我们可以比较具有相似 patch 大小的 ResMLP 实例:ResMLP- b24 /8 有 p = 8 个patch,尽管参数数量是 ConvMixer-1536/20 的两倍多,但性能仍比前者低 0.37%;它也有类似的低吞吐量。ConvMixer-768/32 也比 ResMLP-S12/8 少数百万个参数,但吞吐量少3倍

ResMLP 在将 patch 大小从 16 减半到 8 的准确性方面并没有显著提高这表明对于固定的架构和正则化策略,较小的 patch 并不总是能带来更好的准确性, (例如,训练 p = 8 DeiT 可能是具有挑战性的)。

Isotropic MobileNets

These models are closest in design to ours, despite using a repeating block that is substantially more complex than the ConvMixer one. Despite this, for a similar number of parameters, we can get similar performance. Notably, isotropic MobileNets seem to suffer less from larger patch sizes than do ConvMixers, which makes us optimistic that sufficient parameter tuning could lead to more performant large-patch ConvMixers.

这些模型在设计上与我们的最接近,尽管使用了一个比 ConvMixer 复杂得多的重复块。尽管如此,对于相同数量的参数,我们可以得到相似的性能。值得注意的是,同 ConvMixers 相比,各向同性 MobileNets 似乎在更大的 patch 大小上受到的影响更小,这使我们乐观地认为,充分的参数调优可以导致更高性能的大 patch 的 ConvMixers。

Other models

We included ViT and MLP-Mixer instances in our table, though they are not competitive with ConvMixer, DeiT, or ResMLP, even though MLP-Mixer has comparable regularization to ConvMixer. That is, ConvMixer seems to outperform MLP-Mixer and ViT, while being closer to complexity to them in terms of design and training regime than the other competitors, DeiT and ResMLP.

我们在表中包含了 ViT 和 MLP-Mixe r实例,尽管它们与 ConvMixer、DeiT 或 ResMLP 没有竞争关系,尽管 MLP-Mixer 具有与 ConvMixe r相当的正则化。也就是说,ConvMixer 似乎优于 MLP-Mixer 和 ViT,而在设计和培训制度方面,比其他竞争对手 DeiT 和 ResMLP 更接近于它们的复杂性。

Kernel size

While we found some evidence that larger kernels are better on CIFAR-10, we wanted to see if this finding transferred to ImageNet. Consequently, we trained our best-performing model, ConvMixer-1536/20, with kernel size k = 3 rather than k = 9. This resulted in a decrease of 0.94% top-1 accuracy, which we believe is quite significant relative to the mere 2.2M additional parameters. However, k = 3 is substantially faster than k = 9 for spatial-domain convolution; we speculate that low-level optimizations could close the performance gap to some extent, e.g., by using implicit instead of explicit padding. Since large-kernel convolutions throughout a model are unconventional, there has likely been low demand for such optimizations.

虽然我们发现一些证据表明较大的 kernels 在 CIFAR-10 上更好,但我们想看看这一发现是否转移到 ImageNet。因此,我们训练了性能最好的模型 ConvMixer-1536/20,使其 kernel 大小为 k = 3 而不是 k = 9。这导致 top-1 精度下降了 0.94%,我们认为相对于仅仅 2.2M 的附加参数而言,这是相当显著的。然而,对于空间域卷积,k = 3 要比 k = 9 快得多我们推测低级优化可以在一定程度上缩小性能差距,例如,使用隐式填充代替显式填充。由于整个模型中的大 kernel 卷积是非常规的,因此对这种优化的需求可能很低。

Experiments on CIFAR-10

Residual connections

We experimented with leaving out one, the other, or both residual connections before settling on the current configuration, and consequently chose to leave out the second residual connection. Our baseline model without the connection achieves 95.88% accuracy, while including the connection reduces it to 94.78%. Surprisingly, we see only a 0.31% decrease in accuracy for removing all residual connections. We acknowledge that these findings for residual connections may not generalize to deeper ConvMixers trained on larger data sets.

在确定当前配置之前,我们尝试省略一个、另一个或两个残差连接,因此选择省略第二个残差连接。我们的基线模型在没有连接的情况下准确率达到 95.88%,而包含连接的情况下准确率降低到94.78%。令人惊讶的是,我们发现移除所有残差连接的准确率仅下降了 0.31%。我们承认,这些关于残差连接的发现可能不能推广到更大数据集上训练的更深层次的 ConvMixers。

Normalization

Our model is conceptually similar to the vision transformer and MLP-Mixer, both of which use LayerNorm instead of BatchNorm. We attempted to use LayerNorm instead, and saw a decrease in performance of around 1% as well as slower convergence (see Table 3). However, this was for a relatively shallow model, and we cannot guarantee that LayerNorm would not hinder ImageNet-scale models to an even larger degree. We note that the authors of ResMLP also saw a relatively small increase in accuracy for replacing LayerNorm with BatchNorm, but for a largerscale experiment (Touvron et al., 2021a). We conclude that BatchNorm is no more crucial to our architecture than other regularizations or parameter settings (e.g., kernel size).

我们的模型在概念上类似于 vision transformer 和 MLP-Mixer,两者都使用 LayerNorm 而不是BatchNorm。我们试图改用 LayerNorm,结果发现性能下降了约1%收敛速度也变慢了 (见表3)。然而,这是一个相对较浅的模型,我们不能保证 LayerNorm 不会在更大程度上阻碍 ImageNet-scale 模型。我们注意到,ResMLP 的作者也看到了用 BatchNorm 替换 LayerNorm 的准确率相对较小的提高,但这是针对更大规模的实验 (Touvron等人,2021a)。我们得出结论,BatchNorm 对我们的体系结构来说并不比其他规则化或参数设置 (例如,内核大小) 更重要。

Having settled on an architecture, we proceeded to adjust its parameters h, d, p, k as well as weight decay on CIFAR-10 experiments. (Initially, we took the unconventional approach of excluding weight decay since we were already using strong regularization in the form of RandAug and mixup.) We acknowledge that tuning our architecture on CIFAR-10 does not necessarily generalize to performance on larger data sets, and that this is a limitation of our study.

在确定了结构后,我们在 CIFAR-10 实验中对其参数 h, d, p, k 和重量衰减进行了调整。(最初,我们采用了非常规的方法来排除权重衰减,因为我们已经在 RandAug 和 mixup 的形式中使用了强正则化。) 作者承认,在 CIFAR-10上 调优我们的架构不一定适用于更大数据集的性能,这是本文研究的一个局限性。

B.1 Results

ConvMixers are quite performant on CIFAR-10, easily achieving > 91% accuracy for as little as 100, 000 parameters, or > 96% accuracy for only 887, 000 parameters (see Table 4). With additional refinements e.g., a more expressive classifier or bottlenecks, we think that ConvMixer could be even more competitive. For all experiments, we trained for 200 epochs on CIFAR-10 with RandAug, mixup, cutmix, random erasing, gradient norm clipping, and the standard augmentations in timm. We remove some of these augmentations in Table 3, finding that RandAug and random scaling (“default” in timm) are very important, each accounting for over 3% of the accuracy.

ConvMixers 在 CIFAR-10上 的性能相当好,很容易实现;仅 10 万个参数的准确度为 91%,或仅对 887,000 个参数具有 96% 的准确率 (见表 4)。通过额外的改进,例如更有表现力的分类器或bottlenecks,我们认为 ConvMixer 可能会更有竞争力。对于所有的实验,我们在 CIFAR-10 上使用 RandAug, mixup, cutmix, random erase, gradient norm clip, and the standard augmented in timm 进行了 200  个epoch 的训练。我们删除了表 3 中的一些增强,发现 RandAug 和随机缩放(timm中的“默认”) 是非常重要的,每一个都占了超过3%的准确率

Scaling ConvMixer

We adjusted the hidden dimension h and the depth d, finding that deeper networks take longer to converge while wider networks converge faster. That said, increasing the width or the depth is an effective way to increase accuracy; a doubling of depth incurs less compute than a doubling of width. The number of parameters in a ConvMixer is given exactly by:

\# params = h[d(k^ 2 + h + 6) + c_{in}p^ 2 + n_{classes} + 3] + n_{classes}    (4)

including affine scaling parameters in BatchNorm layers, convolutional kernels, and the classifier.

我们调整了隐藏维数 h 和深度 d,发现深度较深的网络收敛时间较长,而较宽的网络收敛速度较快。也就是说,增加宽度或深度是提高准确性的有效方法深度加倍所引起的计算比宽度加倍所引起的计算要少。ConvMixer 中的参数数是由 (4) 式所示,包括 BatchNorm 层的仿射缩放参数、卷积核和分类器。

Kernel size

We initially hypothesized that large kernels would be important for ConvMixers, as they would allow the mixing of distant spatial information similarly to unconstrained MLPs or selfattention layers. We tried to investigate the effect of kernel size on CIFAR-10: we fixed the model to be a ConvMixer-256/8, and increased the kernel size by 2s from 3 to 15.

Using a kernel size of 3, the ConvMixer only achieves 93.61% accuracy. Simply increasing it to 5 gives an additional 1.50% accuracy, and further to 7 an additional 0.61%. The gains afterwards are relatively marginal, with kernel size 15 giving an additional 0.28% accuracy. It could be that with more training iterations or more regularization, the effect of larger kernels would be more pronounced. Nonetheless, we concluded that ConvMixers benefit from larger-than-usual kernels, and thus used kernel sizes 7 or 9 in most of our later experiments.

我们最初假设较大的核对于 ConvMixers 是重要的,因为它们允许混合长距离的空间信息,类似于不受约束的 MLP 或 self-attention 层。我们试图研究核大小对 CIFAR-10 的影响:我们将模型固定为 ConvMixer-256/8,并将核大小从 3增加到 15,增加了 2s。

使用 3 的内核,ConvMixer 只能达到 93.61% 的精度。简单地将其增加到 5 将提供额外的 1.50% 的准确率,进一步增加到 7 将提供额外的 0.61% 的准确率。之后的增益相对较小,内核大小为 15的精度增加了0.28%。这可能是随着更多的训练迭代或更多的正则化,更大的核的影响将更加明显。尽管如此,我们得出的结论是,ConvMixers 受益于比通常更大的内核,因此在我们后来的大多数实验中使用的内核大小为 7 或 9

It is conventional wisdom that large-kernel convolutions can be “decomposed” into stacked smallkernel convolutions with activations between them, and it is therefore standard practice to use k = 3 convolutions, stacking more of them to increase the receptive field size with additional benefits from nonlinearities. This raises a question: is the benefit of larger kernels in ConvMixer actually better than simply increasing the depth with small kernels? First, we note that deeper networks are generally harder to train, so by increasing the kernel size independently of the depth, we may recover some of the benefits of depth without making it harder for signals to “propagate back” through the network. To test this, we trained a ConvMixer-256/10 with k = 3 (698K parameters) in the same setting as a ConvMixer-256/8 with k = 9 (707K parameters), i.e., we increased depth in a smallkernel model to roughly match the parameters of a large-kernel model. The ConvMixer-256/10 achieved 94.29% accuracy (1.5% less), which provides more evidence for the importance of larger kernels in ConvMixers. Next, instead of fixing the parameter budget, we tripled the depth (using the intuition that 3 stacked k = 3 convolutions have the receptive field of a k = 9 convolution), giving a ConvMixer-256/24 with 1670K parameters, and got 95.16% accuracy, i.e., still less.

传统观点认为,大核卷积可以被 “分解” 成堆叠的小核卷积,它们之间有激活,因此标准做法是使用 k = 3 卷积,叠加更多的卷积以增加接收域大小,并从非线性中获得额外的好处。这就提出了一个问题:在 ConvMixer 中,更大的内核的好处是否真的比简单地堆叠小内核的深度更好? 首先,我们注意到更深层次的网络通常更难训练所以通过增加独立于深度的核大小,我们可以恢复深度的一些好处,而不会让信号更难通过网络 “返回”。为了测试这一点,我们训练了一个 k = 3 (698K参数) 的 ConvMixer-256/10 和一个 k = 9 (707K参数) 的 ConvMixer-256/8 在相同的设置下,也就是说,我们增加了一个小核模型的深度,以大致匹配一个大核模型的参数。ConvMixer-256/10 获得了 94.29%的精度 (少1.5%),这为 ConvMixers 中较大内核的重要性提供了更多的证据。接下来,我们没有固定参数预算,而是将深度增加三倍 (使用直觉,3个堆叠的 k = 3 卷积具有 k = 9 卷积的接受域),给出带有 1670K 参数的 ConvMixer-256/24,并获得 95.16% 的准确率,即更低。也就是说,堆叠 3 个 k=3 的卷积层 比直接使用 k=9 的卷积层的 ConvMixers 要差

Patch size

CIFAR-10 inputs are so small that we initially only used p = 1, i.e., the patch embedding layer does little more than compute h linear combinations of the input image. Using p = 2, we see a reduction in accuracy of about 0.80%; this is a worthy tradeoff in terms of training and inference time. Further increasing the patch size leads to rapid decreases in accuracy, with only 92.61% for p = 4.

Since the “internal resolution” is decreased by a factor of p when increasing the patch size, we assumed that larger kernels would be less important for larger p. We investigated this by again increasing the kernel size from 3 to 11 for ConvMixer-256/8 with p = 2: however, this time, the improvement going from 3 to 5 is only 1.13%, and larger kernels than 5 provide only marginal benefit.

CIFAR-10 输入非常小,所以我们最初只使用 p = 1,也就是说,patch 嵌入层只计算输入图像的 h个线性组合。使用 p = 2,我们看到准确率降低了约 0.80%;这是一个值得权衡的训练和推理时间。进一步增大 patch 的大小会导致准确率迅速下降,p = 4 的准确率仅为 92.61%

由于 “内部分辨率” 在增大 patch 大小时降低了 p 的一个因子,我们假设较大的内核对于较大的 p 来说不那么重要。我们通过再次将 ConvMixer-256/8 的内核大小从 3 增加到 11 (p = 2) 来研究这一点:然而,这一次,从 k=3 到 k=5 的改进仅为 1.13% (p=1 时,这个变化是 1.5%),而 k 大于 5 的模型只提供了边际效益(很小增益)。

Weight decay

We did many of our initial experiments with minimal weight decay. However, this was not optimal: by tuning weight decay, we can get an additional 0.15% of accuracy for no cost. Consequently, we used weight decay (without tuning) for our larger-scale experiments on ImageNet.

我们最初的很多实验都是用最小的重量衰减。然而,这并不是最优的:通过调整权重衰减,我们可以免费获得额外的 0.15% 的精度。因此,我们在 ImageNet 上的更大规模实验中使用了重量衰减(没有 tuning)。

C  Weight Visualizations

 

In Figure 4 and 5, we visualize the (complete) weights of the patch embedding layers of a ConvMixer1536/20 with p = 14 and a ConvMixer-768/32 with p = 7, respectively. Much like Sandler et al. (2019), the layer consists of Gabor-like filters as well as “colorful globs” or rough edge detectors. The filters seem to be more structured than those learned by MLP-Mixer (Tolstikhin et al., 2021); also unlike MLP-Mixer, the weights look much the same going from p = 14 to p = 7: the latter simply looks like a downsampled version of the former. It is unclear, then, why we see such a drop in accuracy for larger patches. However, some of the filters essentially look like noise, maybe suggesting a need for more regularization or longer training, or even more data. Ultimately, we cannot read too much into the learned representations here.

在图 4 和图 5 中,我们可视化了一个 ConvMixer1536/20 (p = 14) 和一个 ConvMixer768 /32 (p = 7) 的 patch 嵌入层的 (完整) 权值。很像 Sandler et al. (2019),该层由 Gabor-like 过滤器以及 “彩色斑点” 或粗糙边缘探测器组成。这些过滤器似乎比 MLP-Mixer 学到的更有结构 (Tolstikhin et al., 2021); 也不像 MLP-Mixer,从 p = 14 到 p = 7 的权重看起来几乎相同:后者看起来只是前者的下采样版本。现在还不清楚,为什么我们看到更大的 path 的精确度会下降。然而,有些过滤器本质上看起来像噪声,这可能意味着需要更多的正则化或更长的训练,甚至更多的数据。最终,我们不能对这些习得的表征进行过多解读。

In Figure 6, we plot the hidden convolutional kernels for successive layers of a ConvMixer. Initially, the kernels seem to be relatively small, but make use of their allowed full size in later layers; there is a clear hierarchy of features as one would expect from a standard convolutional architecture.

在图 6 中,绘制了 ConvMixer 连续层的隐藏卷积核。最初,内核看起来相对较小,但在后来的层中使用了它们允许的完整大小;有一个清晰的功能层次结构,正如人们期望从标准的卷积架构中得到的那样。

D  Implementation

A more readable PyTorch (Paszke et al., 2019) implementation of ConvMixer, where h = dim, d = depth, p = patch_size, k = kernel_size

import torch.nn as nn

class Residual(nn.Module):
    def __init__(self, fn):
        super().__init__()
        self.fn = fn

    def forward(self, x):
        return self.fn(x) + x

def ConvMixer(dim, depth, kernel_size=9, patch_size=7, n_classes=1000):
    return nn.Sequential(
            nn.Conv2d(3, dim, kernel_size=patch_size, stride=patch_size),
            nn.GELU(),
            nn.BatchNorm2d(dim),
            *[nn.Sequential(
                    Residual(nn.Sequential(
                        nn.Conv2d(dim, dim, kernel_size, groups=dim, padding="same"),
                        nn.GELU(),
                        nn.BatchNorm2d(dim)
                    )),
                    nn.Conv2d(dim, dim, kernel_size=1),
                    nn.GELU(),
                    nn.BatchNorm2d(dim)
            ) for i in range(depth)],
            nn.AdaptiveAvgPool2d((1,1)),
            nn.Flatten(),
            nn.Linear(dim, n_classes)
        )

An implementation of our model in exactly 280 characters, in case you happen to know of any means of disseminating information that could benefit from such a length. All you need to do to run this is from torch.nn import *. 

This section presents an expanded (but still quite compact) version of the terse ConvMixer implementation that we presented in the paper. The code is given in Figure 7. We also present an even more terse implementation in Figure 8, which to the best of our knowledge is the first model that achieves the elusive dual goals of 80%+ ImageNet top-1 accuracy while also fitting into a tweet.

def ConvMixr(h,d,k,p,n):
    S,C,A=Sequential,Conv2d,lambda x:S(x,GELU(),BatchNorm2d(h))
    R=type('',(S,),{'forward':lambda s,x:s[0](x)+x})
    return S(A(C(3,h,p,p)),*[S(R(A(C(h,h,k,groups=h,padding=k//2))),A(C(h,h,1))) for i
        in range(d)],AdaptiveAvgPool2d((1,1)),Flatten(),Linear(h,n))

References

 MLP-Mixer

MLP-Mixer: An all-MLP Architecture for Vision [2021]

Abstrct  Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e. “mixing” the per-location features), and one with MLPs applied across patches (i.e. “mixing” spatial information). When trained on large datasets, or with modern regularization schemes, MLP-Mixer attains competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-the-art models. We hope that these results spark further research beyond the realms of well established CNNs and Transformers.

ResMLP

ResMLP: Feedforward networks for image classification with data-efficient training  

[2021

Abstrct  We present ResMLP, an architecture built entirely upon multi-layer perceptrons for image classification. It is a simple residual network that alternates (i) a linear layer in which image patches interact, independently and identically across channels, and (ii) a two-layer feed-forward network in which channels interact independently per patch. When trained with a modern training strategy using heavy data-augmentation and optionally distillation, it attains surprisingly good accuracy/complexity tradeoffs on ImageNet. We also train ResMLP models in a self-supervised setup, to further remove priors from employing a labelled dataset. Finally, by adapting our model to machine translation we achieve surprisingly good results.

We share pre-trained models and our code based on the Timm library.

CycleMLP

CycleMLP: A MLP-like Architecture for Dense Prediction  [2021]

Abstrct  This paper presents a simple MLP-like architecture, CycleMLP, which is a versatile backbone for visual recognition and dense predictions, unlike modern MLP architectures, e.g., MLP-Mixer [49], ResMLP [50], and gMLP [35], whose architectures are correlated to image size and thus are infeasible in object detection and segmentation. CycleMLP has two advantages compared to modern approaches. (1) It can cope with various image sizes. (2) It achieves linear computational complexity to image size by using local windows. In contrast, previous MLPs have quadratic computations because of their fully spatial connections. We build a family of models that surpass existing MLPs and achieve a comparable accuracy (83.2%) on ImageNet-1K classification compared to the state-of-the-art Transformer such as Swin Transformer [36] (83.3%), but using fewer parameters and FLOPs. We expand the MLPlike models’ applicability, making them a versatile backbone for dense prediction tasks. CycleMLP aims to provide a competitive baseline on object detection, instance segmentation, and semantic segmentation for MLP models. In particular, CycleMLP achieves 45.1 mIoU on ADE20K val, comparable to Swin (45.2 mIOU). Code is available at https://github.com/ShoufaChen/CycleMLP.

 

gMLP

Pay Attention to MLPs  [2021]

Abstrct  Transformers [1] have become one of the most important architectural innovations in deep learning and have enabled many breakthroughs over the past few years. Here we propose a simple attention-free network architecture, gMLP, based solely on MLPs with gating, and show that it can perform as well as Transformers in key language and vision applications. Our comparisons show that self-attention is not critical for Vision Transformers, as gMLP can achieve the same accuracy. For BERT, our model achieves parity with Transformers on pretraining perplexity and is better on some downstream tasks. On finetuning tasks where gMLP performs worse, making the gMLP model substantially larger can close the gap with Transformers. In general, our experiments show that gMLP can scale as well as Transformers over increased data and compute.

vision permutator

Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition [2021]

Abstrct  In this paper, we present Vision Permutator, a conceptually simple and data efficient MLP-like architecture for visual recognition. By realizing the importance of the positional information carried by 2D feature representations, unlike recent MLP-like models that encode the spatial information along the flattened spatial dimensions, Vision Permutator separately encodes the feature representations along the height and width dimensions with linear projections. This allows Vision Permutator to capture long-range dependencies along one spatial direction and meanwhile preserve precise positional information along the other direction. The resulting position-sensitive outputs are then aggregated in a mutually complementing manner to form expressive representations of the objects of interest. We show that our Vision Permutators are formidable competitors to convolutional neural networks (CNNs) and vision transformers. Without the dependence on spatial convolutions or attention mechanisms, Vision Permutator achieves 81.5% top-1 accuracy on ImageNet without extra large-scale training data (e.g., ImageNet-22k) using only 25M learnable parameters, which is much better than most CNNs and vision transformers under the same model size constraint. When scaling up to 88M, it attains 83.2% top-1 accuracy. We hope this work could encourage research on rethinking the way of encoding spatial information and facilitate the development of MLP-like models. Code is available at https://github.com/Andrew-Qibin/VisionPermutator.

 (Wightman, 2019)

Pytorch image models. https://github.com/rwightman/ pytorch-image-models, 2019.

RandAugment (Cubuk et al., 2020)

Randaugment: Practical automated data augmentation with a reduced search space. CVPRW 2020

Abstract  Recent work on automated augmentation strategies has led to state-of-the-art results in image classification and object detection. An obstacle to a large-scale adoption of these methods is that they require a separate and expensive search phase. A common way to overcome the expense of the search phase was to use a smaller proxy task. However, it was not clear if the optimized hyperparameters found on the proxy task are also optimal for the actual task. In this work, we rethink the process of designing automated augmentation strategies. We find that while previous work required a search for both magnitude and probability of each operation independently, it is sufficient to only search for a single distortion magnitude that jointly controls all operations. We hence propose a simplified search space that vastly reduces the computational expense of automated augmentation, and permits the removal of a separate proxy task.

Despite the simplifications, our method achieves equal or better performance over previous automated augmentation strategies on on CIFAR-10/100, SVHN, ImageNet and COCO datasets. EfficientNet-B7, we achieve 85.0% accuracy, a 1.0% increase over baseline augmentation, a 0.6% improvement over AutoAugment on the ImageNet dataset. With EfficientNet-B8, we achieve 85.4% accuracy on ImageNet, which matches a previous result that used 3.5B extra images. On object detection, the same method as classification leads to 1.0-1.3% improvement over baseline augmentation. Code will be made available online.

mixup (Zhang et al., 2017)

mixup: Beyond empirical risk minimization, ICLR 2018

Abstract  Large deep neural networks are powerful, but exhibit undesirable behaviors such as memorization and sensitivity to adversarial examples. In this work, we propose mixup, a simple learning principle to alleviate these issues. In essence, mixup trains a neural network on convex combinations of pairs of examples and their labels. By doing so, mixup regularizes the neural network to favor simple linear behavior in-between training examples. Our experiments on the ImageNet-2012, CIFAR-10, CIFAR-100, Google commands and UCI datasets show that mixup improves the generalization of state-of-the-art neural network architectures. We also find that mixup reduces the memorization of corrupt labels, increases the robustness to adversarial examples, and stabilizes the training of generative adversarial networks.

CutMix (Yun et al., 2019)

Cutmix: Regularization strategy to train strong classifiers with localizable features. ICCV 2019

Abstract  Regional dropout strategies have been proposed to enhance the performance of convolutional neural network classifiers. They have proved to be effective for guiding the model to attend on less discriminative parts of objects (e.g. leg as opposed to head of a person), thereby letting the network generalize better and have better object localization capabilities. On the other hand, current methods for regional dropout remove informative pixels on training images by overlaying a patch of either black pixels or random noise. Such removal is not desirable because it leads to information loss and inefficiency during training. We therefore propose the CutMix augmentation strategy: patches are cut and pasted among training images where the ground truth labels are also mixed proportionally to the area of the patches. By making efficient use of training pixels and retaining the regularization effect of regional dropout, CutMix consistently outperforms the state-of-the-art augmentation strategies on CIFAR and ImageNet classification tasks, as well as on the ImageNet weakly-supervised localization task. Moreover, unlike previous augmentation methods, our CutMix-trained ImageNet classifier, when used as a pretrained model, results in consistent performance gains in Pascal detection and MS-COCO image captioning benchmarks. We also show that CutMix improves the model robustness against input corruptions and its out-of-distribution detection performances. Source code and pretrained models are available at  https://github.com/clovaai/CutMix-PyTorch.

random erasing (Zhong et al., 2020)

Random erasing data augmentation. AAAI 2020

Abstract  In this paper, we introduce Random Erasing, a new data augmentation method for training the convolutional neural network (CNN). In training, Random Erasing randomly selects a rectangle region in an image and erases its pixels with random values. In this process, training images with various levels of occlusion are generated, which reduces the risk of over-fitting and makes the model robust to occlusion. Random Erasing is parameter learning free, easy to implement, and can be integrated with most of the CNN-based recognition models. Albeit simple, Random Erasing is complementary to commonly used data augmentation techniques such as random cropping and flipping, and yields consistent improvement over strong baselines in image classification, object detection and person reidentification. Code is available at: https://github. com/zhunzhong07/Random-Erasing.

AdamW (Loshchilov &Hutter, 2018)

Fixing weight decay regularization in adam. 2018.

Abstract  We note that common implementations of adaptive gradient algorithms, such as Adam, limit the potential benefit of weight decay regularization, because the weights do not decay multiplicatively (as would be expected for standard weight decay) but by an additive constant factor. We propose a simple way to resolve this issue by decoupling weight decay and the optimization steps taken w.r.t. the loss function. We provide empirical evidence that our proposed modification (i) decouples the optimal choice of weight decay factor from the setting of the learning rate for both standard SGD and Adam, and (ii) substantially improves Adam’s generalization performance, allowing it to compete with SGD with momentum on image classification datasets (on which it was previously typically outperformed by the latter). We also demonstrate that longer optimization runs require smaller weight decay values for optimal results and introduce a normalized variant of weight decay to reduce this dependence. Finally, we propose a version of Adam with warm restarts (AdamWR) that has strong anytime performance while achieving state-ofthe-art results on CIFAR-10 and ImageNet32x32. Our source code is available at GitHub - loshchil/AdamW-and-SGDW: Decoupled Weight Decay Regularization (ICLR 2019)

猜你喜欢

转载自blog.csdn.net/u014546828/article/details/120649970