ResMLP: Feedforward networks for image classification with data-efficient training

Abstract

We present ResMLP, an architecture built entirely upon multi-layer perceptrons for image classification.

It is a simple residual network that alternates (i) a linear layer in which image patches interact, independently and identically across channels, and (ii) a two-layer feed-forward network in which channels interact independently per patch.

When trained with a modern training strategy using heavy data-augmentation and optionally distillation, it attains surprisingly good accuracy/complexity tradeoffs on ImageNet. We also train ResMLP models in a self-supervised setup, to further remove priors from employing a labelled dataset. Finally, by adapting our model to machine translation we achieve surprisingly good results.

We share pre-trained models and our code based on the Timm library.

研究内容：本文提出了基于多层感知器的图像分类体系结构 ResMLP。

方法介绍：它是一种简单的残差网络，它可以替代

(i) 一个线性层，其中图像小块在各个通道之间独立而相同地相互作用，以及

(ii)一个两层前馈网络，其中每个通道在每个小块之间独立地相互作用。

实验结论：当使用使用大量数据增强和选择性蒸馏的现代训练策略进行训练时，它在 ImageNet 上获得了惊人的准确性/复杂度折衷。

本文还在自监督设置中训练 ResMLP 模型，以进一步去除使用标记数据集的先验。

最后，通过将模型应用于机器翻译，取得了令人惊讶的良好结果。

1 Introduction

Recently, the transformer architecture [60], adapted from its original use in natural language processing with only minor changes, has achieved performance competitive with the state of the art on ImageNet-1k [50] when pre-trained with a sufficiently large amount of data [16]. Retrospectively, this achievement is another step towards learning visual features with less priors: Convolutional Neural Networks (CNN) had replaced the hand-designed choices from hard-wired features with flexible and trainable architectures. Vision transformers further removes several hard decisions encoded in the convolutional architectures, namely the translation invariance and local connectivity.

研究背景：

最近，transformer 架构 (从其最初在自然语言处理中的使用进行了调整，只做了很小的更改) 在使用足够大的数据进行预训练时，在 ImageNet-1k 上实现了与当前水平相当的性能。回顾一下，这一成就是学习较少先验的视觉特征的又一步：卷积神经网络 (CNN) 用灵活和可训练的架构取代了手工设计的硬连接特征。Vision transformer 进一步消除了卷积架构中编码的几个困难决策，即转换不变性和局部连通性。

This evolution toward less hard-coded prior in the architecture has been fueled by better training schemes [16, 56], and, in this paper, we push this trend further by showing that a purely multilayer perceptron (MLP) based architecture, called Residual Multi-Layer Perceptrons (ResMLP), is competitive on image classification.

ResMLP is designed to be simple and encoding little prior about images:

        + it takes image patches as input,

        + projects them with a linear layer, and

        + sequentially updates their representations with two residual operations:

          (i) a cross-patch linear layer applied to all channels independently; and

          (ii) an cross-channel single-layer MLP applied independently to all patches.

        + At the end of the network, the patch representations are average pooled, and

        + fed to a linear classifier.

We outline ResMLP in Figure 1 and detail it further in Section 2.

ResMLP 方法介绍：

这种在体系结构中更少硬编码的进化是由更好的训练方案推动的。本文进一步推动了这一趋势，证明了一种纯粹基于多层感知器 (MLP) 的体系结构，称为残差多层感知器 (ResMLP)，在图像分类上是有竞争力的。ResMLP 的设计简单，编码图像的先验性小：

+ 以图像 patch 为输入，

+ 将其投影为线性层，

+ 然后通过两个残差操作依次更新其表示:

(i) 一个跨 patch 线性层，独立应用于所有通道；和

(ii) 一个跨通道单层 MLP，独立应用于所有 patch 。

+ 在网络的最后，patch 表示被平均池化，

+ 并送入一个线性分类器。

图 1 中介绍了 ResMLP，并在第 2 节中进一步详细说明。

The ResMLP architecture is strongly inspired by the vision transformers (ViT) [16], yet it is much simpler in several ways: we replace the self-attention sublayer by a linear layer, resulting in an architecture with only linear layers and GELU non-linearity [25]. We observe that the training of ResMLP is more stable than ViTs when using the same training scheme as in DeiT [56] and CaiT [57], allowing to remove the need for batch-specific or cross-channel normalizations such as BatchNorm, GroupNorm or LayerNorm. We speculate that this stability comes from replacing self-attention with linear layers. Finally, another advantage of using a linear layer is that we can still visualize the interactions between patch embeddings, revealing filters that are similar to convolutions on the lower layers, and longer range in the last layers.

ResMLP 优于 ViT：

ResMLP 架构受到 vision transformers (ViT) 的启发，但它在几个方面要简单得多: 用线性层代替了self-attention 子层，结果是一个只有线性层和 GELU 非线性的架构。

当使用与DeiT 和 CaiT 相同的训练方案时，ResMLP 的训练比 ViT 更稳定，允许消除 batch 特定或跨通道标准化的需要，如 BatchNorm、GroupNorm 或 LayerNorm。推测这种稳定性来自于用线性层取代 self-attention。

最后，使用线性层的另一个好处是，ResMLP 仍然可以可视化 patch embeddings 之间的交互，揭示滤波器在较低层类似于的卷积，在最后层有较长的范围 longer range。

We further investigate if our purely MLP based architecture could benefit to other domains beyond images, and particularly, with more complex output spaces. In particular, we adapt our MLP based architecture to take inputs with variable length, and show its potential on the problem of Machine Translation. To do so, we develop a sequence-to-sequence (seq2seq) version of ResMLP, where both encoder and decoders are based on ResMLP with across-attention between the encoder and decoder [2]. This model is similar to the original seq2seq Transformer with ResMLP layers instead of Transformer layers [60]. Despite not being originally designed for this task, we observe that ResMLP is competitive with Transformers on the challenging WMT benchmarks.

自然语言处理领域进一步证明有效性：

本文进一步研究了，纯粹基于 MLP 的体系结构是否可以对图像以外的其他领域有益，特别是具有更复杂输出空间的领域。特别地，本文调整了基于 MLP 的体系结构，以接受可变长度的输入，并在机器翻译问题上展示了它的潜力。为此，本文开发了一个序列到序列 (seq2seq) 版本的ResMLP，其中编码器和解码器都基于 ResMLP，编码器和解码器之间的 across-attention。这个模型类似于最初的 seq2seq Transformer，只是使用了 ResMLP 层，而不是 Transformer 层。尽管ResMLP 最初并不是为这项任务设计的，但本文观察到，在具有挑战性的 WMT 基准测试中，ResMLP 与 transformer 相比具有竞争力。

In summary, in this paper, we make the following observations:

• despite its simplicity, ResMLP reaches surprisingly good accuracy/complexity trade-offs with ImageNet-1k training only, without requiring normalization based on batch or channel statistics;

• these models benefit significantly from distillation methods [56]; they are also compatible with modern self-supervised learning methods based on data augmentation, such as DINO [7];

• A seq2seq ResMLP achieves competitive performances compared to a seq2seq Transformers on the WMT benchmark for Machine Translation.

综上所述，本文得到以下几点观察:

• 尽管很简单，ResMLP 在仅 ImageNet-1k 训练的情况下达到了惊人的准确性/复杂性，而不需要基于批处理或通道统计数据的归一化；

• 这些模型显著受益于蒸馏方法；与基于数据增强的自监督学习方法兼容，如 DINO [7]；

• 在机器翻译的 WMT 基准测试中，与 seq2seq transformer 相比，seq2seq ResMLP 实现了具有竞争力的性能。

2 Method

The overall ResMLP architecture

Our model, denoted by ResMLP, takes a grid of N ×N nonoverlapping patches as input, where the patch size is typically equal to 16×16. The patches are then independently passed through a linear layer to form a set of $N^2$ d-dimensional embeddings.

The resulting set of N2 embeddings are fed to a sequence of Residual Multi-Layer Perceptron layers to produce a set of N2 d-dimensional output embeddings. These output embeddings are then averaged (“average-pooling”) as a d-dimension vector to represent the image, which is fed to a linear classifier to predict the label associated with the image. Training uses the cross-entropy loss.

ResMLP，以 N×N 个不重叠的 patch 组成的网格作为输入，其中 patch 的大小通常等于 16×16。然后，这些 patches 独立通过一层线性层，形成一组 $N^2$ d 维 embeddings。

所得的 $N^2$ embeddings 集合被输入到一个残差多层感知器层序列中，以产生一组 $N^2$ d 维输出 embeddings。然后，这些输出嵌入被平均 (“平均池化”) 作为一个 d 维向量来表示图像，该向量被送入线性分类器，以预测与图像相关的标签。训练使用交叉熵损失。

The Residual Multi-Perceptron Layer

Our network is a sequence of layers that all have the same structure: a linear sublayer applied across patches followed by a feedforward sublayer applied across channels. Similar to the Transformer layer, each sublayer is paralleled with a skip-connection [23]. The absence of self-attention layers makes the training more stable, allowing us to replace the Layer Normalization [1] by a simpler Affine transformation:

where α and β are learnable weight vectors. This operation only rescales and shifts the input element-wise. This operation has several advantages over other normalization operations: first, as opposed to Layer Normalization, it has no cost at inference time, since it can absorbed in the adjacent linear layer. Second, as opposed to BatchNorm [30] and Layer Normalization, the $\texttt{Aff}$ operator does not depend on batch statistics. The closer operator to $\texttt{Aff}$ is the LayerScale introduced by Touvron et al. [57], with an additional bias term. For convenience, we denote by $\texttt{Aff}$ (X) the Affine operation applied independently to each column of the matrix X.

本文的网络是一系列具有相同结构的层：一个应用于 cross-patch 的线性子层，然后是应用于 cross-channel 的前馈子层。与 Transformer 层类似，每个子层都与跳接并行。self-attention 层的缺失使得训练更加稳定，允许用一个更简单的仿射变换替换层归一化，放射变换如公式 (1) 所示。其中 α 和 β 是可学习的权向量。此操作仅对输入元素进行缩放和移动。

与其他归一化操作相比，这个操作有几个优点：

首先，与 Layer Normalization 相比，它在推断时间上没有成本，因为它可以被相邻的线性层吸收。

其次，与 BatchNorm 和 Layer Normalization 相反， $\texttt{Aff}$ 操作符不依赖于批统计。

与 $\texttt{Aff}$ 更接近的算符是 Touvron et al. 引入的 LayerScale，带有额外的偏差项。

为方便起见，用 $\texttt{Aff}$ (X) 表示独立应用于矩阵 X 的每一列的仿射运算。

We apply the $\texttt{Aff}$ operator at the beginning (“pre-normalization”) and end (“post-normalization”) of each residual block. As a pre-normalization, $\texttt{Aff}$ replaces LayerNorm without using channel-wise statistics. Here, we initialize α = 1, and β = 0. As a post-normalization, $\texttt{Aff}$ is similar to LayerScale and we initialize α with the same small value as in [57].

在每个残差块的开始 (“预归一化”) 和结束 (“后归一化”) 处应用 $\texttt{Aff}$ 算子。作为一种预规范化， $\texttt{Aff}$ 取代了 LayerNorm，而不使用通道统计。这里，初始化 α = 1， β = 0。作为后规范化， $\texttt{Aff}$ 类似于 LayerScale，使用与 [57] 相同的小值初始化 α。

Overall, our Multi-layer perceptron takes a set of $N^2$ d-dimensional input features stacked in a d × $N^2$ matrix X, and outputs a set of $N^2$ d-dimension output features, stacked in a matrix Y with the following set of transformations:

where A, B and C are the main learnable weight matrices of the layer. Note that Eq (3) is the same as the feedforward sublayer of a Transformer with the ReLU non-linearity replaced by a GELU function [25]. The dimensions of the parameter matrix A are $N^2$ × $N^2$ , i.e., this “cross-patch” sublayer exchanges information between patches, while the “cross-channel” feedforward sublayer works per location. Similar to a Transformer, the intermediate activation matrix Z has the same dimensions as the input and output matrices, X and Y. Finally, the weight matrices B and C have the same dimensions as in a Transformer layer, which are 4d×d and d×4d, respectively.

总的来说，本文的多层感知器将一组 $N^2$ d维输入特征堆叠在一个 d × $N^2$ 矩阵X中，并输出一组 $N^2$ d维输出特征，堆叠在一个矩阵Y中，其变换集如 (3) 和 (4)。其中 A, B 和 C 是该层的主要可学习权矩阵。

请注意，Eq(3) 与 Transformer 的前馈子层相同，只是 ReLU 非线性被 GELU 函数所取代。参数矩阵A的维数为 $N^2$ × $N^2$ ，即该 “cross-patch” 子层在 patch 之间交换信息，而“cross-channel” 前馈子层在每个位置工作。与 Transformer 类似，中间激活矩阵 Z 的维数与输入输出矩阵 X 和 Y 相同，最后权重矩阵 B 和 C 的维数与 Transformer 层中的维数相同，分别为4d×d 和 d×4d。

Differences with the Vision Transformer architecture

Our architecture is closely related to the ViT model [16]. However, ResMLP departs from ViT with several simplifications:

• no self-attention blocks: it is replaced by a linear layer with no non-linearity,

• no positional embedding: the linear layer implicitly encodes information about patch positions,

• no extra “class” token: we simply use average pooling on the patch embedding,

• no normalization based on batch statistics: we use a learnable affine operator.

与 Vision Transformer 架构的差异：

ResMLP 体系结构与 ViT 模型密切相关。然而，ResMLP 与 ViT 不同，有几个简化：

• 无 self-attention 块：其被一个没有非线性的线性层所取代，

• 无位置 embedding：线性层隐式编码关于 embedding 位置的信息，

• 没有额外的 “class” tokens：只是在 patch embedding 上使用平均池化，

• 不基于 batch 统计的规范化：使用可学习的仿射运算符。

Class-MLP as an alternative to average-pooling

We propose an adaptation of the class-attention token introduced in CaiT [57]. In CaiT, this consists of two layers that have the same structure as the transformer, but in which only the class token is updated based on the frozen patch embeddings. We translate this method to our architecture, except that, after aggregating the patches with a linear layer, we replace the attention-based interaction between the class and patch embeddings by simple linear layers, still keeping the patch embeddings frozen. This increases the performance, at the expense of adding some parameters and computational cost. We refer to this pooling variant as “class-MLP”, since the purpose of these few layers is to replace average pooling.

Class-MLP替代平均池化：

本文提出了对 CaiT 中引入的 class-attention token 的适应性。在 CaiT 中，包括两个与 transformer 具有相同结构的层，但其中只有 class token 是基于 frozen patch embeddings 进行更新的。

本文将这种方法转换到 ResMLP 的架构中，除了在用线性层聚合 patchs 之后，用简单的线性层取代了 class 和 patch embeddings 之间基于注意力的交互，仍然保持 patch embeddings 不变。这提高了性能，但增加了一些参数和计算成本。本文将这种池化变体称为 “class-MLP”，因为这几个层的目的是替换平均池。

Sequence-to-sequence ResMLP

Similar to Transformer, the ResMLP architecture can be applied to sequence-to-sequence tasks. First, we follow the general encoder-decoder architecture from Vaswani et al. [60], where we replace the self-attention sublayers by the residual multi-perceptron layer. In the decoder, we keep the cross-attention sublayers, which attend to the output of the encoder. In the decoder, we adapt the linear sublayers to the task of language modeling by constraining the matrix A to be triangular, in order to prevent a given token representation to access tokens from the future. Finally, the main technical difficulty from using linear sublayers in a sequence-to-sequence model is to deal with variable sequence lengths. However, we observe that simply padding with zeros and extracting the submatrix A corresponding to the longest sequence in a batch, works well in practice.

Sequence-to-sequence 的任务中可变序列长度问题：

与 Transformer 类似，ResMLP 体系结构可以应用于 Sequence-to-sequence 的任务。首先，遵循Vaswani et al.的一般编码器-解码器架构，用残差多感知器层替换 self-attention 子层。在解码器中，保留交叉注意子层，这些子层负责编码器的输出。在解码器中，通过约束矩阵 A 为三角形，使线性子层适应语言建模任务，以防止给定的符号表示访问来自未来的符号。最后，在 sequence-to-sequence 模型中使用线性子层的主要技术难点是处理可变序列长度。然而，对应于最长序列，通过简单地填充零和提取子矩阵 A，在实践中效果很好。

其它相关讨论请阅读原文实验部分，包括参数量、FLOPS 对比讨论，ResMLP 线性层可视化讨论，规范化、池化、patch size 等消融实验，等。

论文阅读：ResMLP: Feedforward networks for image classification with data-efficient training