MetaFormer/PoolFormer study notes and code

MetaFormer/PoolFormer study notes and code

MetaFormer Is Actually What Y ou Need for Vision
code:https://github.com/sail-sg/poolformer

Abstract

Transformers show great potential in computer vision tasks. It is generally believed that their attention-based token mixer module contributes most to their capabilities. However, recent research has shown that the attention-based modules in Transformers can be replaced by spatial MLP and the resulting model still performs well. Based on this observation,we hypothesize that the general architecture of the transformer, rather than the specific token mixer module, is more important to the performance of the model. To verify this, we deliberately replace the attention module in Transformers with an embarrassingly simple spatial pooling operator for only basic token mixing . Surprisingly, we observe that a derived model, called PoolFormer, achieves competitive performance on multiple computer vision tasks. For example, on ImageNet-1K, PoolF-former achieves 82.1% top-1 accuracy, with 35%/52% fewer parameters and 50%/62% MAC reduction, than well-tuned visual transformer/MLP samples. The accuracy of the baseline DeiT-B/ResMLP-B24 is 0.3%/1.1% higher. The effectiveness of PoolFormer verified our hypothesis and promptedus to propose the concept of "MetaFormer", A general architecture abstracted from Transformers without the need to specify a token mixer. Based on extensive experiments, we believe that MetaFormr is key to achieving superior results of recent Transformer and MLP class models on vision tasks. This work requires more future research dedicated to improving MetaFormer rather than focusing on the token mixer module. In addition, our proposed PoolF-ormer can serve as a starting baseline for future MetaFormer architecture design.

1. Introduction

Transformers have gained great interest and success in the field of computer vision [3, 8, 44, 55]. Since the pioneering work of the Visual Transformer (ViT) [17] adapted pure transformers to image classification tasks, many subsequent models have been developed to further improve and achieve good performance in various computer vision tasks [36, 53, 63].

As shown in Figure 1(a), the transformer encoder consists of two components. One is an attention module that mixes information between tokens, which we call a token mixer. Another component contains the remaining modules such as channel MLP and remaining connections. By treating the attention module as a specific token mixer, we further abstract the entire converter into a general architectural element where the token mixer is not specified. As shown in Figure 1(a).

image-20220803091921908

The success of Transformers has long been attributed to attention-based token mixers [56]. Based on this common belief, many variants of attention modules [13, 22, 57, 68] were developed to improve visual transformers. However, a recent work [51] completely replaced the attention module with a spatial MLP as a token mixer and found that the derived MLP-like model could achieve competitive performance on image classification benchmarks. Subsequent work [26, 35, 52] further improved MLP-like models through efficient data training and specific MLP module design, gradually narrowing the performance gap with ViT and challenging The dominance of attention as a token mixer.

Some recent approaches [32, 39, 40, 45] have explored other types of token mixers in meta-precursor architectures and demonstrated encouraging performance. For example, [32] uses Fourier transform to replace attention and still achieves about 97% of the accuracy of ordinary transformers. Taking all these results together, it seems that promising results can be obtained as long as the model adopts the metamodel as a common architecture. Thus, we hypothesize that MetaFormer is more important for models to achieve competitive performance than token-specific mixers.

To verify this hypothesis, we apply a very simple non-parametric operator pooling as a token mixer, performing only basic token mixing< a i=2>. Surprisingly, this derived model, called PoolFormer, achieves competitive performance and even consistently outperforms well-tuned Transformer and MLP-like models, including DeiT [53] and ResMLP [52], As shown in Figure 1(b). More specifically, PoolFormer-M36 achieves 82.1% top-1 accuracy on the ImageNet-1K classification benchmark, which is 0.3%/1.1% higher than the well-tuned vision Transformer/MLP-like baseline DetB/ResMLP-B24. Parameters Reduced by 35%/52%, MAC reduced by 50%/62%. These results show that MetaFormer can still provide promising performance even with a simple token mixer. Therefore, we believe that MetaFormer is our actual need for a visual model that is more important to achieve competitive performance than a specific token mixer. Note that this does not mean that token mixers are irrelevant. MetaFormer still has this abstract component. This means that token mixers are not limited to specific types, such as attention

The contribution of this article has two aspects. First, we abstract Transformer into a general architectural meta-model, and empirically prove that the success of the Transformer/MLP class model is largely due to the meta-model architecture. Specifically, by using only a simple non-parametric operator pooling as an extremely weak token mixer for MetaFormer, we build a simple model named PoolFormer and find that it can still achieve highly competitive performance. We hope that our findings will inspire more future research devoted to improving quantifiers rather than focusing on token mixer modules. Second, we evaluate the proposed PoolFormer on multiple vision tasks, including image classification [14], object detection [34], instance segmentation [34], and semantic segmentation [67], and find that it hybridizes with using tokens Competitive performance compared to SOTA models with complex designs of processors. PoolFormer can easily serve as a good starting point for future MetaFormer architecture designs.

2. Related work

Transformer was first proposed by [56] for translation tasks, and then quickly became popular in various natural language processing tasks. In language pre-training tasks, Transformer is trained on large-scale unlabeled text corpora and achieves amazing performance [2, 15]. Inspired by the success of Transformers in natural language processing, many researchers have applied attention mechanisms and Transformers to visual tasks [3, 8, 44, 55]. Notably, Chen et al. introduced iGPT [6], in which a transformer is trained to autoregressively predict pixels on an image for self-supervised learning. Dosovitskiy et al. proposed the visual transform (ViT) [17] that takes hard patch embeddings as input. They show that ViT pre-trained on a large attribute dataset (JFT dataset with 300 million images) can achieve excellent performance in supervised image classification tasks. DeiT [53] and T2T ViT [63] further demonstrate that ViT can achieve good performance when pretrained only on ImageNet-1K (∼ 1.3 million images) from scratch. A lot of work has been done on which token mixer is better [7, 26]. However, the goal of this work is neither to participate in this debate nor to design new complex token mixers to reach new technological levels. Instead, we examined a fundamental question: What is really responsible for the successof Transformers and their variants? Our answer is a universal architecture, MetaFormer. We're just exploring MetaFormer's capabilities using pools as a basic token mixer.

Meanwhile, several works help answer the same question. Dong et al. demonstrated that without residual connections or MLPs, the output converges to a rank-one matrix in a bi-exponential manner [16]. Raghu et al. [43] compared the feature differences between ViT and CNN and found thatself-attention can collect global information early, while Residual connections can propagate features from lower layers to higher layers. Park et al. [42] showed that multi-head self-attention improves accuracy and generalization by flattening the loss landscape. Unfortunately, they did not abstract the converter into a common architecture and study it from the perspective of a common framework

3. Method

3.1. MetaFormer

We first proposed the core concept of this work "MetaFormer". As shown in Figure 1, abstracted from Transformers [56], MetaFormer is a general architecture where the token mixer is not specified while other components remain the same as Transformers. Enter I I I is first processed through input embedding, such as VIT's patch embedding [17],

image-20220803093557921

Among X ∈ R N × C X∈R^{N×C} XRN×C represents an embedded token with sequence length N and embedding dimension C .

The embedded tokens are then fed into repeated blocks of meta-precursors, each of which includes two remaining sub-blocks. Specifically, the first sub-block mainly contains the token mixer for communicating information between tokens, which can be represented as

image-20220803093724460

Among them, norm (·) represents normalization, such as layer normalization [1] or batch normalization [28]; TokenMixer (·) refers to the module mainly used for mixing token information. It is implemented through various attention mechanisms in recent visual transformation models [17, 63, 68], or through spatial MLP in MLP-like models [51, 52]. Note that the primary function of a token mixer is to propagate token information, although some token mixers can also mix channels, such as attention.

The second sub-block mainly consists of a two-layer MLP with nonlinear activation,

image-20220803093818363

Among W 1 ∈ R C × r C W_1∈ R^{C×rC} IN1RC×rC W 2 ∈ R r C × C W_2∈ R^{rC×C} IN2RrC×C has the MLP expansion ratio r’s learnable parameters; σ(·) is a nonlinear activation function, such as GELU[25] or ReLU[41].

Instantiations of MetaFormer

MetaFormer describes a general architecture that allows different models to be obtained immediately by specifying the specific design of the token mixer. As shown in Figure 1(a), if the token mixer is specified as an attention or spatial MLP, the MetaFormer becomes a Transformer or MLP-like model respectively.

3.2. PoolFormer

Starting from the introduction of Transformers [56], many works have attached great importance to attention and focused on designing various attention-based token mixer components. In contrast, these works pay little attention to general architectures, i.e., metamodels.

In this work, we argue that this meta-precursor general architecture mainly contributes to the recent success of Transformer and MLP class models. To demonstrate it, we intentionally use an embarrassingly simple operator pooling as a token mixer. This operator has no learnable parameters, it simply causes each token to averagely aggregate the features of its nearby tokens.

Since this work targets vision tasks, we assume that the input is in channel-first data format, i.e. T ∈ R C × H × W T∈ R^{C×H×W } TRC×H×W . Ike mathematician display page

image-20220803094219469

where K is the pool size. Since the MetaFormer block already has residual connections, a subtraction of the input itself is added in equation (4) . The Pytork-like code for the pool is shown in Algorithm 1.

image-20220803094309852

It is well known thatthe computational complexity of self-attention and spatial MLP is quadratic in the number of tokens to be mixed. To make matters worse, spatial MLP brings more parameters when dealing with longer sequences. Therefore, self-attention and spatial MLPs can typically only handle hundreds of tokens. In contrast, pooling requires computational complexity that is linear in sequence length without any learnable parameters. Therefore, we exploit pooling by adopting a hierarchical structure similar to traditional CNNs [24, 31, 49] and recent hierarchical transformer variants [36, 57]. Figure 2 shows the overall framework of PoolFormer. Specifically, PoolFormer has 4 stages, with h / 4 × w / 4, h / 8 × w / 8, h / 16 × w / 16 and h / 32 × w respectively. / 32 h/4×w/4, h/8×w/8, h/16×w/16 and h/32×w/32 h/4×w/4h/8×w/8h/16×w/16sumh/32 a>×w/32 token, where H and W represent the width and height of the input image. Embed sizes are divided into two groups: 1) Small size model, embed sizes are 64, 128, 320 and 512 , responding to four stages; 2) Medium-sized model with embedding dimensions of 96, 192, 384 and 768 . Assuming there are L pool former blocks in total, stages 1, 2, 3 and 4 will contain L/6, L/6, L/2 and L/6 pool formers respectively. The MLP expansion ratio is set to 4. According to the above simple model scaling rules, we obtained 5 different model sizes of the pool former, whose hyperparameters are shown in Table 1. image-20220803094453965

image-20220803095007485

4. Experiments

image-20220803095204851

image-20220803095221204

image-20220803095235239

image-20220803095246274

image-20220803095258093

5. Conclusion and future work

In this work, we abstract attention in Transformers as token mixers and Transformers overall into an architecture generally called metashapers, where token mixers are not specified. Rather than focusing on a specific token mixer, we point out that MetaFormer is actually what we need to ensure reasonable performance. To verify this, we intentionally specify the token mixer as an extremely simple pool of metaformers. The study found that the derived PoolFormer model can achieve competitive performance on different vision tasks, which is good support that "the meta-model is actually needed for vision".

In the future, we will further evaluate PoolFormer in more different learning environments, such as self-supervised learning and transfer learning. Additionally, it would be interesting to see if PoolFormer is still suitable for natural language processing tasks, to further support the claim in the natural language processing field that "MetaFormer is actually all you need". We hope that this work will inspire more future research dedicated to improving the basic architectural metamodel, rather than focusing too much on the token mixer module.

appendix

image-20220803095451003

image-20220803095459477

image-20220803095515769

Improved Layer Normalization

image-20220803095706320

PoolFormer block

image-20220803100207637

image-20220803100227022

Guess you like

Origin blog.csdn.net/charles_zhang_/article/details/126135353