[Thesis Notes] MetaFormer/PoolFormer Paper Notes and Experience

paper:MetaFormer is Actually What You Need for Vision

github:https://github.com/sail-sg/poolformer

aistudio: no GPU? Tesla V100 Experience PoolFormer Online

Transformers have demonstrated great potential in computer vision tasks, and a common view is that attention-based token mixer modules make transformers competitive. But after replacing the attention with spatial MLP, the model still has very good results. So is it the structure of the transformer rather than the attention that makes it effective? The author uses the pooling layer to replace the attention in the transformer, and builds the PoolFormer model, which has achieved very good results, and the ImageNet-1k accuracy rate reaches 82.1%. The effectiveness of the Transformer structure is demonstrated, not the attention.

I. Introduction

First, the definition of MetaFormer is given, as shown in the figure below, a standard transformer structure, Token Mixer can be Attention/Spatial MLP and other structures.

In order to verify the validity of the MetaFormer structure, the author proposes PoolFormer, which uses a very simple pooling layer as the Token Mixer and verifies its validity.

 Second, the network structure

The network structure of PoolFormer is very simple, just replace the Attention module of Transformer with Pooling. (Stunned operation, no parameters, simple calculation, can be so effective)

 The Pooling code is as follows (note that there is a pool(x)-x):

class Pooling(nn.Module):
    """
    Implementation of pooling for PoolFormer
    --pool_size: pooling size
    """
    def __init__(self, pool_size=3):
        super().__init__()
        self.pool = nn.AvgPool2d(
            pool_size, stride=1, padding=pool_size//2, count_include_pad=False)

    def forward(self, x):
        return self.pool(x) - x

The specific configuration of PoolFormer is as follows:

 3. Experimental results

The experimental results of PoolFormer on ImageNet-1k are as follows:

 4. Summary

PoolFormer proves the effectiveness of the Transformer structure. It is unbelievable that such a simple design of pooling can be so effective, and I am on my knees.

Guess you like

Origin blog.csdn.net/qq_40035462/article/details/123933962