# RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision?

## 从摘要理解论文

For the past ten years, CNN has reigned supreme in the world of computer vision, but recently, Transformer is on the rise. However, the quadratic computational cost of self-attention has become a severe problem of practice.

There has been much research on architectures without CNN and self-attention in this context. In particular, MLP-Mixer is a simple idea designed using MLPs and hit an accuracy comparable to the Vision Transformer.

However, the only inductive bias in this architecture is the embedding of tokens.

Thus, there is still a possibility to build a non-convolutional inductive bias into the architecture itself, and we built in an inductive bias using two simple ideas.

1. A way is to divide the token-mixing block vertically and horizontally.
2. Another way is to make spatial correlations denser among some channels of token-mixing.

这里又一次出现了使用垂直与水平方向对计算进行划分的思路。类似的思想已经出现在很多方法中，例如：

• 卷积方法
• 维度拆分
• Axial-Attention Transformer 方法
• Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation
• CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows
• MLP 方法
• Hire-MLP: Vision MLP via Hierarchical Rearrangement

With this approach, we were able to improve the accuracy of the MLP-Mixer while reducing its parameters and computational complexity.

Compared to other MLP-based models, the proposed model, named RaftMLP has a good balance of computational complexity, the number of parameters, and actual memory usage. In addition, our work indicates that MLP-based models have the potential to replace CNNs by adopting inductive bias. The source code in PyTorch version is available at github.com/okojoalg/ra….

## 主要内容

Vertical-Mixing Block 的索引形式变化过程：((rh*rw*sr,h,w) -> (sr, rh*h, rw*w) <=> (rw*sr*w, rh*h) （因为这里是通道和水平方向共享，所以可以等价，而图中绘制的是等价符号左侧的形式），Horizontal-Mixing Block 类似。

class RaftTokenMixingBlock(nn.Module):
# b: size of mini -batch, h: height, w: width,
# c: channel, r: size of raft (number of groups), o: c//r,
# e: expansion factor,
# x: input tensor of shape (h, w, c)
def __init__(self):
self.lnv = nn.LayerNorm(c)
self.lnh = nn.LayerNorm(c)
self.fnv1 = nn.Linear(r * h, r * h * e)
self.fnv2 = nn.Linear(r * h * e, r * h)
self.fnh1 = nn.Linear(r * w, r * w * e)
self.fnh2 = nn.Linear(r * w * e, r * w)

def forward(self, x):
"""
x: b, hw, c
"""
# Vertical-Mixing Block
y = self.lnv(x)
y = rearrange(y, 'b (h w) (r o) -> b (o w) (r h)')
y = self.fcv1(y)
y = F.gelu(y)
y = self.fcv2(y)
y = rearrange(y, 'b (o w) (r h) -> b (h w) (r o)')
y = x + y

# Horizontal-Mixing Block
y = self.lnh(y)
y = rearrange(y, 'b (h w) (r o) -> b (o h) (r w)')
y = self.fch1(y)
y = F.gelu(y)
y = self.fch2(y)
y = rearrange(y, 'b (o h) (r w) -> b (h w) (r o)')
return x + y

## 实验结果

Although RaftMLP-36 has almost the same parameters and number of FLOPs as ResMLP-36, it is not more accurate than ResMLP-36. However, since RaftMLP and ResMLP have different detailed architectures other than the raft-token-mixing block, the effect of the raft-token-mixing block cannot be directly compared, unlike the comparison with MLP-Mixer. Nevertheless, we can see that raft-token-mixing is working even though the layers are deeper than RaftMLP-12. (关于最后这个模型 36 的比较，我也没看明白想说个啥，层数更多难道 raft-token-mixing 可能就不起作用了？)

## 一些扩展与畅想

• token-mixing block 可以扩展到 3D 情形来替换 3D 卷积。这样可以用来处理视频。
• 本文进引入了水平和垂直的空间归纳偏置，以及一些通道的相关性的约束。但是作者也提到，还可以尝试利用其他的归纳偏置：例如平行不变性（parallel invariance，这个不是太明白），层次性（hierarchy）等。