ECCV 2022 | MorphMLP: An MLP-like Backbone Network for Video Spatiotemporal Modeling

Click the card below to follow the " CVer " public account

AI/CV heavy dry goods, delivered as soon as possible

Reprinted from: Heart of the Machine

Meitu Imaging Research Institute (MT Lab) and National University of Singapore propose an efficient MLP (Multilayer Perceptron Model) video backbone network for solving challenging video spatiotemporal modeling problems. This method only uses a simple fully connected layer to process video data, which improves the efficiency and effectively learns the fine-grained features in the video, thereby improving the accuracy of the video backbone network framework. Furthermore, adapting this network to the image domain (image classification and segmentation) also achieves competitive results.

8433de3506351b49e9009fedcfa69a61.png

  • Paper link: https://arxiv.org/abs/2111.12527

  • GitHub link: https://github.com/MTLab/MorphMLP

introduction

Thanks to the pioneering work of Vision Transformer (ViT) [1], attention-based architectures have shown great capabilities in a variety of computer vision tasks, achieving good results from the image domain to the video domain. However, recent studies have shown that self-attention may not be important as it can be replaced by a simple multi-layer perceptron (MLP), and many MLP-like architectures have been developed on image domain tasks by means of alternative attention frameworks. , and achieved promising results. But the application in the video domain is still blank, so whether it is possible to design a general MLP video domain architecture becomes a new concern.

Meitu Imaging Research Institute (MT Lab) and the National University of Singapore Show Lab proposed an MLP video backbone network to achieve efficient video spatiotemporal modeling in video classification . The network model proposes MorphFC in space, pays attention to local details in the early layers, and gradually transforms into modeling remote information as the network deepens, thus overcoming the problem that current CNN and MLP models can only perform local or global modeling. . Temporally, the network model introduces a temporal path to capture long-term temporal information in the video, concatenates all pixels of the same spatial location frame, and merges them into a block. At the same time, each block will be processed by the fully connected layer to obtain a new block.

Based on spatial and temporal modeling, researchers have extensively explored various methods to build video backbones, and finally model the spatial and temporal information in a concatenated order and represent them in an efficient spatiotemporal representation learning framework. This network model is the first to propose a method for efficient video spatio-temporal modeling without using convolution and self-attention mechanisms and only using fully connected layers. Compared with the previous video CNN and Transformer architectures, this network model improves accuracy while reducing the cost. amount of calculation . Furthermore, adapting this network to the image domain (image classification and segmentation) also achieves competitive results. The paper has been accepted by the international conference ECCV 2022.

Background introduction

Since MLP models have not yet been applied in the video domain, researchers first analyze the challenges of using MLP in a spatiotemporal representation learning framework .

From a spatial perspective, current MLP models lack a deep understanding of semantic details. This is mainly because they operate MLPs globally on all tokens in the space, while ignoring the hierarchical learning of visual representations (as shown in Figure 1 below). From a time perspective, learning the long-term dependencies of frames in a video is currently implemented based on video Transformers, but the computational time cost is huge. Therefore, how to effectively replace the self-attention of long-range aggregation with the connection layer is crucial to save computation time.

79d0382b763999c8182d05abb41f5ee2.png

Figure 1: Feature visualization

To address these challenges, researchers propose an efficient MLP video representation learning architecture, namely MorpMLP , which consists of two key layers, MorphFCs and MorphFCt. The researchers gradually expanded the receptive field along the length and width directions, so that MorphFC can effectively capture the core semantics in the space (as shown in Figure 2 below).

c26064d599ca951c23103a1e7b7ea59b.png

Figure 2: Operational overview

This progressive approach brings the following two advantages in spatial modeling compared to existing MLP model designs.

  • First, it can learn hierarchical interactions to discover discriminative details by operating fully connected layers from small to large spatial regions;

  • Second, this small-to-large region modeling can effectively reduce the computational complexity of the fully-connected layers used for spatial modeling.

Furthermore, MorphFCt can adaptively capture temporal long-range dependencies on frames. We concatenate features from each spatial location in all frames into a temporal block, in this way fully connected layers can efficiently process each temporal block and model long-term temporal dependencies. Finally, a MorphMLP block is constructed by sequentially arranging MorphFC and MorphFCt and stacking these blocks into a general MorphMLP backbone network for video modeling.

On the one hand, this layered approach can expand the collaborative capabilities of MorphFCs and MorphFCt to learn complex spatiotemporal interactions in videos; on the other hand, this multi-scale and multi-dimensional decomposition approach achieves better results between accuracy and efficiency. better balance. MorphMLP is the first efficient MLP architecture built for the video domain, which significantly reduces computation and is more accurate than previous state-of-the-art video models .

Spatiotemporal Modeling Models for MorphMLP

spatial modeling

As mentioned above, mining core semantics is crucial for video recognition. Typical CNNs and previous MLP-Like architectures only focus on local or global information modeling, so they cannot do this.

To address this challenge, the researchers propose a novel MorphFC layer that expands the receptive field of the fully connected layer hierarchically so that it runs from small to large regions, processing each frame independently horizontally and vertically . Taking the horizontal processing as an example (the blue block in Figure 3 below), given a frame, first split the frame along the horizontal direction to form blocks, and divide each block into multiple groups along the channel dimension to reduce computation cost.

Next, each group is flattened into a 1D vector and a fully connected layer is applied for feature transformation. After the feature transformation is completed, reshape all groups back to the original dimensions of the frame, and the vertical direction is processed in the same way (the green block in Figure 3). In addition to splitting along the horizontal and vertical directions, a fully connected layer is applied to process each spatial location individually to ensure group-to-group communication along the channel dimension.

Finally, add the horizontal, vertical and channel features. As the network deepens, the block length increases hierarchically, enabling the fully connected layer to gradually discover more core semantics from small spatial regions to large spatial regions.

917e4d46c226b6f7711fc95f4444456d.png

Figure 3: Spatial Modeling

time modeling

In addition to the horizontal and vertical pathways, we introduce another temporal pathway that aims to capture long-term temporal information at low computational cost using simple fully-connected layers.

Specifically, given the input video, firstly divide into groups along the channel dimension to reduce the computational cost, then concatenate the features of all frames in each spatial location into a block, then apply a fully connected layer to transform the temporal features, and finally Reshape all blocks back to their original dimensions. In this way, the fully connected layer can simply aggregate the dependencies along the time dimension in the blocks to model time (the orange blocks in Figure 4 below).

74b97fb172fbfc4548cb5f1f63da8aeb.png

Figure 4: Spatial Modeling

spatiotemporal modeling

The fully-connected layers in time and space are connected in series to achieve more stable spatio-temporal optimization convergence and reduce computational complexity. Finally, the backbone network that uses the fully-connected layer to extract video features is constructed, as shown in Figure 5 below. On this basis, the adaptation to the image domain can be accomplished by simply discarding the temporal dimension.

f1c86489e7e4719cce230e8de6ea4b09.png

Figure 5: Network Architecture

result

4689b6e3b4e3fb649e78ff44b329136a.png

Table 1: Accuracy and computational performance on the k400 dataset

f7ec5a7ab325d59b57ee8938af3edcbc.png

Table 2: Accuracy and computational performance on the Something-Something dataset

0d6f5eb8f84ef418268acdc055cfd764.png

Table 3: Accuracy and computational performance of image domain adaptation on ImageNet

2d3e1829a4377d3b033ba5113861f9de.png

Table 4: Image segmentation performance

Summarize

In this paper, we propose MorphMLP, a self-attention-free, MLP-like backbone network for video representation learning. The method is able to gradually discover core semantics and capture long-term temporal information, which is also the first backbone network to apply MLP architecture in the video domain. Experiments show that this self-attention-free model can be as powerful as, or even better than, self-attention-based architectures.

Click to enter—>  CV WeChat technical exchange group

CVPR 2022 Paper and Code Downloads

 
  

Background reply: CVPR2022, you can download the CVPR 2022 papers and open source papers collection

Background reply: Transformer review, you can download the latest 3 Transformer reviews PDF

目标检测和Transformer交流群成立
扫描下方二维码,或者添加微信:CVer222,即可添加CVer小助手微信,便可申请加入CVer-目标检测或者Transformer 微信交流群。另外其他垂直方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch、TensorFlow和Transformer等。
一定要备注:研究方向+地点+学校/公司+昵称(如目标检测或者Transformer+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群

▲扫码或加微信号: CVer222,进交流群
CVer学术交流群(知识星球)来了!想要了解最新最快最好的CV/DL/ML论文速递、优质开源项目、学习教程和实战训练等资料,欢迎扫描下方二维码,加入CVer学术交流群,已汇集数千人!

▲扫码进群
▲点击上方卡片,关注CVer公众号
 
  
整理不易,请点赞和在看

Guess you like

Origin blog.csdn.net/amusi1994/article/details/126697252