UNIFORMER-video model (combination of 3D CNN and transformer)

1 Introduction

This article is based on the February 2022 "UNIFORMER: UNIFIED TRANSFORMER FOR EFFICIENT

SPATIOTEMPORAL REPRESENTATION LEARNING" translation summary.

Learning rich and multi-scale spatio-temporal semantic information from a high-dimensional video is a very challenging task, because there are a lot of local redundancy and complex global dependencies between frames in the video. ). Object movement between adjacent frames is tiny. But the objects in the frames in the long range are dynamically related.

Recent research mainly focuses on 3D convolutional neural networks and visual transformers. Although 3D convolution can capture detailed local spatiotemporal features in a small 3D field (such as 3*3*3), it reduces the spatiotemporal redundancy between adjacent frames, that is, effectively processes local information to control local redundancy , but lacks the ability to capture global dependencies due to the restricted receptive field. The visual transformer can capture long-range dependencies through the self-attention mechanism, but the blind similarity comparison of all tokens in each layer makes it unable to reduce local redundancy well. As shown in the table below:

Based on this, we propose Unified transFormer (UniFormer), which integrates 3D convolution and transformer, and achieves a good balance between calculation and accuracy. Spatio-temporal redundancy and dependencies can be handled simultaneously. See https://github.com/Sense-X/UniFormer for code details .

2 methods

The whole model includes 4 stages (stage), each stage is a UniFormer module, and its channels are 64, 128, 320, 512 respectively. Each UniFormer module consists of 3 parts: Dynamic Position Embedding (DPE), Multi-Head Relation Aggregator (MHRA), and Feed-Forward Network (FFN). The first two stages (shallow) learn the local relationship (local) to reduce the computational burden; the last two stages (deep) learn the global relationship (global).

The formula of the overall structure is as follows:

2.1 Multi-Head Relation Aggregator (MHRA)

Using a multi-head mix:

Among them, A is the token affinity (token affinity), which will be divided into local and global below. V refers to the linear transformer. U is a parameter matrix that can be learned to integrate N heads (R).

local MHRA

Where i, j refer to the index of the domain token. The entire A is a locally learnable parameter matrix.

The first two stages of the whole model use local MHRA.

global MHRA

Among them, Q and K are two different linear transformations.

The last two stages of the whole model use global MHRA.

The local MHRA uses BN, and the global MHRA uses LN.

2.2 Dynamic Position Embedding (DPE)

DPE can maintain transformation invariance and is friendly to different cropping lengths.

We extend the conditional position encoding (CPE) to design DPE. The formula is as follows:

Among them, DWConv represents a simple 3D depth direction convolution, which uses 0 padding.

3 experiments

On the Kinetics-400&600 dataset, we can see that our model is better than 3D CNN and transformer.

On the Something-Something V1&V2 dataset, CNN-like models cannot capture long dependencies and effect price differences well. And U NI FORMER is still better, as shown in the table below:

Guess you like

Origin blog.csdn.net/zephyr_wang/article/details/130348138