TEA: Temporal Excitation and Aggregation for Action Recognition reading notes

1. Introduction

Consider the importance of temporal modeling in video behavior recognition, including motion excitation module (ME) and a multi-temporal aggregation model (MTA), embed them in a standard ResNet block, and regenerate a temporal excitation and aggregation block, ME and MTA acts on short-range movement and long-range aggregation respectively. The ME module uses the feature level time difference calculated from the space-time feature to activate the feature's action-sensitive channel. The MTA module deforms the local convolution into a set of sub-convolutions to form a hierarchical residual structure.

Innovation:

  1. Use the ME module to replace the traditional manual optical flow extraction and input the 2D dual-stream framework based on the convolutional network for action recognition. This module does not need to split the spatiotemporal features for training, but directly integrates the motion modeling into the spatiotemporal features for learning .
  2. Use MTA instead of traditional methods to deal with long-range time aggregation. Although (2+1)D convolution is also used, a set of subconvolutions is used instead of 1D time convolution.

Two, contrast

The current typical video-based behavior recognition, one is based on a dual-stream structure, this structure includes a spatial 2-dimensional CNN that learns static features from each frame and a 2-dimensional CNN that models motion information in the form of optical flow . Two streams are trained separately, and then the average of the two streams is taken as the prediction result. However, these methods require additional calculations or storage consumption for optical flow. What's more, the interaction between different frames and two models is limited, which often occurs in the last layer.
Another method is based on 3-dimensional CNNs and (2+1)-dimensional variables. The first work in this area is C3D, which uses three-dimensional convolution on adjacent frames to model spatio-temporal features in a unified way. There is also I3D proposed by using 2-dimensional CNNs to expand the two-dimensional convolution into three-dimensional convolution. Sometimes in order to reduce the calculation of the three-dimensional convolution, the three-dimensional convolution will be resolved into a two-dimensional spatial convolution and a one-dimensional time convolution or a mixture of two-dimensional and three-dimensional. However, after a large number of local convolution operations, the useful features from the far frame will be weakened and cannot be captured well.
This method abandons the optical flow extraction, and learns the action representation similar to the feature level by calculating the time difference. The learning of spatio-temporal features and motion coding can be combined, and these features can be used to discover and enhance their motion-sensitive components. Our method also uses motion features to recalibrate features to enhance motion patterns. The multi-time aggregation model we proposed is simple and effective, and does not require additional operators.

Three, method

Insert picture description here
Use the loose time sampling strategy proposed by TSN to sample variable length videos. First, the video is divided into T segments, and then a frame is randomly selected from each segment to form an input sequence of T frames. Use ResNet based on two-dimensional CNN to superimpose multiple TEA block stacks for spatiotemporal modeling. TEA includes an ME model that activates the action mode and an MTA model that establishes a long-range time relationship. Finally, a simple time average pooling is used for all frames. The forecast results are averaged.
Insert picture description here

  1. ME module
    will be expanded from the original motion model pixel level to level features a large range of motion modeling and temporal characteristics in order to be able to incorporate learning in a unified framework.

  2. Comparing SEnet
    1) SEnet is designed for image-based tasks. When it is applied to spatio-temporal features, it does not consider time information and performs independent calculations on each frame of the video.
    2) SEnet is a self-gating mechanism. The model weights obtained are used to enhance the information channel of feature X. Our module aims to enhance the motion-sensitive component of the feature.
    3) SEnet will suppress all useless channels, but our model will retain static background information through a remaining connection.

  3. The MTA model
    is inspired by the spatio-temporal features of Res2Net and the inclusion of the corresponding local convolutional layer in a subset. In this model, the subset is regarded as a hierarchical residual structure, a series of subconvolutions are applied to the features in turn, and then the equivalent receptive field of the space-time dimension is expanded accordingly.

  4. After the integrated
    ME module with the ResNet block is integrated into the bottleneck layer (the first 1x1 convolutional layer), the MTA module replaces the original 3x3 convolutional layer, and builds a behavior recognition network by stacking TEA blocks.

Four, experiment

Insert picture description here
Test results corresponding to different baselines:
Insert picture description here
Insert picture description here
Insert picture description here

Guess you like

Origin blog.csdn.net/qq_41214679/article/details/107975761