Collaborative Spatioitemporal Feature Learning for Video Action Recognition

Collaborative Spatioitemporal Feature Learning for Video Action Recognition

Summary

Temporal feature extraction operation of the video identification is a very important part. Either conventional neural network model learning time and space features, respectively (C2D), either uncontrolled study combined temporal and spatial characteristics (C3D).

The author proposes a novel neural operation, it will re-sharing constraints temporal characteristics encode collaboratively by adding weights on the parameters of learning.

In particular, two-dimensional convolution of three orthogonal views along the volume of video data, so that the apparent spatial cues are learned cues and operation time.

Collaboration may be learned by sharing convolution kernel, the spatial and temporal characteristics at different viewing angles to each other and optimization, wherein the complementary fusion then summed by a weighting coefficient weighted sum is to end learning.

Algorithm performance to the state-of-the-art and the effect of the win in Time Challenge2018 Moments.

Based on different perspectives learned coefficient, the authors quantify the contribution of the spatial and temporal characteristics, this analysis provides some interpretability as a model may provide some guidance for the future of video recognition algorithm design.

I. Introduction

Joint Learning spatial and temporal characteristics of motion recognition for mission critical, similar spatial characteristics extraction and image recognition tasks can be easily extracted, but there are still two issues have not been resolved, a time to learn how the features, another problem how to spatial characteristics and time characteristics of a good combination. There are several attempts to solve the above problems: 1. Design of spatial and temporal characteristics and features as input time to double the network. 2. 3D convolution time network characteristics and spatial characteristics closely entangled, learn together, and therefore temporal characteristics can learn directly from the network. However, a large number of parameters and 3D convolution computation limits the performance of the network model.

Authors propose a joint space-time characteristics of a learning operation (CoST), can be shared under the weight constraint joint study spatial and temporal characteristics. 3D vector given video sequences, from different angles of the first decomposed into a set of three 2D images, then each of the three sets of images by convolving a convolution operation. Video sequence obtained from the three angles are:

  1. We usually see \ (HW \) angle of view, i.e. that the \ (HW \) seen as a plane, \ (T \) as the dimension of a single plane extension.
  2. \ (TW \) perspective, ie \ (TW \) seen as a plane, \ (H \) as the dimension of a single plane of extension.
  3. \ (TH \) perspective, ie \ (TH \) seen as a plane, \ (W is \) as the dimension of a single plane of extension.

This design allows each frame contains a wealth of information about the operation, not the action information between two frames. Making 2D convolution can be captured directly into the sequence operation clues. It also makes use of 2D convolution can learn spatial and temporal characteristics without the 3D feature.

The reason a different view shared parameters are:

  1. Different views generated image is compatible, can be seen from the figure, \ (TH \) and \ (TW \) image is formed as if still present as edges, corners as edge information. So you can re-share rights.
  2. Convolution kernel C2D is the inherent redundancy of the network, domain features can be shared by re-learning the right way.
  3. Greatly reducing the number of parameters of the model, the network can be more easily trained.

In addition, in the spatial domain to learn the features easily migrate into the time domain through the clever design of the network structure and pre-training parameter.

Complementary features of different views by weighted summing integration algorithm for each channel of each of the learning perspective independent parameters, which can learn the demand temporal characteristics. Based on these parameters, the author can be their contribution of time and spatial domain to quantify some quantitative analysis.

On the basis of CoST operation, we built a convolution neural network. This network and compared C2D, COST can jointly learn temporal characteristics, and compared C3D, COST based 2D convolution, and COST connected C3D C2D. Experiments show that, CoST performance better than C2D and C3D.

Authors' contributions are summarized as follows:

  • COST proposed, 2D convolution with a convolution instead of learning a 3D temporal characteristics
  • This work is the first to quantify the importance of spatial and temporal characteristics analysis
  • CoST performance is better than C3D and its variants, on large data sets reaches state-of-the-art results

Second, Introduction

Introduced by manual feature traditional algorithm, which is the local characteristic properties of the optical flow is preferably guided along a trajectory. It introduced the double network structure and timing of the evolution of LSTM modeling. C3D and C3D describes the evolution model. And algorithm proposed by the authors is most similar Slicing CNN, also carried crowd video understanding learning models from different perspectives, different is that the algorithm independently learn the characteristics of three perspectives from different branches of the network, and finally at the top level network merge. The author is a joint study temporal characteristics, spatial and temporal characteristics of the polymerization are carried out in each layer.

Third, the method

1. 2D connets

C2D model ((A) in FIG C2D model) can extract robust feature space, but only a very simple strategy to combine spatial and temporal characteristics. To as C2D OF baseline model, as the backbone network to ResNet, to build a network structure, the structure as shown in Table ResNet-50's.

1565079148966

ResNet network structure

CoST

2. 3D ConNets

C3D C2D is improved, increasing the dimension of time, the \ (h \ times w \) convolution kernel becomes \ (t \ times h \ times w \) convolution kernel. FIG. (B) and (c) two convolution yes C3D ways, can be seen clearly, (c) a ratio parameter (b) a much smaller amount of parameters, and experimental results show that (c) effect effect and (b) comparable, use of (c) a structure as C3D baseline model.

1565078974864

3. CoST

下图对比了CoST操作和\(C3D_{3 \times 3 \times 3}\)\(C3D_{3 \times 1 \times 1}\)\(C3D_{3 \times 3 \times 3}\)利用3D卷积将时间和空间特征联合提取出来,\(C3D_{3 \times 1 \times 1}\)首先用\({3 \times 1 \times 1}\)的卷积核提取时间上的特征,然后用\({3 \times 3 \times 3}\)的卷积核提取空间特征。作者用3个\(3 \times 3\)的2D卷积核从三个视角分别进行卷积操作,然后通过加权求和将三个特征图进行融合,注意,这里的三个卷积核参数是共享的(代码怎么实现的参数共享呢),参数可以通过端到端的方法去训练。

1565080804706

输入的特征大小为\(T \times H \times W \times C_1\)\(C_1\)是输入特征的通道数,三个视角的卷积操作可以表示为:
\[ \begin{aligned} \boldsymbol{x}_{h w} &=\boldsymbol{x} \otimes \boldsymbol{w}_{1 \times 3 \times 3} \\ \boldsymbol{x}_{t w} &=\boldsymbol{x} \otimes \boldsymbol{w}_{3 \times 1 \times 3} \\ \boldsymbol{x}_{t h} &=\boldsymbol{x} \otimes \boldsymbol{w}_{3 \times 3 \times 1} \end{aligned} \]
其中,\(\otimes\)表示3D卷积操作,\(w\)是增加了一个维度的三个视角共享的\(3 \times 3\)的卷积。

这里的卷积可以这样理解:对于\(T-W\)视角,将\(T-W\)看做一个平面,\(H\)看做是平面的堆叠,其中每个平面是有\(C_1\)个通道。如果单独看一个平面,只对一个平面进行卷积操作,则卷积核的大小为\(C_1 \times 3 \times 3\),卷积结果大小为\(T \times W\)。然而从这个视角出发,共有\(H\)个这样的平面,所以对每一个平面都用上述\(C_1 \times 3 \times 3\)的卷积核进行卷积,即每一个平面用完全一样的\(C_1 \times 3 \times 3\)的卷积核进行卷积得到\(T \times W \times H\)大小的特征。因为共有\(C_2\)个卷积核,所以经过卷积后的特征图大小为\(T \times H \times W \times C_2\),上述公式中\(w_{3 \times 1 \times 3}\)中忽略了平面的通道数\(C_1\),并且将\(H\)\(1\)代替。

得到三个视角的特征后,对其进行加权求和得到该层最终的输出:
\[ y=\left[\alpha_{h w}, \alpha_{t w}, \alpha_{t h}\right]\left[\begin{array}{l}{\boldsymbol{x}_{h w}} \\ {\boldsymbol{x}_{t w}} \\ {\boldsymbol{x}_{t h}}\end{array}\right] \]
\(\boldsymbol{\alpha}=\left[\alpha_{h w}, \alpha_{t w}, \alpha_{t h}\right]\)\(\boldsymbol{\alpha}\)是一个\(C_2 \times 3\)大小的矩阵,其中3表示3个视角。为了避免从多个视图得到的响应发生巨大爆炸,\(\boldsymbol{\alpha}\)用SoftMax函数对每一行进行归一化。

作者设计了两种CoST结构,如图所示:

1565089792600

CoST(a)

系数\(\boldsymbol{\alpha}\)被认为是模型参数的一部分,在反向传播的时候可以被更新,当进行识别的时候参数是固定的

CoST(b)

系数\(\boldsymbol{\alpha}\)是基于特征被网络预测得到的,这个设计灵感来自于最近的自适应机制。每个样本的系数值取决于样本自己。首先用global pooling将三个视角的特征pooling为\(1 \times 1 \times 1\),然后然后用\(1 \times 1 \times 1\)的卷积核进行卷积,这里的参数也是共享的,接着将拼起来的三个特征送到FC层中,最后用Softmax函数进行归一化。

与C2D和C3D的联系

如图所示,如果\(T-W\)\(T-H\)的系数是0,这时候就退化成C2D了,因此,CoST是严格一般化的C2D

CoST可以看做是特殊化的C3D,3D卷积核参数量是\(k^3\),可以看做是一个\(3 \times 3\)的立方体空间,而CoST的参数量为\(3 k^{2}-3 k+1\),可以看做图中的立方体去了八个角的剩余的部分。如果参数没有共享,CoST非常接近C3D除了八个角的参数被设置成0并且不可以被学习,如果有参数共享,虽然感受野的大小只有19个,对应的19个参数可以从不同视图之间共享的9个可学习参数中派生出来。对于计算量,CoST也远胜于C3D。

三、实验结果

作者在Moments in Time和Kinetics进行了实验,Moments in Time共有来自于339个动作类的802245个训练视频和39900个验证视频,每个视频被修剪得是的持续时间有3秒,Kinetics有236763个训练视频和19095个验证视频分别来自400个人体动作类别,每个视频持续时间有10秒。

1. 实验细节

对于一个视频,采样64个连续的视频帧,然后每隔8帧抽一帧样本,这样对于每一个视频有8个视频帧。从一个按比例缩放的视频中随机裁剪出大小为224×224像素的图像块,该视频的短边随机采样在256到320像素之间。模型用在ImageNet上预训练的2D模型进行初始化。用8个GPU进行训练,为了加快速度,8个GPU被分为两组并且权重在两个组之间异步更新。每个GPU上的mini-batch的大小为8个视频,即对于一个组的四个GPU,总共的mini-batch大小为32。用SGD方法迭代600次,momentum为0.9且weight decay为0.0001.学习率初始化为0.005且在300k和450k次迭代的时候下降10倍。

在推理过程中,作者对短边重调到256像素的视频执行空间完全卷积推理。而对于时域,从一个完整的视频中平均采样10个片段,并分别计算它们的分类分数。最后的预测是所有剪辑的平均得分。

2. 消融实验

分别在两个数据集上验证了CoST(b)的效果更好、共享权重效果更好、CoST的效果比C2D和C3D的效果更好。

1565092936823

1565092991072

3. 和state-of-the-art相比

1565093078262

1565093106741

4. 不同视角的重要性

The author \ (WH \) coefficient is defined as the angle of view of the influence of the spatial characteristics of the \ (TW \) and \ (TH \) coefficient is defined as the angle of view of the influence of temporal characteristics. It can be seen from FIG. 8 acting on the spatial characteristics of the two data sets are very important, Moments in Time need more time than Kinetics characteristics. As the number of layers increases, the influence of the spatial characteristics of the declining influence of the rise time characteristics, indicating that the underlying network is more concerned about the spatial characteristics of learning, network executives are more concerned about the extraction time characteristics. But also with a number of examples of some video images will be described with respect to the sample, wherein the time is more important, for a number of additional video samples, the more important features of the space.

1565093265914

1565093487015

5. Discussion

For video analysis, how to efficiently encode spatial and temporal characteristics remains an open question. Although the experiments show that the parameters shared to some extent to enhance the operation performance of recognition, but the space may be used as dimension T is a regular spatial dimension , from Intuitively, the spatial and temporal actions belong to two different modal information. Prompting OF collaborative learning motivation is to visualize a different view (Figure 1). Interestingly, our results show that, at least to some extent, they have similar features, you can use a single network with the same network architectures to learn together and share the convolution kernel. In physics, according to Minkowski space, a three-dimensional space and time may be uniform four-dimensional continuum. Our findings may indicate the characteristics of the context of learning and interpretation is supported by the space-time model.

Conclusions

Learning from the 3D volume data is a major challenge for video features identified in operation. In this paper, we propose a novel feature of learning how to operate it learns temporal characteristics from multiple views cooperatively. It can be easily and C3D C2D as a direct replacement. Large-scale experiments benchmarks demonstrate the superiority of architecture over existing methods proposed. Learning coefficient based on different views, we can see that the spatial and temporal characteristics of the individual contribution to the classification. Phylogenetic analysis indicated some promising direction algorithm design, we will leave future work.

Guess you like

Origin www.cnblogs.com/shyern/p/11313109.html