TimeSformer: Is Space-Time Attention All You Need for Video Understanding Paper Speed Reading and Summary of Core Points

Abstract

  • A convolution-free method for video classification based entirely on spatially and temporally self-attention .
  • The method name is " TimeSformer ".
  • We adapt standard Transformer architectures to video by learning spatio-temporal features directly from a series of frame-level patches.
  • Comparing different self-attention schemes, divided attention achieves the best video classification accuracy.
  • Compared to 3D convolutional networks, our model trains faster, achieves significantly higher testing efficiency (with a slight decrease in accuracy), and it can also be applied to longer video clips (more than a minute).

Introduction

  • The Transformer model does an excellent job of capturing long-distance dependencies between words, as well as training scalability.
  • High similarities between video understanding and NLP:
    • videos and sentences are both sequential
    • Just as the meaning of a word can often only be understood in relation to other words in a sentence, it can be argued that atomic actions in short-term clips need to be combined with the rest of the video to fully disambiguate.
  • No one has attempted to use self-attention as the sole building block of a video recognition model.
  • Such a design has the potential to overcome some inherent limitations of convolutional models for video analytics.
  • First, while their strong inductive biases (e.g., variances such as local connectivity and translation) are undoubtedly beneficial for small training sets, they may unduly limit the model's expression ability. . Compared with CNNs, Transformers impose less restrictive bias, which expands their expressive power and makes them more suitable for modern big data regimes, where there is less need for strong induced prior knowledge.
  • Second, although convolutional kernels are specifically designed to capture short-range spatiotemporal information, they cannot model dependencies beyond the receptive field. While deep stacking of convolutions naturally expands the receptive field, these strategies have inherent limitations in capturing long-range dependencies by aggregating shorter-range information. Instead, by directly comparing feature activations across all spatio-temporal locations, a self-attention mechanism can be applied to capture both local and global long-distance dependencies with larger receptive fields. [Here is saying that the receptive field is limited, or that the receptive field is too small]
  • Finally, despite advances in GPU hardware acceleration, the training of deep CNNs remains prohibitively expensive, especially when applied to high-resolution and long videos. Compared with CNNs, Transformers have faster training and inference capabilities , so models with greater learning capabilities can be built for the same computing budget.
  • Think of a video as a sequence of patches extracted from individual frames. Each patch is linearly mapped to an embedding and augmented with positional information. This makes it possible to interpret the resulting sequence of vectors as token embeddings, which can be fed to the Transformer encoder.
  • A disadvantage of self attention in the standard Transformer is that it needs to compute similarity measures for all token pairs, so it is computationally expensive. Think of applying temporal attention and spatial attention separately in each block of the network.
  • TimeSformer adopts a radically different design compared to the established paradigm of convolution-based video architectures. However, its accuracy is comparable to, and in some cases even higher than, the state-of-the-art in this field. We also show that our model can be used to remotely model videos spanning many minutes.

Related Work

  • Use attention for image classification, either in conjunction with convolutional operators, or even as a complete replacement for it
  • Closest: Image networks that use self-attention instead of convolutions, but have issues with memory consumption and computational cost. Some of the self-attention operators considered in our experiments employ similar sparsity and axial computations , although they generalize to spatiotemporal volumes.
  • Visual Transformer (ViT), the idea of ​​patch + token embedding
  • There is a large literature based on the combination of text transformation and video CNN to solve various video language tasks

TimeSFormer Model

  • Input video: H W 3 channels * F frame
  • Decomposed into patches: each frame is decomposed into N patches, and the number of pixels in each patch is 3 P P
  • Linear embedding: multiplication by a learnable matrix, plus position encoding (learnable space-time position embedding), to get z(p, t), each patch and each time corresponds to a token of length D, plus A class token, as the input of Transformer
  • QKV calculation:

image.png

  • Self-attention: via dot product operation.

image.png
In block a and header l, for the token whose time is t and position is p, use its q to obtain similarity with all other tokens in the same layer, and perform softmax.

  • Encoding: For the token of the a-th block, the l-th header, the time t, and the position p, use the weighted sum of the attention coefficient (only care about other tokens in the same block and the same head)

image.png
After that, the calculation results of all the heads in the same block are connected, multiplied by a matrix, and then after MLP, the calculation of this multi-layer attention block is completed.
image.png

  • Classification embedding: Add another LN and MLP to get the final output.
  • Space-Time Self-Attention Models: A T+S attention method is proposed, and T and S are applied one by one.
    • T: For each token, compare all tokens at the same location but at different times

image.png

  • S:…
  • Sparse Local Global: First consider all temporally and spatially adjacent patches (local), and then calculate the sparse patch (global) with a step size of 2. is an approximation of full spatio-temporal attention using local/global decomposition and sparsity patterns.
  • Aixal: time, width, height

Experiments

  • Solution: Use a clip with a size of 8×224×224, and the frame sampling rate is 1/32. The patch size is 16×16 pixels. Pretrain ViT on ImageNet-21K.
  • K400 and SSv2:
    • On the K400, where spatial cues are more important than temporal information, reliable accuracy can be obtained without any temporal modeling on the K400. Focusing only on spaces doesn't perform well on SSv2.
    • On SSv2, ImageNet-1K and ImageNet-21K preprocessing lead to similar accuracies. Because SSv2 requires complex spatiotemporal reasoning, while K400 is more inclined to spatial scene information, it benefits greatly from features learned on larger preprocessed datasets.
    • On the K400, TimeSformer performed best in all cases. On SSv2, where more complex temporal inference is required, TimeSformer outperforms other models only when sufficient training videos are used.
    • Using only spatial location embeddings produces reliable results on Kinetics-400, but worse results on Something-Something-V2. This is because Kinetics-400 is more spatially biased, while SomethingSomething-V2 requires complex temporal reasoning.

Computational cost (the left picture is to increase the spatial resolution, and the right picture is to increase the number of video frames):
image.png

  • TimeSformer is more suitable for environments involving large-scale learning. In contrast, the huge computational cost of modern 3D CNNs makes it difficult to further increase their model capacity while maintaining efficiency.
  • Three variants:
    • TimeSformer:8 × 224 × 224 video clips
    • TimeSformer-HR:16 × 448 × 448 video clips
    • TimeSformer-L:96 × 224 × 224
  • On Diving-48 dataset: The accuracy of TimeSformer is lower than the best model on this dataset. However, considering that our model uses a completely different design, we think these results show that TimesFormer is a promising approach even for challenging time-intensive datasets such as SSv2.
  • Long-Term Video Modeling: HowTo100M Dataset. Only instances with at least 100 video data, i.e. a subset, are considered.

Conclusion

In this work, we introduce TimeSformer, a fundamentally different approach to video modeling compared to the established paradigm of convolution-based video networks.
We show that it is possible to design efficient, scalable video architectures based entirely on spatio-temporal self-attention. Our approach:
(1) is conceptually simple,
(2) achieves state-of-the-art results on major action recognition benchmarks,
(3) is cheap to train and infer,
(4) can be applied to clips longer than one minute, enabling long-term video modeling.
In the future, we plan to extend our method to other video analysis tasks such as action localization, video captioning, and question answering.

Guess you like

Origin blog.csdn.net/qq_41112170/article/details/130026964