Paper InterpretationLanguage-based Action Concept Spaces Improve Video Self-Supervised Learning

Language-based Action Concept Spaces Improve Video Self-Supervised Learning

Language-based action concept space improves video self-supervised learning

Note: Due to recent research needs, I will first put the translation summary here.

Paper address: paper

https://arxiv.org/pdf/2307.10922v3.pdf

Insert image description here

Summary

Recent contrastive language image pre-training has led to learning highly transferable and robust image representations. However, adapting these models to the video domain with minimal supervision remains an open problem. We explore a simple step in this direction by adapting image CLIP models to the video domain using language-bound self-supervised learning. The backbone modified for temporal modeling is trained in an autodistillation setting, with the training target operating in action concept space. This space is constructed using feature vectors of various action concepts extracted from the language encoder using relevant textual cues. A large language model that understands actions and their properties generates relevant textual prompts. We introduce two training objectives, concept distillation and concept alignment , which preserve the generality of the original representation while strengthening the relationship between actions and their attributes. Our method improves zero-shot and linear detection performance on three action recognition benchmarks.

introduce

Actions in videos are defined by individual objects, their relationships and interactions [1, 2]. Video self-supervised learning focuses on discovering representations that understand such action attributes directly from video content without human supervision [3]. Especially in the case of video, where human annotation can be expensive and noisy, this self-supervised approach is invaluable.

A recent self-supervised variant explores learning using loosely paired image-caption pairs, resulting in highly transferable and robust representations, such as CLIP [4]. The zero-shot performance achieved by these methods is often comparable to fully supervised methods. However, their counterparts in the video domain [5, 6, 7, 8, 9, 10, 11] do not exhibit the same generality. In fact, some methods that train CLIP on videos [11, 12] perform worse than image CLIP in the zero-shot setting (see Table 2). This behavior can be attributed to the low availability and noisy nature of labeled (or paired-captioned) video datasets [3]. This has inspired the exploration of self-supervised learning (SSL) techniques that can learn from videos with less supervision while leveraging existing representations such as image CLIP [4]. Existing state-of-the-art video SSL methods [13, 14] learn highly transferable representations from videos, but combining these representations with image CLIP representations is not straightforward. In fact, although methods like SVT [13] are able to leverage image SSL representation [15] for weight initialization to achieve better performance, using image CLIP representation instead of weight initialization results in lower performance than image CLIP (see Table 4) . This raises the need for alternative video SSL methods that are compatible with CLIP (eg image representation) and is our main motivation.

In this work, we explore self-supervised learning techniques to adapt the image CLIP model [4] to the video domain in a fully self-supervised setting, without relying on any form of video-level labeling or subtitles. In this case, natural language can still provide strong hints about the attributes that constitute the action category [16, 17]. We exploit this idea to propose a novel language-based self-supervised learning objective. Following standard self-distillation and multi-view based SSL formulations [15, 13], we introduce a language-aligned feature space, action concept space, which is where our SSL goal operates. Large language models (LLMs) [18], given their extensive world knowledge [19, 20], can serve as ideal tools for generating necessary textual concepts for these spaces. We also introduce regularization suitable for the SSL target of our language to prevent crashes during training. Our final framework is called language-based self-supervision (LSS).

Compared with existing video self-supervised learning methods [13, 14], our proposed LSS better preserves and improves the transferability of image CLIP representation (see Table 1 and Table 4). Furthermore, our language-aligned learning framework allows direct zero-shot operations on downstream tasks. Furthermore, unlike video CLIP methods with similar zero-shot features [5, 6, 7, 8, 9, 10, 11], which utilize each video label/subtitle for learning, our proposed LSS only requires videos for train.

We summarize our main contributions as follows:
• The self-supervised learning paradigm is able to retain and improve the advantages of CLIP, such as image representation of video domain operations
• Video-specific self-supervised learning goals, namely concept distillation and concept alignment, to enhance action categories and their visual Relationships between attributes
• Novel language-based video self-supervised learning framework that performs zero-shot operations on downstream action classification tasks without the need for per-video tags/subtitles for training
Experiments on action recognition datasets show our learned representations State-of-the-art performance in linear probing, standard zero-sample and conducted zero-sample settings.

Language-based Self-Supervision (LSS) language-based self-supervision

In this section, we present our proposal: language-based self-supervision (LSS). The generality and robustness of shared image language representation spaces such as CLIP [4] allow interesting manipulations of visual representations using language. We explore such operations in the context of visual self-supervised learning focused on video understanding. The goal of self-supervision can operate within a latent space constructed by language, maintaining the linguistic alignment of learned visual representations. This allows for better interpretation of the representation as well as zero-shot inference. We discuss four key components of our approach: the backbone architecture, the concept distillation goal, modifications to avoid crashes, and the concept alignment goal.

—Backbone Architecture

Our method introduces text classifiers into self-distillation based SSL works [15, 13], replacing the projector network. Given a data sample x, let x1, x2 ∈ R(C,T,H,W) be two enhanced views generated using the following video-specific transformation of [13], where C = 3, T = 8, H = W = 224 are channel, time and space dimensions respectively.

Visual Encoder: The visual encoder θv processes xi to produce features fi ∈ R768. We leverage the pretrained image encoder of CLIP [4] to perform temporal modeling using factorized spatiotemporal attention. The visual transformer variant of CLIP was chosen to allow us to decompose spatiotemporal attention. In particular, we use the ViT-B/16 architecture as the image encoder, where for a given enhanced view of H = W = 224 and T = 8, each transformer block sequentially processes 8 time stamps and 196 time stamps, respectively. Spatial labeling, the embedding dimension of each token is R768. In addition to the input labels from the data samples, a classification label [64, 65] serves as the final feature vector of the network output, i.e., fi, which is common to CLIP image encoders. This classification label is appropriately dilated and processed following [66] to accommodate our modification of decomposed spatiotemporal attention. We follow [66] with zero-initialization of the additional temporal attention parameters, achieving training at the beginning with the same output as the pre-trained CLIP image encoder.

Text classifier: Inspired by [67], a set of n language embeddings θt extracted from the CLIP text encoder is used to construct the weight parameters of the linear layer (without the bias term), which we call the text classifier θc. The role of this text classifier is to project the visual features fi into the vector space defined by these n embeddings, yielding f~i ∈ Rn. Next we discuss the details of these vector spaces (called action concept spaces) and text classifier modules.

—Action Concept Spaces

Self-supervised learning methods following exponential moving average (EMA)-based self-distillation [68, 15, 13] utilize projection networks (MLP) to operate in higher-dimensional feature spaces. This is expected to minimize the training-test domain gap, handle noisy positive sample pairs, and better distinguish subtle feature differences [69]. Around these concepts, we propose an alternative concept space consisting of a set of basic vectors defined by language-based action concepts. Our language-based self-monitoring goals operate within such a conceptual space.

Concept space: Based on the assumption that text encoder features capture subtle differences between different action categories, we hypothesize that the necessary nuances between these actions will be better captured in our proposed concept space. The defining parameters of concept spaces are their basis vectors bi. Normalized embeddings (extracted from the text encoder, θt) of various natural language descriptions (ci) associated with the action category are used as these basis vectors.
Insert image description here

Category concept space: We explored 3 different strategies to construct category concept space. The basic setup uses action labels from the Kinetics-400 [70], UCF-101 [71] and HMDB-51 [72] datasets, resulting in a set of 530 (400 + 101 + 51, ignoring overlap) basis vectors. Our next goal is to link LL.M.s to their awareness of action, occurring within the latter two strategies. We utilize LLM [18] and visual LLM [73] to extract a large number of action category labels. When we explore the idea of ​​extending the base vector set with additional LMM-based action labels in Section 4, the base set containing a modest 530 categories is sufficient to improve downstream task performance.

Describe the concept space: This space is constructed conditioned on the previous category concept space. For each action tag used by the latter, we use a large language model (LLM) to extract 4 different descriptions and a set of visual features related to that action tag. The role of the LL.M. is to inject its knowledge of the world (i.e. awareness of videos, actions and their properties) into the representations we learn during self-supervised learning. In detail, we prompt GPT-3 [18] to generate such descriptions and features using the procedure outlined in Appendix A. We emphasize that GPT-3 is used here as an intelligent LLM incorporating world knowledge of videos and actions in order to create natural linguistic descriptions of given action category labels. The text output generated for each action label is processed by our text encoder to generate multiple embeddings for a single action label. These embeddings are averaged to produce corresponding basis vectors describing the concept space. Note how this leads to a one-to-one correspondence between the common dimensions between the two concept spaces and the basis vectors of the spaces, which we exploit for the self-supervision objective.

in conclusion

We introduce a novel language-based self-supervised learning (SSL) video method , called LSS, capable of adapting powerful language-aligned image representations (CLIP [4]) to the video domain. In particular, we propose two SSL goals based on self-distillation: concept distillation and concept alignment . Our method trains without video-level labels or paired subtitles, similar to how previous video SSL works, but preserves the language alignment of image CLIP, enabling direct zero-shot inference . We demonstrate state-of-the-art performance on linear detection via learned representations for downstream tasks. For zero-shot operations, LSS shows strong performance in both standard and conductive settings, suggesting a promising direction for video SSL.

Limitations, future work and wider implications : Language alignment for LSS may be primarily limited to per-frame static information, since the alignment is derived from image CLIP [4]. LSS cannot differentiate between motion-based categories such as "object moving from left to right". Furthermore, while containing highly discriminative and general information at the image level, CLIP features lack object-level spatial awareness [75]. Our proposed model is built on these representations and is inherently limited in understanding object-level motion and interactions in videos. However, recent advances in localization-aware CLIP models [75, 97, 98] open avenues to leverage object-centric or pixel-level representations to better model such video motion patterns, opening up interesting future directions. In terms of broader implications, the datasets and pre-trained models we used may contain biases, which may be reflected in LSS. However, our reduced reliance on human annotations may reduce additional bias.

Reproducibility statement : We built a code base derived from the SVT [13] and CLIP [4] source code and used the pre-trained CLIP weights from https://github.com/openai. All experiments use publicly available datasets. Descriptions of our operations will be publicly released along with our code base.

Guess you like

Origin blog.csdn.net/qq_40514113/article/details/135375837