Core vision tasks based on VLP(7)

        Computer vision is already ubiquitous in our society, with applications in areas such as visual search, image understanding, map making, medicine, and self-driving cars. One of the core tasks of these applications is visual recognition tasks such as image classification and object detection. The main goal of these tasks is to assign semantically meaningful concepts to visual instances, such as images or regions. Traditional computer vision systems are trained to predict a fixed set of predefined concepts, such as image class labels on ImageNet/JFT300M, object categories on COCO, etc. Although near-human-level performance has been reported on these tasks, this restricted form of proximity to set concepts limits the model's generalizability and usability because additional annotated data is required to specify semantic concepts unseen in the training data. . In this chapter, we describe how recent advances in VLP address core visual recognition problems.

1. Basic principles of paradigm shift

        Recent computer vision systems are trained with free-form natural language supervision, ranging from simple object category names to descriptive descriptions. This language-augmented visual model exhibits strong transfer capabilities. We believe the following two factors have contributed to this paradigm shift.

(1) Open recognition is achieved by transforming the problem from classification to retrieval.

        Traditional forms of classification define and learn a fixed set of embedding vectors, each representing an object category. The model cannot predict and convey concepts beyond these dense collections of concepts. Another option is to think of image classification as an image-to-text retrieval task, where images (or image regions) are found by searching for matching concepts. Parametric models such as neural networks are employed to encode images and language (concepts), and dense retrieval is performed to retrieve images from related concepts.

(2) The language supervision form improves the generality and usability of the model, allowing a wide range of visual concepts to be represented.

        A fixed set of visual concepts is an oversimplified representation of visual concepts because of the need for compactness in the classification head. In contrast, the text encoder introduced in the retrieval form is able to handle a larger concept pool. Natural language is semantically richer than any collection of concept labels (e.g. object categories). The text-sequence form of the language also allows external knowledge (e.g., from WordNet and Wikipedia) to be represented in the same format as image captions and concept labels, further improving concept coverage.

Here is an introduction to some models:

VLP models developed for core computer vision problems

        

A glossary of representative VLP models in core vision tasks. For data size, we report the number of image-text pairs, including image labels and image captions. IC: Image classification. OD: Object detection. LocNar: localized narrative. Golden-G is hybrid gold reference annotation data processed in MDETR. ITC: Image-Text Contrastive Learning. WRA: word-region alignment. TP: Tag prediction. SSL: Self-supervised learning.

        We list a representative VLP model glossary, which describes multiple dimensions of the model. In the figure, we show the evolution of these VLP models over time. This series of research equips computer vision models with open visual recognition capabilities, opening up the possibility of building a widely applicable computer vision system with strong task-level transfer capabilities, thus paving the way for Computer Vision in the Wild (CVinW). the way.

2. Image classification

We use Image Caption Matching for image classification. In this method, we define a triplet data set S = {(xn, tn, yn)}Nn=1 containing images and corresponding language descriptions. where x ∈

Our goal is to learn a general and rich visual-semantic representation that enables images to be correctly aligned with their linguistic descriptions, i.e., to achieve image classification. For each image x, we use an image encoding model fθ parameterized as θ to represent it as a visual feature vector ˜v ∈ RP×1: ˜v = fθ(x). For each linguistic description t, we encode it using a text encoder fφ parameterized as φ, resulting in a feature vector ˜u ∈ RP×1: ˜u = fφ(t). Note that ˜v and ˜u are vector representations of the entire image and sentence respectively.

For the i-th image xi and j-th language description tj in a batch B, we normalize their feature vectors on a hypersphere to obtain ui = fθ(xi) / ||fθ(xi) || and vj = fφ(tj) / ||fφ(tj)||. Then, we calculate the similarity between them sij = uTi vj. In this way, we can evaluate the matching degree between images and language descriptions to achieve image classification tasks.

The above are the steps and process of image classification using image description matching method.

Figure 1: Image description matching for image classification

2.1  UniCL

        is a two-way supervised contrastive learning objective, which is defined based on the matching relationship between images and language descriptions:

        Among them , τ is a temperature hyperparameter that controls the intensity of punishment for difficult-to-negative samples. In Figure 4.2, there are two images sharing the same language/concept "dog", and according to the UniCL formula, the corresponding elements in the contrastive learning target matrix are marked as positive. By extending UniCL to 800M training samples, Microsoft's Florence model can be obtained, achieving SoTA performance on many tasks at the time.

2.2 CLIP/ALIGN

        CLIP/ALIGN is based on the assumption that in a batch, there is only a one-to-one mapping relationship between the image and its paired description, that is, P(i) = {i} and Q(j) = {j} . The training goals of CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021) are:

        For the example in Figure 1, CLIP or ALIGN only treats diagonal elements as positive and all off-diagonal elements as negative. Ideally, CLIP or ALIGN should be applied to image-text pairs that do not have duplicates in either modality.

Related to the formulation of traditional classification problems .

Note that LUniCL is closely related to the standard cross-entropy loss used in supervised image classification problems. Specifically, the image-to-language contrast term reverts to cross-entropy as a special case when the following three conditions are met. (i) The text encoder fφ is represented as a simple linear embedding layer W. (ii) The batch size |B| is much larger than the number of categories K, so that when random sampling is used during training, the embedding vectors of all categories participate in contrastive learning. (iii) τ = 1, and no normalization operation is performed, so ˜u = u and ˜v = v. In practice, these conditions are easily satisfied and can be simplified to

        where ˆy is the true label of the i-th image in the batch.

Other language-image pre-training methods for image classification.

        Learning visual backbones from network-scale image-text pairs is an emerging research topic. Recently, an increasing number of papers have appeared aiming to improve the performance of zero-shot/few-shot image classification in practical scenarios.

• Improved contrastive pre-training targets

FILIP (Yao et al., 2022) introduces a guided learning method of fine-grained region-lexical correspondences. PyramidCLIP (Gao et al., 2022b) builds an input pyramid with different semantic levels and aligns the two modalities in a hierarchical manner through intra- and cross-level alignment. Prefix conditioning (Saito et al., 2022) introduces the use of prefixed hints to combine image title and image label data, selecting the appropriate hint based on the data type. CyCLIP (Goel et al., 2022) showed that it is possible to explicitly combine the similarity between two mismatched image-text pairs (cross-modal consistency) as well as the similarity between image-image pairs and text-text pairs. (internal consistency) symmetrization to learn consistent representations.

• Self-supervision + comparison target

DeCLIP (Li et al., 2022j) comprehensively studies multiple single-modal self-supervised signals in image-text pairs. SLIP (Mu et al., 2021) studies the integration of image-image self-supervised learning and image-text contrastive learning. Occluded image/language modeling has also been combined with image-text contrastive learning, such as MultiMAE (Bachmann et al., 2022) and M3AE (Geng et al., 2022).

• Freeze model

LiT (Zhai et al., 2022) introduced the “contrastive adjustment” method, showing that locking the pre-trained image encoder and adjusting the text encoder works best for zero-shot transfer. Flamingo (Alayrac et al., 2022) leverages pre-trained models for each single modality and continues to pre-train cross-modal modules to achieve impressive image classification performance using contextual learning.

• Scale up

Due to the good results shown by network-scale pre-training in computer vision tasks, more and more studies explore the success of VLP model scaling. BASIC (Pham et al., 2021) was proposed to extend the comparative learning framework of CLIP and ALIGN in three dimensions (data size, model size and batch size), achieving a zero-sample accuracy of 85.7% on ImageNet. LIMoE (Mustafa et al., 2022) is a sparse expert mixture model capable of language-image multi-modal learning. The Pathways Language and Image model (PaLI) (Chen et al., 2022e) discovered the importance of jointly scaling visual and linguistic components. As existing Transformer language models are much larger than their visual counterparts, PaLI trained the largest ViT to date to quantify the benefits of larger capacity vision models, based on large multi-language mixed pre-training tasks and a new one containing more than Image-text training set of 10B images and text in 100 languages.

In the literature, there are two different experimental setups to evaluate the open-set image classification capabilities of pre-trained models.

• Category-level migration within a single domain

Traditional zero-shot transfer evaluation methods, which have been studied for decades, pre-define a manual segmentation in a given visual domain, ensuring that the evaluation concepts are not observed during training. Examples include Animal with Attributes (AwA), Caltech-UCSD Birds-200 (CUB), SUN, aPY (Farhadi et al., 2009) and ZS-ImageNet.

• Task level migration

To demonstrate the strong applicability and generalizability of CLIP, Radford et al. (2021) directly applied pre-trained checkpoints to identify any concept in about 30 public image classification datasets in the community. Impressive results are reported despite the fact that the models have never observed images in these downstream datasets. It quickly popularized the zero-shot task transfer evaluation method for basic computer vision models. Many variations of CLIP have been proposed. But these works use different downstream datasets for evaluation, making their results impossible to compare. The recent Image Classification in the Wild (ICinW) benchmark is an attempt at task-level evaluation and covers 20 public datasets (Li et al., 2022b).


Top: CLIP pre-trains an image encoder and a text encoder to predict which images are paired with which text in a dataset/batch. This behavior allows us to transform CLIP into a zero-shot classifier. We convert all categories into captions, such as “a picture of a dog”, and predict the caption category that best matches the given image.
Bottom: Prediction results of the zero-shot CLIP classifier on examples from four datasets. This image was created in Radford et al. (2021).
 

        Application examples of language-image models in IC. In the figure above, we illustrate how an image-text contrastive training model like CLIP can be used for zero-shot image classification. Given a new IC dataset/task with a set of concept/category names, convert each concept into a title by using various text templates. Titles were used as cues for the text coder to extract conceptual representations. The query image is fed into an image encoder to extract a visual representation, which is used to calculate the similarity related to all concepts. The result with the highest similarity gives an idea of ​​the prediction. At the bottom of the figure, four cases are shown, one from ImageNet and the other three from ICinW, representing real-world IC scenarios.

3. Target detection

        A typical object detection task consists of two subtasks. First, the localization task aims to determine the presence of an object in an image and indicate the location with a bounding box. Second, the recognition task determines the object categories present in the region of interest (or bounding box). The recognition task is similar to the image classification task (Section 4.2), except that image classification is performed on the entire image in IC, while in OD it is performed on individual regions/boxes. Therefore, by following the way of converting classification into retrieval (as described in Section 4.2), the transferability of OD models in open set recognition can be improved. Specifically, each region/box feature will go through two prediction heads, namely the box classifier and the box regressor, which are trained using the classification loss L_cls and the localization loss L_loc respectively:

3.1 Single-stage model

        In the traditional formulation of object detection, the box classifier is implemented using a simple linear layer, and the classification loss Lcls can be expressed as: 

Here, 2 O ∈ RM×d is the object/region/box feature of the input image, W ∈ RK×d is the weight matrix of the box classifier, Scls ∈ RM×K is the output classification logits, T ∈ {0, 1} M×K is the target, M(S; T) is the loss function, such as focal loss in the single-stage target detection model.

        GLIP (Li et al., 2022h) redefines OD as a phrase localization task instead of classifying each region/box into K categories. It does this by positioning/aligning each region in the image with K phrases in the text prompt t. Calculate the alignment score between a region in image x and a word in tip t Ground:

Here, P ∈ RL×d are the contextualized word/token features from the language encoder and L is the length of the language cue t. P plays a role similar to the weight matrix W in (4.9). The localization model consisting of an image encoder fθ and a language encoder fφ is trained end-to-end by minimizing the losses defined in (4.8) and (4.9) by simply replacing the categorical logits Scls in (4.9) with ( 4.10) Region-word alignment score Ground. In Figure 4.4 we show an example of Sground computed for 4 region-word pairs. It is worth noting that all bounding box proposals used to calculate Sground come from a single image. Matched pairs receive higher scores than mismatched pairs.

3.2 Two-stage model

By refining the knowledge of the CLIP/ALIGN model into a two-stage detector, ViLD (Gu et al., 2022d) and RegionCLIP (Zhong et al., 2022) proposed zero-shot object detection methods. In the two-stage detector, an independent region proposal network (RPN) is used to distinguish the foreground and background, and its loss function is Lrpn. Since Lrpn does not use the semantic information of the target category, it can be incorporated into the localization loss function Lloc in Equation (4.8). In RegionCLIP, RPN is used to propose image regions for all images in a batch, resulting in a total of N image regions. The set of image regions is represented by {ri}N i=1. Given a proposed region, a visual representation vi of region ri is generated through a visual encoder using a feature pooling method (such as RoIAlign). RegionCLIP also builds a pool of candidate concepts for image regions, which are often different from those of the complete image. These concepts exist in the form of natural language and are encoded into semantic representations {uk}k=1,...,K by a pre-trained text encoder L, where K represents the size of the concept pool. By leveraging pre-trained CLIP, the object concept u with the highest matching score is selected as the pseudo label for each region r, thereby constructing positive sample pairs of {u, v}. Use a similar contrastive learning framework with an additional distillation loss function to train an object detection model.

Other language-image pre-training methods for object detection.

Learning universal open-set object detectors from image-text pairs has become an increasingly popular topic. Similar to GLIP, MDETR (Kamath et al., 2021) reformulates the detection problem as a phrase localization problem and uses a single text query over the entire image. FIBER (Dou et al., 2022a) improves on GLIP, including using a coarse-to-fine pre-training process and performing fusion at the backbone network rather than at the object detection head. OVR-CNN (Zareian et al., 2021) fine-tunes image-text models for detection on a limited vocabulary and relies on image-text pre-training to generalize to open vocabulary settings. Detic (Zhou et al., 2022e) improves long-tail detection performance under weak supervision by training the classification head only on examples with only image-level annotations. Other concurrent works include OV-DETR (Zang et al., 2022), X-DETR (Cai et al., 2022), FindIT (Kuo et al., 2022), PromptDet (Feng et al., 2022) and OWL-ViT ( Minderer et al., 2022).

In the literature, there are two different experimental setups used to evaluate the open-set object detection capabilities of pre-trained object detection models.

Category level migration within a single domain

A common zero-shot transfer evaluation in object detection follows the setup in Zareian et al. (2021), where an artificial partition is predefined in the given visual domain, ensuring no conceptual overlap between training and evaluation. For example, on LVIS (Gupta et al., 2019), 866 common categories are trained as base categories and 337 rare categories are evaluated as novel categories. On COCO, there is a partition consisting of 48 base categories and 17 novel categories, removing 15 categories that have no synonyms in the WordNet hierarchy.

• Task level migration

This is an increasingly popular setting, where pretrained object detection models are evaluated in a zero-shot manner on multiple datasets. For example, inspired by CLIP, the LVIS-trained model in ViLD (Gu et al., 2022d) was evaluated on 3 datasets, including PASCAL VOC, COCO, and Objects365. The recent ODinW benchmark generalizes task-level evaluation to a more comprehensive scope, with 13 datasets originating from Li et al. (2022h) and 35 datasets formally defined in Li et al. (2022b).

Application cases of language-image models in target detection.

Top: GLIP pre-trains the image encoder, text encoder and fusion module to predict which image box regions are paired with which words/phrases of the text prompt. This behavior enables us to turn GLIP into a zero-shot object detector. We convert all categories of the dataset into titles via concatenation and predict the word/phrase of the title that GLIP estimates best pairs with the given box. Bottom: Example predictions on six datasets from the zero-shot GLIP object detector shown in ODinW (Li et al., 2022b). This figure was created by Li et al. (2022h).

        

        In Figure 4.5, we show how a GLIP-like region-phrase matching model can be used for zero-shot object detection. Given a new object detection dataset/task and its set of concept/category names, all concepts are converted into titles via concatenation and some simple user-defined text prompts are added. Titles serve as cues to the text encoder for extracting conceptual representations. The query image is fed into an image encoder to extract a comprehensive visual representation, and the similarity to all concepts is then calculated using a deep fusion module. Similarities exceeding a given threshold produce predictions: region of interest boxes and the concept of matching. At the bottom of Figure 4.5, six application cases are shown, all of which are derived from the ODinW benchmark and represent real-world target detection scenarios.

4. Image segmentation

Image segmentation involves grouping image pixels and assigning a class label to each pixel of the image. We take language-driven semantic segmentation (LSeg) (Li et al., 2022a) as an example to illustrate the image segmentation process, where text categories and image pixels are embedded in a common space, and each pixel is assigned to a semantic category. For any semantic segmentation task with K category labels, the text encoder embeds them into a continuous vector space Rd, generating an embedding matrix P = [p1, · · · , pK] ∈ RK×d containing all categories as output. For an image x, the image encoder encodes it into a dense grid representation O ∈ RH×W×d, where H and W specify the spatial dimensions of the feature map. The word-grid similarity tensor is calculated as the dot product Sseg = OP ∈ R(H×W)×K.

Figure 6: Pixel-phrase matching used for segmentation

        In Figure 6, we show a simplified example Sseg computed on 4 word-grid pairs. Note that all mesh features used to calculate Sseg are extracted from one image. Matched pairs are scored higher than mismatched pairs. For a given location pair, we minimize it using per-grid softmax and cross-entropy loss with temperature scaling, which is standard in semantic segmentation. In LSeg, features are decoded using a dense prediction Transformer (Ranftl et al., 2021), and a final spatial regularization block spatially regularizes and cleans the predictions. Since image-text paired data contains rich semantic information, there are many other methods for segmentation using language-image models, as discussed below:

• CLIP-based segmentation

Many segmentation models directly adapt pre-trained CLIP models to pixel-level visual recognition tasks, including PhraseCut (Wu et al., 2020), OpenSeg (Ghiasi et al., 2022), CLIPSeg (L¨uddecke and Ecker, 2022), ZS- Seg (Xu et al., 2021d), MaskCLIP (Zhou et al., 2022a), DenseCLIP (Rao et al., 2021), and MaskCLIP (Ding et al., 2022b). OpenSeg (Ghiasi et al., 2022) also uses class-agnostic mask annotations for model learning to generate mask proposals.

• Training from scratch

GroupViT (Xu et al., 2022) is a new hierarchical grouping Transformer architecture that leverages the Transformer’s global self-attention mechanism to segment the input image into progressively larger groups of arbitrary shapes. It is pre-trained using a multi-label image-text contrastive loss on approximately 12 million image-text pairs. Since GroupViT automatically groups images into semantically similar segments, its output can be easily converted into semantic segmentation without fine-tuning.

5. Trends in field computer vision

        In the three subsections above, we describe how a closed set recognition model can be extended to perform three open set recognition tasks: image classification, object detection, and segmentation. The solution is to utilize parametric functions, such as neural language models, to represent categories instead of traditional non-parametric representations, such as one-hot vector embeddings. Although it empowers open-set recognition, the model still lacks the ability to perform well on a wide range of downstream tasks in the wild, where the visual appearance of the input image and the semantics of the output category often vary significantly between different applications. difference.

Figure 7: Illustration of the Computer Vision in the Wild (CVinW) setting, compared with other settings. This two-dimensional space is constructed with two dimensions: input images and output concepts. The 2D diagram is divided into four quadrants based on the requirements between the model development phase and the model evaluation phase. In the example provided with the standard setting, natural images with the concepts "people, sheep, dogs" are presented. Image from Li et al. (2022b).

        

        In Figure 7, we use the definition from Li et al. (2022b) to compare four settings studied by the computer vision community: traditional closed set recognition setting (lower left quadrant), open set recognition setting (upper left quadrant), domain adaptation or Out-of-distribution settings (lower right quadrant) and CVinW settings (upper right quadrant). Obviously, CVinW takes into account changes in both the visual and conceptual domains. In fact, any visual recognition task can be naturally defined using a customized set of concepts and a given visual domain. From this perspective, CVinW considers task-level transfer, which goes beyond the concept/category-level transfer that often occurs in traditional open set recognition settings. In Figure 4.8, we use the same image above to illustrate the differences between these settings.


A schematic diagram of the different visual identity setups is shown below

        The goal of developing foundational models for computer vision in the wild is twofold: • Ability to transfer to a wide range of new downstream tasks. This means that the application scenarios of the basic model are extensive. Mature datasets like ImageNet and COCO represent closed-set tasks for image classification and object detection, respectively. In real-world settings, both the visual domain and the set of concepts can change significantly beyond ImageNet and COCO. Assessing the effectiveness of a base model is better measured by its applicability rather than its performance on a specific task. • The adaptation cost of task shifting is low. A major advantage of pre-trained base models is that they can be easily (or cheaply) transferred to downstream tasks. This means that model adaptation efficiency is an important factor in measuring the usability of the underlying model. A good base model should be deployed with minimal adaptation work. To measure adaptation cost, Li et al. (2022b) define adaptation cost in two orthogonal dimensions: sample efficiency (measured by the number of training examples) and parameter efficiency (measured by the number of trainable parameters). Mature datasets like ImageNet and COCO do not provide optimal evaluation settings for underlying models. Achieving SoTA performance on these datasets often requires comprehensive fine-tuning of the complete model, which results in high adaptation costs. As a guideline, a base model with fixed weights should be able to perform zero transfers well across many downstream tasks.

        Methods to achieve the above goals can be implemented individually for a range of computer vision tasks or collectively. When implemented individually, the setup is to build a separate base model for each problem. Most of the VLP models described in this chapter fall into this category. When implemented together, the setup is to build a unified base model across all tasks. Computer vision tasks require image processing at different levels of granularity (image, region, pixel), making unification across tasks challenging. Building an AI system that can utilize visual-linguistic data at different granularity levels to seek the optimal trade-off between data scale and semantic richness remains an attractive research topic.

6. Summary and advanced topics

As the VLP literature on core computer vision problems grows rapidly, more and more papers and interesting research topics are emerging, as described in Figure 4.9. Below, we briefly discuss some important topics such as: knowledge-augmented visual models, multilingual language-image models, efficient and robust model adaptability, benchmarking, etc.

• Knowledge-enhanced visual models. Text encoders are arguably the most unique component of recently developed language-augmented computer vision systems. Therefore, improving the ability to encode text is very important for core visual recognition tasks. K-LITE (Shen et al., 2022a) enriches entities in natural language using the WordNet/Wikipedia knowledge base, providing a scalable way to transfer in a zero-shot and few-shot manner in image classification and object detection Learn a wide range of new tasks. Compared to CLIP/UniCL/GLIP, K-LITE is more efficient in pre-training. Tian et al. (2021) explore the use of external knowledge to improve long-tail visual recognition within a single domain, which falls under the category of category-level transfer.

• Multilingual language-image contrast. The success of using English subtitles for image-text contrastive learning inspired the use of other language sources. MURAL (Jain et al., 2021) is pre-trained on multilingual image-text pairs from scratch, including image-to-text contrastive loss and text-to-text contrastive loss between different languages. By distilling from the original English CLIP, Carlsson et al. (2022) trained a language-specific encoder while keeping their image encoder unchanged. Comparative language-image models for other multilingual/bilingual/monolingual varieties include Korean (Ko and Gu, 2022), Italian (Bianchi et al., 2021), Russian (Shonenkov et al., 2022), and Chinese (Gu et al. , 2022a).

• Efficient adaptation methods. As model size grows, how to effectively adapt pre-trained models to various downstream tasks becomes a problem. Research on sample efficiency (e.g., zero-sample and few-shot) and parameter efficiency (e.g., cue tuning, linear probing, and full model fine-tuning). For VLP models, it provides unique opportunities to leverage text encoders for model adaptation, including conditional cue learning (Zhou et al., 2022b), color cue adjustment (CPT) (Yao et al., 2021), VL-Adapter (Sung et al., 2022b) and CLIP adapter (Gao et al., 2021). A comprehensive study on parameter efficiency can be found in He et al. (2022).

• robustness. Wortsman et al. (2022) studied robust fine-tuning of zero-shot models. Fang et al. (2022a) report that in CLIP, the data determines distributional robustness. Fine-tuning CLIP distorts pre-trained features and performs poorly outside the distribution (Kumar et al., 2022). The original CLIP article reports that when the sample size is small, few samples are less effective than zero samples. In contrast, Li et al. (2022b) showed that few-shot CLIP always outperforms zero-shot CLIP when pretrained text encoders are used correctly for model adaptation.

• Benchmarks. Efficiently transferring and fairly evaluating pre-trained language-augmented vision models to downstream datasets and tasks remains challenging. ELEVATER (Li et al., 2022b) provides an evaluation platform for language-augmented visual models. ELEVATER includes a dataset and easy-to-use toolkit for evaluating the task-level transfer capabilities of pre-trained vision models, unlike traditional benchmarks for evaluating category-level zero-shot transfer. It is used in the aforementioned ICinW and ODinW challenges to provide a common playing field for computer vision in the wild.

• Open visual relationship recognition. The idea of ​​open-ended recognition has been extended to more visual recognition tasks, such as relationship detection. Relational Language-Image Pre-training (RLIP) (Yuan et al., 2022) improves zero-shot, few-shot and fine-tuned human-object interaction (HOI) detection performance and enhances robustness to learning from noisy annotations.

• Open video classification. Multimodal Open Vocabulary Video Classification (MOV) (Qian et al., 2022) proposes a visual encoder using a pre-trained text-image model to encode video, optical flow, and audio spectrograms with minimal modifications, and designs cross-modal modal fusion mechanism to aggregate complementary multi-modal information. X-CLIP (Ni et al., 2022) adapts pretrained text-image models to video recognition. It uses a cross-frame attention mechanism to explicitly exchange information between frames and uses a video-specific hinting scheme to leverage video content information to generate discriminative textual hints. For readers interested in "Computer Vision in the Wild" (i.e. VLP used for core vision tasks), you can refer to GitHub - Computer-Vision-in-the-Wild/CVinW_Readings: A collection of papers on the topic of ``Computer Vision in The latest CVinW reading list on the Wild (CVinW)'' .

reference:

  Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

Guess you like

Origin blog.csdn.net/qq_41458274/article/details/133280172