Cross-Modal Prompt Tuning

Paper name: CPT: COLORFUL PROMPT TUNING FOR PRE-TRAINED

VISION-LANGUAGE MODELS

[1] Yao, Y., Zhang, A., Zhang, Z., Liu, Z., Chua, T. S., & Sun, M. (2021). Cpt: Colorful prompt tuning for pre-trained vision-language models. arXiv preprint arXiv:2109.11797.

Paper link: https://arxiv.org/pdf/2109.11797.pdf

1. Article background

       Localizing natural language content on fine-grained images is important in many tasks, such as robot navigation, VQA, visual dialogue and VCR tasks. Recent pretrained VLMs have shown strong capabilities on visual localization tasks. Traditional cross-modal representations are first pre-trained on large-scale image-caption data in a self-supervised manner, and then fine-tuned on downstream tasks. This way of pre-training and fine-tuning has achieved SOTA effects on many cross-modal tasks.

       But there is a large gap between the fine-tuned target form and the pre-trained target form on downstream tasks . As shown in the figure below, the MLM target is used during pre-training , and the words at the [MASK] mark are restored by using cross-modal information , but during fine-tuning, the token representation vector without mask is classified as a semantic label. This Procedures often introduce additional parameters .

        Influenced by recent advances in pre-trained language models in the NLP field, this paper proposes Cross-Modal Prompt Tuning, a new paradigm for fine-tuning VL-PTM. The key point of the idea is to introduce the same color-based markup on images and texts, redesigning the visual localization problem as a fill-in-the-blank problem, and minimizing the gap between pre-training and fine-tuning .

2. Preliminary knowledge

2.1 Visual localization task

       In the literature, visual localization is usually defined as a REC task (referring expression comprehension, reference expression comprehension task), that is, given a picture i and a query text q, the goal of REC is to locate the target area related to q on the image, Next, the common VL-PTM fine-tuning method is introduced.

2.2 Ordinary fine-tuning methods

       The conventional method is to first use the target detector to detect the set of region proposals{v1,v2,v3,...,vn}, and then classify and sort these prompt boxes to select the target region. Specifically, visual and textual inputs are first converted into a sequence of input tokens, as shown in the figure below, where w1, w2,...,wm are the text tokens of the query text q.

       {v1,v2,...,vn} is encoded into an embedded representation through a visual encoder; text and special tokens are converted into embedded vector representations through a lookup table. After the input sequence is converted into an embedded representation, it is sent to the pre-trained transformer to generate a hidden representation, as shown in the following figure:

       Finally, the representation of the target area is optimized through the classification loss ( the article does not talk about which parameters are optimized by the fine-tuning method? Since it is a common fine-tuning method, the backbone parameters of the model and the parameters of the classification head should be updated), in this process . Introduce task-specific parameters. Therefore, fine-tuning VL-PTM requires a large amount of labeled data to stimulate the ability of the model to visually localize.

3. The method of this paper

       CPT ( Colorful Prompt Tuning ), CPT consists of two parts, visual sub-prompt , the purpose of this part is to use colored blocks or segmentation masks to uniquely mark the image area; textual sub-prompt , this part puts the query text into into a color-based query template. Through these two parts, the [MASK] token in the query template can be restored to color-related text to achieve the visual localization task. Aiming at the optimal color selection in CPT (color selection consists of two parts, one is visible visual representation and the other is text representation), this paper proposes a high-quality search method.

3.1 Visual sub-prompt

       Given an image i and a set of prompt boxes R={v1,v2...,vn}, the purpose of Visual sub-prompt is to uniquely label image regions using natural visual markers. It should be noted that this practice of using color has been done in the previous literature. For visualization, objects in the picture are uniquely marked using bounding boxes with colors . Inspired by this work, this paper uses a color set C to bridge the image and text, and each element in C consists of two parts ,

Each region proposal in the image is then labeled with a unique visual representation. Finally, a collection of colored image prompt boxes is generated.

3.2 Textual sub-promopt

       The purpose of this part is to construct the link between the query text and the image regions marked by the Visual sub-prompt. Using a template, the query text q is transformed into a fill-in-the-blank query. VL-PTM is prompted to choose a most appropriate color for the image region to fill in the [mask] marker by the following likelihood calculation formula.

v is the target area, and the bold cwi is the embedding of the unbold cwi in the pre-trained MLM head. Note that this process does not introduce additional parameters (the inference phase uses the color embedding vectors trained by the pre-training process) .

3.3 Training and Inference

       After using the CPT technique, VL-PTMs can perform zero-shot visual localization tasks without any labeled data since the cross-modal representations of colors and other concepts have been well learned by VL-PTM during the pre-training stage . When a small amount of or all labeled data is available, with CPT, the pre-trained visual language model can be further fine-tuned using an entropy-based objective.

4. Experimental part

The experimental part is not written here, interested guest officials can check it by themselves according to the link of the article.

The confusion points of reading this document + my own guess
1. Is adding color to the image area performed in the pre-training task? It seems that only if you think so, it will correspond to the first sentence of Section 3.4. The first sentence of Section 3.4 says that due to the cross-modal representation of colors and other concepts, the visual The language model is well learned, so it can perform visual localization tasks. After the above analysis, it is found that this is the use of prompt technology in the pre-training process of the model. From the perspective of the application of the prompt technology known before, whether it is on the text side, the visual side or multi-modal tasks, this is the first time in the pre-training process. The prompt method is used in the process.
2. In the field of NLP, the purpose of introducing prompt tuning is to reduce the amount of tuning parameters of the model, and only tune the introduced prompt vectors. In this article, it does not say how the parameters are adjusted? I guess, add a fixed prompt parameter in the pre-training process of the visual model (the prompt parameter is the color block, a section of the article proposes a method to find the optimal color representation.), and construct a template on the text side at the same time, in the pre- trained In the process, the parameters of the model and the parameters of the MLM head are updated , so that 1 can be explained. If a small amount or all of the label data can be obtained, then the parameters of the model and MLMhead can be adjusted. In short, the prompt parameters and prompt vectors will not be updated. This is similar to the fixed prompt update model parameters of NLP prompt technology. way of doing.

Guess you like

Origin blog.csdn.net/qq_43775680/article/details/127188496