Paper name: CPT: COLORFUL PROMPT TUNING FOR PRE-TRAINED
VISION-LANGUAGE MODELS
[1] Yao, Y., Zhang, A., Zhang, Z., Liu, Z., Chua, T. S., & Sun, M. (2021). Cpt: Colorful prompt tuning for pre-trained vision-language models. arXiv preprint arXiv:2109.11797.
Paper link: https://arxiv.org/pdf/2109.11797.pdf
1. Article background
Localizing natural language content on fine-grained images is important in many tasks, such as robot navigation, VQA, visual dialogue and VCR tasks. Recent pretrained VLMs have shown strong capabilities on visual localization tasks. Traditional cross-modal representations are first pre-trained on large-scale image-caption data in a self-supervised manner, and then fine-tuned on downstream tasks. This way of pre-training and fine-tuning has achieved SOTA effects on many cross-modal tasks.
But there is a large gap between the fine-tuned target form and the pre-trained target form on downstream tasks . As shown in the figure below, the MLM target is used during pre-training , and the words at the [MASK] mark are restored by using cross-modal information , but during fine-tuning, the token representation vector without mask is classified as a semantic label. This Procedures often introduce additional parameters .
Influenced by recent advances in pre-trained language models in the NLP field, this paper proposes Cross-Modal Prompt Tuning, a new paradigm for fine-tuning VL-PTM. The key point of the idea is to introduce the same color-based markup on images and texts, redesigning the visual localization problem as a fill-in-the-blank problem, and minimizing the gap between pre-training and fine-tuning .
2. Preliminary knowledge
2.1 Visual localization task
In the literature, visual localization is usually defined as a REC task (referring expression comprehension, reference expression comprehension task), that is, given a picture i and a query text q, the goal of REC is to locate the target area related to q on the image, Next, the common VL-PTM fine-tuning method is introduced.
2.2 Ordinary fine-tuning methods
The conventional method is to first use the target detector to detect the set of region proposals{v1,v2,v3,...,vn}, and then classify and sort these prompt boxes to select the target region. Specifically, visual and textual inputs are first converted into a sequence of input tokens, as shown in the figure below, where w1, w2,...,wm are the text tokens of the query text q.
{v1,v2,...,vn} is encoded into an embedded representation through a visual encoder; text and special tokens are converted into embedded vector representations through a lookup table. After the input sequence is converted into an embedded representation, it is sent to the pre-trained transformer to generate a hidden representation, as shown in the following figure:
Finally, the representation of the target area is optimized through the classification loss ( the article does not talk about which parameters are optimized by the fine-tuning method? Since it is a common fine-tuning method, the backbone parameters of the model and the parameters of the classification head should be updated), in this process . Introduce task-specific parameters. Therefore, fine-tuning VL-PTM requires a large amount of labeled data to stimulate the ability of the model to visually localize.
3. The method of this paper
CPT ( Colorful Prompt Tuning ), CPT consists of two parts, visual sub-prompt , the purpose of this part is to use colored blocks or segmentation masks to uniquely mark the image area; textual sub-prompt , this part puts the query text into into a color-based query template. Through these two parts, the [MASK] token in the query template can be restored to color-related text to achieve the visual localization task. Aiming at the optimal color selection in CPT (color selection consists of two parts, one is visible visual representation and the other is text representation), this paper proposes a high-quality search method.
3.1 Visual sub-prompt
Given an image i and a set of prompt boxes R={v1,v2...,vn}, the purpose of Visual sub-prompt is to uniquely label image regions using natural visual markers. It should be noted that this practice of using color has been done in the previous literature. For visualization, objects in the picture are uniquely marked using bounding boxes with colors . Inspired by this work, this paper uses a color set C to bridge the image and text, and each element in C consists of two parts ,
Each region proposal in the image is then labeled with a unique visual representation. Finally, a collection of colored image prompt boxes is generated.
3.2 Textual sub-promopt
The purpose of this part is to construct the link between the query text and the image regions marked by the Visual sub-prompt. Using a template, the query text q is transformed into a fill-in-the-blank query. VL-PTM is prompted to choose a most appropriate color for the image region to fill in the [mask] marker by the following likelihood calculation formula.
v is the target area, and the bold cwi is the embedding of the unbold cwi in the pre-trained MLM head. Note that this process does not introduce additional parameters (the inference phase uses the color embedding vectors trained by the pre-training process) .
3.3 Training and Inference
After using the CPT technique, VL-PTMs can perform zero-shot visual localization tasks without any labeled data since the cross-modal representations of colors and other concepts have been well learned by VL-PTM during the pre-training stage . When a small amount of or all labeled data is available, with CPT, the pre-trained visual language model can be further fine-tuned using an entropy-based objective.
4. Experimental part
The experimental part is not written here, interested guest officials can check it by themselves according to the link of the article.
The confusion points of reading this document + my own guess |