CLIP follow-up--LSeg, GroupViT, ViLD

This blog has been open for two months, and it has not been written. Recently, it was closed to complete it~ the third day of laying flat

Overview of the fields of application of CLIP:

1. Sec

Original paper address: https://arxiv.org/abs/2201.03546

Code: https://github.com/isl-org/lang-seg

This picture clearly shows the powerful effect of zero shot. It can be said that he will give you whatever you want.

 Model structure:

 

The overall model looks very similar to the CLIP model, except that the single image text features are replaced by pixel-by-pixel dense features in semantic segmentation. Except that the text features extracted by the above text encoder are multiplied by the dense image features to calculate the pixel-level image-text similarity, the entire network is exactly the same as the traditional supervised network, there is a picture, and then a The segmented model finally gets a feature map, and this feature map is enlarged through some upscaling. Because the intensive task we are doing, the final output should be the same size as the original image, so we need to do some upscaling operations. The last is the output of the model, just do a cross entropy loss (cross entropy loss) with the ground truth supervision.

The encoder of the image is a DPT structure, that is, a ViT in the front and a decoder in the back. The function of the decoder is to upscale the bottleneck feature, which is equivalent to turning an image of H×W×3 into a , \widetilde{H}, \widetilde{W}C A feature map, \widetilde{H}, \widetilde{W} may be dimensionally reduced compared with the original image. C is the feature dimension, usually 512 or 768, and then the visual features are exhausted. The text branch passes n labels through the text encoder to obtain n text features, the dimension is N×C, where n can be changed at will, and a H×W×N can be obtained by multiplying the visual features and text features tensor, which is no different from the traditional segmentation model. Although this paper is said to be zero shot, it is actually supervised training.

The significance of this paper is to add the branch of the text to the pipeline of the supervised segmentation model, and combine the text features with the visual features, so that the model can learn some language aware visual features, so that it can go through the prompt of the text Arbitrarily get the desired segmentation effect.

Other details are as follows:

(1) LSeg The entire text encoder is the model and weight of the CLIP text encoder, and it is frozen during training and reasoning

(2) The image encoder of LSeg can be any network (CNN/ViT), which needs to be trained;

(3) Spatial Regularization Blocks is a module proposed in this paper. In order to have some learnable parameters to understand the calculation results after calculating the pixel-level graphic similarity, it is composed of some convolutions and depth-by-depth convolutions;

 2.GroupViT

Original paper address: https://arxiv.org/abs/2202.11094

Code: https://github.com/NVlabs/GroupViT

Although LSeg uses the pre-training parameters of CLIP, it still relies on the semantic mask as a supervisory signal, and does not use the text as a supervisory signal. It still relies on the manually labeled semantic mask, but the semantic mask is very expensive. Hahaha, this This paper is a relatively contributing work in the direction of using text as a supervisory signal. Why is it called group ViT? In fact, on the visual side, when doing unsupervised segmentation a long time ago, a method that was often used is grouping, similar to if there are some cluster center points, and then diverge from this point, expanding similar points around Form a group, which is equivalent to a semantic mask, which is a top-down approach. The author of the paper believes that this method can be used in the current framework. They proposed a computing unit, which is called the grouping block on the right side of the figure below. , and there are some group tokens that can be learned. The purpose of this method is to let this model slowly group adjacent pixels into one semantic mask after another during initial learning. From the figure It can be seen that at the beginning of the shallow layer of the model, the segmentation effect is not very good, but after learning, the effect is already very good when it reaches the deep layer.

For the image encoder, it is actually a vision transformer. There are 12 layers in total, that is, there are 12 transformer layers. The input of the image encoder actually has two parts, one is the patch embedding from the original image, which is the linear projection layer The output, the other is the learnable group tokens proposed in this paper, which is the colored rectangle on the right.

Patch embedding is to add a 24*24 picture here, choose 16*16 patch size, there will be a sequence of 14*14, 196 sequence length, after this Linear Projection, some patch embeddings will be obtained. The dimension is 196*384.

The group token can be understood as the previous cls token, that is to say, use this token to represent the whole picture, but the reason why it is 64 instead of 1 here is because when it was 1 before, it means that the whole picture has a feature, but now it is for segmentation , so each category or each small block has a feature, but the learning process of the two tokens is the same, and it is through the self-attention in the transformer layer to learn which token the patch of these images belongs to. After 6 After a transformer layer, a grouping block is used for clustering, and the image patch embedding is assigned to 64 group tokens, which is equivalent to a clustering assignment. Since there are 64 cluster centers, there are 64 tokens left in the end. Another advantage of the grouping block is that the length of the sequence is reduced in disguise, and the computational complexity and computational time are correspondingly reduced.

The grouping block first calculates a similarity matrix in a way similar to self-attention, and uses this matrix to help image tokens to allocate a cluster center, thereby achieving dimensionality reduction, but the process of assigning cluster centers is not derivable , so a trick, namely gumbel softmax, is used to make this process derivable. In this way, the entire model can be trained end-to-end.

Since there are not too many types in general data sets or pictures, the author hopes to reduce the number of 64 cluster centers, so he added 8 new group tokens, and transformed 64 semantic tokens through transformer learning. It is mapped to 8 cluster centers again. Here, 3 transformer layers are used, and then through the grouping block, an 8*384 token is finally obtained, that is, the image is divided into 8 large blocks, and each block has a feature.

The training process is similar to CLIP. The image-text pair is used to calculate a comparative learning loss to train the entire network, but there is a problem here. In CLIP, a picture is a feature, and text is also a feature, but now text is a feature, but The image has 8 large-block features, so the author fuses 8 large-block features through an average pooling, and then obtains the features of the entire image through a layer of MLP, and then it is exactly the same as CLIP.

During inference, the text encoder generates a feature for each category, and the image generates 8 group embeddings for comparison. Since there are only 8 group embeddings, the image can only detect 8 categories.

Limitation: Only 8 categories can be identified, and it will not work if there are too many, but it can be adjusted. The author found that 8 is the best.

3. WILD

Original paper address: https://arxiv.org/abs/2104.08860

代码:tpu/models/official/detection/projects/vild at master · tensorflow/tpu · GitHub

What did ViLD do: On the existing data set, without labeling, it has the ability to detect any new object category

a is the supervised baseline method, b, c, and d are the network structure of ViLD.

In fact, this baseline is a mask-RCNN, which is a two-stage classifier. In the first stage, n proposals will be generated. In the second stage, some region embeddings will be obtained through the detection head according to the N proposals, and then a classification head will be passed. The category of the extracted bounding box is obtained, and the target detection is completed.

Target detection can be divided into two parts from the perspective of the objective function, one is how to locate, and the other is how to classify. The positioning is whether the bounding box is drawn accurately, and the classification is whether the objects in the bounding box are judged accurately. This paper is a bit Decoupling these two pieces means that all the framework diagrams here start from the second stage, and the input is N proposals, and there is no drawing in the first stage.

(1) ViLD-text: If you want to do zero-shot target detection, you must combine it with the text, so how to add the text? The easiest way is to use an image backbone to extract some image features like CLIP, then find a text network to extract some text features, and finally do a dot product of these two features to calculate their The similarity is enough, and it is the same here. The author adopts the simplest way. N proposals enter the detection head, and after some operations, N region embeddings are obtained, which is the same as the N region embeddings of the previous baseline network. Almost, the next step is to calculate some text embeddings. The text embedding is actually to take these categories of objects, then give some prompts, and then generate a sentence, and then throw this sentence to any text encoder. Here It should be noted that the text comes from the category of the object, that is, it is still supervised learning, and the category is still the base category, which is the basic class in the data set. Here baseline and ViLD-text are both done on the same data set. There is supervised training, and it is trained on the basic class, so at this stage, ViLD-text actually only connects text features and image features, but the performance of zero-shot needs to be strengthened. In the figure, the text embedding is marked in blue, which means that the model parameters here are always locked and have not participated in the training. Like LSeg, the text end is locked. Once you have the image features and text features, you can directly do a dot product, and the similarity can be used as the logits of the final classification, and you can do the cross-entropy loss and train the model.

Back ground, because what we are doing now is supervised training, using basic classes, then all other classes that are not in these basic classes can only be stuffed into this background class, so the learning of this background class is very The key is to learn the embedding of a background specifically.

(2) ViLD-image: The image encoder pre-trained by CLIP is very good, and the association with the text is also very good, so the author hopes that the image embedding output by the model is as consistent as possible with the image embedding output by CLIP, and to achieve The best way to do this is knowledge distillation.

When there are some drawn proposals, that is, the obtained bounding boxes, you can extract them and do some resize operations, and then you can throw them to the image encoder pre-trained by CLIP, and then you can get the image Features, the pre-trained image encoder here is also locked, which can ensure that the extracted features are as good as CLIP, and then use this branch, that is, the branch on the right side of ViLD-image as the teacher network, and the student network is The target detection framework used before is to first have some proposals, pass the detection head, and then extract some features. The author hopes that the features here are as close as possible to the features extracted by CLIP. Here, you can directly use an L1 loss to do a distillation. up. It is worth mentioning here that the supervisory signal is no longer manually marked, but the image coding brought by CLIP, so it is not limited by the basic class, so the extracted proposal can have the proposal from the basic class , you can also have a proposal from a new class.

The final frame d is the combination of the two frames. The left side is the branch of target detection, and the right side is the image embedding branch of CLIP, and the right side is only used during training, and the test is useless. For the sake of simplicity in calculation, the author put N Propo and M pre-computed proposals are all given to the target header together, and then N+M embeddings are obtained, and then split, N embeddings are used to calculate the cross-entropy loss, and M pre-computed proposals are used to calculate the distilled L1 loss.

There are still a few articles left, and I am tired from watching other people's videos. I lament the boss's brain, why can't my brain remember anything! smile.jig

 

Guess you like

Origin blog.csdn.net/Zosse/article/details/126933245