CLIP: Open up new heights of text-to-image transfer models

I. Introduction

2021 has witnessed the explosion of vision transformers. After Google proposed ViT, a large number of vision transformers have swept computer vision tasks. In addition to vision transformer, another work that has a great impact on computer vision is DALL-E and CLIP released by Open AI in January 2021, both of which are multimodal models that combine images and text, of which DALL-E is Models are generated based on text, while CLIP uses text as a supervision signal to train transferable visual models . These two works have also led to a new wave of research climaxes like ViT. This article will first introduce the principle of CLIP and how to implement zero-shot classification with CLIP, then we will discuss the motivation behind CLIP, and finally the article will introduce CLIP variants and some other application scenarios.

2. Principle

The full English name of CLIP is Contrastive Language-Image Pre-training , which is a pre-training method or model based on contrasting text-image pairs . CLIP is a multimodal model based on contrastive learning. Unlike some contrastive learning methods in CV such as moco and simclr, the training data of CLIP is a text-image pair: an image and its corresponding text description, here It is hoped that through contrastive learning, the model can learn the matching relationship between text-image pairs . As shown in the figure below, CLIP includes two models: Text Encoder and Image Encoder , where Text Encoder is used to extract text features, and the text transformer model commonly used in NLP can be used; while Image Encoder is used to extract image features, which can be commonly used. CNN model or vision transformer .
CLIP
Here, the extracted text features and image features are compared and learned. For a training batch containing N text-image pairs, combining N text features and N image features in pairs, the CLIP model predicts the similarity of N^2 possible text-image pairs, where the similarity Directly calculate the cosine similarity between text features and image features. Note that the cosine similarity between text features and image features is calculated using text cls_token and image cls_token (cosine similarity), which is the matrix shown in the figure above. There are a total of N positive samples here, i.e. text and images that really belong to a pair (diagonal elements in the matrix) , while the remaining N2-N** text-image pairs are negative samples, then the training goal of CLIP is to maximize the similarity of N positive samples, while minimizing the similarity of **N 2-N negative samples, the corresponding pseudocode implementation As follows:

# image_encoder - ResNet or Vision Transformer
# text_encoder - CBOW or Text Transformer
# I[n, h, w, c] - minibatch of aligned images
# T[n, l] - minibatch of aligned texts
# W_i[d_i, d_e] - learned proj of image to embed
# W_t[d_t, d_e] - learned proj of text to embed
# t - learned temperature parameter

# 分别提取图像特征和文本特征
I_f = image_encoder(I) #[n, d_i]
T_f = text_encoder(T) #[n, d_t]

# 对两个特征进行线性投射,得到相同维度的特征,并进行l2归一化
I_e = l2_normalize(np.dot(I_f, W_i), axis=1)
T_e = l2_normalize(np.dot(T_f, W_t), axis=1)

# 计算缩放的余弦相似度:[n, n]
logits = np.dot(I_e, T_e.T) * np.exp(t)

# 对称的对比学习损失:等价于N个类别的cross_entropy_loss
labels = np.arange(n) # 对角线元素的labels
loss_i = cross_entropy_loss(logits, labels, axis=0)
loss_t = cross_entropy_loss(logits, labels, axis=1)
loss = (loss_i + loss_t)/2

In order to train CLIP, OpenAI collected a total of 400 million text-image pairs from the Internet , which the paper calls WebImageText. According to the number of words in the text, it is similar in size to the WebText training GPT-2. If compared in terms of quantity, It's also 100 million more than Google's JFT-300M dataset, so it's a very large dataset. Although CLIP is a multimodal model, it is mainly used to train transferable vision models. In the paper, Text Encoder fixedly selects a text transformer model containing 63M parameters , while Image Encoder adopts two different architectures, one is the commonly used CNN architecture ResNet, and the other is transformer-based ViT, of which ResNet contains 5 different sizes. Models: ResNet50, ResNet101, RN50x4, RN50x16 and RNx64 (the latter three models are obtained by increasing ResNet by 4x, 16x and 64x respectively according to the EfficientNet scaling rule), while ViT selects 3 models of different sizes: ViT-B/32, ViT-B/16 and ViT-L/14 . All models are trained for 32 epochs, using the AdamW optimizer, and the training process uses a larger batch size: 32768 . Due to the large amount of data, the largest ResNet model RN50x64 needs to be trained on 592 V100 cards for 18 days, while the largest ViT model ViT-L/14 needs to be trained on 256 V100 cards for 12 days . It can be seen how much it takes to train CLIP. resource. For ViT-L/14, an additional epoch was finetuned at a resolution of 336 to enhance performance. The paper found that this model works best, denoted as ViT-L/14@336, the CLIP model for the comparative experiment in the paper also adopts this.

3. Use CLIP to achieve zero-shot classification

We introduced the principle of CLIP above. You can see that the CLIP after training is actually two models. In addition to the visual model, there is also a text model. So how to migrate the pre-trained visual model? Different from pre-training and then fine-tuning commonly used in CV , CLIP can directly achieve zero-shot image classification, that is, classification can be achieved on a specific downstream task without any training data, which is also the highlight and power of CLIP . Implementing zero-shot classification with CLIP is very simple and only requires two simple steps:

  • Construct the description text of each category according to the classification label of the task: A photo of {label}, and then send these texts into the Text Encoder to get the corresponding text features. If the number of categories is N, then N text features will be obtained ;
  • The image to be predicted is sent to Image Encoder to obtain image features, and then the scaled cosine similarity is calculated with N text features (consistent with the training process), and then the category corresponding to the text with the largest similarity is selected as the image classification prediction result. Further, These similarities can be regarded as logits, and after being sent to softmax, the predicted probability of each category can be obtained .

CLIP
It can be seen that we use the multi-modal characteristics of CLIP to build a dynamic classifier for specific tasks. The text features extracted by Text Encoder can be regarded as the weights of the classifier, and the image features extracted by Image Encoder are the weights of the classifier. Enter . Here we give an example based on CLIP ( refer to the official notebook ), there are 6 categories of tasks here: "dog", "cat", "bird", "person", "mushroom", "cup", first we Create a text description, then extract text features:

# 首先生成每个类别的文本描述
labels = ["dog", "cat", "bird", "person", "mushroom", "cup"]
text_descriptions = [f"A photo of a {label}" for label in labels]
text_tokens = clip.tokenize(text_descriptions).cuda()

# 提取文本特征
with torch.no_grad():
    text_features = model.encode_text(text_tokens).float()
    text_features /= text_features.norm(dim=-1, keepdim=True)

Then we read the image to be predicted, input Image Encoder to extract image features, and calculate the cosine similarity with text features:

# 读取图像
original_images = []
images = []
texts = []

for label in labels:
    image_file = os.path.join("images", label+".jpg")
    name = os.path.basename(image_file).split('.')[0]

    image = Image.open(image_file).convert("RGB")
    original_images.append(image)
    images.append(preprocess(image))
    texts.append(name)

image_input = torch.tensor(np.stack(images)).cuda()

# 提取图像特征  
with torch.no_grad():
    image_features = model.encode_image(image_input).float()
    image_features /= image_features.norm(dim=-1, keepdim=True)

# 计算余弦相似度(未缩放)
similarity = text_features.cpu().numpy() @ image_features.cpu().numpy().T #数学上的矩阵乘法

The similarity is shown below. It can be seen that for the 6 images to be predicted, according to the maximum similarity, they can all match the correct text label:
CLIP
Further, we can also calculate the softmax on the obtained cosine similarity to get each Predict the probability value of the category, note that the similarity needs to be scaled here:

logit_scale = np.exp(model.logit_scale.data.item())
text_probs = (logit_scale * image_features @ text_features.T).softmax(dim=-1)
top_probs, top_labels = text_probs.cpu().topk(5, dim=-1)

The resulting predicted probabilities are as follows, you can see 6 images, and the CLIP model can give the correct classification result with absolute confidence:
CLIP

Using CLIP for zero-shot classification, another important point is the generation of text descriptions. In the above example, we use A photo of {label}, but there are other options. For example, we use category labels directly, which is actually a recent NLP A popular research in the field: prompt learning or prompt engineering, see this review paper for details: Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing , in short, the core of prompt learning is By constructing a suitable prompt, the pre-trained model can be directly applied to downstream tasks, which is a different paradigm from the previous pre-training + fine-tuning . The paper also said that if we directly use the category label as the text description, then a lot of text is a word, lacking specific context, and it is not consistent with the training data of CLIP, the effect will not be as good as using A photo of {label} ( It can be improved by 1.3% on the ImageNet dataset) . The paper also experimented with 80 different prompts for integration, and found that it can bring a 3.5% improvement on the ImageNet dataset. For details, see the notebook published by CLIP . The following figure compares the effect of the ResNet-based CLIP model directly using the category name and performing prompt engineering and ensembling:
CLIP

Above we have introduced how to use CLIP to achieve zero-shot classification. The following will briefly introduce the comparison of the effects of CLIP and other methods, which is also the most extensive content in the paper. The first is the comparison of the zero-shot effect of CLIP and a 17-year work Learning Visual N-Grams from Web Data on three classification data sets. As shown in the following table, it can be seen that the CLIP model is far more effective than the previous one. Model, which can reach 76.2 on the ImageNet dataset, which is comparable to the fully supervised ResNet50 effect , and it is quite amazing to achieve this effect without any training data.
CLIP
Further, the paper also compares the performance of zero-shot CLIP and ResNet50 linear probing (pre-training on ImageNet data, finetune with linear classification layer) on 27 data sets, as shown in the figure below, of which 16 data sets CLIP can outperform ResNet50 on sets. However , on some special, complex or abstract datasets, CLIP performs poorly**, such as satellite image classification, lymph node metastasis detection, counting in synthetic scenes, etc., CLIP is not as effective as fully supervised ResNet50, which shows that CLIP does not perform well. Not a panacea, there is still room for improvement. If you look carefully at the figure below, the MNIST data set has a poor performance of CLIP, and the classification accuracy is only 88%, which is incredible, because this task is too simple. By analyzing the CLIP training data, the author found that 400 million There is basically no data similar to MNIST in the training data , so this is extra- domain data for CLIP, and it is easier to understand if the performance is poor. This also shows that: CLIP still cannot solve the deep learning problem of out-of-domain generalization .
CLIP
In addition to the zero-shot comparison, the paper also compares few-shot performance, that is, using only a small number of samples to fine-tune the model . Here are three models compared: BiT-M ResNet-152x2 trained on ImageNet21K, ResNet50 trained on SimCLRv2, and ResNet50 with supervised training. It can be seen that the zero-shot of CLIP and the best model (BiT-M) have comparable performance under 16-shot, and CLIP has a further improvement under 16-shot. Another interesting result is that although the performance of CLIP improves with the increase of the sample size in the few-shot experiment, the performance of 1-shot and 2-shot is worse than that of zero-shot. The author believes that it is mainly the training and There are certain differences in conventional supervised training .
CLIP
In addition, the paper also conducted a representation learning experiment, that is, a linear probe commonly used in self-supervised learning: first extract features with a trained model, and then use a linear classifier for supervised training. The figure below shows the average linear probe score comparison of different models on 27 datasets. It can be seen that the CLIP model outperforms other models in performance, and the calculation is more efficient :
CLIP
In addition, the paper also found that CLIP is more robust in natural distribution drift. For example, CLIP and ResNet101 based on supervised training on ImageNet can achieve 76.2% on the ImageNet validation set, but on the ImageNetV2 dataset, CLIP exceeds ResNet101. On the other 4 datasets with distribution drift, the performance of ResNet101 has dropped significantly, but CLIP can still maintain a large accuracy. For example, on the ImageNet-A dataset, the performance of ResNet101 is only 2.7%, while CLIP can reach 77.1 %.

CLIP
CLIP can achieve such good zero-shot performance, and you may doubt that the training data set of CLIP may contain some examples in the test data set, the so-called data leakage. In this regard, the paper also uses a duplicate detector to check the overlap of the evaluated datasets. It is found that the median overlap rate is 2.2%, while the average is 3.2%. The performance of most datasets before and after deduplication is not too high. Big changes , as follows:
CLIP

At the end of the paper, the limitations of CLIP are also discussed. Here is a brief summary of the more important points :

  • Although the zero-shot performance of CLIP is comparable to the supervised ResNet50, it is not SOTA. The author estimates that to achieve the effect of SOTA, CLIP needs to increase the amount of calculation by 1000x, which is unimaginable;
  • CLIP's zero-shot performs poorly on some datasets, such as fine-grained classification, abstract tasks, etc.;
  • CLIP is robust to natural distribution drift, but there is still a problem of out-of-domain generalization, that is, if the distribution of the test data set is quite different from the training set, CLIP will perform poorly;
  • CLIP does not solve the data inefficiency problem of deep learning, and training CLIP requires a large amount of data;

4. Why CLIP

The principle and application of CLIP were introduced earlier, and here we will look back at another question: why is CLIP, that is, the motivation of CLIP's work . In the field of computer vision, the most commonly used transfer learning method is to pre-train on a large-scale dataset such as ImageNet, and then fine-tune it on specific downstream tasks. The pre-training here is based on supervised training and requires a large amount of data annotation, so the cost is high. In recent years, some methods based on self-supervision have emerged, including methods based on contrastive learning such as MoCo and SimCLR, and methods based on image masks such as MAE and BeiT. The benefit of self-supervised methods is that they no longer require annotation. But whether supervised or self-supervised methods, they still require supervised fine-tuning when transferring to downstream tasks, and cannot achieve zero-shot. For supervised models, since they use classifiers with a fixed number of classes on the pre-trained dataset, they need to define new classifiers for retraining on the new dataset. For self-supervised models, proxy tasks are often assisted for representation learning, and new classifiers need to be added for supervised training when migrating to other datasets . However, in the field of NLP, pre-training methods based on autoregression or language masks have become relatively mature, and pre-training models are easy to zero-shot directly migrate to downstream tasks, such as OpenAI's GPT-3. This difference is due to the fact that text and images belong to two completely different modalities, and another reason is that NLP models can use large amounts of text collected from the Internet. So the question arises: Can a visual model be pre-trained based on a large amount of text on the Internet?

Well, in fact, there have been some previous studies using text as a supervision signal to train visual models. For example, the 16-year work Learning Visual Features from Large Weakly Supervised Data converts this into a multi-label classification task to predict the bag of text corresponding to the image. words; 17 years of work Learning Visual N-Grams from Web Data extends this method further to predict n-grams. Some recent works employ new model architectures and pre-training methods to learn visual features from text, such as VirTex 's transformer-based language model, ICMLM 's language mask-based method, and ConVIRT 's contrastive learning-based method. Overall, there is not much work in this area, mainly because these methods are difficult to achieve high performance. For example, the work in 2017 only achieved 11.5% zero-shot performance on ImageNet, which is far lower SOTA on ImageNet. In addition, there is another direction, which is to improve performance based on weak text supervision. For example, Google's BiT and ViT pre-train the model based on the JFT-300M dataset to obtain SOTA on ImageNet. The JFT-300M dataset is Google's data from the Internet. Collected above, through some automated means to convert web text into 18,291 categories, but there is a certain amount of noise. Although Google has achieved good results based on the JFT-300M dataset, these models are still pre-trained with a fixed-class softmax classifier , which greatly limits its transferability and scalability.

The authors argue that an important difference between Google's weakly supervised approach and previous approaches is scale, or the size of computing power and data . The data volume of JFT-300M has reached hundreds of millions, and Google has used powerful computing power for pre-training. Whereas VirTex, ICMLM and ConVIRT were only trained for a few days on 100k level data. In order to make up for the difference in data, OpenAI collected 400 million data from the Internet for experiments. But a new question arises: what method to use for training. OpenAI first tried the VirTex model, which jointly trains a CNN and text transformer to predict the text of an image (image caption), but found that the training efficiency of this method (evaluated by zero-shot performance on the ImageNet dataset) is not as direct as Predict the bag of words, as shown in the figure below, the training efficiency of the two can be 3 times different. If ConVIRT is further adopted, that is, the method based on contrastive learning, the training efficiency can be further improved by 4 times. The reason for this difference is not difficult to understand. The text-image pairs included in the training data are collected from the Internet, and they have a certain amount of noise, which means that the text and images may not exactly match. At this time, reduce the training appropriately. target, but can achieve better convergence. From the perspective of task difficulty: Transformer Language Model > Bag of Words Prediction > Bag of Words Contrastive (CLIP). Due to the large amount of training data and model computation, training efficiency becomes a crucial factor. This is the reason why the author finally chooses the method of contrastive learning for training.

CLIP

Essentially, CLIP is not really innovative, it just simplifies the ConVIRT method and uses a larger-scale text-image pair dataset for training.

At the end of the paper, the authors also talked about the fact that they adopted the method of contrastive learning due to the restriction of training efficiency, but what they still wanted to do was to generate text directly from images. The loop is closed: text -> image -> text . And the model based on generative training can also achieve zero-shot classification. We can achieve it by predicting the words (labels) in the sentence: A photo of [?].

5. What else can CLIP do

Although the paper only conducts experiments on zero-shot classification with CLIP, in fact, the application value of CLIP is much more than that. After CLIP, there have been many application studies based on CLIP. Here we list some application scenarios.

zero-shot detection

CLIP can be applied to target detection tasks to achieve zero-shot detection, that is, to detect categories that are not included in the training data set. For example, ViLD proposed by Google implements object detection with open vocabulary based on CLIP. Its main architecture is as follows. The basic idea Similar to zero-shot classification, except that text features and ROI features are used to calculate the similarity.
CLIP
Meta AI's latest work, Detic , can detect 2000 classes, and CLIP is also used behind it:
CLIP

search image

Searching for images based on text is one of the most direct applications of CLIP. In fact, CLIP is also used as a sorting model for DALL-E, that is, selecting images that are more relevant to text from the generated images.

video understanding

CLIP is based on text-image pairs, but it can be extended to text-video. For example, VideoCLIP applies CLIP to the video domain to achieve some zero-shot video understanding tasks.

image editing

CLIP can be used to guide image editing tasks, HairCLIP This work uses CLIP to customize hairstyles:
CLIP

image generation

CLIP can also be applied to image generation. For example, StyleCLIP uses CLIP to implement a text-guided StyleGAN:
CLIP
CLIP-GEN is based on CLIP to train a text-generated image model, and training does not need to use any text data directly:
CLIP

self-supervised learning

Recently, Huawei's work MVP uses CLIP for visual self-supervised training:
CLIP

VL tasks

CLIP itself is a multimodal model, so it can also be used in image-text multimodal tasks such as image captioning and Visual Question Answering, the paper How Much Can CLIP Benefit Vision-and The -Language Tasks™ system evaluates the benefits of CLIP on VL tasks.
CLIP

The power of CLIP can be further seen from these specific applications.
In addition to some applied research work, there are actually some improvements to CLIP. The latest paper Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision summarizes several improvements to CLIP:
CLIP

6. Summary

This article systematically summarizes the principle of CLIP and its specific application. CLIP can perform well in downstream tasks (zero-shot or few-shot) after pre-training, and data set prediction is no longer limited to fixed classification. The number of labels can be dynamically added during prediction without the need to retrain fine-tune. Direct zero-shot prediction can bring high accuracy, but CLIP cannot solve the problem of out-of-domain datasets . CLIP and ViT belong to the same magnitude of work, they both break the original paradigm of computer vision, and will definitely leave a name in the history of CV.

7. Reference Links

Learning Transferable Visual Models From Natural Language Supervision
https://github.com/openai/CLIP
https://www.zhihu.com/zvideo/1475706654562299904
https://zhuanlan.zhihu.com/p/493489688

Note: This article is reproduced with https://zhuanlan.zhihu.com/p/493489688

Guess you like

Origin blog.csdn.net/flyingluohaipeng/article/details/126654240