Interpretation of the paper: Learning Transferable Visual Models From Natural Language Supervision

I haven’t read the papers for a while, and they are piled up again. Take some time to read them and take notes. My level is limited and mistakes are inevitable. I welcome corrections and exchanges for learning and improvement.

This paper is based on the supervised learning of natural language and is applied to the field of vision, mainly the combination training of pictures (video <This involves taking too long, and it is also a picture set of central frame combinations intercepted from the video>) and text. This This kind of pre-training is still great for providing downstream tasks without samples.

1 Overview

While deep learning has revolutionized computer vision, current approaches suffer from several major problems: Typical vision datasets are labor-intensive and costly to create, as they require a lot of time and money to collect and label images, and are standard The visual model is only good at the tasks of the data set type. In other words, apart from the cost, it is not versatile enough and belongs to a relatively specific visual classification.
It would be a promising option to be able to learn images directly from raw text, taking advantage of a wider range of sources of supervision. After pre-training, natural language is used to refer to the learned visual concepts (or describe new concepts), thereby achieving zero-shot transfer of the model to downstream tasks.
We study the performance of this approach by benchmarking over 30 different existing computer vision datasets covering faces, animals, OCR (optical character recognition), action recognition in videos, geolocation and many types of fine-grained object classification tasks.
So the paper proposes such a neural network, aiming to solve these problems: it is trained on a variety of images, with a variety of natural language supervision, which is widely available on the Internet. By design, the network can be instructed with natural language to perform a wide variety of classification benchmarks without directly optimizing the performance of the benchmark, similar to the "zero-shot" capabilities of GPT-3.
Demonstrate that the simple pre-training task of predicting which caption matches which image is an efficient and scalable method to learn state-of-the-art images from scratch on a dataset of 400 million (image, text) pairs collected from the Internet express.

2. CLIP architecture

The standard image model jointly trains an image feature extractor and a linear classifier to predict a number of labels, and CLIP (Contrastive Language-Image Pre-Training) jointly trains an image encoder and a text encoder to predict a batch of labels. Correct pairing of (image, text) training samples. At test time, the learned text encoder synthesizes a zero-cost linear classifier by embedding the name or description of the class of the target dataset.
As shown below:
 

Here we collect (image, text) sample pairs from a new dataset of 400 million pairs (data sourced from various publicly available sources on the Internet) and demonstrate a simplified version of ConVIRT trained from scratch, which we call CLIP , is an effective method for learning from natural language supervision.

Compare pre-training: Combine text encoding and image encoding into an N*N matrix. The diagonal is the one with the highest cosine similarity, which is the positive sample, and the rest are negative samples. Create a classifier
from the label text: We will do the label text It became a sentence, and the modification A photo of a {object} was added in front of the classification label object . This is a very valuable approach, which will be discussed in detail later.

We compared the transfer of zero-sample ImageNet and found that it is more effective, faster, and the model is more robust, as shown below:

This result also illustrates its powerful generalization performance, especially for the prediction of unknown tasks, and also reflects its versatility.

3. Principle

3.1. Pre-training method

The current best computer vision systems are very computationally demanding. For example, ImageNet can only predict 1,000 categories.
If the task of learning an open set of visual concepts from natural language seems daunting, training efficiency is a critical factor. We abandon conventional prediction methods and just predict which text as a whole is paired with which image.
By jointly training the image encoder and text encoder to maximize the cosine similarity of N pairs of image and text embeddings in the batch , while minimizing the cosine similarity of N²−N pairs of wrong image and text embeddings , and then calculate the similarity between these Optimize the symmetric cross-entropy loss on the sex score . Since our pre-training dataset is large, overfitting is not a major issue and the details of training CLIP are simplified. We train CLIP from scratch without initializing the image encoder with ImageNet weights or the text encoder with pre-trained weights, using linear projection to map each encoder's representation to the multi-modal embedding space, and with unique data throughout training. Enhancement is random cropping. Finally, the temperature parameter τ that controls the logarithmic range in softmax is directly optimized into a multiplicative scalar of logarithmic parameterization during the training process to avoid becoming a hyperparameter.

3.2. CLIP pseudocode

# 图像编码 - ResNet or Vision Transformer
# 文本编码 - CBOW or Text Transformer
# I[n, h, w, c] - minibatch of aligned images
# T[n, l] - minibatch of aligned texts
# W_i[d_i, d_e] - learned proj of image to embed 图像嵌入投影
# W_t[d_t, d_e] - learned proj of text to embed 文本嵌入投影
# t - learned temperature parameter
#提取图像与文本的特征
I_f = image_encoder(I) #[n, d_i]
T_f = text_encoder(T) #[n, d_t]
#联合多模态嵌入[n, d_e]
I_e = l2_normalize(np.dot(I_f, W_i), axis=1)
T_e = l2_normalize(np.dot(T_f, W_t), axis=1)
#缩放成对余弦相似度[n, n]
logits = np.dot(I_e, T_e.T) * np.exp(t)
# 对称损失函数(两个交叉熵损失函数的均值)
labels = np.arange(n)
loss_i = cross_entropy_loss(logits, labels, axis=0)
loss_t = cross_entropy_loss(logits, labels, axis=1)
loss = (loss_i + loss_t)/2

3.3. Select model

Image encoder: two different architectures.
First, we use ResNet-50 as the basic architecture of the image encoder because of its widespread adoption and mature performance. We use ResNetD improved and anti-aliased rect-2 blur pooling with some modifications to the original version, replacing the global average pooling layer with an attention pooling mechanism .
Attention pooling is implemented as a single-layer "transformer-style" multi-head QKV attention, where the query is conditioned on the global average pooling representation of the image. For the second architecture, we experiment with the
recently introduced Vision Transformer , adding an additional layer of normalization to the combined patch and positional embeddings before the transformer, and using a slightly different initialization scheme.

Text Encoder: Using Transformer , as base size we use a 12-layer 512-wide model with 63M parameters with 8 attention heads. The converter operates on a lowercase byte-pair-encoded (BPE) representation of text with a vocabulary size of 49152, and the maximum sequence length is limited to 76 for computational efficiency. The text sequence is enclosed with [[SOS] and EOS] tokens. The activation of the highest layer of the converter at the [EOS] token is regarded as the feature representation of the text. The text is normalized by the layer and then linearly projected to the multimodal embedding. in space.

Compared to previous computer vision research that typically scales models by increasing width or depth individually, here we use a simple baseline that evenly distributes additional computation to increase model width, depth, and resolution.
For the text encoder, we only scale the width of the model proportional to the calculated increment of ResNet width, and do not scale the depth at all, as we found that the performance of CLIP is less sensitive to the capacity of the text encoder.

3.4. Training

We train 5 ResNets and 3 Vision Transformers .
For ResNet , we trained ResNet-50, ResNet-101, 4x ResNet-50, 16x ResNet-50 and 64x ResNet-50.
For Vision Transformers , we trained a ViT-B/32, a ViT-B/ 16 and a ViT-L/14 .
All models were trained for 32 epochs, using the Adam optimizer with decoupled weight decay regularization applied to all non-gain or bias weights, and using cosine scheduling to decay the learning rate .
Initial hyperparameter settings use a combination of grid search, random search, and manual tuning of the baseline ResNet50 model when training is 1 epoch. The hyperparameters are then heuristically adapted to the larger model due to computational constraints. The learnable temperature parameter τ is initialized to 0.07 and trimmed to prevent logarithmic scaling beyond 100, which we find is necessary to prevent training instability. We use a very large mini-batch of 32,768.
Mixed precision is used to speed up training and save memory. In order to save additional memory, use gradient checkpoints (checkpoints do not save all intermediate results of the entire calculation graph for backpropagation calculations, but recalculate intermediate results during the backpropagation process, exchanging time for space ), Half-precision Adam statisticsand half-precision randomly rounded text encoder weights .

4. Ambiguity

Here is a detailed explanation of the meaning of the expression a photo of a {object} in the picture above :
A common problem is that the word has multiple meanings. When the name of a class is the only information provided to CLIP's text encoder, it cannot distinguish which sense of the word is being referred to due to the lack of context. In some cases, multiple meanings of the same word may be included in the same dataset as different classes.
In ImageNet, for example, the word "cranes" means cranes in construction and cranes in flying animals .
Another example is found in the class Oxford-IIIT Pet dataset. From the context, the word "boxer" clearly refers to a type of pet dog, the Boxer dog , but the text encoder lacking context may also refer to the athlete boxer. .
So for this ambiguity, we found that using the prompt template "A photo of a {label}." is a good default value to help specify that the text is about the image. Just using this prompt can put it on ImageNet The accuracy rate increased by 1.3% .
Similar to the “hint engineering” discussion surrounding GPT-3, we also observe that zero-shot performance can be significantly improved by customizing hint text for each task. Below are some non-exhaustive examples. We found on several fine-grained image classification datasets that it helps specify categories. For example, in Oxford-IIIT Pets, use "A photo of a {label}, a type of pet.",good results. Likewise, it is helpful to specify "a type of food" on Food101 and "a type of aircraft" on FGVC. For optical character recognition (OCR) datasets, we found that adding quotes around the text or numbers to be recognized improves performance. Finally, we found that in satellite image classification datasets, it is also helpful to specify "a satellite photo of a {label}." But we have also seen that zero-shot CLIP is quite weak on some specialized, complex or abstract tasks, such as satellite image classification (EuroSAT and RESISC45), lymph node tumor detection (PatchCamelyon), and object counting in synthetic scenes (CLEVRCounts) , Tasks related to autonomous driving, such as German Traffic Sign Recognition (GTSRB) and identifying the nearest car (KITTI distance).

5. Robustness

Regarding robustness, in fact, considering the wide range of categories in the data set and the large amount of data, the generalization performance of zero samples should be good.

It also visualizes the performance of 0-shot, 1-shot, 2-shot, 4-shot, etc., as shown below:

It can be seen that the accuracy of 0 shot zero samples is higher than that of samples.
Let’s see how CLIP compares to human performance and human learning?
To better understand how humans perform in an evaluation setting similar to CLIP, we evaluated how well humans performed our task with zero samples and if they were shown one or two image samples. How much would human performance improve. This can help us compare human and CLIP task difficulty and identify correlations and differences between them.
We asked five different people to look at each of the 3,669 images in the Oxford IIT Pets dataset and select the 37 cat or dog breeds that best fit the picture (if they were at all unsure they labeled "I have no idea"). After not getting any examples of the species (zero samples), labeling them to the best of my ability without an internet search, as well as adding one and two samples to the experimental situation.
As for the experimental results here, we actually also know that humans have prior knowledge, either they know it, or they clearly know that they don’t know it. Therefore, we speculate that finding a way to appropriately integrate prior knowledge into small sample learning may be an important step in improving the CLIP algorithm.
The most difficult problems for CLIP are often also the most difficult problems for humans. We rank image categories by the difficulty of CLIP, using the probability of correct labels as a measure, as shown below:

It can be seen that compared with humans, the overall average accuracy trend of graphics is still very consistent.

6. Limitations

Although the above examples all show its strong generalization performance and robustness, on some data sets, the performance of this baseline is still far below the overall level.
CLIP's zero-sample is still weak on some tasks, and CLIP performs worse on several types of fine-grained classification (such as distinguishing car models, flower types, and aircraft variants) than task-specific models.
For new tasks that are unlikely to be included in the CLIP pre-training dataset, such as classifying the distance to the nearest car in a photo, CLIP's performance can be close to random. CLIP only achieves an accuracy of 88% on MNIST's handwritten digits, which is easy to understand because such images are basically not widely found in Internet search images.
Although CLIP provides the flexibility to generate zero-shot classifiers for a wide variety of tasks and datasets, CLIP is still limited to selecting from those concepts in a given zero-shot classifier. This is a significant limitation compared to truly flexible methods such as image captioning.
In addition, CLIP is trained by text paired with images on the Internet, and these image-text pairs are unfiltered and unmanaged, which will cause the CLIP model to learn many social biases.
Of course, as for bias, in addition to data from the Internet, algorithmic decisions and the design of how to define classes can lead to and amplify social bias and inequality caused by the use of AI systems. 

7. Looking forward to the future

For many datasets, CLIP performs significantly better than other models, suggesting that natural language supervision outperforms traditional pre-training methods based on image classification.
Our study of CLIP in a zero-shot setting shows that this model shows significant promise in widely applied tasks such as image retrieval or search. For example, it can find related images in a database for a given text, or find related text for a given image. Furthermore, the relative ease of turning CLIP into custom applications with little or no additional data or training could unlock a variety of novel applications that we can't even imagine today, much like what has happened with large language models over the past few years. That way.

Several works have explored the use of dense natural language supervision in videos by training systems to pair descriptive text with videos rather than images. When considered together with CLIP, these works suggest that large-scale natural language supervision is A promising approach to learning high-quality perceptual systems in many domains.
Extend this work to an additional modality by adding raw audio as an additional source of supervised visuals, and demonstrate the benefits of combining all three sources of supervision to richly connect vision and language to solve complex downstream Tasks such as visual question answering, visual common sense reasoning or multi-modal needs.

Quote sources:

About fine-tuning: Fine tuning in computer vision transfer learning
Original paper: https://arxiv.org/pdf/2103.00020.pdf
github: https://github.com/OpenAI/CLIP

Guess you like

Origin blog.csdn.net/weixin_41896770/article/details/133272256