Article directory
论文:Learning Transferable Visual Models From Natural Language Supervision
Code: https://github.com/OpenAI/CLIP
Official website: https://openai.com/research/clip
Source: OpenAI
Time: 2021.02
contribute:
- A multimodal model based on image-text matching is proposed
- Through the joint training of image and text models, the cosine similarity between the two encoding features is maximized to achieve the matching of images and texts
- Models based on image-text matching are much more efficient than models that directly learn text content
Introduction:
- CLIP is a large-scale (400 million graphic-text pairs) graphic-text pre-training model based on comparative learning open sourced by OpenAI.
- Both image and text encoders use Transformer, using cosine similarity to measure the distance between the two encoding features
- English used in the text description
1. Background
The pre-training method of learning directly from the original text has revolutionized NLP a lot in the past few years
Developing text-to-text to implement a standard input and output enables zero-shot transfer of task-agnostic structures to downstream datasets without requiring specific output headers or specific datasets.
It also shows that using pre-trained methods on very large text datasets outperforms training on annotated NLP datasets.
So can super-large-scale pre-training data be used to achieve similar effects in the field of computer vision?
General computer vision tasks are based on set categories, which will limit the generalization of the model, such as the inability to recognize undefined categories.
A huge gap between NLP and CV lies in scale!
How the author of this article did it:
- Investigated what can be achieved using large-scale natural language processing datasets to train image classifiers
- That is, about 400 million image-text pairs data sets were collected on the Internet, and then ConvVIRT was trained from scratch. This article is called CLIP (Contrastive Language-Image Pre-training), which can efficiently obtain supervision signals from natural language.
- The authors study the scalability of CLIP by training 8 models
- The author found that the CLIP and GPT families are very similar, and will learn many tasks in the pre-training stage, such as OCR, positioning, action recognition, etc.
- The author tested the zero-shot migration ability on more than 30 data sets and found that it can achieve the effect of using supervised learning
2. Method
2.1 Using natural language to supervise training
A core point of the method in this paper is to learn the perceptual ability from the supervised information of natural language
Benefits of learning from natural language:
- For image classification, natural language can provide more information than standard labels
- Because the model does not need to be built as a 1-of-N voting selection model, but can learn from the richer supervision included in the large amount of Internet text
- Learning from natural language not only learns a feature representation, but also combines feature representation and language to achieve more flexible zero-shot transfer
2.2 Create a very large data set
The authors construct a new dataset WIT (WebImageText), which includes 400 million (image, text) pairs from publicly available sources on the Internet.
2.3 Select a pre-trained model
Many existing computer vision models consume a lot of computing resources, so the efficiency of training is very important, not to mention using natural language as a supervisory signal, which will be more difficult than training with 1000 labels in ImageNet.
The author first tried a method similar to VirTex, jointly training the CNN of the image and the transformer model of the text from scratch to predict an accurate description of each image.
However, as shown in Figure 2, the text transformer model requires 63 million parameters, which is 2 times the calculation amount of the image encoder ResNet-50, and learning to recognize the category of Imagenet is 3 times lower than other bag-of-words-based methods.
Since the description of each image is different, and the description of the same image may also be diverse, it is very difficult to use natural language as a learning label (that is, to learn each word in its description).
So there is a method using contrastive learning
Many existing generative models can also achieve high-quality image description, but under the premise of the same description level, the generative model is more computationally intensive than the comparison model
Therefore, the training system proposed in this paper constructs image-to-text as a simpler task, treats the text described in natural language as a whole, and learns which image to match, instead of learning each word in the text . This idea increases the speed of zero-sample transfer learning on Imagenet by 4x
The idea of CLIP:
- Given a batch containing N pairs (image, text)
- The training objective of CLIP is to predict an NxN matrix of possible pairings
- So CLIP is to learn multimodal embedding, and train image encoder and text encoder at the same time to maximize the real NN in a batchThe cosine similarity of N (image, text) pairs minimizes the otherN 2 − NN^2-NN2−Cosine similarity of N non-true matching (image, text) pairs
- Optimize these similarity scores with a symmetric cross entropy loss
Figure 3 shows the pseudocode of CLIP:
2.4 Model scaling and selection
In the selection of the image encoder model, the author considered two structures:
- ResNet-50
- ViT
The text encoder uses the transformer structure, the base size is 63M-parameter, 12 layers, and 8 attention heads
In previous CV research, the width and depth of the scaling model are usually used to scale the model size, and this paper is similar.
For the text encoder, the author only scales the width of the model
3. Effect
Ability to transfer zero samples
Table 1 compares the effects of Visual N-Grams and CLIP on 3 different data sets
The effect of Visual N-Grams on ImageNet is 11.5%, while CLIP reaches 76.2%, which is almost the same as the supervised trained ResNet50
The top-5 acc of CLIP reached 95%, which is almost the same as Inception-v4
This shows that CLIP has a good zero-shot transfer ability on classification tasks
The following figure is the comparison effect of using ResNet101 and CLIP ViT-L on the ImageNet dataset shown on the official website:
The effect of CLIP on some datasets: