[AIGC] 6. CLIP | OpenAI produced a graphic-text matching model trained with 400 million samples

insert image description here

论文:Learning Transferable Visual Models From Natural Language Supervision

Code: https://github.com/OpenAI/CLIP

Official website: https://openai.com/research/clip

Source: OpenAI

Time: 2021.02

contribute:

  • A multimodal model based on image-text matching is proposed
  • Through the joint training of image and text models, the cosine similarity between the two encoding features is maximized to achieve the matching of images and texts
  • Models based on image-text matching are much more efficient than models that directly learn text content

Introduction:

  • CLIP is a large-scale (400 million graphic-text pairs) graphic-text pre-training model based on comparative learning open sourced by OpenAI.
  • Both image and text encoders use Transformer, using cosine similarity to measure the distance between the two encoding features
  • English used in the text description

1. Background

The pre-training method of learning directly from the original text has revolutionized NLP a lot in the past few years

Developing text-to-text to implement a standard input and output enables zero-shot transfer of task-agnostic structures to downstream datasets without requiring specific output headers or specific datasets.

It also shows that using pre-trained methods on very large text datasets outperforms training on annotated NLP datasets.

So can super-large-scale pre-training data be used to achieve similar effects in the field of computer vision?

General computer vision tasks are based on set categories, which will limit the generalization of the model, such as the inability to recognize undefined categories.

A huge gap between NLP and CV lies in scale!

How the author of this article did it:

  • Investigated what can be achieved using large-scale natural language processing datasets to train image classifiers
  • That is, about 400 million image-text pairs data sets were collected on the Internet, and then ConvVIRT was trained from scratch. This article is called CLIP (Contrastive Language-Image Pre-training), which can efficiently obtain supervision signals from natural language.
  • The authors study the scalability of CLIP by training 8 models
  • The author found that the CLIP and GPT families are very similar, and will learn many tasks in the pre-training stage, such as OCR, positioning, action recognition, etc.
  • The author tested the zero-shot migration ability on more than 30 data sets and found that it can achieve the effect of using supervised learning

2. Method

insert image description here

2.1 Using natural language to supervise training

A core point of the method in this paper is to learn the perceptual ability from the supervised information of natural language

Benefits of learning from natural language:

  • For image classification, natural language can provide more information than standard labels
  • Because the model does not need to be built as a 1-of-N voting selection model, but can learn from the richer supervision included in the large amount of Internet text
  • Learning from natural language not only learns a feature representation, but also combines feature representation and language to achieve more flexible zero-shot transfer

2.2 Create a very large data set

The authors construct a new dataset WIT (WebImageText), which includes 400 million (image, text) pairs from publicly available sources on the Internet.

2.3 Select a pre-trained model

Many existing computer vision models consume a lot of computing resources, so the efficiency of training is very important, not to mention using natural language as a supervisory signal, which will be more difficult than training with 1000 labels in ImageNet.

The author first tried a method similar to VirTex, jointly training the CNN of the image and the transformer model of the text from scratch to predict an accurate description of each image.

However, as shown in Figure 2, the text transformer model requires 63 million parameters, which is 2 times the calculation amount of the image encoder ResNet-50, and learning to recognize the category of Imagenet is 3 times lower than other bag-of-words-based methods.

insert image description here

Since the description of each image is different, and the description of the same image may also be diverse, it is very difficult to use natural language as a learning label (that is, to learn each word in its description).

So there is a method using contrastive learning

Many existing generative models can also achieve high-quality image description, but under the premise of the same description level, the generative model is more computationally intensive than the comparison model

Therefore, the training system proposed in this paper constructs image-to-text as a simpler task, treats the text described in natural language as a whole, and learns which image to match, instead of learning each word in the text . This idea increases the speed of zero-sample transfer learning on Imagenet by 4x

The idea of ​​CLIP:

  • Given a batch containing N pairs (image, text)
  • The training objective of CLIP is to predict an NxN matrix of possible pairings
  • So CLIP is to learn multimodal embedding, and train image encoder and text encoder at the same time to maximize the real NN in a batchThe cosine similarity of N (image, text) pairs minimizes the otherN 2 − NN^2-NN2Cosine similarity of N non-true matching (image, text) pairs
  • Optimize these similarity scores with a symmetric cross entropy loss

Figure 3 shows the pseudocode of CLIP:

insert image description here

2.4 Model scaling and selection

In the selection of the image encoder model, the author considered two structures:

  • ResNet-50
  • ViT

The text encoder uses the transformer structure, the base size is 63M-parameter, 12 layers, and 8 attention heads

In previous CV research, the width and depth of the scaling model are usually used to scale the model size, and this paper is similar.

For the text encoder, the author only scales the width of the model

3. Effect

Ability to transfer zero samples

Table 1 compares the effects of Visual N-Grams and CLIP on 3 different data sets

The effect of Visual N-Grams on ImageNet is 11.5%, while CLIP reaches 76.2%, which is almost the same as the supervised trained ResNet50

The top-5 acc of CLIP reached 95%, which is almost the same as Inception-v4

This shows that CLIP has a good zero-shot transfer ability on classification tasks

insert image description here

The following figure is the comparison effect of using ResNet101 and CLIP ViT-L on the ImageNet dataset shown on the official website:

insert image description here

The effect of CLIP on some datasets:

insert image description here

insert image description here

insert image description here

insert image description here

insert image description here

insert image description here

insert image description here

insert image description here

insert image description here

insert image description here

Oh

insert image description here

insert image description here

insert image description here

insert image description here

insert image description here

insert image description here

insert image description here

insert image description here

insert image description here

insert image description here

insert image description here

insert image description here

insert image description here

Guess you like

Origin blog.csdn.net/jiaoyangwm/article/details/130033758