Description of CLIP: Connecting Text and Images

Recently, OpenAI released DALL·E and CLIP. The former has not yet been open sourced but the latter has been open sourced. So let’s take a look at CLIP first.
This article sorts out and records the content of OpenAI’s official blog, and you need to read the original text for details. A lot of references to the CLIP introduction article of the big guy , thank you.


One sentence summary of CLIP: Zero-shot is done well, tasks can be customized, and the efficiency is very high.

Related links:
OpenAI CLIP blog
CLIP github
zero-shot learning (Zero-Shot Learning) introduction
CLIP colab CLIP
paper
OpenAI DALL·E blog
(paper and colab github also give links)


Problems with cv now:

  1. Dataset production costs are high
  2. The model only has one type of task, the cost is high, and it does not do well in other tasks. It performs well in
    benchmarks tasks, but it performs poorly in stress tests (if you change the data set, it will not perform well)

Therefore, the CLIP model is proposed, which can alleviate three problems:

  1. Costly datasets: Most of the datasets used in previous models are human-labeled, while the training data of CLIP are all found from the Internet, using plain text as labels, which reduces labor costs
  2. Narrow: The output is limited if trained on a labeled data set. For example, the data set only teaches the model to predict cats and dogs, so it is impossible to let the model predict ducks, and CLIP is not limited on common images.
  3. Poor real-world performance: There is a gap between the benchmark and the real situation. A good performance on the benchmark does not mean that the real situation is good. However, CLIP is not learned from a specific data set, which can alleviate this problem. The author also confirmed through experiments that if learning from ImageNet, although the evaluation effect will be improved, the other 7 data sets are not very good.
    insert image description here

CLIP Advantages & Features:

  1. Summary: Zero-shot is doing a good job. After training on a 400 million uncleaned data set, it can perform well on different data sets. You can customize tasks and have high efficiency.
  2. OpenAI collected 400 million uncleaned image-text pair data from the Internet, and trained them with comparative learning objectives: encode the image and text separately, then calculate the cosine similarity in pairs, and then classify a row of each image or a column of text , to find matching positive examples.
  3. Highly efficient: Although GPT3 is also good for zero-shot, CLIP consumes less resources, requires less calculation, and has high training efficiency. The best version of CLIP can only be trained for two weeks on 256 GPUs, which is similar to other large models in the current image field.
    Two ways to improve efficiency:
    Contrastive learning Objective: As shown in the picture at the beginning of the article, compared with the language model that predicts text descriptions one by one, contrastive learning can improve the efficiency by 4 to 10 times. VisionTransformer: directly divide the
    image into patches, and then use the Transformer. Compared with ResNet encoding, it is 3 times more efficient (true Attention is all you need)
  4. flexible and general: Because they learn a wide range of visual concepts directly from natural language, CLIPs are significantly more flexible and general than existing ImageNet models. We found that they were able to perform many different tasks with ease. To test this, we measure the zero-shot performance of CLIP on more than 30 different datasets, including tasks such as fine-grained object classification, geolocation, video action recognition, and OCR.

CLIP Disadvantages:

  1. While CLIP generally performs well at recognizing common objects, it does not perform as well on more abstract or systematic tasks. Things like counting the number of objects in an image, and on more complex tasks like predicting how close the nearest car is in a photo. On both datasets, zero-shot clipping is only marginally better than random guessing. Compared to task-specific models, Zero-shot CLIP also struggles at very fine-grained classifications, such as distinguishing between car models, aircraft variants, or flower species.
  2. CLIP also generalizes poorly to images not included in its pre-training dataset. For example, even though CLIP learned an effective OCR system, when evaluating handwritten digits from the MNIST dataset, zero-shot CLIP achieved only 88% accuracy, far below the 99.75% accuracy achieved by humans on the dataset. (In fact, it’s okay, after all, it’s not specifically run on MNIST)

The above shortcomings can be solved by feeding more corresponding data. But it will increase the training time.

Model diagram:
insert here
zero-shot has high efficiency and high accuracy:
insert image description here

Guess you like

Origin blog.csdn.net/Only_Wolfy/article/details/112675777