[Computer Vision] CLIP: Connecting Text and Images (Some Supplementary Notes on CLIP)

I. Introduction

We present a neural network named CLIP that can efficiently learn visual concepts from natural language supervision. CLIP can be applied to any visual classification benchmark, just providing the visual category name to be recognized, similar to the "zero-shot" function of GPT-2 and GPT-3.

Although deep learning has revolutionized computer vision, current approaches suffer from several major problems:

  • Typical vision datasets are labor-intensive and expensive to create, while teaching only a small subset of visual concepts;
  • Standard vision models are good at one and only one task, and require a lot of effort to adapt to new tasks;

Models that performed well on benchmarks performed disappointingly on stress tests, casting doubt on the entire deep learning approach to computer vision.

We propose a neural network designed to address these problems:

It is trained on a variety of images with various natural language supervisions that are abundantly available on the internet. By design, the network can be instructed in natural language to perform various classification benchmarks without directly optimizing for benchmark performance, similar to the "zero-shot" feature of GPT-2 and GPT-3.

This is a key change: by not optimizing directly against the baseline, we show that it becomes more representative: our system closes this "robustness gap" by up to 75%, while matching the original on ImageNet zero-shot ResNet-50 performance without using any of the original 1.28M labeled examples.

insert image description here

2. Background and related work

CLIP (Contrastive Language-Image Pre-training) builds on extensive work on zero-shot transfer, natural language supervision, and multimodal learning.

The idea of ​​zero-data learning dates back more than a decade, but until recently was primarily studied in computer vision as a way to generalize to unseen object categories.

A key insight is to exploit natural language as a flexible prediction space for generalization and transfer. In 2013, Richer Socher and co-authors at Stanford developed a proof of concept by training a model on CIFAR-10 to make predictions in a word embedding space and showed that the model could predict two unseen classes.

That same year, DeVISE extended this approach and demonstrated that it is possible to fine-tune an ImageNet model such that it can generalize to correctly predict objects outside the original 1000 training set.

Most instructive for CLIP is the work of Ang Li and his co-authors at FAIR in 2016, where they demonstrated the use of natural language supervision to achieve zero-shot transfer to several existing computer vision classification datasets, such as Norm The ImageNet dataset. They achieved this by fine-tuning an ImageNet CNN to predict broader visual concepts (visual n-grams) from the titles, descriptions, and tagged text of 30 million Flickr photos, and achieved 11.5% accuracy on ImageNet. shooting.

Finally, CLIP is part of a group of papers that revisit learning visual representations from natural language supervision over the past year.

This line of work uses more modern architectures such as Transformer, including VirTex exploring autoregressive language modeling, ICMLM investigating masked language modeling, and ConVIRT investigating the same contrastive objective we used for CLIP, medical imaging.

3. Method

We show that scaling a simple pre-training task is sufficient to achieve competitive zero-shot performance on a variety of image classification datasets.

Our approach uses a large number of available sources of supervision: text paired with images found on the internet.

This data was used to create the following agent training task for CLIP: given an image, predict which of a set of 32,768 randomly sampled text snippets actually pairs with it in our dataset.

To solve this task, our intuition is that a CLIP model needs to learn to recognize various visual concepts in images and associate them with their names. Therefore, the CLIP model can be applied to almost any visual classification task.

For example, if the dataset is tasked with classifying photos of dogs and cats, we would examine each image, whether the CLIP model predicts that the textual description "a photo of a dog" or "a photo of a cat" is more likely to be paired with it.

insert image description here

CLIP aims to alleviate some of the major problems in standard deep learning methods for computer vision:

3.1 Costly datasets

Deep learning is data-intensive, and vision models have traditionally been trained on human-labeled datasets that are expensive to build and provide supervision for only a limited number of predetermined visual concepts.

The ImageNet dataset is one of the largest efforts in this field, requiring more than 25,000 workers to annotate 14 million images for 22,000 object categories. In contrast, CLIP learns from text-image pairs publicly available on the Internet. Reducing the need for expensive large labeled datasets has been extensively studied in previous work, particularly self-supervised learning contrastive methods, self-training methods, and generative modeling.

3.2 Narrow

The ImageNet model is good at predicting 1000 ImageNet categories, but that's all it does "out of the box".

If we wish to perform any other tasks, machine learning practitioners need to construct a new dataset, add an output head, and fine-tune the model.

In contrast, CLIP can be adapted to perform various visual classification tasks without additional training examples.

To apply CLIP to a new task, all we need to do is "tell" CLIP's text encoder the name of the task's visual concept, and it will output a linear classifier of the CLIP's visual representation.

The accuracy of such classifiers is often comparable to that of fully supervised models.

3.3 Poor real-world performance

Deep learning systems are often reported to achieve human or even superhuman performance. On visual benchmarks, but when deployed in the wild, their performance can be much lower than the benchmark-set expectations. In other words, there is a gap between "baseline performance" and "actual performance".

We speculate that this gap arises because the model "cheats" by optimizing only for baseline performance, much like a student passes an exam by studying only exam questions from past years.

In contrast, the CLIP model can be evaluated on a benchmark without the data it was trained on, so it cannot "cheat" in this way. This results in its baseline performance being more representative of its performance in the wild.

To test the "cheating hypothesis", we also measure the change in performance of CLIP when it is able to "learn" ImageNet.

When the linear classifier is fitted on top of CLIP's features, it improves CLIP's accuracy on the ImageNet test set by nearly 10%.

However, this classifier performed no better on average across the evaluation suite of 7 other datasets where "robust" performance was measured.

Four. Key points

4.1 CLIP is highly efficient

CLIP learns from unfiltered, highly diverse and noisy data and is designed to be used in a zero-shot fashion.

We know from GPT-2 and 3 that models trained on such data can achieve convincing zero-shot performance; however, such models are computationally intensive to train. To reduce the required computations, we focus on algorithmic approaches to improve the training efficiency of our method.

We report two algorithmic choices that lead to significant computational savings.

The first option is to use contrastive objects to connect text and images. We initially explored an image-to-text approach, similar to VirTex, but ran into difficulty scaling it to achieve state-of-the-art performance. In small to medium experiments, we found that the contrastive objectives used by CLIP were 4 to 10 times more efficient at zero-shot ImageNet classification.

The second option is to adopt the Vision Transformer, which further improves our computational efficiency by a factor of 3 compared to standard ResNet. Finally, our best performing CLIP model was trained for 2 weeks on 256 GPUs, which is similar to existing large-scale image models.

insert image description here

4.2 CLIP is flexible and general

Because they learn a wide range of visual concepts directly from natural language, CLIP models are more flexible and general than existing ImageNet models. We found that they were able to perform many different tasks with zero shots.

To test this, we measure the zero-shot performance of CLIP on more than 30 different datasets, including tasks such as fine-grained object classification, geolocation, action recognition in videos, and OCR.

In particular, learning OCR is an example of exciting behavior that does not occur in standard ImageNet models. Above, we visualized a random non-cherry picked prediction from each zero-shot classifier.

This finding is also reflected in standard representation learning evaluations using linear probes. On 20 of the 26 different transfer datasets we tested, the best CLIP model outperformed the best publicly available ImageNet model, Noisy Student EfficientNet-L2.

insert image description here
insert image description here

5. Restrictions

While CLIP generally performs well at recognizing common objects, it performs poorly at more abstract or systematic tasks (such as counting the number of objects in an image) and more complex tasks (such as predicting the distance to the nearest car in a photo).

On both datasets, zero-shot CLIP is only marginally better than random guessing. Compared to task-specific models, zero-shot CLIP also performs poorly on very fine-grained classifications, such as distinguishing car models, aircraft variants, or flower species.

CLIP still generalizes poorly to images not covered in its pre-training dataset.

For example, while CLIP learned a powerful OCR system, when evaluated on handwritten digits from the MNIST dataset, zero-shot CLIP achieved only 88% accuracy, far below the 99.75% achieved by humans on the dataset.

Finally, we observe that CLIP's zero-shot classifiers can be sensitive to wording or phrasing, sometimes requiring trial-and-error "hint engineering" to perform well.

6. Wider impact

CLIP allows people to design their own classifiers and removes the need for task-specific training data. The way these classes are designed can significantly affect model performance and model bias. For example, we found that when given a set of labels, including the Fairface race label C.

And some shocking terms, such as "crime", "animal", etc., the model tends to classify images of people aged 0-20 into the shocking category with a rate of about 32.3%. However, this behavior drops to about 8.7% when we add the class "child" to the list of possible classes.

Also, given that CLIP does not require task-specific training data, it can unlock some specific tasks more easily. Some of these tasks may raise privacy or surveillance-related risks, which we explore by studying the performance of CLIP on celebrity recognition.

CLIP achieves a top-1 accuracy of 59.2% in "in the wild" celebrity image classification when choosing from 100 candidates and 59.2% when choosing from 1000 possible options was 43.3%.

While achieving these results with task-independent pre-training is notable, this performance is not competitive with widely used production-grade models.

We further explore the challenges presented by CLIP in our paper, and we hope this work will inspire future research into the characterization of the capabilities, shortcomings, and biases of such models.

7. Conclusion

With CLIP, we tested whether task-agnostic pre-training of internet-scale natural language, which drives recent breakthroughs in NLP, can also be exploited to improve the performance of deep learning in other domains.

We are very excited about what we have achieved so far applying this approach to computer vision.

Like the GPT family, CLIP learns a wide variety of tasks during pre-training, which we demonstrate through zero-shot transfer. We are also encouraged by findings on ImageNet that zero-shot evaluation is a more representative measure of model capability.

Guess you like

Origin blog.csdn.net/wzk4869/article/details/130678449