OpenAI's most important model [CLIP]

What do recent AI breakthroughs DALLE and Stable Diffusion have in common?

They both use components of the CLIP architecture. Therefore, understanding CLIP is a prerequisite if you want to grasp how these models work.

Additionally, CLIP has been used to index photos on Unsplash.

But what did CLIP do, and why is it a milestone for the AI ​​community?

let's start!
insert image description here

Recommendation: Use NSDT Scene Designer to quickly build 3D scenes

1. Overview of CLIP

CLIP 代表 Contrastive Language-Image Pretraining:

CLIP is an open-source, multimodal, zero-sample model. Given an image and a textual description, the model predicts the most relevant textual description for that image without being optimized for a specific task.

Let's break down this description:

  • Open Source: The model was created and open sourced by OpenAI. We'll see a programming tutorial on how to use it later.
  • Multimodal: Multimodal architectures leverage multiple domains to learn a specific task. CLIP combines natural language processing and computer vision.
  • Zero-shot: Zero-shot learning is a method that generalizes on unseen labels without requiring specialized training to classify them. For example, all ImageNet models are trained to recognize 1000 specific categories. CLIP is not subject to this limitation.
  • Contrastive Language: Using this technique, CLIP is trained to understand that similar representations should be close to the latent space, while dissimilar representations should be far apart. This will become more clear with examples later.

Here are some interesting facts about CLIP

  • CLIP is trained using a staggering 400 million image-text pairs. In comparison, the ImageNet dataset contains 1.2 million images.
  • The final tuned CLIP model was trained for two weeks on 256 V100 GPUs. That's at least $200,000 for on-demand training on AWS Sagemaker!
  • The model is trained using mini-batches of 32,768 images.

2. What can CLIP do?

Let's visualize what CLIP does. We'll show a coding example in more detail later.

First, we choose a free image from Unsplash:

insert image description here

Next, we provide CLIP with the following prompts (Prompt):

  • ‘a girl wearing a beanie’.
  • ‘a girl wearing a hat’.
  • ‘a boy wearing a beanie’.
  • ‘a girl riding a bike’.
  • ‘a dog’.

Obviously, the first hint describes the image better.

CLIP automatically finds which text cue best describes an image by assigning a normalized probability. we got:
insert image description here

The model successfully finds the most appropriate image description.

In addition, CLIP can accurately identify classes and objects it has never seen before.

If you have a large dataset of images and you want to label those images with a specific class/category/description, CLIP will do it for you automatically!

Next, we'll show how CLIP works.

3. CLIP architecture

CLIP is a deep learning model that uses novel ideas from other successful architectures and introduces some of its own.

Let's start with the first part, comparing pre-training:

3.1 Comparing pre-training

Figure 1 shows an overview of the comparative pre-training process.
insert image description here

Suppose we have a batch of N images and their respective description pairs, eg <image1, text1>, <image2, text2>, <imageN, textN>.

Contrastive pre-training aims to jointly train image and text encoders that generate image embeddings [I1, I2 … IN] and text embeddings [T1, T2 … TN] in the following way:

  • The cosine similarity of the correct embedding pair <I1,T1>, <I2,T2> (where i=j) is maximized.
  • In contrast, the cosine similarity of dissimilar pairs <I1,T2>, <I1,T3>…<Ii,Tj> (where i≠j) is minimized.

Let's see what happens step by step:

  • The model receives a batch of N pairs.
  • The text encoder is a standard Transformer model with GPT2-style modifications. The image encoder can be ResNet or Vision Transformer.
  • For each image in the batch, the image encoder computes an image vector. The first image corresponds to the I1 vector, the second image corresponds to I2, and so on. Each vector has size de, where de is the size of the underlying dimension. Therefore, the output of this step is a matrix of NX de .
  • Similarly, text descriptions are compressed into text embeddings [T1, T2 … TN], yielding a matrix of NX de .
  • Finally, we multiply these matrices and compute the pairwise cosine similarity between each image and textual description. This produces an NxN matrix as shown in the image above.
  • The goal is to maximize the cosine similarity along the diagonal - these are the correct pairs. In contrast, the similarity of off-diagonal elements should be minimized (eg, I1 image is described by T1 rather than T2, T2, T3, etc.).

Some additional comments:

  • The model uses a symmetric cross-entropy loss as its optimization objective. This type of loss minimizes the image-to-text orientation as well as the text-to-image orientation (remember, our contrastive loss matrix maintains <I1,T2> and <I2,T1> cosine similarity).
  • Contrastive pre-training is not entirely new. It was introduced in the previous model and adapted in CLIP.

3.2 Zero-shot classification

We now have image and text encoders pretrained and we are ready for zero-shot classification.

  • baseline

First, let's provide some background information. How to achieve few-shot classification in the Pre-Transformer era?

It's simple:

  • Download a high-performance pre-trained CNN, such as ResNet, and use it for feature extraction to obtain image features.
  • These features are then used as input to standard classifiers such as logistic regression. The classifier is trained in a supervised manner, where image labels are used as target variables (Fig. 2).
  • If you choose K-shot learning, your training set in the classification phase should only contain K instances of each class.
    When K<10, the task is called few-shot classification learning. Thus, for K=1, we have one-shot classification learning. This is a fully supervised model (the old fashioned way) if we use all available data.

insert image description here

Note the keyword "supervised" above - the classifier should know the class labels in advance. Using an image extractor paired with a classifier is also known as linear probe evaluation.

  • CLIP's Competitive Advantage

The process of how CLIP performs zero-sample classification is shown in Figure 3:

insert image description here

Again, the process is simple:

  • First, we provide a set of text descriptions, such as a photo of a dog or a cat eating an ice-cream (text that we think best describes one or more images). These text descriptions are encoded into text embeddings.
  • Then, we do the same with images - the images are encoded into image embeddings.
  • Finally, CLIP computes the pairwise cosine similarity between image and text embeddings. The text cue with the highest similarity is chosen as the prediction.

Of course, we can input multiple images. CLIP cleverly caches input text embeddings so they don't have to be recomputed for the rest of the input image.

That's it! We have now summarized how CLIP works end-to-end.

4. Problems with data

CLIP uses 30 public datasets for pre-training. Fitting large language models with large amounts of data is important.

However, robust datasets with paired image-text descriptions are hard to find. Most public datasets, such as CIFAR, are images with only one word label - these labels are the target categories. But CLIP was created to use full text descriptions.

To overcome this discrepancy, the authors did not exclude these datasets. Instead, they did some feature engineering: converting a single word label (e.g. bird or car) into a sentence: a photo of a dog or a photo of bird. On the Oxford-IIIT Pets dataset, the author used a hint: A photo of a {label}, a type of pet.

For more information on pre-training techniques, check out the original paper.

5. The impact of CLIP on AI

At the beginning of the article, we claimed that CLIP was a milestone for the AI ​​community.

Let's see why:

5.1 Superior performance as a zero-shot classifier

CLIP is a zero-shot classifier, so it makes sense to test CLIP against few-shot learning models first.

Therefore, the authors tested CLIP against a model consisting of a linear classifier on top of a high-quality pretrained model such as ResNet.

The result is shown in Figure 4:

insert image description here

CLIP significantly outperforms other classifiers.

Furthermore, CLIP is able to match the performance of the 16-shot linear classifier BiT-M. In other words, BiT-M's classifier had to be trained on a dataset of at least 16 examples per class to match CLIP's score—while CLIP achieved the same score without fine-tuning.

Interestingly, the authors evaluate CLIP as a linear probe: they simply use CLIP's image encoder to obtain image features and feed them into a linear classifier - just like other models. Even with this setting, the few-shot learning ability of CLIP is excellent.

5.2 Unrivaled robustness to distribution changes

Distribution drift is a big deal, especially for machine learning systems in production.

Note: You may think of distribution drift as concept drift, although technically they are not the same.

Distribution shift is a phenomenon that occurs when the data on which a model is trained changes over time. As a result, models become less efficient and predictions become less accurate over time.

In fact, distribution drift is not something unexpected -- it happens. The question is, how do you spot this phenomenon early on, and what do you need to do to "recalibrate" your model? This is not easy to fix and depends on many factors.

Fortunately, new research in artificial intelligence is working to create models that can adapt to changes in the distribution.

This is why the authors use the robustness of CLIP for testing. The result is shown in Figure 5:

insert image description here

There are two very important points here about CLIP:

  • CLIP achieves the same accuracy as the SOTA ResNet model on ImageNet, even though CLIP is a zero-shot model.
  • In addition to the original ImageNet, we have similar datasets as distribution shift benchmarks. It seems that ResNet is struggling with these datasets. However, CLIP can handle unknown images very well - in fact, the model maintains the same level of accuracy across all variants of ImageNet!

5.3 Computational Efficiency

Before GPT-2, computational efficiency was taken for granted (sort of).

Today, in an era where models take weeks to train on hundreds of $8,000 GPUs, the problem of computational efficiency is more acutely addressed.

CLIP is a more computationally friendly architecture. Part of this success is due to the fact that CLIP uses the Vision Transformer as the default image encoder component. The result is shown in Figure 6:
insert image description here

Clearly, CLIP is able to utilize hardware resources better than other models. It also means additional cost savings when using cloud services like AWS Sagemaker for training. Furthermore, Figure 6 shows that CLIP provides better scalability in terms of hardware operations and accuracy scores compared to other models.

There is still the question of data efficiency. The authors show that CLIP is more data efficient than similar models in the zero-shot setting. However, they do not address the data efficiency of CLIP in the pre-training phase. However, there may not be much that can be done in this regard, since CLIP uses two types of Transformers - and Transformers are inherently data-intensive models.

5.4 Increased research interest

The success of CLIP sparked interest in text-to-image models and popularized contrastive pre-training methods.

In addition to DALLE and stable diffusion, we can also use CLIP as a discriminator in GAN.

Furthermore, the release of CLIP inspired similar CLIP-based publications that extended the capabilities of models, such as DenseCLIP and CoCoOp.

Additionally, Microsoft released X-CLIP, a minimal extension of CLIP for video language understanding.

Extra info: A Pictionary-like application called paint.wtf uses CLIP to rank your drawings. Give it a try - super fun!

6. How to use CLIP - coding example

Next, we'll show how to use CLIP using the HugginFaces library.

First, let's choose 3 images from Unsplash. We used the first one before:

insert image description here
insert image description here
insert image description here

We will use the following libraries:

import transformers
import datasets
import numpy as np
import pandas as pd
import torch
from PIL import Image
import requests

from transformers import CLIPTokenizerFast, CLIPProcessor, CLIPModel

Next, we load the CLIP model's weights, tokenizer and image processor:

device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "openai/clip-vit-base-patch32"

# we initialize a tokenizer, image processor, and the model itself
tokenizer = CLIPTokenizerFast.from_pretrained(model_id)
processor = CLIPProcessor.from_pretrained(model_id)
model = CLIPModel.from_pretrained(model_id).to(device)

Additionally, we load the above Unsplash image in Python:

urls=['https://images.unsplash.com/photo-1662955676669-c5d141718bfd?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=687&q=80',
    'https://images.unsplash.com/photo-1552053831-71594a27632d?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=662&q=80',
    'https://images.unsplash.com/photo-1530281700549-e82e7bf110d6?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=688&q=80']

images=[Image.open(requests.get(i, stream=True).raw)  for i in urls]

Finally, we provide some text hints for CLIP.

The goal is to have CLIP classify 3 Unsplash images into a specific text description. Note that one of them is misleading - let's see if we can confuse the model:

text_prompts=["a girl wearing a beanie", "a boy wearing a beanie", "a dog", "a dog at the beach"]
inputs = inputs = processor(text=text_prompts, images=images, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image 
probs = logits_per_image.softmax(dim=1) 
pd.DataFrame(probs.detach().numpy()*100, columns=text_prompts, index=list(['image1','image2', 'image3'])).style.background_gradient(axis=None,low=0, high=0.91).format(precision=2)

insert image description here

The model successfully classified all 3 images!

Note two points:

  • CLIP can understand multiple entities and their actions in each image.
  • CLIP assigns the most specific description to each image. For example, we can describe the second image as "a dog" and "a dog on the beach". However, the model correctly decides that the phrase "dog" better describes the second image because there is no beach.

Feel free to try this example. The full example is here . Use your images with text descriptions to discover how CLIP works.

7. CLIP limitations and future work

Although CLIP is a revolutionary model, there is still room for improvement. The authors identify areas where further progress is likely.

  • Accuracy Score: CLIP is a state-of-the-art zero-shot classifier that directly challenges task-specific trained models.
  • The fact that CLIP matches the accuracy of fully supervised ResNet101 on ImageNet is striking. However, there are still supervised models that achieve higher scores. The authors stress that, given its amazing scalability, CLIP may achieve higher scores, but this requires a lot of computer resources.
  • Ambiguity: The authors point out that there is ambiguity in CLIP. Sometimes, the model cannot distinguish the meaning of some words due to the lack of context. Remember, we mentioned earlier that some images are only labeled with class labels and not with full-text hints. The authors provide an example: In the Oxford-IIIT Pet dataset, the word "boxer" refers to a dog breed, but other images see "boxer" as an athlete. Here, the culprit is the quality of the data, not the model itself.
  • Task-specific learning: While CLIP can distinguish complex image patterns, the model fails on some trivial tasks. For example, the model struggles with handwritten digit recognition tasks (Figure 7). The authors attribute this type of misclassification to the lack of handwritten digits in the training dataset.

insert image description here

8. Conclusion

Without a doubt, CLIP is an important model for the AI ​​community.

Essentially, CLIP paves the way for a new generation of text-to-image models that are revolutionizing AI research. Of course, don't forget that this model is open source.

Last but not least, there is a lot of room for improvement. Throughout the paper, the authors imply that many of the limitations of CLIP are due to the low quality of the training data.


Original link: OpenAI's most influential model - BimAnt

Guess you like

Origin blog.csdn.net/shebao3333/article/details/128980856