CLIP paper reading, zero-shot experiment, linear prob experiment record

Record CLIP paper reading, zero-shot experiment (direct reasoning), linear probe experiment (freeze CLIP extracted features and only train the classification layer).

1. Paper reading

Paper: Learning Transferable Visual Models From Natural Language Supervision
Github: https://github.com/openai/CLIP
Reference Video: Intensive reading of CLIP papers paragraph by paragraph [Paper Intensive Reading]

CLIP ( Contrastive Language- Image Pre -training) is a work of OpenAI in 2021. The purpose is to use text as a supervisory signal to train a transferable vision model. The model principle is shown in the red box below :

insert image description here

  • Text Encoder: Transformer is used, 12 layers, 8 heads, 512-dimensional features, and the tokenizer uses BPE byte pair encoding;
  • Image Encoder: 5 different ResNets, EfficientNets and 3 different ViTs were selected in the experiment, and ViT-L/14@336px was finally selected.

The specific model hyperparameters are as follows:
insert image description here

  • Training phase : The pre-training goal is to let the model learn the matching relationship between text-image pairs through comparative learning, that is, in the schematic diagram of the above model, the blue diagonal line is the matching image-text pair. The training set used their own WIT (WebImageText) dataset containing 400 million image-text pairs.

  • Inference stage : use prompt engineering (for example, to classify images of dogs and cats, input "A photo of cat" and "A photo of dog" respectively, and then calculate the similarity with image features) and prompt ensemble (more than 80 prompts have been designed, For example, you can use different prompts to construct input text for "cat" and "dog", extract features and score respectively).

The following is the pseudocode of the model workflow:

# image_encoder - 残差网络 ResNet 或者 Vision Transformer
# text_encoder - CBOW 或者文本 Transformer

# I[n, h, w, c] - 训练图像,n是batch size,h/w/c分别是高/宽/通道数
# T[n, l] - 训练文本,n是batch size,l是文本长度
# W_i[d_i, d_e] - 可学习的 图像嵌入 投影矩阵
# W_t[d_t, d_e] - 可学习的 文本嵌入 投影矩阵
# t - softmax 可学习的 temperature 参数

# 抽特征 I_f 和 T_f
I_f = image_encoder(I) #[n, d_i]
T_f = text_encoder(T) #[n, d_t]

# 对 I_f、T_f 分别点乘各自投影矩阵,投到同一个向量空间,并做 norm 得到各自特征向量。[n, d_e]
I_e = l2_normalize(np.dot(I_f, W_i), axis=1)
T_e = l2_normalize(np.dot(T_f, W_t), axis=1)

# 算图文特征的余弦相似度。[n, n]
logits = np.dot(I_e, T_e.T) * np.exp(t)

# 对称损失函数(对比学习常用)
labels = np.arange(n)
loss_i = cross_entropy_loss(logits, labels, axis=0)
loss_t = cross_entropy_loss(logits, labels, axis=1)
loss = (loss_i + loss_t)/2

2. Zero-shot reasoning experiment

This part directly loads the pre-trained model weights for zero-shot inference.

  • Create a new project openai_clip, refer to Github , source code installation clip and other dependencies:

    pip install git+https://github.com/openai/CLIP.git
    
  • Manually download the model weights locally (ViT-B/32):

    wget https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt
    
  • Save the test image piano_dog.png, create a new one zero-shot.ipynb, and run the code:

    import torch
    import clip
    import os
    import numpy as np
    from PIL import Image
    
    os.environ['CUDA_VISIBLE_DEVICES']='1'
    device = "cuda" if torch.cuda.is_available() else "cpu"
    clip.available_models()
    

    Test Image:
    insert image description here
    List of available model weights:
    insert image description here

  • Load the model and view the model information:

    model, preprocess = clip.load("ckpt/ViT-B-32.pt", device=device)
    
    input_resolution = model.visual.input_resolution
    context_length = model.context_length
    vocab_size = model.vocab_size
    
    print("Model parameters:", f"{
            
            np.sum([int(np.prod(p.shape)) for p in model.parameters()]):,}")
    print("Input resolution:", input_resolution)
    print("Context length:", context_length)
    print("Vocab size:", vocab_size)
    

    insert image description here

  • Extract graphic features and calculate similarity:

    image = preprocess(Image.open("./dataset/piano_dog.png")).unsqueeze(0).to(device)
    text = clip.tokenize(["a dog eating an egg", "a dog singing a song", "a dog playing a piano"]).to(device)
    
    with torch.no_grad():
    
        image_features = model.encode_image(image)
        text_features = model.encode_text(text)
        print("图文特征:", image_features.shape, text_features.shape)
    
        logits_per_image, logits_per_text = model(image, text)
        probs = logits_per_image.softmax(dim=-1).cpu().numpy()
        print("图文logits:", image_features.shape, text_features.shape, probs.shape)
    
    print("Label probs:", np.around(probs, 3))  # prints: [[0.9927937  0.00421068 0.00299572]]
    

    It can be seen that "a dog playing a piano" has the highest probability.
    insert image description here

3. Linear Probe training experiment

The paper mentioned that freezing CLIP is used to extract features, and then adding a linear layer for classification. After a wave of training with 8-shot and above samples, the effect will be better than direct zeroshot. So use sklearn to practice it.
insert image description here

import os
import clip
import torch

import numpy as np
from sklearn.linear_model import LogisticRegression
from torch.utils.data import DataLoader
from torchvision.datasets import CIFAR100
from tqdm import tqdm

# 加载模型
device = "cuda:4" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('./ckpt/ViT-B-32.pt', device)

# 加载 cifar100 数据集 (root 路径下需包含解压后的 cifar-100-python 文件夹)
root = os.path.expanduser("./dataset/cifar100/")
train = CIFAR100(root, download=True, train=True, transform=preprocess)
test = CIFAR100(root, download=True, train=False, transform=preprocess)

# 模型只用于提取特征
def get_features(dataset):
    all_features = []
    all_labels = []
    
    with torch.no_grad():
        for images, labels in tqdm(DataLoader(dataset, batch_size=100)):
            features = model.encode_image(images.to(device))

            all_features.append(features)
            all_labels.append(labels)

    return torch.cat(all_features).cpu().numpy(), torch.cat(all_labels).cpu().numpy()

# 构造 sklearn 的 train_X, train_y, test_X, test_y
train_features, train_labels = get_features(train)
test_features, test_labels = get_features(test)

# 初始化一个 logistic regression 对象
classifier = LogisticRegression(random_state=0, C=0.316, max_iter=1000, verbose=1)
classifier.fit(train_features, train_labels)

# 评测一下
predictions = classifier.predict(test_features)
accuracy = np.mean((test_labels == predictions).astype(np.float)) * 100.
print(f"Accuracy = {
      
      accuracy:.3f}")

Guess you like

Origin blog.csdn.net/muyao987/article/details/127043150