CLIP: Creating an Image Classifier

introduce

Suppose you need to classify whether people wear glasses, but you don't have the data or resources to train a custom model.

In this tutorial, you will learn how to use a pretrained CLIP model to create a custom classifier without any training required. This approach, called zero-shot image classification, enables image classification of classes not explicitly observed during training of the original CLIP model.

For convenience, an easy-to-use Jupyter notebook is provided below with the full code.

CLIP: Theoretical Background

The CLIP (Contrastive Language-Image Pre-training) model is a multimodal vision and language model developed by OpenAI. It maps images and text descriptions to the same latent space, enabling it to determine whether images and descriptions match.

CLIP was developed by comparative training on a dataset of over 400 million image-text pairs from the Internet [1]. Surprisingly, the pre-trained CLIP-generated classifier has shown competitive results with supervised baseline models, and in this tutorial we will utilize this pre-trained model to generate a glasses detector.

CLIP contrast training

The CLIP model consists of an image encoder and a text encoder (Figure 1). In training, a batch of images is processed through an image encoder (ResNet variant or ViT) to obtain image representation tensors (embeddings)1. At the same time, their corresponding descriptions are processed through a text encoder (Transformer) to obtain text embeddings.

The CLIP model is trained to predict which image tensor belongs to which text tensor in the batch. This is done by jointly training an image encoder and a text encoder to maximize the cosine similarity between the truly paired image and text embeddings in the batch [2], while reducing the cosine similarity between incorrectly paired embeddings and achieved (Figure 1, blue squares on the diagonal axis). Optimization is performed using a symmetric cross-entropy loss of these similarity scores.

Create a custom classifier

When creating a custom classifier using CLIP, the conversion of category names into text embedding vectors is processed by a pretrained text encoder, while image embeddings are performed using a pretrained image encoder (Figure 2). The cosine similarity between the image embedding and each text embedding is then computed, and the image is assigned to the category with the highest cosine similarity score.

Code

data set

In this tutorial, we will create an image classifier that detects whether people wear glasses or not, and evaluate the performance of the classifier using the "with or without glasses" dataset [3] on Kaggle.

Although the dataset contains 5000 images, we will only utilize the first 100 to speed up the demonstration. The dataset contains a folder with all images and a CSV file with labels. To facilitate loading of image paths and labels, we will customize the Pytorch dataset class to create a CustomDataset() class. You can find it in the provided notebook code.

Load the CLIP model

After installing and importing CLIP and its related libraries, we load the required model and torchvision conversion pipeline. The text encoder is a Transformer, while the image encoder can be a ResNet variant like Vision Transformer (ViT) or ResNet50. You can view the available image encoders with the command clip.available_models().

print( clip.available_models() )
model, preprocess = clip.load("RN50")

Extract text embedding vectors

First, the text label is processed by the text tokenizer (clip.tokenize()), and the label word is converted into a value. This produces padded tensors of size N x 77 (N is the number of classes, 77 for two classes under binary classification) as input to the text encoder.

The text encoder converts the tensor to a text embedding tensor of size N x 512, where each category is represented by a single vector. To encode text and retrieve embeddings, use the model.encode_text() method.

preprocessed_text = clip.tokenize(['no glasses','glasses'])
text_embedding = model.encode_text(preprocessed_text)

Extract image embedding vectors

Before being passed to the image encoder, each image is preprocessed, including center cropping, normalization, and resizing to meet the image encoder's requirements. After preprocessing, the images are passed to an image encoder, which produces an image embedding tensor of size 1 x 512 as output.

preprocessed_image = preprocess(Image.open(image_path)).unsqueeze(0)
image_embedding = model.encode_image(preprocessed_image)

Similarity result

To measure the similarity between image encodings and each text label encoding, we will use the cosine similarity distance metric. model() takes preprocessed image and text inputs, passes them into image and text encoders, and computes the cosine similarity between corresponding image and text features, multiplied by 100 (image logarithmic score). The logits are then normalized to a list of probability distributions for each class using softmax.

Since we are not training a model, we will disable gradient computation with torch.no_grad() .

with torch.no_grad():
    image_logits, _ = model(preprocessed_image, preprocessed_text)
proba_list = image_logits.softmax(dim=-1).cpu().numpy()[0]

Set the class with the greatest probability as the predicted class, and extract its index, probability, and corresponding label.

y_pred = np.argmax(proba_list)
y_pred_proba = np.max(proba_list)
y_pred_token = ['no glasses','glasses'][y_pred_idx]

packaging code

We can create a Python class called CustomClassifier to wrap this code. At initialization, the pre-trained CLIP model is loaded to generate an embedded text representation vector for each label.

We'll define a classify() method that takes image paths as input and returns predicted labels (stored in a DataFrame called df_results ) with their probability scores.

To evaluate the performance of the model, we will define a validate() method that uses a PyTorch dataset instance ( CustomDataset() ) to retrieve the images and labels, then call the classify() method to predict the results and evaluate the model's performance. This method returns a DataFrame containing predicted labels and probability scores for all images. The max_images parameter is used to limit the number of images to 100.

class CustomClassifier:

    def __init__(self, prompts):

        self.class_prompts = prompts
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model, self.preprocess = clip.load("RN50", device=self.device) # "ViT-B/32"
        self.preprocessed_text = clip.tokenize(self.class_prompts).to(self.device)
        print(f'Classes Prompts: {self.class_prompts}')

    def classify(self, image_path, y_true = None):

        preprocessed_image = self.preprocess(Image.open(image_path)).unsqueeze(0).to(self.device)

        with torch.no_grad():
            image_logits, _ = self.model(preprocessed_image, self.preprocessed_text)
            proba_list = image_logits.softmax(dim=-1).cpu().numpy()[0]

        y_pred = np.argmax(proba_list)
        y_pred_proba = np.max(proba_list)
        y_pred_token = self.class_prompts[y_pred]
        results = pd.DataFrame([{'image': image_path, 'y_true': y_true, 'y_pred': y_pred, 'y_pred_token': y_pred_token, 'proba': y_pred_proba}])
        return results

    def validate (self, dataset, max_images):

        df_results = pd.DataFrame()
        for sample in tqdm(range(max_images)):
            image_path, class_idx = dataset[sample]
            image_results = self.classify(image_path, class_idx)
            df_results = pd.concat([df_results, image_results])

        accuracy = accuracy_score(df_results.y_true, df_results.y_pred)
        print(f'Accuracy - {round(accuracy,2)}')
        return accuracy, df_results

Individual images can be classified using the classify() method:

prompts = ['no glasses','glasses']
image_results = CustomClassifier(prompts).classify(image_path)

The performance of a classifier can be evaluated with the validate() method:

accuracy, df_results = CustomClassifier(prompts).validate(glasses_dataset, max_images =100)

Note that with the original ['no glasses', 'with glasses'] class labels, we achieved a decent 0.82 accuracy without training any models, and with hint engineering we can even further improve our the result of.

tip works

The CLIP classifier encodes text labels into a learned latent space and compares their similarity to the image latent space. Modifying the wording of the prompts may result in different text embeddings, affecting the performance of the classifier.

To improve prediction accuracy, we will explore multiple cues by trial and error and choose the one that yields the best results. For example, using the cues "photo of a man without glasses" and "photo of a man with glasses" yields an accuracy of 0.94.

prompts = ['photo of a man with no glasses', 'photo of a man with glasses']
accuracy, df_results = CustomClassifier(prompts).validate(glasses_dataset, max_images =100)

Analyzing multiple prompts yielded the following results:

['no glasses', 'with glasses',] - 0.82 accuracy

['face without glasses', 'face with glasses'] - 0.89 accuracy

['photo of man without glasses', 'photo of man with glasses'] - 0.94 accuracy

As we've seen, tweaking the wording can significantly improve performance. By analyzing multiple cues, we improved from a baseline accuracy of 0.82 to 0.94. However, it is important to avoid overfitting cues to the dataset.

in conclusion

CLIP models are very powerful tools for developing zero-shot classifiers for various tasks. Using CLIP, it is easy to generate on-the-fly classifiers with highly satisfactory accuracy.

However, CLIP may struggle with tasks such as fine-grained classification, abstract or systematic tasks such as counting objects, and predicting truly out-of-distribution images not covered in its pre-training dataset. Therefore, its performance in new tasks should be evaluated beforehand.

Using the Jupyter notebook provided below, you can easily create your own custom classifier. Just follow the instructions to add data, and you'll have your personalized classifier in no time.

Thanks for reading!

Jupyter Notebook

Install

Put this notebook into the desired directory. Install and import the required libraries:

!pip install clip-by-openai
!pip install pandas
!pip install -U scikit-learn
!pip install opendatasets
!pip install ipywidgets
!jupyter nbextension enable --py widgetsnbextension --sys-prefix

import numpy as np
import torch
import os
import clip
from PIL import Image
from tqdm.notebook import tqdm_notebook as tqdm
import pandas as pd
from sklearn.metrics import accuracy_score
import zipfile
import random

Dataset processing

Manually download the glasses dataset from Kaggle https://www.kaggle.com/datasets/jeffheaton/glasses-or-no-glasses If you have the kaggle.json key file, you can use the dataset identifier as "glass or no glass .zip" (instructions for obtaining the key file can be found at https://www.geeksforgeeks.org/how-to-download-kaggle-datasets-into-jupyter-notebook/)

!kaggle datasets download -d jeffheaton/glasses-or-no-glasses
# Extract zip dataset
with zipfile.ZipFile('glasses-or-no-glasses.zip', 'r') as zip_ref:
    zip_ref.extractall()

Helper function to display images

def display_random_images(dir_path, num_images, seed, save_path=None):
    random.seed(seed)
    image_paths = []
    for subdir, dirs, files in os.walk(dir_path):
        for file in files:
            file_path = os.path.join(subdir, file)
            if file_path.endswith(".png") or file_path.endswith(".jpg") or file_path.endswith(".jpeg"):
                image_paths.append(file_path)

    random_images = random.sample(image_paths, num_images)

    images = [Image.open(image_path) for image_path in random_images]
    widths, heights = zip(*(i.size for i in images))

    total_width = sum(widths)
    max_height = max(heights)

    new_im = Image.new('RGB', (total_width, max_height))

    x_offset = 0
    for im in images:
        new_im.paste(im, (x_offset,0))
        x_offset += im.size[0]

    if save_path:
        new_im.save(save_path)

    display(new_im)

Display a random selection of images from the "glasses" dataset

display_random_images(dir_path = 'faces-spring-2020', num_images=6, seed = 12, save_path = 'random_data.jpg')

Extract image paths and labels from CSV using pytorch dataset classes

class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, csv_path, images_folder):
        self.df = pd.read_csv(csv_path)[['id','glasses']]
        self.images_folder = images_folder
        self.class2index = {"no glasses":0, "glasses":1}

    def __len__(self):
        return len(self.df)

    def __getitem__(self, index):
        filename = f'face-{self.df.iloc[index, 0]}.png'
        label = self.df.iloc[index, -1]
        image_path = os.path.join(self.images_folder, filename)
        return image_path, label

path_images = r"faces-spring-2020/faces-spring-2020"
path_csv = r"train.csv"
glasses_dataset = CustomDataset(path_csv, path_images)

CLIP custom classifier

The "CustomClassifier" class defines a custom zero-shot image classifier that uses a pre-trained CLIP model. The "classify" method classifies a single image, while the "validate" method classifies a catalog of images and evaluates performance.

class CustomClassifier:

    def __init__(self, prompts):

        self.class_prompts = prompts
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model, self.preprocess = clip.load("RN50", device=self.device) # "ViT-B/32"
        self.preprocessed_text = clip.tokenize(self.class_prompts).to(self.device)
        print(f'Classes Prompts: {self.class_prompts}')

    def classify(self, image_path, y_true = None):

        preprocessed_image = self.preprocess(Image.open(image_path)).unsqueeze(0).to(self.device)

        with torch.no_grad():
            image_logits, _ = self.model(preprocessed_image, self.preprocessed_text)
            proba_list = image_logits.softmax(dim=-1).cpu().numpy()[0]

        y_pred = np.argmax(proba_list)
        y_pred_proba = np.max(proba_list)
        y_pred_token = self.class_prompts[y_pred]
        results = pd.DataFrame([{'image': image_path, 'y_true': y_true, 'y_pred': y_pred, 'y_pred_token': y_pred_token, 'proba': y_pred_proba}])
        return results

    def validate (self, dataset, max_images):

        df_results = pd.DataFrame()
        for sample in tqdm(range(max_images)):
            image_path, class_idx = dataset[sample]
            image_results = self.classify(image_path, class_idx)
            df_results = pd.concat([df_results, image_results])

        accuracy = accuracy_score(df_results.y_true, df_results.y_pred)
        print(f'Accuracy - {round(accuracy,2)}')
        return accuracy, df_results

Prediction for a single image:

image_path = r'faces-spring-2020/faces-spring-2020/face-1.png'
prompts = ['no glasses', 'glasses']
image_results = CustomClassifier(prompts).classify(image_path)

print(f"Prediction - '{image_results.y_pred_token[0]}'")

Classes Prompts: ['no glasses', 'glasses']
Prediction - 'no glasses'

Classification and evaluation of entire image folders.

accuracy, df_results = CustomClassifier(prompts).validate(glasses_dataset, max_images =100)
display(df_results)

Classes Prompts: ['no glasses', 'glasses']
  0%|          | 0/100 [00:00<?, ?it/s]
Accuracy - 0.82

100 rows × 5 columns

Prompt project

prompts = ['face without glasses', 'face with glasses']
accuracy, df_results = CustomClassifier(prompts).validate(glasses_dataset, max_images =100)

Classes Prompts: ['face without glasses', 'face with glasses']
  0%|          | 0/100 [00:00<?, ?it/s]
Accuracy - 0.89

prompts = ['photo of a man with no glasses', 'photo of a man with glasses']
accuracy, df_results = CustomClassifier(prompts).validate(glasses_dataset, max_images =100)

Classes Prompts: ['photo of a man with no glasses', 'photo of a man with glasses']
  0%|          | 0/100 [00:00<?, ?it/s]
Accuracy - 0.94

References

[0] Code: https://gist.github.com/Lihi-Gur-Arie/844a4c3e98a7561d4e0ddb95879f8c11

[1] CLIP article: https://arxiv.org/pdf/2103.00020v1.pdf

[2] Cosine similarity review: https://towardsdatascience.com/understanding-cosine-similarity-and-its-application-fd42f585296a

[3] ‘Glasses or No Glasses’ dataset from Kaggle, license CC BY-SA 4.0: https://www.kaggle.com/datasets/jeffheaton/glasses-or-no-glasses

☆ END ☆

If you see this, it means you like this article, please forward and like it. Search "uncle_pn" on WeChat, welcome to add the editor's WeChat "woshicver", and update a high-quality blog post in the circle of friends every day.

↓ Scan the QR code to add editor↓