Image similarity comparison CLIP or DINOv2

In the field of artificial intelligence, the two giants of computer vision are CLIP and DINOv2. CLIP changed the way images are understood, and DINOv2 brings new methods to self-supervised learning. In this article, we will explore the journey that defines the strengths and subtleties of CLIP and DINOv2. We aim to discover which of these models truly excels in the world of image similarity tasks. Let’s witness the battle between these two giants and see which model comes out on top.

009dfe745e1312be02c66d12ee6f6707.jpeg

Image similarity in CLIP

Calculating the similarity between two images using CLIP is a simple process that requires only two steps: first extract the features of the two images, and then calculate their cosine similarity.

First, make sure you have the required packages installed. Recommendations for setting up and using a virtual environment:‍

#Start by setting up a virtual environment
virtualenv venv-similarity
source venv-similarity/bin/activate
#Install required packages
pip install transformers Pillow torch

Next, continue to calculate image similarity:

import torch
from PIL import Image
from transformers import AutoProcessor, CLIPModel
import torch.nn as nn


device = torch.device('cuda' if torch.cuda.is可用 else "cpu")
processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)


#Extract features from image1
image1 = Image.open('img1.jpg')
with torch.no_grad():
    inputs1 = processor(images=image1, return_tensors="pt").to(device)
    image_features1 = model.get_image_features(**inputs1)


#Extract features from image2
image2 = Image.open('img2.jpg')
with torch.no_grad():
    inputs2 = processor(images=image2, return_tensors="pt").to(device)
    image_features2 = model.get_image_features(**inputs2)


#Compute their cosine similarity and convert it into a score between 0 and 1
cos = nn.CosineSimilarity(dim=0)
sim = cos(image_features1[0],image_features2[0]).item()
sim = (sim+1)/2
print('Similarity:', sim)

f0c0695959a4d57e65e16b9dc92ef9db.jpeg

f37afad1a0796d41b7254b9506b20305.jpeg

2 similar images

Using the example of two similar images, the obtained similarity score was an impressive 96.4%.

Image similarity in DINOv2

The process of using DINOv2 to calculate the similarity between two images is similar to CLIP. Requires the same set of packages without any additional installation:

from transformers import AutoImageProcessor, AutoModel
from PIL import Image
import torch.nn as nn


device = torch.device('cuda' if torch.cuda.is_available() else "cpu")
processor = AutoImageProcessor.from_pretrained('facebook/dinov2-base')
model = AutoModel.from_pretrained('facebook/dinov2-base').to(device)




image1 = Image.open('img1.jpg')
with torch.no_grad():
    inputs1 = processor(images=image1, return_tensors="pt").to(device)
    outputs1 = model(**inputs1)
    image_features1 = outputs1.last_hidden_state
    image_features1 = image_features1.mean(dim=1)


image2 = Image.open('img2.jpg')
with torch.no_grad():
    inputs2 = processor(images=image2, return_tensors="pt").to(device)
    outputs2 = model(**inputs2)
    image_features2 = outputs2.last_hidden_state
    image_features2 = image_features2.mean(dim=1)


cos = nn.CosineSimilarity(dim=0)
sim = cos(image_features1[0],image_features2[0]).item()
sim = (sim+1)/2
print('Similarity:', sim)

Working with the same image pairs from the CLIP example, the similarity score obtained using DINOv2 was 93%.

Test using COCO dataset

Before we evaluate their performance in depth, let us compare the results of CLIP and DINOv2 using images from the validation set of the COCO dataset. The process we use is as follows:

  • Traverse the dataset to extract features of all images.

  • Store embeddings in FAISS index.

  • Extract features of the input image.

  • Retrieve the three images with the highest similarity.

For those who want to know more about FAISS, please refer to this informative article. Make sure to install it first using: pip install faiss-[gpu|cpu].

Part 1: Feature extraction and creation of 2 indexes

import torch
from PIL import Image
from transformers import AutoProcessor, CLIPModel, AutoImageProcessor, AutoModel
import faiss
import os
import numpy as np


device = torch.device('cuda' if torch.cuda.is_available() else "cpu")


#Load CLIP model and processor
processor_clip = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
model_clip = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)


#Load DINOv2 model and processor
processor_dino = AutoImageProcessor.from_pretrained('facebook/dinov2-base')
model_dino = AutoModel.from_pretrained('facebook/dinov2-base').to(device)


#Retrieve all filenames
images = []
for root, dirs, files in os.walk('./val2017/'):
    for file in files:
        if file.endswith('jpg'):
            images.append(root  + '/'+ file)




#Define a function that normalizes embeddings and add them to the index
def add_vector_to_index(embedding, index):
    #convert embedding to numpy
    vector = embedding.detach().cpu().numpy()
    #Convert to float32 numpy
    vector = np.float32(vector)
    #Normalize vector: important to avoid wrong results when searching
    faiss.normalize_L2(vector)
    #Add to index
    index.add(vector)


def extract_features_clip(image):
    with torch.no_grad():
        inputs = processor_clip(images=image, return_tensors="pt").to(device)
        image_features = model_clip.get_image_features(**inputs)
        return image_features


def extract_features_dino(image):
    with torch.no_grad():
        inputs = processor_dino(images=image, return_tensors="pt").to(device)
        outputs = model_dino(**inputs)
        image_features = outputs.last_hidden_state
        return image_features.mean(dim=1)


#Create 2 indexes.
index_clip = faiss.IndexFlatL2(512)
index_dino = faiss.IndexFlatL2(768)


#Iterate over the dataset to extract features X2 and store features in indexes
for image_path in images:
    img = Image.open(image_path).convert('RGB')
    clip_features = extract_features_clip(img)
    add_vector_to_index(clip_features,index_clip)
    dino_features = extract_features_dino(img)
    add_vector_to_index(dino_features,index_dino)


#store the indexes locally
faiss.write_index(index_clip,"clip.index")
faiss.write_index(index_dino,"dino.index")

Part 2: Image Similarity Search

import faiss
import numpy as np
import torch
from transformers import AutoImageProcessor, AutoModel, AutoProcessor, CLIPModel
from PIL import Image
import os


#Input image
source='laptop.jpg'
image = Image.open(source)


device = torch.device('cuda' if torch.cuda.is_available() else "cpu")


#Load model and processor DINOv2 and CLIP
processor_clip = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
model_clip = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)


processor_dino = AutoImageProcessor.from_pretrained('facebook/dinov2-base')
model_dino = AutoModel.from_pretrained('facebook/dinov2-base').to(device)


#Extract features for CLIP
with torch.no_grad():
    inputs_clip = processor_clip(images=image, return_tensors="pt").to(device)
    image_features_clip = model_clip.get_image_features(**inputs_clip)


#Extract features for DINOv2
with torch.no_grad():
    inputs_dino = processor_dino(images=image, return_tensors="pt").to(device)
    outputs_dino = model_dino(**inputs_dino)
    image_features_dino = outputs_dino.last_hidden_state
    image_features_dino = image_features_dino.mean(dim=1)


def normalizeL2(embeddings):
    vector = embeddings.detach().cpu().numpy()
    vector = np.float32(vector)
    faiss.normalize_L2(vector)
    return vector


image_features_dino = normalizeL2(image_features_dino)
image_features_clip = normalizeL2(image_features_clip)


#Search the top 5 images
index_clip = faiss.read_index("clip.index")
index_dino = faiss.read_index("dino.index")


#Get distance and indexes of images associated
d_dino,i_dino = index_dino.search(image_features_dino,5)
d_clip,i_clip = index_clip.search(image_features_clip,5)

result

Using four different images as input, the search produced the following results:

6caef8435a7481de3d4da4478c5c1947.png

793fb1c80da1664325f62bbf6bad84b7.png

d6c9d0ed8311784b9c49cb7429e1c20d.png

4247a6461028f9952c03cabbbdbb0050.png

CLIP and DINOv2

In this small subset, it seems that DINOv2 exhibits slightly superior performance.

Benchmarks on DISC21 dataset

To compare their performance, we will follow the same approach described in this story: https://medium.com/aimonks/image-similarity-with-dinov2-and-faiss-741744bc5804. We will also repeat the above script to extract features and then calculate image similarity.

data set

To compare CLIP and DINOv2, we chose the DISC21 dataset, which was specially created for image similarity search. Due to its huge size of 350GB, we will use a subset of 150,000 images.

Metrics used

In terms of metrics, we will calculate the following:

  • Accuracy: The ratio of the number of correctly predicted images to the total number of images.

  • Top three accuracy: The ratio of the number of times the correct image is found in the top three similar images to the total number of images.

  • Computation time: The time required to process the entire data set.

Benchmark results

  • Feature extraction

        ​ ​ CLIP: 70.7 images per second

        ​​​​DINOv2: 69.7 images per second

  • Accuracy and top three accuracy

c8b4899caa7ca085fa76f4361b5e091e.png

Accuracy and top three accuracy

  • Analyze results

1. Both models correctly predicted the image

4f2deb32a9b2b390e5432dfbe1cc70a5.png

2. All models failed to find the correct image

8cd3a597b00bc5dca3d0dbcd7b676afb.png

3. Only CLIP predicted the image correctly, DINOv2 predicted it among the first three

337e19ad9d6b002bca8247893b892137.png

4. Only DINOv2 predicted the image correctly

343cac619c0f49f365ba4df90734a09a.png

analyze

DINOv2 shows a clear lead, achieving an impressive accuracy of 64% on an apparently challenging dataset. In comparison, CLIP demonstrated a more modest accuracy of 28.45%.

In terms of computational efficiency, both models exhibit very similar feature extraction times. This balance doesn't put either model at a clear advantage in this regard.

limit

While this benchmark provides valuable insights, its limitations must be recognized. Evaluation is performed on a subset of 1448 images, compared with a pool of 150,000 images. Given the size of the entire dataset at 2.1 million images, the narrow focus is necessary to conserve resources.

It is worth noting that MetaAI uses the DISC21 dataset as a baseline for its model, which may give DINOv2 a favorable advantage. However, our tests on the COCO dataset revealed interesting details: DINOv2 showed an enhanced ability to identify major elements in images, while CLIP showed an ability to focus on specific details in input images (such as bus images shown).

Finally, the difference in embedding dimensions between CLIP and DINOv2 must be considered. CLIP uses an embedding dimension of 512, while DINOv2 uses 768. While another option is to use a larger CLIP model embedding dimensionality matching, it is worth noting that this will come at the cost of speed. A quick test on a small subset showed a slight performance improvement, but not to the level of performance demonstrated by DINOv2.

·  END  ·

HAPPY LIFE

494a479da9399c8542f61e24126e5122.png

This article is for learning and communication only. If there is any infringement, please contact the author to delete it.

Guess you like

Origin blog.csdn.net/weixin_38739735/article/details/134544258