[Computer Vision] DINOv2 (Facebook self-supervised visual learning) environment deployment and code demonstration (including source code)


My code demonstration has been uploaded to the Kaggle platform, and the specific note address is:

https://www.kaggle.com/code/holmes0610/dinov2

insert image description here

1. Introduction

DINOv2: Learning Robust Visual Features Without Supervision

It is the first method of training a computer vision model that uses self-supervised learning to achieve results that match or exceed standard methods used in the field.

Recent breakthroughs in natural language processing for model pretraining on large amounts of data have opened the way for similar foundational models in computer vision. These models can greatly simplify the use of images in any system by producing general purpose visual features (i.e., features that work across image distributions and tasks without fine-tuning).

This work shows that existing pre-training methods, especially self-supervised methods, can produce such features if trained with enough curated data from different sources. We revisit existing methods and combine different techniques to scale our data and model size pre-training. Most technical contributions are aimed at accelerating and stabilizing large-scale training. On the data side, we propose an automated pipeline to build dedicated, diverse, and curated datasets of images, rather than uncured data as is commonly done in the self-supervised literature.

On the model side, we train the ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that outperform the best available general-purpose OpenCLIP (Ilharco et al., 2021) Most Benchmarking at the image and pixel level.

Paper address:

https://arxiv.org/abs/2304.07193

project address:

https://github.com/facebookresearch/dinov2/tree/main

insert image description here

Address of the demo:

https://dinov2.metademolab.com/demos

Result of depth estimation:

DINOv2 frozen features can be easily used for models that predict the per-pixel depth of a single image, either in-distribution or out-of-distribution.

insert image description here

The result of semantic segmentation:

DINOv2 frozen features can be easily used for models that predict per-pixel object classes in a single image.

insert image description here

2. Environment deployment

First download the code locally:

insert image description here

Or you can use git directly to pull project files:

!git clone https://ghproxy.com/https://github.com/facebookresearch/dinov2.git

insert image description here
According to the source code requirements, you need to use 11.7 CUDA, you need to check whether your computer meets the requirements, you can enter through cmd:

nvidia-smi

Query the supported CUDA Version:

insert image description here

Open the cmd command (in the current location) in the above folder and enter the following command:

Create a new python environment in the form of conda.yml:

conda env create -f conda.yaml 

Then activate the environment:

conda activate dinov2

If you encounter an error, open conda.yml with Notepad and delete the red part first:

insert image description here

Run according to the above cmd command again, and it will display done after completion.

Then manually install the remaining packages.

I chose to install all directly:

!pip install -r /kaggle/working/dinov2/requirements.txt

insert image description here

In the end, although there is an error, it can still be used smoothly:

insert image description here
You also need to install sklearn:

!pip install scikit-learn -i https://pypi.tuna.tsinghua.edu.cn/simple

insert image description here

If there are no surprises here, the environment will be deployed successfully!

3. Example of use

The trained model provided by DINOv2 is:

import torch

dinov2_vits14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14')
dinov2_vitb14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14')
dinov2_vitl14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14')
dinov2_vitg14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14')

We provide 4 models: 1 ViT-g trained from scratch, and 3 ViT-S/B/L models extracted from ViT-g.

The model takes an image as input and returns class and patch labels.

The embedding dimension is:

insert image description here

These models follow the Transformer architecture with a patch size of 14. For a 224x224 image, this results in 1 class label + 256 patch labels.

The model can accept larger images if the image shape is a multiple of the block size (14). If this condition is not verified, the model will be cropped to the nearest multiple of the patch size.

The original test image we used is:

%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

image = mpimg.imread('/kaggle/input/demo-image/1 (4).png')

plt.imshow(image)
plt.axis('off')
plt.show()

# 输出图像尺寸
print("图像尺寸:{} x {} x {}".format(image.shape[0], image.shape[1], image.shape[2]))

insert image description here

We test several different models, first dinov2_vits14:

import torch
import torchvision.transforms as T
import matplotlib.pyplot as plt
from PIL import Image
from sklearn.decomposition import PCA


patch_h = 60
patch_w = 40
feat_dim = 384  # vits14

transform = T.Compose([
    T.GaussianBlur(9, sigma=(0.1, 2.0)),
    T.Resize((patch_h * 14, patch_w * 14)),
    T.CenterCrop((patch_h * 14, patch_w * 14)),
    T.ToTensor(),
    T.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
])

dinov2_vitb14 = torch.hub.load('', 'dinov2_vits14', source='local').cuda()

features = torch.zeros(4, patch_h * patch_w, feat_dim)
imgs_tensor = torch.zeros(4, 3, patch_h * 14, patch_w * 14).cuda()

img_path = f'/kaggle/input/demo-image/1 (4).png'
img = Image.open(img_path).convert('RGB')
imgs_tensor[0] = transform(img)[:3]
with torch.no_grad():
    features_dict = dinov2_vitb14.forward_features(imgs_tensor)
    features = features_dict['x_norm_patchtokens']

features = features.reshape(4 * patch_h * patch_w, feat_dim).cpu()

pca = PCA(n_components=3)
pca.fit(features)
pca_features = pca.transform(features)

pca_features_bg = pca_features[:, 0] < 10
pca_features_fg = ~pca_features_bg

# PCA for only foreground patches
pca_features_rem = pca.transform(features[pca_features_fg])
for i in range(3):
    pca_features_rem[:, i] = (pca_features_rem[:, i] - pca_features_rem[:, i].mean()) / (pca_features_rem[:, i].std() ** 2) + 0.5

pca_features_rgb = pca_features.copy()
pca_features_rgb[pca_features_bg] = 0
pca_features_rgb[pca_features_fg] = pca_features_rem

pca_features_rgb = pca_features_rgb.reshape(4, patch_h, patch_w, 3)
plt.imshow(pca_features_rgb[0][..., ::-1])
plt.savefig('features1.png')
plt.show()
plt.close()

The output result is:

insert image description here
The use of dinov2_vitl14 model:

import torch
import torchvision.transforms as T
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.image as mpimg 
from PIL import Image
from sklearn.decomposition import PCA
import matplotlib
 
patch_h = 75
patch_w = 50
feat_dim = 1024 # vitl14
 
transform = T.Compose([
    T.GaussianBlur(9, sigma=(0.1, 2.0)),
    T.Resize((patch_h * 14, patch_w * 14)),
    T.CenterCrop((patch_h * 14, patch_w * 14)),
    T.ToTensor(),
    T.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
])
 
dinov2_vitb14 = torch.hub.load('', 'dinov2_vitl14',source='local').cuda()
 
features = torch.zeros(4, patch_h * patch_w, feat_dim)
imgs_tensor = torch.zeros(4, 3, patch_h * 14, patch_w * 14).cuda()
 
img_path = f'/kaggle/input/demo-image/1 (4).png'
img = Image.open(img_path).convert('RGB')
imgs_tensor[0] = transform(img)[:3]
with torch.no_grad():
    features_dict = dinov2_vitb14.forward_features(imgs_tensor)
    features = features_dict['x_norm_patchtokens']
    
features = features.reshape(4 * patch_h * patch_w, feat_dim).cpu()
# print(features)
pca = PCA(n_components=3)
pca.fit(features)
pca_features = pca.transform(features)
pca_features[:, 0] = (pca_features[:, 0] - pca_features[:, 0].min()) / (pca_features[:, 0].max() - pca_features[:, 0].min())
 
pca_features_fg = pca_features[:, 0] > 0.3
pca_features_bg = ~pca_features_fg
 
b = np.where(pca_features_bg)
# print("1",pca_features[:, 0])
# print(pca_features_fg)
# PCA for only foreground patches
pca.fit(features[pca_features_fg])
pca_features_rem = pca.transform(features[pca_features_fg])
for i in range(3):
    pca_features_rem[:, i] = (pca_features_rem[:, i] - pca_features_rem[:, i].min()) / (pca_features_rem[:, i].max() - pca_features_rem[:, i].min())
    # transform using mean and std, I personally found this transformation gives a better visualization
    # pca_features_rem[:, i] = (pca_features_rem[:, i] - pca_features_rem[:, i].mean()) / (pca_features_rem[:, i].std() ** 2) + 0.5

pca_features_rgb = pca_features.copy()
pca_features_rgb[pca_features_fg] = pca_features_rem
pca_features_rgb[b] = 0
# print("digtial",pca_features_rgb)
pca_features_rgb = pca_features_rgb.reshape(4, patch_h, patch_w, 3)
plt.imshow(pca_features_rgb[0][...,::-1])
plt.savefig('features3.png')
plt.show()
plt.close()

The output is:

insert image description here

Finally use vitg14:

import torch
import torchvision.transforms as T
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.image as mpimg 
from PIL import Image
from sklearn.decomposition import PCA
import matplotlib
 
patch_h = 75
patch_w = 50
feat_dim = 1536 # vitg14
 
transform = T.Compose([
    T.GaussianBlur(9, sigma=(0.1, 2.0)),
    T.Resize((patch_h * 14, patch_w * 14)),
    T.CenterCrop((patch_h * 14, patch_w * 14)),
    T.ToTensor(),
    T.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
])
 
dinov2_vitb14 = torch.hub.load('', 'dinov2_vitg14',source='local').cuda()
 
features = torch.zeros(4, patch_h * patch_w, feat_dim)
imgs_tensor = torch.zeros(4, 3, patch_h * 14, patch_w * 14).cuda()
 
img_path = f'/kaggle/input/demo-image/1 (4).png'
img = Image.open(img_path).convert('RGB')
imgs_tensor[0] = transform(img)[:3]
with torch.no_grad():
    features_dict = dinov2_vitb14.forward_features(imgs_tensor)
    features = features_dict['x_norm_patchtokens']
    
features = features.reshape(4 * patch_h * patch_w, feat_dim).cpu()
# print(features)
pca = PCA(n_components=3)
pca.fit(features)
pca_features = pca.transform(features)
pca_features[:, 0] = (pca_features[:, 0] - pca_features[:, 0].min()) / (pca_features[:, 0].max() - pca_features[:, 0].min())
 
pca_features_fg = pca_features[:, 0] > 0.3
pca_features_bg = ~pca_features_fg
 
b = np.where(pca_features_bg)
# print("1",pca_features[:, 0])
# print(pca_features_fg)
# PCA for only foreground patches
pca.fit(features[pca_features_fg])
pca_features_rem = pca.transform(features[pca_features_fg])
for i in range(3):
    pca_features_rem[:, i] = (pca_features_rem[:, i] - pca_features_rem[:, i].min()) / (pca_features_rem[:, i].max() - pca_features_rem[:, i].min())
    # transform using mean and std, I personally found this transformation gives a better visualization
    # pca_features_rem[:, i] = (pca_features_rem[:, i] - pca_features_rem[:, i].mean()) / (pca_features_rem[:, i].std() ** 2) + 0.5

pca_features_rgb = pca_features.copy()
pca_features_rgb[pca_features_fg] = pca_features_rem
pca_features_rgb[b] = 0
# print("digtial",pca_features_rgb)
pca_features_rgb = pca_features_rgb.reshape(4, patch_h, patch_w, 3)
plt.imshow(pca_features_rgb[0][...,::-1])
plt.savefig('features2.png')
plt.show()
plt.close()

The final output is:

insert image description here

Obviously the final effect is better!

Guess you like

Origin blog.csdn.net/wzk4869/article/details/131744115
Recommended