Article directory
My code demonstration has been uploaded to the Kaggle platform, and the specific note address is:
https://www.kaggle.com/code/holmes0610/dinov2
1. Introduction
DINOv2: Learning Robust Visual Features Without Supervision
It is the first method of training a computer vision model that uses self-supervised learning to achieve results that match or exceed standard methods used in the field.
Recent breakthroughs in natural language processing for model pretraining on large amounts of data have opened the way for similar foundational models in computer vision. These models can greatly simplify the use of images in any system by producing general purpose visual features (i.e., features that work across image distributions and tasks without fine-tuning).
This work shows that existing pre-training methods, especially self-supervised methods, can produce such features if trained with enough curated data from different sources. We revisit existing methods and combine different techniques to scale our data and model size pre-training. Most technical contributions are aimed at accelerating and stabilizing large-scale training. On the data side, we propose an automated pipeline to build dedicated, diverse, and curated datasets of images, rather than uncured data as is commonly done in the self-supervised literature.
On the model side, we train the ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that outperform the best available general-purpose OpenCLIP (Ilharco et al., 2021) Most Benchmarking at the image and pixel level.
Paper address:
https://arxiv.org/abs/2304.07193
project address:
https://github.com/facebookresearch/dinov2/tree/main
Address of the demo:
https://dinov2.metademolab.com/demos
Result of depth estimation:
DINOv2 frozen features can be easily used for models that predict the per-pixel depth of a single image, either in-distribution or out-of-distribution.
The result of semantic segmentation:
DINOv2 frozen features can be easily used for models that predict per-pixel object classes in a single image.
2. Environment deployment
First download the code locally:
Or you can use git directly to pull project files:
!git clone https://ghproxy.com/https://github.com/facebookresearch/dinov2.git
According to the source code requirements, you need to use 11.7 CUDA, you need to check whether your computer meets the requirements, you can enter through cmd:
nvidia-smi
Query the supported CUDA Version:
Open the cmd command (in the current location) in the above folder and enter the following command:
Create a new python environment in the form of conda.yml:
conda env create -f conda.yaml
Then activate the environment:
conda activate dinov2
If you encounter an error, open conda.yml with Notepad and delete the red part first:
Run according to the above cmd command again, and it will display done after completion.
Then manually install the remaining packages.
I chose to install all directly:
!pip install -r /kaggle/working/dinov2/requirements.txt
In the end, although there is an error, it can still be used smoothly:
You also need to install sklearn:
!pip install scikit-learn -i https://pypi.tuna.tsinghua.edu.cn/simple
If there are no surprises here, the environment will be deployed successfully!
3. Example of use
The trained model provided by DINOv2 is:
import torch
dinov2_vits14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14')
dinov2_vitb14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14')
dinov2_vitl14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14')
dinov2_vitg14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14')
We provide 4 models: 1 ViT-g trained from scratch, and 3 ViT-S/B/L models extracted from ViT-g.
The model takes an image as input and returns class and patch labels.
The embedding dimension is:
These models follow the Transformer architecture with a patch size of 14. For a 224x224 image, this results in 1 class label + 256 patch labels.
The model can accept larger images if the image shape is a multiple of the block size (14). If this condition is not verified, the model will be cropped to the nearest multiple of the patch size.
The original test image we used is:
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
image = mpimg.imread('/kaggle/input/demo-image/1 (4).png')
plt.imshow(image)
plt.axis('off')
plt.show()
# 输出图像尺寸
print("图像尺寸:{} x {} x {}".format(image.shape[0], image.shape[1], image.shape[2]))
We test several different models, first dinov2_vits14:
import torch
import torchvision.transforms as T
import matplotlib.pyplot as plt
from PIL import Image
from sklearn.decomposition import PCA
patch_h = 60
patch_w = 40
feat_dim = 384 # vits14
transform = T.Compose([
T.GaussianBlur(9, sigma=(0.1, 2.0)),
T.Resize((patch_h * 14, patch_w * 14)),
T.CenterCrop((patch_h * 14, patch_w * 14)),
T.ToTensor(),
T.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
])
dinov2_vitb14 = torch.hub.load('', 'dinov2_vits14', source='local').cuda()
features = torch.zeros(4, patch_h * patch_w, feat_dim)
imgs_tensor = torch.zeros(4, 3, patch_h * 14, patch_w * 14).cuda()
img_path = f'/kaggle/input/demo-image/1 (4).png'
img = Image.open(img_path).convert('RGB')
imgs_tensor[0] = transform(img)[:3]
with torch.no_grad():
features_dict = dinov2_vitb14.forward_features(imgs_tensor)
features = features_dict['x_norm_patchtokens']
features = features.reshape(4 * patch_h * patch_w, feat_dim).cpu()
pca = PCA(n_components=3)
pca.fit(features)
pca_features = pca.transform(features)
pca_features_bg = pca_features[:, 0] < 10
pca_features_fg = ~pca_features_bg
# PCA for only foreground patches
pca_features_rem = pca.transform(features[pca_features_fg])
for i in range(3):
pca_features_rem[:, i] = (pca_features_rem[:, i] - pca_features_rem[:, i].mean()) / (pca_features_rem[:, i].std() ** 2) + 0.5
pca_features_rgb = pca_features.copy()
pca_features_rgb[pca_features_bg] = 0
pca_features_rgb[pca_features_fg] = pca_features_rem
pca_features_rgb = pca_features_rgb.reshape(4, patch_h, patch_w, 3)
plt.imshow(pca_features_rgb[0][..., ::-1])
plt.savefig('features1.png')
plt.show()
plt.close()
The output result is:
The use of dinov2_vitl14 model:
import torch
import torchvision.transforms as T
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.image as mpimg
from PIL import Image
from sklearn.decomposition import PCA
import matplotlib
patch_h = 75
patch_w = 50
feat_dim = 1024 # vitl14
transform = T.Compose([
T.GaussianBlur(9, sigma=(0.1, 2.0)),
T.Resize((patch_h * 14, patch_w * 14)),
T.CenterCrop((patch_h * 14, patch_w * 14)),
T.ToTensor(),
T.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
])
dinov2_vitb14 = torch.hub.load('', 'dinov2_vitl14',source='local').cuda()
features = torch.zeros(4, patch_h * patch_w, feat_dim)
imgs_tensor = torch.zeros(4, 3, patch_h * 14, patch_w * 14).cuda()
img_path = f'/kaggle/input/demo-image/1 (4).png'
img = Image.open(img_path).convert('RGB')
imgs_tensor[0] = transform(img)[:3]
with torch.no_grad():
features_dict = dinov2_vitb14.forward_features(imgs_tensor)
features = features_dict['x_norm_patchtokens']
features = features.reshape(4 * patch_h * patch_w, feat_dim).cpu()
# print(features)
pca = PCA(n_components=3)
pca.fit(features)
pca_features = pca.transform(features)
pca_features[:, 0] = (pca_features[:, 0] - pca_features[:, 0].min()) / (pca_features[:, 0].max() - pca_features[:, 0].min())
pca_features_fg = pca_features[:, 0] > 0.3
pca_features_bg = ~pca_features_fg
b = np.where(pca_features_bg)
# print("1",pca_features[:, 0])
# print(pca_features_fg)
# PCA for only foreground patches
pca.fit(features[pca_features_fg])
pca_features_rem = pca.transform(features[pca_features_fg])
for i in range(3):
pca_features_rem[:, i] = (pca_features_rem[:, i] - pca_features_rem[:, i].min()) / (pca_features_rem[:, i].max() - pca_features_rem[:, i].min())
# transform using mean and std, I personally found this transformation gives a better visualization
# pca_features_rem[:, i] = (pca_features_rem[:, i] - pca_features_rem[:, i].mean()) / (pca_features_rem[:, i].std() ** 2) + 0.5
pca_features_rgb = pca_features.copy()
pca_features_rgb[pca_features_fg] = pca_features_rem
pca_features_rgb[b] = 0
# print("digtial",pca_features_rgb)
pca_features_rgb = pca_features_rgb.reshape(4, patch_h, patch_w, 3)
plt.imshow(pca_features_rgb[0][...,::-1])
plt.savefig('features3.png')
plt.show()
plt.close()
The output is:
Finally use vitg14:
import torch
import torchvision.transforms as T
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.image as mpimg
from PIL import Image
from sklearn.decomposition import PCA
import matplotlib
patch_h = 75
patch_w = 50
feat_dim = 1536 # vitg14
transform = T.Compose([
T.GaussianBlur(9, sigma=(0.1, 2.0)),
T.Resize((patch_h * 14, patch_w * 14)),
T.CenterCrop((patch_h * 14, patch_w * 14)),
T.ToTensor(),
T.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
])
dinov2_vitb14 = torch.hub.load('', 'dinov2_vitg14',source='local').cuda()
features = torch.zeros(4, patch_h * patch_w, feat_dim)
imgs_tensor = torch.zeros(4, 3, patch_h * 14, patch_w * 14).cuda()
img_path = f'/kaggle/input/demo-image/1 (4).png'
img = Image.open(img_path).convert('RGB')
imgs_tensor[0] = transform(img)[:3]
with torch.no_grad():
features_dict = dinov2_vitb14.forward_features(imgs_tensor)
features = features_dict['x_norm_patchtokens']
features = features.reshape(4 * patch_h * patch_w, feat_dim).cpu()
# print(features)
pca = PCA(n_components=3)
pca.fit(features)
pca_features = pca.transform(features)
pca_features[:, 0] = (pca_features[:, 0] - pca_features[:, 0].min()) / (pca_features[:, 0].max() - pca_features[:, 0].min())
pca_features_fg = pca_features[:, 0] > 0.3
pca_features_bg = ~pca_features_fg
b = np.where(pca_features_bg)
# print("1",pca_features[:, 0])
# print(pca_features_fg)
# PCA for only foreground patches
pca.fit(features[pca_features_fg])
pca_features_rem = pca.transform(features[pca_features_fg])
for i in range(3):
pca_features_rem[:, i] = (pca_features_rem[:, i] - pca_features_rem[:, i].min()) / (pca_features_rem[:, i].max() - pca_features_rem[:, i].min())
# transform using mean and std, I personally found this transformation gives a better visualization
# pca_features_rem[:, i] = (pca_features_rem[:, i] - pca_features_rem[:, i].mean()) / (pca_features_rem[:, i].std() ** 2) + 0.5
pca_features_rgb = pca_features.copy()
pca_features_rgb[pca_features_fg] = pca_features_rem
pca_features_rgb[b] = 0
# print("digtial",pca_features_rgb)
pca_features_rgb = pca_features_rgb.reshape(4, patch_h, patch_w, 3)
plt.imshow(pca_features_rgb[0][...,::-1])
plt.savefig('features2.png')
plt.show()
plt.close()
The final output is:
Obviously the final effect is better!