Heavy! Meta open source DINOv2 visual model! No fine-tuning needed and the results are amazing!

Click the card below to follow the " CVer " official account

AI/CV heavy dry goods, delivered in the first time

Click to enter —>【Computer Vision】WeChat Technology Exchange Group

Reprinted from: Heart of the Machine

DINOv2 can be used for a variety of vision tasks without fine-tuning.

After open-sourcing the SAM model of "splitting everything", Meta has gone further and further on the road of "visual basic model".

This time, they open sourced a set of models called DINOv2. These models produce high-performance visual representations that can be used for downstream tasks such as classification, segmentation, image retrieval, and depth estimation without fine-tuning.

a9d44e44e4e81ba53512773912fd9c61.gif

This set of models has the following characteristics:

  • Use self-supervised training without the need for large amounts of labeled data;

  • Can be used as the backbone of almost all CV tasks without fine-tuning, such as image classification, segmentation, image retrieval and depth estimation;

  • Learning features directly from images without relying on textual descriptions, which allows the model to better understand local information;

  • can be learned from any collection of images;

  • A pretrained version of DINOv2 is already available and is comparable to CLIP and OpenCLIP on a range of tasks.

db9cee39a866f0d0303e9bdd48a03e30.png

  • Paper link: https://arxiv.org/pdf/2304.07193

  • Code: https://github.com/facebookresearch/dinov2

  • Project link: https://dinov2.metademolab.com/

Paper Overview

Learning task-neutral pretrained representations has become a standard in natural language processing. You can use these features "as is" (without fine-tuning), and they perform significantly better than task-specific models on downstream tasks. This success is due to pretraining on large amounts of raw text using auxiliary objectives, such as language modeling or word embeddings, which do not require supervision.

With this paradigm shift in the field of NLP, expect similar "foundational" models to emerge in computer vision. These models should generate visual features that work "out of the box" on any task, whether at the image level (e.g. image classification) or pixel level (e.g. segmentation).

There is great hope that these basic models can focus on text-guided pre-training, that is, using a form of text supervision to guide the training of features. This form of text-guided pre-training limits the information about images that can be preserved, as captions only approximate the rich information in images, and finer, complex pixel-level information may not be uncovered with this supervision. Furthermore, these image encoders require already aligned text-image corpora and cannot provide the flexibility of their text counterparts, i.e. cannot learn from raw data alone.

An alternative to text-guided pretraining is self-supervised learning, where features are learned from images only. These methods are conceptually closer to predecessor tasks such as language modeling, and can capture information at the image and pixel level. However, despite their potential to learn general-purpose features, most of the improvements in self-supervised learning have been achieved in the context of pre-training on the small refined dataset ImageNet1k. Some researchers have attempted to extend these methods beyond ImageNet-1k with some efforts, but they focused on unfiltered datasets, which often resulted in a significant loss of performance quality. This is due to a lack of control over data quality and diversity, which are critical to producing good results.

In this work, the researchers explored the possibility of self-supervised learning to learn general visual features if pre-trained on a large amount of refined data. They revisit existing discriminative self-supervised methods that learn features at the image and patch levels, such as iBOT, and reconsider some of their design choices under larger datasets. Most of our technical contributions are tailored to stabilize and accelerate discriminative self-supervised learning as we scale model and data sizes. These improvements make their method around 2x faster and require 1/3 less memory than similar discriminative self-supervised methods, allowing them to take advantage of longer training and larger batch sizes.

Regarding the pre-training data, they built an automated pipeline for filtering and rebalancing the dataset from a large unfiltered collection of images. This was inspired by pipelines used in NLP, where data similarity is used instead of external metadata, and manual annotation is not required. A major difficulty when working with images is to rebalance concepts and avoid overfitting in some dominant modes. In this work, a naive clustering approach works well for this problem, and the researchers collected a small and diverse corpus of 142M images to validate their approach.

Finally, the researchers provide various pretrained vision models, called DINOv2, trained on their data using different Vision Transformer (ViT) architectures. They release all models and code to retrain DINOv2 on any data. At scale, they validate the quality of DINOv2 on various computer vision benchmarks at the image and pixel level, as shown in Figure 2. Finally, the researchers conclude that self-supervised pre-training alone is a good candidate for learning transferable frozen features, comparable to the best publicly available weakly-supervised models.

a9f1ee892599b80fff2c9c281eb410b8.png

data processing

The researchers assembled their refined LVD-142M dataset by retrieving images from the bulk of the unfiltered data that were close to images in multiple refined datasets. In their paper, they describe the main components in the data pipeline, including curated/unfiltered data sources, image deduplication steps, and retrieval systems. The entire pipeline does not require any metadata or text, and directly processes images, as shown in Figure 3. The reader is referred to Appendix A for more details on the model methodology.

06c1258816413c8408ccfb81f4f0c57e.pngFigure 3: Overview of the pipeline for data processing. Images from refined and non-refined data sources are first mapped to embeddings. The unedited images are then deduplicated before being matched with standard images. The resulting combination further enriches the initial dataset with a self-supervised retrieval system.

Discriminative self-supervised pre-training

The researchers learn their features via a discriminative self-supervised approach, which can be viewed as a combination of DINO and iBOT losses, centered on SwAV. They also added a regularizer to propagate features and a short high-resolution training phase.

Efficient implementation

They considered several improvements to train the model on a larger scale. The model was trained on an A100 GPU using PyTorch 2.0, and the code can also be used with pretrained models for feature extraction. Details of the model are in Table 17 in the Appendix. Under the same hardware, compared with the iBOT implementation, the DINOv2 code uses only 1/3 of the memory and runs up to 2 times faster than the former.

6451f56c4ac8b187b3401e1abbeadfd0.png

Experimental results

In this section, the researchers present empirical evaluations of the new model on a number of image understanding tasks. They evaluate global and local image representations, including category and instance-level recognition, semantic segmentation, monocular depth prediction, and action recognition.

ImageNet classification

89546951307570a907b9d9307c5a4a91.png

8fb15277d904794833afaf8162c650ea.png

0f2516df53b43fa87be19968c386c2a2.png

Other Image and Video Classification Benchmarks

ece171efca94d40e896c5dcb519959ed.png

3799070c9614eb5124b9fc6c16dc66c8.png

instance recognition

282649d0fcf1c910354af09666a39c2d.png

intensive recognition task

bfed6b0798dc48239bcb9e9029ead81e.png

ee6bedb778d777327af9efa8d0579885.png

qualitative results

fccce849653c24af1cb4d812f3f19aa8.png

da5d3c3766aab005be0a65ab013cfae1.png

ba4d104c8b32431ed0d0b1721e99d412.png

Click to enter —>【Computer Vision】WeChat Technology Exchange Group

The latest CVPR 2023 papers and code download

 
  

Background reply: CVPR2023, you can download the collection of CVPR 2023 papers and code open source papers

Background reply: Transformer review, you can download the latest 3 Transformer review PDFs

目标检测和Transformer交流群成立
扫描下方二维码,或者添加微信:CVer333,即可添加CVer小助手微信,便可申请加入CVer-目标检测或者Transformer 微信交流群。另外其他垂直方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch、TensorFlow和Transformer等。
一定要备注:研究方向+地点+学校/公司+昵称(如目标检测或者Transformer+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群

▲扫码或加微信号: CVer333,进交流群
CVer计算机视觉(知识星球)来了!想要了解最新最快最好的CV/DL/AI论文速递、优质实战项目、AI行业前沿、从入门到精通学习教程等资料,欢迎扫描下方二维码,加入CVer计算机视觉,已汇集数千人!

▲扫码进星球
▲点击上方卡片,关注CVer公众号

It's not easy to organize, please like and watchd258c4acef01c8c27a4e64eb8f92acaa.gif

Guess you like

Origin blog.csdn.net/amusi1994/article/details/130397745