Labs are ready to train (De)CLIP! SenseTime ICLR2022 DeCLIP is officially open source!

Click the card below to follow the " CVer " public account

AI/CV heavy dry goods, delivered as soon as possible

As an important milestone in 2021, CLIP has attracted the attention of researchers as soon as it came out. But the 400 million image-text data and hundreds of GPU cards require researchers to be daunted.

In order to solve the data efficiency problem of CLIP training, SenseTime launched DeCLIP, which has been accepted by ICLR 2022. Its DeCLIP-ResNet50 can achieve 60.4% Zero-Shot accuracy on ImageNet while using 7.1 times less data than CLIP, which is higher than CLIP-ResNet50 is 0.8% higher! In addition, based on DeCLIP, an image-text pair pre-training related benchmark is proposed, which integrates the current CLIP, SLIP, FILIP and other related work. The related data, code, models and training scripts of DeCLIP and Benchmark are now open source, welcome to use!

DeCLIP (ICLR 2022): 

https://arxiv.org/abs/2110.05208

CLIP-Benchmark: 

https://arxiv.org/abs/2203.05796

Code (open source): https://github.com/Sense-GVT/DeCLIP

5a571451986ddfcd8cc8308e5ff652e3.png

1. Motivation

Large-scale language-image contrast learning pre-training has achieved good results on zero-shot learning and downstream tasks (such as CLIP). However, models such as CLIP require 400M of data for pre-training. In order to improve the efficiency of training and allow the model to achieve good results with less training data, this paper proposes an efficient multi-modal pre-training paradigm DeCLIP . Unlike CLIP which only uses image-text pair matching as a self-supervised signal, DeCLIP uses a variety of supervision signals:

  • Self-supervised learning within a modality;

  • Multi-view supervised learning across modalities;

  • Nearest neighbor supervised learning.

2. Method

As shown in the figure below, this paper proposes a multi-modal pre-training paradigm DeCLIP with higher data utilization efficiency. Use more supervisory information to achieve efficient use of data.

52e049791858f29649651a50f03f2922.png

2.1 CLIP Review

First, let's review CLIP. CLIP directly performs comparative learning between image and text pairs, using two encoders to encode image information and text information respectively. Image encoders generally use CNN or VIT, and text encoders generally use transformers. After that, the text and image embeddings are mapped into the same space, and the idea of ​​contrastive learning is used to shorten the distance between matching image-text embeddings and distance unmatched embeddings.

2.2 Self-Supervision within each modality (SS)

Self-supervised learning is carried out separately in each modality, including self-supervised learning of images and self-supervised learning of text.

b7bf14d1588dded8a38ffbbd2b920f43.png

(a) Image Self-Supervised Learning

Image-level self-supervised learning in the way proposed by SimSiam. The image is augmented by two data to obtain two views, which are first encoded by an image encoder with shared weights, and then one of the views is enhanced by a two-layer MLP, and cosine similarity is calculated with the output of the other view. and return the gradient.

(b) Text Self-Supervised Learning

Text self-supervised learning following the method in BERT. First randomly select 15% of tokens in each sequence, then replace that token (1) 80% probability with [mask] (2) 10% probability with random token (3) 10% probability without modification. Finally, the language model output at the corresponding position is used to predict the original token and optimized using a cross-entropy loss.

2.3. Cross-modal Multi-View Supervision Learning (Multi-View Supervision, MVS)

The original CLIP directly uses the embedding of images and text to calculate the self-supervised InfoNCE loss, while DeCLIP uses the data-augmented text and images to perform four times of InfoNCE, which is three times more than CLIP. Specifically, for the original image-text pair 61a3e4b34d951aa978b3f075ed5e24f4.png, DeCLIP performs data enhancement on the image and data enhancement on the 06bb796edb4ea3a3ccfc85f0a06a9142.pngtext 1896d5679890ba5e36a5b092fceeb049.png. The calculated cdbac0d7f542749da664806a6fa93a5c.pngInfoNCE loss function has three more supervisions than CLIP.

2.4. Nearest-Neighbor Supervision (NNS)

1fd24d851d6a6089a657a751f6a8dc15.png

Because the same images may have similar language descriptions, image-text pairs with similar language descriptions are selected for comparative learning. The entire data distribution is simulated by maintaining a first-in-first-out (FIFO) queue, selecting the most similar sentences from this queue as positive samples, and eef671677660dbd83d0a9a377bdc92f7.pngusing the InfoNCE loss function as the nearest neighbor loss function between the selections.

Finally, the three losses are weighted and summed to obtain the final loss.

fcefea44d82681521da58ab52f7497e8.png

3. Experiments

3.1. Datasets

1d83abbc73958c53709bef87853691eb.png

The DeCLIP dataset contains 29M of existing open source and 59M of Internet crawling, a total of 88M of data.

3.2. Accuracy of Zero-Shot and Finetune

0cc94cdfcb4afe99e6a62a9f8ac6f5b7.png

9e78fcfcc6d01e32e0cf6c77bc102df3.png

3.3. The effect of three kinds of supervision and the comparison of training speed

2eda13b1a3ce542c188c6eeaf618e1ea.png

4. CLIP-Benchmark

At present, the data and hyperparameters based on the related papers of the CLIP series are different. In order to facilitate the use of the community, this paper proposes CLIP-Benchmark on the basis of DeCLIP, which includes the high-quality YFCC15M-V2 data set and the existing related Paper. Reproduced code and results comparison (CLIP, SLIP, FILIP, DeCLIP) and an ensemble training method DeFILIP. The specific method and effect are shown in the following figure.

702833e1a22517ee84c740d273b0df8e.png

9c9703d628f7c533d2f90d955a92b152.png

ICCV和CVPR 2021论文和代码下载

后台回复:CVPR2021,即可下载CVPR 2021论文和代码开源的论文合集

后台回复:ICCV2021,即可下载ICCV 2021论文和代码开源的论文合集

后台回复:Transformer综述,即可下载最新的3篇Transformer综述PDF
CVer-Transformer交流群成立
扫描下方二维码,或者添加微信:CVer6666,即可添加CVer小助手微信,便可申请加入CVer-Transformer 微信交流群。另外其他垂直方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch、TensorFlow和Transformer等。
一定要备注:研究方向+地点+学校/公司+昵称(如Transformer+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群

▲扫码或加微信: CVer6666,进交流群
CVer学术交流群(知识星球)来了!想要了解最新最快最好的CV/DL/ML论文速递、优质开源项目、学习教程和实战训练等资料,欢迎扫描下方二维码,加入CVer学术交流群,已汇集数千人!

▲扫码进群
▲点击上方卡片,关注CVer公众号

整理不易,请点赞和在看

Guess you like

Origin blog.csdn.net/amusi1994/article/details/123836299