CLIP: Train a unified vector embedding of images and text

Learning Transferable Visual Models From Natural Language Supervision

Introduction

Over the past few years, pretraining methods that learn directly from raw text have revolutionized NLP. Task-agnostic objectives, such as autoregressive and masked language models, have expanded across multiple levels of computation, model capacity, and data, steadily improving performance. The development of text-to-text as a standardized input-output interface enables task-agnostic architectures to zero-shot translation to downstream datasets, thereby eliminating the need for special output headers or dataset-specific customizations.

On web-scale text collections, aggregate supervision is adaptable to modern pre-training methods and outperforms high-quality group-labeled NLP datasets. However, in other fields such as computer vision, it is still standard practice to pretrain models on crowd-labeled datasets such as ImageNet. Could a scalable pretraining approach that learns directly from web text be able to achieve similar breakthroughs in computer vision? Some previous work is encouraging:

More than 20 years ago, Mori et al. explored improving content-based image retrieval by training a model to predict nouns and adjectives in text documents paired with images. Quattoni et al. demonstrate that it is possible to learn more data-efficient image representations by training in the weight space of a classifier to predict caption-related characters in images through manifold learning. Srivastava and Salakhutdinov explored deep representation learning by training multimodal Deep Boltzmann Machines on the basis of underlying image and text labelling features. …

While the theoretical proof is exciting, supervised image representation learning with natural language is rare. This may be because the performance shown on general benchmarks is much lower than other methods, such as Li et al. on ImageNet with a zero-shot setting test success of only 11.5%, far below the state-of-the-art accuracy of 88.4% , even lower than the 50% accuracy of classical computer vision methods.

Through a large amount of public data on the Internet, the article creates a new dataset of 400 million pairs (ie image, text), and proves a simplified version of ConVIRT, called CLIP (Contrastive Language-Image Pre-training), which is an efficient method for learning from natural language supervision, and the scalability of CLIP is studied by training a series of 8 models that cover nearly 2 orders of magnitude of computation and data services, i.e., the transmission performance is smooth and scalable. Predicted computational function. We benchmark the zero-shot translation performance of CLIP on more than 30 existing datasets and find that its performance is comparable to previous task-specific supervised models.

An overview of this method is shown in Figure 1:

image.png

method

Natural language supervision

The idea of ​​the method is to learn perception from natural language with supervision, and learning from natural language has some potential advantages compared to other training methods. Compared to standard group source labels for image classification, scaled natural language supervision is much easier because it does not require annotations to be in classic "machine learning compatible formats" such as canonical 1-of-N majority vote "gold label". And methods that work as natural language can passively learn from large amounts of text on the web. Learning from natural language also has an important advantage over most unsupervised or self-supervised learning methods, as it does not just "just" learn a representation, but also associates that representation with language, enabling Flexible zero-shot transfer.

Create a large enough dataset

A major motivation for natural language supervision is that large amounts of data in this form can be made public on the Internet. Since existing datasets do not adequately reflect this possibility, considering their results alone will not estimate the potential of this research direction. To address this problem, we construct a new dataset consisting of 400 million pairs (image, text) data collected from various publicly available sources on the Internet, called WIT (WebImageText).

Choose an efficient pre-training method

最先进的计算机视觉系统需要非常大的计算量,从自然语言中学习一组开放的视觉概念的任务似乎很艰巨。在实验过程中,我们发现训练效率是成功扩展自然语言监控的关键,并在此基础上选择了最终的预训练方法。最初的方法,类似于 VirTex,联合训练一个图像 CNN 和文本转换器从头开始预测图像的标题。

然而,在有效地缩放这种方法处遇到了困难。如图2所示,展示了一个6300万参数transformer 语言模型,它已经使用了两倍于它的 resnet-50图像编码器的计算能力,学会识别 ImageNet 类的速度比预测同一文本的单词包编码的更简单的基准慢三倍。

image.png

文章提出了一个训练系统来解决潜在更简单的代理任务,将文本作为一个整体来配对图像,而不是文本的确切字词。从相同的单词包编码基线开始,在图2中将预测目标(prediction)换成对比目标(contrastive),并且观察到zero-shot转换到 ImageNet 的速率进一步提高了4倍。

给定一批 N N (image,text)对样本,CLIP被训练用于预测 N × N N \times N 种可能的样本对在一批中哪个可能实际发生。为实现这点,CLIP通过联合训练一个图像encoder和文本encoder来学习一个多模态嵌入空间,实现一个批次中 N N 个真实样本对中最大化图像和文本嵌入的余弦相似度最小化 N 2 N N^2-N 不正确配对的余弦相似度。

CLIP 实现的核心的伪码如图3所示,这种批量结构技术和目标首次被引入到深度测量学习领域。

image.png

选择并扩展模型

考虑了两种不同的图像编码器体系结构:首先,使用 ResNet-50作为图像编码器的基本架构,因为它被广泛采用,并且性能得到过验证。对于第二种架构,使用最近推出的 Vision Transformer (ViT)进行实验。

文本编码器是一个Transformer,其基本结构使用了63M-参数 12-层 512-宽 模型并有8个注意力头。

源码分析

源码链接

使用方法

首先安装PyTorch 1.7.1torchvision,以及一些额外的依赖。在一个CUDA GPU 机器上,如下所示:

$ conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
$ pip install ftfy regex tqdm
$ pip install git+https://github.com/openai/CLIP.git
复制代码

在机器上没有GPU时,将cudatoolkit=11.0替换为appropriate CUDA版本或cpuonly

import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)
text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print("Label probs:", probs)  # prints: [[0.9927937  0.00421068 0.00299572]]
复制代码

Guess you like

Origin juejin.im/post/7077828588614975519