CVPR 2023 | 谷歌提出CLIPPO:仅从像素理解图像和语言

点击下方卡片,关注“CVer”公众号

AI/CV重磅干货,第一时间送达

点击进入—>【多模态】微信技术交流群

转载自:机器之心 | 编辑:袁铭怿

CLIPPO 是一种统一的模型,用单个编码器和对比损失来执行图像、文本和多模态任务,优于传统的 NLP 基线和之前基于像素的掩码语言模型。

近年来,基于 Transformer 的大规模多模态训练促成了不同领域最新技术的改进,包括视觉、语言和音频。特别是在计算机视觉和图像语言理解方面,单个预训练大模型可以优于特定任务的专家模型。

然而,大型多模态模型通常使用模态或特定于数据集的编码器和解码器,并相应地导致涉及的协议。例如,此类模型通常涉及在各自的数据集上对模型的不同部分进行不同阶段的训练,并进行特定于数据集的预处理,或以特定于任务的方式迁移不同部分。这种模式和特定于任务的组件可能会导致额外的工程复杂性,并在引入新的预训练损失或下游任务时面临挑战。

因此,开发一个可以处理任何模态或模态组合的单一端到端模型,将是多模态学习的重要一步。本文中,来自谷歌研究院(谷歌大脑团队)、苏黎世的研究者将主要关注图像和文本。

62bf487fc1ca5b954e19d888e8e8d25f.png

CLIPPO: Image-and-Language Understanding from Pixels Only
论文地址:https://arxiv.org/abs/2212.08045

代码(已开源):

https://github.com/google-research/big_vision

Many key unifications speed up the process of multimodal learning. First, it was confirmed that the Transformer architecture can serve as a general-purpose backbone and perform well on text, vision, audio, and other domains. Second, many papers explored mapping different modalities into a single shared embedding space to simplify the input/output interface, or develop a single interface for multiple tasks. Third, alternative representations of modalities allow the utilization of neural architectures or training procedures designed in one domain in another. For example, [54] and [26,48] represent text and audio respectively, processed by rendering these forms as images (spectrograms in the case of audio).

This paper explores the use of purely pixel-based models for multimodal learning of text and images. The model is a single Vision Transformer that processes visual input or text, or both, all rendered as RGB images. All modalities use the same model parameters, including low-level feature processing; that is, there are no modality-specific initial convolutions, tokenization algorithms, or input embedding tables. The model is trained with only one task: contrastive learning, as popularized by CLIP and ALIGN. Therefore the model is called CLIP-Pixels Only (CLIPPO). 

On the main tasks where CLIP is designed for image classification and text/image retrieval, CLIPPO also performs similarly to CLIP (within 1-2% similarity), although there is no specific tower modality. Surprisingly, CLIPPO can perform complex language understanding tasks without requiring any left-to-right language modeling, masked language modeling, or explicit word-level losses. Especially on the GLUE benchmark, CLIPPO outperforms classic NLP baselines such as ELMO+BiLSTM+attention. In addition, CLIPPO also outperforms pixel-based masked language models and approaches BERT's score.

Interestingly, CLIPPO can also achieve good performance on VQA when simply rendering images and text together, despite never being pretrained on such data. An immediate advantage of pixel-based models over regular language models is that no vocabulary needs to be predetermined. Consequently, the performance of multilingual retrieval is improved compared to equivalent models using classical tokenizers. Finally, the study also found that the previously observed modal gap was reduced when training CLIPPO in some cases.

Method overview

CLIP has emerged as a powerful, scalable paradigm for training multipurpose vision models on datasets. Specifically, this approach relies on image/alt-text pairs, which can be automatically collected from the web at scale. Therefore, textual descriptions are often noisy and may consist of single keywords, keyword sets, or potentially lengthy descriptions. Using these data, two encoders are jointly trained, a text encoder embedding alt-text and an image encoder embedding corresponding images in a shared latent space. The two encoders are trained using a contrastive loss that encourages embeddings of corresponding images and alt-text to be similar while being different from embeddings of all other images and alt-text.

Once trained, such an encoder pair can be used in a number of ways: it can classify a fixed set of visual concepts with a textual description (zero-shot classification); embeddings can be used to retrieve an image given a textual description, and vice versa; Alternatively, the visual encoder can be transferred to downstream tasks in a supervised manner by fine-tuning on a labeled dataset or by training the head on frozen image encoder representations. In principle, a text encoder could be used as a stand-alone text embedding, but to our knowledge, no one has explored this application in depth, and some studies cite poor language modeling performance of text encoders due to low-quality alt-text .

Previous work has shown that image and text encoders can be implemented with a shared transformer model (also known as the one-tower model, or 1T-CLIP), where images are embedded using patch embeddings and tokenized texts are embedded using separate word embeddings. Except for modality-specific embeddings, all model parameters are shared for both modalities. While this type of sharing typically results in lower performance on image/image-language tasks, it also cuts the number of model parameters in half.

CLIPPO takes this idea a step further: text input is rendered on a blank image, which is subsequently processed entirely as an image, including initial patch embedding (see Figure 1). Training against previous work yields a single visual transformer model that can understand both images and text through a single visual interface, and provides an approach that can be used to solve image, image-language, and pure language understanding tasks. single representation.

c614ca5d5e5d0ba2093747f580b62448.png

In addition to multimodal versatility, CLIPPO alleviates a common difficulty in text processing, namely developing appropriate tokenizers and vocabularies. This is especially interesting in the context of massively multilingual settings, where text encoders must handle dozens of languages.

It can be found that CLIPPO trained on image/alt-text pairs performs comparable to 1T-CLIP on public image and image-language benchmarks, and competes with strong baseline language models on the GLUE benchmark. However, learning language understanding from alt-texts alone is fundamentally limited due to the lower quality of alt-texts, which are often not grammatical sentences. Therefore, language-based contrastive training can be added to image/alt-texts contrastive pre-training. Specifically, consecutive sentence pairs sampled from a text corpus, translated sentence pairs in different languages, post-translated sentence pairs, and sentence pairs with missing words need to be considered.

Experimental results

Visual and Visual-Language Comprehension

Image classification and retrieval . Table 1 shows the performance of CLIPPO and it can be seen that compared to CLIP∗, CLIPPO and 1T-CLIP yield an absolute decrease of 2-3 percentage points.

d59ec2bea393d17aa7dd8a5cd753dd84.png

VQA . VQAv2 scores for the model and baseline are reported in Figure 2. It can be seen that CLIPPO outperforms CLIP∗, 1T-CLIP, and ViT-B/16, achieving a score of 66.3.

a3f696b70ccf4be8bf5ae9476ce65a73.png

Multilingual Vision-Language Understanding

Figure 3 shows that CLIPPO achieves retrieval performance comparable to these baselines. In the case of mT5, using additional data improves performance; exploiting these additional parameters and data in a multilingual context would be an interesting future direction for CLIPPO.

4d3a1a49a2a73a419baa98dfce09c001.png

language understanding

Table 2 shows the GLUE benchmark results for CLIPPO and baselines. It can be observed that CLIPPO trained on WebLI is competitive with the BiLSTM+Attn+ELMo baseline (which has deep word embeddings trained on a large language corpus). Furthermore, we can also see that CLIPPO and 1T-CLIP outperform language encoders trained using standard contrastive language-visual pretraining.

8af9dd86fe1e44b7aad340cb1f975279.png

For more details of the research, please refer to the original paper.

Click to enter —>【Transformer】WeChat Technology Exchange Group

The latest CVPR 2023 papers and code download

 
  

Background reply: CVPR2023, you can download the collection of CVPR 2023 papers and code open source papers

Background reply: Transformer review, you can download the latest 3 Transformer review PDFs

多模态和Transformer交流群成立
扫描下方二维码,或者添加微信:CVer333,即可添加CVer小助手微信,便可申请加入CVer-多模态或者Transformer 微信交流群。另外其他垂直方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch、TensorFlow和Transformer等。
一定要备注:研究方向+地点+学校/公司+昵称(如多模态或者Transformer+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群

▲扫码或加微信号: CVer333,进交流群
CVer计算机视觉(知识星球)来了!想要了解最新最快最好的CV/DL/AI论文速递、优质实战项目、AI行业前沿、从入门到精通学习教程等资料,欢迎扫描下方二维码,加入CVer计算机视觉,已汇集数千人!

▲扫码进星球
 
  
▲点击上方卡片,关注CVer公众号
‍It’s not easy to organize 586be17b83eb930ca254b26491711a15.gif

Guess you like

Origin blog.csdn.net/amusi1994/article/details/130538216