CLIP & CLAP

CLIP

abstract

  • The original computer classification task based on supervised data training will deteriorate in generalization and usability when facing new classification targets;

  • This paper proposes to use massive network graphic-text matching data (400 millon) as a pre-training model. Similar to the GPT model in NLP, the implementation can be transferred to many image tasks in zero-shot - performing well in more than 30 image data sets (such as OCR, video action recognition and subdivided image classification tasks). For example, for the classification task of ResNet-50 on ImageNet, there is no need for training data to achieve comparable accuracy.

  • CLIP, Contrastive Language-Image Pre-training

intro

  • Inspired by the idea of ​​NLP large model pre-training, whether it is possible to use massive network data to pre-train the model to achieve task-agnoistic learning, so that it is more suitable for a variety of downstream tasks.
  • Previous work has tried various methods to describe image content, but the effect is worse than the classic method. The previous work is analyzed to learn the trade-off between limited labeled data and massive unrestricted text.

Approach

insert image description here

  • The benefits of learning from natural language are: (1) It can learn from massive Internet data; (2) Instead of traditional N-type labels, some general paradigms are learned from natural language, and these paradigms are combined with natural language , making it easier to extend to zero-shot scenarios.

Creating a Sufficiently Large Dataset

  • The existing image data set YFCC100M, screened out with text description, about 15 million
  • Obtained by query on the Internet, the number of classes is roughly balanced, 500,000 queries * 20000 per class, a total of about 10 billion (image text) pairs, this data set is named WIT (WebImageText)

Selecting an Efficient Pre-Training Method

insert image description here

  • Training efficiency is key to using/augmenting natural language supervised methods. Compared with predicting a specific word or generating the title of an image, as shown in the orange line to the green line, in the image embedding task of generating the same performance, replacing the prediction target with == comparison learning target can significantly improve the learning efficiency. == This article introduces contrastive learning into the field of graphics and text for the first time. The pseudocode of the process is as follows.
    insert image description here

  • Delete the last layer of the original text encoder/image encoder, and use linear projection to project to a common multimodal space. Set the temperature parameter tt in softmaxt

  • The image encoder uses two types: (1) An improved version of ResNet-50, where the specific changes are written in the paper; (2) ViT, basically implemented according to the original framework.

  • text encoder:transformer-based, a 63M-parameter 12- layer 512-wide model with 8 attention heads.

  • In the experiment, it was found that the size of the image encoder needs to be enlarged, but the model is not very sensitive to the size of the text encoder.

experiment

  • Several image encoder models with different configurations (different sizes), there are some techniques for large model training, some of which are mentioned in the paper. mini batch=32,768, very large
  • The largest RN50x64 model (image encoder) took 18 days to train on 592 V100 GPUs; the largest ViT model took 12 days on 256 V100 GPUs

Zero-Shot Transfer

CLAP: LEARNING AUDIO CONCEPTS FROM NATURAL LANGUAGE SUPERVISION

  • 2022.6
  • microsoft
  • code

abstract

  • Contrastive Language-Audio Pretraining (CLAP): Two separate encoders are used for text and audio, and the training strategy of contrastive learning is used to define the same multimodal spatial representation embedding.
  • 128k text-audio pair is used for training, each audio is processed into 5s data (~127h), and then compared with zero-shot and finetune in 16 downstream tasks

method

insert image description here

  • input audio, text<1xL>

  • After audio-encoder, the audio time dimension is compressed into X a X_aXa:, N is the batch size; the text encoder outputs X t X_t after encodingXt
    insert image description here

  • After linear transformation respectively, it becomes E a E_aEaand E t E_tEt
    insert image description here

  • Calculate the similarity matrix
    insert image description here
    insert image description here

experiment

insert image description here

Guess you like

Origin blog.csdn.net/qq_40168949/article/details/129160628