Large model learning--CLIP

This article is a study note of the CLIP algorithm. From the introduction of the CLIP algorithm to the specific implementation principles, to application methods and some subsequent optimization strategies to learn the CLIP series algorithms.

What is CLIP:

The full name of CLIP is Contrastive Language–Image Pre-training, a pre-training method based on contrasting text-image pairs.

How is CLIP done?

It mainly includes two modules: Text Encoder and Image Encoder, which extract text and image features respectively, and then let the model learn the text-image matching relationship based on comparison learning. The original text is trained using large-scale data (400 million text-image pairs). Based on the massive data, the CLIP model can learn more general visual semantic information to improve downstream tasks.

 

How to use CLIP

CLIP can be applied to numerous downstream tasks. For example, image and text retrieval, text and video retrieval, image and text question and answer, image and text generation, etc.

CLIP series optimization method:

Mainly study a series of subsequent work based on CLIP, including multiple paper methods.

FLIPScaling Language-Image Pre-trainingvia Masking

A simple and efficient CLIP accelerated training method, which only needs to mask out part of the image, can accelerate the CLIP training process by 2 to 3 times and achieve better performance.

 We know that the image input to the image encoder (transformer) must first be processed into multiple patches, and each patch is equivalent to a word in the sequence. We do a random mask on this sequence, randomly mask out some words in proportion, and then input the remaining words (image patch) into the encoder.

 The advantage of this is that it reduces the length of the sequence, which reduces the amount of calculation and makes the training faster; at the same time, it reduces the memory usage and increases the batch size, which is beneficial to comparative learning. On the other hand, this operation is somewhat similar to dropout, which is equivalent to regularization.

Finally, in order to reduce the gap in the distribution (the training mask is not masked in the test version), FLIP will add a small amount of unmasking training at the end, which can further improve the performance of the model.

FILIP:Fine-grained Interactive Language-Image Pre-Training

The twin-tower structure of CLIP separately extracts the global features of images and texts for comparative learning and lacks the interaction of local information. FILIP increases the fine-grained interaction of image tokens and text tokens.

DeCLIPSupervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Refer to the blog  DeCLIP, a data-efficient CLIP training method | Fight monsters and upgrade together

On the basis of CLIP, various forms of supervision are adopted, including:

  • Single-modal self-supervised learning
    • Image Self-Supervision (ISS) : The same image is subjected to data enhancement to obtain two views: (xI, x~I) (xI, x~I). The result after data enhancement is passed through the same encoder to obtain two embedding vectors. (zI,z~I)(zI,z~I), and then pass one of the embedding vectors xIxI through a perd layer to obtain the vector pIpI. During training, make pIpI and x~Ix~I as close as possible;
    • Text Self-Supervision (TSS) : Text self-supervision uses the MLM method, which randomly masks out 15% of the tokens in the text, and then uses the previous and later tokens to predict the masked tokens;

  • Cross-modal multi-view supervised learning (MVS)
    • CLIP only uses the original image-text pair (zI, zT) (zI, zT) to calculate the infoNCE loss, while DeCLIP uses the enhanced text and image to calculate the infoNCE loss: (zI, zT), (z~I ,zT),(zI,z~T),(z~I,z~T)(zI,zT),(z~I,zT),(zI,z~T),(z~I,z~ T), compared with CLIP, there are 3 more supervision information;
  • Nearest Neighbor Supervised Learning (NNS)
    • Considering that the same image may have similar language descriptions, pictures and texts with similar language descriptions are selected for comparative learning. A first-in, first-out queue is maintained to simulate the distribution of the entire data, and the most similar sentence is selected from the queue as the correct sentence. Sample zT′zT′, and then use InfoNCE to calculate the nearest neighbor loss: (zI, zT′), (z~I, zT′) (zI, zT′), (z~I, zT′);

The training loss is the weighted sum of the above parts:

Guess you like

Origin blog.csdn.net/qq_30921029/article/details/130681401