Four ideas for scientific research in the era of large models

background

How to use limited resources to make some scientific research work in the era of increasingly large models.

four directions

1、Efficient(PEFT)

To improve training efficiency, here we take PEFT (parameter efficient fine tuning) as an example

2、Existing stuff(pretrained model)、New directions

Using other people's pre-trained models, new research directions

3、plug-and-play

Make some plug-and-play modules, such as model modules, objective functions, new loss functions, data enhancement methods, etc.

4、Dataset,evaluation and survey

Construct datasets, publish analysis-focused articles or review papers

1. Efficient (PEFT) - the first direction

Using the paper AIM as an example to describe how to perform PEFT, that is, to efficiently fine-tune a large model when hardware resources are limited

  • Paper address: https://arxiv.org/abs/2302.03024
  • Paper Title: AIM: Adapting Image Models for Efficient Video Action Recognition
  • Title Translation: Tuning Image Models for Efficient Video Action Recognition

Thinking: Does the trained image model need to be fine-tuned?

1. Clip has proved that even with ZeroShot (the model remains unchanged, reasoning is performed directly on each data set), its effect is also very good. That is, a well-trained image model is generalizable and effective in extracting visual features from it.

2. Continued fine-tuning can lead to catastrophic forgetting. If you use a small amount of data to fine-tune a large model, it may directly overfit, or many features of the large model will be lost.

Conclusion: The pre-trained image model does not need further fine-tuning .

Comparison chart of the traditional model and the improved fine-tuning method of the paper:

Therefore, the method of the paper is to try to lock the model parameters, add some timing processing modules, objective functions, etc. to modify the surrounding methods (ie PEFT) so that the image model can do the task of video understanding without retraining the video model. save time and energy.

Two PEFT methods

1、adapter

Originally from this paper:

  • Paper address: https://arxiv.org/abs/1902.00751
  • Paper Title: Parameter-Efficient Transfer Learning for NLP
  • Title translation: Parameter-efficient transfer learning for NLP

The structure of the Adapter layer is shown on the right side of the figure below: downsampling FC layer + nonlinear activation layer + upsampling FC layer, plus residual connection.

The PEFT method here means that, as shown on the left side of the figure below, two adapters are added to the Transformer. When fine-tuning, the parameters of the original Transformer are locked, and only the parameters of the adapter layer are being learned.

Compared with the large model, the parameters of the adapter layer are very small. For example, if LoRa is used in the 175B GPT3, only one ten-thousandth of the parameters need to be trained. Therefore, the training cost is greatly reduced.

2、prompt tuning

  • Paper address: https://arxiv.org/abs/2109.01134
  • Paper Title: Learning to Prompt for Vision-Language Models

Prompt tuning means that the prompt words can be adjusted arbitrarily. Such adjustments will have a great impact on the final performance. Whether you can get the desired result depends on whether you choose a good prompt word. For example, as shown in the figure below, different prompt words have a great influence on the accuracy rate.

How does the above picture classify pictures through prompts? Give the category name CLASS to the model to see which text and picture have the highest similarity.

Prompt is divided into two types:

Hard Prompt : The prompt words set manually cannot be modified or learned. Setting these up requires some prior knowledge, which we don't always have.

Soft Prompt : Set the prompt word as a learnable vector. As shown in the figure below, set the input CLASS of the text encoder to learnable context, and the model optimizes this context part. This can not only save a lot of calculation, but also avoid manually setting prompt words in downstream tasks.

VPN

Apply the learnable Prompt method to pure vision tasks, as shown in the figure below.

  • Paper address: https://arxiv.org/abs/2203.12119
  • Paper Title: Visual Prompt Tuning

The blue part in the figure is the original trained model, and the red part is the prompt that needs to be fine-tuned. There are two ways to add prompt tuning:

1. VPT: Deep , adding prompt to the input and output of each layer.

2. VPT: Shallow , add prompt at the input.

Summary of the recent PEFT method, summarized from a unified point of view:

  • Paper address: https://arxiv.org/abs/2110.04366

AIM model design

As shown in the figure above, the AIM model is to add the Adapter in Figure a to the ViT model in Figure b, and there are three ways in Figure c, d, and e:

1. Spatial Adaptation , only add the Adapter behind the S-MSA layer, that is, do not increase the video understanding ability, but only add some learning parameters.

2. Temporal Adaptation , reuse one MSA layer, and add an Adapter after the two MSA layers, that is, let the model learn from the two directions of Spatial and Temporal, so as to have the ability of sequential modeling.

3. Joint Adaptation , on the basis of Temporal Adaptation, an Adapter is also added beside the MLP, that is, the three Adapters are allowed to perform their duties, making the optimization problem simpler.

Note: MSA is a multi-head self-attention (MultiHead Self-Attention, S-MSA and T-MSA share weights, but the dimensions are different.

The effect is shown in the figure below. The effect of the AIM model with only 14M parameters is already higher than that of the previous 121M model.

2. Existing stuff (pretrained model) - the second direction

There are two points:

1. Cleverly use other people's pre-trained models to achieve FewShot, ZeroShot, or at most Fine Tuning experiments.

2. New research directions.

This paper describes how these two points are used:

  • Paper address: https://arxiv.org/abs/2207.05027
  • 论文标题:Unsupervised Semantic Segmentation with Self-supervised Object-centric Representations

From the title, you can see these two skills:

1. Self-supervised here refers to the use of pre-trained DINO, DeepUSPS, BASNet and other networks

2. The direction here is Object-centric Learning, which is a booming topic with few players and small data sets.

The figure above shows how to use several pre-trained models to find new objects without supervision. The steps are as follows:

1. Find the Mask of some salient objects through the pre-training model DeepUSPS .

For example, the basketball in the picture can get a round Mask

2. Cut out the corresponding object in the picture according to the Mask, and adjust the size to 224*224.

For example, cut out the basketball in the picture and zoom in

3. Then return the picture obtained in step 2 to a 1024*1024 feature (global representation) through the pre-training model DINO .

4. Cluster all the features , so that the classification ID of these objects can be obtained through unsupervised learning.

Note: Clustering can only classify the same objects together, but it does not know what the specific objects are.

5. Use the picture and the corresponding classification ID to train a semantic segmentation network ( Semantic segmentation network ).

Note: This is equivalent to a supervised learning, the label comes from step 4

6. A picture may have multiple objects, so add a Self-training and do more reincarnations.

In this way, objects can be found from the picture.

3. plug-and-play-the third direction

It is very convincing to make some general-purpose, plug-and-play modules. Within a set range, after adding such modules, you can have a unified rise point and give appropriate analysis. The MixGen paper describes how to add modules:

  • Paper address: https://arxiv.org/abs/2206.08358
  • Paper title: MixGen: A New Multi-Modal Data Augmentation

The text models are very large, and the image models are relatively small, but the parameters of self-attention can be shared, so try to use the large text model to distill the small image model

Note: Model distillation: Use the training set to train a complete and complex teacher model, then design a small-scale student model, then fix the weight parameters of the teacher model, and then use the training set and the output of the teacher model to train the student model at the same time. At this time, it is necessary to design a series of losses, so that the student model gradually approaches the performance characteristics of the teacher model during the distillation learning process, so that the prediction accuracy of the student model gradually approaches the teacher model.

Why didn't the image model do data enhancement before?

1. A lot of pictures have been used in the picture model training, and there is no need for data enhancement.

'2. Or do data enhancement, but remove the Color Jittering and Random Filp, because these two changes to the picture will cause the picture and the text to not match.

For example: the picture has a white dog and a green tree, and only Color Jittering on the picture will cause the color to change. The picture is no longer a white dog, but the text is still a white dog, so the text and the picture do not match.

The method of the paper: Since the goal is to retain as much information as possible, the method here is very simple and rude, which is to directly splice two sentences together , so that new training samples can be obtained without losing information.

For example, in the figure below, two pictures are augmented with data to obtain a third picture, and the text of the two pictures is spliced ​​to obtain the text of the third picture.

Reviewer's Constructive Proposal: Data Augmentation When Downstream Tasks Have Little Data.

4. Dataset, evaluation and survey-the fourth direction

Construct data sets, publish analysis-oriented articles or review papers, here are two papers as examples.

Dataset-based big detection integrates three datasets:

  • Paper address: https://arxiv.org/abs/2203.13249

Review papers on video action detection:

  • Paper address: https://arxiv.org/abs/2012.06567

Guess you like

Origin blog.csdn.net/qq_43687860/article/details/129799457