[Paper Reading 72] Parameter-Efficient Transfer Learning for NLP

1. Basic information

topic Paper author and unit source years
Parameter-Efficient Transfer Learning for NLP Neil Houlsby et al Google Research, Jagiellonian University - Poland PMLR 2019

Houlsby N, Giurgiu A, Jastrzebski S, et al. Parameter-efficient transfer learning for NLP[C]//International Conference on Machine Learning. PMLR, 2019: 2790-2799.

Paper link: http://proceedings.mlr.press/v97/houlsby19a.html

Thesis code:

2. Key points

Research Topics problem background Core method flow highlights data set in conclusion thesis type keywords
Large model fine-tuning Inefficient or invalid fine-tuning parameters Propose the Adapter module. Based on the Bert model to conduct experiments, 26 different classification tasks. Only a small number of trainable parameters are added for each task. The parameters of the previous network are fixed and the parameters are highly multiplexed. 26 categories. Includes the GLUE benchmark. In the case of training few parameters, it can approach the effect of training full parameters. The Adapter scored 80.0 on GLUE and 80.4 on full fine-tuning. model method PETL,Adapter

The goal of introducing Adapter: For N tasks, fully fine-tuning the model requires N x the number of parameters of the pre-trained model. However, the goal of Adapter is to achieve the same performance as fine-tuning, but the total parameter training is less, ideally close to 1 ×.

3. Model (core content)

The combination framework of Adapter and transformer.

Added in two places in the Transformer, one after the projection and one after the two forward layers;

Like a bottleneck for each Adapter layer. Its parameters are much less than the original model, and it also includes skip-connection. Only the green part is updated.

p9Ti0yV.png

4. Experiment and analysis

AutoML platform for experiments.

4.1 Dataset

GLUE benchmark

17 public data

SQuAD question answering

4.2 GLUE benchmark results

GLUE scored 80.0, and fully fine-tuned 80.4.

The total tuning parameter of the BERT_LARGE model is 9.0 x , which means that the sum of these 9 tasks has to be fine-tuned;

The best effect of Adapters is 80.0, and the total amount of parameters is only 1.3 times of the original model parameters, and the training parameters are only 3.6%.

p9TmSvd.png

5. Summary

An adapter model combined with a transformer is proposed, which can achieve full-tuning effect with few parameters in training. The idea is very good, and the effect is relatively good.

Guess you like

Origin blog.csdn.net/ld326/article/details/130827854