1. Basic information
topic | Paper author and unit | source | years |
---|---|---|---|
Parameter-Efficient Transfer Learning for NLP | Neil Houlsby et al Google Research, Jagiellonian University - Poland | PMLR | 2019 |
Houlsby N, Giurgiu A, Jastrzebski S, et al. Parameter-efficient transfer learning for NLP[C]//International Conference on Machine Learning. PMLR, 2019: 2790-2799.
Paper link: http://proceedings.mlr.press/v97/houlsby19a.html
Thesis code:
2. Key points
Research Topics | problem background | Core method flow | highlights | data set | in conclusion | thesis type | keywords |
---|---|---|---|---|---|---|---|
Large model fine-tuning | Inefficient or invalid fine-tuning parameters | Propose the Adapter module. Based on the Bert model to conduct experiments, 26 different classification tasks. | Only a small number of trainable parameters are added for each task. The parameters of the previous network are fixed and the parameters are highly multiplexed. | 26 categories. Includes the GLUE benchmark. | In the case of training few parameters, it can approach the effect of training full parameters. The Adapter scored 80.0 on GLUE and 80.4 on full fine-tuning. | model method | PETL,Adapter |
The goal of introducing Adapter: For N tasks, fully fine-tuning the model requires N x the number of parameters of the pre-trained model. However, the goal of Adapter is to achieve the same performance as fine-tuning, but the total parameter training is less, ideally close to 1 ×.
3. Model (core content)
The combination framework of Adapter and transformer.
Added in two places in the Transformer, one after the projection and one after the two forward layers;
Like a bottleneck for each Adapter layer. Its parameters are much less than the original model, and it also includes skip-connection. Only the green part is updated.
4. Experiment and analysis
AutoML platform for experiments.
4.1 Dataset
GLUE benchmark
17 public data
SQuAD question answering
4.2 GLUE benchmark results
GLUE scored 80.0, and fully fine-tuned 80.4.
The total tuning parameter of the BERT_LARGE model is 9.0 x , which means that the sum of these 9 tasks has to be fine-tuned;
The best effect of Adapters is 80.0, and the total amount of parameters is only 1.3 times of the original model parameters, and the training parameters are only 3.6%.
5. Summary
An adapter model combined with a transformer is proposed, which can achieve full-tuning effect with few parameters in training. The idea is very good, and the effect is relatively good.