Overview of pre-trained language models (3) - the actual use of pre-trained language models

This series of articles is a record of the "Summary of Pre-trained Language Models" made by the author with Qiu Xipeng's "Pre-trained Models for Natural Language Processing: A Survey" as the main reference material. The work that has been included in this review will be shared for discussion. This article is the third in the series, recording the actual use of the pre-trained language model.

The first and second articles can be read by jumping to:
Overview of pre-training language model (1) - pre-training language model and its historical
pre-training language model review (2) - pre-training tasks and training strategies

Practical use of pretrained language models

The actual use of pre-trained language models is also a complex issue, which I summarize in two aspects. On the one hand, how to apply the pre-trained language model to the downstream tasks of NLP and improve the performance of the model in downstream tasks is the ultimate goal of the pre-trained language model technology; on the other hand, although the pre-trained language model can give various NLP tasks have brought performance improvements, but they have always been criticized for their large energy consumption and dependence on computing resources.

transfer learning

A large amount of literature has shown that pre-trained language models can bring gratifying performance improvements in general benchmarks, question answering, sentiment analysis, named entity recognition, machine translation, summarization, and many other tasks. At present, the application of pre-trained language models to downstream tasks mainly relies on transfer learning. Transfer learning can transfer the general language knowledge learned from pre-trained language models from large-scale corpus (Source Dataset) to specific downstream tasks. (Target Dataset), as shown in the figure below.

(1) Selection of pre-training tasks, model structure, and corpus

As mentioned in the pre-training language model overview (2) - pre-training tasks and training strategies , we need to consider the impact of pre-training tasks on downstream tasks. For example: NSP can make the pre-training language model understand the relationship between two sentences, so for downstream tasks such as question answering and natural language reasoning, NSP can be considered as a pre-training task to train the pre-training language model.

In terms of model structure and corpus, it also needs to be selected according to downstream tasks. For example, the BERT model, because it does not have an Encoder-Decoder structure, is difficult to use for generating tasks. In contrast, the GPT series of pre-trained models may bring better performance in generating tasks. The same is true for corpus. If conditions are met, a large amount of domain-specific or language-specific corpus can be used for pre-training or there are ready-made pre-training models. Using these models may bring more improvements in related downstream tasks. .

(2) Layer selection

Different layers of a pre-trained language model may extract different features. There are many ways to use these features, such as using only static Embedding, top-level representation, and comprehensive representation of all layers (fused in a certain way). As for which one is better or which type of task is more suitable, there is no conclusion at present, and it still has to be selected through experiments on specific tasks.

(3) Whether the pre-training model parameters in the fine-tuning stage are solidified

We know that transfer learning usually consists of two stages, one is pre-training and the other is fine-tuning. For some tasks, in the fine-tuning stage, the pre-trained model is used as a feature extractor, and its parameters are fixed (such as Word Embedding, ELMo (more typical)). For most tasks, the parameters of the pre-trained model in the fine-tuning stage are not solidified, and are still adjusted in the fine-tuning stage to adapt to downstream tasks. The advantages of the former method over the latter method are still unclear to me for the time being, and I still prefer the latter.

Model Compression and Acceleration

In order to solve the problem of large-scale pre-trained language models consuming a lot of energy and relying on computing resources, so that they can be applied to more people, industries, and computing platforms, model compression and acceleration are essential.

In fact, the topic of model compression is not limited to pre-trained language models, but goes hand in hand with the development of deep learning models. Common methods, such as pruning, quantization, parameter sharing, low-rank decomposition, knowledge distillation, etc., have been applied in the pre-training model. For specific literature, you can refer to the review by Mr. Qiu.

What the author wants to point out here is that the work on hardware acceleration does not seem to have received the attention of Mr. Qiu in this Survey. Hardware acceleration will be more conducive to applying pre-trained models to more computing platforms (especially platforms with limited computing power). The author recommends the following work for your reference:

[1]S. Pati, S. Aga, N. Jayasena, and M. D. Sinclair, “Demystifying BERT: implications for accelerator design,” p. 17.
[2]Y. J. Kim and H. H. Awadalla, “FastFormers: highly efficient transformer models for natural language understanding,” arXiv:2010.13382 [cs], Oct. 2020, Accessed: Sep. 26, 2021. [Online]. Available: http://arxiv.org/abs/2010.13382
[3]Z. Liu, G. Li, and J. Cheng, “Hardware acceleration of fully quantized BERT for efficient natural language processing,” arXiv:2103.02800 [cs], Mar. 2021, Accessed: Sep. 26, 2021. [Online]. Available: http://arxiv.org/abs/2103.02800
[4]Y. You et al., “Large batch optimization for deep learning: training BERT in 76 minutes,” arXiv:1904.00962 [cs, stat], Jan. 2020, Accessed: Sep. 26, 2021. [Online]. Available: http://arxiv.org/abs/1904.00962

Guess you like

Origin blog.csdn.net/zbgjhy88/article/details/118743970