Natural language processing from entry to application - dynamic word vector pre-training: ELMo word vector

15419467:

Category: General Catalog of "Natural Language Processing from Entry to Application"


After the pre-training of the bidirectional language model is completed, the encoding part of the model (including the input representation layer and multi-layer stacked LSTM) can be used to calculate the dynamic word vector representation of any text. The most natural approach is to use the output of the last hidden layer of the two LSTMs as the dynamic vector representation of words. However, in the ELMo model, different levels of hidden layer vectors imply different levels or granularity of text information. For example, the LSTM hidden layer representation closer to the top usually encodes more semantic information, while the hidden layer representation closer to the bottom layer (including the input representation xxx ) focuses more on lexical and syntactic information. Different downstream tasks have different requirements for word representation. For example, for tasks such as reading comprehension and automatic question answering, the demand for semantic information is high; for tasks such as named entity recognition, lexical and syntactic information are more important. Therefore, ELMo adopts a mechanism of weighted averaging of vector representations at different levels to provide more combination degrees of freedom for different downstream tasks. LetR t R_tRtIndicates wt w_twtThe set of all intermediate state vector representations of , then:
R t = { xt , ht , j ∣ j = 1 , 2 , ⋯ , L } R_t=\{x_t, h_{t, j}|j=1, 2, \cdots, L\}Rt={ xt,ht,jj=1,2,,L}

式中, h t , j = [ h ← t , j , h → t , j ] h_{t, j}=[\overleftarrow{h}_{t, j}, \overrightarrow{h}_{t, j}] ht,j=[h t,j,h t,j] represents the vector obtained after concatenating the output of the forward and backward hidden layers of each layer in the two multi-layer stacked LSTMs. Letht , 0 = xt h_{t, 0}=x_tht,0=xt, then the ELMo word vector can be expressed as:
ELMo t = f ( R t , Ψ ) = γ task ∑ j L sj task ht , j \text{ELMo}_t=f(R_t, \Psi)=\gamma^\text {task}\sum_j^Ls^\text{task}_jh_{t, j}ELMot=f(Rt,P )=ctaskjLsjtaskht,j

式中, Ψ = { s task , γ task } \Psi=\{s^\text{task}, \gamma^\text{task}\} Ps={ stask,ctask }is an additional parameter required to calculate the ELMo vector;s task s^\text{task}stask represents the weight of each vector, reflecting the importance of each layer of vectors to the target task, which can be calculated by normalizing a set of parameters according to the Softmax function, and the weight vector can be learned during the training process of downstream tasks; γ task \gamma ^\text{task}cThe task coefficient is also related to the downstream task, and the ELMo vector can be appropriately scaled when the ELMo vector is used in conjunction with other vectors. When using ELMo vectors as word features for downstream tasks, the parameters of the encoder will be "frozen" and will not participate in the update. In summary, ELMo vector representation has the following three characteristics:

  • Dynamic (context-sensitive): the ELMo vector representation of a word is determined by its current context
  • Robust (Robust): ELMo vector representation uses character-level input, which is robust to unregistered words
  • Hierarchy: ELMo word vectors are combined from the vector representations of each level in the deep pre-training model, providing greater freedom of use for downstream tasks.

The figure below shows the overall structure of the ELMo model:
ELMo model

Advantages and disadvantages of ELMo

ELMo realizes the conversion from static word embedding to dynamic word embedding, and from word embedding to scene word embedding, which better solves the problem of polysemy. However, because ELMo uses Bi-LSTM, it is still an automatic regression problem, so its concurrency capability will be affected. In the case of requiring a large number of corpora as training data, this limitation also directly affects its performance and scalability. ELMo has two main advantages:

  • Realize the transition from simple word embedding (Word Embedding) to contextualized word embedding (Contextualized Word Embedding)
  • Realize the transformation of the pre-training model from static to dynamic

At the same time, ELMo also has disadvantages: the feature extractor of the ELMo pre-training model uses a bidirectional cyclic neural network (such as Bi-LSTM), and the training of the cyclic neural network needs to be sequenced from left to right or right to left, which strictly limits concurrent processing. ability. In addition, each layer of ELMo will splice vectors in two directions, so this operation is actually still a one-way learning, and it is impossible to learn in two directions at the same time.

References:
[1] Che Wanxiang, Cui Yiming, Guo Jiang. Natural language processing: a method based on pre-training model [M]. Electronic Industry Press, 2021. [2] Shao Hao, Liu Yifeng. Pre-
training language model [M] ]. Electronic Industry Press, 2021.
[3] He Han. Introduction to Natural Language Processing [M]. People's Posts and Telecommunications Press, 2019 [
4] Sudharsan Ravichandiran. BERT Basic Tutorial: Transformer Large Model Combat [M]. People's Posts and Telecommunications Publishing Society, 2023
[5] Wu Maogui, Wang Hongxing. Simple Embedding: Principle Analysis and Application Practice [M]. Machinery Industry Press, 2021.

Guess you like

Origin blog.csdn.net/hy592070616/article/details/131272214