Natural language processing from entry to application - dynamic word vector (Dynamic Word Embedding) / contextualized word vector (Contextualized Word Embedding)

Category: General Catalog of "Natural Language Processing from Entry to Application"


As mentioned above, the learning of word vectors mainly utilizes the co-occurrence information between words in the corpus, and the core idea behind it is the distributed semantic assumption. In the static word vector learning algorithm introduced so far, whether it is the word2vec algorithm based on local context prediction, or the GloVe regression algorithm based on explicit global co-occurrence information, its essence is to combine the co-occurrence context information of a word in the entire corpus aggregated into the vector representation of the word. Therefore, the word vector trained on a given corpus can be considered static, that is, for any word, its vector representation is constant and does not change with its context. However, in natural language, the same word may exhibit many different meanings, grammatical properties or attributes in different contexts or contexts. Taking the word "end" as an example, its meanings are completely different in the sentences "he personally ended up participating in the game" and "he ended up like this", and have different parts of speech (the former is a verb and the latter is a noun). Polysemy is a common language phenomenon in natural language, and it is also a natural result of the development and change of natural language. In the static word vector representation, since all the contextual information of the word is compressed and aggregated into a single vector representation, it is difficult to describe the different meaning information of a word in different contexts or contexts.

In order to solve this problem, the researchers proposed a contextualized word embedding (Contextualized Word Embedding). As the name suggests, in this representation method, the vector of a word will be calculated from its current context, so it changes dynamically with the context. In the following articles, it will also be called Dynamic Word Embedding. Under the dynamic word vector representation, the "next field" in the previous example will get two different word vector representations in the two sentences. It should be noted that dynamic word vectors still strictly satisfy the distributed semantic assumption.

In a text sequence, the dynamic word vector of each word is actually the result of semantic combination of the context of the word. For sequence data such as text, the cyclic neural network just provides an effective way of semantic combination. In the application of sequence data modeling, the hidden layer representation at the last moment of the cyclic neural network is used as the vector representation of the entire text segment (sentence) for text classification; the hidden layer representation at each moment is also used for sequence labeling (like part-of-speech tagging). This means that the hidden layer representation at each moment (position) in the cyclic neural network model can just be used as the vector representation of the word at the current context at that moment, that is, the dynamic word vector. At the same time, the recurrent neural network can perform self-supervised learning through the language model task without any additional data annotation. Based on this idea, Matthew Peters et al. proposed a language model-enhanced sequence tagging model TagLM. The model uses the hidden layer representation of the pre-trained recurrent neural network language model as an additional word feature, which significantly improves the performance of the sequence tagging task. Subsequently, they further improved this research and proposed the idea of ​​deep context-related word vectors and the pre-training model ELMo (Embeddings from Language Models). Experiments on multiple natural language processing tasks including automatic question answering, textual entailment, and information extraction show that ELMo can directly and effectively bring significant improvements to the best models at the time. At the same time, the ELMo model has also been extended to multilingual scenarios, and has achieved excellent performance in the evaluation task of CoNLL-2018 international multilingual general dependency syntax analysis.

Under certain conditions, it is also possible to train recurrent neural networks with richer supervisory signals. For example, when there is a bilingual parallel corpus of a certain scale, a recurrent neural network can be trained using a sequence-to-sequence-based machine translation method. After the training is completed, the encoder of the translation model can be used to encode the source language to obtain dynamic word vectors. The CoVe model proposed in [12] adopts this pre-training method. However, the acquisition of bilingual parallel corpus is more difficult than monolingual data, and the areas covered are relatively limited, so the versatility is lacking.

References:
[1] Che Wanxiang, Cui Yiming, Guo Jiang. Natural language processing: a method based on pre-training model [M]. Electronic Industry Press, 2021. [2] Shao Hao, Liu Yifeng. Pre-
training language model [M] ]. Electronic Industry Press, 2021.
[3] He Han. Introduction to Natural Language Processing [M]. People's Posts and Telecommunications Press, 2019 [
4] Sudharsan Ravichandiran. BERT Basic Tutorial: Transformer Large Model Combat [M]. People's Posts and Telecommunications Publishing Society, 2023
[5] Wu Maogui, Wang Hongxing. Simple Embedding: Principle Analysis and Application Practice [M]. Machinery Industry Press, 2021.

Guess you like

Origin blog.csdn.net/hy592070616/article/details/131259691