The training and application of sentence vector (sentence embedding) model sent2vec

Word embedding (word embedding) aims to map words in natural language into a vector. Academia has done in-depth research in this area. People engaged in NLP must have already understood this (if you are not familiar with it yet, Please refer to [1]). This article will discuss another topic that is very related to this-the sentence vector model (sent2vec), which is the technology of mapping a complete sentence into a real number vector. In fact, there are many published results in this area. The technology involved in this article is mainly derived from a paper in NAACL 2018 (see reference [2] for details).

 

The sent2vec model proposed in [2] can be understood as an extension of the classic CBOW method in word embedding technology. Specifically, in order to "capture" the semantics of the sentence (not limited to a specific word), two improvements are needed: 1) The object of investigation is an entire sentence, rather than a fixed-size window in the sentence A sequence of words boxed out. 2) Introducing n-gram to enhance the ability to embed the order of words in sentences. Note that the embedded model is basically unsupervised learning. For a given sentence, the input of sent2vec during training is all the words in the sentence and the n-gram sequence, and the output (or the need to be fitted with a neural network) is missing words. Literature [3] understand sent2vec from another perspective (as shown in the figure below), and regard it as an unsupervised version of fastText , "the entire sentence is the context and possible class labels are all vocabulary words", and notice the all here Vocabulary words are actually the missing words mentioned earlier (if you understand CBOW, then at this point, they have the same effect). After getting the embedding vector of each word in the sentence, the embedding of the sentence is the average of all word vectors in it.

Let's try to train a sent2vec model. The original author of the paper provided some trained models in [4] (the processed language is English), and also gave the steps to train the sent2vec model by itself ([5] also gave a model based on this Package, use [4

Guess you like

Origin blog.csdn.net/baimafujinji/article/details/50652798