Word vector unsupervised method may sentence vector

       NLP word vector technology is a basic technology in the field, vector word words will be converted to a fixed dimension vector by vector processing tasks to make the relationship between NLP semantic computing can be achieved.

       We all know that a sentence is composed of words, word vector technology is only single words turn into a vector of fixed dimensions, then how to obtain a vector composed of multiple words of a sentence? This is a good question, after all, need to be addressed in the text of the actual environment is a sentence, not one word. In order to enable readers to understand the specific steps wording vector generation vector sentence, this article will introduce several sentences vector word vector is generated as follows unsupervised learning tools, they are: cumulative method, average method, TF-IDF weighted average method and embedded SIF law.

1 addition method

       Sentence vector addition method was the easiest way, assuming there is such a text:

There is no royal way to geometry.
——Euclid(欧几里得)

       The message is famous ancient Greek mathematician Euclid's famous quote, which in Chinese means "gateway to geometry and not the Royal Mile." NLP process a piece of text you first need to disable some text to word processing, common English words have be disabled verbs, prepositions, conjunctions, etc., after a stop word processing to the above text can be obtained from the following words:

       {there, no, royal, way , geometry}
       paper, vector corresponding word dictionary (GoogleNews-vectors-negative300.bin) and the python gensim word vector obtained, the following words have the above-described vector may word (limited space available here, with 5-dimensional vector word to demonstrate)

Term Word vector
There [ 0.1, 0.2, 0.3, 0.4, 0.5]
No [ 0.2, 0.3, 0.4, 0.5, 0.6]
Royal [ 0.3, 0.4, 0.5, 0.6, 0.7 ]
Way [ 0.4, 0.5, 0.6, 0.7, 0.8 ]
Geometry [0.5, 0.6, 0.7, 0.8, 0.9]

       Law practice is to accumulate word vector sentence of all non-stop words superimposed sentence if there are n non-stop words, the word vector sentence by the following means to obtain:

       Vsentence = Vword1 + Vword2 + …… + Vwordn

       According to this method can be obtained ". There is no royal way to geometry" sentence vectors:

       In one Vroyal + + + + Vway Vsentence = Vtheris Vgeometry

                     = [ 0.1, 0.2, 0.3, 0.4, 0.5] + [ 0.2, 0.3, 0.4, 0.5, 0.6] + … + [0.5, 0.6, 0.7, 0.8, 0.9]

                     = [1.5, 2.0, 2.5, 3.0, 3.5]

2 average method

       And cumulative averaging method similar to the method also requires all non-stop word in a sentence vector stack up, but in the end need to add add up the number of non-punishable vector stop words. Vector word sentence obtained by the following means:

       Vsentence = (Vword1 + Vword2 + …… + Vwordn) / n

       According to this method can be obtained ". There is no royal way to geometry" sentence vectors:

       Vsentence = (+ Vtheris in one Vroyal + + + Vway Vgeometry) / 5

                     = ([ 0.1, 0.2, 0.3, 0.4, 0.5] + [ 0.2, 0.3, 0.4, 0.5, 0.6] + … + [0.5, 0.6, 0.7, 0.8, 0.9]) / 5

                     = [1.5, 2.0, 2.5, 3.0, 3.5] / 5

                     = [0.3, 0.4, 0.5, 0.6, 0.7]

3 TF-IDF weighted average

       TF-IDF need to use the weighted average to the TF-IDF art, TF-IDF is a technique commonly used text processing techniques. A word commonly used to assess TF-IDF model for the importance of a document, often used in the field of search technology and information retrieval. A word TF-IDF value is proportional to the frequency of its occurrence in the document, and it appears inversely proportional to frequency in the corpus. Obtained by multiplying by the TF-IDF term frequency TF (Term Frequency) and inverse document frequency IDF (Inverse Document Frequency). For the word ti is:

       Where ni, j is the frequency of the words appear in ti own position in the document j, Σknk, j is the number of all documents j All the words in the corresponding
sum, | D | indicates the total number of documents in the training corpus, | j: ti∈dj | ti represents the total number of words contained in the document the training corpus.
Also worth noting is that, if the word is not in the corpus ti so (1) where | j: ti∈dj | 0, it will lead to IDFj denominator is 0, the value can not be calculated IDFj. It needs to be improved as follows:

       TFIDF weighting vector Ci only need a non-stop word of each sentence, the sentence also require TFIDF values ​​for each non-stop words. TF portion of each non-stop words better calculated, IDF depends on which part of the corpus of users, if it is done query retrieval, then the IDF is all part of the corpus corresponding query sentence; if the text is to do self-similar clustering, then IDF corpus is all part of the corresponding sentence to be classified. TF-IDF weighting is then obtained by means of the sentence vector:

       Vsentence = TFIDFword1 * Vword1 + TFIDFword2 * Vword2 + …… + TFIDFwordn * Vwordn

       Assuming that "There is no royal way to geometry." Retrieval query is done, then calculate the corresponding IT-IDF corpus is all query sentences. A total if all query sentence 100; wherein 60 query sentence comprising words there, 65 th query sentence comprising words no, 7 th query sentence comprising words royal, 72 th query sentence comprising words way, 9 th quer sentence y containing words geometry . Then this sentence for each non-stop word of the TF-IDF number as follows:

       There: 1/(1+1+1+1+1) * log(100/(1+60) = 0.098

       No: 1/(1+1+1+1+1) * log(100/(1+65) = 0.083

       Royal: 1/(1+1+1+1+1) * log(100/(1+7) = 0.505

       Way: 1/(1+1+1+1+1) * log(100/(1+72) = 0.629

       Geometry: 1/(1+1+1+1+1) * log(100/(1+9) = 0.460

       Therefore, according to IT-IDF weighting vectors of this sentence is:

       Vsentence = TFIDFthere Vtheris * + + ...... + TFIDFgeometry in one TFIDFno * * Vgeometry

                     =0.098[0.1,0.2,0.3,0.4,0.5]+0.083[0.2,0.3,0.4,0.5,0.6]+…+0.460*[0.5,0.6,0.7,0.8,0.9]

                     = [0.147, 0.166, 1.2625 , 1.887, 1.61]

4 ISF embedding

       ISF weighted average method and TF-IDF weighted average method is similar, ISF weighted thesis from Princeton University's latent variable Model Approach to A PMI-based Word embeddings. ( Https://openreview.net/forum?id=Sy K00v5xx) , according to the authors say, this method can be very good to get the whole sentence, according to the vector based on each word word vector. Insert of SIF and estimated probability of each word method requires the use of a main component, SIF embedding specific operation is as follows:




FIG 1 SIF embedded vector generating sentences

       First, input of the algorithm are:
       (1) the word vectors for each word
       (2) the entire corpus sentence
       (3) tunable A
       (. 4) estimated probability of each word

       The output of the algorithm is:
       a sentence vector

       Specific steps of the algorithm are:
       (1) preliminary sentence vector

       Traversing corpus each sentence, the sentence is assumed that the current s, preliminary sentence vector of the current sentence s is calculated by the following equation:

\[\frac{{\rm{1}}}{{\left| s \right|}}\sum\nolimits_{w \in s} {\frac{a}{{a + p\left( w \right)}}{v_w}} \]

       I.e., the weighted averaging process, each term vector is multiplied by the coefficient a / (a ​​rear p (w) + superposed, the number of the last sentence imposed superimposed vector s in the words, for the tunable parameters of a paper and used 0.0001 0.001 two .P (w) in the whole corpus are words unigram probability, i.e., the words in the word frequency w impose All the words in the corpus and the word frequency.

       (2) calculating a main component
       of all preliminary sentence vector principal component analysis to calculate all the initial sentence vector of the first principal component u

       (3) to give the target sentence vector
       preliminary sentence vector is calculated by a secondary processing to give the target sentence vector

       This authors also disclose the source code on Github, interested readers can download their own to do the experiment, Github Code

summary

       This paper describes four unsupervised means to generate vector based on the word of a sentence sentence vector, in addition to unsupervised means, the actual environment as well as the supervision method used to generate a sentence vector, such as training a classifier CNN's text, taking the last sentence as a hidden layer output vector, interested readers can google to learn more.

references

       [1] Arora S, Liang Y, Ma T. A simple but tough-to-beat baseline for sentence embeddings[J]. 2016.

Guess you like

Origin www.cnblogs.com/Kalafinaian/p/11300953.html