Embed the word [NLP] Detailed GloVe

What is GloVe?

GloVe full name Global Vectors for Word Representation, called the global word vector is similar to a word with word2vec expression vector.

How GloVe is achieved?

GloVe the implementation is divided into the following three steps:

  • Corpus (corpus) constructed according to a co-occurrence matrix (Co-ocurrence Matrix) XX (What is the co-occurrence matrix ?), The matrix of each element Xij represents the word i and context of words j in particular the size of the context window (context window) the number of times of co-occurrence. Generally, the smallest unit number is 1, but disagreed GloVe: It is based on two words distance dd context window is proposed a decay function (decreasing weighting): decay = 1 / ddecay = 1 / d with to re-calculate the weight, that is to say the right farther away from the two words share the total count (total count) of smaller weight .
  • Approximate relationship between the word vector constructs (Word Vector) and the co-occurrence matrix (Co-ocurrence Matrix), author of the study were asked the following equation can be approximated by the expression of the relationship between the two:
    after 1 we can construct with the formula it's a loss function:

  • The basic form of the loss function is the simplest mean square loss, but on this basis, plus a weighting function f (Xij), then this function is what role, why should add this function? We know that in a corpus, there is certainly a lot of times the word appears with them is a lot of (frequent co-occurrences), then we want to:

    • Weights greater than 1. These key words those words (rare co-occurrences) rarely occur together, so that a non-decreasing function to the function (non-decreasing);
    • 2. But we do not want this weight is too large (overweighted), when reaches a certain level should not increase;
    • 3. If the two words do not appear together, that is, Xij = 0Xij = 0, then they should not participate in the calculation of loss function were to go, that is, f (x) f (x) to satisfy f (0) = 0f (0) = 0

     

How GloVe is training?

Although many people claim GloVe is an unsupervised (unsupervised learing) way to learn (because it does not require manual annotation label), but in fact it still has a label, the label is in the formula 2 log (Xij), and formula vector ww 2 ~ ww ~ and is to constantly update parameter / learning, so in essence it's training methods training methods with supervised learning no different, are based on gradient descent. Specifically, in this paper the experiments do: using AdaGrad gradient descent algorithm, all the non-zero elements of the matrix XX in randomly sampling, the curvature of learning (learning rate) is set to 0.05, the vector size is smaller than 300 in the case of iterations 50 times, 100 times on other iterations of the size of the vectors, until convergence. The final study is two vector is obtained and ~ ww ~ ww, because XX is symmetric (symmetric), and so in principle ww ~ ww ~ is also symmetrical, their only difference is that the value of the initialization is not the same, which led to the final value is not the same. So these two really are equivalent, can be used as the final result. However, in order to improve the robustness, we finally choose both the sum and w + ~ ww + w ~ as the final Vector (two different initialization equivalent to adding a different random noise, it is possible to improve the robustness). After the training corpus consisting of 40 billion token, the experimental results as shown below:

This diagram uses a total of three indicators: the semantic accuracy, grammatical accuracy and overall accuracy. So we can easily find in the Vector Dimension 300 can achieve the best, and the context Windows size roughly between 6-10.

Glove and LSA, word2vec comparison

LSA (Latent Semantic Analysis) is a relatively early word vector count-based characterization tool, it is also based on co-occurance matrix, only using a down-large matrix based on singular value decomposition (SVD) of matrix decomposition techniques Victoria, and we know the complexity of the SVD is very high, so it's relatively large computational cost. Another point is its statistical weight of all the words are the same weight. These disadvantages are overcome one of the GloVe. The word2vec biggest drawback is not fully utilized the corpus of all, it is in fact the GloVe combines the advantages of both. From the experimental results presented in this paper point of view, the performance is far more than GloVe LSA and word2vec, but it was also said GloVe online and word2vec actual performance actually similar.

 

Reprinted from: https://blog.csdn.net/u014665013/article/details/79642083

Guess you like

Origin blog.csdn.net/zkq_1986/article/details/93195885