## Four-step understanding GloVe! (With code implementation)

As the title GloVe paper is concerned, GloVe full name Global Vectors for Word Representation, which is a global word frequency statistics (count-based & overall statistics) the word representation (word representation) based tool that can be expressed as a single word vector of real numbers, these vectors are semantic features captured between words, such as similarity (similarity), analogy (analogy) and the like. Through operation of the vector, such as Euclidean distance or the cosine similarity degree can be calculated semantic similarity between two words.

## 2. GloVe implementation steps

### 2.1 Construction of co-occurrence matrix

What is the co-occurrence matrix?

Co-occurrence matrix as the name suggests is the co-occurrence of meaning, the word co-occurrence matrix is ​​mainly used to find the document theme (topic), a topic model, such as the LSA.

Local word-word co-occurrence matrix window can dig syntax and semantic information, for example:

• I like deep learning.
• I like NLP.
• I enjoy flying

There are more than three words, sliding window set to 2, a dictionary can be obtained: { "the I like", "Deep like", "Deep Learning", "like the NLP", "Enjoy the I", "Enjoy Flying", "the I like "} .

We can get a co-occurrence matrix (symmetric matrix):

In the middle of each grid represents the number of rows and columns of the phrase in the dictionary occur together, it reflects the co-occurrence characteristics.

GloVe of co-occurrence matrix

Corpus (Corpus) constructing a co-occurrence matrix (Co-ocurrence Matrix) X, according to the matrix element of Xij in a typical word for each i and j the number of co-occurrence context word in a particular context window size (context window). Generally, the smallest unit number is 1, but disagreed GloVe: It is based on two words in the context window distance d is proposed a decay function (decreasing weighting): decay = 1 / d is used to calculate the weights, That is to say two words share the farther away from the total weights (total count) of smaller weight .

### Approximately 2.2 word vector relationship and co-occurrence matrix

Approximate relationship between the word vector constructs (Word Vector) and the co-occurrence matrix (Co-ocurrence Matrix), author of the study were asked the following equation can express the relationship between the two is approximately:

$w_i^T\tilde{w_j}+b_i+\tilde{b}_j=log(X_{ij})$

Wherein, \ (W_i and T ^ \ tilde {w} _j \) is the term we will eventually require a vector solution; \ (B_i and \ tilde {b} _j \) are the vectors of the two words bias term. Of course, you must have a lot of questions about this formula, for example, which in the end is how come, why use this formula, why we should construct two words vector \ [w_i ^ T and \ tilde {w} _J \] ? Please refer to references at the end of the text.

### 2.3 constructor loss function

Once you have the formula 2.2 we can construct its loss function of:

$J=\sum_{i,j=1}^Vf(X_{ij})(w_i^T\tilde{w}_j+b_i+\tilde{b}_j-log(X_{ij}))^2$

The basic form of the loss function is the simplest mean square loss, but on this basis, plus a weighting function \ (f (X_ {ij}) \) , then this function is what role, why should add this function it? We know that in a corpus, there is certainly a lot of times the word appears with them is a lot of (frequent co-occurrences), then we want to:

• These words are important rights greater than those words (rare co-occurrences) rarely appear together, so this function is to be a non-decreasing function (non-decreasing);
• But we do not want this weight is too large (overweighted), when reaches a certain level should not increase;
• If the two words do not appear together, that is, \ (X_ {ij} = 0 \) , then they should not participate in the calculation of loss function were to go, that is, f (x) to satisfy f (0) = 0.

Function satisfies the above three conditions are many, the authors adopted the form of a piecewise function:

This image function is as follows:

### 2.4 GloVe training model

Although many people claim GloVe is an unsupervised (unsupervised learing) way to learn (because it does not require manual annotation label), but in fact it still has a label, the label is above equation log (Xij), and formula the vector \ (w and \ tilde {w} \) is to keep the parameters / learning, so in essence it's training methods training methods with supervised learning no different, are based on gradient descent.

## 3. GloVe与LSA、Word2Vec的比较

LSA（Latent Semantic Analysis）是一种比较早的count-based的词向量表征工具，它也是基于co-occurance matrix的，只不过采用了基于奇异值分解（SVD）的矩阵分解技术对大矩阵进行降维，而我们知道SVD的复杂度是很高的，所以它的计算代价比较大。还有一点是它对所有单词的统计权重都是一致的。而这些缺点在GloVe中被一一克服了。

The word2vec biggest drawback is not fully utilized the corpus of all, it is in fact the GloVe combines the advantages of both. From the experimental results presented in this paper point of view, the performance is far more than GloVe LSA and word2vec, but it was also said GloVe online and word2vec actual performance actually similar.

## 4. The code implementation

Generate word vector

After decompression, enter the directory execution

make

Compile operation.

And then do sh demo.sh training and generate word vector file: vectors.txt and vectors.bin

GloVe code implementation

## 5. References

Author: @mantchs

Welcome to join the discussion! Improve joint project! Group number: [541,954,936]

### Guess you like

Origin www.cnblogs.com/mantch/p/11403771.html
Recommended
Ranking
Daily