## 1. Talk about GloVe

As the title GloVe paper is concerned, **GloVe full name Global Vectors for Word Representation, which is a global word frequency statistics (count-based & overall statistics) the word representation (word representation) based tool that can be expressed as a single word vector of real numbers, these vectors are semantic features captured between words, such as similarity (similarity), analogy (analogy) and the like. **Through operation of the vector, such as Euclidean distance or the cosine similarity degree can be calculated semantic similarity between two words.

## 2. GloVe implementation steps

### 2.1 Construction of co-occurrence matrix

**What is the co-occurrence matrix?**

Co-occurrence matrix as the name suggests is the co-occurrence of meaning, the word co-occurrence matrix is mainly used to find the document theme (topic), a topic model, such as the LSA.

Local word-word co-occurrence matrix window can dig syntax and semantic information, **for example:**

- I like deep learning.
- I like NLP.

- I enjoy flying

There are more than three words, sliding window set to 2, a dictionary can be obtained: **{ "the I like", "Deep like", "Deep Learning", "like the NLP", "Enjoy the I", "Enjoy Flying", "the I like **"} .

We can get a co-occurrence matrix (symmetric matrix):

In the middle of each grid represents the number of rows and columns of the phrase in the dictionary occur together, it reflects the **co-occurrence** characteristics.

**GloVe of co-occurrence matrix**

Corpus (Corpus) constructing a co-occurrence matrix (Co-ocurrence Matrix) X, according to **the matrix element of Xij in a typical word for each i and j the number of co-occurrence context word in a particular context window size (context window). **Generally, the smallest unit number is 1, but disagreed GloVe: It is based on two words in the context window distance d is proposed a decay function (decreasing weighting): decay = 1 / d is used to calculate the weights, That is to say **two words share the farther away from the total weights (total count) of smaller weight** .

### Approximately 2.2 word vector relationship and co-occurrence matrix

Approximate relationship between the word vector constructs (Word Vector) and the co-occurrence matrix (Co-ocurrence Matrix), author of the study were asked the following equation can express the relationship between the two is approximately:

\[w_i^T\tilde{w_j}+b_i+\tilde{b}_j=log(X_{ij})\]

Wherein, \ (W_i and T ^ \ tilde {w} _j \) is the term we will eventually require a vector solution; \ (B_i and \ tilde {b} _j \) are the vectors of the two words bias term. Of course, you must have a lot of questions about this formula, for example, which in the end is how come, why use this formula, why we should construct two words vector \ [w_i ^ T and \ tilde {w} _J \] ? Please refer to references at the end of the text.

### 2.3 constructor loss function

Once you have the formula 2.2 we can construct its loss function of:

\[J=\sum_{i,j=1}^Vf(X_{ij})(w_i^T\tilde{w}_j+b_i+\tilde{b}_j-log(X_{ij}))^2\]

The basic form of the loss function is the simplest mean square loss, but on this basis, plus a weighting function \ (f (X_ {ij}) \) , then this function is what role, why should add this function it? We know that in a corpus, there is certainly a lot of times the word appears with them is a lot of (frequent co-occurrences), then we want to:

- These words are important rights greater than those words (rare co-occurrences) rarely appear together, so this function is to be a non-decreasing function (non-decreasing);
- But we do not want this weight is too large (overweighted), when reaches a certain level should not increase;
- If the two words do not appear together, that is, \ (X_ {ij} = 0 \) , then they should not participate in the calculation of loss function were to go, that is, f (x) to satisfy f (0) = 0.

Function satisfies the above three conditions are many, the authors adopted the form of a piecewise function:

This image function is as follows:

### 2.4 GloVe training model

Although many people claim GloVe is an unsupervised (unsupervised learing) way to learn (because it does not require manual annotation label), but in fact it still has a label, the label is above equation log (Xij), and formula the vector \ (w and \ tilde {w} \) is to keep the parameters / learning, so in essence it's training methods training methods with supervised learning no different, are based on gradient descent.

具体地，这篇论文里的实验是这么做的：**采用了AdaGrad的梯度下降算法，对矩阵 X 中的所有非零元素进行随机采样，学习曲率（learning rate）设为0.05，在vector size小于300的情况下迭代了50次，其他大小的vectors上迭代了100次，直至收敛。**最终学习得到的是两个vector是 \(w和\tilde{w}\)，因为 X 是对称的（symmetric），所以从原理上讲 \(w和\tilde{w}\) 是也是对称的，他们唯一的区别是初始化的值不一样，而导致最终的值不一样。

所以这两者其实是等价的，都可以当成最终的结果来使用。**但是为了提高鲁棒性，我们最终会选择两者之和** \(w+\tilde{w}\) **作为最终的vector（两者的初始化不同相当于加了不同的随机噪声，所以能提高鲁棒性）。**在训练了400亿个token组成的语料后，得到的实验结果如下图所示：

这个图一共采用了三个指标：语义准确度，语法准确度以及总体准确度。那么我们不难发现Vector Dimension在300时能达到最佳，而context Windows size大致在6到10之间。

## 3. GloVe与LSA、Word2Vec的比较

LSA（Latent Semantic Analysis）是一种比较早的count-based的词向量表征工具，它也是基于co-occurance matrix的，只不过采用了基于奇异值分解（SVD）的矩阵分解技术对大矩阵进行降维，而我们知道SVD的复杂度是很高的，所以它的计算代价比较大。还有一点是它对所有单词的统计权重都是一致的。而这些缺点在GloVe中被一一克服了。

The word2vec biggest drawback is not fully utilized the corpus of all, it is in fact the GloVe combines the advantages of both. From the experimental results presented in this paper point of view, the performance is far more than GloVe LSA and word2vec, but it was also said GloVe online and word2vec actual performance actually similar.

## 4. The code implementation

**Generate word vector**

Download GitHub project: https://github.com/stanfordnlp/GloVe/archive/master.zip

After decompression, enter the directory execution

make

Compile operation.

And then do sh demo.sh training and generate word vector file: vectors.txt and vectors.bin

[ Machine learning easy to understand series of articles ]

## 5. References

Author: @mantchs

GitHub：https://github.com/NLP-LOVE/ML-NLP

Welcome to join the discussion! Improve joint project! Group number: [541,954,936]