Word Vector: GloVe Model Explained

  This content mainly introduces the GloVe model for constructing word vectors .

1 Introduction

  Before the GloVe model was proposed, there were two main types of models for learning word vectors:

  • Global matrix factorization methods such as Latent semantic analysis (LSA).
  • Local context window methods, such as the skip-gram model proposed by Mikolov et al.

  However, both methods have obvious drawbacks. While methods like LSA exploit statistics effectively, they perform relatively poorly on the word analogy task, suggesting a suboptimal vector space structure. Methods like skip-gram may do better on the analogy task, but they make little use of corpus statistics because they train on individual local context windows rather than global co-occurrence counts.

2 GloVe model

  The full name of GloVe is Global Vectors for Word Representation, which is a word representation tool based on count-based & overall statistics.

  The implementation of GloVe is mainly divided into three steps: (1) constructing the co-occurrence matrix; (2) the approximate relationship between the word vector and the co-occurrence matrix; (3) constructing the loss function.

2.1 Construction of co-occurrence matrix

  Suppose we have a corpus containing the following three sentences:

i like deep learning

i like NLP

i enjoy flying

  This corpus involves 7 words: i, like, enjoy, deep, learning, NLP, flying.

  Assuming that we use a statistical window with a size of 3 (left and right lengths are 1), taking the first statement " i like deep learning " as an example, the following window content will be generated:

window label center word window content
0 i i love
1 love i love deep
2 deep love deep learning
3 learning deep learning

  Taking window 1 as an example, the central word is love , and the context words are i and deep , then update the elements in the co-occurrence matrix:

X l o v e , i + = 1 X l o v e , d e e p + = 1 X_{love, i} \quad += \quad 1 \\ X_{love, deep} \quad += \quad 1 Xlove,i+=1Xlove,deep+=1

  Using the above method, traverse the entire corpus to get the co-occurrence matrix XXX

i like enjoy deep learning NLP flying
i 0 2 1 0 0 0 0
like 2 0 0 1 0 1 0
enjoy 1 0 0 0 0 0 1
deep 0 1 0 0 1 0 0
learning 0 0 0 1 0 0 0
NLP 0 1 0 0 0 0 0
flying 0 0 1 0 0 0 0

Among them, the first column represents the central word, and the first row represents the context word.

2.2 Approximate relationship between word vector and co-occurrence matrix

  Before we start, let's define some variables:

  • X i j X_{ij} XijDenotes the word jjj in wordiiThe number of occurrences of i in the context.
  • X i = ∑ k X i k X_i = \sum_k X_{ik} Xi=kXi kIndicates any word that occurs in the word iiThe number of times in the i context.
  • P i j = P ( j ∣ i ) = X i j / X i P_{ij} = P(j|i) = X_{ij} / X_{i} Pij=P(ji)=Xij/XiDenotes the word jjj occurs in the wordiiProbability in context of i .

  Let's take a look at a table provided by the author of the paper:

Table 1. Co-occurrence probabilities of the target words ice and steam with selected context words from the 6 billion corpus. It is only in this ratio that the noise from non-distinguished words is canceled out, so large values ​​(much larger than 1) correlate with properties of ice, and small values ​​(much smaller than 1) correlate with properties of steam.

  Table 1 shows the probabilities and their ratio results for a large corpus, where i = icei = icei=i c e andj = steam j = steamj=s t e a m . Forkkicebutnotsteamk , such ask = solidk=solidk=s o l i dP i / P jk P_{i}/P_{jk}Pi k/Pjkmuch greater than 1. Similarly, for a word kk related to steam but not icek , such ask = gask=gask=g a s , the ratio is much smaller than 1. Forwordswaterorfashionkkk , which is either related toiceandsteamor neither, has a ratio close to 1. Compared to the original probabilities, this ratio is able to better distinguish related words (solidandgas) from irrelevant words (waterandfashion), and better distinguish two related words.

  The above argument shows that it is a more appropriate method to learn word embeddings through the ratio of probabilities rather than probabilities themselves. Ratio P ik / P jk P_{ik}/P_{jk}Pi k/Pjkdepends on three words iii j j j andkkk , the most general model takes the form:

F ( w i , w j , w ~ k ) = P i k P j k (1) F(w_i, w_j, \tilde w_k) = \frac{P_{ik}}{P_{jk}}\tag{1} F(wi,wj,w~k)=PjkPi k(1)

where w ∈ R dw \in \mathbb{R}^dwRd is a word vector,w ~ ∈ R d \tilde{w} \in \mathbb{R}^dw~Rd is the separate context word vector. In this equation, the right-hand side is extracted from the corpus,FFF may depend on some yet-to-be-determined parameters. Because the vector space has a linear structure, the most natural way to express the proportional difference between two probabilities is to use the vector difference, which can be obtained:

F ( w i − w j , w ~ k ) = P i k P j k (2) F(w_i - w_j, \tilde{w}_k) = \frac{P_{ik}}{P_{jk}} \tag{2} F(wiwj,w~k)=PjkPi k(2)

  Next, we notice that FF in equation (2)The parameter of F is a vector, and the right side is a scalar, so we putFFThe parameters of F are subjected to dot product operation to obtain:

F ( ( w i − w j ) T w ~ k ) = P i k P j k (3) F((w_i - w_j)^T \tilde{w}_k) = \frac{P_{ik}}{P_{jk}} \tag{3} F((wiwj)Tw~k)=PjkPi k(3)

  we know xxX is a symmetric matrix, words and context words are actually relative, that is, if we do the following exchange:w ↔ w ~ w \leftrightarrow \tilde{w}ww~ X ↔ X T X \leftrightarrow X^T XXT , formula (3) should remain unchanged. Obviously, the current formula is not satisfied. However, symmetry can be achieved in two steps. First, we askFFF satisfies the homomorphic property, namely:

F ( ( w i − w j ) T w ~ k ) = F ( w i T w ~ k ) F ( w j T w ~ k ) (4) F((w_i - w_j)^T \tilde{w}_k) = \frac{F(w_i^T \tilde{w}_k)}{F(w_j^T \tilde{w}_k)} \tag{4} F((wiwj)Tw~k)=F(wjTw~k)F(wiTw~k)(4)

Combined with equation (3) we get:

F ( w i T w ~ k ) = P i k = X i k X i (5) F(w_i^T \tilde{w}_k) = P_{ik} = \frac{X_{ik}}{X_i} \tag{5} F(wiTw~k)=Pi k=XiXi k(5)

Let F = exp ⁡ F=\exp in equation (4)F=exp , get:

w i T w ~ k = log ⁡ ( P i k ) = log ⁡ ( X i k ) − log ⁡ ( X i ) (6) w_i^T \tilde{w}_k = \log(P_{ik}) = \log(X_{ik}) - \log(X_i) \tag{6} wiTw~k=log(Pi k)=log(Xi k)log(Xi)(6)

Next, we notice that if not log ⁡ ( X i ) \log(X_i)   on the rightlog(Xi) , Equation (6) will exhibit exchange symmetry. However, this term is not related tokkk is irrelevant, so it can be absorbed aswi w_iwiThe bias term bi b_ibi. Finally, for w ~ k \tilde{w}_kw~kAdd a bias b ~ k \tilde{b}_kb~kAchieve symmetry:

w i T w ~ k + b i + b ~ k = log ⁡ ( X i k ) (7) w_i^T \tilde{w}_k + b_i + \tilde{b}_k = \log(X_{ik}) \tag{7} wiTw~k+bi+b~k=log(Xi k)(7)

  In this way, we get an approximate relationship between word vectors and co-occurrence matrices.

2.3 Construct loss function

  In the paper, the author proposes a new weighted least squares regression model to construct the loss function. Treat equation (7) as a least squares problem and introduce a weight function f ( X ij ) f(X_{ij}) in the loss functionf(Xij)

J = ∑ i , j = 1 V f ( X i j ) ( w i T w ~ j + b i + b ~ j − log ⁡ X i j ) 2 (8) J = \sum_{i,j=1}^{V} f(X_{ij})(w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ij})^2 \tag{8} J=i,j=1Vf(Xij)(wiTw~j+bi+b~jlogXij)2(8)

where VVV is the size of the vocabulary.

  Because rare co-occurrences carry noise and carry less information than frequent co-occurrences. Adding a weight function avoids giving equal weight to all co-occurrence events. Thus the weight function should conform to the following properties:

  • f ( 0 ) = 0 f(0) = 0 f(0)=0 . iffff is considered as a continuous function, it should increase asx → 0 x \rightarrow 0x0 disappears fast enough thatlim ⁡ x → 0 f ( x ) log ⁡ 2 x \lim_{x \rightarrow 0} f(x) \log^2xlimx0f(x)log2x is finite.

  • f ( x ) f(x) f ( x ) should be non-decreasing so that rare co-occurrences are not overweighted.

  • for larger xxx value,f ( x ) f(x)f ( x ) should be relatively small so that frequent co-occurrences are not overweighted.

  Many functions satisfy these properties, and the authors found that functions of the following form perform well:

f ( x ) = { ( x / x m a x ) α , i f ( x < x m a x ) 1 , o t h e r w i s e (9) f(x) = \left \{ \begin{array}{cc} (x/x_{max})^{\alpha}, \quad \mathbb{if}(x < x_{max}) \\ 1, \quad \mathbb{otherwise} \end{array} \right. \tag{9} f(x)={ (x/xmax)a ,if(x<xmax)1,otherwise(9)

Figure 1 Weighting function f of α=3/4

The author uses xmax x_{max}   in all experimentsxmaxFixed to 100. And found that α = 3 / 4 \alpha=3/4a=3 / 4 ratioα = 1 \alpha=1a=1 with some improvements over the linear version.

3 Summary

  GloVe is a new global log-bilinear regression model for unsupervised learning of word vectors. It combines the advantages of both global matrix factorization and local context window methods. GloVe models make efficient use of statistics by training only on the non-zero elements in the word-word co-occurrence matrix, rather than the entire sparse matrix or a single context window in a large corpus. The GloVe model generates a vector space with meaningful substructure.

3.1 The difference between GloVe and LSA

  LSA is an earlier statistics-based word vector representation tool. It is also based on co-occurrence matrix, but it uses matrix decomposition technology based on Singular Value Decomposition (SVD) to reduce the dimensionality of large matrices, because the complexity of SVD is very large. High, so its computational cost is relatively large. Another point is that it has the same statistical weight for words. These shortcomings are overcome in GloVe.

3.2 The difference between GloVe and word2vec

  Glove and word2vec mainly have the following differences:

  • word2vec uses the neural network to predict the central word (CBOW) according to the context or the context (skip-gram) according to the central word; GloVe is a logarithmic bilinear regression model, which essentially reduces the dimensionality of the co-occurrence matrix.
  • Both use sliding windows, but word2vec is used directly for training, while GloVe is used to build co-occurrence matrices.
  • word2vec only pays attention to local information, GloVe utilizes local information and global information.
  • GloVe is faster to train than word2vec.

reference

[1] GloVe: Global Vectors for Word Representation

[2] Stanford University's word vector tool: GloVe

[3] (15) Easy to understand - the principle of Glove algorithm

[4] GloVe Homepage

Guess you like

Origin blog.csdn.net/benzhujie1245com/article/details/124890411