Understanding word2vec

This article is my personal understanding of word2vec. Most of it is translated from Dr. Xin Rong's paper. At the same time, I refer to many blogs on the Internet, all of which are listed in References.
Note: In the mathematical formulas in the text, lowercase without bold represents scalar values, lowercase with bold represents vectors, and uppercase with bold represents matrices.

1. Introduction to word2vec

Word2Vec is a word vector (Word Embedding) calculation tool open sourced by Google in 2013, which is used to solve the distribution coding problem of words. It is widely used in natural language processing by learning semantic knowledge in an unsupervised manner from large amounts of text predictions .

2. What is word embedding (word embedding)

We know that text is a kind of unstructured data information, such as Chinese characters are a kind of square characters, which cannot be directly calculated. Therefore, it is necessary to convert this unstructured information into structured information, so as to realize calculation for text information. Simply put, it is to convert text into mathematical vectors.

There are many ways to express this, and there are three common ones:

  • One-hot encoding | one-hot representation
  • integer encoding
  • word embedding | word embedding

The so-called one-hot encoding is to open a vector with the same size as the vocabulary, and a position of the vector represents a word. For example, assuming that there are only four words in the vocabulary, I, Love, China, and Country, then a possible encoding is :

Me: [1 0 0 0]
Love: [0 1 0 0] China
: [0 0 1 0]
Country: [0 0 0 1]

But in reality, tens of thousands of different words are likely to appear in the text, and the vector will be very long at this time. More than 99% of them are 0, leading to the so-called curse of dimensionality.

Integer encoding is simpler, which is to use an integer to represent a word, such as 1 for me, 2 for love, and so on.

The disadvantages of integer encoding are as follows:

  • Inability to express relationship between words
  • Integer encoding can be challenging for model interpretation.

Word embedding is a method of text representation. He can express the text through a low-dimensional vector , which is not as long as one-hot. And words with similar semantics will be relatively similar in the vector space . Word2vec is a method based on statistical methods to obtain word embeddings, which was proposed by Mikolov of Google in 2013.

The Word2vec algorithm has 2 training modes (2 network structures, CBOW and Skip-Gram):

  1. CBOW (Continuous Bag-of-Word Model): Use the context of the word to predict the current word
  2. Skip-gram: the current word predicts its context

First of all, word2vec is essentially a neural network, its input is the one-hot vector of the word, the output is the one-hot vector of the word to be predicted (not the so-called word embedding), and the real word embedding representation of the word is Implied in the parameters of the network.

Write down your own understanding below, most of which are explained in the paper word2vec Parameter Learning Explained by Dr. Xin Rong.

3. CBOW

CBOW uses context to predict the current word, that is, the input is a word with multiple contexts, and the output is a word to be predicted. First, for simplicity, we set the input context word to one.

3. 1 One-word context

At this time, the model inputs a word and predicts a word. The following figure shows the structure of the network:
insert image description here
In fact, the network structure is very simple. There is only one hidden layer in the figure, and the hidden layer has no activation function . The output layer has a Softmax function , that is, the probability distribution of the output predicted words, and the ones with high probability are predicted. up. We assume that the length of the word list is VVV , that is, there are a total ofVVV words. The number of nodes in the hidden layer isNNN , the number of nodes in the output layer is VVas in the input layerV , because the probability of each word is to be output, and the true value is the one-hot of the word to be predicted, then the entire network must be trained to output one-hot as close as possible to the word to be predicted.

There is a fully connected network between the input layer and the hidden layer. Since there is no activation function, the input-output relationship is very simple:
h = WT x = v IT \mathbf{h}=\mathbf{W}^{T} \mathbf{x }=\mathbf{v}_I^Th=WTx=vIT
where x \mathbf{x}x is aV × 1 V \times 1V×A column vector of 1 . W \mathbf{W}W is aV × NV \times NV×N matrix, thenh \mathbf{h}h is anN × 1 N \times 1N×A column vector of 1 . Sincex \mathbf{x}x has only one element that is1 11 , assuming thekkthThe k dimension is 1, and the others are0 00 , then the essence of this operation is to putW \mathbf{W}W 'skkthAfter transposing k lines, copy them to h \mathbf{h}h,即 v I T \mathbf{ v}_I^T vIT, subscript III stands for input.

Similarly, a size of N × VN \times V is passed between the hidden layer and the output layerN×V 's matrixW ′ \mathbf{W}^\primeW full connection (W ′ \mathbf{W}^\primeWW \mathbf{W}W has nothing to do with it! ), let its output beu \mathbf{u}u , then the outputjjthj个电影的的:
uj = vj ′ T h u_{j}={\mathbf{v}_{j}^{\prime}}^T \mathbf{h}uj=vjT h
wherevj ′ T {\mathbf{v}_{j}^{\prime}}^TvjT represents the matrixW ′ T {\mathbf{W}^\prime}^TWT 'sjjrow j (or, matrixW ′ {\mathbf{W}^\prime}W' Thejjthcolumn j ). Finally, use Softmax to obtain the posterior probability of the word (the sum of all probabilities here is equal to1 11 了):
p ( w j ∣ w I ) = y j = exp ⁡ ( u j ) ∑ j ′ = 1 V exp ⁡ ( u j ′ ) p\left(w_{j} \mid w_{I}\right)=y_{j}=\frac{\exp \left(u_{j}\right)}{\sum_{j^{\prime}=1}^{V} \exp \left(u_{j^{\prime}}\right)} p(wjwI)=yj=j=1Vexp(uj)exp(uj)
p ( w j ∣ w I ) p\left(w_{j} \mid w_{I}\right) p(wjwI) represents the given input word, predicting outputjjthThe probability of j words. Substituteuj u_juj,有,
p ( w j ∣ w I ) = exp ⁡ ( v j ′ v I ) ∑ j ′ = 1 V exp ⁡ ( v j ′ ′ T v I ) p\left(w_{j} \mid w_{I}\right)=\frac{\exp \left(\mathbf{v}_{ {j}}^{\prime} \mathbf{v}_{ {I}}\right)}{\sum_{j^{\prime}=1}^{V} \exp \left(\mathbf{v}_{ {j^{\prime}}}^{\prime}{ }^{T} \mathbf{v}_{ {I}}\right)} p(wjwI)=j=1Vexp(vjTvI)exp(vjvI)
We assume that the predicted true value appears at j ∗ j^*j positions, the predicted output should betj ∗ \mathbf{t}_{j^*}tjIs a one-hot vector, and only j ∗ j^*j positions have a1 11 , other positions are0 00 , then our purpose is to want to outputy \mathbf{y}jth of y ∗ j^*jThe larger the value of the ∗ position, the better (as close as possible totj ∗ \mathbf{t}_{j^*}tj), the smaller the value of other positions, the better,
max ⁡ p ( w O ∣ w I ) = max ⁡ yj ∗ = max ⁡ log ⁡ yj ∗ = uj ∗ − log ⁡ ∑ j ′ = 1 V exp ⁡ ( uj ′ ) : = − E , \begin{aligned} \max p\left(w_{O} \mid w_{I}\right) &=\max y_{j^{*}} \\ &=\max \log y_{j^{*}} \\ &=u_{j^{*}}-\log \sum_{j^{\prime}=1}^{V} \exp \left(u_{j^{\ prime}}\right):=-E, \end{aligned}maxp(wOwI)=maxyj=maxlogyj=ujlogj=1Vexp(uj):=E,
For example, E = − log ⁡ p ( w O ∣ w I ) E=-\log p\left(w_{O} \mid w_{I}\right)E=logp(wOwI) is the loss function, to maximize the above probability, that is to minimizeEEE. _
Therefore, the parameters can be updated by backpropagating the above formula as follows,
∂ E ∂ wij ′ = ∂ E ∂ uj ⋅ ∂ uj ∂ wij ′ = ( yj − tj ) ⋅ hivj ′ (new) = vj ′ (old) − η ⋅ ( yj − tj ) ⋅ h for j = 1 , 2 , ⋯ , V \frac{\partial E}{\partial w_{ij}^{\prime}}=\frac{\partial E}{\partial u_{j}} \cdot \frac{\partial u_{j}}{\partial w_{ij}^{\prime}} =(y_{j}-t_{j})\cdot h_{i} \\ {\mathbf{v}_{ {j}}^{\prime}}^\text {(new)}={\mathbf{v}_{ { j}}^{\prime}} ^\text { ( old)}-\eta \cdot (y_{j}-t_{j}) \cdot \mathbf{h} \quad \text { for } j=1,2, \cdots, VwijE=ujEwijuj=(yjtj)hivj(new)=vj (old)the(yjtj)h for j=1,2,,V

y j , t j , h i y_{j}, t_{j},h_{i} yj,tj,hiRepresent vectors y , t , h \mathbf{y,t,h}y,t,h 'sj , j , ij,j,ij,j,i elements,wij w_{ij}wijRepresents matrix W ′ {\mathbf{W}^\prime}W thiirow i , jjj column elements,η \etaη is the learning rate.
Similarly, updateW \mathbf{W}The process of W is as follows, we first seekEEE tohi h_ihi 的导数如下:
∂ E ∂ h i = ∑ j = 1 V ∂ E ∂ u j ⋅ ∂ u j ∂ h i = ∑ j = 1 V ( y j − t j ) ⋅ w i j ′ : = E H i \frac{\partial E}{\partial h_{i}}=\sum_{j=1}^{V} \frac{\partial E}{\partial u_{j}} \cdot \frac{\partial u_{j}}{\partial h_{i}}=\sum_{j=1}^{V} (y_{j}-t_{j}) \cdot w_{i j}^{\prime} :=\mathrm{EH}_{i} hiE=j=1VujEhiuj=j=1V(yjtj)wij:=EHi

: = := := means record as.

The parameter definitions are the same as above. The next step is to calculate EEE vsWWThe partial derivative of W , here we should pay attention, because the operation of the input layer is, W \mathbf{W}W 'skkthAfter transposing k lines, copy them to h \mathbf{h}h , is also said,hi = wki h_i = w_{ki}hi=wk i, where kkk represents thekkthk elements are1 11 , others are0 00 , so we updateW \mathbf{W}W only needs to update thekkthK rows, other values ​​are0 00,保持不变:
∂ E ∂ w k i = ∂ E ∂ h i ⋅ ∂ h i ∂ w k i = ∑ j = 1 V ( y j − t j ) ⋅ w i j ′ = E H i ⋅ x k for  w  is a constant,  i = 1 , 2 , ⋯   , N \frac{\partial E}{\partial w_{k i}}=\frac{\partial E}{\partial h_{i}} \cdot \frac{\partial h_{i}}{\partial w_{k i}} = \sum_{j=1}^{V} (y_{j}-t_{j}) \cdot w_{i j}^{\prime}=\mathrm{EH}_{i} \cdot x_{k} \quad \text{for } w \text{ is a constant, } i=1,2, \cdots,N wk iE=hiEwk ihi=j=1V(yjtj)wij=EHixkfor w is a constant, i=1,2,,The N-
vector representation is updated as follows:
v I ( new ) = v I (old) − η EHT \mathbf{v}_{ { I}}^{(\text {new })}=\mathbf{v}_{ {I}}^{\text {(old) }}-\eta \mathrm{EH}^{T}vI(new )=vI(old) ηEHT
also updates onlyW \mathbf{W}W 'skkthk line, other lines remain unchanged.

3.2 Multi-word context

Now we extend the model to the case where there are multiple context word inputs. The figure below shows a Multi-word context CBOW model. At this time, when
insert image description here
calculating the output of the hidden layer, the CBOW model does not directly copy the input vector of the input context, but is to take the average of the input context vectors and use as output the average vector of the product of the input to the hidden weight matrix:
h = 1 CWT ( x 1 + x 2 + ⋯ + x C ) = 1 C ( vw 1 + vw 2 + ⋯ + vw C ) T \begin{aligned} \mathbf{h} &=\frac{1}{C} \mathbf{W}^{T}\left(\mathbf{x}_{1}+ \mathbf{x}_{2}+\cdots+\mathbf{x}_{C}\right) \\ &=\frac{1}{C}\left(\mathbf{v}_{w_{1} }+\mathbf{v}_{w_{2}}+\cdots+\mathbf{v}_{w_{C}}\right)^{T} \end{aligned}h=C1WT(x1+x2++xC)=C1(vw1+vw2++vwC)T
Among them, CCC is the number of context words (in One-word context,C = 1 C=1C=1), w 1 , ⋯   , w C w_{1}, \cdots, w_{C} w1,,wCis the word vector of the context, v 1 , ⋯ , v C \mathbf{v}_1,\cdots, \mathbf{v}_Cv1,,vC 同上。损失函数定义如下:
E = − log ⁡ p ( w O ∣ w I , 1 , ⋯   , w I , C ) = − u j ∗ + log ⁡ ∑ j ′ = 1 V exp ⁡ ( u j ′ ) = − v O ′ ⋅ h + log ⁡ ∑ j ′ = 1 V exp ⁡ ( v j ′ T ⋅ h ) \begin{aligned} E &=-\log p\left(w_{O} \mid w_{I, 1}, \cdots, w_{I, C}\right) \\ &=-u_{j^{*}}+\log \sum_{j^{\prime}=1}^{V} \exp \left(u_{j^{\prime}}\right) \\ &=-\mathbf{v}_{ {O}}^{\prime} \cdot \mathbf{h}+\log \sum_{j^{\prime}=1}^{V} \exp \left(\mathbf{v}_{ {j}}^{\prime}{ }^{T} \cdot \mathbf{h}\right) \end{aligned} E=logp(wOwI,1,,wI,C)=uj+logj=1Vexp(uj)=vOh+logj=1Vexp(vjTh)
Since the part from the hidden layer to the output vector has not changed, so W ′ {\mathbf{W}}^\primeW Differentiation of the boundary value:
vj ′ (new) = vj ′ (old) − η ⋅ ( yj − tj ) ⋅ h for j = 1 , 2 , ⋯ , V {\mathbf{v}_{ {j} } ^{\prime}}^\text {(new)}={\mathbf{v}_{ { j}}^{\prime}} ^\text {(old)}-\eta \cdot(y_{j }-t_{j}) \cdot \mathbf{h} \quad \text { for } j=1.2, \cdots, Vvj(new)=vj (old)the(yjtj)h for j=1,2,,VW
\mathbf{W}The update of W is actually similar, except that we need to update each wordw I , c w_{I,c}wI,cDefinition:
v I , c ( new ) = v I , c ( old ) − 1 C ⋅ η ⋅ EHT for c = 1 , 2 , ⋯ , C \mathbf{v}_{ {I, c} } ^{(\mathrm{new})}=\mathbf{v}_{ {I,c}}^{(\text{old})}-\frac{1}{C}\cdot\eta\cdot\ mathrm{EH}^{T}\quad\text{for} c=1.2,\cdots,CvI,c(new)=vI,c(old )C1theEHT for c=1,2,,C
任利,v I , c \mathbf{v}_{ {I, c}}vI,cis the ccth in the input contextThe input vector of c words, or the corresponding matrixW \mathbf{W}W 'skc k_ckcline, where kc k_ckcfor input ccc words are not 0 0in one-hot encoding0 's dimension.

For example, in the context word, the first word c = 1 c=1c=1 is "I",V = 4 V=4V=4 , encoded as[ 1 0 0 0 ] [1 \ 0 \ 0 \ 0][ 1 0 0 0 ]    , then,k 1 = 0 k_1 = 0k1=0 ; in the context word, the second wordc = 2 c=2c=2 is "country",V = 4 V=4V=4 , encoded as[ 0 0 0 1 ] [0 \ 0 \ 0 \ 1][ 0 0 0 1 ]    , then,k 2 = 3 k_2 = 3k2=3

4. Skip-Gram Model

The figure below shows the structure of the Skip-Gram Model:
insert image description here
The derivation of the parameter update equation of the Skip-Gram Model is not much different from that of the one-word-context model. The loss function becomes:
E = − log ⁡ p ( w O , 1 , w O , 2 , ⋯ , w O , C ∣ w I ) = − log ⁡ ∏ c = 1 C exp ⁡ ( uc , jc ∗ ) ∑ j ′ = 1 V exp ⁡ ( uj ′ ) = − ∑ c = 1 C ujc ∗ + C ⋅ log ⁡ ∑ j ′ = 1 V exp ⁡ ( uj ′ ) \begin{aligned} E &=-\log p \left(w_{O, 1}, w_{O, 2}, \cdots, w_{O, C} \mid w_{I}\right) \\ &=-\log \prod_{c=1}^ {C} \frac{\exp \left(u_{c, j_{c}^{*}}\right)}{\sum_{j^{\prime}=1}^{V} \exp \left( u_{j^{\prime}}\right)} \\ &=-\sum_{c=1}^{C} u_{j_{c}^{*}}+C \cdot \log \sum_{j ^{\prime}=1}^{V} \exp \left(u_{j^{\prime}}\right) \end{aligned}E=logp(wO,1,wO,2,,wO,CwI)=logc=1Cj=1Vexp(uj)exp(uc , jc)=c=1Cujc+Clogj=1Vexp(uj)
Since the Skip-Gram Model is given an input word and predicts the probability of multiple words in its context, the loss function becomes p ( w O , 1 , w O , 2 , ⋯ , w O , C ∣ w I ) p\ left(w_{O, 1}, w_{O, 2}, \cdots, w_{O, C} \mid w_{I}\right)p(wO,1,wO,2,,wO,CwI) negative logarithm. wherejc ∗ j_{c}^{*}jcis cc in the glossaryc context output words are not 0in one-hot encodingThe dimension of 0 (explained as above).

Since there are multiple outputs, we find the loss EEThe partial derivative of E
for each dimension of each output vector is as follows: ∂ E ∂ uc , j = yc , j − tc , j : = ec , j \frac{\partial E}{\partial u_{c, j} }=y_{c, j}-t_{c, j}:=e_{c, j}uc , iE=yc , itc , i:=ec , i

: = := := is toyc , j − tc , j y_{c, j}-t_{c, j}yc , itc , idenoted as ec , j e_{c, j}ec , i

For simplicity, we write again:
EI j = ∑ c = 1 C ec , j \mathrm{EI}_{j}=\sum_{c=1}^{C} e_{c, j}E Ij=c=1Cec , i
As before, ccc represents the number of output contexts. The next step is to seekEELet W ′ {\mathbf{W}}^\primeW 中元素的偏导数,并更新:
∂ E ∂ w i j ′ = ∑ c = 1 C ∂ E ∂ u c , j ⋅ ∂ u c , j ∂ w i j ′ = E I j ⋅ h i \frac{\partial E}{\partial w_{i j}^{\prime}}=\sum_{c=1}^{C} \frac{\partial E}{\partial u_{c, j}} \cdot \frac{\partial u_{c, j}}{\partial w_{i j}^{\prime}}=\mathrm{EI}_{j} \cdot h_{i} wijE=c=1Cuc , iEwijuc , i=E Ijhi
则更新如下:
w i j ′ (  new  ) = w i j ′  (old)  − η ⋅ E I j ⋅ h i w_{i j}^{\prime}(\text { new })=w_{i j}^{\prime} \text { (old) }-\eta \cdot \mathrm{EI}_{j} \cdot h_{i} wij( new )=wij (old) theE Ijhi
Definition:
vj ′ (new) = vj ′ (old) − η ⋅ EI j ⋅ h for j = 1 , 2 , ⋯ , V \mathbf{v}_{ {j}}^{\prime} \ text { (new) }=\mathbf{v}_{ {j}}^{\prime}\text {(old)}-\eta \cdot \mathrm{EI}_{j}\cdot \mathbf{h } \quad \text { for } j=1.2, \cdots, Vvj (new) =vj (old) theE Ijh for j=1,2,,V
can see that, except for the loss function, there is almost no difference between the other and the one-word-context model.

Finally, find the update equation of the weight matrix from the input layer to the hidden layer. Since it can be seen from the structure that it is exactly the same as the one-word-context model, we directly give the update equation as follows:
v I ( new ) = v I ( old ) − η ⋅ EHT \mathbf{v}_{ {I}}^{(\text {new })}=\mathbf{v}_{ { I}}^{(\text {old })}-\eta \cdot \mathrm{EH}^{T}vI(new )=vI(old )theEHT
whereEH EHE H is annn- dimensional vector, each element in the vector is:
EH i = ∑ j = 1 VEI j ⋅ wij ′ \mathrm{EH}_{i}=\sum_{j=1}^{V} \mathrm{EI} _{j} \cdot w_{ij}^{\prime}EHi=j=1VE Ijwij

5. Finally

Definition:
vj ′ (new) = vj ′ (old) − η ⋅ ( yj − tj ) ⋅ h for j = 1 , 2 , ⋯ , V {\mathbf{v}_{ {j}}^{ \ prime}}^\text {(new)}={\mathbf{v}_{ {j}}^{\prime}} ^\text {(old)}-\eta \cdot(y_{j}-t_ {j}) \cdot \mathbf{h}\quad \text {for} j=1.2, \cdots, Vvj(new)=vj (old)the(yjtj)h for j=1,2,,V
word2vec needs all the words in the vocabulary on the variable side during the update process. For each step of the update, we also need to calculatewj , uj , yj , ej w_j,u_j, y_j, e_jwj,uj,yj,ejFinally, a vj ′ {\mathbf{v}_{ {j}}^{\prime}} can be updatedvj, the calculation amount can be imagined. All in the actual word2vec, a lot of tricks are used to optimize the calculation efficiency, and the details will not be expanded. I recommend the blog in References and this blog .

References

  1. https://easyai.tech/ai-definition/word-embedding/#wordembedding
  2. word2vec Parameter Learning Explained
  3. https://mp.weixin.qq.com/s/7dsjfcOfm9uPheJrmB0Ghw

Guess you like

Origin blog.csdn.net/weixin_43335465/article/details/120923126