How does word2vec get word vectors

Reprinted an article with a good analysis on Zhihu: https://www.zhihu.com/question/44832436


Author: crystalajj
Link: https://www.zhihu.com/question/44832436/answer/266068967
Source: Zhihu The
copyright belongs to the author. For commercial reprints, please contact the author for authorization, and for non-commercial reprints, please indicate the source.

How does word2vec get word vectors? This problem is bigger. To start from the beginning, you first have a text corpus. You need to preprocess the corpus. This processing flow is related to the type of corpus and your personal purpose . For example, if it is an English corpus, you may need to convert the upper and lower case to check for spelling errors, etc. If it is a Chinese-Japanese corpus, you need to add word segmentation processing. This process has been sorted out by other answers and will not be repeated. After getting the processed corpus you want, use their one-hot vectors as the input of word2vec, and train low-dimensional word embeddings (word embedding) through word2vec. I have to say that word2vec is a great tool. There are currently two training models (CBOW and Skip-gram) and two acceleration algorithms (Negative Sample and Hierarchical Softmax). This answer aims to explain how word2vec converts the one-hot vector of corpus (the input of the model) into a low-dimensional word vector (the intermediate product of the model, more specifically the input weight matrix), and really feels the change of the vector, No acceleration algorithms are involved. If the reader asks, I will add it when I have time.

1 General impression of the two models of Word2Vec

As mentioned just now, Word2Vec includes two word training models: CBOW model and Skip-gram model.

The CBOW model predicts the central word according to the words around the central word W(t) and the
Skip-gram model predicts the surrounding words according to the central word W(t).

Leaving aside the advantages and disadvantages of the two models, their structure is only different from the input layer and the output layer. Please see:

<img src="https://pic2.zhimg.com/50/v2-2a319bac1bb7fcae2f4395d2c38674ea_hd.jpg" data-size="normal" data-rawwidth="421" data-rawheight="414" class="origin_image zh-lightbox-thumb" width="421" data-original="https://pic2.zhimg.com/v2-2a319bac1bb7fcae2f4395d2c38674ea_r.jpg"> CBOW模型<img src="https://pic3.zhimg.com/50/v2-a54db7c984e6eaf9f06cf21178238fc6_hd.jpg" data-size="normal" data-rawwidth="462" data-rawheight="427" class="origin_image zh-lightbox-thumb" width="462" data-original="https://pic3.zhimg.com/v2-a54db7c984e6eaf9f06cf21178238fc6_r.jpg"> Skip-gram model

These two structural diagrams are actually simplified, and readers only need to have a general judgment and understanding of the difference between the two models. Next, we will analyze the construction of the CBOW model and how the word vector is generated. After understanding the CBOW model, the Skip-gram model is no problem.

2 Understanding of CBOW Model

In fact, students with good mathematics foundation and English can refer to Stanford University's Deep Learning for NLP class notes .

Of course, the lazy and trouble-free children's shoes will follow my footsteps slowly.

Let's first look at this structure diagram and describe the process of the CBOW model in natural language:

<img src="https://pic2.zhimg.com/50/v2-0f439e1bb44c71c8e694cc65cb509263_hd.jpg" data-size="normal" data-rawwidth="313" data-rawheight="471" class="content_image" width="313"> CBOW模型结构图

NOTE: {} in curly brackets is the explanation content.

  1. Input layer: onehot of context words. {Assume the word vector space dim is V, and the number of context words is C}
  2. All onehots are multiplied by the shared input weight matrix W. {V*N matrix, N is the number set by itself , initialize the weight matrix W}
  3. The resulting vector {because it is onehot, so it is a vector} is added and averaged as a hidden layer vector, and the size is 1*N.
  4. Multiply by the output weight matrix W' {N*V}
  5. Get the vector {1*V} The activation function is processed to obtain the V-dim probability distribution {PS: because it is onehot, each dimension of the bucket represents a word}, ​​the word indicated by the index with the highest probability is the predicted intermediate word (target word)
  6. Compared with onehot of true label, the smaller the error, the better
Therefore, a loss function (usually a cross-entropy cost function) needs to be defined, and the gradient descent algorithm is used to update W and W'. After training, the vector obtained by multiplying each word of the input layer by the matrix W is the word embedding we want. This matrix (word embedding of all words) is also called look up table (actually smart you It has been seen that, in fact, this look up table is the matrix W itself), that is to say, the onehot of any word multiplied by this matrix will get its own word vector (1*N dimension) . With the look up table, you can directly look up the table to get the word vector of the word without the training process.
In fact, it is basically clear to see here.

This time I can answer the question of the title! If you still think I'm not making it clear, don't worry! Come with me and follow the chestnut to go through the process of the CBOW model!

3 CBOW model process example

Suppose our current Corpus is this simple document with only four words:
{I drink coffee everyday}
We choose coffee as the central word, and the window size is set to 2.
That is, we need to use the words "I", "drink" and "everyday" to predict a word, and we want the word to be coffee.
<img src="https://pic4.zhimg.com/50/v2-3e75211b3b675f17a232f29fae0982bc_hd.jpg" data-caption="" data-size="normal" data-rawwidth="1429" data-rawheight="736" class="origin_image zh-lightbox-thumb" width="1429" data-original="https://pic4.zhimg.com/v2-3e75211b3b675f17a232f29fae0982bc_r.jpg"> <img src="https://pic4.zhimg.com/50/v2-abd3c7d6bc76c01266e8ddd32acfe31a_hd.jpg" data-caption="" data-size="normal" data-rawwidth="1514" data-rawheight="844" class="origin_image zh-lightbox-thumb" width="1514" data-original="https://pic4.zhimg.com/v2-abd3c7d6bc76c01266e8ddd32acfe31a_r.jpg"> <img src="https://pic1.zhimg.com/50/v2-66655880a87789eaba5dd6f5c5033e94_hd.jpg" data-caption="" data-size="normal" data-rawwidth="1305" data-rawheight="755" class="origin_image zh-lightbox-thumb" width="1305" data-original="https://pic1.zhimg.com/v2-66655880a87789eaba5dd6f5c5033e94_r.jpg"> <img src="https://pic2.zhimg.com/50/v2-5325f4a5d1fbacefd93ccb138b706a69_hd.jpg" data-caption="" data-size="normal" data-rawwidth="1304" data-rawheight="798" class="origin_image zh-lightbox-thumb" width="1304" data-original="https://pic2.zhimg.com/v2-5325f4a5d1fbacefd93ccb138b706a69_r.jpg"> <img src="https://pic4.zhimg.com/50/v2-1713450fa2a0f37c8cbcce4ffef04baa_hd.jpg" data-caption="" data-size="normal" data-rawwidth="1319" data-rawheight="736" class="origin_image zh-lightbox-thumb" width="1319" data-original="https://pic4.zhimg.com/v2-1713450fa2a0f37c8cbcce4ffef04baa_r.jpg">

Assuming that the probability distribution we obtained at this time has reached the set number of iterations, then the look up table we trained now should be the matrix W. That is, the one-hot representation of any word multiplied by this matrix will get its own word embedding.

If you have any questions, please ask questions.


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325890135&siteId=291194637