Word Vectors Word Vectors Study Notes

  • Word Vectors

Encode words into vectors that represent a point in the word space. Each dimension can be seen as an encoding of some semantic information.

  • one-hot vector:
The simplest word vector, representing each word as a |V|*1-dimensional vector. |V| is the vocabulary size, its index position value in the vocabulary is 1, and the rest are 0.
However, due to the huge number of words in the vocabulary, the dimension of one-hot vector is too high; in addition, the dot product of any two vectors is zero, so the word vectors are independent of each other and cannot contain similarity information between words.

  • SVD singular value decomposition:
1. The co-occurrence matrix X of the generated word (Word-Document Co-occurrence Matrix, Word-Word Co-occurrence Matrix)
2. Decomposition X = USV^T
3.U lines represent the word embeddings
Word-Document Co-occurrence Matrix:

The i-th row and the j-th column of the matrix indicate whether the word i appears in the chapter j, the occurrence is 1, otherwise it is 0. So the dimension of the co-occurrence matrix is ​​|V|*num(docs), that is, the total number of words * the total number of chapters, which is undoubtedly a very large matrix.

Word-Word Co-occurrence Matrix:

Indicates the number of times the two words appear at the same time, set a window window, only the two words in the window are counted as co-occurrence.
For example, window size = 1, the corpus contains the following three sentences:


Next, perform SVD decomposition on the co-occurrence matrix ( click to open the link )

to reduce the dimension by selecting the first k dimensions

The resulting word vector contains rich grammatical and semantic information, but there are many other problems:


Due to the frequent addition of new words, the dimension of the matrix also changes accordingly
  1. matrix sparse
  2. The matrix dimension is high
  3. SVD decomposition is computationally expensive
  4. Severe imbalance in the frequency of words

  • word2vec:
It is Google's open source word vector calculation model. Based on a very important linguistic assumption of distributional similarity, that is, similar words have similar contexts. Including two shallow neural network algorithms: continues bag-of-word CBOW, Skip-gram and two training models: negative sampling, negative sampling, hierarchical softmax.


  • CBOW: Input the context matrix context, predict the middle word center word.

For example, for the sentence: The cat jumped over the puddle.


input context 为 "The","cat","over","the","puddle"

output center word 为 "jumped"

As known parameters of the neural network model:

input:x(c),one-hot vectors—context

output:y(c) ,one-hot vector—center word

The unknown parameters to be solved are two matrices and

Among them, the custom size of n represents the embedding space, V represents the vocabulary, and |V| is the total number of words

The i-th column represents the word vector embedded vector of the word wi, denoted by vi

The jth row of , represents the word vector embedded vector of the word wj, denoted as uj

actually learns two vectors u and v for each word

The calculation process of the model is as follows:

1. Generate one-hot vectors of context

2. Multiply the matrix and the one-hot vector of each word in the context to get embedded word vectors

3. Averaging the embedded word vectors can be regarded as averaging the features of the context words

4. Calculate the score matrix score vector   Because the dot product of the two vectors is dot product, the larger the obtained value, the more similar the two word vectors are. Therefore, the larger the value in the score vector z, the more similar the corresponding word is to the feature mean of the context vectors, and the greater the possibility of being the center word.

5. Convert score to probability value

6. Learning by combining our calculated probability matrix  with the actual

set objective function

Iteratively update the solution sum with stochastic gradient descent method , so the word vector of all words in the vocabulary is obtained

CBOW model network structure diagram

  • Skip-gram: Input the middle word center word, and predict the context matrix context probability distribution.

input:x(c) ,one-hot vector—center word

output:y(c),one-hot vectors—context

Two matrices ,  same as CBOW

The calculation process of the model is as follows:

1. Enter the center word:  

2. Generate embedded vector: 

3. Calculate the score vector: 

4.

5. Naive Bayes needs to be used when generating the objective function. Given the center word, it is assumed that the words in the context are independent of each other

Skip-gram model network structure diagram

  • Negative Sampling

When calculating the loss function J , since it is necessary to regularize the softmax and accumulate |V| scores, the amount of calculation is huge, so it is necessary to simplify the calculation and perform approximate estimation.

Negative Sampling replaces traversing the entire vocabulary by sampling several negative samples.

Negative samples are selected by the 3/4 power of the unigram distribution "Unigram distribution".

Indicates word and context

Indicates that word and context are data in the corpus, and conversely indicates that they are not data in the corpus

Replacing softmax with sigmoid function

The new objective function is

Refers to the sum here , refers to the non-corpus

The maximum likelihood probability above is equivalent to minimizing the loss function as follows

  • Hierarchical Softmax 

More efficient than ordinary Softmax, using Huffman tree to represent words in the vocabulary, so that the more common words are encoded with shorter lengths. The main advantage is that the computational cost is reduced to

Each leaf node represents a word

represents the number of nodes in the path from root to w

word vector representing the ith node in the path from root to w

represents the left child of node n

The objective function becomes the path probability that maximizes w = correct word




Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324515045&siteId=291194637