Talk word2vec

The original NNLM very time-consuming training vector word, especially on a large scale corpus, the author presents the thesis also possible optimization program, so word2vec concern is that if more effective training on a large scale vector word corpus.
Computational complexity for each training sample:
Q = N * N + D + H * H * D * V
where V is the size of the dictionary, each word is encoded as 1-of-V, N is before the current in the current word sequence N words, D is the word size of the vector, H is the number of hidden layer neuron network layer.
The major part of Q is the last part of the H * V, but can be reduced by some optimization methods (hs / ng), so the main case to the complexity of N * D * H, so the layer of the neural network directly removed to improve computational efficiency.
Previous work the authors found that successful language training a neural network model can be in two steps: 1, first through a simple word model training vectors, 2, and N-gram NNLM training on this. Meanwhile, the subsequent increase of the current word (context information) word can get better results. This paper presents a model of two structures:


5012681-c86ae6732604bc37.png
CBOW vs Skip-gram

1.CBOW: NNLM with similar, but the network layer is removed, hereinafter use of the current word, i.e., the projection within the context of the word obtained by the unified window vector and the predicted current word. At this time, the order of the words within the model no longer affect the result of the projection, the computational complexity as: Q = N + D * D * log (V)
2.Skip Gram-: and CBOW is similar, but with the current predicted within this context words, in order to improve efficiency, in terms of practical engineering were sampled within the context of computational complexity in this case is: Q = C * (D + D * log (V))

Optimization
mentioned before the main computational complexity of the original NNLM is output layer, i.e., H * V, the main idea is to optimize the probability of avoiding V full amount calculation, of implements two schemes, i.e. Hierarchical Softmax sampling and negative
1.Hierarchical softmax
by Huffman tree constructed word frequency, and then replaced by the output layer Huffman tree, with each node on a layer of results do binary, parts of speech are judged, the leaf nodes corresponding to the words, the probability of the word belonging Analyzing

5012681-92d7705402dded20.png

Features: position closer to the root of the high-frequency words, to further reduce the required calculation. But for low frequency words, which corresponds to a position away from the root, corresponding to the path length, the amount of computation required is still large, the efficiency is not high

2.negative sampling
the output layer prevents the determination of the total amount of the dictionary, and to a portion of the ring by a priori knowledge of the most confusing and negative samples of composition (relative to the current word). The proportion of the proposed word frequency normalized by a certain percentage of the composition to a candidate set, the candidate set selected at random a number of negative samples (n << V) to constitute the negative sample set, the final layer of multi-classification into a multi softmax a sigmod binary layer, to improve the quality and efficiency of computing word vector.

Kind of Training Program
1. Based on the CBOW hs

5012681-85731dc626c6bc05.png

among them

5012681-ad7aeb7a7537d61d.png
5012681-758de7ce01665132.png
5012681-0778a9f4bb9294cd.png

Corresponding pseudo-code:

5012681-12690d33f1f6a309.png

2. Based on the Skip-gram hs


5012681-be23c4dab0c50f83.png

Corresponding pseudo-code:


5012681-12fbcbc0a743789d.png

3. Based on the CBOW ng

5012681-9fb0e1c766d0618e.png

5012681-46f08884f73d1b37.png
image.png

Corresponding to the pseudo-code:

5012681-dc38ca2c854c479f.png

4. Based on the skip-gram hs


5012681-b6eae03bbf2ffe5a.png

G CBOW hs based expressions of the same, just the outermost layer of the multi-summation, the latter process is the same CBOW.

Thinking:
the training process 1. word vector is a fake task, our goal is not the final language model, feature vector instead generated in this process, with a real task to train is not better?
2. Because it is a fake task, how do we assess the quality of this task, and how to assess the word vectors obtained? Paper used in similar characteristics and linear translation of the word, is there a better way?
3. The word vector "similarity" What is the specific meaning?

A:
1. real task will generally be better to train word vector, but are generally built on the downstream task word vector, it is generally a training word vector, then initial parameter as the downstream task embedding layer training.
2. In addition to linear translation of the similar words and, in other cases the efficiency can be assessed by a downstream task.
3. The word vector "similarity" with the usual sense of synonyms or similar words are different in nature, the meaning of the word vector is more "parity word," that is the context of the word similar. Another angle, we will write a function connected form of the model:

5012681-ca73ef6c3fa50c9d.png

Where v the vector corresponding word vector space around two words, because the model is symmetric, the left and right two vectors may optionally words a practical use.

among them
5012681-f1a01ba86eed8884.png

The denominator is the normalization term, ignore, and ultimately maximize P (wk | wi) at the same time, that is let
inside Vwk and Vwi the product more. I.e., the inner product vector words implicit in the model (direction) of the word represents a direct vector of distance (semantic distance), the cosine term vectors can be used to find the semantic word closer. Further, left and right two words belong to different vectors of a vector space, minimizing the semantical distance of two words is converted to minimize the distance at two different words semantic space, rather than in the same vector space, why this scheme feasible? I think mainly because the model is symmetric, although two different vector spaces, but can be considered only after the rotating zoom, the word does not change (the angle between the word vector) occurred in the relative position of the vector space.

Advantages: 1. No neural network layer, so there is no time-consuming matrix multiplication, leaving only a softmax layer, computational efficiency.
Stochastic gradient descent is used for 2. Optimize, rare words not dominate the optimization target

https://arxiv.org/pdf/1301.3781.pdf
https://arxiv.org/abs/1310.4546

Guess you like

Origin blog.csdn.net/weixin_34336292/article/details/90993597