NLP (six): word embedding (insert word)

reference:

https://www.zhihu.com/question/45027109/answer/129387065

 

A, word embedding is doing

Compared to one-hot encoding represents a word, we hope to use more distributed representation (distributed representation) to represent a word. Because the sparsity of one-hot encoding determines that it is difficult to grasp the similarity between the words, including parts of speech, semantic information, etc., such as car and bus more or less similarity, one-hot encoding is difficult to measure. After embedding through, up to hundreds of thousands of a sparse vector is mapped to several hundred dense vector dimension, the value of each dimension of the dense vector can be considered meaningful.
After we get a text, word of which was a one-hot encoding, we can put it into a feed Embedding layer, its main function is to learn distributed express the words and extremely sparse one-hot words encoded dimensionality reduction.

Second, how to train the word distributed representation

If a principle to follow is: words to make sense in context. It gave rise to two kinds of training methods distributed representation of the word: CBOW model and Skip-gram model, the final output level of these two models are not 'dense' We want to get the word vector, a kind of deliberate, not feeling of wine.

CBOW: Depending on the context to predict the probability of the current term, and weight is the same as all the words in the context of the right to influence the probability of occurrence of the current word. Training input CBOW context model is one word feature vectors associated word corresponding to the word, the vocabulary size of the output layer neurons.

Skip-gram: According to a feature words appear to predict the probability of context words. Training input is a word feature vector of a certain word, the vocabulary size of the output layer neurons. 

It seems to be the output of the probability of the word, rather than the vector word you want. If the 'i want to eat apple' such a sentence, a model CBOW, digging EAT characteristic words, the remaining words as an input, the input trained network, our goal is that in the output layer size Glossary these nodes, the nodes corresponding to eat probability of the largest, if reached this task, we found a good deal at this time of the model of the word.
If input layer - hidden layer - this output layer network model, we are obviously onehot the input vector, a sort of computing the weight matrix W, the transport layer hiding the operation result to the output layer, the output layer and then through the other matrix computing, softmax about, to get the final probability. Since we can predict overall dredged feature words eat this model, indicating that this model is able to handle semantics, it means that we trained the weight matrix W, it onehot matrix consisting of vector obtained by multiplying the result , can be used as a dense word vector, get to do other tasks.
In other words, our goal is to obtain seemingly characteristic words, or get the context of the word, after the fact, to reach this goal, we want just one weight matrix only.
Since the beginning of the word is a sparse vector, and so by multiplying W, resulting in W is a a row or a column value. If W is the size of n × number of hidden layers , the length of a dense vectors obtained after the input word to form a sparse vector length should be hidden layer, rather than the beginning of the length of the vocabulary, this is the dimension reduction .
 
 

,


Guess you like

Origin www.cnblogs.com/liuxiangyan/p/12526392.html