word2vec improvement Hierarchical Softmax

First Hierarchical Softmax is word2vec an improved manner, since the conventional word2vec requires a huge amount of calculation, this method has two main points of improvement:

1. For mapping from the input layer to the hidden layer, not taking the activation function plus linear transformation neural network, instead of using a simple method of summing all of the input vector and the averaged word.

4 such as input the three-dimensional vector words: ( 1 , 2 , 3 , 4 ) , ( 9 , 6 , 11 , 8 ) , ( 5 , 10 , 7 , 12 ) (1,2,3,4), (9,6,11,8), (5,10,7,12), then the word is that after we word2vec vector map ( 5 , 6 , 7 , 8 ) . Here is a word vector into a word from multiple vectors.

2. The second improvement is an improvement from the hidden layer to the layer where the calculated amount softmax output. To avoid softmax to calculate probabilities for all words, word2vec sampled instead of the Huffman tree map from the hidden layer to the output layer softmax

So here we divided into three parts, describes the Huffman tree, the Hierarchical Softmax CBOW Skip-Gram model and model-based Hierarchical Softmax.

A. Huffman tree

Step Huffman tree structure:

Suppose there are n weights, the Huffman tree is constructed with n leaf nodes. n weight values ​​are set to w1, w2, ..., construction rules wn, the Huffman tree is:


(1) w1, w2, ..., wn are n as forest trees (each tree node there is only one);


(2) selecting the root node in the forest two minimum weight tree merge as a new tree in the left and right sub-tree, and the tree is the root of its right to a new left and right subtree root sum of weights of the nodes;


(3) Delete selected two trees from the forest, and the addition of new forest tree;


(4) repeat (2), (3) step, until only a forest tree up the tree that is obtained by Huffman tree.

Example: Suppose a, b, c, d, e, f number six, and the values ​​were 9,12,6,3,5,15

Huffman tree structure as shown below:

 

 

 

 

 

 

 

 

 

 

 

 

Here coding convention left subtree coded to 1, the coding is right subtree 0, while the left subtree is the right agreement is not less than the weight of the heavy weights right subtree.

II. Based on Hierarchical Softmax model of CBOW

Meanwhile CBOW using Hierarchical Softmax, the algorithm combines the Huffman coding, each root word w can be accessed from the root of the tree to a unique path along which it forms a path which encode code. Suppose n (w, j) is the j-th node on this path, and L (w) is the length of this path, j from 1 starts encoding, i.e., n (w, 1) = root, n (w , L (w)) = w. For the j-th node, as the level Softmax defined Label 1 - code [j].

Taking a window of appropriate size as a context, the word read into the input layer in the window, their vector (dimension K, the initial random) and added together to form K hidden layer nodes. Output layer is a massive binary tree leaf nodes represent all words in the corpus (corpus contains V separate words, there is a binary tree | V | leaf nodes). And this whole pieces constructed binary tree algorithm is Huffman tree. Thus, for each leaf node of a word, there will be a globally unique code, the form "010011", may wish to note 1 to the left subtree and right subtree is 0. Next, each node of a binary tree within the node will now hidden layer are connected edge, so for each node within a binary tree will have even the K edges, each side will have the right value.

For example, in a given context, for a word to be predicted (this should be regarded as a positive sample, the term is known in advance), then let the predicted probability of binary coded words to maximum (calculated using logistic probability function ), For example, if a word is "010001", we solved the first bit is zero probability, the probability is 1 in second place and so on. The probability of a word in the current network is the product of the probabilities from the root to the word on the path. Thus the difference samples can be obtained , followed by a gradient descent method for solving parameter. Obviously, the neural network is trained with positive and negative samples continue to solve the output value and the real value of the error, and then re-solve the edge weight value of each parameter by a method of gradient descent. The way binary tree used here is to reduce the complexity of the time

This concludes the process model algorithm based CBOW Hierarchical Softmax, the iterative gradient method using a stochastic gradient rise

step:

Input: Dimension Size corpus based on the training sample CBOW word vector M , the context size CBOW 2 C , step η

Output: internal node Huffman tree model parameters [theta] , all words vector w

1. Establish a training corpus-based Huffman tree samples.

2. randomly initialize all of the model parametersθ, all words vector w

3. iterative gradient ascent procedure, the training set for each sample ( C O n- T E X T ( W ) , W ) treated as follows:

   

 

 

 

 

 

 

 

 

Three. Skip-Gram model based on the Hierarchical Softmax

Skip-Gram model and the model is actually a turn of CBOW

For the input layer to the hidden layer (projection layer), this step is simpler than CBOW, since there is only one word, so that X W XW is the word W W corresponding word vectors.

The second step, is updated by gradient ascent our method [theta] W J - . 1 and X W , attention here X W has around 2 C word vector, then if we expect P ( X I | X W ) , I = 1 , 2 ... 2 c maximum. In this case we note that since the context of each other, in the desired P ( X I | X W ) , I = . 1 , 2 2 ... C while maximizing, in turn, we expect P ( X W | X I) , I = . 1 , 2 2 ... C maximum. So is the use ofP(the X-i| the X- w ) good orP(the X-w | the X- i ) is good, word2vec use of the latter, the benefits of doing so is in an iteration window, we not only updatethe X-wXW word butX I , I = . 1 , 2 2 ... C total 2 C words. It will be more balanced overall iteration. For this reason, Skip-Gram model does not model the same as the input and CBOW iterative update, but for 2 c outputs iterative update.

This concludes the algorithm based on the Skip-Gram Hierarchical Softmax the process model, using a stochastic gradient iterative gradient ascent method:

Input: training corpus based on the Skip-Gram samples, the vector dimension word size M , the context size Skip-Gram of 2 C , step η

Output: internal node Huffman tree model parameters [theta] , all words vector w

1. Establish a training corpus-based Huffman tree samples.

2. Initialize all stochastic model parameters [theta] , all words vector W ,

3. iterative gradient ascent procedure, for each sample in the training set ( W , C O n- T E X T ( W ) ) treated as follows:

 

 

 

 

 

 

 

 

 

 

 

 

Summary: The above is Hierarchical Softmax of word2vec model.

The main references link below:

https://www.cnblogs.com/pinard/p/7243513.html

https://blog.csdn.net/weixin_33842328/article/details/86246017

Guess you like

Origin www.cnblogs.com/r0825/p/10964192.html