NLP ---Detailed explanation of the skip-gram principle of word2vec algorithm

1. Word Embedding (word2vec)

Natural language is a complex system used to express meaning. In this system, words are the basic unit of meaning. As the name implies, word vectors are vectors used to represent words, and can also be considered as feature vectors or representations of words. The technique of mapping words to real number domain vectors is also called word embedding. In recent years, word embedding has gradually become the basic knowledge of natural language processing.

2. Why not use one-hot vectors

  • [How to use one-hot]

        1. Assuming that the number of different words in the dictionary (the size of the dictionary) is N, each word can correspond to a continuous integer from 0 to N−1. These integers corresponding to the word are called the index of the word.

         2. Assuming that the index of a word is i, in order to get the one-hot vector representation of the word, we create a vector of length N with all 0s, and set its i-th bit to 1. In this way, each word is represented as a vector of length N, which can be directly used by the neural network.

         3. Simply put: How many different words I will create a vector of how many dimensions, as above: There are N different words in a dictionary, then an N-dimensional vector will be created, where the word appears in the position 1. The position is set to i, then the corresponding vector is generated. For example: [I, happy, happy, learning, learning], where "I" can be coded as: [1,0,0,0,0], and the following "happiness" can be coded as: [0 ,1,0,0,0], and so on.

  • [Problems]

         1. This method cannot be used to calculate the similarity between words.

         2. The reason is that each word is an orthogonal vector in space, and there is no connection between each other.

         3. For example, we measure by cosine similarity.

                                                            

         4. For vectors x,y \in R^{d}, their cosine similarity is the cosine of the angle between them.

  • [Solution Strategy]

         1. Since there is no way to solve the problem in one-hot, then we need to solve it by word embedding. That is the word2vec method that we will focus on later. There are currently two implementation models for this method.

                     1.1. Skip-gram: Infer words in a certain window of context through the central word.

                     1.2. Continuous bag of words (CBOW): infer the central word through context.

3. Skip-gram model

  • The following is developed around such a picture.

         Insert picture description here

 

  • First, let's take a chestnut to illustrate what this algorithm is doing.

        1. Suppose there are 5 words in our text sequence, ["the", "man", "loves", "his", "son"].

        2. Assuming the size of our window skip-window=2, the central word is "loves", then the words of context are: "the", "man", "his", "son". The context word here is also called "background word", and the corresponding window is called "background window".

        3. What the jump character model can help us to do is to generate the conditional probability of the background words "the", "man", "his" and "son" with a distance of no more than 2 from the central word "loves", using the formula Means:

                                                       P(“the",“man",“his",“son"∣“loves").

         4. Further, assuming that given the central word, the background words are independent of each other, the formula can be further obtained:

                                                 P(“the"∣“loves")⋅P(“man"∣“loves")⋅P(“his"∣“loves")⋅P(“son"∣“loves").

          5. The above transformation is similar to the transformation process from Bayes to Naive Bayes.

          6. A simple icon can be expressed as:

                skip-gram

            7. It can be seen that here is a one-to-many scenario, 2m words are guessed based on one word, (m represents the size of the background window).

            8. From the above example, it is probably clear what skip-gram is doing, so let's analyze the above diagram step by step.

  3.1 one-hot word symbol (encoding)

  • The first step is to perform one-hot encoding. Some students may be confused. At the beginning, they said that one-hot does not work. There is a fatal problem. There is no way to calculate the similarity between words, but we should not ignore the fact that the computer does not work. To identify "characters", all data must be converted into binary encoding.
  • So now that you have chosen to use one-hot for encoding, how should you deal with it?

         1. This is actually very simple. It is a routine operation. I believe those who have studied machine learning will know it clearly. Let me talk about it for the sake of literacy.

         2. For example, my text sequence here is: ["the", "man", "loves", "his", "son"], then the following coding can be done.

                the :[1,0,0,0,0]

                man:[0,1,0,0,0]

                loves:[0,0,1,0,0]

                his:[0,0,0,1,0]

                son:[0,0,0,0,1]

  • This method is very simple, and the result of encoding is a very sparse matrix. (How sparse is it? There is only one element in each row, and the rest are 0)
  • For example, if there are N non-repeated words in my dictionary, then the overall encoding is a large matrix of N∗N dimensions, which is a 1∗N vector for a word.

3.2 Lookup Table 

  • As we mentioned above, one-hot does not calculate the similarity. In order to solve the problem that the computer cannot recognize the "character type" data, we have to use this method, but we must be clear that we do this to solve the encoding Question, the ultimate goal is to use a dense vector to represent a word, and this word can represent the exact meaning in space.
  • For example, "man" and "woman" should be relatively close in space.
  • In this case, we must first initialize such a vector, and then update the value (weight) of the vector through learning, and then get the vector we finally want.
  • Now that we have such an idea, we must first use a dense vector to represent the words we expressed in one-hot mode, that is, to perform a mapping.
  • This mapping process is called embedding. Because it is a word mapping, it is called word embedding. Well, now that you know how to do such embedding, how do you do it?
  • Suppose we want to map a word to a 300-dimensional vector (this 300-dimensional is obtained through a lot of experiments, the general mapping range is 200-500), then our intuitive approach is: matrix operations. Because the current situation of each of our words is a vector of NNN dimension, if it is mapped to 300 dimensions, the required weight matrix is ​​N∗300, so the resulting matrix is ​​a 1∗300 matrix, which can be understood as a vector. As shown in the figure below (assuming N=5).

                               Insert picture description here

  • You should be very clear about the calculation of the matrix. The following result is passed in the program, and the corresponding elements are multiplied and then added. For example: 10=0∗17+0∗23+0∗4+1∗10+0∗ 11. Others can be obtained in turn. But for such a small matrix, 5∗3=155*3=155∗3=15 operations are performed during the operation. If there are N=100000 different words in my vocabulary, the size of each word vector Is 300, then the mapping of a word will be calculated N∗ 300 times, 30 million calculations, which may not seem like a big deal, but we must be clear that the size of the vocabulary can be more than 100,000, and the amount of calculation should be extremely huge.
  • Careful students may have observed that there is a rule in the above picture:

         1. That's right, the calculated set of vectors is related to the position of our words in the one-hot encoding table!

         2. If the word appears in the column = 3, then the corresponding index of the weight matrix will be selected = 3, (the index starts from 0), what does this mean to us?

         3. Up to now, this problem should be very clear, we don't need to calculate, we can directly map the word to any dimension through the corresponding relationship of the index position.

         4. The above mapping method corresponding to the index is the LookUp Table, which does not need to be calculated, and can be mapped only by query.

  • After understanding the above, we will describe it in professional terms: After understanding the above, we will describe it in professional terms:

         1. This mapping relationship is single mapping, so what is single mapping? The simple understanding is: if there is a mapping relationship between set A → set B, and any different elements in set A have different mappings in set B, the purpose of this is to ensure that each word is independent and different of.
         2. After the mapping, the amount of information will not change.
         3. The process is to map the word vector after one-hot encoding to a low-latitude space through a hidden layer of a neural network without any activation function.

  • Well, so far we have completed two steps: one-hot → lookup table. This completes the initialization of our word vector, and the next thing to do is training. It's a little difficult later! ! !

3.3 Mathematical principles (parameter update)

  • Let’s take a look at a picture to show where we are       Insert picture description here
  • Everyone has noticed that we have reached the Hidden Layer to the Output Layer layer. To put it simply, the hidden layer and the output layer are fully connected, and then a softmax is the output probability.
  • The process is relatively simple, one Forward Propagation, one Backward Propagation. To complete the parameter update, in order to facilitate understanding, let's illustrate with the example from the beginning.
  • for example

        1. Suppose the text sequence is "the", "man", "loves", "his", and "son".

        2. With "loves" as the central word, set the background window size to 2. As shown in the figure, what the jump character model cares about is the conditional probability of generating background words "the", "man", "his", and "son" that are not more than 2 words away from it, given the central word "loves", namely

                                                                P(“the",“man",“his",“son"∣“loves").

  • Assuming that given the central word, the generation of the background word is independent of each other, then the above formula can be rewritten as

                                                       P(“the"∣“loves")⋅P(“man"∣“loves")⋅P(“his"∣“loves")⋅P(“son"∣“loves").

        Insert picture description here

 

 

 

Guess you like

Origin blog.csdn.net/devil_son1234/article/details/107300300