NLP news text classification-Task5

Task5-Word2vec for text classification based on deep learning

1. Word vector
This section learns word vector through word2vec. The basic idea behind the word2vec model is to predict words that appear in the context. For each input text, we select a context window and a central word, and based on this central word, we predict the probability of other words in the window. Therefore, the word2vec model can easily learn the vector expression of new words from the new corpus. It is an efficient online learning algorithm (the main idea of ​​online learning word2vec: predict each other through words and context, and the corresponding two algorithms are respectively :
1. Skip- grams (SG): prediction context
2. Continuous Bag of Words (CBOW): prediction target word

2.Skip-grams principle and network structure The
Insert picture description here
Word2Vvec model is actually divided into two parts, the first part is to build the model, and the second part is to obtain the embedded word vector Word2Vvec through the model. The whole modeling process is actually the same as the auto-encoder (auto- The idea of ​​encoder) is very similar, that is, first build a neural network based on the training data. When the model is trained, we will not use the trained model to process new tasks. What we really need is the model through the training data. The learned parameters, such as the weight matrix of the hidden layer-we will see later that these weights in Word2vec are actually the "word vectors" we are trying to learn.

2.1 Skip-grams process

Suppose we have a sentence "The dog barked at the mailman"

1. First we choose a word in the middle of the sentence as our input word, for example, we choose "dog'" as the nput word

2. With input wordl, we define a parameter called Iskip window, which represents the number of words we select from one side (left or right) of the current input word. If we set skip window=2, then we finally get the words in the window (including the input word), which is [The,'dog', barked 'at] skip window=2, which means selecting 2 words on the left side of the left input word The word and the 2 words on the right enter our window, so the entire window size span=2x2=4. Another parameter is called num skips, which represents how many different words we select from the entire window as our output word. When skip window=2, num skips=2, we will get two sets (input word, output The training data in the form of word), namely (dog,"barked),(dog','the).

3. The neural network will output a probability distribution based on these training data, this probability represents the possibility of each word in our dictionary as the output word of the input wordi. This sentence is a bit confusing, let's look at an example. In the second step, we obtained two sets of training data with skip window and num skips=2. If we first use a set of data (dog, barked) to train the neural network, then the model will tell us that each word in the vocabulary when'dog is used as an input word, its possibility as an output word is also That is to say, the output probability of the model represents how likely it is that each word in our dictionary will appear at the same time as the input wordl. For example: if we input a word "Soviet" into the neural network model, then in the output probability of the final model, the probability of related words like Union"','Russia will be much higher than that of unrelated words like "watermelon," kangaroo" Because "Union", "Russia" is more likely to appear in the window of "Soviet" in the text.

We will train the neural network to complete the probability calculation mentioned above by inputting pairs of words in the text. The following figure shows some examples of our training samples. We select the sentence The quick brown fox jumps over lazy dog' and set our window size to 2 (window_size=2), which means we only select two words before and after the input word and the input word to combine. In the figure below, blue represents the input word and the box represents the word in the window.
Insert picture description here
Insert picture description here
Our model will learn statistics from the number of occurrences of each pair of words. For example, our neural network may get more training sample pairs similar to ("Soviet", "Union"), but rarely see a combination of ('Soviet, "Sasquatch"). Therefore, when our model is trained, given a word'Soviet' as input, the output result of "Union" or "Russia" is given a higher probability than "Sasquatch"

PS: Both input word and output wordi will be one-hot encoded by us. Think about it carefully. After our input is one-hot encoded, most of the dimensions are 0 (in fact, only one position is 1), so this vector is quite sparse, so what will be the result? If we multiply a 1x10000 vector and a 10000x300 matrix, it will consume considerable computing resources. For efficient calculation, it will only select the index row of the corresponding vector in the matrix with the dimension value of 1.

Insert picture description here

2.2 Skip-grams training

From the above part, we can see that the Word2vec model is a super large neural network (the weight matrix is ​​very large). For example: we have a vocabulary of 10,000 words. If we want to embed a 300-dimensional word vector, then our input-hidden layer weight matrix and the hidden layer output layer weight matrix will both have 10000x 300=3 million weights, in this case Gradient descent in a huge neural network is quite slow. Worse, you need a lot of training data to adjust these weights and avoid overfitting. A weight matrix of millions and hundreds of millions of training samples mean that training this model will be a disaster.

solution:

1. Treat common word pairs or phrases as single "words"
2. Sampling high-frequency words to reduce the number of training samples
3. Use the "negative sampling" method for the optimization target, so that each The training of each training sample will only update a small part of the model weights, thereby reducing the computational burden

Guess you like

Origin blog.csdn.net/DZZ18803835618/article/details/107696554