[DataWhale Learning Record 15-05] Zero-based Introductory NLP-News Text Categorization Questions-05 Text Categorization Based on Deep Learning 2

Text classification based on deep learning 2

5.1 Learning objectives

  1. Learn the use and basic principles of Word2Vec
  2. Learn to use TextCNN and TextRNN for text representation
  3. Learn to use HAN network structure to complete text classification

5.2 Text Representation Part3

5.2.1 Word Vector

This section uses Word2Vec to learn word vectors. The basic idea behind the Word2Vec model is to predict the words that appear in the context. For each input text, we select a context window and a central word, and based on this central word, we predict the probability of other words in the window. Therefore, the Word2Vec model can easily learn the vector expression of new words from the new corpus, which is an efficient online learning algorithm (online learning). The main idea of ​​Word2Vec: predict each other through words and context, the corresponding two algorithms are:

  • Skip-grams (SG): prediction context
  • Continuous Bag of WOrds (CBOW): Predict the target word

In addition, two more efficient training methods are proposed:

  • Hierarchical softmax
  • Negative sampling

1. Skip-grams principle and network structure

In the Word2Vec model, there are mainly two models, Skip-Gram and CBOW. From an intuitive understanding, Skip-Gram is a given input word to predict the context. And CBOW is a given context to predict the input word.
Insert picture description here

The Word2Vec model is actually divided into two parts, the first part is to build the model, and the second part is to obtain the embedded word vector through the model.

The entire modeling process of Word2Vec is actually very similar to the idea of ​​auto-encoder, that is, first build a neural network based on the training data. After the model is trained, we will not use the trained model for processing. For the new task, what we really need are the parameters obtained from the training data of this training model, such as the weight matrix of the hidden layer-we will see later that these weights in Word2Vec are actually the "Word Vectors" we are trying to learn.

Skip-grams process

Suppose we have a sentence "The dog barked at the mailman".

  1. First, we choose a word in the middle of the sentence as our input word, and join us in choosing "dog" as the input word;
  2. With the input word, we define another parameter called skip_window, which represents the number of words we select from one side (left or right) of the current input word. If we set skip_window=2, then we finally get the word in the window (including input word) is ['The','dog','barked','at']. skip_window=2 means selecting the 2 words on the left of the left input word and the two words on the right to enter our window,
    so the entire creation size span=2x2=4. Another parameter is called num_skips=2, and we will get two Training data in the form of group (input word, output word), namely ('dog','barked'), ('dog','the').
  3. Based on these training data, the neural network will output a probability distribution, which represents the possibility of each word in our dictionary as the output word of the input word. This sentence is a bit convoluted, let's look at an example. In the second step, we obtained a set of training data under the condition of setting skip_window and num_skips=2. Join us first to take a set of data ('dog','barked') to train the neural network, then the model will learn this training sample, and will tell us that each word in the vocabulary when'dog' is used as an input word, it is used as an output The possibility of word.

In other words, the output probability of the model represents how likely it is that each word in our dictionary will appear at the same time as the input word. For example: if we want to input a word "Soviet" into the neural network model, then in the final model output probability, the probability of thinking about "Union" and "Russia" will be much higher than thinking about "watermelon" and "kangaroo". Probability of related words. Because of "Union", "Ruassia" is more likely to appear in the "Soviet" window in the text.

We will train the neural network to complete the probability calculations mentioned above by inputting pairs of words in the text. The following figure shows some examples of our training samples. We select the sentence "The quick brown fox jumps over lazy dog" and set our window size to 2 (window_size=2), which means we only select two words before and after the input word and the input word to combine. In the figure below, blue represents the input word, and the box represents the word located in the window.
Insert picture description here
Insert picture description here
Our model will learn statistics from the number of occurrences of each pair of words. For example, our neural network may get more training sample pairs similar to ("Soviet", "Union"), but rarely see a combination of ("Soviet", "Sasquatch"). Therefore, when our model is trained, given a word "Soviet" as input, the output result of "Union" or "Russia" is given a higher probability than "Sasquatch".

PS: Both input word and output word will be one-hot encoded by us. Think about it carefully. After our input is one-hot encoded, most of the dimensions are 0 (in fact, only one position is 1), so this vector is quite sparse, so what will be the result? If we multiply a 1 x 10000 vector and a 10000 x 300 matrix, it will consume considerable computing resources. For efficient calculation, it will only select the index row with the dimension value of 1 in the corresponding vector in the matrix:
Insert picture description here

2. Skip-grams training

As can be seen from the above part, the Word2Vec model is a super large neural network (the weight matrix is ​​very large). For example: we have a vocabulary of 10,000 words, if we want to embed a 300-dimensional word vector, then our input-hidden layer weight matrix and hidden layer-output layer weight matrix will both have 10000 x 300 = 3 million weights. Gradient descent in such a large neural network is quite slow. To make matters worse, you need a lot of training data to adjust these weights and avoid overfitting. A weight matrix of millions and hundreds of millions of training samples mean that training this model will be a disaster

solution:

  • Treat common word pairs or phrases as single "words"

  • Sampling high-frequency words to reduce the number of training samples

  • The "negative sampling" method is adopted for the optimization target, so that the training of each training sample will only update a small part of the model weight, thereby reducing the computational burden

2.1 Word pairs and “phases”

The meanings of some word combinations (or phrases) have completely different meanings after being disassembled. For example, "Boston Globe" is the name of a newspaper, but individual words such as "Boston" and "Globe" cannot express this meaning. Therefore, as long as "Boston Globe" appears in the article, we should use it as a single word to generate its word vector, rather than disassembling it. The same examples include "New York", "United Stated" and so on.

In the model released by Google, its training sample contains 100 billion words from the Google News data set, but in addition to a single word, there are 3 million word combinations (or phrases).

2.2 Sampling of high-frequency words

In the previous part, for the original text "The quick brown fox jumps over the laze dog", if a window of size 2 is used, then we can get the training samples shown in the figure. Insert picture description here
But for the frequently used high-frequency words like "the", there are two problems with this processing method:

  1. When we get paired word training samples, training samples like ("fox", "the") will not provide us with more semantic information about "fox" because "the" is in the context of each word Almost always appear in

  2. Since common words like "the" appear in the text with a high probability, we will have a large number of training samples like ("the",...), and the number of these samples far exceeds our learning of the word "the" The number of training samples required for the vector

Word2Vec solves this problem of high frequency words through the "sampling" mode. Its basic idea is as follows: For every word we encounter in the training original text, they have a certain probability of being deleted from the text, and the probability of being deleted is related to the frequency of the word.

ωi is a word, and Z(ωi) is the frequency of occurrence of the word ωi in all corpora. For example, if the word "peanut" appears 1000 times in a corpus of a billion scale, then Z(peanut) = 1000/1000000000 = 1e-6.

P(ωi) represents the probability of retaining a certain word:Insert picture description here

2.3 Negative sampling

Training a neural network means inputting training samples and constantly adjusting the weights of neurons, so as to continuously improve the accurate prediction of the target. Whenever the neural network is trained with a training sample, its weight will be adjusted once.

Therefore, the size of the dictionary determines that our Skip-Gram neural network will have a large-scale weight matrix. All these weights need to be adjusted through hundreds of millions of training samples. This is very computationally expensive and practical. It will be very slow in training.

Negative sampling solves this problem. It is a method used to increase the training speed and improve the quality of the word vector. Unlike the original update of all weights for each training sample, negative sampling only updates a small part of the weights of a training sample at a time, which will reduce the amount of calculation in the gradient descent process.

When we use training samples (input word: "fox", output word: "quick") to train our neural network, both "fox" and "quick" are encoded by one-hot. If our dictionary size is 10000, in the output layer, we expect the neuron node corresponding to the word "quick" to output 1, and the remaining 9999 should output 0. Here, the words corresponding to the 9999 neuron nodes that we expect to output as 0 are called "negative" words.

When using negative sampling, we will randomly select a small portion of negative words (for example, select 5 negative words) to update the corresponding weights. We will also update the weight of our "positive" word (in our example above, this word refers to "quick").

PS: In the paper, the author pointed out that for small-scale data sets, it is better to choose 5-20 negative words. For large-scale data sets, you can only choose 2-5 negative words.

We use "unigram distribution" to select "negative words". The probability of a word being selected as a negative sample is related to its frequency of appearance. The higher the frequency of occurrence, the easier it is to be selected as negative words.

The formula for calculating the probability of each word being selected as "negative words": Insert picture description here
where f(ωi) represents the frequency of occurrence of the word, and the 3/4 radical in the formula is completely based on experience.

In the code implementation of code negative sampling, unigram table has an array containing 100 million elements, this array is filled by the index number of each word in the vocabulary, and there are duplicates in this array, that is to say, some words Will appear multiple times. So how to determine the number of times each word's index appears in this array, there is a formula, that is to say, the calculated negative sampling probability * 100 million = the number of times the word appears in the table.

With this table, every time we go for negative sampling, we only need to generate a random number in the range of 0-1 million, and then select the word with the index number of this random number in the table as our negative word. . The greater the negative sampling probability of a word, the more times it appears in this list, and the greater the probability that it will be selected.

3. Hierarchical Softmax

3.1 Hoffman tree

Input: n nodes with weights (w1, w2,...wn)

Output: the corresponding Huffman tree

  1. Think of (w1, w2,...wn) as a forest with n trees, and each tree has only one node

  2. In the forest, the two trees with the smallest root node weight are selected and merged to obtain a new tree. The two trees are distributed as the left and right subtrees of the new tree. The root node weight of the new tree is the sum of the root node weights of the left and right subtrees

  3. Remove the two trees with the smallest root node weight from the forest, and add the new tree to the forest

  4. Repeat steps 2 and 3 until there is only one tree in the forest

Below we use a specific example to illustrate the process of building a Huffman tree. We have (a, b, c, d, e, f) a total of 6 nodes, and the weight distribution of the nodes is (16, 4, 8, 6, 20, 3).

The first is the smallest combination of b and f, and the weight of the root node of the new tree is 7. At this time, there are 5 trees in the forest, and the weight of the root node is 16, 8, 6, 20, and 7, respectively. At this time, the 6 and 7 with the smallest root node weight are merged to obtain a new subtree, and so on, and finally the following Huffman tree is obtained.

So what are the benefits of the Hoffman tree? Generally, after obtaining the Huffman tree, we will perform Huffman coding on the leaf nodes. Because the leaf nodes with higher weights are closer to the root node, and the leaf nodes with lower weights are far away from the root node, so our high-weight nodes have shorter encoding values. , And the low weight value has a longer encoding value. This guarantees that the weighted path of the tree is the shortest, which is also in line with our information theory, that is, we hope that the more commonly used words have shorter codes. How to code it? Generally, for the nodes of a Huffman tree (except the root node), it can be agreed that the left subtree is coded as 0 and the right subtree is coded as 1. As shown in the figure above, the code of c can be obtained as 00.

In word2vec, the convention encoding method is opposite to the above example, that is, the left subtree is coded as 1, and the right subtree is coded as 0. At the same time, it is stipulated that the weight of the left subtree is not less than the weight of the right subtree.

More principles can refer to: Hoffman tree principle

3.2 Hierarchical Softmax process

In order to avoid calculating the softmax probability of all words, word2vec samples the Huffman tree to replace the mapping from the hidden layer to the output softmax layer.

The establishment of the Hoffman tree:

  1. Build a Huffman tree based on the label and frequency (the higher the frequency of the label, the shorter the path of the Huffman tree)

  2. Each leaf node in the Huffman tree represents a label
    Insert picture description here

As shown in the figure above: Insert picture description here
Insert picture description here
Note: theta at this time is an undetermined coefficient, which is an iterative formula obtained after deriving the maximum likelihood.
Insert picture description here
Use gensim to train word2vec

from gensim.models.word2vec import Word2Vec
model = Word2Vec(sentences, workers=num_workers, size=num_features)
  1. CS224n Note 2 Vector representation of words: word2vec
  2. Stanford University Deep Learning and Natural Language Processing Lecture 2: Word Vector
  3. (Stanford CS224d) Deep Learning and NLP Course Notes (3): Evaluation of GloVe and Model
  4. Principle of word2vec (3) Model based on Negative Sampling
  5. A detailed explanation of the Skip-Gram model of Word2vec (structure)

5.2.2 TextCNN

TextCNN uses CNN (Convolutional Neural Network) for text feature extraction. Convolution kernels of different sizes extract n-gram features. The feature map calculated by convolution is retained by MaxPooling to retain the largest feature value, and then stitched into a vector as text The representation.

Here we use 100 convolution kernels of size 2, 3, and 4 based on the original text of TextCNN, and the final text vector size is 100*3=300 dimensions.

Insert picture description here

5.2.3 TextRNN

TextRNN uses RNN (recurrent neural network) for text feature extraction. Since text itself is a sequence, LSTM is naturally suitable for modeling sequence data. TextRNN inputs the word vector of each word in the sentence into the bidirectional double-layer LSTM in turn, and splices the hidden layer at the last effective position in the two directions into a vector as the representation of the text.Insert picture description here

5.3 Text representation based on TextCNN and TextRNN

5.3.1 TextCNN:

  • Model building
self.filter_sizes = [2, 3, 4]  # n-gram window
self.out_channel = 100
self.convs = nn.ModuleList([nn.Conv2d(1, self.out_channel, (filter_size, input_size), bias=True) for filter_size in self.filter_sizes])
  • Forward propagation
pooled_outputs = []
for i in range(len(self.filter_sizes)):
    filter_height = sent_len - self.filter_sizes[i] + 1
    conv = self.convs[i](batch_embed)
    hidden = F.relu(conv)  # sen_num x out_channel x filter_height x 1
 
    mp = nn.MaxPool2d((filter_height, 1))  # (filter_height, filter_width)
    # sen_num x out_channel x 1 x 1 -> sen_num x out_channel
    pooled = mp(hidden).reshape(sen_num, self.out_channel)
    
    pooled_outputs.append(pooled)

5.3.2 TextRNN

  • Model building
input_size = config.word_dims
 
self.word_lstm = LSTM(
    input_size=input_size,
    hidden_size=config.word_hidden_size,
    num_layers=config.word_num_layers,
    batch_first=True,
    bidirectional=True,
    dropout_in=config.dropout_input,
    dropout_out=config.dropout_hidden,
)
  • Forward propagation
hiddens, _ = self.word_lstm(batch_embed, batch_masks)  # sent_len x sen_num x hidden*2
hiddens.transpose_(1, 0)  # sen_num x sent_len x hidden*2
 
if self.training:
    hiddens = drop_sequence_sharedmask(hiddens, self.dropout_mlp)

5.4 Use HAN for text classification

Hierarchical Attention Network for Document Classification (HAN) is based on hierarchical attention, which is coded at the word and sentence levels separately and obtains the representation of the document based on the attention, which is then classified by Softmax. The role of the word encoder is to obtain the representation of the sentence, which can be replaced with the TextCNN and TextRNN mentioned in the previous section, or it can be replaced with the BERT in the next section.

Insert picture description here

5.5 Summary of this chapter

This chapter introduces the use of Word2Vec, as well as the principles and training of TextCNN and TextRNN, and finally introduces HAN for long document classification.

5.6 Homework

  1. Try to train word vectors with Word2Vec
  2. Try to use TextCNN, TextRNN to complete text representation
  3. Try to use HAN for text classification

reference:

  1. https://mp.weixin.qq.com/s/I-yeHQopTFdNk67Ir_iWiA
  2. https://github.com/hecongqing/2018-daguan-competition

Guess you like

Origin blog.csdn.net/qq_40463117/article/details/107655495