Tianchi NLP Competition-News Text Classification (5)-Text Classification Based on Deep Learning 2-TextCNN, TextRNN


Series of articles
Tianchi NLP Competition-News Text Classification (1)
-Comprehension of Competition Questions Tianchi NLP Competition-News Text Classification (2)
-Data Reading and Data Analysis Tianchi NLP Competition-News Text Classification (3)-Based on Machine Learning Text classification
Tianchi NLP competition-news text classification (4)-text classification based on deep learning 1-FastText
Tianchi NLP competition-news text classification (5)-text classification based on deep learning 2-TextCNN, TextRNN


Five, text classification based on deep learning 2-TextCNN, TextRNN

5.1 Text representation method-word vector

Here you can refer to: CS224n notes-Word Vectors and Word Senses (2)

This section learns word vectors through word2vec. The basic idea behind the word2vec model is to predict words that appear in the context. For each input text, we select a context window and a central word, and based on this central word, we predict the probability of other words in the window. Therefore, the word2vec model can easily learn the vector expression of new words from the new corpus, which is an efficient online learning algorithm (online learning).

The main idea of ​​word2vec: predict each other through words and context, the corresponding two algorithms are:

  • Skip-grams (SG): prediction context
  • Continuous Bag of Words (CBOW): Predict the target word

In addition, two more efficient training methods are proposed:

  • Hierarchical softmax
  • Negative sampling

5.1.1 Skip-grams principle and network structure

In the Word2Vec model, there are mainly two models, Skip-Gram and CBOW. From an intuitive understanding, Skip-Gram is a given input word to predict the context. And CBOW is a given context to predict the input word.

Insert picture description here

The Word2Vec model is actually divided into two parts, the first part is to build the model, and the second part is to obtain the embedded word vector through the model.

The entire modeling process of Word2Vec is actually very similar to the idea of ​​auto-encoder, that is, first build a neural network based on the training data. After the model is trained, we will not use the trained model for processing. For the new task, what we really need are the parameters learned by the model through the training data, such as the weight matrix of the hidden layer-we will see later that these weights are actually the "word vectors" we are trying to learn in Word2Vec. ".

Skip-grams process

Suppose we have a sentence "The dog barked at the mailman".

  1. First, we choose a word in the middle of the sentence as our input word, for example, we choose "dog" as the input word;
  2. With the input word, we define a parameter called skip_window, which represents the number of words we select from one side (left or right) of the current input word. If we set skip_window=2, then we finally get the word in the window (including input word) is ['The','dog','barked','at']. skip_window=2 means selecting the 2 words on the left and 2 words on the right of the left input word to enter our window, so the size of the entire window span=2x2=4. The other parameter is called num_skips, which represents how many different words we select from the entire window as our output word. When skip_window=2 and num_skips=2, we will get two sets of (input word, output word) forms The training data is ('dog','barked'), ('dog','the').
  3. Based on these training data, the neural network will output a probability distribution, which represents the possibility of each word in our dictionary as the output word of the input word. This sentence is a bit convoluted, let's look at an example. In the second step, we obtained two sets of training data with skip_window and num_skips=2. If we first take a set of data ('dog','barked') to train the neural network, then the model will learn this training sample and tell us that each word in the vocabulary when'dog' is used as an input word, it is used as an output The possibility of word.

In other words, the output probability of the model represents how likely it is that each word in our dictionary will appear at the same time as the input word. For example: if we input a word "Soviet" into the neural network model, the probability of related words like "Union" and "Russia" in the output probability of the final model will be much higher than that of words like "watermelon" and "kangaroo" Probability of unrelated words. Because of "Union", "Russia" is more likely to appear in the window of "Soviet" in the text.

We will train the neural network to complete the probability calculations mentioned above by inputting pairs of words in the text. The following figure shows some examples of our training samples. We select the sentence "The quick brown fox jumps over lazy dog" and set our window size to 2 (window_size=2), which means that we only select two words before and after the input word to combine with the input word. In the figure below, blue represents the input word, and the box represents the word located in the window.

Insert picture description here

Insert picture description here

Our model will learn statistics from the number of occurrences of each pair of words. For example, our neural network may get more training sample pairs similar to ("Soviet", "Union"), but rarely see a combination of ("Soviet", "Sasquatch"). Therefore, when our model is trained, given a word "Soviet" as input, the output result of "Union" or "Russia" is given a higher probability than "Sasquatch".

PS: Both input word and output word will be one-hot encoded by us. Think about it carefully. After our input is one-hot encoded, most of the dimensions are 0 (in fact, only one position is 1), so this vector is quite sparse, so what will be the result? If we multiply a 1 x 10000 vector and a 10000 x 300 matrix, it will consume considerable computing resources. For efficient calculation, it will only select the index row with the dimension value of 1 in the corresponding vector in the matrix:

Insert picture description here

5.1.2 Skip-grams training

As can be seen from the above part, the Word2Vec model is a super large neural network (the weight matrix is ​​very large). For example: we have a vocabulary of 10,000 words, if we want to embed a 300-dimensional word vector, then our input-hidden layer weight matrix and hidden layer-output layer weight matrix will both have 10000 x 300 = 3 million weights. Gradient descent in such a large neural network is quite slow. To make matters worse, you need a lot of training data to adjust these weights and avoid overfitting. A weight matrix of millions and hundreds of millions of training samples mean that training this model will be a disaster

solution:

  • Treat common word pairs or phrases as single "words"
  • Sampling high-frequency words to reduce the number of training samples
  • The "negative sampling" method is adopted for the optimization target, so that the training of each training sample will only update a small part of the model weight, thereby reducing the computational burden

1.Word pairs and "phases"

The meanings of some word combinations (or phrases) have completely different meanings after being disassembled. For example, "Boston Globe" is the name of a newspaper, but individual words such as "Boston" and "Globe" cannot express this meaning. Therefore, as long as "Boston Globe" appears in the article, we should use it as a single word to generate its word vector, rather than disassembling it. The same examples include "New York", "United Stated" and so on.

In the model released by Google, its training sample contains 100 billion words from the Google News data set, but in addition to a single word, there are 3 million word combinations (or phrases).

2. Sampling of high-frequency words

In the previous part, for the original text "The quick brown fox jumps over the laze dog", if a window of size 2 is used, then we can get the training samples shown in the figure.

Insert picture description here
But for the frequently used high-frequency words like "the", there are two problems with this processing method:

  1. When we get paired word training samples, training samples like ("fox", "the") will not provide us with more semantic information about "fox" because "the" is in the context of each word Almost always appear in
  2. Since common words like "the" appear in the text with a high probability, we will have a large number of training samples like ("the",...), and the number of these samples far exceeds our learning of the word "the" The number of training samples required for the vector

Word2Vec solves this problem of high frequency words through the "sampling" mode. Its basic idea is as follows: For every word we encounter in the training original text, they have a certain probability of being deleted from the text, and the probability of being deleted is related to the frequency of the word.

Note:怎么个删法??

ωi is a word, and Z(ωi) is the frequency of occurrence of the word ωi in all corpora. For example, if the word "peanut" appears 1000 times in a corpus of a billion scale, then Z(peanut) = 1000/1000000000 = 1e-6.

P(ωi) represents the probability of retaining a certain word:

Insert picture description here

3.Negative sampling

Training a neural network means inputting training samples and constantly adjusting the weights of neurons, so as to continuously improve the accurate prediction of the target. Whenever the neural network is trained with a training sample, its weight will be adjusted once.

Therefore, the size of the dictionary determines that our Skip-Gram neural network will have a large-scale weight matrix. All these weights need to be adjusted through hundreds of millions of training samples. This is very computationally expensive and practical. It will be very slow in training.

Negative sampling solves this problem. It is a method used to increase the training speed and improve the quality of the word vector. Unlike the original update of all weights for each training sample, negative sampling only updates a small part of the weights of a training sample at a time, which will reduce the amount of calculation in the gradient descent process.

When we use training samples (input word: "fox", output word: "quick") to train our neural network, both "fox" and "quick" are encoded by one-hot. If our dictionary size is 10000, in the output layer, we expect the neuron node corresponding to the word "quick" to output 1, and the remaining 9999 should output 0. Here, the words corresponding to the 9999 neuron nodes that we expect to output as 0 are called "negative" words.

When using negative sampling, we will randomly select a small portion of negative words (for example, select 5 negative words) to update the corresponding weights. We will also update the weight of our "positive" word (in our example above, this word refers to "quick").

PS: In the paper, the author pointed out that for small-scale data sets, it is better to choose 5-20 negative words. For large-scale data sets, you can only choose 2-5 negative words.

We use "unigram distribution" to select "negative words". The probability of each word being selected as a negative sample is related to its frequency of appearance. The higher the frequency of occurrence, the easier it is to be selected as negative words.

The formula for calculating the probability of each word being selected as "negative words":

Insert picture description here

Among them, f(ωi) represents the frequency of words appearing, and the square root of 3/4 in the formula is completely based on experience.

在代码负采样的代码实现中,unigram table有一个包含了一亿个元素的数组,这个数组是由词汇表中每个单词的索引号填充的,并且这个数组中有重复,也就是说有些单词会出现多次。那么每个单词的索引在这个数组中出现的次数该如何决定呢,有公式,也就是说计算出的负采样概率*1亿=单词在表中出现的次数。

有了这张表以后,每次去我们进行负采样时,只需要在0-1亿范围内生成一个随机数,然后选择表中索引号为这个随机数的那个单词作为我们的negative word即可。一个单词的负采样概率越大,那么它在这个表中出现的次数就越多,它被选中的概率就越大。

5.1.3 Hierarchical Softmax(分层Softmax)

1. Hoffman tree

Input: n nodes with weights (w1, w2,...wn)

Output: the corresponding Huffman tree

  1. Think of (w1, w2,...wn) as a forest with n trees, and each tree has only one node
  2. In the forest, the two trees with the smallest root node weight are selected and merged to obtain a new tree. The two trees are distributed as the left and right subtrees of the new tree. The root node weight of the new tree is the sum of the root node weights of the left and right subtrees
  3. Remove the two trees with the smallest root node weight from the forest, and add the new tree to the forest
  4. Repeat steps 2 and 3 until there is only one tree in the forest

Below we use a specific example to illustrate the process of building a Huffman tree. We have (a, b, c, d, e, f) a total of 6 nodes, and the weight distribution of the nodes is (16, 4, 8, 6, 20, 3).

The first is the smallest combination of b and f, and the weight of the root node of the new tree is 7. At this time, there are 5 trees in the forest, and the weight of the root node is 16, 8, 6, 20, and 7, respectively. At this time, the 6 and 7 with the smallest root node weight are merged to obtain a new subtree, and so on, and finally the following Huffman tree is obtained.

Insert picture description here

So what are the benefits of the Hoffman tree? Generally, after obtaining the Huffman tree, we will perform Huffman coding on the leaf nodes. Because the leaf nodes with higher weights are closer to the root node, and the leaf nodes with lower weights are far away from the root node, so our high-weight nodes have shorter encoding values. , And the low weight value has a longer encoding value. This guarantees that the weighted path of the tree is the shortest, which is also in line with our information theory, that is, we hope that the more commonly used words have shorter codes. How to code it? Generally, for the nodes of a Huffman tree (except the root node), it can be agreed that the left subtree is coded as 0, and the right subtree is coded as 1. As shown in the figure above, the code of c can be obtained as 00.

In word2vec , the convention encoding method is opposite to the above example, that is , the left subtree is coded as 1, and the right subtree is coded as 0. At the same time, it is stipulated that the weight of the left subtree is not less than the weight of the right subtree.

More principles can refer to: Hoffman tree principle

2. Hierarchical Softmax process

In order to avoid calculating the softmax probability of all words, word2vec samples the Huffman tree to replace the mapping from the hidden layer to the output softmax layer.

The establishment of the Hoffman tree:

  • Build a Huffman tree based on the label and frequency (the higher the frequency of the label, the shorter the path of the Huffman tree
  • Each leaf node in the Huffman tree represents a label

Insert picture description here

As shown in FIG:

Insert picture description here
Insert picture description here

Note: Theta at this time is an undetermined coefficient, which is an iterative formula obtained by deriving the maximum likelihood and solving it.

Insert picture description here

Use gensim to train word2vec

from gensim.models.word2vec import Word2Vec
model = Word2Vec(sentences, workers=num_workers, size=num_features)

reference:

  1. CS224n Note 2 Vector representation of words: word2vec
  2. Stanford University Deep Learning and Natural Language Processing Lecture 2: Word Vector
  3. (Stanford CS224d) Deep Learning and NLP Course Notes (3): Evaluation of GloVe and Model
  4. Principle of word2vec (3) Model based on Negative Sampling
  5. A detailed explanation of the Skip-Gram model of Word2vec (structure)

5.2 TextCNN

TextCNN uses CNN (Convolutional Neural Network) for text feature extraction. Convolution kernels of different sizes extract n-gram features. The feature map calculated by convolution is retained by MaxPooling to retain the largest feature value, and then stitched into a vector as text The representation.

Here we use 100 convolution kernels of size 2, 3, and 4 based on the original text of TextCNN, and the final text vector size is 100*3=300 dimensions.

5.3 TextRNN

TextRNN uses RNN (recurrent neural network) for text feature extraction. Since text itself is a sequence, LSTM is naturally suitable for modeling sequence data. TextRNN inputs the word vector of each word in the sentence into the bidirectional double-layer LSTM in turn, and splices the hidden layer at the last effective position in the two directions into a vector as the representation of the text.

Insert picture description here

5.4 Text representation based on TextCNN and TextRNN

TextCNN

  • Model building
self.filter_sizes = [2, 3, 4]  # n-gram window
self.out_channel = 100
self.convs = nn.ModuleList([nn.Conv2d(1, self.out_channel, (filter_size, input_size), bias=True) for filter_size in self.filter_sizes])
  • Forward propagation
pooled_outputs = []
for i in range(len(self.filter_sizes)):
    filter_height = sent_len - self.filter_sizes[i] + 1
    conv = self.convs[i](batch_embed)
    hidden = F.relu(conv)  # sen_num x out_channel x filter_height x 1

    mp = nn.MaxPool2d((filter_height, 1))  # (filter_height, filter_width)
    # sen_num x out_channel x 1 x 1 -> sen_num x out_channel
    pooled = mp(hidden).reshape(sen_num, self.out_channel)
    
    pooled_outputs.append(pooled)

TextRNN

  • Model building
input_size = config.word_dims

self.word_lstm = LSTM(
    input_size=input_size,
    hidden_size=config.word_hidden_size,
    num_layers=config.word_num_layers,
    batch_first=True,
    bidirectional=True,
    dropout_in=config.dropout_input,
    dropout_out=config.dropout_hidden,
)
  • Forward propagation
hiddens, _ = self.word_lstm(batch_embed, batch_masks)  # sent_len x sen_num x hidden*2
hiddens.transpose_(1, 0)  # sen_num x sent_len x hidden*2

if self.training:
    hiddens = drop_sequence_sharedmask(hiddens, self.dropout_mlp)

Use HAN for text classification

Hierarchical Attention Network for Document Classification (HAN) is based on hierarchical attention, which is coded at the word and sentence levels and obtains the representation of the document based on the attention, which is then classified by Softmax. The role of the word encoder is to obtain the representation of the sentence, which can be replaced with the TextCNN and TextRNN mentioned in the previous section, or it can be replaced with the BERT in the next section.

Insert picture description here

Guess you like

Origin blog.csdn.net/bosszhao20190517/article/details/107666065