Summary of text classification practical skills (tricks)

table of Contents

 

Foreword

About tokenizer

About Chinese Character Vector

If the data set is very noisy

Choose CNN or RNN for the baseline? Does the route follow CNN or RNN?

Where is Dropout added

About the second category

About multi-label classification

What to do if the category is not balanced

Don't be too tangled series

I still do n’t use tricks, but I just want to get a good result.


Foreword

A year ago, Xiao Xi asked such a question on Zhihu

Are there any tricks in text classification that are rarely mentioned in papers but have an important impact on performance?
Link: https://www.zhihu.com/question/265357659/answer/578944550

At that time, I was brushing a more interesting task, and found that strange tricks can bring a lot of performance benefits. In addition, in order to verify a small idea, I ran a bunch of public text classification data sets. Although the idea was not very bright, I accumulated and explored a lot of tricks to brush performance╮ ( ̄ ▽  ̄ ””) ╭ In the follow-up, I used these tricks to brush a lot of related games (even if it is a special text classification problem such as text matching), I found that baseline + a bunch of tricks + simple integration can easily brush the top10 of a text classification water game top3, I feel the importance of tuning and tricks. Ran Ge, I have n’t had the basic problem of text classification for a long time. I feel that I have almost forgotten. While still having a little vague memory, I will sort it out and share it with everyone. Provide some help or inspiration. First, come to a conclusion, tricks are used well, and the tuning is wonderful. TextCNN can also hang most of the bells and whistles of the deep model. Tricks are useless, and the SOTA model also has poor performance, making you doubt your life. There is no point in the following, there is no logic to start this article spicy.

About tokenizer

Whether it is Chinese or English, the inevitable thing when you take the data set is to see if you want to do word segmentation. It is about to struggle with the choice of the tokenizer.

Passerby C: Our factory has a full range of tokenizers that hang all kinds of open source tokenizers
Xiao Xi: Well, you can go down

First of all, there is a question. ** Is it really that the more advanced the algorithm, the better the performance of the downstream task? ** Many people will ignore one thing when they come to this step, ** Word vector! ! ! ** In fact, compared to the advanced degree of the word segmentation algorithm itself, in the context of neural networks using pre-trained word vectors, ** Ensuring that the token granularity match between the word segmenter and the word vector table is actually more important! ** After all, even if your word score is good, once it is not in the word vector table, it will become OOV, and even the best score will be used ╮ ( ̄ ▽  ̄ "") ╭ (unless you are not too much trouble Write some code to perform special processing on the OOV relative to the word vector table, anyway, I usually find it troublesome (╮ (╯ ▽ ╰) ╭), so there are two cases here. 1. **** The tokenizers of known pre-trained word vectors generally like the official release of pre-trained word vectors such as word2vec, glove, fasttext will publish the information of the corresponding training corpus, including pre-processing strategies such as word segmentation, etc. This is really the case It couldn't be better, no entanglement, if you decide to use a certain word vector, then use the word breaker that was used to train the word vector! The performance of this tokenizer in downstream tasks will be better than other belligerent tokenizers. 2. **** Do not know the tokenizer of the pre-trained word vector. Then you need to "guess" the tokenizer. How to guess? First, after getting the pre-trained word vector table, go to search for some specific words such as some websites, mailboxes, idioms, names, etc., there are also in Englishn'tWait, see what granularity the training word vector uses to divide them, and then run a few tokenizers to see which granularity is closest to him. If you do n’t worry, put it in the downstream task. Run and watch. Of course, the ideal situation is of course to determine the word segmentation that is most suitable for the current task data set, and then use the pre-trained word vectors produced by the same word segmentation. Unfortunately, there are not so many versions of open word vectors on the Internet for selection, so training your own word vectors on the downstream task training set or a large number of identically distributed unsupervised corpora is obviously more conducive to further squeezing the performance of the model. However, how to pre-train a useful word vector for the current task is enough to write an article. . I wo n’t talk about it here. I ’ll write it later ~ (I did n’t pay attention to Xiao Xi ’s attention!) Of course, in addition to the tokenizer and the word vector table, you must also match the case vector and the definition of OOV. Table match. If you use a case-sensitive word vector table, but you also lowercase all the words in downstream tasks, then do n’t think about it, the absolute performance loses more than N percentage points.

About Chinese Character Vector

Passerby: It's troublesome, I don't know any words, I'm going to use word vectors.
Xiao Xi: Don't run away ( ̄∇ ̄)

If you really use char-level as the main force, then do n’t forget to pre-train Chinese word vectors! And remember to open the window larger during pre-training . Do n’t use the word-level window size directly. Other pre-training hyperparameters can be adjusted easily. It is definitely better than randomly initialized word vectors.

If the data set is very noisy

There are two cases where the noise is severe. For the data set D (X, Y), one is that X has a lot of internal noise (for example, the text is spoken or generated by the majority of Internet users), and the other is that Y is very noisy (some samples are marked by obvious errors, It is difficult for some samplers to define which category they belong to, even with category ambiguity). For the former type of noise , a natural idea is to use a language model or text correction based on the editing distance. However, due to the existence of proper nouns and "false noise" beyond imagination in the actual goose, it is often effective in actual scenes. Not very good. There are generally two ideas for Xiao Xi here, one is to directly change the input of the model to char-level (the granularity of the word in Chinese), and then train from scratch (without using pre-trained word vectors) to compare with word-level Now, if the obvious effect of char-level is good, then in a short time, directly based on char-level to do model pit ~ If the performance is not too bad, or the char has already achieved the head, would like to do word-level? ? Don't worry, first help Xiao Xi buy a lollipop ( ̄∇ ̄), a very work but seemingly not many people find trick is to use a special hyperparameter FastText to train a word vector. Why is it special? Generally speaking, the window size of char ngram in fasttext in English generally takes a value of 3 to 6, but when processing Chinese, if our purpose is to remove the noise in the input, then we can limit this window to 1 to 2, This kind of small window is helpful for the model to capture typos (imagine that when we type a wrong word, we usually get one of the words to another word with the same phonetic shape), such as the recent word "seems" learned by word2vec It may be "like", but the "seem" recent words learned by the fasttext in the small ngram window are likely to be words that contain internal typos, such as "Yihu", so that the words composed of not too much typos are suddenly Come back together, even to a certain extent, against the noise generated by the tokenizer (cut a word into multiple words). Of course, if the data set is very clean, then training the word vector may be gg. And for the latter case of noise(That is, the noise in Y), a very straightforward idea is to do label smoothing, but Xiao Xi used it many times in actual combat to find that the effect is not too obvious. The final summary of the trick is, first ignore this noise, forcibly train the model as best as possible, and then let the trained model run the training set and development set, take out the wrong samples in the training set and those in the development set with high Samples that make wrong decisions with confidence (such as predicting a sample with a label of 0 to 1 with 99% certainty), and then do the analysis of these bad cases. If it is found that the wrong labeling has a strong regularity, then directly Use a script to make batch corrections (just make sure that the correct rate of the corrected label is significantly higher than before the correction). If there is no regularity, but it is found that most of the samples with high confidence in the model are labeled incorrectly, delete these samples directly ~ often you can also get a small performance improvement, after all, the test set is manual Annotated, difficult samples and wrongly labeled samples will not be too many.

Choose CNN or RNN for the baseline? Does the route follow CNN or RNN?

Don't really get too tangled with this issue in text classification. I personally prefer CNN mainly because I run fast. . . It is good to run a few more experiments. And the actual experience feels that the basic CNN model such as TextCNN is not only particularly easy to implement, but also easy to become a strong baseline on a data set (unless this classification task is difficult), it takes an hour or two to make the baseline It's not too late to make other models ~ It also helps to correct the general direction early. If you want to talk about objective thinking and decision-making, then take an hour to take a good look at the data set ~ If you feel that many strong ngrams in the data set can directly help generate the correct decision, then CNN starts. If you feel that many cases are the kind that requires reading a sentence or even two or three times to get the correct tag, then RNN starts. Of course, if the data is large and there is a graphics card, you can also try Transformer. If you have more time, you can also run the CNN and RNN models and integrate them.

Where is Dropout added

After the word embedding layer, after the pooling layer, and after the FC layer ** (fully connected layer) **, oh. The dropout probabilities in the initial stage remain the same, and it is better to fine-tune the time alone (there has never been this time). As for the word dropout strategy that is sometimes touted by some people (mask some tokens randomly to [PAD], or 0. Note that this operation is not equivalent to adding dropout to the embedding layer), and finally try it if you have time, pro test When the dropout is tuned, it generally does not play much role.

About the second category

The binary classification problem must use sigmoid as the activation function of the output layer? Of course not, try softmax with two categories. It may be that there is one more branch with a little more information. Although the latter is a bit uglier in the form of mathematics, in practice, it often brings a few points of improvement. It is also more metaphysical.

About multi-label classification

If a sample has multiple labels at the same time, and even the labels also constitute a DAG (directed acyclic graph), do n’t worry, first use binary-cross-entropy to train a baseline (that is, turn each category into a binary classification Problem, this multi-label classification problem of N categories becomes N binary classification problems), after all, this has a ready-made API in tensorflow, namely tf.nn.sigmoid_cross_entropy_with_logits. Therefore, the implementation cost is very small. Then you may also be pleasantly surprised to find that after the baseline is completed, it seems that the multi-label problem is not big, and the DAG problem has basically been solved by itself (although the model layer does not specifically deal with this problem), and then you can do it with confidence. what? Is the problem solved? Check the papers ╮ ( ̄ ▽  ̄ ””) ╭Xi Xi has not been exposed to a data set that is too difficult in this regard.

What to do if the category is not balanced

As the Internet said, quickly use various upsampling and downsampling boosting strategies to use it? Nono, if the ratio of positive and negative samples is only 9: 1, continue to do your deep model to adjust your hyperparameters. After the model is completed, you will find that this imbalance is not worth mentioning to the model, and the decision threshold is completely unnecessary Hand adjustment. but! Yes! If you find that a batch is often a sample of the same category, or a sample of some categories is difficult to encounter after a lot of batches, equilibrium is very, very necessary. Category imbalance problem portal-> [Xiao Xi Selected] How to elegantly and stylishly solve the imbalance classification problem

Don't be too tangled series

  1. Do n’t be too tangled about whether the text truncation length is 120 or 150

  2. Don't be too entangled with the small increase in performance of the development set brought by performance-insensitive hyperparameters

  3. Do n’t be too tangled whether the embedding of unregistered words is initialized to all 0s or randomly initialized, do n’t share embedding with PAD.

  4. Don't be too entangled with whether the optimizer uses Adam or MomentumSGD. If the relationship with SGD is not deep, then there is no brain Adam, and finally use MomentumSGD to run a few times.

I still do n’t use tricks, but I just want to get a good result.

BERT understand. Over. This is what I remembered for the time being. The remaining tricks that I remembered will be updated to Zhihu, portal:

https://www.zhihu.com/question/265357659/answer/578944550

In other words, Xiao Xi has shared so many tricks with you, dear friends, are there any secret tricks to share with Xiao Xi in the comment area ( ̄∇ ̄)

Published 45 original articles · won praise 2 · Views 5228

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/105460364