Chinese Pre-training Model Generalization Ability Challenge (Part 2): Top score skills

Chinese Pre-training Model Generalization Ability Challenge (Part 2): Top score skills

1. Ideas to improve the game

  1. Modify calculate_loss.py to change the calculation method of loss, starting from balancing the difficulty of subtasks and uneven samples of each subtask category;
  2. Modify net.py to change the structure of the model, add an attention layer, or other layers;
  3. Use tools such as cleanlab to clean the training text;
  4. Do text data enhancement, or use other data sets to pretrain during pre-training;
  5. For the trained model, train an epoch with a small learning rate on the complete data set (including the validation set and the training set);
  6. Adjust bathSize and a_step, change the degree of gradient accumulation, currently batchSize=16, a_step=16;
  7. Use chinese-roberta-wwm-ext as the pre-training model;

Second, the improvement of the BERT model

  relatedBERTFor the introduction of the paper, see: https://blog.csdn.net/weixin_42691585/article/details/108956159

Four improvements are   proposed to the BERT model , in order to get better results under the premise of smaller parameter amount and faster pre-training speed.

  1. The relative position is embedded . The relative position information of words and words is far more important for understanding natural language than the absolute position information of words, and the model needs a lot of extra parameters and time in order to learn the absolute position information. The BERT model considers both absolute position and relative position. That is to say, the BERT model theoretically captures the influence of each word on all words in all positions at each position. This is a total of 22 distribution modes (the total number of words and the length of the sequence). If only the relative position is considered, the model only needs to learn 22 distribution modes, which can reduce the parameters and training time of the model to a certain extent, and can improve the generalization of the model. Therefore, through the relative position embedding method, the model ignores the absolute position information and retains the more important relative position information.

  2. Independent two-way multi-head self-attention mechanism . The multi-head self-attention layer is the core structure of the BERT model, which plays a decisive role in accurately extracting the meaning of sentences. Word order should be considered in the interaction between words, that is, when word A appears before and after word B, the meaning of word B should be different. However, in the BERT model, this effect is not realized at the word level, but indirectly realized by the method of absolute position embedding. This design requires the model to learn the relationship between redundant absolute positions and words, which results in a large amount of parameters and a long training time. In order to directly capture this key information, an independent two-way multi-head self-attention mechanism is used, that is, two "multi-heads" are used to independently process the preceding and subsequent words, so that the word A is represented as a different real number vector when it appears before and after the word B. .

  3. Hierarchical densely connected network . Dense connection refers to the establishment of dense additional data paths between different network layers. This can make the gradient better conduct to the deep layer, at the same time improve the utilization of the features extracted by each layer of the network, and make the overall loss function more "smooth", so that the network can be optimized faster and better. In the BERT model, there is only a residual network connection inside each attention layer, but there is no additional data path between layers. The network depth of the BERT model is large, and the gradient transmission is difficult, which leads to a long training time. In addition, the features extracted by each attention layer can only be used by the next layer, which is less efficient. Therefore, by establishing a hierarchical dense connection network between the attention layers, in order to reduce the training time of the model, and improve the utilization of the features extracted by each attention layer, thereby improving the performance of the model.

  4. New pre-training task-out-of-order judgment . The BERT model has insufficient interpretability in capturing word order, and the position embedding mechanism is not very convincing. Pre-training is used to make the model more sensitive to word order, so a pre-training task for out-of-order judgment is proposed, which is to randomly shuffle part of the word order in the sentence or keep it unchanged, and then use the model to judge whether the word order of the sentence is reasonable.

Three, other improvement ideas

  1、Combining the advantages of BERT and GRU models, Improve and get a new multi-standard Chinese word segmentation model. In the traditional multi-standard word segmentation model, only the Bi-LSTM model is used, but the training time will increase as the data set increases. Therefore, the simpler Bi-GRU model is used to speed up the training of the model. At the same time, in order to extract richer semantic features from the text, the BERT pre-training model is used as a semantic feature extraction layer and added to the model to improve the word segmentation effect of the model. According to the above two improvements, a new multi-standard Chinese word segmentation model is obtained, and related experiments are set up. The improvement of training time and word segmentation effect proves the effectiveness of the improvement.

  2、Add the mixed domain attention module in the field of computer vision to the short text classification model, Improve and get a new short text classification model. The convolutional neural network in traditional short text classification treats all features equally when extracting features. In order to enhance the model's ability to extract key features, imitating the approach in computer vision, the hybrid domain attention module is added to the short text classification model. By setting up a control experiment with the original model, it is proved that adding the mixed domain attention module is really helpful for the extraction of key features of the text.

  3、Apply multi-channel mechanism in hierarchical attention model, Improved to obtain a multi-channel hierarchical Attention classification model suitable for Chinese long text data. When the hierarchical attention model is used for Chinese text classification tasks, long texts increase the probability of word segmentation errors and result in loss of information, so the feature extraction channel for word-granular text representation is added to supplement the loss caused by word-granular text representation. The same set of control experiments with the original model, through the improvement of the classification effect, proved that the features extracted after adding the word granularity text representation are more comprehensive.

  4、By combining the above Chinese word segmentation technology and the improvement of text classification technology, Designed and implemented a simple Chinese long and short text mixed classification system, and used the system to test the impact of the above two key technology improvements on the overall Chinese text classification effect after the combination.

Four, experience

  The gap between theory and practice is often very different. Academic papers pay more attention to the novelty of model architecture design, and more importantly, new ideas. The most important thing in practice is the effect in the landing scene, the points of attention and methods. It's all different. This part simply sorts out some experience and lessons in the actual project process.

  1. The model is obviously not the most important thing : it cannot be denied that a good model design is essential to get good results, and it is also a hot spot of academic concern. In actual use, the workload of the model actually takes up relatively little time. Although the second part introduces five models of CNN/RNN and its variants, the actual Chinese text classification task simply using CNN is enough to achieve very good results. Our experimental test RCNN improves the accuracy by about 1%, and Not very significant. The best practice is to first use the TextCNN model to debug the overall task effect to the best, and then try to improve the model.

  2. Understand your data : Although the application of deep learning has a big advantage in that it no longer requires cumbersome and inefficient artificial feature engineering, if you just treat it as a black box, you will inevitably often doubt your life. Be sure to understand your data and remember that data sense is always very important regardless of traditional methods or deep learning methods. Pay attention to badcase analysis and understand whether your data is suitable and why it is right and wrong.

  3. Pay attention to iteration quality-record and analyze your each experiment : iteration speed is the key to the success or failure of an algorithm project, and students who have studied probability can easily agree. The important thing about algorithm projects is not only the speed of iteration, but also the quality of iteration. If you do not build a quick experimental analysis routine, no matter how fast the iteration speed is, it will only distress your company's precious computing resources. It is recommended to record each experiment, and the experimental analysis should answer at least these three questions: Why do we need to experiment? What is the conclusion? How to experiment next?

  4. Super-parameter adjustment : Super-parameter adjustment is the daily routine of all tuning engineers. A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classification is recommended , and some super-parameters are posted inside. Comparative experiment.

  5. Be sure to use dropout : There are two situations where you don't need to: the amount of data is very small, or you use a better regular method, such as bn. In practice, we tried dropout with different parameters, and the best is 0.5, so if your computing resources are very limited, the default 0.5 is a good choice.

  6. Fine-tuning is mandatory : As mentioned above, if you just use the word vector trained by word2vec as the feature representation, I bet you will lose a lot of effect.

  7. Softmax loss is not necessarily required : It depends on your data. If your task is non-mutual exclusion among multiple categories, you can try training multiple binary classifiers, that is, define the problem as multi lable instead of multi class, after we adjusted the accuracy rate still increased by >1%.

  8. Category imbalance problem : It is basically a conclusion that has been verified in many scenarios: if your loss is dominate by some categories, it is mostly negative in general. It is recommended to try a similar booststrap method to adjust the sample weight in loss.

  9. Avoid training shocks : By default, random sampling factors must be increased as much as possible to make the data distribution iid, and the default shuffle mechanism can make the training results more stable. If the training model is still very volatile, consider adjusting the learning rate or mini_batch_size.

  10. Don't draw conclusions too early before converging : the best one is to play until the end, especially for some new angle tests. Don’t deny it easily, at least wait until convergence.

PS: The computer is too low to run out ε=(´ο` ))) alas *

Reference

[1] Wang Nanti. Research on improved text representation model based on BERT [D]. Southwest University, 2019.
[2] Wang Xiaoming. Research on key technologies of Chinese text classification based on deep learning [D]. University of Electronic Science and Technology of China, 2020.
[ 3] https://zhuanlan.zhihu.com/p/25928551

Guess you like

Origin blog.csdn.net/weixin_42691585/article/details/114107729