Remember a little white tuning baseline-NLP Chinese pre-training model generalization ability competition

purpose

  
  Preliminary summary According to the baseline training model and optimization direction provided by the Datawhale big guys, the baseline is improved in order to increase the score. For Xiaobai’s innocent, very very hard, I hereby record the process to spur myself, and I am still exploring hard~ Thanks again for the guidance of the big guys!

background

Personal configuration

  • Native graphics card: RTX3070;
  • Currently renting two RMB 3090 to try the water;
  • And Xiaobai who adjusted the baseline for the first time~

Challenge requirements

Optimization direction provided by baseline

  1. Modify calculate_loss.py to change the calculation method of loss, starting from balancing the difficulty of subtasks and uneven samples of each subtask category;
  2. Modify net.py to change the structure of the model, add an attention layer, or other layers;
  3. Use tools such as cleanlab to clean the training text;
  4. Do text data enhancement, or use other data sets to pretrain during pre-training;
  5. For the trained model, train an epoch with a small learning rate on the complete data set (including the validation set and the training set);
  6. Adjust batchSize and a_step, change the degree of gradient accumulation, currently batchSize=16, a_step=16;
  7. Use chinese-roberta-wwm-ext as the pre-training model;

Tuning process (updated as you go~)

Adjust batchSize and epochs

  When I ran the baseline for the first time, I batchSize=16had already blasted my 8g video memory, and then I adjusted it and batchSize=8successfully ran through. The result is shown in the figure:
Insert picture description here
  After that, it was greatly reduced epoch=6, and the result was improved again.
Insert picture description here

  In the subsequent training process, it epochbasically 4~6floated between, and batchSizeafter 4~8observing the submitted score, it was found that it was back before the liberation.Insert picture description here

cleanlab (still trying ing)

  Start trying to use cleanlab to clean the data. In short, it is to automatically clean the "dirty data" that is not of sufficient quality by some way (confidence learning). And cleanlab is an open source data cleaning tool from ICML2020's paper ( text link ) jointly proposed by MIT and Google . It only needs to be pip install cleanlabinstalled and can be used. For a detailed explanation, please see this article——> Xi Xiaoyao: Don't let the data pit you , write it well!
  If we want to find out the incorrectly labeled samples, we only need to provide two inputs: one input is the original sample label (because the original label may be wrong, it is called the noise label here); the other input is the cross-validation of the training set, To predict the probability of each sample under different label categories, this is a n × mn\times mn×The probability matrix of m (nnn is the size of the data set,mmm is the total number of tag categories).

from cleanlab.pruning import get_noise_indices
# 输入
# s:噪声标签
# psx: n x m 的预测概率概率,通过交叉验证获得
ordered_label_errors = get_noise_indices(
    s=numpy_array_of_noisy_labels,
    psx=numpy_array_of_predicted_probabilities,
    sorted_index_method='normalized_margin', # Orders label errors
 )

If we not only want to find mislabeled samples, but also want to clean up these labeled noises and continue learning again:

from cleanlab.classification import LearningWithNoisyLabels
from sklearn.linear_model import LogisticRegression

# 其实可以封装任意一个你自定义的模型.
lnl = LearningWithNoisyLabels(clf=LogisticRegression())
lnl.fit(X=X_train_data, s=train_noisy_labels)
# 对真实世界进行验证.
predicted_test_labels = lnl.predict(X_test)

  I'm still learning the code structure of the baseline. I've been (数据准备阶段)generate_data.pywriting about it. After several attempts , I have n't run through the code.
  When writing a blog, I found that I may not be clear enough about the combination of data cleaning and model training (wrongly made an article when dividing the data set), and I need to further clarify the source and meaning of the input parameters of cleanlab, that is Say that you need to re-read the article analysis, and then refer to the source code interpretation .

Replace the pre-trained model

  After trying cleanlab to no avail, I chose a direction that is easier to implement. Replace the original bert-base-chinese with chinese-roberta-wwm-ext (download link ) . Download config.json, vocab.txt, pytorch_model.bin, these three files into the tianchi-multi-task-nlp/bert_pretrain_modelfolder.
Insert picture description here
  Then, in order to experience batchSize=32the feeling, I rented two yuan of 3090 (you can search for a solution that suits you, or you can use colab or aistudio after understanding the situation ) to try the water and speed up instantly. The old yellow knife is well-deserved!
Insert picture description here
After that, in order to test the strength of the two 3090s, I will batchSizeadjust it 64, but I didn't expect it to explode (it must be a mistake in my operation~~). In the last test before renting an environment blog, parameters adjusted as follows: batchSize=48, , epoch=8submit the results are shown:
Insert picture description here
no significant changes. . . Good quality data is the key!

to sum up

  Adjust batchSize and epochs to adapt to the video memory (throughout the entire tuning process). At the beginning, try to use cleanlab to clean the training set; after the essence of the method is not available, choose to replace the pre-trained model, and replace bert-base-chinese with chinese-roberta-wwm -ext; when writing a blog, I baldly remembered method 5 (please forgive a little white for being dull~), which was mentioned in the Basic Concept chapter explaining model training with Mr. Li Hongyi ( homepage of ML20 course ) (N-fold) The cross validation method is similar. This method may balance variance and bias to reduce total error. I feel that I can rescue the model again, and continue to learn happily after writing it!

reference

Guess you like

Origin blog.csdn.net/weixin_40807714/article/details/114108607