预训练之后分数较低排查错误


最近在预训练完达观杯基于大规模预训练模型的风险事物标签识别的题目之后,发现分数较低,排查错误的过程整理为一篇对应论文。

使用标准的tensorflow中的预训练过程进行比较

这里我们首先使用tensorflow之中的微调过程进行比对,训练的结果如下

Epoch 1/20

2021-09-19 16:34:15.721296: I tensorflow/stream_executor/cuda/cuda_blas.cc:1760] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.

875/875 [==============================] - 211s 228ms/step - loss: 1.5204 - acc: 0.5936

100%|███████████████████████████████████████| 1401/1401 [01:27<00:00, 16.07it/s]

final_score = 
0.4637266561611361
self.best_score = 
0.4637266561611361
Epoch 2/20
875/875 [==============================] - 199s 227ms/step - loss: 0.9490 - acc: 0.7278

100%|███████████████████████████████████████| 1401/1401 [01:18<00:00, 17.86it/s]

final_score = 
0.4715278770706616
self.best_score = 
0.4715278770706616
Epoch 3/20
875/875 [==============================] - 199s 227ms/step - loss: 0.7054 - acc: 0.7865

100%|███████████████████████████████████████| 1401/1401 [01:20<00:00, 17.39it/s]

final_score = 
0.48631696631087207
self.best_score = 
0.48631696631087207
Epoch 4/20
875/875 [==============================] - 199s 227ms/step - loss: 0.4862 - acc: 0.8514

100%|███████████████████████████████████████| 1401/1401 [01:19<00:00, 17.58it/s]

final_score = 
0.4927837543840665
self.best_score = 
0.4927837543840665
Epoch 5/20
875/875 [==============================] - 199s 227ms/step - loss: 0.2881 - acc: 0.9099

100%|███████████████████████████████████████| 1401/1401 [01:19<00:00, 17.60it/s]

final_score = 
0.4603489827324962
Epoch 6/20
875/875 [==============================] - 199s 227ms/step - loss: 0.2006 - acc: 0.9365

100%|███████████████████████████████████████| 1401/1401 [01:19<00:00, 17.68it/s]

final_score = 
0.5136259654205074
self.best_score = 
0.5136259654205074
Epoch 7/20
875/875 [==============================] - 199s 227ms/step - loss: 0.1359 - acc: 0.9576

100%|███████████████████████████████████████| 1401/1401 [01:19<00:00, 17.59it/s]

final_score = 
0.5293800811021452
self.best_score = 
0.5293800811021452
Epoch 8/20
875/875 [==============================] - 199s 227ms/step - loss: 0.1007 - acc: 0.9684

100%|███████████████████████████████████████| 1401/1401 [01:18<00:00, 17.81it/s]

final_score = 
0.4566600972331969
Epoch 9/20
875/875 [==============================] - 199s 227ms/step - loss: 0.0830 - acc: 0.9744

100%|███████████████████████████████████████| 1401/1401 [01:19<00:00, 17.55it/s]

final_score = 
0.484445120332004
Epoch 10/20
875/875 [==============================] - 199s 227ms/step - loss: 0.0796 - acc: 0.9761

100%|███████████████████████████████████████| 1401/1401 [01:18<00:00, 17.93it/s]

final_score = 
0.46480710031858224
Epoch 11/20
875/875 [==============================] - 199s 227ms/step - loss: 0.0588 - acc: 0.9827

100%|███████████████████████████████████████| 1401/1401 [01:19<00:00, 17.71it/s]

final_score = 
0.4820417457316996
Epoch 12/20
875/875 [==============================] - 199s 228ms/step - loss: 0.0660 - acc: 0.9812

100%|███████████████████████████████████████| 1401/1401 [01:18<00:00, 17.87it/s]

final_score = 
0.5076002583300518
Epoch 13/20
875/875 [==============================] - 199s 228ms/step - loss: 0.0655 - acc: 0.9803

100%|███████████████████████████████████████| 1401/1401 [01:18<00:00, 17.78it/s]

final_score = 
0.4855591258396087
Epoch 14/20
875/875 [==============================] - 199s 228ms/step - loss: 0.0553 - acc: 0.9848

100%|███████████████████████████████████████| 1401/1401 [01:20<00:00, 17.47it/s]

final_score = 
0.47588983392087264
Epoch 15/20
875/875 [==============================] - 199s 228ms/step - loss: 0.0643 - acc: 0.9827

100%|███████████████████████████████████████| 1401/1401 [01:17<00:00, 17.97it/s]

final_score = 
0.49583256897288686
Epoch 16/20
875/875 [==============================] - 199s 228ms/step - loss: 0.0468 - acc: 0.9863

100%|███████████████████████████████████████| 1401/1401 [01:20<00:00, 17.40it/s]

final_score = 
0.4645668699854448
Epoch 17/20
875/875 [==============================] - 200s 228ms/step - loss: 0.0483 - acc: 0.9854

100%|███████████████████████████████████████| 1401/1401 [01:18<00:00, 17.74it/s]

final_score = 
0.4835244217626551
Epoch 18/20
875/875 [==============================] - 199s 228ms/step - loss: 0.0494 - acc: 0.9860

100%|███████████████████████████████████████| 1401/1401 [01:20<00:00, 17.44it/s]

final_score = 
0.5069050625288192

从中可以看出最好的分数为0.5293800811021452,与之前使用pytorch微调的最高分相比,提升明显,由此我们将错误的重点放在微调代码的甄别之中
!!!启示:定位错误的位置很关键,只有定位好了错误的位置,错误才能够被分析与判断出来!!!

第一波提升:加上一个dropout之后,并未明显地提升

第二波提升:更换优化器由Adamw更换为Adam优化器,并未明显地提升

发现问题:多加了一个pooler网络层

这里面我经过跑程序发现,我的效果较好的tensorflow的bert模型之中

inputs = np.array([[1,2,3,4,5],[1,2,3,4,5]])
outputs = bertmodel(inputs)

与pytorch之中

output = model(torch.tensor([[1,2,3,4,5],
                             [1,2,3,4,5]]),None,None)
                             

输出的内容不同

class ClassificationModel(nn.Module):
    def __init__(self,model,config,n_labels):
        super(ClassificationModel,self).__init__()
        #self.embedding = nn.Embedding(30522,768)
        self.model = model
        self.dropout1 = nn.Dropout(0.2)
        self.fc1 = nn.Linear(config.embedding_size,config.embedding_size)
        self.activation = torch.tanh
        self.dropout2 = nn.Dropout(0.2)
        #self.activation = F.tanh
        self.fc2 = nn.Linear(config.embedding_size,n_labels)
        
    def forward(self,input_ids,segment_ids,input_mask):
        #outputs = self.embedding(input_ids)
        outputs = self.model(input_ids)
        #[64,128,768]
        print('---outputs = ---')
        print(outputs)
        print('----------------')
        outputs = outputs[:,0]
        #outputs = self.dropout1(outputs)
        #!!!这里输出之后需要一波dropout!!!让模型效果变好
        outputs = self.fc1(outputs)
        outputs = self.activation(outputs)
        outputs = self.dropout2(outputs)
        outputs = self.fc2(outputs)
        #outputs = F.softmax(outputs)
        return outputs
    #之前这里少量return outputs返回值为None

经过仔细地排查之后,发现pytorch之中的fine-tuning多加了一个pooler网络层
去除pooler网络层之后,发现网络层的效果提升并不明显

最后发现,参数问题,maxlen过小,maxlen加大之后,一次epoch的训练时间明显增加,效果也变得更好了

在tensorflow之中的maxlen为128,而我调用的maxlen只有32,之前由于显存不够,所以故意调小了maxlen的值,可见不能随便地调小maxlen,否则会导致模型训练的时候出现问题

使用unlabeled-data之后再预训练labeled-data反而下降

先用了150个unlabeled文件进行预训练,这之后再使用labeled-data进行预训练了300个epoch,发现预训练300个epoch之后的效果反而不如不用labeled-data进行预训练的效果好,思考有没有可能是labeled-data预训练的epoch过多了,造成了模型参数结果的过拟合。
调整之后发现,是由于引用的是原先的bertmodel之中的数据,而不是新的unlabelled数据预训练好之后的权重值造成的这种差距

最后的话

今天在训练的过程中还发现,训练的时候数据会有波动,第一次训练结果0.5488,第二次训练结果0.560,同样的模型

猜你喜欢

转载自blog.csdn.net/znevegiveup1/article/details/120383246