pytorch训练ubuntu卡死内存泄漏

事件:使用pytorch进行multi-task learning,训练到30-60 epoch的时候,机器卡死了。虽然是ubuntu也卡死了

原因:一通没头没脑地分析之后,原因可能是内存泄漏。

解决:
将数据记录到log文件以提供给tensorboard可视化分析的时候,注意要在结束时关闭 SummaryWriter

writer = SummaryWriter(os.path.join(ckptDir, 'logs'))
for epoch in range(num_epochs):
	...
	    # tensorboardX
        writer.add_scalar('learning rate', lr, epoch + 1)
        writer.add_scalars('loss', {'train loss': train_loss, 'validation loss': val_loss}, epoch + 1)
        writer.add_scalars('accuracy', {'train accuracy': train_acc, 'validation accuracy': val_acc}, epoch + 1)
        writer.add_scalars('balanced accuracy', {'train bacc': train_bacc, 'validation bacc': val_bacc}, epoch + 1)

	...
# 就是这一句
writer.close()

参考:Drux @ https://stackoverflow.com/questions/44831317/tensorboard-unble-to-get-first-event-timestamp-for-run

猜你喜欢

转载自blog.csdn.net/qxqxqzzz/article/details/107354508
今日推荐