TensorFlow训练模型的过程中打开tensorboard

在训练的过程中,想通过tensorboard实时观察训练损失和验证集准确率,一直出错,打开tensorboard后在浏览器查看,然后训练就停止了,提示信息如下:

File "D:/ProgramData/PycharmProjects/tf_learn/mnist/mnist_train.py", line 88, in main
    train()
  File "D:/ProgramData/PycharmProjects/tf_learn/mnist/mnist_train.py", line 68, in train
    saver.save(sess, ckpt, global_step=i)
  File "D:\ProgramData\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\training\saver.py", line 1662, in save
    save_relative_paths=self._save_relative_paths)
  File "D:\ProgramData\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\training\saver.py", line 1013, in _update_checkpoint_state
    text_format.MessageToString(ckpt))
  File "D:\ProgramData\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\lib\io\file_io.py", line 436, in atomic_write_string_to_file
    rename(temp_pathname, filename, overwrite)
  File "D:\ProgramData\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\lib\io\file_io.py", line 415, in rename
    compat.as_bytes(oldname), compat.as_bytes(newname), overwrite, status)
  File "D:\ProgramData\Anaconda3\envs\tf_gpu\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 519, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnknownError: Failed to rename: ./train\checkpoint.tmp6b8a317e1edd43e99a6ae35134e969ea to: ./train\checkpoint : \udcbeܾ\udcf8\udcb7\udcc3\udcceʡ\udca3
; Input/output error

提示tensorflow.python.framework.errors_impl.UnknownError: Failed to rename,错误位置是在保存模型的时候saver.save(sess, ckpt, global_step=i),但是如果不打开tensorboard,就不会出错,训练正常,模型及日志都能正常保存,训练结束后,通过tensorboard也可以正常查看。相关位置代码如下:

# 每100次训练打印一次损失值与验证准确率
if i > 0 and i % 100 == 0:
	validate_accuracy = eval(sess, accuracy_sum, x, y, training, mnist.validation)
	print('step: {}, loss: {}, validation accuracy: {}'.format(i, loss, validate_accuracy))
	ckpt = os.path.join(FLAGS.train_dir, 'model.ckpt')

	summary = tf.Summary()
	summary.value.add(tag='val_acc', simple_value=validate_accuracy)
	summary_writer.add_summary(summary, i)

	summary_str_loss = sess.run(summary_loss, feed_dict={x: xs, y: ys, training: False})
	summary_writer.add_summary(summary_str_loss, i)
	summary_writer.flush()
	saver.save(sess, ckpt, global_step=i)

既然正常训练没有问题,那应该是记录summary和saver有影响,考虑把saver.save(sess, ckpt, global_step=i)移到summary上面试试:

# 每100次训练打印一次损失值与验证准确率
if i > 0 and i % 100 == 0:
	validate_accuracy = eval(sess, accuracy_sum, x, y, training, mnist.validation)
	print('step: {}, loss: {}, validation accuracy: {}'.format(i, loss, validate_accuracy))
	ckpt = os.path.join(FLAGS.train_dir, 'model.ckpt')
	saver.save(sess, ckpt, global_step=i)

	summary = tf.Summary()
	summary.value.add(tag='val_acc', simple_value=validate_accuracy)
	summary_writer.add_summary(summary, i)

	summary_str_loss = sess.run(summary_loss, feed_dict={x: xs, y: ys, training: False})
	summary_writer.add_summary(summary_str_loss, i)
	summary_writer.flush()

修改之后就正常了,打开tensorboard实时查看训练过程信息,不影响模型训练。虽然没有问题了,但是不知道为什么会有这个影响。。。

猜你喜欢

转载自blog.csdn.net/qiumokucao/article/details/81503769