tensorflow 单机到分布式 tf.train.SyncReplicasOptimizer + monitoredtrainningsession

reduce_grads = average_gradients(tower_grads)
        opt = tf.train.SyncReplicasOptimizer(
          opt_gpu,
          replicas_to_aggregate=num_workers,
          total_num_replicas=num_workers,
          name="sync_replicas")
        apply_gradient_op = opt.apply_gradients(reduce_grads, global_step=global_step)

...

hooks = [opt.make_session_run_hook((FLAGS.task_index == 0),num_tokens=0),
             tf.train.StopAtStepHook(last_step=1000000),
             tf.train.LoggingTensorHook(tensors={'step': global_step, 'loss': total_loss}, every_n_iter=10)]

...
with tf.train.MonitoredTrainingSession(master=server.target,
                                     is_chief=(FLAGS.task_index == 0),
                                     checkpoint_dir="/weixue/my_bench/train_logs",
                                     hooks = hooks,
                                     scaffold=scaffold,
                                     config = config) as mon_sess:
问题记录:

1. opt的顺序

2.var的initialize

3.hook参数增加了num_tokens=0 

This is supposed to be executed in the beginning of the chief/sync thread so that even if the total_num_replicas is less than replicas_to_aggregate, the model can still proceed as the replicas can compute multiple steps per variable update. Make sure:
`num_tokens >= replicas_to_aggregate - total_num_replicas`.

当replicas_to_aggregate = total_num_replicas,不要往queue中继续添加op

猜你喜欢

转载自blog.csdn.net/qq_32110859/article/details/81303130
今日推荐