tensorflow benchmark 错误记录

多机多卡分布式训练model时:

启动脚本为:

CUDA_VISIBLE_DEVICES='' nohup python -u tf_cnn_benchmarks.py --batch_size=2048 --data_dir=// --data_name=imagenet --model=alexnet --num_batches=100 --num_gpus=4 --train_dir=// --ps_hosts=wx-test2:50000,wx-test:50000 --worker_hosts=wx-test2:50001,wx-test:50001 --task_index=0 --job_name=ps > alexnet/8gpu/8gpu_ps0.txt 2>&1 &

sleep 5

nohup python -u tf_cnn_benchmarks.py --batch_size=2048 --data_dir=// --data_name=imagenet --model=alexnet --num_batches=100 --num_gpus=4 --train_dir=/weixue/new/scripts/tf_cnn_benchmarks/alexnet/8gpu/ --ps_hosts=wx-test2:50000,wx-test:50000 --worker_hosts=wx-test:50001,wx-test2:50001 --task_index=0 --job_name=worker > alexnet/8gpu/8gpu_worker0.txt 2>&1 &

worker0 持续报错:

tensorflow.python.framework.errors_impl.InvalidArgumentError: /job:ps/replica:0/task:0/device:CPU:0 unknown device.

Generating model
W0716 12:44:24.854329 139855663281920 tf_logging.py:126] From /usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/image_ops_impl.py:968: calling squeeze (from tensorflow.python.ops.array_ops) with squeeze_dims is deprecated and will be removed in a future version.
Instructions for updating:
Use the `axis` argument instead
I0716 12:44:26.083981 139855663281920 tf_logging.py:116] Create CheckpointSaverHook.
I0716 12:44:26.410391 139855663281920 tf_logging.py:116] Graph was finalized.
2018-07-16 12:44:27.423535: I tensorflow/core/distributed_runtime/master_session.cc:1142] Start master session 3d55ac115f2c4045 with config: intra_op_parallelism_threads: 1 inter_op_parallelism_threads: 32 gpu_options { } allow_soft_placement: true
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1322, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: /job:ps/replica:0/task:0/device:CPU:0 unknown device.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tf_cnn_benchmarks.py", line 60, in <module>
    app.run(main)  # Raises error on invalid flags, unlike tf.app.run()
  File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 274, in run
    _run_main(main, argv)
  File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 238, in _run_main
    sys.exit(main(argv))
  File "tf_cnn_benchmarks.py", line 56, in main
    bench.run()
  File "/weixue/new/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1384, in run
    return self._benchmark_cnn()
  File "/weixue/new/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1667, in _benchmark_cnn
    max_wait_secs=7200)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 415, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 826, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 549, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1012, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1017, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 706, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 477, in create_session
    init_fn=self._scaffold.init_fn)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/session_manager.py", line 283, in prepare_session
    sess.run(init_op, feed_dict=init_feed_dict)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 900, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1135, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: /job:ps/replica:0/task:0/device:CPU:0 unknown device.

将:

--worker_hosts=wx-test:50001,wx-test2:50001 --task_index=0 --job_name=worker > alexnet/8gpu/8gpu_worker0.txt 2>&1 &

改为:

 --worker_hosts=wx-test2:50001,wx-test:50001 --task_index=0 --job_name=worker > alexnet/8gpu/8gpu_worker0.txt 2>&1 &

程序正在运行

疑问:报这个错是因为tensorflow的原因还是k8s集群的问题?

2018年7月17日更新:

是一个很低级错:

必须要按照0,1,2,3,..的顺序输入worker。

猜你喜欢

转载自blog.csdn.net/qq_32110859/article/details/81071408
今日推荐