This article documents the author highlights some errors on Tensorflow use, easy to view later people quickly solve the problem.
I was left blank.
I was left blank.
CreateSession still waiting for response from worker: /job:worker/replica:0/task:0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
2018-12-0522:18:24.565303: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job ps -> {0 -> localhost:3376} 2018-12-0522:18:24.565372: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job worker -> {0 -> localhost:3330, 1 -> localhost:3331} 2018-12-0522:18:24.569212: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:381] Started server with target:grpc: 2018-12-0522:18:26.170901: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job ps -> {0 -> localhost:3376} 2018-12-0522:18:26.170969: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job worker -> {0 -> localhost:3330, 1 -> localhost:3331} 2018-12-0522:18:26.174856: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:381] Started server with target:grpc://localhost:3330 2018-12-0522:18:27.177003: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job ps -> {0 -> localhost:3376} 2018-12-0522:18:27.177071: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job worker -> {0 -> localhost:3330, 1 -> localhost:3331} 2018-12-0522:18:27.180980: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:381] Started server with target:grpc://localhost:3331 2018-12-0522:18:34.625459: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0 2018-12-0522:18:34.625513: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2018-12-0522:18:36.231936: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2018-12-0522:18:36.231971: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2018-12-0522:18:37.235899: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2018-12-0522:18:37.235952: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0
First, ensure that job_name,task_index,ps_hosts,worker_hoststhe four parameters are correct, consider the following case is incorrect: the IP to start a worker process or ps 192.168.1.100 on the machine:
Since the process starting position is 192.168.1.100, but the operating parameters specified task_index is 1, the corresponding IP address (task_index first term is 0) or the second ps_hosts worker_hosts, that is 192.168.1.101, and the process IP host machine itself is inconsistent.
Another situation can cause this problem from occurring, starting from TensorFlow-1.4, will be automatically distributed environment variable used to connect the agent, the agent does not need to run if the interconnections between the nodes, then the proxy environment variable shift in addition you can add code at the beginning of the script: Note that this code must be written in the import tensorflow as tf or before moxing.tensorflow as mox import
1 2 3
import os os.enrivon.pop('http_proxy') os.enrivon.pop('https_proxy')
ImportError: /lib64/libstdc++.so.6: version `CXXABI_1.3.9’ not found
. 1 2 . 3 . 4 . 5 . 6 . 7 . 8 . 9 10 large column Tensorflow error collection class = "Line">. 11 12 is 13 is 14
/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as`np.float64 == np.dtype(float).type`. from ._conv import register_converters as _register_converters Traceback (most recent call last): File "trainer.py", line 14, in <module> import sklearn.datasets File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/sklearn/__init__.py", line 134, in <module> from .base import clone File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/sklearn/base.py", line 11, in <module> from scipy import sparse File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/scipy/sparse/__init__.py", line 229, in <module> from .csr import * File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/scipy/sparse/csr.py", line 15, in <module> from ._sparsetools import csr_tocsc, csr_tobsr, csr_count_blocks, ImportError: /lib64/libstdc++.so.6: version `CXXABI_1.3.9' not found (required by /home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/scipy/sparse/_sparsetools.cpython-36m-x86_64-linux-gnu.so)
System libraries older, excluding CXXABI_1.3.9, can be <Anaconda_PATH>/libadded LD_LIBRARY_PATHin, like this:
1
export LD_LIBRARY_PATH=/home/../anaconda3/lib:$
Thus, the system will first find the anaconda inside lib, so as to meet the requirements.
2018-12-07 15:40:05.167922: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job ps -> {0 -> localhost:3333} 2018-12-07 15:40:05.167970: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job worker -> {0 -> 192.168.100.36:3333, 1 -> 192.168.100.37:3333} 2018-12-07 15:40:05.171857: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:381] Started server with target: grpc://localhost:3333 Parameter server: waiting forcluster connection... 2018-12-07 15:40:05.213496: E tensorflow/core/distributed_runtime/master.cc:315] CreateSession failed because worker /job:worker/replica:0/task:0 returned error: Unavailable: OS Error 2018-12-07 15:40:05.213645: E tensorflow/core/distributed_runtime/master.cc:315] CreateSession failed because worker /job:worker/replica:0/task:1 returned error: Unavailable: OS Error Traceback (most recent call last): File"/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call return fn(*args) File"/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1317, in _run_fn self._extend_graph() File"/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1352, in _extend_graph tf_session.ExtendSession(self._session) tensorflow.python.framework.errors_impl.UnavailableError: OS Error
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File"trainer.py", line 364, in <module> tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) File"/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, inrun _sys.exit(main(argv)) File"trainer.py", line 70, in main num_classes=num_classes) File"trainer.py", line 138, in parameter_server sess.run(tf.report_uninitialized_variables()) File"/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, inrun run_metadata_ptr) File"/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File"/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run run_metadata) File"/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.UnavailableError: OS Error
As shown above, when running multi-machine distributed tensorflow the parameter server processes this error. Here , he said:
This has been troubling me for a while. I found out that the problem is that GRPC uses the native “epoll” polling engine for communication. Changing this to a portable polling engine solved this issue for me. The way to do is to set the environment variable, “GRPC_POLL_STRATEGY=poll” before running the tensorflow programs. This solved this issue for me. For reference, see, https://github.com/grpc/grpc/blob/master/doc/environment_variables.md.
According to which they belong, in a new environment variable: