Tensorflow error highlights

Article Directory
  • references
  • This article documents the author highlights some errors on Tensorflow use, easy to view later people quickly solve the problem.

    I was left blank.

    I was left blank.

    CreateSession still waiting for response from worker: /job:worker/replica:0/task:0

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    2018-12-05 22:18:24.565303: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job ps -> {0 -> localhost:3376}
    2018-12-05 22:18:24.565372: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job worker -> {0 -> localhost:3330, 1 -> localhost:3331}
    2018-12-05 22:18:24.569212: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:381] Started server with target: grpc:
    2018-12-05 22:18:26.170901: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job ps -> {0 -> localhost:3376}
    2018-12-05 22:18:26.170969: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job worker -> {0 -> localhost:3330, 1 -> localhost:3331}
    2018-12-05 22:18:26.174856: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:381] Started server with target: grpc://localhost:3330
    2018-12-05 22:18:27.177003: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job ps -> {0 -> localhost:3376}
    2018-12-05 22:18:27.177071: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job worker -> {0 -> localhost:3330, 1 -> localhost:3331}
    2018-12-05 22:18:27.180980: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:381] Started server with target: grpc://localhost:3331
    2018-12-05 22:18:34.625459: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0
    2018-12-05 22:18:34.625513: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1
    2018-12-05 22:18:36.231936: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
    2018-12-05 22:18:36.231971: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1
    2018-12-05 22:18:37.235899: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
    2018-12-05 22:18:37.235952: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0

    First, ensure that job_name,task_index,ps_hosts,worker_hoststhe four parameters are correct, consider the following case is incorrect:
    the IP to start a worker process or ps 192.168.1.100 on the machine:

    1
    2
    3
    4
    --job_name=worker
    --task_index=1
    --ps_hosts=192.168.1.100:2222,192.168.1.101:2222
    --worker_hosts=192.168.1.100:2223,192.168.1.101:2223

    Since the process starting position is 192.168.1.100, but the operating parameters specified task_index is 1, the corresponding IP address (task_index first term is 0) or the second ps_hosts worker_hosts, that is 192.168.1.101, and the process IP host machine itself is inconsistent.

    Another situation can cause this problem from occurring, starting from TensorFlow-1.4, will be automatically distributed environment variable used to connect the agent, the agent does not need to run if the interconnections between the nodes, then the proxy environment variable shift in addition you can add code at the beginning of the script:
    Note that this code must be written in the import tensorflow as tf or before moxing.tensorflow as mox import

    1
    2
    3
    import os
    os.enrivon.pop('http_proxy')
    os.enrivon.pop('https_proxy')

    - Excerpt ( https://bbs.huaweicloud.com/blogs/463145f7a1d111e89fc57ca23e93a89f )

    ImportError: /lib64/libstdc++.so.6: version `CXXABI_1.3.9’ not found

    . 1 
    2
    . 3
    . 4
    . 5
    . 6
    . 7
    . 8
    . 9
    10
    large column   Tensorflow error collection class = "Line">. 11
    12 is
    13 is
    14
    /home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
    from ._conv import register_converters as _register_converters
    Traceback (most recent call last):
    File "trainer.py", line 14, in <module>
    import sklearn.datasets
    File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/sklearn/__init__.py", line 134, in <module>
    from .base import clone
    File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/sklearn/base.py", line 11, in <module>
    from scipy import sparse
    File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/scipy/sparse/__init__.py", line 229, in <module>
    from .csr import *
    File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/scipy/sparse/csr.py", line 15, in <module>
    from ._sparsetools import csr_tocsc, csr_tobsr, csr_count_blocks,
    ImportError: /lib64/libstdc++.so.6: version `CXXABI_1.3.9' not found (required by /home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/scipy/sparse/_sparsetools.cpython-36m-x86_64-linux-gnu.so)

    System libraries older, excluding CXXABI_1.3.9, can be <Anaconda_PATH>/libadded LD_LIBRARY_PATHin, like this:

    1
    export LD_LIBRARY_PATH=/home/../anaconda3/lib:$

    Thus, the system will first find the anaconda inside lib, so as to meet the requirements.

    Reference:. Stackoverflow Question 2

    Distributed Tensorflow, ps end run appears tensorflow.python.framework.errors_impl.UnavailableError: OS Error

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    2018-12-07 15:40:05.167922: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job ps -> {0 -> localhost:3333}
    2018-12-07 15:40:05.167970: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job worker -> {0 -> 192.168.100.36:3333, 1 -> 192.168.100.37:3333}
    2018-12-07 15:40:05.171857: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:381] Started server with target: grpc://localhost:3333
    Parameter server: waiting for cluster connection...
    2018-12-07 15:40:05.213496: E tensorflow/core/distributed_runtime/master.cc:315] CreateSession failed because worker /job:worker/replica:0/task:0 returned error: Unavailable: OS Error
    2018-12-07 15:40:05.213645: E tensorflow/core/distributed_runtime/master.cc:315] CreateSession failed because worker /job:worker/replica:0/task:1 returned error: Unavailable: OS Error
    Traceback (most recent call last):
    File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
    File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1317, in _run_fn
    self._extend_graph()
    File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1352, in _extend_graph
    tf_session.ExtendSession(self._session)
    tensorflow.python.framework.errors_impl.UnavailableError: OS Error

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
    File "trainer.py", line 364, in <module>
    tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
    File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
    File "trainer.py", line 70, in main
    num_classes=num_classes)
    File "trainer.py", line 138, in parameter_server
    sess.run(tf.report_uninitialized_variables())
    File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
    File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
    File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
    File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
    tensorflow.python.framework.errors_impl.UnavailableError: OS Error

    As shown above, when running multi-machine distributed tensorflow the parameter server processes this error.
    Here , he said:

    This has been troubling me for a while. I found out that the problem
    is that GRPC uses the native “epoll” polling engine for communication.
    Changing this to a portable polling engine solved this issue for me.
    The way to do is to set the environment variable,
    “GRPC_POLL_STRATEGY=poll” before running the tensorflow programs. This
    solved this issue for me. For reference, see,
    https://github.com/grpc/grpc/blob/master/doc/environment_variables.md.

    According to which they belong, in a new environment variable:

    1
    export GRPC_POLL_STRATEGY=poll

    Successfully resolved the problem.

    references

    Guess you like

    Origin www.cnblogs.com/lijianming180/p/12360826.html