tensorflow stepped pit distributed deployment experience

There would not be a pit, dug more people, it will naturally pit.

The company recently decided to build a distributed data cluster to train as an ignorant and white love for knowledge, he has abused natural hair off the floor.

After spending a full 2.5 weeks, and finally, under the guidance of revenue was roughly estimated to buddy reason, and finally found the answer in a paste Huawei's technology, that time is really Duang bang, looked at that moment the process has finally run up really wanted to run under a setting sun to celebrate. This process is really not easy, during the google and Baidu basic information regardless of relevant and irrelevant are turned over and over, there is no good solution to the problem, then the mentality is really explode, keyword finally changed a bit, only to see the last page of Google results Valley Huawei posts there are so relevant words, I did not expect after opening the New World opened, excited !!!!!

So far nonsense, now to introduce the deployment details.

This deployment is based on ubuntu 16.04 + hadoop + spark + tensorflowOnSpark + tensorflow 1.8

The environment here is a cluster of ps and two worker

hadoop and spark deployment

hadoop and spark deployment is not described here, a lot of information online, be in accordance with basically no problem, there are other problems, then find a fly matched, anyway, finally able to run on it.

tensorflow installation and tensorflowOnSpark

Here it is to take yarn as a resource manager, refer to the specific installation 's official website . Many online article said the official documentation simple, but in fact, stepped on after completion of the pit found that, in fact, really that simple, not up and running, mostly related to the cluster and configure itself. So these pits have to fill their own, the official can only give hints.

In 官方的说明文档是可以跑起来的short, !


Well, it's time to say it encountered the pit.
It was the official documentation, implementation of the Run distributed MNIST training (using feed_dict) this step into the pit.

Pit encountered as follows:

Timeout while feeding partition

Look tips, that is, when the input data is timed out. Since the timeout, the first thought out of my mind is the data downloaded from the hadoop it? Then in Riga log to verify the code, I did not expect to really be able to download, GG, this time on silly, since can lower down, notes and communications hadoop is no problem, where the root of that problem in it? Google and Baidu chant, congratulations, you will find on the subject, the results of the search are either not on point, or is not relevant, either ad (you know). This time on the desperate, log on that point, the search results are not solve the problem, only another way to look for other log.

On the input terminal yarn logs -applicationId <your applicationId>, and then emerge N number of log on screen, If thou run long enough, then you might see something like the following log

CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
CreateSession still waiting for response from worker: /job:worker/replica:0/task:0

Yes, this is a very tough battle, you have to step on the pit! ! !

From the point of view log information, such as the current node in response ps0 and worker0, but a series of log illustrates this two nodes did not return a response, then the problem lies in where? goole baidu and chanting, online restart what, what worker0 completed quickly, worker1 not finished so is suspended, what ClusterSpecconfiguration problems, what a firewall is not closed and so on, a variety of methods are useless, I even once suspected ssh is a problem (forgive me is white), but also changed what role did not ah. In short, anyway, did so detached link to open the N times is not the role of the old spit blood in one place.

So do after a few days, see on github a message , described the phenomenon and I quite like, but the man behind the disappearance of the posts mention, did not result feedback is how. So open yourself shining recommendations buddy mentioning, check the permissions issue, but also useless. At this time there is no way but to himself a thread, open source is still pretty fast little brother replies, pointing out that the node can not access each other, forehead, although not specifically give a solution, but at least indicate the direction of the investigation. So once ran tensorflow native code is distributed, but also fail, so far, in the root of the problem locking tensorflow, and what hadoop nothing to do, it narrowed the range. Then again Google, Huawei finally in a technical paste found on reason

另外一种情况也会导致该问题的发生,从TensorFlow-1.4开始,分布式会自动使用环境变量中的代理去连接,如果运行的节点之间不需要代理互连,那么将代理的环境变量移除即可,在脚本的开始位置添加代码:
# 注意这段代码必须写在import tensorflow as tf或者import moxing.tensorflow as mox之前

import os
os.environ.pop('http_proxy')
os.environ.pop('https_proxy')
proxy! ! ! ! ! !

I suddenly remembered the time before writing reptile pit was also a moment's agent, but did not mind, can not think. . . . .

Then delete the proxy configuration code into the script, but still able to run . Tears ~~

Ha ha ha, since identify the cause, then easier to handle. To each node, the code on the proxy configuration are removed and re-run the Run distributed MNIST training (using feed_dict) script code, SUCCEED !!!

My dear, how the Internet no one in that case it ... ..

In addition to the above the pit, also encountered the following

waiting for 1 reservations

About this pit, because after I restart the other nodes without re-executing startup script distributed, leading to reboot the node does not contact the master, master node restart have been waiting for a response. Of course, the Internet also has to be said 设置spark.cores.max(集群总核数)和spark.task.cpus(worker节点分配核数)满足spark.cores.max/spark.task.cpus=workernumber, in short, that there does not appear to check the above circumstances it

'AutoProxy[get_queue]' object has no attribute 'put'

The pit also met several times and, finally, according to this post on the script together --executor-cores 1ok, specific principles did not get to the bottom ...

Experience

As an inexperienced white, it is the beginning of the official documentation to configure, do not know how to analyze a problem, only Google Baidu, random configuration, hit a lot of walls, thankless task. Later still determined to make a little basic knowledge, like tensorflow distributed inside ps, worker concept, as well as spark, hadoop basic concepts, architecture, and so on, to get started a lot of the so-called "工欲善其事必先利其器"not only increase the experience, but also to solve the problem, this is the high cost of learning time.

Original link large column  https://www.dazhuanlan.com/2019/08/23/5d5f27c7c8cf3/

Guess you like

Origin www.cnblogs.com/chinatrump/p/11415199.html