topological sort failed with message: The graph couldn't be sorted in topological order.

前段时间在试着跑Stochastic Adversarial Video Prediction的代码，结果还是遇到了一堆问题，对于我这样一个比小白好不到哪里去的人，debug弄得我头都秃了……下面是我遇到的一个问题，我在网上找了好久都没找到解决方法，StackOverflow上有一个回答却解决不了问题。最后我还是在Github的issue上问了作者，作者回复真心及时（给作者点一个超大的赞），下面是问题描述以及解决方法。

1. 报错信息

我当时使用的服务器环境是tensorflow 1.11.0, CUDA 9.0.176, cuDNN 7.3.1, Ubuntu 16.04, GPU是两块 nvidia titan Xp。训练的时候我运行的脚本是

CUDA_VISIBLE_DEVICES=0,1 python scripts/train.py --input_dir data/bair --dataset bair \
  --model savp --model_hparams_dict hparams/bair_action_free/ours_savp/model_hparams.json \
  --output_dir logs/bair_action_free/ours_savp \
  --gpu_mem_frac 0.7

然后就会出现下面的报错

2018-11-08 08:14:24.668709: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:666] Iteration = 0, topological sort failed with message: The graph couldn't be sorted in topological order.
2018-11-08 08:14:25.028003: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:666] Iteration = 1, topological sort failed with message: The graph couldn't be sorted in topological order.

2. 解决方法

tensorflow 的文档解释是有向图中可能出现了环路，导致拓扑排序失败。但是这个要怎么解决……作者放出来的代码大概不会出这么大的bug吧。

最后在GitHub上作者的解释是这是因为在多块GPU上训练的问题，可能代码哪个地方有bug，可以试着在单块GPU上训练。不过显存很可能不够用，可以调小batchsize，同时将默认的descriminator模型由全卷积网络改为dna或者flow（emmm其实我不懂啥意思，按作者说的做就行了）。最终运行的脚本改为下面的就好了。

CUDA_VISIBLE_DEVICES=0 python scripts/train.py --input_dir data/bair --dataset bair \ 
  --model savp --model_hparams_dict hparams/bair_action_free/ours_savp/model_hparams.json \
  --output_dir logs/bair_action_free/ours_savp \ 
  --gpu_mem_frac 0.7 \ 
  --model_hparams tv_weight=0.001,transformation=flow

3. 提醒

作者也说到，把batchsize调小以后可能导致训练效果下降，最终的预测效果达不到论文中那么好。

topological sort failed with message: The graph couldn't be sorted in topological order.

1. 报错信息

2. 解决方法

3. 提醒

猜你喜欢