1.在运行th main.lua -expID test-run指令时,出现这样的问题:
cudnnFindConvolutionForwardAlgorithm failed: 2 convDesc=[mode : CUDNN_CROSS_CORRELATION datatype : CUDNN_DATA_FLOAT] hash=-dimA6,3,256,256 -filtA64,3,7,7 6,64,128,128 -padA3,3 -convStrideA2,2 CUDNN_DATA_FLOAT
/root/torch/install/bin/luajit: /root/torch/install/share/lua/5.1/cudnn/find.lua:483: cudnnFindConvolutionForwardAlgorithm failed, sizes: convDesc=[mode : CUDNN_CROSS_CORRELATION datatype : CUDNN_DATA_FLOAT] hash=-dimA6,3,256,256 -filtA64,3,7,7 6,64,128,128 -padA3,3 -convStrideA2,2 CUDNN_DATA_FLOAT
stack traceback:
[C]: in function 'error'
/root/torch/install/share/lua/5.1/cudnn/find.lua:483: in function 'forwardAlgorithm'
...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:190: in function 'func'
/root/torch/install/share/lua/5.1/nngraph/gmodule.lua:345: in function 'neteval'
/root/torch/install/share/lua/5.1/nngraph/gmodule.lua:380: in function 'forward'
/root/pose-hg-train/src/train.lua:45: in function 'step'
/root/pose-hg-train/src/train.lua:103: in function 'train'
main.lua:19: in main chunk
[C]: in function 'dofile'
/root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00406670
通过查询,出现问题的原因是:GPU的内存不够。因此nvidia-smi查看gpu使用情况,
最终选择三个占用率不高的gpu来运行:CUDA_VISIBLE_DEVICES=2,5,6 th main.lua -expID test-run
问题解决。