Ubuntu 16.04下TensorFlow Faster rcnn安装及测试踩坑指南

1. 安装

有些库对版本有依赖，安装后更改比较麻烦，列出了自己的各个对应的版本

Anaconda 2
python 2.7
cuda 8.0
cudnn 6.0
opencv 3.4
tensor flow 1.3（要求cudnn6.0版本）

faster-rcnn（这个有多个版本，不同的运行起来稍微有区别，我用的是 https://github.com/CharlesShang/TFFRCNN）

另外，电脑GPU显存太小的，不建议使用tensorflow的gpu版本，不然到时候会报错OOM（out of memory），最好是用服务器跑，可以使用nvidia-smi查看GPU使用情况

2.运行出现的问题及解决

（1）GPU内存不足

报错：

ResourceExhaustedError (see above for traceback): 
OOM when allocating tensor of shape [4096] and type float
[Node: fc6/biases/Momentum/Initializer/zeros = Const[_class=["loc:@fc6/biases"], 
dtype=DT_FLOAT, value=Tensor<type: float shape: [4096] values: 0 0 0...>, 
_device="/job:localhost/replica:0/task:0/gpu:0"]()]]

解决办法：

修改TFFRCNN/lib/fast_rcnn/train.py文件

config.gpu_options.per_process_gpu_memory_fraction = 0.40

将这里的0.4改成0.9或其他试试

更改TFFRCNN/lib/fast_rcnn/config.py文件中的batch_size，一般是2的倍数，可以适当减小，但同时会影响结果

最好是换GPU显存大的电脑

（2）运行Demo报错：

tensorflow.python.framework.errors_impl.NotFoundError:/home/denis/WEB/DeepLearning/Faster-CNN_TF/tools/../lib/roi_pooling_layer/roi_pooling.so:
undefinedsymbol:_ZN10tensorflow7strings6StrCatB5cxx11ERKNS0_8AlphaNumE

解决办法：

修改make.sh文件，在中间添加

-D_GLIBCXX_USE_CXX11_ABI=0

g++ -std=c++11 -shared -o roi_pooling.so roi_pooling_op.cc \
	roi_pooling_op.cu.o -I $TF_INC  -D GOOGLE_CUDA=1 -fPIC $CXXFLAGS  -D_GLIBCXX_USE_CXX11_ABI=0 \
	-lcudart -L $CUDA_PATH/lib64

（3）运行Demo报错：

tensorflow.python.framework.errors_impl.NotFoundError: ./faster_rcnn/../lib/roi_pooling_layer/roi_pooling.so: 
cannot open shared object file: No such file or directory

解决办法：找一个或者拷贝一个roi_pooling.so放到指定文件夹下面

3.用自己的数据跑faster rcnn

首先是制作VOC2007数据集，然后执行命令：

cd $TFFRCNN

python ./faster_rcnn/train_net.py --gpu 0 --weights ./data/pretrain_model/VGG_imagenet.npy --imdb voc_2007_trainval --iters 70000 --cfg  ./experiments/cfgs/faster_rcnn_end2end.yml --network VGGnet_train --set EXP_DIR exp_dir

出错1：报错

AttributeError: 'NoneType' object has no attribute 'model_checkpoint_path'

解决办法：

添加 restore 0

此时执行代码如下：

python ./faster_rcnn/train_net.py --gpu 0 --restore 0 --weights ./data/pretrain_model//VGG_imagenet.npy --imdb voc_2007_trainval --iters 70000 --cfg ./experiments/cfgs/faster_rcnn_end2end.yml --network VGGnet_train --set EXP_DIR exp_dir

出错2：

执行demo.py文件时报错：找不到model

Unsuccessful TensorSliceReader constructor: Failed to get matching files

或者提示no model found，原因是不同版本TensorFlow读取模型差异造成的，有人说是bug

按照本实验流程，生成的模型文件

类似于这样：

xxxx.ckpt.data-00000-of-00001

xxxx.ckpt.index

xxx.ckpt.meta

分别对应着数据文件，索引文件等等，确实没有单独的以ckpt作为后缀结尾的文件，网上说的加斜杠的方法也不管用

最终解决办法：

直接将meta文件重命名，去掉meta，变成xxxx.ckpt，搞定！！！！