我的AI之路(45)--使用自己的数据集训练CenterNet

    CenterNet是anchor-free类型网络,具有识别精度高且速度快的特点,根据作者的论文中列出的数据来看,指标综合考虑来看比较牛了:

     最后那个CenterNet-HG,也就是backbone使用的Hourglass-104网络的AP值只比FSAF低一点了(但是FSAF目前貌似还没有源码放出来),比YOLO序列和RCNN序列都强很多,虽然FPS自有7.8,但是对一般实时性要求不是很高的视频检测也够用了,所以拿来试试。

     首先下载作者的源码:  git clone https://github.com/xingyizhou/CenterNet.git,根据安装说明:https://github.com/xingyizhou/CenterNet/blob/master/readme/INSTALL.md,环境和工具软件是:

      Ubuntu 16.04, with Anaconda Python 3.6 and PyTorch v0.4.1

他这个源码是使用的pytorch0.41版写的,由于pytorch0.41支持的CUDA最高版本是CUDA9,不支持我们的服务器上目前安装的CUDA10.0或CUDA10.1,我先是试了一下使用conda创建隔离环境后安装支持CUDA10的pytorch1.3或pytorch1.0.0,然后跑了一下,结果报错,说是CenterNet中有API是不支持的了(后面再说),但在公共服务器上又不好随便乱装CUDA(安装过CUDA的应该知道它的厉害,很能折腾人,装得不对服务器登录进不去、黑屏之类的问题让人三思),于是想到还是使用docker最好,首先到hub.docker.com上拉取个pytorch0.4.1+CUDA9.0的devel版镜像:

     docker pull pytorch/pytorch:0.4.1-cuda9-cudnn7-devel

然后运行创建实例(进入容器内部后默认的初始路径是/workspace,所以把下载了CenterNet源码的目录work_pytorch映射到/workspace,并预留端口12000的映射,以备后面有需要时对模型做server端封装调用,并带上ipc=host参数,以防止做多GPU分布式训练的过程中出现共享内存不足的错误):

      nvidia-docker run --ipc=host -d -it --name pytorch0.41 -v /home/fychen/AI/work_pytorch:/workspace -p 12000:12000 pytorch/pytorch:0.4.1-cuda9-cudnn7-devel bash

进入容器后,执行下面的修改(容器内的pytorch安装在/opt/conda路径下)把torch.nn.functional.py里1254行的torch.backends.cuddn.enabled改为False:

      sed -i "1254s/torch\.backends\.cudnn\.enabled/False/g" /opt/conda/lib/python3.6/site-packages/torch/nn/functional.py

然后,依次执行下面的命令安装pycocotools:

     git clone https://github.com/cocodataset/cocoapi.git
     cd cocoapi/PythonAPI
     pip install cython
     make
     python setup.py install --user

  再依次执行下面的命令完成CenterNet下面的部分代码的编译:    

     cd /workspace/CenterNet
     pip install -r requirements.txt

     cd src/lib/models/networks/DCNv2
     ./make.sh

     cd /workspace/CenterNet/src/lib/external
     make

再安装一些跑CenterNet需要的支持包(不安装这些包会报错):

      apt-get install libglib2.0-dev  libsm6  libxrender1  libxext6

然后下载对应的预训练模型,我要使用的backbone是hour-glass,模型训练后用来做物体检测,根据https://github.com/xingyizhou/CenterNet/blob/master/readme/MODEL_ZOO.md :

      下载第一行的ctdet_coco_hg模型即可,点击右边的model链接下载模型文件ctdet_coco_hg.pth,这里是从driver.google.com上下载文件,所以需要科学上网,所幸之前用同事的工具提前下载了这个文件,已上传到了这里(ctdet_coco.hg.path文件受上传限制分成四部分:part1,part2part3,part4),不然近期好像很多工具都用不了了。把下载到的预训练模型文件ctdet_coco_hg.pth放到CenterNet/models/下面即可,由于coco_hg要求使用coco格式的数据集,所以如果你的是PASCAL VOC2007格式的,还得写脚本转成成coco2017格式(作者的那个github页面上没说要求coco数据集是coco2014还是coco2017格式的,不过看了coco.py文件里的代码后知道是coco2017格式的),多说一点,网上好像没有找到文章说coco2014格式和coco2017格式的差别的,这里根据我做数据集格式转换的经历(把pascal voc2007格式转成coco2014和coco2017格式都弄过)以举例形式列一下:

#COCO2014格式数据集(object detection部分):
coco/annotations/instances_train2014.json,instances_minival2014.json
coco/images/train2014/COCO_train2014_000000000001.jpg
coco/images/val2014/COCO_val2014_000000000001.jpg

#COCO2017格式数据集(object detection部分),图片部分,coco目录下是否增加image子目录存放train2017和val2017,根据网络模型的代码定,比如CenterNet就不需要,换个别的网络,比如EfficientDet就需要,可以选择保持coco数据集的路径格式不变,不符合这格式的网络模型修改它的coco dataset的实现代码里的读取数据的代码即可:

CenterNet需要的coco数据集的路径是这样:
   coco/annotations/instances_train2017.json, instance_val2017.json
   coco/train2017/000000000001.jpg
   coco/val2017/000000010001.jpg

EfficientDet需要的coco数据集的路径是这样:

COCO
├── annotations
│   ├── instances_train2017.json
│   └── instances_val2017.json
│── images
    ├── train2017
    └── val2017

可以看出coco2014对图片文件的命名也要求加上前缀,这点很蠢,coco2017把这点不合理的去掉了。

       准备好自己的coco2017数据集后,拷贝或者链接到CenterNet/data/下,路径如下:

        CenterNet/data/coco/annotations/...

        CenterNet/data/coco/train2017/...

        CenterNet/data/coco/val2017/...

预训练模型和coco数据集都准备好了,接下来就是根据自己数据集的情况合理修改CenterNet的代码:

  1)修改CenterNet/src/lib/datasets/dataset/coco.py里这些地方:

     num_classes = 1    #把针对coco数据集的80改为1,我的数据集只有一类要识别的物体

     ...

     self.annot_path = os.path.join(
          self.data_dir, 'annotations',
          #'image_info_test-dev2017.json').format(split)
          'instances_val2017.json')   #image_info_test-dev2017.json是coco2017数据集的文件,这里需要改成使用我自己的文件
     class_name=['__background__','fire']  #把coco数据集的80类注释掉或删掉,加上我自己的类

2)修改CenterNet/src/main.py:

    torch.backends.cudnn.benchmark = False   #不然启动时会报错,错误详情参见后面

    ...

    if opt.val_intervals > 0 and epoch % opt.val_intervals == 0:

      ...

      if log_dict_val[opt.metric] < best:
        best = log_dict_val[opt.metric]
        save_model(os.path.join(opt.save_dir, 'model_best.pth'),epoch, model)
      save_model(os.path.join(opt.save_dir, 'lite_model_last.pth'),epoch, model)

      best初值为1e10,if log_dict_val[opt.metric] < best这个条件除了第一次迭代到epoch为opt.val_intervals(默认值5)时能满足外(best值随即被更新为log_dict_val[opt.metric] 的值),后面迭代训练过程中, log_dict_val[opt.metric] < best这个条件很难再满足,因此model_best.pth很难再次生成,我这个数据集图片几千张并不算大,但训练一个epoch也需要15分钟左右,有时急于需要获得一个模型做测试,所以最好增加上面那句话以便每迭代opt.val_intervals轮后,生成一个不含optimizer的轻量一点的模型文件,大约只有700多M,而每轮迭代生成的model_last.pth是含有optimizer的:

       save_model(os.path.join(opt.save_dir, 'model_last.pth'), epoch, model, optimizer)

3)修改 lib/opts.py(test和demo使用,把class的数量由80改为1):

     'ctdet': {'default_resolution': [512, 512], 'num_classes': 1,
                'mean': [0.408, 0.447, 0.470], 'std': [0.289, 0.274, 0.278],
                'dataset': 'coco'}

接下来就可以执行下面的命令开始训练了(使用了5个GPU, 2-6号):

      nohup python -u main.py ctdet --exp_id coco_hg --arch hourglass --batch_size 24 --master_batch 4 --lr 2.5e-4 --load_model ../models/ctdet_coco_hg.pth --gpus 2,3,4,5,6 &

如果不使用预训练模型,从头开始训练则执行:
      nohup python -u main.py ctdet --exp_id coco_hg --arch hourglass --batch_size 24 --master_batch 4 --lr 2.5e-4  --gpus 2,3,4,5,6 & 

       如果训练中途意外中断了,再次启动训练时,可以使用(--resume让脚本自动加载上次保存的model_last.pth模型文件,以便训练从上次最后的状态处恢复):

     nohup python -u main.py ctdet --resume --exp_id coco_hg --arch hourglass --batch_size 24 --master_batch 4 --lr 2.5e-4  --gpus 2,3,4,5,6 & 

   如果想改变默认的训练迭代总轮数(140),比如说改成100,可以在命令行加参数 --num_epochs=100,也可以修改 CenterNet/src/lib/opts.py

   目前总共花了三天多训练了265个epoch了,train_loss仍在稳定下降,没有出现大幅度波动回弹情况,看来精度仍在提升中。

训练到epoch=300时,各种Loss值都比较稳定了,停止,进行验证测试,执行:

     python -u test.py ctdet --exp_id coco_hg --arch hourglass --keep_res --resume

即可验证测试 data/coco/val2017/下的图片并最终输出AP值。

把少量图片拷贝到data/coco/test/下,执行:

     python -u demo.py ctdet --arch hourglass --demo ../data/coco/test/ --load_model ../exp/ctdet/coco_hg/model_last.pth

即可根据识别后弹出的标注了识别结果图片检查模型的识别效果,可通过指定load_model参数测试指定的模型。

  遇到的一些错误及解决办法:

1. Traceback (most recent call last):
   File "main.py", line 12, in <module>
    from models.model import create_model, load_model, save_model
  File "/home/fychen/AI/CenterNet/src/lib/models/model.py", line 12, in <module>
    from .networks.pose_dla_dcn import get_pose_net as get_dla_dcn
  File "/home/fychen/AI/CenterNet/src/lib/models/networks/pose_dla_dcn.py", line 16, in <module>
    from .DCNv2.dcn_v2 import DCN
  File "/home/fychen/AI/CenterNet/src/lib/models/networks/DCNv2/dcn_v2.py", line 11, in <module>
    from .dcn_v2_func import DCNv2Function
  File "/home/fychen/AI/CenterNet/src/lib/models/networks/DCNv2/dcn_v2_func.py", line 9, in <module>
    from ._ext import dcn_v2 as _backend
  File "/home/xfychenrt/AI/CenterNet/src/lib/models/networks/DCNv2/_ext/dcn_v2/__init__.py", line 3, in <module>
    from ._dcn_v2 import lib as _lib, ffi as _ffi
ImportError: /home/fychen/AI/CenterNet/src/lib/models/networks/DCNv2/_ext/dcn_v2/_dcn_v2.so: undefined symbol: __cudaPopCallConfiguration

CUDA版本不对(CUDA10),必须要安装对于的版本的CUDA9

2. Traceback (most recent call last):
  File "main.py", line 12, in <module>
    from models.model import create_model, load_model, save_model
  File "/home/fychen/AI/CenterNet/src/lib/models/model.py", line 12, in <module>
    from .networks.pose_dla_dcn import get_pose_net as get_dla_dcn
  File "/home/fychen/AI/CenterNet/src/lib/models/networks/pose_dla_dcn.py", line 16, in <module>
    from .DCNv2.dcn_v2 import DCN
  File "/home/fychen/AI/CenterNet/src/lib/models/networks/DCNv2/dcn_v2.py", line 11, in <module>
    from .dcn_v2_func import DCNv2Function
  File "/home/fychen/AI/CenterNet/src/lib/models/networks/DCNv2/dcn_v2_func.py", line 9, in <module>
    from ._ext import dcn_v2 as _backend
  File "/home/fychen/AI/CenterNet/src/lib/models/networks/DCNv2/_ext/dcn_v2/__init__.py", line 2, in <module>
    from torch.utils.ffi import _wrap_function
  File "/home/fychen/anaconda3/envs/CenterNet/lib/python3.6/site-packages/torch/utils/ffi/__init__.py", line 1, in <module>
    raise ImportError("torch.utils.ffi is deprecated. Please use cpp extensions instead.")
   ImportError: torch.utils.ffi is deprecated. Please use cpp extensions instead.

CenterNet代码是基于pytorch 0.4.1的API,安装的是pytorch1.0.0或者1.3,pytorch退回0.4.1即可

 3. ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm)

Process Process-1:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 110, in _worker_loop
    data_queue.put((idx, samples))
  File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 341, in put
    obj = _ForkingPickler.dumps(obj)
  File "/opt/conda/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 190, in reduce_storage
    fd, size = storage._share_fd_()
RuntimeError: unable to write to file </torch_1552_747694723>
Traceback (most recent call last):
  File "main.py", line 102, in <module>
    main(opt)
  File "main.py", line 70, in main
    log_dict_train, _ = trainer.train(epoch, train_loader)
  File "/workspace/CenterNet/src/lib/trains/base_trainer.py", line 119, in train
    return self.run_epoch('train', epoch, data_loader)
  File "/workspace/CenterNet/src/lib/trains/base_trainer.py", line 61, in run_epoch
    for iter_id, batch in enumerate(data_loader):
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 330, in __next__
    idx, batch = self._get_batch()
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 309, in _get_batch
    return self.data_queue.get()
  File "/opt/conda/lib/python3.6/queue.py", line 164, in get
    self.not_empty.wait()
  File "/opt/conda/lib/python3.6/threading.py", line 295, in wait
    waiter.acquire()
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 227, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 1552) exited unexpectedly with exit code 1. Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace.

   共享内存不足,commit容器为一个新的镜像以保存当前的所有已做的更改,然后退出当前容器,然后用新的镜像运行创建新的容器,启动命令中注意加上 --ipc=host以让容器使用物理机的内存

4. Traceback (most recent call last):
  File "main.py", line 102, in <module>
    main(opt)
  File "main.py", line 70, in main
    log_dict_train, _ = trainer.train(epoch, train_loader)
  File "/workspace/CenterNet/src/lib/trains/base_trainer.py", line 119, in train
    return self.run_epoch('train', epoch, data_loader)
  File "/workspace/CenterNet/src/lib/trains/base_trainer.py", line 69, in run_epoch
    output, loss, loss_stats = model_with_loss(batch)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
Arnold img_path=== /workspace/CenterNet/src/lib/../../data/coco/train2017/00001753.jpg
    result = self.forward(*input, **kwargs)
  File "/workspace/CenterNet/src/lib/trains/base_trainer.py", line 19, in forward
    outputs = self.model(batch['input'])
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/workspace/CenterNet/src/lib/models/networks/large_hourglass.py", line 255, in forward
    inter = self.pre(image)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/container.py", line 91, in forward
    input = module(input)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/workspace/CenterNet/src/lib/models/networks/large_hourglass.py", line 27, in forward
    conv = self.conv(x)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 301, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuda runtime error (11) : invalid argument at /opt/conda/conda-bld/pytorch_1532579805626/work/aten/src/THC/THCGeneral.cpp:663

修改main.py:
   torch.backends.cuddn.benchmark=False

5.运行demo.py时出现警告:

Creating model...
loaded ../exp/ctdet/coco_hg/model_last.pth, epoch 300
Skip loading parameter hm.0.1.weight, required shapetorch.Size([80, 256, 1, 1]), loaded shapetorch.Size([1, 256, 1, 1]). 
If you see this, your model does not fully load the pre-trained weight. Please make sure you have correctly specified --arch xxx or set the correct --num_classes for your own dataset.

修改 lib/opts.py里 default_dataset_info里的num_class的值为你的实际class数量即可(比如我的是1)

如果需要使用debugger,在命令行将--debug的值设置(或者在代码里设置opt.debug)为不小于1的值,并且记得修改 src/lib/utils/debugger.py里的coco_class_name里的类别值改为自己的数据集的类别值。

发布了61 篇原创文章 · 获赞 90 · 访问量 11万+

猜你喜欢

转载自blog.csdn.net/XCCCCZ/article/details/104599968
今日推荐