目标检测 YOLOv5 - YOLOv5:v6版本多机多卡训练出现的错误及解决方案

2021年4月份发布的 YOLOv5:v5版本，2021年10月份发布的 YOLOv5:v6版本
发布了更小的Nano模型YOLOv5n和YOLOv5n6

一. 改进的方面
整合了 Roboflow，使用Roboflow来组织、标注、准备、版本化和托管用于训练YOLOv5模型的数据集，Roboflow上有很多公开的数据集。

二. 当采用多机多卡训练时，出现了以下问题

TypeError: barrier() got an unexpected keyword argument ‘device_ids’

代码出错的地方utils/torch_utils.py

def torch_distributed_zero_first(local_rank: int):
    """
    Decorator to make all processes in distributed training wait for each local_master to do something.
    """
    if local_rank not in [-1, 0]:
        dist.barrier(device_ids=[local_rank])
    yield
    if local_rank == 0:
dist.barrier(device_ids=[0])

原因是PyTorch版本，YOLOv5的推荐的Python>=3.6.0，PyTorch>=1.7。
PyTorch>=1.7的函数

torch.distributed.barrier(group=<object object>, async_op=False)

看看新的PyTorch 1.9的函数

torch.distributed.barrier(group=None, async_op=False, device_ids=None)

再看看PyTorch 1.8的函数

torch.distributed.barrier(group=None, async_op=False, device_ids=None)

通过比较发现2021年10月份发布的 YOLOv5:v6版本，使用的PyTorch并不是1.7版本，最简单的方法就是升级下自己PyTorch版本，至少1.8。
也可以按照 YOLOv5的一贯做法，加一个check_requirements()主要是检测 torch>=1.8.0。

第二种方式：
将上述代码替换成：

@contextmanager
def torch_distributed_zero_first(local_rank: int):
    """
    Decorator to make all processes in distributed training wait for each local_master to do something.
    """
    if local_rank not in [-1, 0]:
        torch.distributed.barrier()
    yield
    if local_rank == 0:
        torch.distributed.barrier()

目标检测 YOLOv5 - YOLOv5:v6版本多机多卡训练出现的错误及解决方案

猜你喜欢