YOLOv5+BiSeNet—simultaneous target detection and semantic segmentation

Preface

I saw a project on Gayhub. Someone added a new segmentation head based on YOLOv5 and added the BiSeNet semantic segmentation algorithm to target detection, enabling it to perform target detection and semantic segmentation at the same time.
Project address: https://github.com/TomMao23/multiyolov5

Effect preview

Let’s first look at the effect of replicating it using the model provided by the original author:
(Originally I wanted to show the video, but after uploading it to CSDN twice, it disappeared inexplicably, so I’ll show the animation instead)
Please add image description

Model architecture

The target detection model uses YOLOv5. The specific principles have been explained in detail in my previous blog post [Target Detection] from YOLOv1 to YOLOX (Theoretical Review) .
The semantic segmentation model uses part of the BiSeNet structure, because I am not in this direction. I will not go into the specific principles in detail. Here is the structure diagram of BiSeNet [1]:

Insert image description here

core code

The original author used the Coco data set for target detection, and the Cityscapes data set for semantic segmentation.
The model is mainly modified on the YOLOv5-5.0 version. The benchmark model uses YOLOv5m. The implementation of semantic segmentation mainly adds a header to the Head part of the model output:
yolov5m_city_seg.yaml

# parameters
nc: 10  # number of classes
n_segcls: 19 # 分割类别数
depth_multiple: 0.67  # model depth multiple
width_multiple: 0.75  # layer channel multiple

# anchors
anchors:
  - [10,13, 16,30, 33,23]  # P3/8
  - [30,61, 62,45, 59,119]  # P4/16
  - [116,90, 156,198, 373,326]  # P5/32

# YOLOv5 backbone
backbone:
  # [from, number, module, args]
  [[-1, 1, Focus, [64, 3]],  # 0-P1/2
   [-1, 1, Conv, [128, 3, 2]],  # 1-P2/4
   [-1, 3, C3, [128]],
   [-1, 1, Conv, [256, 3, 2]],  # 3-P3/8
   [-1, 9, C3, [256]],
   [-1, 1, Conv, [512, 3, 2]],  # 5-P4/16
   [-1, 9, C3, [512]],
   [-1, 1, Conv, [1024, 3, 2]],  # 7-P5/32
   [-1, 1, SPP, [1024, [5, 9, 13]]],
   [-1, 3, C3, [1024, False]],  # 9
  ]

# YOLOv5 head
head:
  [[-1, 1, Conv, [512, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 6], 1, Concat, [1]],  # cat backbone P4  # PANet是add, yolov5是concat
   [-1, 3, C3, [512, False]],  # 13

   [-1, 1, Conv, [256, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 4], 1, Concat, [1]],  # cat backbone P3
   [-1, 3, C3, [256, False]],  # 17 (P3/8-small)

   [-1, 1, Conv, [256, 3, 2]],
   [[-1, 14], 1, Concat, [1]],  # cat head P4
   [-1, 3, C3, [512, False]],  # 20 (P4/16-medium)

   [-1, 1, Conv, [512, 3, 2]],
   [[-1, 10], 1, Concat, [1]],  # cat head P5
   [-1, 3, C3, [1024, False]],  # 23 (P5/32-large)
                  #[类别/输出通道, C3的n, C3的c2, C3的shortcut(以base为例,其他头含义可能不同)] yolo.py解析代码, []内第一项必须是输出通道数
   #[[4, 19], 1, SegMaskLab, [n_segcls, 3, 256, False]],  # 语义分割头通道配置256,[]内n为3
   [[16, 19, 22], 1, SegMaskPSP, [n_segcls, 3, 256, False]],  # 语义分割头通道配置256
   #[[16, 19, 22], 1, SegMaskBiSe, [n_segcls, 3, 256, False]],  # 语义分割头通道配置无效
   #[[16], 1, SegMaskBase, [n_segcls, 3, 512, False]],  # 语义分割头通道配置512

   [[17, 20, 23], 1, Detect, [nc, anchors]],  # Detect(P3, P4, P5)  必须在最后一层, 原代码很多默认了Detect是最后, 并没有全改
  ]

In the code, in the final output part, the author added three parallel Detectsegmentation heads, among which , SegMaskLab, SegMaskPSP, are different independent structures, which are used by the author for experiments. In yolo.py, you can see their detailed structure:SegMaskBiSeSegMaskBase

class SegMaskPSP(nn.Module):  # PSP头,多了RFB2和FFM,同样砍了通道数,没找到合适的位置加辅助损失,因此放弃辅助损失
    def __init__(self, n_segcls=19, n=1, c_hid=256, shortcut=False, ch=()):  # n是C3的, (接口保留了,没有使用)c_hid是隐藏层输出通道数(注意配置文件s*0.5,m*0.75,l*1)
        super(SegMaskPSP, self).__init__()
        self.c_in8 = ch[0]  # 16  # 用16,19,22宁可在融合处加深耗费一些时间,检测会涨点分割也很好。严格的消融实验证明用17,20,23分割可能还会微涨,但检测会掉3个点以上,所有头如此
        self.c_in16 = ch[1]  # 19
        self.c_in32 = ch[2]  # 22
        # self.c_aux = ch[0]  # 辅助损失  找不到合适地方放辅助,放弃
        self.c_out = n_segcls
        # 注意配置文件通道写256,此时s模型c_hid=128
        self.out = nn.Sequential(  # 实验表明引入较浅非线性不太强的层做分割会退化成检测的辅助(分割会相对低如72退到70,71,检测会明显升高),PP前应加入非线性强一点的层并适当扩大感受野
                                RFB2(c_hid*3, c_hid, d=[2,3], map_reduce=6),  # 3*128//6=64 RFB2和RFB无关,仅仅是历史遗留命名(训完与训练模型效果不错就没有改名重训了)
                                PyramidPooling(c_hid, k=[1, 2, 3, 6]),  # 按原文1,2,3,6,PSP加全局更好,但是ASPP加了全局后出现边界破碎
                                FFM(c_hid*2, c_hid, k=3, is_cat=False),  # FFM改用k=3, 相应的砍掉部分通道降低计算量(原则就是差距大的融合哪怕砍通道第一层也最好用3*3卷积,FFM融合效果又比一般卷积好,除base头外其他头都遵循这种融合方式)
                                nn.Conv2d(c_hid, self.c_out, kernel_size=1, padding=0),
                                nn.Upsample(scale_factor=8, mode='bilinear', align_corners=True),
                               )
        self.m8 = nn.Sequential(
                                Conv(self.c_in8, c_hid, k=1),
        )
        self.m32 = nn.Sequential(
                                Conv(self.c_in32, c_hid, k=1),
                                nn.Upsample(scale_factor=4, mode='bilinear', align_corners=True),
        )
        self.m16 = nn.Sequential(
                                Conv(self.c_in16, c_hid, k=1),
                                nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True),
        )
        # self.aux = nn.Sequential(
        #                        Conv(self.c_aux, 256, 3),  
        #                        nn.Dropout(0.1, False), 
        #                        nn.Conv2d(256, self.c_out, kernel_size=1),
        #                        nn.Upsample(scale_factor=8, mode='bilinear', align_corners=True),
        # )
    def forward(self, x):
        # 这个头三层融合输入做过消融实验,单独16:72.6三层融合:73.5,建议所有用1/8的头都采用三层融合,在Lab的实验显示三层融合的1/16输入也有增长
        feat = torch.cat([self.m8(x[0]), self.m16(x[1]), self.m32(x[2])], 1)
        # return self.out(feat) if not self.training else [self.out(feat), self.aux(x[0])]
        return self.out(feat)

The following are the main changes in model detection (detect.py). In the model output part, seg is used to obtain the semantic segmentation results, and the color map defined in advance is used Cityscapes_COLORMAPto color the segmented parts respectively.

seg = F.interpolate(seg, (im0.shape[0], im0.shape[1]), mode='bilinear', align_corners=True)[0]
mask = label2image(seg.max(axis=0)[1].cpu().numpy(), Cityscapes_COLORMAP)[:, :, ::-1]
dst = cv2.addWeighted(mask, 0.4, im0, 0.6, 0)

Code backup

There are many other changes, you can go to the original author's warehouse to read them.
The code is backed up here, including the model weights provided by the author:
https://pan.baidu.com/s/1JtqCtlJwk5efkiTQqmNpVA?pwd=36bk

References

[1]https://blog.csdn.net/qq_40073354/article/details/120725919

Guess you like

Origin blog.csdn.net/qq1198768105/article/details/126122364
Recommended