目标检测算法Yolov7

背景

Yolov7作为CVPR2023目标检测领域的新作,还是有必要花时间研究一下。
paper:https://arxiv.org/abs/2207.02696
code:https://github.com/WongKinYiu/yolov7

网络结构

先验知识

VoVNet

在设计轻量化网络时,FLOPs和模型参数时主要考虑因素,但是减少模型大小和FLOPs不等同于减少推理时间。还需要考虑两个重要的因素:内存访问成本(Memory Access Cost,MAC)和GPU计算效率。

  1. MAC

根据ShuffleNetV2论文中给出的卷积层MAC计算方法:
M A C = h w ( c i + c o ) + k 2 c i c o MAC=hw(c_i+c_o)+k^2c_ic_o MAC=hw(ci+co)+k2cico
其中 k , h , w , c i , c o k,h,w,c_i,c_o k,h,w,ci,co分别表示卷积核大小、输出特征图宽高、以及输入和输出通道数。卷积层的计算量 B = k 2 h w c i c o B=k^2hwc_ic_o B=k2hwcico,如果固定B的话,那么就有:
M A C ≥ 2 h w B k 2 + B h w MAC \ge 2\sqrt{\frac{hwB}{k^2}} + \frac{B}{hw} MAC2k2hwB +hwB
根据均值不等式,当且仅当输入和输出通道数相同时,MAC取下界,设计最高效。

  1. GPU计算效率

GPU计算的优势在于并行计算机制,当将一个卷积核较大的卷积层拆分成几个小的卷积层,尽管效果是相同的,降低了FLOPs,但是GPU计算是低效的。因此,相比于FLOPs,更应该关注的是FLOPs Per Second,即用总的FLOPs除以总的GPU推理时间,这个指标越高说明GPU利用率越高。
VoVNet是针对DenseNet进行改进的,如下图所示,DenseNet的block中的每一层都会接收其前面所有层作为额外的输入,通过concat方式整合来自不同层的特征图,实现特征重用,提升精度。由于每层的输入是线性增长,而输出的通道数是固定的,带来的问题就是输入和输出通道数不一致,即MAC不是最优的。此外,由于输入通道数较大,先使用1×1卷积降维,这种额外层的引入对GPU高效计算不利。
image.png
DenseNet的问题在于每层都聚合了前面层的特征,造成特征冗余。为此,VoVNet提出了OSA(One-Shot Aggregation)模块,如下图所示。
image.png
简单来说,就是在block的最后一层聚合前面所有层的特征。解决了输入和输出通道数不一致问题,也不需要1×1卷积降维,所以取得最小的MAC且GPU计算高效。

CSPNet

CSPNet(Cross Stage Partial Network)从网络结构设计的角度来解决推理过程中需要很大计算量的问题,能够降低20%计算量的情况下保持甚至提高模型的能力。
作者任务推理计算量过高的问题在于网络优化中的梯度信息重复导致,CSPNet通过将梯度的变化从头到尾集成到特征图中,在减少计算量的同时保证准确率。
image.pngimage.png
CSPNet的基本思想是将输入的特征图按照通道数分为两个部分,其中一个部分进行正常的dense block操作,另一部分直接进行concate操作。目标检测算法Yolov5的CSPBottleBlock的结构如下:
image.png
将输入分为两个分支,一个分支先经过CBL,再经过多个残差结构,再进行一次卷积;另一个分支直接进行卷积;然后将两个分支的输出执行concat操作,再经过BN、LeakyReLU,最后进行一次CBL。

ELAN

ELAN高效层聚合网络,在网络层面上,属于梯度路径设计网络的范畴。设计ELAN的主要目的是解决在执行模型缩放时深度模型的收敛性会逐渐恶化的问题。
image.png

Backbone

Yolov7的主干特征提取网络中,主要包括如下两个模块:

  • ELAN:用来提取图像特征。
  • Transition block:用来对特征图进行下采样。通常情况下,对特征图进行下采样使用的是一个卷积核大小为3×3、步长为2的卷积或者一个步长为2的MaxPooling层。而在Yolov7中,通过将这两种操作整合在一起来完成下采样工作。如下图所示,transition模块存在两个分支,左分支是一个步长为2的MaxPooling和一个1×1的卷积,右分支是一个1×1卷积和一个卷积核大小为3、步长为2的卷积,两个分支的输出会进行通道堆叠。

image.png
Yolov7的主干特征提取网络基本模块是Conv2d + BatchNorm2D + SiLU,其代码实现如下:

import torch
import torch.nn as nn

class SiLU(nn.Module):
    @staticmethod
    def forward(x):
        return x * nn.Sigmoid(x)

def autopad(k, p=None):
    # 根据kernel_size计算padding的大小
    if p is None:
        p = k // 2 if isinstance(k, int) else [x // 2 for x in k]
    return p

def Conv(nn.Module):
    # Conv2d + BatchNorm + SiLU
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=SiLU()):
    # in_channels, out_channels, kernel_size, stride, padding, groups
	super(Conv, self).__init__()
    self.conv    = nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=False)
    self.bn      = nn.BatchNorm2d(c2, eps=0.001, momentum=0.03)
    self.act     = nn.LeakyReLU(0.01, inplace=True) if act is True else (act if isinstance(act, nn.Module) else nn.Identity())

	def forward(self, x):
        return self.act(self.bn(self.conv(x)))

然后基于CBS模块,构建ELAN模块,具体代码如下:

class ELAN(nn.Module):
    def __init__(self, c1, c2, c3, n=4, e=1, ids=[0]) -> None:
        super(ELAN, self).__init__()
        c_  = int(c2 * e)

        self.ids = ids      # [0, 1, 2, 3, 4, 5]中用于concat的tensor有哪些
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c1, c_, 1, 1)
        self.cv3 = nn.ModuleList([
            Conv(c_ if i==0 else c2, c2, 3, 1) for i in range(n)
        ])

        self.cv4 = Conv(c_ * 2 + c2 * (len(ids) -2), c3, 1, 1)
    
    def forward(self, x):
        x_1 = self.cv1(x)
        x_2 = self.cv2(x)

        x_all = [x_1, x_2]
        for i in range(len(self.cv3)):
            x_2 = self.cv3[i](x_2)
            x_all.append(x_2)
        # self.ids = [-1, -3, -5, -6] <=> [5, 3, 1, 0]
        out = self.cv4(torch.cat([x_all[id] for id in self.ids], 1))
        return out

然后是Transition Block的实现如下:

class MP(nn.Module):
    def __init__(self, k=2) -> None:
        super(MP, self).__init__()
        self.m = nn.MaxPool2d(kernel_size=k, stride=k)

    def forward(self, x):
        return self.m(x)
    
class Transition_Block(nn.Module):
    def __init__(self, c1, c2) -> None:
        super(Transition_Block, self).__init__()
        self.cv1 = Conv(c1, c2, 1, 1)
        self.cv2 = Conv(c1, c2, 1, 1)
        self.cv3 = Conv(c2, c2, 3, 2)

        self.mp = MP()
    
    def forward(self, x):
        # h,w,c1-> h/2, w/2,c1 ->h/2, w/2, c2
        x_1 = self.mp(x)
        x_1 = self.cv1(x_1)

        # h,w,c1 -> h, w, c2 -> h/2, w/2 c2
        x_2 = self.cv2(x)
        x_2 = self.cv3(x_2)

        # h/2, w/2, c2 cat h/2, w/2, c2 -> h/2, w/2, c1
        return torch.cat([x_2, x_1], 1)

然后基于ELAN和Transition Block就能得到Yolov7的主干特征提取网络结构如下:

class Backbone(nn.Module):
    def __init__(self, transition_channels, block_channels, n, pretrained=False) -> None:
        super(Backbone, self).__init__()
        #-------------------------------------#
        #   输入图像尺寸大小为640, 640, 3
        #-------------------------------------#

        self.ids = [-1, -3, -5, -6]
        # 640, 640, 3 -> 640, 640, 32 -> 320, 320, 64 -> 320, 320, 64
        self.stem = nn.Sequential(
            Conv(3, transition_channels, 3, 1),
            Conv(transition_channels, transition_channels * 2, 3, 2),
            Conv(transition_channels * 2, transition_channels * 2, 3, 1)
        )
        # 320, 320, 64 -> 160, 160, 128 -> 160, 160, 256
        self.dark2 = nn.Sequential(
            Conv(transition_channels * 2, transition_channels * 4, 3, 2),
            ELAN(transition_channels* 4, block_channels * 2, transition_channels * 8, n=n, ids=self.ids)
        )
        # 160, 160, 256 -> 80, 80, 256 -> 80, 80, 512
        self.dark3 = nn.Sequential(
            Transition_Block(transition_channels* 8, transition_channels * 4),
            ELAN(transition_channels*8, block_channels*4, transition_channels*16, n=n, ids=self.ids)
        )
        # 80, 80, 512 -> 40, 40, 512 -> 40, 40, 1024
        self.dark4 = nn.Sequential(
            Transition_Block(transition_channels * 16, transition_channels * 8),
            ELAN(transition_channels * 16, block_channels * 8, transition_channels * 32, n=n, ids=self.ids)
        )
        # 40, 40, 1024 -> 20, 20, 1024 -> 20, 20, 1024
        self.dark5 = nn.Sequential(
            Transition_Block(transition_channels * 32, transition_channels * 16),
            ELAN(transition_channels * 32, block_channels * 8, transition_channels * 32, n=n, ids=self.ids)
        )

        if pretrained:
            url = 'https://github.com/bubbliiiing/yolov7-pytorch/releases/download/v1.0/yolov7_backbone_weights.pth'
            checkpoint = torch.hub.load_state_dict_from_url(url=url, map_location="cpu", model_dir="./model_data")
            self.load_state_dict(checkpoint, strict=False)
            print("Load weights from " + url)
        
    def forward(self, x):
        x = self.stem(x)
        x = self.dark2(x)
        #---------------------------------------------------#
        #   dark3的输出为80, 80, 512, 是一个有效输出特征层
        #---------------------------------------------------#
        x = self.dark3(x)
        feat1 = x
        #---------------------------------------------------#
        #   dark4的输出为40, 40, 1024, 是一个有效输出特征层
        #---------------------------------------------------#
        x = self.dark4(x)
        feat2 = x
        #---------------------------------------------------#
        #   dark5的输出为20, 20, 1024, 是一个有效输出特征层
        #---------------------------------------------------#
        x = self.dark5(x)
        feat3 = x

        return feat1, feat2, feat3

Yolov7的主干特征提取网络的结构图如下图所示:
image.png

Neck

在neck部分,Yolov7提取多层特征进行多尺度特征融合,一共提取三个特征层。即主干特征提取网络的输出,当输入图像为(640,640,3)时,三个特征层的尺寸分别为(80,80,512), (40,40,1024) 和(20, 20,1024)。
Yolov7的特征融合方式主要步骤如下:

  1. 将(20, 20,1024)的特征图首先利用SPPCSPC结构进行特征提取,该结构可以提高yolov7的感受野,设为P5。其中SPPCSPS的结构如下图所示;

image.png
对应的代码实现如下:

class SPPCSPC(nn.Module):
    def __init__(self, c1, c2, n=1, shortcut=False, g=1, e=0.5, k=(5, 9, 13)):
        super(SPPCSPC, self).__init__()
        c_ = int(2 * c2 * e)
        # 左分支
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv3 = Conv(c_, c_, 3, 1)
        self.cv4 = Conv(c_, c_, 1, 1)
        self.m   = nn.ModuleList([nn.Maxpool2d(kernel_size=x, stride=1, padding=x//2)for x in k])
        self.cv5 = Conv(4 * c_, c_, 1, 1)
        self.cv6 = Conv(c_, c_, 3, 1)
        # 右分支
        self.cv2 = Conv(c1, c_, 1, 1)
        # concat 输出
        self.cv7 = Conv(2 * c_, c2, 1, 1)
    def forward(self, x):
        x1 = self.cv4(self.cv3(self.cv1(x)))
        y1 = self.cv6(self.cv5(torch.cat([x1] + [m(x1) for m in self.m], 1)))
        y2 = self.cv2(x)
        return self.cv7(torch.cat((y1, y2), dim=1))
  1. 将P5输入到1×1卷积中调整通道,然后将P5特征图进行上采样UpSample,并与(40,40,1024)特征图进行一次卷积后的特征图进行concat操作,之后使用ELAN模块提取特征得到P4,尺寸为(40,40,256);
  2. P4进行1×1卷积调整通道,然后进行上采样Upsample操作,并与((80,80,512)特征图进行一次卷积后的特征图进行concat操作,对整合后的特征图使用ELAN提取特征得到P3,也称P3_out,此时特征图尺寸为(80,80,128)
  3. P3_out的特征图进行一次Transition Block操作进行下采样,得到的特征图与P4进行concat操作,然后使用ELAN提取特征得到P4_out,尺寸为(40,40,256);
  4. P4_out的特征图进行一次Transition Block操作进行下采样,得到的特征图与P5进行concat操作,然后使用ELAN提取特征得到P5_out,尺寸为(20,20,128);

特征融合模块PANet将不同尺寸的特征层进行融合,有利于更好的特征提取。整体的结构图如下图所示:
image.png
代码实现如下:

class YoloBody(nn.Module):
    def __init__(self) -> None:
        super(YoloBody, self).__init__()
        #-----------------------------------#
        #      定义yolov7的参数
        #-----------------------------------#
        transition_channels = 32
        block_channels      = 32
        panet_channels      = 32
        n = 4                   # ELAN模块中右边两个分支卷积的个数
        e = 2
        ids = [-1, -2, -3, -4, -5, -6]
        #-----------------------------------#
        #     输入图像尺寸为640, 640,3
        #-----------------------------------#

        #-----------------------------------#
        #     主干网络提取特征
        #     输出三个有效特征层, 尺寸分别为:
        #     80, 80, 512
        #     40, 40, 1024
        #     20, 20, 1024
        #-----------------------------------#
        self.backbone = Backbone(transition_channels=transition_channels, block_channels=block_channels, n=n)

        #-----------------------------------#
        #     特征融合网络
        #-----------------------------------#
        self.upsample = nn.Upsample(scale_factor=2, mode="nearest")     # 最邻近方式

        self.sppcspc  = SPPCSPC(transition_channels*32, transition_channels*16)
        self.conv_for_P5  = Conv(transition_channels*16, transition_channels*8)
        self.conv_for_feat2 = Conv(transition_channels*32, transition_channels*8)
        self.elan_for_concat1 = ELAN(transition_channels*16, panet_channels*4, transition_channels*8, e=e, n=n, ids=ids)

        self.conv_for_P4 = Conv(transition_channels*8, transition_channels*4)
        self.conv_for_feat1 = Conv(transition_channels*16, transition_channels*4)
        self.elan_for_concat2 = ELAN(transition_channels*8, panet_channels*2, transition_channels*4, e=e, n=n, ids=ids)

        self.down_sample1 = Transition_Block(transition_channels*4, transition_channels*4)
        self.elan_for_concat3 = ELAN(transition_channels*16, panet_channels*4, transition_channels*8, e=e, n=n, ids=ids)
        

        self.down_sample2 = Transition_Block(transition_channels*8, transition_channels*8)
        self.elan_for_concat4 = ELAN(transition_channels * 32, panet_channels * 8, transition_channels * 16, e=e, n=n, ids=ids)
    
    def forward(self, x):
        #-----------------------------------#
        #     主干网络提取特征
        #     输出三个有效特征层, 尺寸分别为:
        #     80, 80, 512
        #     40, 40, 1024
        #     20, 20, 1024
        #-----------------------------------#
        feat1, feat2, feat3 = self.backbone(x)
        #-----------------------------------#
        #     特征融合网络, 输出三个特征图
        #     P3_out: 80, 80, 128
        #     P4_out: 40, 40, 256
        #     P5_out: 20, 20, 512
        #-----------------------------------#
        P5         = self.sppcspc(feat3)
        P5_conv    = self.conv_for_P5(P5)
        P5_upsample=self.upsample(P5_conv)
        P4         = torch.cat([self.conv_for_feat2(feat2), P5_upsample], 1)
        P4         = self.elan_for_concat1(P4)

        P4_conv    = self.conv_for_P4(P4)
        P4_upsample= self.upsample(P4_conv)
        P3         = torch.cat([self.conv_for_feat1(feat1), P4_upsample], 1)
        P3_out     = self.elan_for_concat2(P3)

        P3_downsample = self.down_sample1(P3_out)
        P4            = torch.cat([P3_downsample, P4], 1)
        P4_out        = self.elan_for_concat3(P4)

        P4_downsample = self.down_sample2(P4_out)
        P5            = torch.cat([P4_downsample, P5], 1)
        P5_out        = self.elan_for_concat4(P5)

Head

通过特征融合模块,可以得到三个加强特征,尺寸分别为(80, 80, 128), (40, 40, 256), (20, 20, 512),然后将这三个特征图传入到Yolo head模块中,以获得预测结果。
image.png
与Yolov5直接采用卷积层做预测不同,Yolov7在进行卷积之前使用了一个RepConv结构。RepConv模块来自RepVGG网络,训练阶段加入了Identity和残差分支,推理阶段通过重参数化技术将RepConv模块转换为普通3×3卷积,便于模型的部署与加速。
代码实现如下:

class RepConv(nn.Module):
    def __init__(self, c1, c2, k=3, s=1, p=None, g=1, act=SiLU()) -> None:
        super(RepConv, self).__init__() 
        self.groups = g
        self.in_channels = c1
        self.out_channels = c2
        assert k == 3
        assert autopad(k, p) == 1
        padding_11 = autopad(k, p) - k // 2

        self.rbr_identity = (nn.BatchNorm2d(num_features=c1, eps=0.001, momentum=0.03)  if c2==c1 and s == 1 else None)
        self.rbr_dense    = nn.Sequential(
            nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=False),
            nn.BatchNorm2d(num_features=c2, eps=0.001, momentum=0.03)
        )
        self.rbr_1x1      = nn.Sequential(
            nn.Conv2d(c1, c2, 1, s, padding_11, groups=g, bias=False),
            nn.BatchNorm2d(num_features=c2, eps=0.001, momentum=0.03)
        )

        self.act = nn.LeakyReLU(0.1, inplace=True) if act is True else (act if isinstance(act, nn.Module) else nn.Identity())
    
    def forward(self, x):

        if self.rbr_identity is None:
            identity_out = 0
        else:
            identity_out = self.rbr_identity(x)
        return self.act(self.rbr_dense(x) + self.rbr_1x1(x) + identity_out)
        
class YoloBody(nn.Module):
    def __init__(self, anchor_masks, num_classes) -> None:
        super(YoloBody, self).__init__()
        #-----------------------------------#
        #      定义yolov7的参数
        #-----------------------------------#
        transition_channels = 32
        block_channels      = 32
        panet_channels      = 32
        n = 4                   # ELAN模块中右边两个分支卷积的个数
        e = 2
        ids = [-1, -2, -3, -4, -5, -6]
        #-----------------------------------#
        #     输入图像尺寸为640, 640,3
        #-----------------------------------#

        #-----------------------------------#
        #     主干网络提取特征
        #     输出三个有效特征层, 尺寸分别为:
        #     80, 80, 512
        #     40, 40, 1024
        #     20, 20, 1024
        #-----------------------------------#
        self.backbone = Backbone(transition_channels=transition_channels, block_channels=block_channels, n=n)

        #-----------------------------------#
        #     特征融合网络
        #-----------------------------------#
        self.upsample = nn.Upsample(scale_factor=2, mode="nearest")     # 最邻近方式

        self.sppcspc  = SPPCSPC(transition_channels*32, transition_channels*16)
        self.conv_for_P5  = Conv(transition_channels*16, transition_channels*8)
        self.conv_for_feat2 = Conv(transition_channels*32, transition_channels*8)
        self.elan_for_concat1 = ELAN(transition_channels*16, panet_channels*4, transition_channels*8, e=e, n=n, ids=ids)

        self.conv_for_P4 = Conv(transition_channels*8, transition_channels*4)
        self.conv_for_feat1 = Conv(transition_channels*16, transition_channels*4)
        self.elan_for_concat2 = ELAN(transition_channels*8, panet_channels*2, transition_channels*4, e=e, n=n, ids=ids)

        self.down_sample1 = Transition_Block(transition_channels*4, transition_channels*4)
        self.elan_for_concat3 = ELAN(transition_channels*16, panet_channels*4, transition_channels*8, e=e, n=n, ids=ids)
        

        self.down_sample2 = Transition_Block(transition_channels*8, transition_channels*8)
        self.elan_for_concat4 = ELAN(transition_channels * 32, panet_channels * 8, transition_channels * 16, e=e, n=n, ids=ids)

        #-----------------------------------#
        #     Yolo Head预测模块
        #-----------------------------------#
        self.rep_conv1 = RepConv(transition_channels * 4, transition_channels * 8, 3, 1)
        self.rep_conv2 = RepConv(transition_channels * 8, transition_channels * 16, 3, 1)
        self.rep_conv3 = RepConv(transition_channels * 16, transition_channels * 32, 3, 1)

        self.yolo_head_P3 = nn.Conv2d(transition_channels * 8, len(anchor_masks[2]) * (5 + num_classes), 1)
        self.yolo_head_P4 = nn.Conv2d(transition_channels * 16, len(anchor_masks[1]) * (5 + num_classes), 1)
        self.yolo_head_P5 = nn.Conv2d(transition_channels * 32, len(anchor_masks[0]) * (5 + num_classes), 1)
    
    def forward(self, x):
        #-----------------------------------#
        #     主干网络提取特征
        #     输出三个有效特征层, 尺寸分别为:
        #     80, 80, 512
        #     40, 40, 1024
        #     20, 20, 1024
        #-----------------------------------#
        feat1, feat2, feat3 = self.backbone(x)
        #-----------------------------------#
        #     特征融合网络, 输出三个特征图
        #     P3_out: 80, 80, 128
        #     P4_out: 40, 40, 256
        #     P5_out: 20, 20, 512
        #-----------------------------------#
        P5         = self.sppcspc(feat3)
        P5_conv    = self.conv_for_P5(P5)
        P5_upsample=self.upsample(P5_conv)
        P4         = torch.cat([self.conv_for_feat2(feat2), P5_upsample], 1)
        P4         = self.elan_for_concat1(P4)

        P4_conv    = self.conv_for_P4(P4)
        P4_upsample= self.upsample(P4_conv)
        P3         = torch.cat([self.conv_for_feat1(feat1), P4_upsample], 1)
        P3_out     = self.elan_for_concat2(P3)

        P3_downsample = self.down_sample1(P3_out)
        P4            = torch.cat([P3_downsample, P4], 1)
        P4_out        = self.elan_for_concat3(P4)

        P4_downsample = self.down_sample2(P4_out)
        P5            = torch.cat([P4_downsample, P5], 1)
        P5_out        = self.elan_for_concat4(P5)

        #-----------------------------------#
        #     Yolo Head预测模块
        #-----------------------------------#
        P3_out = self.rep_conv1(P3_out)
        P4_out = self.rep_conv2(P4_out)
        P5_out = self.rep_conv3(P5_out)
        #-----------------------------------#
        #     第三个预测特征层
        #-----------------------------------#
        out3 = self.yolo_head_P3(P3_out)
        #-----------------------------------#
        #     第二个预测特征层
        #-----------------------------------#
        out2 = self.yolo_head_P4(P4_out)
        #-----------------------------------#
        #     第一个预测特征层
        #-----------------------------------#
        out1 = self.yolo_head_P5(P5_out)
        return [out1, out2, out3]

解码和NMS

Yolo Head模块输出网络的预测结果,共有三层,尺寸分别是(N, 80, 80, 255)、(N, 40, 40, 255)、(N, 20, 20, 255)。其中255=3×85表示单个网格的3个先验框的预测信息,85=4+1+80分别表示真实框与先验框之间的offset,是否包含目标、以及每个类别的预测概率值。

  1. 解码

每个框的预测结果前四位是真实框与先验框之间的offset,因此,需要对齐进行解码才能对应网络输入。其中前两位为先验框中心坐标的偏移,后两位为宽高,Decode公式如下:
b x = 2 σ ( t x ) − 0.5 + c x b_x = 2\sigma(t_x) - 0.5 + c_x bx=2σ(tx)0.5+cx
b y = 2 σ ( t y ) − 0.5 + c y b_y = 2\sigma(t_y) - 0.5 + c_y by=2σ(ty)0.5+cy
b w = p w ( 2 σ ( t w ) ) 2 b_w = p_w(2\sigma(t_w))^2 bw=pw(2σ(tw))2
b h = p h ( 2 σ ( t h ) ) 2 b_h = p_h(2\sigma(t_h))^2 bh=ph(2σ(th))2
结合先验框和offset的信息,得到预测框的过程如下图所示
image.png
解码过程的具体实现代码如下:

    def sigmoid(self, x):
        return 1 / ( 1 + np.exp(-x))
    
    def decode_box(self, predictions):
        outputs = list()
        for i, pred in enumerate(predictions):
            #----------------------------------------------#
            #     输入的predictions包含3个元素, 尺寸分别为:
            #     batch_size, 255, 20, 20
            #     batch_size, 255, 40, 40
            #     batch_size, 255, 80, 80
            #----------------------------------------------#
            batch_size = pred.size(0)
            feature_height = pred.size(1)
            feature_width  = pred.size(2)

            stride_h = self.input_shape[0] / feature_height
            stride_w = self.input_shape[1] / feature_width

            # 此时获得的scaled_anchors大小是相对于特征图而言的
            scaled_anchors = [(w/stride_w, h/stride_h) for w, h in self.anchors[self.anchor_mask[i]]]

            #----------------------------------------------#
            #     转换predications的维度:
            #     batch_size, 3, 20, 20, 85
            #     batch_size, 3, 40, 40, 85
            #     batch_size, 3, 80, 80, 85
            #----------------------------------------------#
            pred = pred.view(batch_size, len(self.anchor_masks[i], self.bbox_attrs, feature_height, feature_width))
            pred = pred.permute(0, 1, 3, 4, 2).contigous()

            #----------------------------------------------#
            #     调整预测框参数:
            #----------------------------------------------#
            x = self.sigmoid(pred[..., 0])
            y = self.sigmoid(pred[..., 1])
            w = self.sigmoid(pred[..., 2])
            h = self.sigmoid(pred[..., 3])

            # 获取目标置信度, 是否包含物体
            box_conf = self.sigmoid(pred[..., 4])
            # 获取每个类别的概率值
            cls_conf = self.sigmoid(pred[..., 5:])

            # 根据特征图大小, 生成网格, 特征中心点为左上角点
            grid_x = np.repeat(np.expand_dims(np.repeat(np.expand_dims(np.linspace(0, feature_width - 1, feature_width), 0), feature_height, axis=0), 0), batch_size * len(self.anchors_mask[i]), axis=0)
            grid_x = np.reshape(grid_x, np.shape(x))
            grid_y = np.repeat(np.expand_dims(np.repeat(np.expand_dims(np.linspace(0, feature_height - 1, feature_height), 0), feature_width, axis=0).T, 0), batch_size * len(self.anchors_mask[i]), axis=0)
            grid_y = np.reshape(grid_y, np.shape(y))

            #----------------------------------------------------------#
            #   按照网格格式生成先验框的宽高
            #   batch_size,3,20,20
            #----------------------------------------------------------#
            anchor_w = np.repeat(np.expand_dims(np.repeat(np.expand_dims(np.array(scaled_anchors)[:, 0], 0), batch_size, axis=0), -1), feature_height * feature_width, axis=-1)
            anchor_h = np.repeat(np.expand_dims(np.repeat(np.expand_dims(np.array(scaled_anchors)[:, 1], 0), batch_size, axis=0), -1), feature_height * feature_width, axis=-1)
            anchor_w = np.reshape(anchor_w, np.shape(w))
            anchor_h = np.reshape(anchor_h, np.shape(h))

            #---------------------------------------------#
            # 结合先验框和offset得到预测框
            #---------------------------------------------#
            pred_boxes = np.zeros(pred[..., :4].shape)
            pred_boxes[..., 0] = x * 2. - 0.5 + grid_x
            pred_boxes[..., 1] = y * 2. - 0.5 + grid_y
            pred_boxes[..., 2] = (w * 2.)** 2 * anchor_w
            pred_boxes[..., 3] = (h * 2.) ** 2 * anchor_h
            #----------------------------------------------------------#
            #   将输出结果归一化成0-1小数的形式
            #----------------------------------------------------------#
            _scale = np.array([feature_width, feature_height, feature_width, feature_height])
            #----------------------------------------------------------# 
            #    3个输出, 尺寸分别为:
            #    batch_size, 3 * 20 * 20, 85
            #    batch_size, 3 * 40 * 40, 85
            #    batch_size, 3 * 80 * 80, 85
            #---------------------------------------------------------#
            output = np.concatenate([np.reshape(pred_boxes, (batch_size, -1, 4)) / _scale,
                                np.reshape(box_conf, (batch_size, -1, 1)), np.reshape(cls_conf, (batch_size, -1, self.num_classes))], -1)
            outputs.append(output)
        return outputs
  1. NMS

NMS,非极大值抑制,其思想是搜索局部最大值,抑制非极大值。在目标检测任务中,进行非极大值抑制的过程如下:

  1. 对所有预测框,先按照类别进行分组;
  2. 对于每个类别的预测框,按照置信度从大到小排序;
  3. 将置信度最高的预测框添加到最终输出列表中,计算其与其他所有预测框的IoU;
  4. 删除IoU大于阈值的预测框;
  5. 重复c-d步骤,直到预测框列表为空;
def non_max_supperession(self, predication, num_classes, conf_thres=0.5, nms_thres=0.4):
    #---------------------------------------------------------#
    #      predication输入尺寸为[batch_size, num_anchors, 85]
    #      num_anchors = 20*20*3 + 40*40*3 + 80*80*3
    #---------------------------------------------------------#

    #---------------------------------------------------------#
    #     将预测框从(center_x, center_y, width, height)格式
    #     转换为(left, top, right, bottom)
    #---------------------------------------------------------#
    box_corner = torch.zeros_like(predication)
    box_corner[..., 0] = predication[..., 0] - predication[..., 2] / 2
    box_corner[..., 1] = predication[..., 1] - predication[..., 3] / 2
    box_corner[..., 2] = predication[..., 0] + predication[..., 2] / 2
    box_corner[..., 1] = predication[..., 1] + predication[..., 3] / 2
    predication[..., :4] = box_corner[..., :4]
    outputs = [None for _ in range(len(predication))]
    for i, image_pred in enumerate(predication):
        #----------------------------------------#
        # 每个预测框会预测80类, 求得当前预测框
        # 最大的预测置信度, 以及对应的类别
        # class_conf [num_anchors, 1]   种类置信度
        # class_pred [num_anchors, 1]   对应的类别
        #----------------------------------------#
        class_conf, class_pred = torch.max(image_pred[:, 5:5+num_classes], 1, keepdim=True)
        
        #-------------------------------------#
        # 利用置信度进行第一轮筛选
        # 目标概率和类别概率是否大于阈值
        #-------------------------------------#
        conf_mask = (image_pred[:, 4] * class_conf[:, 0] >= conf_thres).squeeze()

        #-------------------------------------#
        # 根据置信度筛选结果进行预测结果筛选
        #-------------------------------------#
        image_pred = image_pred[conf_mask]
        class_conf = class_conf[conf_mask]
        class_pred = class_pred[conf_mask]
        # 判断预测结果筛选之后是否还存在预测框
        if not image_pred.size(0):
            continue

        #--------------------------------------------#
        #  detections [num_anchors, 7]
        #  7分别代表[x1, y1, x2, y2, obj_conf, class_conf, class_pred]
        #--------------------------------------------#

        detections = torch.cat((image_pred[:, :5], class_conf.float(), class_conf.float()), 1)
        
        #-----------------------------------#
        # 获取预测结果中包含的类别
        #----------------------------------#
        unique_labels = detections[:, -1].cpu().unique()

        if predication.is_cuda:
            unique_labels = unique_labels.cuda()
            detections = detections.cuda()
        
        for c in unique_labels:
            #-----------------------------------#
            # 获得某一类的预测结果
            #----------------------------------#
            detection_class = detections[detections[:, -1] == c]

            #-------------------------#
            # 按照预测框的置信度排序
            #-------------------------#
            _, conf_sort_index = torch.sort(detection_class[:, 4] * detection_class[:, 5], descending=True)
            detection_class = detection_class[conf_sort_index]

            # 进行非极大值抑制
            max_detections = list()
            while detection_class.size(0):
                # 取出该类置信度最高的, 与剩余的预测框进行IoU, 判断重合程度是否大于nms_thres
                max_detections.append(detection_class[0].unsqueeze())
                if len(detection_class) == 1:
                    break
                ious = bbox_iou(max_detections[-1], detection_class[1:])
                detection_class = detection_class[1:][ious < nms_thres]
            
            max_detections = torch.cat(max_detections).data
        outputs[i] = max_detections if outputs[i] is None else torch.cat((outputs[i], max_detections))
    return outputs

模型推理

  1. 图像预处理

输入图像在进行推理之前需要进行预处理,具体要经过如下一些步骤:
image.png

image_name = "example.jpg"
image_data = cv2.imread(image_name)

#---------------------------------#
# 将图像转换为RGB图像
#---------------------------------#
image_data = cvtColor(image)

#---------------------------------#
# 维持图像宽高比, 添加灰条
#---------------------------------#
image_data = resize_image(image, (input_width, input_height))

#---------------------------------#
# 归一化
#---------------------------------#
image_data = np.array(image_data, dtype=np.float32)
image_data = image_data / 255

#--------------------------------#
# HWC转NCHW
#--------------------------------#
image_data = np.expand_dims(np.transpose(image_data, (2, 0, 1)), 0)
  1. 网络推理

经过预处理的图像输入到模型中进行前向传播,然后经过Decode和NMS,得到预测框。

  1. rescale预测结果

从上一层传下来的预测框是相对于预处理后的图像而言,即大小为640×640,且图像中包含灰条。这一步就是需要找到有效图像的左上角起点,然后将目标框信息rescale到原图上。image.png

# ---------------------------------------#
# 将box从输入图像维度转换为原图
# img1_shape 网络输入大小
# boxes当前图中box的信息
# img0_shape 原图大小
# ---------------------------------------#
def scale_boxes(img1_shape, boxes, img0_shape, ratio_pad=None):
    # Rescale boxes (xyxy) from img1_shape to img0_shape
    if ratio_pad is None:  # calculate from img0_shape
        gain = min(img1_shape[0] / img0_shape[0], img1_shape[1] / img0_shape[1])  # gain  = old / new
        pad = (img1_shape[1] - img0_shape[1] * gain) / 2, (img1_shape[0] - img0_shape[0] * gain) / 2  # wh padding
    else:
        gain = ratio_pad[0][0]
        pad = ratio_pad[1]

    boxes[..., [0, 2]] -= pad[0]  # x padding
    boxes[..., [1, 3]] -= pad[1]  # y padding
    boxes[..., :4] /= gain
    clip_boxes(boxes, img0_shape)
    return boxes
  1. 渲染结果

渲染结果,指的是将box画到原图上,并保存或显示出来。

p1, p2 = (int(box[0]), int(box[1])), (int(box[2]), int(box[3]))
cv2.rectangle(self.im, p1, p2, color, thickness=self.lw, lineType=cv2.LINE_AA)
if label:
    tf = max(self.lw - 1, 1)  # font thickness
    w, h = cv2.getTextSize(label, 0, fontScale=self.lw / 3, thickness=tf)[0]  # text width, height
    outside = p1[1] - h >= 3
    p2 = p1[0] + w, p1[1] - h - 3 if outside else p1[1] + h + 3
    cv2.rectangle(self.im, p1, p2, color, -1, cv2.LINE_AA)  # filled
    cv2.putText(self.im,
                label, (p1[0], p1[1] - 2 if outside else p1[1] + h + 2),
                0,
                self.lw / 3,
                txt_color,
                thickness=tf,
                lineType=cv2.LINE_AA)

显示结果如下图所示:
zidane.jpg

参考链接

猜你喜欢

转载自blog.csdn.net/hello_dear_you/article/details/129646502