Smart target detection - Pytorch builds YoloV7-OBB rotating target detection platform

Smart target detection - Pytorch builds [YoloV7-OBB] rotating target detection platform

study preface

Based on the YoloV7-Pytorch source code of the B-lead source, a rotating target detection version of Yolov7 was developed.

Source code download

https://github.com/Egrt/yolov7-obb
You can order a star if you like it.

YoloV7-OBB improved part (incomplete)

1. The main part: The innovative multi-branch stacking structure is used for feature extraction. Compared with the previous Yolo, the jump connection structure of the model is more dense. Using an innovative downsampling structure, Maxpooling and features with a step size of 2x2 are used for parallel extraction and compression.

2. Enhanced feature extraction part: Same as the main part, the enhanced feature extraction part also uses a multi-input stacking structure for feature extraction, and uses Maxpooling and features with a step size of 2x2 for parallel downsampling.

3. Special SPP structure: The SPP with CSP mechanism is used to expand the receptive field, and the CSP structure is introduced into the SPP structure. This module has a large residual edge to assist optimization and feature extraction.

4. Adaptive multi-positive sample matching: In the Yolo series before YoloV5, each real frame corresponds to a positive sample during training, that is, during training, each real frame is only predicted by one prior frame. In YoloV7, in order to speed up the training efficiency of the model, the number of positive samples is increased. During training, each real frame can be predicted by multiple prior frames. In addition, for each real frame, the iou and type are calculated based on the predicted frame adjusted by the prior frame to obtain the cost, and then find the most suitable prior frame for the real frame.

5. Drawing on the structure of RepVGG, introduce RepConv in a specific part of the network, after fuse, reduce the parameter amount of the network to ensure the network x

6. The auxiliary branch is used to assist the convergence, but it is not used in the smaller models YoloV7 and YoloV7-X.

The above is not all the improvements, there are some other improvements, here are just some of the improvements that I am more interested in and are very effective.

YoloV7-OBB implementation ideas

1. Overall structure analysis


Before learning YoloV7-OBB, we need to have a certain understanding of the work done by YoloV7-OBB, which will help us understand the details of the network later. YoloV7-OBB is not much different from the previous Yolo in terms of prediction methods. Still divided into three parts.

They are Backbone, FPN and Yolo Head respectively .

Backbone is the backbone feature extraction network of YoloV7-OBB . The input image will first be feature extracted in the backbone network . The extracted features can be called the feature layer, which is the feature set of the input image . In the backbone part, we obtained three feature layers for the next step of network construction. I call these three feature layers effective feature layers .

FPN is a strengthened feature extraction network of YoloV7-OBB . The three effective feature layers obtained in the backbone part will perform feature fusion in this part. The purpose of feature fusion is to combine feature information of different scales. In the FPN part, the obtained effective feature layers are used to continue to extract features. In YoloV7, the Panet structure is still used. We not only upsample the features to achieve feature fusion, but also downsample the features again to achieve feature fusion.

Yolo Head is the classifier and regressor of YoloV7-OBB . Through Backbone and FPN, we can already obtain three enhanced effective feature layers. Each feature layer has width, height and number of channels. At this time, we can regard the feature map as a collection of feature points one after another . There are three prior frames on each feature point, and each prior frame has the number of channels. feature . What Yolo Head actually does is to judge the feature points and judge whether there is an object corresponding to the prior frame on the feature points . Like the previous version of Yolo, the decoupling head used by YoloV7-OBB is together, that is, classification and regression are implemented in a 1X1 convolution.

Therefore, the work done by the entire YoloV7-OBB network is feature extraction-feature enhancement-predicting the object situation corresponding to the prior frame .

2. Analysis of network structure

1. Introduction to backbone network Backbone

The backbone feature extraction network used by YoloV7-OBB has two important features:
1. It uses a multi-branch stacking module . This module is not named in the paper, but I think this name is very appropriate after analyzing the source code. In this blog post, The multi-drop stacking module is shown in the figure.
After reading this picture, you should understand why I call this module a multi-branch stacking module, because in this module, the input of the final stacking module contains multiple branches, the first on the left is a convolution normalization activation function, and the second on the left is a Convolution normalized activation function, the second from the right is three convolutional normalized activation functions, and the first from the right is five convolutional normalized activation functions.
After the four feature layers are stacked, a convolution normalization activation function will be performed again for feature integration.

    class Multi_Concat_Block(nn.Module):
        def __init__(self, c1, c2, c3, n=4, e=1, ids=[0]):
            super(Multi_Concat_Block, self).__init__()
            c_ = int(c2 * e)
            
            self.ids = ids
            self.cv1 = Conv(c1, c_, 1, 1)
            self.cv2 = Conv(c1, c_, 1, 1)
            self.cv3 = nn.ModuleList(
                [Conv(c_ if i ==0 else c2, c2, 3, 1) for i in range(n)]
            )
            self.cv4 = Conv(c_ * 2 + c2 * (len(ids) - 2), c3, 1, 1)
    
        def forward(self, x):
            x_1 = self.cv1(x)
            x_2 = self.cv2(x)
            
            x_all = [x_1, x_2]
            for i in range(len(self.cv3)):
                x_2 = self.cv3[i](x_2)
                x_all.append(x_2)
                
            out = self.cv4(torch.cat([x_all[id] for id in self.ids], 1))
            return out

So many stacks actually correspond to a denser residual structure. The residual network is characterized by being easy to optimize , and can improve accuracy by increasing considerable depth . Its internal residual block uses skip connections, which alleviates the problem of gradient disappearance caused by increasing depth in deep neural networks.

2. Use the innovative transition module Transition_Block for downsampling. In a convolutional neural network, a common transition module for downsampling is a convolution with a kernel size of 3x3 and a step size of 2x2 or a step size of 2x2 max pooling . In YoloV7, the author has assembled two kinds of transition modules, and one transition module has two branches, as shown in the figure. The left branch is a maximum pooling with a step size of 2x2 + a 1x1 convolution, and the right branch is a 1x1 convolution + a convolution with a convolution kernel size of 3x3 and a step size of 2x2. The results of the two branches are output will stack.

    class MP(nn.Module):
        def __init__(self, k=2):
            super(MP, self).__init__()
            self.m = nn.MaxPool2d(kernel_size=k, stride=k)
    
        def forward(self, x):
            return self.m(x)
        
    class Transition_Block(nn.Module):
        def __init__(self, c1, c2):
            super(Transition_Block, self).__init__()
            self.cv1 = Conv(c1, c2, 1, 1)
            self.cv2 = Conv(c1, c2, 1, 1)
            self.cv3 = Conv(c2, c2, 3, 2)
            
            self.mp  = MP()
    
        def forward(self, x):
            x_1 = self.mp(x)
            x_1 = self.cv1(x_1)
            
            x_2 = self.cv2(x)
            x_2 = self.cv3(x_2)
            
            return torch.cat([x_2, x_1], 1)

The entire backbone implementation code is:

import torch
import torch.nn as nn


def autopad(k, p=None):
   if p is None:
       p = k // 2 if isinstance(k, int) else [x // 2 for x in k] 
   return p

class SiLU(nn.Module):  
   @staticmethod
   def forward(x):
       return x * torch.sigmoid(x)
   
class Conv(nn.Module):
   def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=SiLU()):  # ch_in, ch_out, kernel, stride, padding, groups
       super(Conv, self).__init__()
       self.conv   = nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=False)
       self.bn     = nn.BatchNorm2d(c2, eps=0.001, momentum=0.03)
       self.act    = nn.LeakyReLU(0.1, inplace=True) if act is True else (act if isinstance(act, nn.Module) else nn.Identity())

   def forward(self, x):
       return self.act(self.bn(self.conv(x)))

   def fuseforward(self, x):
       return self.act(self.conv(x))
   
class Multi_Concat_Block(nn.Module):
   def __init__(self, c1, c2, c3, n=4, e=1, ids=[0]):
       super(Multi_Concat_Block, self).__init__()
       c_ = int(c2 * e)
       
       self.ids = ids
       self.cv1 = Conv(c1, c_, 1, 1)
       self.cv2 = Conv(c1, c_, 1, 1)
       self.cv3 = nn.ModuleList(
           [Conv(c_ if i ==0 else c2, c2, 3, 1) for i in range(n)]
       )
       self.cv4 = Conv(c_ * 2 + c2 * (len(ids) - 2), c3, 1, 1)

   def forward(self, x):
       x_1 = self.cv1(x)
       x_2 = self.cv2(x)
       
       x_all = [x_1, x_2]
       for i in range(len(self.cv3)):
           x_2 = self.cv3[i](x_2)
           x_all.append(x_2)
           
       out = self.cv4(torch.cat([x_all[id] for id in self.ids], 1))
       return out

class MP(nn.Module):
   def __init__(self, k=2):
       super(MP, self).__init__()
       self.m = nn.MaxPool2d(kernel_size=k, stride=k)

   def forward(self, x):
       return self.m(x)
   
class Transition_Block(nn.Module):
   def __init__(self, c1, c2):
       super(Transition_Block, self).__init__()
       self.cv1 = Conv(c1, c2, 1, 1)
       self.cv2 = Conv(c1, c2, 1, 1)
       self.cv3 = Conv(c2, c2, 3, 2)
       
       self.mp  = MP()

   def forward(self, x):
       x_1 = self.mp(x)
       x_1 = self.cv1(x_1)
       
       x_2 = self.cv2(x)
       x_2 = self.cv3(x_2)
       
       return torch.cat([x_2, x_1], 1)
   
class Backbone(nn.Module):
   def __init__(self, transition_channels, block_channels, n, phi, pretrained=False):
       super().__init__()
       #-----------------------------------------------#
       #   输入图片是640, 640, 3
       #-----------------------------------------------#
       ids = {
    
    
           'l' : [-1, -3, -5, -6],
           'x' : [-1, -3, -5, -7, -8], 
       }[phi]
       self.stem = nn.Sequential(
           Conv(3, transition_channels, 3, 1),
           Conv(transition_channels, transition_channels * 2, 3, 2),
           Conv(transition_channels * 2, transition_channels * 2, 3, 1),
       )
       self.dark2 = nn.Sequential(
           Conv(transition_channels * 2, transition_channels * 4, 3, 2),
           Multi_Concat_Block(transition_channels * 4, block_channels * 2, transition_channels * 8, n=n, ids=ids),
       )
       self.dark3 = nn.Sequential(
           Transition_Block(transition_channels * 8, transition_channels * 4),
           Multi_Concat_Block(transition_channels * 8, block_channels * 4, transition_channels * 16, n=n, ids=ids),
       )
       self.dark4 = nn.Sequential(
           Transition_Block(transition_channels * 16, transition_channels * 8),
           Multi_Concat_Block(transition_channels * 16, block_channels * 8, transition_channels * 32, n=n, ids=ids),
       )
       self.dark5 = nn.Sequential(
           Transition_Block(transition_channels * 32, transition_channels * 16),
           Multi_Concat_Block(transition_channels * 32, block_channels * 8, transition_channels * 32, n=n, ids=ids),
       )
       
       if pretrained:
           url = {
    
    
               "l" : 'https://github.com/bubbliiiing/yolov7-pytorch/releases/download/v1.0/yolov7_backbone_weights.pth',
               "x" : 'https://github.com/bubbliiiing/yolov7-pytorch/releases/download/v1.0/yolov7_x_backbone_weights.pth',
           }[phi]
           checkpoint = torch.hub.load_state_dict_from_url(url=url, map_location="cpu", model_dir="./model_data")
           self.load_state_dict(checkpoint, strict=False)
           print("Load weights from " + url.split('/')[-1])

   def forward(self, x):
       x = self.stem(x)
       x = self.dark2(x)
       #-----------------------------------------------#
       #   dark3的输出为80, 80, 512,是一个有效特征层
       #-----------------------------------------------#
       x = self.dark3(x)
       feat1 = x
       #-----------------------------------------------#
       #   dark4的输出为40, 40, 1024,是一个有效特征层
       #-----------------------------------------------#
       x = self.dark4(x)
       feat2 = x
       #-----------------------------------------------#
       #   dark5的输出为20, 20, 1024,是一个有效特征层
       #-----------------------------------------------#
       x = self.dark5(x)
       feat3 = x
       return feat1, feat2, feat3

2. Construct FPN feature pyramid for enhanced feature extraction


In the feature utilization part, YoloV7-OBB extracts multiple feature layers for target detection , and extracts a total of three feature layers .
The three feature layers are located in different positions of the main body, respectively located in the middle layer, the middle and lower layers , and the bottom layer. When the input is (640,640,3), the shapes of the three feature layers are feat1=(80,80,512), feat2= (40,40,1024), feat3=(20,20,1024).

After obtaining three effective feature layers, we use these three effective feature layers to construct the FPN layer. The construction method is (in this blog post, the SPPCSPC structure is attributed to FPN):

  1. The feature layer of feat3=(20,20,1024) first uses SPPCSPC for feature extraction. This structure can improve the receptive field of YoloV7 and obtain P5.
  2. First perform a 1X1 convolution adjustment channel on P5, then perform upsampling UmSampling2d and combine it with the feat2=(40,40,1024) convolutional feature layer , and then use Multi_Concat_Block for feature extraction to obtain P4. At this time The obtained feature layer is (40,40,256).
  3. First perform a 1X1 convolution adjustment channel on P4, then perform upsampling UmSampling2d and combine it with feat1=(80,80,512) to perform a convolution feature layer , and then use Multi_Concat_Block for feature extraction to obtain P3_out, obtained at this time The feature layer is (80,80,128).
  4. The feature layer of P3_out=(80,80,128) performs a Transition_Block convolution for downsampling, after downsampling, stacks with P4 , and then uses Multi_Concat_Block for feature extraction P4_out, the feature layer obtained at this time is (40,40,256).
  5. The feature layer of P4_out=(40,40,256) performs a Transition_Block convolution for downsampling, and stacks it with P5 after downsampling , and then uses Multi_Concat_Block for feature extraction P5_out. The feature layer obtained at this time is (20,20,512).

The feature pyramid can perform feature fusion of feature layers of different shapes , which is conducive to extracting better features .

#---------------------------------------------------#
#   yolo_body
#---------------------------------------------------#
class YoloBody(nn.Module):
    def __init__(self, anchors_mask, num_classes, phi, pretrained=False):
        super(YoloBody, self).__init__()
        #-----------------------------------------------#
        #   定义了不同yolov7版本的参数
        #-----------------------------------------------#
        transition_channels = {
    
    'l' : 32, 'x' : 40}[phi]
        block_channels      = 32
        panet_channels      = {
    
    'l' : 32, 'x' : 64}[phi]
        e       = {
    
    'l' : 2, 'x' : 1}[phi]
        n       = {
    
    'l' : 4, 'x' : 6}[phi]
        ids     = {
    
    'l' : [-1, -2, -3, -4, -5, -6], 'x' : [-1, -3, -5, -7, -8]}[phi]
        conv    = {
    
    'l' : RepConv, 'x' : Conv}[phi]
        #-----------------------------------------------#
        #   输入图片是640, 640, 3
        #-----------------------------------------------#

        #---------------------------------------------------#   
        #   生成主干模型
        #   获得三个有效特征层,他们的shape分别是:
        #   80, 80, 512
        #   40, 40, 1024
        #   20, 20, 1024
        #---------------------------------------------------#
        self.backbone   = Backbone(transition_channels, block_channels, n, phi, pretrained=pretrained)

        #------------------------加强特征提取网络------------------------# 
        self.upsample   = nn.Upsample(scale_factor=2, mode="nearest")

        # 20, 20, 1024 => 20, 20, 512
        self.sppcspc                = SPPCSPC(transition_channels * 32, transition_channels * 16)
        # 20, 20, 512 => 20, 20, 256 => 40, 40, 256
        self.conv_for_P5            = Conv(transition_channels * 16, transition_channels * 8)
        # 40, 40, 1024 => 40, 40, 256
        self.conv_for_feat2         = Conv(transition_channels * 32, transition_channels * 8)
        # 40, 40, 512 => 40, 40, 256
        self.conv3_for_upsample1    = Multi_Concat_Block(transition_channels * 16, panet_channels * 4, transition_channels * 8, e=e, n=n, ids=ids)

        # 40, 40, 256 => 40, 40, 128 => 80, 80, 128
        self.conv_for_P4            = Conv(transition_channels * 8, transition_channels * 4)
        # 80, 80, 512 => 80, 80, 128
        self.conv_for_feat1         = Conv(transition_channels * 16, transition_channels * 4)
        # 80, 80, 256 => 80, 80, 128
        self.conv3_for_upsample2    = Multi_Concat_Block(transition_channels * 8, panet_channels * 2, transition_channels * 4, e=e, n=n, ids=ids)

        # 80, 80, 128 => 40, 40, 256
        self.down_sample1           = Transition_Block(transition_channels * 4, transition_channels * 4)
        # 40, 40, 512 => 40, 40, 256
        self.conv3_for_downsample1  = Multi_Concat_Block(transition_channels * 16, panet_channels * 4, transition_channels * 8, e=e, n=n, ids=ids)

        # 40, 40, 256 => 20, 20, 512
        self.down_sample2           = Transition_Block(transition_channels * 8, transition_channels * 8)
        # 20, 20, 1024 => 20, 20, 512
        self.conv3_for_downsample2  = Multi_Concat_Block(transition_channels * 32, panet_channels * 8, transition_channels * 16, e=e, n=n, ids=ids)
        #------------------------加强特征提取网络------------------------# 

        # 80, 80, 128 => 80, 80, 256
        self.rep_conv_1 = conv(transition_channels * 4, transition_channels * 8, 3, 1)
        # 40, 40, 256 => 40, 40, 512
        self.rep_conv_2 = conv(transition_channels * 8, transition_channels * 16, 3, 1)
        # 20, 20, 512 => 20, 20, 1024
        self.rep_conv_3 = conv(transition_channels * 16, transition_channels * 32, 3, 1)

        # 4 + 1 + num_classes
        # 80, 80, 256 => 80, 80, 3 * 25 (4 + 1 + 20) & 85 (4 + 1 + 80)
        self.yolo_head_P3 = nn.Conv2d(transition_channels * 8, len(anchors_mask[2]) * (5 + 1 + num_classes), 1)
        # 40, 40, 512 => 40, 40, 3 * 25 & 85
        self.yolo_head_P4 = nn.Conv2d(transition_channels * 16, len(anchors_mask[1]) * (5 + 1 + num_classes), 1)
        # 20, 20, 512 => 20, 20, 3 * 25 & 85
        self.yolo_head_P5 = nn.Conv2d(transition_channels * 32, len(anchors_mask[0]) * (5 + 1 + num_classes), 1)

    def fuse(self):
        print('Fusing layers... ')
        for m in self.modules():
            if isinstance(m, RepConv):
                m.fuse_repvgg_block()
            elif type(m) is Conv and hasattr(m, 'bn'):
                m.conv = fuse_conv_and_bn(m.conv, m.bn)
                delattr(m, 'bn')
                m.forward = m.fuseforward
        return self
    
    def forward(self, x):
        #  backbone
        feat1, feat2, feat3 = self.backbone.forward(x)
        
        #------------------------加强特征提取网络------------------------# 
        # 20, 20, 1024 => 20, 20, 512
        P5          = self.sppcspc(feat3)
        # 20, 20, 512 => 20, 20, 256
        P5_conv     = self.conv_for_P5(P5)
        # 20, 20, 256 => 40, 40, 256
        P5_upsample = self.upsample(P5_conv)
        # 40, 40, 256 cat 40, 40, 256 => 40, 40, 512
        P4          = torch.cat([self.conv_for_feat2(feat2), P5_upsample], 1)
        # 40, 40, 512 => 40, 40, 256
        P4          = self.conv3_for_upsample1(P4)

        # 40, 40, 256 => 40, 40, 128
        P4_conv     = self.conv_for_P4(P4)
        # 40, 40, 128 => 80, 80, 128
        P4_upsample = self.upsample(P4_conv)
        # 80, 80, 128 cat 80, 80, 128 => 80, 80, 256
        P3          = torch.cat([self.conv_for_feat1(feat1), P4_upsample], 1)
        # 80, 80, 256 => 80, 80, 128
        P3          = self.conv3_for_upsample2(P3)

        # 80, 80, 128 => 40, 40, 256
        P3_downsample = self.down_sample1(P3)
        # 40, 40, 256 cat 40, 40, 256 => 40, 40, 512
        P4 = torch.cat([P3_downsample, P4], 1)
        # 40, 40, 512 => 40, 40, 256
        P4 = self.conv3_for_downsample1(P4)

        # 40, 40, 256 => 20, 20, 512
        P4_downsample = self.down_sample2(P4)
        # 20, 20, 512 cat 20, 20, 512 => 20, 20, 1024
        P5 = torch.cat([P4_downsample, P5], 1)
        # 20, 20, 1024 => 20, 20, 512
        P5 = self.conv3_for_downsample2(P5)
        #------------------------加强特征提取网络------------------------# 
        # P3 80, 80, 128 
        # P4 40, 40, 256
        # P5 20, 20, 512
        
        P3 = self.rep_conv_1(P3)
        P4 = self.rep_conv_2(P4)
        P5 = self.rep_conv_3(P5)
        #---------------------------------------------------#
        #   第三个特征层
        #   y3=(batch_size, 78, 80, 80)
        #---------------------------------------------------#
        out2 = self.yolo_head_P3(P3)
        #---------------------------------------------------#
        #   第二个特征层
        #   y2=(batch_size, 78, 40, 40)
        #---------------------------------------------------#
        out1 = self.yolo_head_P4(P4)
        #---------------------------------------------------#
        #   第一个特征层
        #   y1=(batch_size, 78, 20, 20)
        #---------------------------------------------------#
        out0 = self.yolo_head_P5(P5)

        return [out0, out1, out2]

3. Use Yolo Head to obtain prediction results


Using the FPN feature pyramid, we can obtain three enhanced features. The shapes of these three enhanced features are (20,20,512), (40,40,256), (80,80,128), and then we use the feature layers of these three shapes Pass in Yolo Head to get the prediction result.

Different from the previous Yolo series, YoloV7 uses a RepConv structure before Yolo Head. The idea of ​​this RepConv is taken from RepVGG. The basic idea is to introduce a special residual structure to assist training during training. This residual structure is After a unique design, the complex residual structure can be equivalent to an ordinary 3x3 convolution in the actual prediction. At this time, the complexity of the network is reduced, but the prediction performance of the network is not reduced.

For each feature layer, we can use a convolution to adjust the number of channels. The final number of channels is related to the number of types that need to be distinguished. In YoloV7, there are 3 prior frames for each feature point on each feature layer .
In the modification of the network model of Yolov7-OBB, it is mainly to increase the output of Yolo-Head to the dimension of the rotation angle theta (radian). If the
voc training set is used, there are 20 classes, and the final dimension should be 78 = 3x26 , the shape of the three feature layers is ( 20,20,78 ), ( 40,40,78 ), ( 80,80,78 ). If it is a rectangular box prediction, the final dimension should be 75 = 3 x25.
The last 78 can be split into three 26, corresponding to the 26 parameters of the three prior frames, and 26 can be split into 5+1+20.
The first 5 parameters are used to judge the regression parameters (xc, yc, w, h, theta) of each feature point, and the prediction frame can be obtained after the regression parameters are adjusted; the 6th parameter is used to judge whether each feature point contains an object
;
The last 20 parameters are used to judge the type of object contained in each feature point.

If you use the coco training set, the class is 80, the final dimension should be 258 = 3x86, the shape of the three feature layers is ( 20,20,258 ), ( 40,40,258 ), ( 80,80,258 )
the last 258 It can be split into three 86, corresponding to the 86 parameters of the three prior frames, and 86 can be split into 5+1+80.
The first 5 parameters are used to judge the regression parameters (xc, yc, w, h, theta) of each feature point, and the prediction frame can be obtained after the regression parameters are adjusted; the 6th parameter is used to judge whether each feature point contains an object
;
The last 80 parameters are used to judge the type of object contained in each feature point.

The implementation code is as follows:

import numpy as np
import torch
import torch.nn as nn

from nets.backbone import Backbone, Multi_Concat_Block, Conv, SiLU, Transition_Block, autopad


class SPPCSPC(nn.Module):
    # CSP https://github.com/WongKinYiu/CrossStagePartialNetworks
    def __init__(self, c1, c2, n=1, shortcut=False, g=1, e=0.5, k=(5, 9, 13)):
        super(SPPCSPC, self).__init__()
        c_ = int(2 * c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c1, c_, 1, 1)
        self.cv3 = Conv(c_, c_, 3, 1)
        self.cv4 = Conv(c_, c_, 1, 1)
        self.m = nn.ModuleList([nn.MaxPool2d(kernel_size=x, stride=1, padding=x // 2) for x in k])
        self.cv5 = Conv(4 * c_, c_, 1, 1)
        self.cv6 = Conv(c_, c_, 3, 1)
        # 输出通道数为c2
        self.cv7 = Conv(2 * c_, c2, 1, 1)

    def forward(self, x):
        x1 = self.cv4(self.cv3(self.cv1(x)))
        y1 = self.cv6(self.cv5(torch.cat([x1] + [m(x1) for m in self.m], 1)))
        y2 = self.cv2(x)
        return self.cv7(torch.cat((y1, y2), dim=1))

class RepConv(nn.Module):
    # Represented convolution
    # https://arxiv.org/abs/2101.03697
    def __init__(self, c1, c2, k=3, s=1, p=None, g=1, act=SiLU(), deploy=False):
        super(RepConv, self).__init__()
        self.deploy         = deploy
        self.groups         = g
        self.in_channels    = c1
        self.out_channels   = c2
        
        assert k == 3
        assert autopad(k, p) == 1

        padding_11  = autopad(k, p) - k // 2
        self.act    = nn.LeakyReLU(0.1, inplace=True) if act is True else (act if isinstance(act, nn.Module) else nn.Identity())

        if deploy:
            self.rbr_reparam    = nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=True)
        else:
            self.rbr_identity   = (nn.BatchNorm2d(num_features=c1, eps=0.001, momentum=0.03) if c2 == c1 and s == 1 else None)
            self.rbr_dense      = nn.Sequential(
                nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=False),
                nn.BatchNorm2d(num_features=c2, eps=0.001, momentum=0.03),
            )
            self.rbr_1x1        = nn.Sequential(
                nn.Conv2d( c1, c2, 1, s, padding_11, groups=g, bias=False),
                nn.BatchNorm2d(num_features=c2, eps=0.001, momentum=0.03),
            )

    def forward(self, inputs):
        if hasattr(self, "rbr_reparam"):
            return self.act(self.rbr_reparam(inputs))
        if self.rbr_identity is None:
            id_out = 0
        else:
            id_out = self.rbr_identity(inputs)
        return self.act(self.rbr_dense(inputs) + self.rbr_1x1(inputs) + id_out)
    
    def get_equivalent_kernel_bias(self):
        kernel3x3, bias3x3  = self._fuse_bn_tensor(self.rbr_dense)
        kernel1x1, bias1x1  = self._fuse_bn_tensor(self.rbr_1x1)
        kernelid, biasid    = self._fuse_bn_tensor(self.rbr_identity)
        return (
            kernel3x3 + self._pad_1x1_to_3x3_tensor(kernel1x1) + kernelid,
            bias3x3 + bias1x1 + biasid,
        )

    def _pad_1x1_to_3x3_tensor(self, kernel1x1):
        if kernel1x1 is None:
            return 0
        else:
            return nn.functional.pad(kernel1x1, [1, 1, 1, 1])

    def _fuse_bn_tensor(self, branch):
        if branch is None:
            return 0, 0
        if isinstance(branch, nn.Sequential):
            kernel      = branch[0].weight
            running_mean = branch[1].running_mean
            running_var = branch[1].running_var
            gamma       = branch[1].weight
            beta        = branch[1].bias
            eps         = branch[1].eps
        else:
            assert isinstance(branch, nn.BatchNorm2d)
            if not hasattr(self, "id_tensor"):
                input_dim = self.in_channels // self.groups
                kernel_value = np.zeros(
                    (self.in_channels, input_dim, 3, 3), dtype=np.float32
                )
                for i in range(self.in_channels):
                    kernel_value[i, i % input_dim, 1, 1] = 1
                self.id_tensor = torch.from_numpy(kernel_value).to(branch.weight.device)
            kernel      = self.id_tensor
            running_mean = branch.running_mean
            running_var = branch.running_var
            gamma       = branch.weight
            beta        = branch.bias
            eps         = branch.eps
        std = (running_var + eps).sqrt()
        t   = (gamma / std).reshape(-1, 1, 1, 1)
        return kernel * t, beta - running_mean * gamma / std

    def repvgg_convert(self):
        kernel, bias = self.get_equivalent_kernel_bias()
        return (
            kernel.detach().cpu().numpy(),
            bias.detach().cpu().numpy(),
        )

    def fuse_conv_bn(self, conv, bn):
        std     = (bn.running_var + bn.eps).sqrt()
        bias    = bn.bias - bn.running_mean * bn.weight / std

        t       = (bn.weight / std).reshape(-1, 1, 1, 1)
        weights = conv.weight * t

        bn      = nn.Identity()
        conv    = nn.Conv2d(in_channels = conv.in_channels,
                              out_channels = conv.out_channels,
                              kernel_size = conv.kernel_size,
                              stride=conv.stride,
                              padding = conv.padding,
                              dilation = conv.dilation,
                              groups = conv.groups,
                              bias = True,
                              padding_mode = conv.padding_mode)

        conv.weight = torch.nn.Parameter(weights)
        conv.bias   = torch.nn.Parameter(bias)
        return conv

    def fuse_repvgg_block(self):    
        if self.deploy:
            return
        print(f"RepConv.fuse_repvgg_block")
        self.rbr_dense  = self.fuse_conv_bn(self.rbr_dense[0], self.rbr_dense[1])
        
        self.rbr_1x1    = self.fuse_conv_bn(self.rbr_1x1[0], self.rbr_1x1[1])
        rbr_1x1_bias    = self.rbr_1x1.bias
        weight_1x1_expanded = torch.nn.functional.pad(self.rbr_1x1.weight, [1, 1, 1, 1])
        
        # Fuse self.rbr_identity
        if (isinstance(self.rbr_identity, nn.BatchNorm2d) or isinstance(self.rbr_identity, nn.modules.batchnorm.SyncBatchNorm)):
            identity_conv_1x1 = nn.Conv2d(
                    in_channels=self.in_channels,
                    out_channels=self.out_channels,
                    kernel_size=1,
                    stride=1,
                    padding=0,
                    groups=self.groups, 
                    bias=False)
            identity_conv_1x1.weight.data = identity_conv_1x1.weight.data.to(self.rbr_1x1.weight.data.device)
            identity_conv_1x1.weight.data = identity_conv_1x1.weight.data.squeeze().squeeze()
            identity_conv_1x1.weight.data.fill_(0.0)
            identity_conv_1x1.weight.data.fill_diagonal_(1.0)
            identity_conv_1x1.weight.data = identity_conv_1x1.weight.data.unsqueeze(2).unsqueeze(3)

            identity_conv_1x1           = self.fuse_conv_bn(identity_conv_1x1, self.rbr_identity)
            bias_identity_expanded      = identity_conv_1x1.bias
            weight_identity_expanded    = torch.nn.functional.pad(identity_conv_1x1.weight, [1, 1, 1, 1])            
        else:
            bias_identity_expanded      = torch.nn.Parameter( torch.zeros_like(rbr_1x1_bias) )
            weight_identity_expanded    = torch.nn.Parameter( torch.zeros_like(weight_1x1_expanded) )            
        
        self.rbr_dense.weight   = torch.nn.Parameter(self.rbr_dense.weight + weight_1x1_expanded + weight_identity_expanded)
        self.rbr_dense.bias     = torch.nn.Parameter(self.rbr_dense.bias + rbr_1x1_bias + bias_identity_expanded)
                
        self.rbr_reparam    = self.rbr_dense
        self.deploy         = True

        if self.rbr_identity is not None:
            del self.rbr_identity
            self.rbr_identity = None

        if self.rbr_1x1 is not None:
            del self.rbr_1x1
            self.rbr_1x1 = None

        if self.rbr_dense is not None:
            del self.rbr_dense
            self.rbr_dense = None
            
def fuse_conv_and_bn(conv, bn):
    fusedconv = nn.Conv2d(conv.in_channels,
                          conv.out_channels,
                          kernel_size=conv.kernel_size,
                          stride=conv.stride,
                          padding=conv.padding,
                          groups=conv.groups,
                          bias=True).requires_grad_(False).to(conv.weight.device)

    w_conv  = conv.weight.clone().view(conv.out_channels, -1)
    w_bn    = torch.diag(bn.weight.div(torch.sqrt(bn.eps + bn.running_var)))
    # fusedconv.weight.copy_(torch.mm(w_bn, w_conv).view(fusedconv.weight.shape))
    fusedconv.weight.copy_(torch.mm(w_bn, w_conv).view(fusedconv.weight.shape).detach())

    b_conv  = torch.zeros(conv.weight.size(0), device=conv.weight.device) if conv.bias is None else conv.bias
    b_bn    = bn.bias - bn.weight.mul(bn.running_mean).div(torch.sqrt(bn.running_var + bn.eps))
    # fusedconv.bias.copy_(torch.mm(w_bn, b_conv.reshape(-1, 1)).reshape(-1) + b_bn)
    fusedconv.bias.copy_((torch.mm(w_bn, b_conv.reshape(-1, 1)).reshape(-1) + b_bn).detach())
    return fusedconv

#---------------------------------------------------#
#   yolo_body
#---------------------------------------------------#
class YoloBody(nn.Module):
    def __init__(self, anchors_mask, num_classes, phi, pretrained=False):
        super(YoloBody, self).__init__()
        #-----------------------------------------------#
        #   定义了不同yolov7版本的参数
        #-----------------------------------------------#
        transition_channels = {
    
    'l' : 32, 'x' : 40}[phi]
        block_channels      = 32
        panet_channels      = {
    
    'l' : 32, 'x' : 64}[phi]
        e       = {
    
    'l' : 2, 'x' : 1}[phi]
        n       = {
    
    'l' : 4, 'x' : 6}[phi]
        ids     = {
    
    'l' : [-1, -2, -3, -4, -5, -6], 'x' : [-1, -3, -5, -7, -8]}[phi]
        conv    = {
    
    'l' : RepConv, 'x' : Conv}[phi]
        #-----------------------------------------------#
        #   输入图片是640, 640, 3
        #-----------------------------------------------#

        #---------------------------------------------------#   
        #   生成主干模型
        #   获得三个有效特征层,他们的shape分别是:
        #   80, 80, 512
        #   40, 40, 1024
        #   20, 20, 1024
        #---------------------------------------------------#
        self.backbone   = Backbone(transition_channels, block_channels, n, phi, pretrained=pretrained)

        #------------------------加强特征提取网络------------------------# 
        self.upsample   = nn.Upsample(scale_factor=2, mode="nearest")

        # 20, 20, 1024 => 20, 20, 512
        self.sppcspc                = SPPCSPC(transition_channels * 32, transition_channels * 16)
        # 20, 20, 512 => 20, 20, 256 => 40, 40, 256
        self.conv_for_P5            = Conv(transition_channels * 16, transition_channels * 8)
        # 40, 40, 1024 => 40, 40, 256
        self.conv_for_feat2         = Conv(transition_channels * 32, transition_channels * 8)
        # 40, 40, 512 => 40, 40, 256
        self.conv3_for_upsample1    = Multi_Concat_Block(transition_channels * 16, panet_channels * 4, transition_channels * 8, e=e, n=n, ids=ids)

        # 40, 40, 256 => 40, 40, 128 => 80, 80, 128
        self.conv_for_P4            = Conv(transition_channels * 8, transition_channels * 4)
        # 80, 80, 512 => 80, 80, 128
        self.conv_for_feat1         = Conv(transition_channels * 16, transition_channels * 4)
        # 80, 80, 256 => 80, 80, 128
        self.conv3_for_upsample2    = Multi_Concat_Block(transition_channels * 8, panet_channels * 2, transition_channels * 4, e=e, n=n, ids=ids)

        # 80, 80, 128 => 40, 40, 256
        self.down_sample1           = Transition_Block(transition_channels * 4, transition_channels * 4)
        # 40, 40, 512 => 40, 40, 256
        self.conv3_for_downsample1  = Multi_Concat_Block(transition_channels * 16, panet_channels * 4, transition_channels * 8, e=e, n=n, ids=ids)

        # 40, 40, 256 => 20, 20, 512
        self.down_sample2           = Transition_Block(transition_channels * 8, transition_channels * 8)
        # 20, 20, 1024 => 20, 20, 512
        self.conv3_for_downsample2  = Multi_Concat_Block(transition_channels * 32, panet_channels * 8, transition_channels * 16, e=e, n=n, ids=ids)
        #------------------------加强特征提取网络------------------------# 

        # 80, 80, 128 => 80, 80, 256
        self.rep_conv_1 = conv(transition_channels * 4, transition_channels * 8, 3, 1)
        # 40, 40, 256 => 40, 40, 512
        self.rep_conv_2 = conv(transition_channels * 8, transition_channels * 16, 3, 1)
        # 20, 20, 512 => 20, 20, 1024
        self.rep_conv_3 = conv(transition_channels * 16, transition_channels * 32, 3, 1)

        # 4 + 1 + num_classes
        # 80, 80, 256 => 80, 80, 3 * 25 (4 + 1 + 20) & 85 (4 + 1 + 80)
        self.yolo_head_P3 = nn.Conv2d(transition_channels * 8, len(anchors_mask[2]) * (5 + 1 + num_classes), 1)
        # 40, 40, 512 => 40, 40, 3 * 25 & 85
        self.yolo_head_P4 = nn.Conv2d(transition_channels * 16, len(anchors_mask[1]) * (5 + 1 + num_classes), 1)
        # 20, 20, 512 => 20, 20, 3 * 25 & 85
        self.yolo_head_P5 = nn.Conv2d(transition_channels * 32, len(anchors_mask[0]) * (5 + 1 + num_classes), 1)

    def fuse(self):
        print('Fusing layers... ')
        for m in self.modules():
            if isinstance(m, RepConv):
                m.fuse_repvgg_block()
            elif type(m) is Conv and hasattr(m, 'bn'):
                m.conv = fuse_conv_and_bn(m.conv, m.bn)
                delattr(m, 'bn')
                m.forward = m.fuseforward
        return self
    
    def forward(self, x):
        #  backbone
        feat1, feat2, feat3 = self.backbone.forward(x)
        
        #------------------------加强特征提取网络------------------------# 
        # 20, 20, 1024 => 20, 20, 512
        P5          = self.sppcspc(feat3)
        # 20, 20, 512 => 20, 20, 256
        P5_conv     = self.conv_for_P5(P5)
        # 20, 20, 256 => 40, 40, 256
        P5_upsample = self.upsample(P5_conv)
        # 40, 40, 256 cat 40, 40, 256 => 40, 40, 512
        P4          = torch.cat([self.conv_for_feat2(feat2), P5_upsample], 1)
        # 40, 40, 512 => 40, 40, 256
        P4          = self.conv3_for_upsample1(P4)

        # 40, 40, 256 => 40, 40, 128
        P4_conv     = self.conv_for_P4(P4)
        # 40, 40, 128 => 80, 80, 128
        P4_upsample = self.upsample(P4_conv)
        # 80, 80, 128 cat 80, 80, 128 => 80, 80, 256
        P3          = torch.cat([self.conv_for_feat1(feat1), P4_upsample], 1)
        # 80, 80, 256 => 80, 80, 128
        P3          = self.conv3_for_upsample2(P3)

        # 80, 80, 128 => 40, 40, 256
        P3_downsample = self.down_sample1(P3)
        # 40, 40, 256 cat 40, 40, 256 => 40, 40, 512
        P4 = torch.cat([P3_downsample, P4], 1)
        # 40, 40, 512 => 40, 40, 256
        P4 = self.conv3_for_downsample1(P4)

        # 40, 40, 256 => 20, 20, 512
        P4_downsample = self.down_sample2(P4)
        # 20, 20, 512 cat 20, 20, 512 => 20, 20, 1024
        P5 = torch.cat([P4_downsample, P5], 1)
        # 20, 20, 1024 => 20, 20, 512
        P5 = self.conv3_for_downsample2(P5)
        #------------------------加强特征提取网络------------------------# 
        # P3 80, 80, 128 
        # P4 40, 40, 256
        # P5 20, 20, 512
        
        P3 = self.rep_conv_1(P3)
        P4 = self.rep_conv_2(P4)
        P5 = self.rep_conv_3(P5)
        #---------------------------------------------------#
        #   第三个特征层
        #   y3=(batch_size, 78, 80, 80)
        #---------------------------------------------------#
        out2 = self.yolo_head_P3(P3)
        #---------------------------------------------------#
        #   第二个特征层
        #   y2=(batch_size, 78, 40, 40)
        #---------------------------------------------------#
        out1 = self.yolo_head_P4(P4)
        #---------------------------------------------------#
        #   第一个特征层
        #   y1=(batch_size, 78, 20, 20)
        #---------------------------------------------------#
        out0 = self.yolo_head_P5(P5)

        return [out0, out1, out2]

3. Decoding of prediction results

1. Obtain prediction frame and score

From the second step, we can obtain the prediction results of the three feature layers, and the shapes are ( N,20,20,258 ), ( N,40,40,258 ), ( N,80,80,258 ) data.

However, this prediction result does not correspond to the position of the final prediction frame on the picture, and it needs to be decoded to complete. In YoloV7, there are 3 prior frames for each feature point on each feature layer.

The last 258 of each feature layer can be split into three 86, corresponding to the 86 parameters of the three prior frames , we first reshape it, the result is ( N,20,20,3,86 ), ( N ,40.40,3,86 ), ( N,80,80,3,86 ).

85 of them can be split into 5+1+80.
The first 5 parameters are used to judge the regression parameters of each feature point, and the prediction frame can be obtained after the regression parameters are adjusted; the
6th parameter is used to judge whether each feature point contains an object;
the last 80 parameters are used to judge each feature point The type of object to include.

The first five parameters are [x,y,w,h,θ]. Among them, x and y are the center coordinates of the rotating coordinate system, θ is the acute angle between the rotating coordinate system and the x-axis, and the counterclockwise direction is specified as a negative angle.
insert image description here
Taking ( N,20,20,3,86 ) as an example, this feature layer is equivalent to dividing the image into 20x20 feature points. If a feature point falls in the corresponding frame of the object, it is used to predict the object.

As shown in the figure, the blue points are 20x20 feature points. At this time, we will demonstrate the decoding operation of the three prior frames of the black points on the left:
1. Calculate the center prediction point, and use Regression to predict the first two The content of the serial number offsets the center coordinates of the three prior frames of the feature points, and after the offset, there are three red points in the right figure;
2. Calculate the width and height of the predicted frame, and use Regression to predict the contents of the last two serial numbers Obtain the width and height of the predicted frame after calculating the exponent;
3. The predicted frame obtained at this time can be drawn on the picture.
4. The fifth parameter theta indicates the offset angle of the rectangle frame rotation, and the value range is [-π/2, π/2).


In addition to such decoding operations, there are non-maximum suppression operations that need to be performed to prevent the accumulation of boxes of the same kind.

import numpy as np
import torch
import math
from utils.nms_rotated import obb_nms

class DecodeBox():
    def __init__(self, anchors, num_classes, input_shape, anchors_mask = [[6,7,8], [3,4,5], [0,1,2]]):
        super(DecodeBox, self).__init__()
        self.anchors        = anchors
        self.num_classes    = num_classes
        self.bbox_attrs     = 6 + num_classes
        self.input_shape    = input_shape
        #-----------------------------------------------------------#
        #   13x13的特征层对应的anchor是[142, 110],[192, 243],[459, 401]
        #   26x26的特征层对应的anchor是[36, 75],[76, 55],[72, 146]
        #   52x52的特征层对应的anchor是[12, 16],[19, 36],[40, 28]
        #-----------------------------------------------------------#
        self.anchors_mask   = anchors_mask

    def decode_box(self, inputs):
        outputs = []
        for i, input in enumerate(inputs):
            #-----------------------------------------------#
            #   输入的input一共有三个,他们的shape分别是
            #   batch_size = 1
            #   batch_size, 3 * (5 + 1 + 80), 20, 20
            #   batch_size, 255, 40, 40
            #   batch_size, 255, 80, 80
            #-----------------------------------------------#
            batch_size      = input.size(0)
            input_height    = input.size(2)
            input_width     = input.size(3)

            #-----------------------------------------------#
            #   输入为640x640时
            #   stride_h = stride_w = 32、16、8
            #-----------------------------------------------#
            stride_h = self.input_shape[0] / input_height
            stride_w = self.input_shape[1] / input_width
            #-------------------------------------------------#
            #   此时获得的scaled_anchors大小是相对于特征层的
            #-------------------------------------------------#
            scaled_anchors = [(anchor_width / stride_w, anchor_height / stride_h) for anchor_width, anchor_height in self.anchors[self.anchors_mask[i]]]

            #-----------------------------------------------#
            #   输入的input一共有三个,他们的shape分别是
            #   batch_size, 3, 20, 20, 86
            #   batch_size, 3, 40, 40, 86
            #   batch_size, 3, 80, 80, 86
            #-----------------------------------------------#
            prediction = input.view(batch_size, len(self.anchors_mask[i]),
                                    self.bbox_attrs, input_height, input_width).permute(0, 1, 3, 4, 2).contiguous()

            #-----------------------------------------------#
            #   先验框的中心位置的调整参数
            #-----------------------------------------------#
            x = torch.sigmoid(prediction[..., 0])  
            y = torch.sigmoid(prediction[..., 1])
            #-----------------------------------------------#
            #   先验框的宽高调整参数
            #-----------------------------------------------#
            w = torch.sigmoid(prediction[..., 2]) 
            h = torch.sigmoid(prediction[..., 3]) 
            #-----------------------------------------------#
            #   获取旋转角度
            #-----------------------------------------------#
            angle       = torch.sigmoid(prediction[..., 4])
            #-----------------------------------------------#
            #   获得置信度,是否有物体
            #-----------------------------------------------#
            conf        = torch.sigmoid(prediction[..., 5])
            #-----------------------------------------------#
            #   种类置信度
            #-----------------------------------------------#
            pred_cls    = torch.sigmoid(prediction[..., 6:])

            FloatTensor = torch.cuda.FloatTensor if x.is_cuda else torch.FloatTensor
            LongTensor  = torch.cuda.LongTensor if x.is_cuda else torch.LongTensor

            #----------------------------------------------------------#
            #   生成网格,先验框中心,网格左上角 
            #   batch_size,3,20,20
            #----------------------------------------------------------#
            grid_x = torch.linspace(0, input_width - 1, input_width).repeat(input_height, 1).repeat(
                batch_size * len(self.anchors_mask[i]), 1, 1).view(x.shape).type(FloatTensor)
            grid_y = torch.linspace(0, input_height - 1, input_height).repeat(input_width, 1).t().repeat(
                batch_size * len(self.anchors_mask[i]), 1, 1).view(y.shape).type(FloatTensor)

            #----------------------------------------------------------#
            #   按照网格格式生成先验框的宽高
            #   batch_size,3,20,20
            #----------------------------------------------------------#
            anchor_w = FloatTensor(scaled_anchors).index_select(1, LongTensor([0]))
            anchor_h = FloatTensor(scaled_anchors).index_select(1, LongTensor([1]))
            anchor_w = anchor_w.repeat(batch_size, 1).repeat(1, 1, input_height * input_width).view(w.shape)
            anchor_h = anchor_h.repeat(batch_size, 1).repeat(1, 1, input_height * input_width).view(h.shape)

            #----------------------------------------------------------#
            #   利用预测结果对先验框进行调整
            #   首先调整先验框的中心,从先验框中心向右下角偏移
            #   再调整先验框的宽高。
            #   x 0 ~ 1 => 0 ~ 2 => -0.5, 1.5 => 负责一定范围的目标的预测
            #   y 0 ~ 1 => 0 ~ 2 => -0.5, 1.5 => 负责一定范围的目标的预测
            #   w 0 ~ 1 => 0 ~ 2 => 0 ~ 4 => 先验框的宽高调节范围为0~4倍
            #   h 0 ~ 1 => 0 ~ 2 => 0 ~ 4 => 先验框的宽高调节范围为0~4倍
            #----------------------------------------------------------#
            pred_boxes          = FloatTensor(prediction[..., :4].shape)
            pred_boxes[..., 0]  = x.data * 2. - 0.5 + grid_x
            pred_boxes[..., 1]  = y.data * 2. - 0.5 + grid_y
            pred_boxes[..., 2]  = (w.data * 2) ** 2 * anchor_w
            pred_boxes[..., 3]  = (h.data * 2) ** 2 * anchor_h
            pred_theta          = (angle.data - 0.5) * math.pi
            #----------------------------------------------------------#
            #   将输出结果归一化成小数的形式
            #----------------------------------------------------------#
            _scale = torch.Tensor([input_width, input_height, input_width, input_height]).type(FloatTensor)
            output = torch.cat((pred_boxes.view(batch_size, -1, 4) / _scale, pred_theta.view(batch_size, -1, 1),
                                conf.view(batch_size, -1, 1), pred_cls.view(batch_size, -1, self.num_classes)), -1)
            outputs.append(output.data)
        return outputs

    def non_max_suppression(self, prediction, num_classes, input_shape, image_shape, letterbox_image, conf_thres=0.5, nms_thres=0.4):
        #----------------------------------------------------------#
        #   prediction  [batch_size, num_anchors, 85]
        #----------------------------------------------------------#

        output = [None for _ in range(len(prediction))]
        for i, image_pred in enumerate(prediction):
            #----------------------------------------------------------#
            #   对种类预测部分取max。
            #   class_conf  [num_anchors, 1]    种类置信度
            #   class_pred  [num_anchors, 1]    种类
            #----------------------------------------------------------#
            class_conf, class_pred = torch.max(image_pred[:, 6:6 + num_classes], 1, keepdim=True)

            #----------------------------------------------------------#
            #   利用置信度进行第一轮筛选
            #----------------------------------------------------------#
            conf_mask = (image_pred[:, 5] * class_conf[:, 0] >= conf_thres).squeeze()
            #----------------------------------------------------------#
            #   根据置信度进行预测结果的筛选
            #----------------------------------------------------------#
            image_pred = image_pred[conf_mask]
            class_conf = class_conf[conf_mask]
            class_pred = class_pred[conf_mask]
            if not image_pred.size(0):
                continue
            #-------------------------------------------------------------------------#
            #   detections  [num_anchors, 8]
            #   8的内容为:x, y, w, h, angle, obj_conf, class_conf, class_pred
            #-------------------------------------------------------------------------#
            detections = torch.cat((image_pred[:, :6], class_conf.float(), class_pred.float()), 1)

            #------------------------------------------#
            #   获得预测结果中包含的所有种类
            #------------------------------------------#
            unique_labels = detections[:, -1].cpu().unique()

            if prediction.is_cuda:
                unique_labels = unique_labels.cuda()
                detections = detections.cuda()

            for c in unique_labels:
                #------------------------------------------#
                #   获得某一类得分筛选后全部的预测结果
                #------------------------------------------#
                detections_class = detections[detections[:, -1] == c]

                #------------------------------------------#
                #   使用官方自带的非极大抑制会速度更快一些!
                #   筛选出一定区域内,属于同一种类得分最大的框
                #------------------------------------------#
                _, keep = obb_nms(
                    detections_class[:, :5],
                    detections_class[:, 5] * detections_class[:, 6],
                    nms_thres
                )
                max_detections = detections_class[keep]
                
                # Add max detections to outputs
                output[i] = max_detections if output[i] is None else torch.cat((output[i], max_detections))
            
            if output[i] is not None:
                output[i] = output[i].cpu().numpy()
        return output

2. Score screening and non-maximum suppression

After the final prediction results are obtained, score sorting and non-maximum suppression screening are performed .

Score screening is to screen out prediction boxes whose scores meet the confidence level.
Non-maximum suppression is to filter out the box with the highest score belonging to the same category in a certain area.

The process of score screening and non-maximum suppression can be summarized as follows:
1. Find the frame in the picture whose score is greater than the threshold function . Score screening before coincident box screening can greatly reduce the number of boxes.
2. Cycling the category , the function of non-maximum suppression is to filter out the frame that belongs to the same category with the highest score in a certain area . Cycling the category can help us perform non-maximum suppression on each category.
3. Sort the category from largest to smallest according to the score.
4. Take out the box with the highest score each time, and calculate the degree of overlap with all other predicted boxes, and remove the ones that overlap too much.

The results of score screening and non-maximum suppression can be used to draw prediction boxes.

The image below is non-maximally suppressed.
insert image description here

The figure below is without non-maximum suppression.
insert image description here

The implementation code is:

def non_max_suppression(self, prediction, num_classes, input_shape, image_shape, letterbox_image, conf_thres=0.5, nms_thres=0.4):
        #----------------------------------------------------------#
        #   prediction  [batch_size, num_anchors, 85]
        #----------------------------------------------------------#

        output = [None for _ in range(len(prediction))]
        for i, image_pred in enumerate(prediction):
            #----------------------------------------------------------#
            #   对种类预测部分取max。
            #   class_conf  [num_anchors, 1]    种类置信度
            #   class_pred  [num_anchors, 1]    种类
            #----------------------------------------------------------#
            class_conf, class_pred = torch.max(image_pred[:, 6:6 + num_classes], 1, keepdim=True)

            #----------------------------------------------------------#
            #   利用置信度进行第一轮筛选
            #----------------------------------------------------------#
            conf_mask = (image_pred[:, 5] * class_conf[:, 0] >= conf_thres).squeeze()
            #----------------------------------------------------------#
            #   根据置信度进行预测结果的筛选
            #----------------------------------------------------------#
            image_pred = image_pred[conf_mask]
            class_conf = class_conf[conf_mask]
            class_pred = class_pred[conf_mask]
            if not image_pred.size(0):
                continue
            #-------------------------------------------------------------------------#
            #   detections  [num_anchors, 8]
            #   8的内容为:x, y, w, h, angle, obj_conf, class_conf, class_pred
            #-------------------------------------------------------------------------#
            detections = torch.cat((image_pred[:, :6], class_conf.float(), class_pred.float()), 1)

            #------------------------------------------#
            #   获得预测结果中包含的所有种类
            #------------------------------------------#
            unique_labels = detections[:, -1].cpu().unique()

            if prediction.is_cuda:
                unique_labels = unique_labels.cuda()
                detections = detections.cuda()

            for c in unique_labels:
                #------------------------------------------#
                #   获得某一类得分筛选后全部的预测结果
                #------------------------------------------#
                detections_class = detections[detections[:, -1] == c]

                #------------------------------------------#
                #   使用官方自带的非极大抑制会速度更快一些!
                #   筛选出一定区域内,属于同一种类得分最大的框
                #------------------------------------------#
                _, keep = obb_nms(
                    detections_class[:, :5],
                    detections_class[:, 5] * detections_class[:, 6],
                    nms_thres
                )
                max_detections = detections_class[keep]
                
                # Add max detections to outputs
                output[i] = max_detections if output[i] is None else torch.cat((output[i], max_detections))
            
            if output[i] is not None:
                output[i] = output[i].cpu().numpy()
        return output

4. Training part

1. Calculate the content required for loss

Calculating loss is actually a comparison between the predicted results of the network and the real results of the network.
Like the prediction results of the network, the loss of the network is also composed of three parts, namely the Reg part, the Obj part, and the Cls part. The Reg part is the regression parameter judgment of the feature point, the Obj part is the judgment of whether the feature point contains an object, and the Cls part is the type of object contained in the feature point.

2. The matching process of positive samples

In YoloV7-OBB, the matching process of positive samples during training can be divided into two parts.
a. For each real frame, roughly match the prior frame and feature points by coordinates, width and height.
b. Use SimOTA to adaptively select how many a priori boxes each real box corresponds to.

The so-called positive sample matching is to find out which a priori boxes are considered to have corresponding real boxes, and is responsible for the prediction of this real box .

a. Match the prior frame and feature points

In this part, YoloV7-OBB performs rough matching on each ground truth box. Find which a priori boxes on which feature points can be responsible for the prediction of the real box.

First, match the prior frame. In the YoloV7-OBB network, a total of 9 prior frames of different sizes are designed. Each output feature layer corresponds to 3 prior boxes.

For any real frame gt, YoloV7-OBB no longer uses iou to match the positive samples, but directly uses the aspect ratio for matching, that is, uses the real frame and 9 a priori frames of different sizes to calculate the aspect ratio.

If the width-to-height ratio between the ground truth box and a priori box is greater than the set threshold, it means that the match between the ground truth box and the priori box is not enough, and the priori box is considered as a negative sample.

For example, there is a real frame at this time, its width and height are [200, 200], which is a square . The 9 a priori boxes set by default in YoloV7 are [12, 16], [19, 36], [40, 28], [36, 75], [76, 55], [72, 146], [142, 110 ] ], [192, 243], [459, 401] . Set Threshold Threshold to 4 .

At this point we need to calculate the width-to-height ratio of the real frame and the 9 prior frames . There are two situations when comparing the width and height, one is that the width and height of the real frame are larger than the prior frame, and the other is that the width and height of the prior frame are larger than the real frame . Therefore, we need to calculate at the same time: the width and height of the real frame/the width and height of the prior frame; the width and height of the prior frame/the width and height of the real frame. Then pick the maximum value among them.

The next list is the comparison result. This is a matrix with a shape of [9, 4]. 9 represents 9 a priori boxes, and 4 represents the width and height of the real box/the width and height of the a priori box; the width and height of the a priori box/ The width and height of the ground truth box.

[[16.66666667 12.5         0.06        0.08      ]
 [10.52631579  5.55555556  0.095       0.18      ]
 [ 5.          7.14285714  0.2         0.14      ]
 [ 5.55555556  2.66666667  0.18        0.375     ]
 [ 2.63157895  3.63636364  0.38        0.275     ]
 [ 2.77777778  1.36986301  0.36        0.73      ]
 [ 1.4084507   1.81818182  0.71        0.55      ]
 [ 1.04166667  0.82304527  0.96        1.215     ]
 [ 0.43572985  0.49875312  2.295       2.005     ]]

Then take the maximum value for the comparison result of each prior box. The following matrix is ​​obtained:

[16.66666667 10.52631579  7.14285714  5.55555556  3.63636364  2.77777778
  1.81818182  1.215       2.295     ]

Afterwards, we judge which a priori boxes have a comparison result whose value is less than the threshold . It can be known that [76, 55], [72, 146], [142, 110], [192, 243], [459, 401] five a priori boxes all meet the requirements.

[142, 110], [192, 243], [459, 401] belong to the feature layers of 20, 20.
[76, 55], [72, 146] belong to the feature layers of 40, 40.

At this point we can already judge which sizes of a priori boxes can be used for the prediction of the real box.

In the past Yolo of YoloV5, each real frame is predicted by the feature point in the upper left corner of the grid where its center point is located.

In YoloV7-OBB, like YoloV5, for the selected feature layer, first calculate which grid the real frame falls in. At this time, the feature point in the upper left corner of the grid is a feature point responsible for prediction.

At the same time, the rounding rule is used to find the nearest two grids, and all three grids are considered to be responsible for predicting the real box.

The red dot indicates the center of the real box , and besides the current grid, its two nearest neighbor grids are also selected. From here, it can be found that the value range of the XY axis offset part of the prediction frame is no longer 0-1, but 0.5-1.5.

After finding the corresponding feature points, the corresponding feature points are responsible for the prediction of the real frame in the prior frame that satisfies the aspect ratio.

But this step is only a rough screening, and we will use simOTA to accurately screen later.

def find_3_positive(self, predictions, targets):
        #------------------------------------#
        #   获得每个特征层先验框的数量
        #   与真实框的数量
        #------------------------------------#
        num_anchor, num_gt  = len(self.anchors_mask[0]), targets.shape[0] 
        #------------------------------------#
        #   创建空列表存放indices和anchors
        #------------------------------------#
        indices, anchors    = [], []
        #------------------------------------#
        #   创建8个1
        #   序号0,1为1
        #   序号2:7为特征层的高宽
        #   序号7为1
        #------------------------------------#
        gain    = torch.ones(8, device=targets.device)
        #------------------------------------#
        #   ai      [num_anchor, num_gt]
        #   targets [num_gt, 6] => [num_anchor, num_gt, 8]
        #------------------------------------#
        ai      = torch.arange(num_anchor, device=targets.device).float().view(num_anchor, 1).repeat(1, num_gt)
        targets = torch.cat((targets.repeat(num_anchor, 1, 1), ai[:, :, None]), 2)  # append anchor indices
        # targets (tensor): (na, n_gt_all_batch, [img_index, clsid, cx, cy, l, s, theta, anchor_index]])
        g   = 0.5 # offsets
        off = torch.tensor([
            [0, 0],
            [1, 0], [0, 1], [-1, 0], [0, -1],  # j,k,l,m
            # [1, 1], [1, -1], [-1, 1], [-1, -1],  # jk,jm,lk,lm
        ], device=targets.device).float() * g 

        for i in range(len(predictions)):
            #----------------------------------------------------#
            #   将先验框除以stride,获得相对于特征层的先验框。
            #   anchors_i [num_anchor, 2]
            #----------------------------------------------------#
            anchors_i = torch.from_numpy(self.anchors[i] / self.stride[i]).type_as(predictions[i])
            anchors_i, shape = torch.from_numpy(self.anchors[i] / self.stride[i]).type_as(predictions[i]), predictions[i].shape
            #-------------------------------------------#
            #   计算获得对应特征层的高宽
            #-------------------------------------------#
            gain[2:6] = torch.tensor(predictions[i].shape)[[3, 2, 3, 2]]
            
            #-------------------------------------------#
            #   将真实框乘上gain,
            #   其实就是将真实框映射到特征层上
            #-------------------------------------------#
            t = targets * gain
            if num_gt:
                #-------------------------------------------#
                #   计算真实框与先验框高宽的比值
                #   然后根据比值大小进行判断,
                #   判断结果用于取出,获得所有先验框对应的真实框
                #   r   [num_anchor, num_gt, 2]
                #   t   [num_anchor, num_gt, 7] => [num_matched_anchor, 7]
                #-------------------------------------------#
                r = t[:, :, 4:6] / anchors_i[:, None]
                j = torch.max(r, 1. / r).max(2)[0] < self.threshold
                t = t[j]  # filter
                
                #-------------------------------------------#
                #   gxy 获得所有先验框对应的真实框的x轴y轴坐标
                #   gxi 取相对于该特征层的右小角的坐标
                #-------------------------------------------#
                gxy     = t[:, 2:4] # grid xy
                gxi     = gain[[2, 3]] - gxy # inverse
                j, k    = ((gxy % 1. < g) & (gxy > 1.)).T
                l, m    = ((gxi % 1. < g) & (gxi > 1.)).T
                j       = torch.stack((torch.ones_like(j), j, k, l, m))
                
                #-------------------------------------------#
                #   t   重复5次,使用满足条件的j进行框的提取
                #   j   一共五行,代表当前特征点在五个
                #       [0, 0], [1, 0], [0, 1], [-1, 0], [0, -1]
                #       方向是否存在
                #-------------------------------------------#
                t       = t.repeat((5, 1, 1))[j]
                offsets = (torch.zeros_like(gxy)[None] + off[:, None])[j]
            else:
                t = targets[0]
                offsets = 0

            #-------------------------------------------#
            #   b   代表属于第几个图片
            #   gxy 代表该真实框所处的x、y中心坐标
            #   gwh 代表该真实框的wh坐标
            #   gij 代表真实框所属的特征点坐标
            #-------------------------------------------#
            b, c    = t[:, :2].long().T  # image, class
            gxy     = t[:, 2:4]  # grid xy
            gwh     = t[:, 4:6]  # grid wh
            gij     = (gxy - offsets).long()
            gi, gj  = gij.T  # grid xy indices

            #-------------------------------------------#
            #   gj、gi不能超出特征层范围
            #   a代表属于该特征点的第几个先验框
            #-------------------------------------------#
            a = t[:, -1].long()  # anchor indices
            indices.append((b, a, gj.clamp_(0, shape[2] - 1), gi.clamp_(0, shape[3] - 1)))  # image, anchor, grid indices
            anchors.append(anchors_i[a])  # anchors

        return indices, anchors

b. SimOTA adaptive matching

In YoloV7-OBB, we will calculate a Cost cost matrix, which represents the cost relationship between each real frame and each feature point. The Cost cost matrix consists of two parts: 1. Each real frame and current feature point
prediction 2. The type prediction accuracy
of each real frame and the current feature point prediction frame;

The higher the degree of overlap between each real frame and the current feature point prediction frame, it means that this feature point has tried to fit the real frame, so its Cost will be smaller.

The higher the type prediction accuracy of each real frame and the current feature point prediction frame, it also means that this feature point has tried to fit the real frame, so its Cost will be smaller.

The purpose of the cost matrix is ​​to adaptively find the real frame that the current feature point should fit. The higher the coincidence degree, the more fitting is required, the more accurate the classification is, the more fitting is required, and the more fitting is required within a certain radius.

In SimOTA, different targets set different numbers of positive samples (dynamick). Taking the ants and watermelons in Megvii’s official answer as an example, the traditional positive sample allocation scheme often assigns the same number of positive samples to watermelons and ants in the same scene. The number of positive samples, either the ants have many low-quality positive samples, or the watermelon only has one or two positive samples. It is not suitable for any distribution method.
The key to dynamic positive sample setting is how to determine k. The specific method of SimOTA is to first calculate the 10 feature points with the lowest cost of each target, and then add the prediction frame corresponding to these ten feature points with the IOU of the real frame to obtain the final the k.

Therefore, the process of SimOTA is summarized as follows:
1. Calculate the degree of overlap between each real frame and the predicted frame of the current feature point.
2. Calculation Add up the IOU of the 20 predicted frames with the highest coincidence degree and the real frame to obtain the k of each real frame, which means that each real frame has k feature points corresponding to it.
3. Calculate the type prediction accuracy of each real frame and the current feature point prediction frame.
4. Calculate the Cost matrix.
5. Take the k points with the lowest Cost as the positive samples of the real frame.

def build_targets(self, predictions, targets, imgs):
        #-------------------------------------------#
        #   匹配正样本
        #-------------------------------------------#
        indices, anch       = self.find_3_positive(predictions, targets)

        matching_bs         = [[] for _ in predictions]
        matching_as         = [[] for _ in predictions]
        matching_gjs        = [[] for _ in predictions]
        matching_gis        = [[] for _ in predictions]
        matching_targets    = [[] for _ in predictions]
        matching_anchs      = [[] for _ in predictions]
        
        #-------------------------------------------#
        #   一共三层
        #-------------------------------------------#
        num_layer = len(predictions)
        #-------------------------------------------#
        #   对batch_size进行循环,进行OTA匹配
        #   在batch_size循环中对layer进行循环
        #-------------------------------------------#
        for batch_idx in range(predictions[0].shape[0]):
            #-------------------------------------------#
            #   先判断匹配上的真实框哪些属于该图片
            #-------------------------------------------#
            b_idx       = targets[:, 0]==batch_idx
            this_target = targets[b_idx]
            #  targets (tensor): (n_gt_all_batch, [img_index clsid cx cy l s theta ])
            #-------------------------------------------#
            #   如果没有真实框属于该图片则continue
            #-------------------------------------------#
            if this_target.shape[0] == 0:
                continue
            
            #-------------------------------------------#
            #   真实框的坐标进行缩放
            #-------------------------------------------#
            txywh = this_target[:, 2:6] * imgs[batch_idx].shape[1]
            #-------------------------------------------#
            #   从中心宽高到左上角右下角
            #-------------------------------------------#
            txyxy = torch.cat((txywh, this_target[:,6:]), dim=-1)

            pxyxys      = []
            p_cls       = []
            p_obj       = []
            from_which_layer = []
            all_b       = []
            all_a       = []
            all_gj      = []
            all_gi      = []
            all_anch    = []
            
            #-------------------------------------------#
            #   对三个layer进行循环
            #-------------------------------------------#
            for i, prediction in enumerate(predictions):
                #-------------------------------------------#
                #   b代表第几张图片 a代表第几个先验框
                #   gj代表y轴,gi代表x轴
                #-------------------------------------------#
                b, a, gj, gi    = indices[i]
                idx             = (b == batch_idx)
                b, a, gj, gi    = b[idx], a[idx], gj[idx], gi[idx]       
                       
                all_b.append(b)
                all_a.append(a)
                all_gj.append(gj)
                all_gi.append(gi)
                all_anch.append(anch[i][idx])
                from_which_layer.append(torch.ones(size=(len(b),)) * i)
                
                #-------------------------------------------#
                #   取出这个真实框对应的预测结果
                #-------------------------------------------#
                fg_pred = prediction[b, a, gj, gi]                
                p_obj.append(fg_pred[:, 5:6]) # [4:5] = theta
                p_cls.append(fg_pred[:, 6:])
                
                #-------------------------------------------#
                #   获得网格后,进行解码
                #-------------------------------------------#
                grid    = torch.stack([gi, gj], dim=1).type_as(fg_pred)
                pxy     = (fg_pred[:, :2].sigmoid() * 2. - 0.5 + grid) * self.stride[i]
                pwh     = (fg_pred[:, 2:4].sigmoid() * 2) ** 2 * anch[i][idx] * self.stride[i]
                # 获取预测的旋转角度
                pangle  = (fg_pred[:, 4:5].sigmoid() - 0.5) * math.pi
                pxywh   = torch.cat([pxy, pwh, pangle], dim=-1)
                pxyxys.append(pxywh)
            
            #-------------------------------------------#
            #   判断是否存在对应的预测框,不存在则跳过
            #-------------------------------------------#
            pxyxys = torch.cat(pxyxys, dim=0)
            if pxyxys.shape[0] == 0:
                continue
            
            #-------------------------------------------#
            #   进行堆叠
            #-------------------------------------------#
            p_obj       = torch.cat(p_obj, dim=0)
            p_cls       = torch.cat(p_cls, dim=0)
            from_which_layer = torch.cat(from_which_layer, dim=0)
            all_b       = torch.cat(all_b, dim=0)
            all_a       = torch.cat(all_a, dim=0)
            all_gj      = torch.cat(all_gj, dim=0)
            all_gi      = torch.cat(all_gi, dim=0)
            all_anch    = torch.cat(all_anch, dim=0)
        
            #-------------------------------------------------------------#
            #   计算当前图片中,真实框与预测框的重合程度
            #   iou的范围为0-1,取-log后为0~inf
            #   重合程度越大,取-log后越小
            #   因此,真实框与预测框重合度越大,pair_wise_iou_loss越小
            #-------------------------------------------------------------#
            pair_wise_iou_loss = compute_kld_loss(txyxy, pxyxys, taf=1.0, fun='sqrt')
            pair_wise_iou      = 1 - pair_wise_iou_loss

            #-------------------------------------------#
            #   最多二十个预测框与真实框的重合程度
            #   然后求和,找到每个真实框对应几个预测框
            #-------------------------------------------#
            top_k, _    = torch.topk(pair_wise_iou, min(20, pair_wise_iou.shape[1]), dim=1)
            dynamic_ks  = torch.clamp(top_k.sum(1).int(), min=1)

            #-------------------------------------------#
            #   gt_cls_per_image    种类的真实信息
            #-------------------------------------------#
            gt_cls_per_image = F.one_hot(this_target[:, 1].to(torch.int64), self.num_classes).float().unsqueeze(1).repeat(1, pxyxys.shape[0], 1)
            
            #-------------------------------------------#
            #   cls_preds_  种类置信度的预测信息
            #               cls_preds_越接近于1,y越接近于1
            #               y / (1 - y)越接近于无穷大
            #               也就是种类置信度预测的越准
            #               pair_wise_cls_loss越小
            #-------------------------------------------#
            num_gt              = this_target.shape[0]
            cls_preds_          = p_cls.float().unsqueeze(0).repeat(num_gt, 1, 1).sigmoid_() * p_obj.unsqueeze(0).repeat(num_gt, 1, 1).sigmoid_()
            y                   = cls_preds_.sqrt_()
            pair_wise_cls_loss  = F.binary_cross_entropy_with_logits(torch.log(y / (1 - y)), gt_cls_per_image, reduction="none").sum(-1)
            del cls_preds_
        
            #-------------------------------------------#
            #   求cost的总和
            #-------------------------------------------#
            cost = (
                pair_wise_cls_loss
                + 3.0 * pair_wise_iou_loss
            )

            #-------------------------------------------#
            #   求cost最小的k个预测框
            #-------------------------------------------#
            matching_matrix = torch.zeros_like(cost)
            for gt_idx in range(num_gt):
                _, pos_idx = torch.topk(cost[gt_idx], k=dynamic_ks[gt_idx].item(), largest=False)
                matching_matrix[gt_idx][pos_idx] = 1.0

            del top_k, dynamic_ks

            #-------------------------------------------#
            #   如果一个预测框对应多个真实框
            #   只使用这个预测框最对应的真实框
            #-------------------------------------------#
            anchor_matching_gt = matching_matrix.sum(0)
            if (anchor_matching_gt > 1).sum() > 0:
                _, cost_argmin = torch.min(cost[:, anchor_matching_gt > 1], dim=0)
                matching_matrix[:, anchor_matching_gt > 1]          *= 0.0
                matching_matrix[cost_argmin, anchor_matching_gt > 1] = 1.0
            fg_mask_inboxes = matching_matrix.sum(0) > 0.0
            matched_gt_inds = matching_matrix[:, fg_mask_inboxes].argmax(0)

            #-------------------------------------------#
            #   取出符合条件的框
            #-------------------------------------------#
            from_which_layer    = from_which_layer.to(fg_mask_inboxes.device)[fg_mask_inboxes]
            all_b               = all_b[fg_mask_inboxes]
            all_a               = all_a[fg_mask_inboxes]
            all_gj              = all_gj[fg_mask_inboxes]
            all_gi              = all_gi[fg_mask_inboxes]
            all_anch            = all_anch[fg_mask_inboxes]
            this_target         = this_target[matched_gt_inds]
        
            for i in range(num_layer):
                layer_idx = from_which_layer == i
                matching_bs[i].append(all_b[layer_idx])
                matching_as[i].append(all_a[layer_idx])
                matching_gjs[i].append(all_gj[layer_idx])
                matching_gis[i].append(all_gi[layer_idx])
                matching_targets[i].append(this_target[layer_idx])
                matching_anchs[i].append(all_anch[layer_idx])

        for i in range(num_layer):
            matching_bs[i]      = torch.cat(matching_bs[i], dim=0) if len(matching_bs[i]) != 0 else torch.Tensor(matching_bs[i])
            matching_as[i]      = torch.cat(matching_as[i], dim=0) if len(matching_as[i]) != 0 else torch.Tensor(matching_as[i])
            matching_gjs[i]     = torch.cat(matching_gjs[i], dim=0) if len(matching_gjs[i]) != 0 else torch.Tensor(matching_gjs[i])
            matching_gis[i]     = torch.cat(matching_gis[i], dim=0) if len(matching_gis[i]) != 0 else torch.Tensor(matching_gis[i])
            matching_targets[i] = torch.cat(matching_targets[i], dim=0) if len(matching_targets[i]) != 0 else torch.Tensor(matching_targets[i])
            matching_anchs[i]   = torch.cat(matching_anchs[i], dim=0) if len(matching_anchs[i]) != 0 else torch.Tensor(matching_anchs[i])

        return matching_bs, matching_as, matching_gjs, matching_gis, matching_targets, matching_anchs

3. Calculate Loss

It can be seen from the first part that the loss of YoloV7 consists of three parts:
1. The Reg part. From the second part, we can know the prior frame corresponding to each real frame. After obtaining the prior frame corresponding to each frame, take out the prior frame The prediction frame corresponding to the inspection frame, using the real frame and the prediction frame to calculate the KLD loss, as the Loss composition of the Reg part.
2. In the Obj part, the prior frame corresponding to each real frame can be known from the second part. All the prior frames corresponding to the real frame are positive samples, and the remaining prior frames are negative samples. According to the positive and negative samples and feature points The cross-entropy loss is calculated as the prediction result of whether the object is included, as the Loss component of the Obj part.
3. In the Cls part, the prior frame corresponding to each real frame can be known from the third part. After obtaining the prior frame corresponding to each frame, the prediction result of the type of the prior frame is taken out, and according to the type of the real frame and the prior frame The type prediction result of the inspection box is used to calculate the cross entropy loss, which is used as the Loss component of the Cls part.

import math
from copy import deepcopy
from functools import partial

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from utils.kld_loss import compute_kld_loss, KLDloss

def smooth_BCE(eps=0.1):  # https://github.com/ultralytics/yolov3/issues/238#issuecomment-598028441
    # return positive, negative label smoothing BCE targets
    return 1.0 - 0.5 * eps, 0.5 * eps

class YOLOLoss(nn.Module):
    def __init__(self, anchors, num_classes, input_shape, anchors_mask = [[6,7,8], [3,4,5], [0,1,2]], label_smoothing = 0):
        super(YOLOLoss, self).__init__()
        #-----------------------------------------------------------#
        #   13x13的特征层对应的anchor是[142, 110],[192, 243],[459, 401]
        #   26x26的特征层对应的anchor是[36, 75],[76, 55],[72, 146]
        #   52x52的特征层对应的anchor是[12, 16],[19, 36],[40, 28]
        #-----------------------------------------------------------#
        self.anchors        = [anchors[mask] for mask in anchors_mask]
        self.num_classes    = num_classes
        self.input_shape    = input_shape
        self.anchors_mask   = anchors_mask

        self.balance        = [0.4, 1.0, 4]
        self.stride         = [32, 16, 8]
        
        self.box_ratio      = 0.05
        self.obj_ratio      = 1 * (input_shape[0] * input_shape[1]) / (640 ** 2)
        self.cls_ratio      = 0.5 * (num_classes / 80)
        self.threshold      = 4

        self.cp, self.cn                    = smooth_BCE(eps=label_smoothing)  
        self.BCEcls, self.BCEobj, self.gr   = nn.BCEWithLogitsLoss(), nn.BCEWithLogitsLoss(), 1
        self.kldbbox = KLDloss(taf=1.0, fun='sqrt')
    
    def __call__(self, predictions, targets, imgs): 
        #-------------------------------------------#
        #   对输入进来的预测结果进行reshape
        #   bs, 255, 20, 20 => bs, 3, 20, 20, 85
        #   bs, 255, 40, 40 => bs, 3, 40, 40, 85
        #   bs, 255, 80, 80 => bs, 3, 80, 80, 85
        #-------------------------------------------#
        for i in range(len(predictions)):
            bs, _, h, w = predictions[i].size()
            predictions[i] = predictions[i].view(bs, len(self.anchors_mask[i]), -1, h, w).permute(0, 1, 3, 4, 2).contiguous()
            
        #-------------------------------------------#
        #   获得工作的设备
        #-------------------------------------------#
        device              = targets.device
        #-------------------------------------------#
        #   初始化三个部分的损失
        #-------------------------------------------#
        cls_loss, box_loss, obj_loss    = torch.zeros(1, device = device), torch.zeros(1, device = device), torch.zeros(1, device = device)
        
        #-------------------------------------------#
        #   进行正样本的匹配
        #-------------------------------------------#
        bs, as_, gjs, gis, targets, anchors = self.build_targets(predictions, targets, imgs)
        #-------------------------------------------#
        #   计算获得对应特征层的高宽
        #-------------------------------------------#
        feature_map_sizes = [torch.tensor(prediction.shape, device=device)[[3, 2, 3, 2]].type_as(prediction) for prediction in predictions] 
    
        #-------------------------------------------#
        #   计算损失,对三个特征层各自进行处理
        #-------------------------------------------#
        for i, prediction in enumerate(predictions): 
            #-------------------------------------------#
            #   image, anchor, gridy, gridx
            #-------------------------------------------#
            b, a, gj, gi    = bs[i], as_[i], gjs[i], gis[i]
            tobj            = torch.zeros_like(prediction[..., 0], device=device)  # target obj

            #-------------------------------------------#
            #   获得目标数量,如果目标大于0
            #   则开始计算种类损失和回归损失
            #-------------------------------------------#
            n = b.shape[0]
            if n:
                prediction_pos = prediction[b, a, gj, gi]  # prediction subset corresponding to targets
                # prediction_pos [xywh angle conf cls ]
                #-------------------------------------------#
                #   计算匹配上的正样本的回归损失
                #-------------------------------------------#
                #-------------------------------------------#
                #   grid 获得正样本的x、y轴坐标
                #-------------------------------------------#
                grid    = torch.stack([gi, gj], dim=1)
                #-------------------------------------------#
                #   进行解码,获得预测结果
                #-------------------------------------------#
                xy      = prediction_pos[:, :2].sigmoid() * 2. - 0.5
                wh      = (prediction_pos[:, 2:4].sigmoid() * 2) ** 2 * anchors[i]
                angle   = (prediction_pos[:, 4:5].sigmoid() - 0.5) * math.pi
                box_theta = torch.cat((xy, wh, angle), 1)
                #-------------------------------------------#
                #   对真实框进行处理,映射到特征层上
                #-------------------------------------------#
                selected_tbox           = targets[i][:, 2:6] * feature_map_sizes[i]
                selected_tbox[:, :2]    -= grid.type_as(prediction)
                theta                   = targets[i][:, 6:7]
                selected_tbox_theta     = torch.cat((selected_tbox, theta),1)
                #-------------------------------------------#
                #   计算预测框和真实框的回归损失
                #-------------------------------------------#
                kldloss                 = self.kldbbox(box_theta, selected_tbox_theta)
                box_loss                += kldloss.mean()
                #-------------------------------------------#
                #   根据预测结果的iou获得置信度损失的gt
                #-------------------------------------------#
                tobj[b, a, gj, gi] = (1.0 - self.gr) + self.gr * (1 - kldloss).detach().clamp(0).type(tobj.dtype)  # iou ratio

                #-------------------------------------------#
                #   计算匹配上的正样本的分类损失
                #-------------------------------------------#
                selected_tcls               = targets[i][:, 1].long()
                t                           = torch.full_like(prediction_pos[:, 6:], self.cn, device=device)  # targets
                t[range(n), selected_tcls]  = self.cp
                cls_loss                    += self.BCEcls(prediction_pos[:, 6:], t)  # BCE

            #-------------------------------------------#
            #   计算目标是否存在的置信度损失
            #   并且乘上每个特征层的比例
            #-------------------------------------------#
            obj_loss += self.BCEobj(prediction[..., 5], tobj) * self.balance[i]  # obj loss
            
        #-------------------------------------------#
        #   将各个部分的损失乘上比例
        #   全加起来后,乘上batch_size
        #-------------------------------------------#
        box_loss    *= self.box_ratio
        obj_loss    *= self.obj_ratio
        cls_loss    *= self.cls_ratio
        bs          = tobj.shape[0]
        
        loss    = box_loss + obj_loss + cls_loss
        return loss

    def build_targets(self, predictions, targets, imgs):
        #-------------------------------------------#
        #   匹配正样本
        #-------------------------------------------#
        indices, anch       = self.find_3_positive(predictions, targets)

        matching_bs         = [[] for _ in predictions]
        matching_as         = [[] for _ in predictions]
        matching_gjs        = [[] for _ in predictions]
        matching_gis        = [[] for _ in predictions]
        matching_targets    = [[] for _ in predictions]
        matching_anchs      = [[] for _ in predictions]
        
        #-------------------------------------------#
        #   一共三层
        #-------------------------------------------#
        num_layer = len(predictions)
        #-------------------------------------------#
        #   对batch_size进行循环,进行OTA匹配
        #   在batch_size循环中对layer进行循环
        #-------------------------------------------#
        for batch_idx in range(predictions[0].shape[0]):
            #-------------------------------------------#
            #   先判断匹配上的真实框哪些属于该图片
            #-------------------------------------------#
            b_idx       = targets[:, 0]==batch_idx
            this_target = targets[b_idx]
            #  targets (tensor): (n_gt_all_batch, [img_index clsid cx cy l s theta ])
            #-------------------------------------------#
            #   如果没有真实框属于该图片则continue
            #-------------------------------------------#
            if this_target.shape[0] == 0:
                continue
            
            #-------------------------------------------#
            #   真实框的坐标进行缩放
            #-------------------------------------------#
            txywh = this_target[:, 2:6] * imgs[batch_idx].shape[1]
            #-------------------------------------------#
            #   从中心宽高到左上角右下角
            #-------------------------------------------#
            txyxy = torch.cat((txywh, this_target[:,6:]), dim=-1)

            pxyxys      = []
            p_cls       = []
            p_obj       = []
            from_which_layer = []
            all_b       = []
            all_a       = []
            all_gj      = []
            all_gi      = []
            all_anch    = []
            
            #-------------------------------------------#
            #   对三个layer进行循环
            #-------------------------------------------#
            for i, prediction in enumerate(predictions):
                #-------------------------------------------#
                #   b代表第几张图片 a代表第几个先验框
                #   gj代表y轴,gi代表x轴
                #-------------------------------------------#
                b, a, gj, gi    = indices[i]
                idx             = (b == batch_idx)
                b, a, gj, gi    = b[idx], a[idx], gj[idx], gi[idx]       
                       
                all_b.append(b)
                all_a.append(a)
                all_gj.append(gj)
                all_gi.append(gi)
                all_anch.append(anch[i][idx])
                from_which_layer.append(torch.ones(size=(len(b),)) * i)
                
                #-------------------------------------------#
                #   取出这个真实框对应的预测结果
                #-------------------------------------------#
                fg_pred = prediction[b, a, gj, gi]                
                p_obj.append(fg_pred[:, 5:6]) # [4:5] = theta
                p_cls.append(fg_pred[:, 6:])
                
                #-------------------------------------------#
                #   获得网格后,进行解码
                #-------------------------------------------#
                grid    = torch.stack([gi, gj], dim=1).type_as(fg_pred)
                pxy     = (fg_pred[:, :2].sigmoid() * 2. - 0.5 + grid) * self.stride[i]
                pwh     = (fg_pred[:, 2:4].sigmoid() * 2) ** 2 * anch[i][idx] * self.stride[i]
                pangle  = (fg_pred[:, 4:5].sigmoid() - 0.5) * math.pi
                pxywh   = torch.cat([pxy, pwh, pangle], dim=-1)
                pxyxys.append(pxywh)
            
            #-------------------------------------------#
            #   判断是否存在对应的预测框,不存在则跳过
            #-------------------------------------------#
            pxyxys = torch.cat(pxyxys, dim=0)
            if pxyxys.shape[0] == 0:
                continue
            
            #-------------------------------------------#
            #   进行堆叠
            #-------------------------------------------#
            p_obj       = torch.cat(p_obj, dim=0)
            p_cls       = torch.cat(p_cls, dim=0)
            from_which_layer = torch.cat(from_which_layer, dim=0)
            all_b       = torch.cat(all_b, dim=0)
            all_a       = torch.cat(all_a, dim=0)
            all_gj      = torch.cat(all_gj, dim=0)
            all_gi      = torch.cat(all_gi, dim=0)
            all_anch    = torch.cat(all_anch, dim=0)
        
            #-------------------------------------------------------------#
            #   计算当前图片中,真实框与预测框的重合程度
            #   iou的范围为0-1,取-log后为0~inf
            #   重合程度越大,取-log后越小
            #   因此,真实框与预测框重合度越大,pair_wise_iou_loss越小
            #-------------------------------------------------------------#
            pair_wise_iou_loss = compute_kld_loss(txyxy, pxyxys, taf=1.0, fun='sqrt')
            pair_wise_iou      = 1 - pair_wise_iou_loss

            #-------------------------------------------#
            #   最多二十个预测框与真实框的重合程度
            #   然后求和,找到每个真实框对应几个预测框
            #-------------------------------------------#
            top_k, _    = torch.topk(pair_wise_iou, min(20, pair_wise_iou.shape[1]), dim=1)
            dynamic_ks  = torch.clamp(top_k.sum(1).int(), min=1)

            #-------------------------------------------#
            #   gt_cls_per_image    种类的真实信息
            #-------------------------------------------#
            gt_cls_per_image = F.one_hot(this_target[:, 1].to(torch.int64), self.num_classes).float().unsqueeze(1).repeat(1, pxyxys.shape[0], 1)
            
            #-------------------------------------------#
            #   cls_preds_  种类置信度的预测信息
            #               cls_preds_越接近于1,y越接近于1
            #               y / (1 - y)越接近于无穷大
            #               也就是种类置信度预测的越准
            #               pair_wise_cls_loss越小
            #-------------------------------------------#
            num_gt              = this_target.shape[0]
            cls_preds_          = p_cls.float().unsqueeze(0).repeat(num_gt, 1, 1).sigmoid_() * p_obj.unsqueeze(0).repeat(num_gt, 1, 1).sigmoid_()
            y                   = cls_preds_.sqrt_()
            pair_wise_cls_loss  = F.binary_cross_entropy_with_logits(torch.log(y / (1 - y)), gt_cls_per_image, reduction="none").sum(-1)
            del cls_preds_
        
            #-------------------------------------------#
            #   求cost的总和
            #-------------------------------------------#
            cost = (
                pair_wise_cls_loss
                + 3.0 * pair_wise_iou_loss
            )

            #-------------------------------------------#
            #   求cost最小的k个预测框
            #-------------------------------------------#
            matching_matrix = torch.zeros_like(cost)
            for gt_idx in range(num_gt):
                _, pos_idx = torch.topk(cost[gt_idx], k=dynamic_ks[gt_idx].item(), largest=False)
                matching_matrix[gt_idx][pos_idx] = 1.0

            del top_k, dynamic_ks

            #-------------------------------------------#
            #   如果一个预测框对应多个真实框
            #   只使用这个预测框最对应的真实框
            #-------------------------------------------#
            anchor_matching_gt = matching_matrix.sum(0)
            if (anchor_matching_gt > 1).sum() > 0:
                _, cost_argmin = torch.min(cost[:, anchor_matching_gt > 1], dim=0)
                matching_matrix[:, anchor_matching_gt > 1]          *= 0.0
                matching_matrix[cost_argmin, anchor_matching_gt > 1] = 1.0
            fg_mask_inboxes = matching_matrix.sum(0) > 0.0
            matched_gt_inds = matching_matrix[:, fg_mask_inboxes].argmax(0)

            #-------------------------------------------#
            #   取出符合条件的框
            #-------------------------------------------#
            from_which_layer    = from_which_layer.to(fg_mask_inboxes.device)[fg_mask_inboxes]
            all_b               = all_b[fg_mask_inboxes]
            all_a               = all_a[fg_mask_inboxes]
            all_gj              = all_gj[fg_mask_inboxes]
            all_gi              = all_gi[fg_mask_inboxes]
            all_anch            = all_anch[fg_mask_inboxes]
            this_target         = this_target[matched_gt_inds]
        
            for i in range(num_layer):
                layer_idx = from_which_layer == i
                matching_bs[i].append(all_b[layer_idx])
                matching_as[i].append(all_a[layer_idx])
                matching_gjs[i].append(all_gj[layer_idx])
                matching_gis[i].append(all_gi[layer_idx])
                matching_targets[i].append(this_target[layer_idx])
                matching_anchs[i].append(all_anch[layer_idx])

        for i in range(num_layer):
            matching_bs[i]      = torch.cat(matching_bs[i], dim=0) if len(matching_bs[i]) != 0 else torch.Tensor(matching_bs[i])
            matching_as[i]      = torch.cat(matching_as[i], dim=0) if len(matching_as[i]) != 0 else torch.Tensor(matching_as[i])
            matching_gjs[i]     = torch.cat(matching_gjs[i], dim=0) if len(matching_gjs[i]) != 0 else torch.Tensor(matching_gjs[i])
            matching_gis[i]     = torch.cat(matching_gis[i], dim=0) if len(matching_gis[i]) != 0 else torch.Tensor(matching_gis[i])
            matching_targets[i] = torch.cat(matching_targets[i], dim=0) if len(matching_targets[i]) != 0 else torch.Tensor(matching_targets[i])
            matching_anchs[i]   = torch.cat(matching_anchs[i], dim=0) if len(matching_anchs[i]) != 0 else torch.Tensor(matching_anchs[i])

        return matching_bs, matching_as, matching_gjs, matching_gis, matching_targets, matching_anchs

    def find_3_positive(self, predictions, targets):
        #------------------------------------#
        #   获得每个特征层先验框的数量
        #   与真实框的数量
        #------------------------------------#
        num_anchor, num_gt  = len(self.anchors_mask[0]), targets.shape[0] 
        #------------------------------------#
        #   创建空列表存放indices和anchors
        #------------------------------------#
        indices, anchors    = [], []
        #------------------------------------#
        #   创建7个1
        #   序号0,1为1
        #   序号2:6为特征层的高宽
        #   序号6为1
        #------------------------------------#
        gain    = torch.ones(8, device=targets.device)
        #------------------------------------#
        #   ai      [num_anchor, num_gt]
        #   targets [num_gt, 6] => [num_anchor, num_gt, 8]
        #------------------------------------#
        ai      = torch.arange(num_anchor, device=targets.device).float().view(num_anchor, 1).repeat(1, num_gt)
        targets = torch.cat((targets.repeat(num_anchor, 1, 1), ai[:, :, None]), 2)  # append anchor indices
        # targets (tensor): (na, n_gt_all_batch, [img_index, clsid, cx, cy, l, s, theta, anchor_index]])
        g   = 0.5 # offsets
        off = torch.tensor([
            [0, 0],
            [1, 0], [0, 1], [-1, 0], [0, -1],  # j,k,l,m
            # [1, 1], [1, -1], [-1, 1], [-1, -1],  # jk,jm,lk,lm
        ], device=targets.device).float() * g 

        for i in range(len(predictions)):
            #----------------------------------------------------#
            #   将先验框除以stride,获得相对于特征层的先验框。
            #   anchors_i [num_anchor, 2]
            #----------------------------------------------------#
            anchors_i = torch.from_numpy(self.anchors[i] / self.stride[i]).type_as(predictions[i])
            anchors_i, shape = torch.from_numpy(self.anchors[i] / self.stride[i]).type_as(predictions[i]), predictions[i].shape
            #-------------------------------------------#
            #   计算获得对应特征层的高宽
            #-------------------------------------------#
            gain[2:6] = torch.tensor(predictions[i].shape)[[3, 2, 3, 2]]
            
            #-------------------------------------------#
            #   将真实框乘上gain,
            #   其实就是将真实框映射到特征层上
            #-------------------------------------------#
            t = targets * gain
            if num_gt:
                #-------------------------------------------#
                #   计算真实框与先验框高宽的比值
                #   然后根据比值大小进行判断,
                #   判断结果用于取出,获得所有先验框对应的真实框
                #   r   [num_anchor, num_gt, 2]
                #   t   [num_anchor, num_gt, 7] => [num_matched_anchor, 7]
                #-------------------------------------------#
                r = t[:, :, 4:6] / anchors_i[:, None]
                j = torch.max(r, 1. / r).max(2)[0] < self.threshold
                t = t[j]  # filter
                
                #-------------------------------------------#
                #   gxy 获得所有先验框对应的真实框的x轴y轴坐标
                #   gxi 取相对于该特征层的右小角的坐标
                #-------------------------------------------#
                gxy     = t[:, 2:4] # grid xy
                gxi     = gain[[2, 3]] - gxy # inverse
                j, k    = ((gxy % 1. < g) & (gxy > 1.)).T
                l, m    = ((gxi % 1. < g) & (gxi > 1.)).T
                j       = torch.stack((torch.ones_like(j), j, k, l, m))
                
                #-------------------------------------------#
                #   t   重复5次,使用满足条件的j进行框的提取
                #   j   一共五行,代表当前特征点在五个
                #       [0, 0], [1, 0], [0, 1], [-1, 0], [0, -1]
                #       方向是否存在
                #-------------------------------------------#
                t       = t.repeat((5, 1, 1))[j]
                offsets = (torch.zeros_like(gxy)[None] + off[:, None])[j]
            else:
                t = targets[0]
                offsets = 0

            #-------------------------------------------#
            #   b   代表属于第几个图片
            #   gxy 代表该真实框所处的x、y中心坐标
            #   gwh 代表该真实框的wh坐标
            #   gij 代表真实框所属的特征点坐标
            #-------------------------------------------#
            b, c    = t[:, :2].long().T  # image, class
            gxy     = t[:, 2:4]  # grid xy
            gwh     = t[:, 4:6]  # grid wh
            gij     = (gxy - offsets).long()
            gi, gj  = gij.T  # grid xy indices

            #-------------------------------------------#
            #   gj、gi不能超出特征层范围
            #   a代表属于该特征点的第几个先验框
            #-------------------------------------------#
            a = t[:, -1].long()  # anchor indices
            indices.append((b, a, gj.clamp_(0, shape[2] - 1), gi.clamp_(0, shape[3] - 1)))  # image, anchor, grid indices
            anchors.append(anchors_i[a])  # anchors

        return indices, anchors

Train your own YoloV7-OBB model

First go to Github to download the corresponding warehouse. After downloading, use the decompression software to decompress it, and then use the programming software to open the folder.
Note that the opened root directory must be correct, otherwise the code will not run if the relative directory is incorrect.

It must be noted that the root directory after opening is the directory where the file is stored.

1. Data set preparation

This article uses the VOC format for training. Before training, you need to make your own data set. If you don’t have your own data set, you can download the VOC12+07 data set through the Github connection and try it out.
Before training, put the label file in the Annotation under the VOC2007 folder under the VOCdevkit folder.

Before training, put the picture file in JPEGImages under the VOC2007 folder under the VOCdevkit folder.

At this point, the placement of the data set has ended.

Second, the processing of data sets

After completing the arrangement of the data set, we need to process the data set in the next step. The purpose is to obtain 2007_train.txt and 2007_val.txt for training. We need to use voc_annotation.py in the root directory.

There are some parameters in voc_annotation.py that need to be set.
They are annotation_mode, classes_path, trainval_percent, train_percent, VOCdevkit_path, the first training can only modify classes_path

'''
annotation_mode用于指定该文件运行时计算的内容
annotation_mode为0代表整个标签处理过程,包括获得VOCdevkit/VOC2007/ImageSets里面的txt以及训练用的2007_train.txt、2007_val.txt
annotation_mode为1代表获得VOCdevkit/VOC2007/ImageSets里面的txt
annotation_mode为2代表获得训练用的2007_train.txt、2007_val.txt
'''
annotation_mode     = 0
'''
必须要修改,用于生成2007_train.txt、2007_val.txt的目标信息
与训练和预测所用的classes_path一致即可
如果生成的2007_train.txt里面没有目标信息
那么就是因为classes没有设定正确
仅在annotation_mode为0和2的时候有效
'''
classes_path        = 'model_data/voc_classes.txt'
'''
trainval_percent用于指定(训练集+验证集)与测试集的比例,默认情况下 (训练集+验证集):测试集 = 9:1
train_percent用于指定(训练集+验证集)中训练集与验证集的比例,默认情况下 训练集:验证集 = 9:1
仅在annotation_mode为0和1的时候有效
'''
trainval_percent    = 0.9
train_percent       = 0.9
'''
指向VOC数据集所在的文件夹
默认指向根目录下的VOC数据集
'''
VOCdevkit_path  = 'VOCdevkit'

classes_path is used to point to the txt corresponding to the detection category. Taking the ssdd dataset as an example, the txt we use is:
insert image description here

When training your own data set, you can create a cls_classes.txt by yourself, and write the categories you need to distinguish in it.

1. Dataset loading format modification

In Yolov7, the code obtains x1, y1, x2, y2 in the xml file , which are the coordinates of the lower left and upper right corners of the real frame, and the rotating target detection frame rbox is the coordinates x1, y1, x2, y2 of the four corners of the rotating frame , x3, y3, x4, y4 , which will be converted to xc, yc, w, h, theta in the dataloader later .

The code snippet for this section is:

xmlbox = obj.find('rotated_bndbox')
b = (int(float(xmlbox.find('x1').text)), int(float(xmlbox.find('y1').text)), \
	int(float(xmlbox.find('x2').text)), int(float(xmlbox.find('y2').text)), \
	int(float(xmlbox.find('x3').text)), int(float(xmlbox.find('y3').text)), \
	int(float(xmlbox.find('x4').text)), int(float(xmlbox.find('y4').text)))
list_file.write(" " + ",".join([str(a) for a in b]) + ',' + str(cls_id))

In addition to the rbox coordinates, the category cls_id of the target is added at the end.

2.dataloader data loading modification

First read the address of the image, the poly label box and cls_id of the image from the txt file, and use the np.zeros() method to create a matrix with a dimension of (the number of label boxes, 6) and a value of 0. And convert the poly label box to rbox[…, [xc, yc, w, h, theta, cls_id]], and perform normalization. In the following code box is […, [x1, y2, x2, y2, x3, y3, x4, y4, cls_id]], rbox is normalized […, [xc, yc, w, h, theta, cls_id]]. The value range of theta is [-π/2, π/2), and no normalization is performed.

rbox    = np.zeros((box.shape[0], 6))
rbox[..., :5] = poly2rbox(box[..., :8], (ih, iw), use_pi=True)
rbox[..., 5]  = box[..., 8]

The rotation rectangle is generally marked with eight coordinates of four points […, [x1, y2, x2, y2, x3, y3, x4, y4]], because there are many errors in the labeling process, many times the labeling is not Rectangle, but irregular quadrilateral (such as image edge area, dense small target), then it is inappropriate to directly use four coordinates to calculate the length, width and deflection angle at this time. The processing process in poly2rbox is as follows:

  1. Obtain a quadrilateral closed mask (binary matrix) from four marked points;
  2. Opencv finds the coordinates of the four points of the smallest circumscribed rectangle through the mask (the range of the opencv rotation rectangle angle is [-90,0), so the
    angle in poly2rbox takes a negative value, and the definition is ccw counterclockwise);
  3. Find the smallest circumscribed horizontal rectangle of the rotated rectangle, and calculate the corresponding points between the rotated rectangle and the horizontal rectangle (this calculation directly calculates the distance between each corresponding point, and finds the minimum value);
  4. After finding the best corresponding point in this way, get the center point, length, width, and angle of the rotating rectangle, that is, [cx, cy, w, h, θ];
  5. Finally, convert θ from angle to radian, and limit its value range ∈ [-pi/2, pi/2).
def poly2rbox(polys, img_size=(), num_cls_thata=180, radius=6.0, use_pi=False, use_gaussian=False):
    """
    Trans poly format to rbox format.
    Args:
        polys (array): (num_gts, [x1 y1 x2 y2 x3 y3 x4 y4]) 
        num_cls_thata (int): [1], theta class num
        radius (float32): [1], window radius for Circular Smooth Label
        use_pi (bool): True θ∈[-pi/2, pi/2) , False θ∈[0, 180)

    Returns:
        use_gaussian True:
            rboxes (array): 
            csl_labels (array): (num_gts, num_cls_thata)
        elif 
            rboxes (array): (num_gts, [cx cy l s θ]) 
    """
    assert polys.shape[-1] == 8
    img_h, img_w = img_size[0], img_size[1]
    if use_gaussian:
        csl_labels = []
    rboxes = []
    for poly in polys:
        poly = np.float32(poly.reshape(4, 2))
        (x, y), (w, h), angle = cv2.minAreaRect(poly) # θ ∈ [0, 90] # opencv>=4.5.1 若是< -90到0
        angle = -angle # θ ∈ [-90, 0] # 故 rbbox2poly 中 角度再 负 了一次  定义是 ccw 逆时针 
        # # 两者的闭集位置进行了调换,所以在边界角度处的转换和非边界角度处的转换越有所不同。
        # if angle >= 90:
        #     angle = angle - 180
        # else:
        #     w, h = h, w
        #     angle = angle -90
        theta = angle / 180 * pi # 转为pi制

        # trans opencv format to longedge format θ ∈ [-pi/2, pi/2]
        if w != max(w, h): 
            x = x / img_w
            y = y / img_h

            w, h = h, w
            w = w / img_h
            h = h / img_w
            theta += pi/2
            
        else:
            w = w / img_w
            h = h / img_h
            
            x = x / img_w
            y = y / img_h

        theta = regular_theta(theta) # limit theta ∈ [-pi/2, pi/2)
        angle = (theta * 180 / pi) + 90 # θ ∈ [0, 180)

        if not use_pi: # 采用angle弧度制 θ ∈ [0, 180)
            rboxes.append([x, y, w, h, angle])
        else: # 采用pi制
            rboxes.append([x, y, w, h, theta])
        if use_gaussian:
            csl_label = gaussian_label_cpu(label=angle, num_class=num_cls_thata, u=0, sig=radius)
            csl_labels.append(csl_label)
    if use_gaussian:
        return np.array(rboxes), np.array(csl_labels)
    return np.array(rboxes)

The following is part of the code of the data random loading process. It is worth mentioning that when the image is randomly flipped, theta in rbox needs to be reversed. For example, theta is π/4, and it is -π/4 after being reversed; rbox The coordinates of the center point are multiplied by the scaling factor and then the normalized offset distance is added; the width and height are directly multiplied by the scaling factor.

#------------------------------#
#   获得图像的高宽与目标高宽
#------------------------------#
iw, ih  = image.size
h, w    = input_shape
#------------------------------#
#   获得预测框
#------------------------------#
box     = np.array([np.array(list(map(int,box.split(',')))) for box in line[1:]])
#------------------------------#
#   将polygon转换为rbox
#------------------------------#
rbox    = np.zeros((box.shape[0], 6))
rbox[..., :5] = poly2rbox(box[..., :8], (ih, iw), use_pi=True)
rbox[..., 5]  = box[..., 8]

if not random:
    scale = min(w/iw, h/ih)
    nw = int(iw*scale)
    nh = int(ih*scale)
    dx = (w-nw)//2
    dy = (h-nh)//2

    #---------------------------------#
    #   将图像多余的部分加上灰条
    #---------------------------------#
    image       = image.resize((nw,nh), Image.BICUBIC)
    new_image   = Image.new('RGB', (w,h), (128,128,128))
    new_image.paste(image, (dx, dy))
    image_data  = np.array(new_image, np.float32)

    #---------------------------------#
    #   对真实框进行调整
    #---------------------------------#
    if len(rbox)>0:
        np.random.shuffle(rbox)
        rbox[:, 0] = rbox[:, 0]*nw/w + dx/w
        rbox[:, 1] = rbox[:, 1]*nh/h + dy/h
        rbox[:, 2] = rbox[:, 2]*nw/w
        rbox[:, 3] = rbox[:, 3]*nh/h

    return image_data, rbox

#------------------------------------------#
#   对图像进行缩放并且进行长和宽的扭曲
#------------------------------------------#
new_ar = iw/ih * self.rand(1-jitter,1+jitter) / self.rand(1-jitter,1+jitter)
scale = self.rand(.25, 2)
if new_ar < 1:
    nh = int(scale*h)
    nw = int(nh*new_ar)
else:
    nw = int(scale*w)
    nh = int(nw/new_ar)
image = image.resize((nw,nh), Image.BICUBIC)

#------------------------------------------#
#   将图像多余的部分加上灰条
#------------------------------------------#
dx = int(self.rand(0, w-nw))
dy = int(self.rand(0, h-nh))
new_image = Image.new('RGB', (w,h), (128,128,128))
new_image.paste(image, (dx, dy))
image = new_image
#------------------------------------------#
#   翻转图像
#------------------------------------------#
flip = self.rand()<.5
if flip: image = image.transpose(Image.FLIP_LEFT_RIGHT)

image_data      = np.array(image, np.uint8)
#---------------------------------#
#   对图像进行色域变换
#   计算色域变换的参数
#---------------------------------#
r               = np.random.uniform(-1, 1, 3) * [hue, sat, val] + 1
#---------------------------------#
#   将图像转到HSV上
#---------------------------------#
hue, sat, val   = cv2.split(cv2.cvtColor(image_data, cv2.COLOR_RGB2HSV))
dtype           = image_data.dtype
#---------------------------------#
#   应用变换
#---------------------------------#
x       = np.arange(0, 256, dtype=r.dtype)
lut_hue = ((x * r[0]) % 180).astype(dtype)
lut_sat = np.clip(x * r[1], 0, 255).astype(dtype)
lut_val = np.clip(x * r[2], 0, 255).astype(dtype)

image_data = cv2.merge((cv2.LUT(hue, lut_hue), cv2.LUT(sat, lut_sat), cv2.LUT(val, lut_val)))
image_data = cv2.cvtColor(image_data, cv2.COLOR_HSV2RGB)
#---------------------------------#
#   对真实框进行调整
#---------------------------------#
if len(rbox)>0:
    np.random.shuffle(rbox)
    rbox[:, 0] = rbox[:, 0]*nw/w + dx/w
    rbox[:, 1] = rbox[:, 1]*nh/h + dy/h
    rbox[:, 2] = rbox[:, 2]*nw/w
    rbox[:, 3] = rbox[:, 3]*nh/h
    if flip: 
        rbox[:, 0] = 1 - rbox[:, 0]
        rbox[:, 4] *= -1

3. Mosaic data augmentation for rotating targets

This part modifies the mosaic data enhancement code referring to the above-mentioned processing of rbox, and the result is shown in the following figure:

insert image description here

3. Start network training

Environmental preparation

Before starting training, you also need to install the rotating target detection non-maximum suppression library:
insert image description here
enter the utils\nms_rotated directory and run the following command to install:

python setup.py build_ext --inplace

Through voc_annotation.py we have generated 2007_train.txt and 2007_val.txt, now we can start training.
There are many parameters for training. You can read the comments carefully after downloading the library. The most important part is still the classes_path in train.py.

classes_path is used to point to the txt corresponding to the detection category, this txt is the same as the txt in voc_annotation.py! Training your own data set must be modified!

After modifying the classes_path, you can run train.py to start training. After training for multiple epochs, the weights will be generated in the logs folder.
The functions of other parameters are as follows:

#---------------------------------#
#   Cuda    是否使用Cuda
#           没有GPU可以设置成False
#---------------------------------#
Cuda            = True
#---------------------------------------------------------------------#
#   distributed     用于指定是否使用单机多卡分布式运行
#                   终端指令仅支持Ubuntu。CUDA_VISIBLE_DEVICES用于在Ubuntu下指定显卡。
#                   Windows系统下默认使用DP模式调用所有显卡,不支持DDP。
#   DP模式:
#       设置            distributed = False
#       在终端中输入    CUDA_VISIBLE_DEVICES=0,1 python train.py
#   DDP模式:
#       设置            distributed = True
#       在终端中输入    CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 train.py
#---------------------------------------------------------------------#
distributed     = False
#---------------------------------------------------------------------#
#   sync_bn     是否使用sync_bn,DDP模式多卡可用
#---------------------------------------------------------------------#
sync_bn         = False
#---------------------------------------------------------------------#
#   fp16        是否使用混合精度训练
#               可减少约一半的显存、需要pytorch1.7.1以上
#---------------------------------------------------------------------#
fp16            = False
#---------------------------------------------------------------------#
#   classes_path    指向model_data下的txt,与自己训练的数据集相关 
#                   训练前一定要修改classes_path,使其对应自己的数据集
#---------------------------------------------------------------------#
classes_path    = 'model_data/voc_classes.txt'
#---------------------------------------------------------------------#
#   anchors_path    代表先验框对应的txt文件,一般不修改。
#   anchors_mask    用于帮助代码找到对应的先验框,一般不修改。
#---------------------------------------------------------------------#
anchors_path    = 'model_data/yolo_anchors.txt'
anchors_mask    = [[6, 7, 8], [3, 4, 5], [0, 1, 2]]
#----------------------------------------------------------------------------------------------------------------------------#
#   权值文件的下载请看README,可以通过网盘下载。模型的 预训练权重 对不同数据集是通用的,因为特征是通用的。
#   模型的 预训练权重 比较重要的部分是 主干特征提取网络的权值部分,用于进行特征提取。
#   预训练权重对于99%的情况都必须要用,不用的话主干部分的权值太过随机,特征提取效果不明显,网络训练的结果也不会好
#
#   如果训练过程中存在中断训练的操作,可以将model_path设置成logs文件夹下的权值文件,将已经训练了一部分的权值再次载入。
#   同时修改下方的 冻结阶段 或者 解冻阶段 的参数,来保证模型epoch的连续性。
#   
#   当model_path = ''的时候不加载整个模型的权值。
#
#   此处使用的是整个模型的权重,因此是在train.py进行加载的。
#   如果想要让模型从0开始训练,则设置model_path = '',下面的Freeze_Train = Fasle,此时从0开始训练,且没有冻结主干的过程。
#   
#   一般来讲,网络从0开始的训练效果会很差,因为权值太过随机,特征提取效果不明显,因此非常、非常、非常不建议大家从0开始训练!
#   从0开始训练有两个方案:
#   1、得益于Mosaic数据增强方法强大的数据增强能力,将UnFreeze_Epoch设置的较大(300及以上)、batch较大(16及以上)、数据较多(万以上)的情况下,
#      可以设置mosaic=True,直接随机初始化参数开始训练,但得到的效果仍然不如有预训练的情况。(像COCO这样的大数据集可以这样做)
#   2、了解imagenet数据集,首先训练分类模型,获得网络的主干部分权值,分类模型的 主干部分 和该模型通用,基于此进行训练。
#----------------------------------------------------------------------------------------------------------------------------#
model_path      = 'model_data/yolov7_weights.pth'
#------------------------------------------------------#
#   input_shape     输入的shape大小,一定要是32的倍数
#------------------------------------------------------#
input_shape     = [640, 640]
#------------------------------------------------------#
#   phi             所使用到的yolov7的版本,本仓库一共提供两个:
#                   l : 对应yolov7
#                   x : 对应yolov7_x
#------------------------------------------------------#
phi             = 'l'
#----------------------------------------------------------------------------------------------------------------------------#
#   pretrained      是否使用主干网络的预训练权重,此处使用的是主干的权重,因此是在模型构建的时候进行加载的。
#                   如果设置了model_path,则主干的权值无需加载,pretrained的值无意义。
#                   如果不设置model_path,pretrained = True,此时仅加载主干开始训练。
#                   如果不设置model_path,pretrained = False,Freeze_Train = Fasle,此时从0开始训练,且没有冻结主干的过程。
#----------------------------------------------------------------------------------------------------------------------------#
pretrained      = False
#------------------------------------------------------------------#
#   mosaic              马赛克数据增强。
#   mosaic_prob         每个step有多少概率使用mosaic数据增强,默认50%。
#
#   mixup               是否使用mixup数据增强,仅在mosaic=True时有效。
#                       只会对mosaic增强后的图片进行mixup的处理。
#   mixup_prob          有多少概率在mosaic后使用mixup数据增强,默认50%。
#                       总的mixup概率为mosaic_prob * mixup_prob。
#
#   special_aug_ratio   参考YoloX,由于Mosaic生成的训练图片,远远脱离自然图片的真实分布。
#                       当mosaic=True时,本代码会在special_aug_ratio范围内开启mosaic。
#                       默认为前70%个epoch,100个世代会开启70个世代。
#------------------------------------------------------------------#
mosaic              = True
mosaic_prob         = 0.5
mixup               = True
mixup_prob          = 0.5
special_aug_ratio   = 0.7
#------------------------------------------------------------------#
#   label_smoothing     标签平滑。一般0.01以下。如0.01、0.005。
#------------------------------------------------------------------#
label_smoothing     = 0

#----------------------------------------------------------------------------------------------------------------------------#
#   训练分为两个阶段,分别是冻结阶段和解冻阶段。设置冻结阶段是为了满足机器性能不足的同学的训练需求。
#   冻结训练需要的显存较小,显卡非常差的情况下,可设置Freeze_Epoch等于UnFreeze_Epoch,Freeze_Train = True,此时仅仅进行冻结训练。
#      
#   在此提供若干参数设置建议,各位训练者根据自己的需求进行灵活调整:
#   (一)从整个模型的预训练权重开始训练: 
#       Adam:
#           Init_Epoch = 0,Freeze_Epoch = 50,UnFreeze_Epoch = 100,Freeze_Train = True,optimizer_type = 'adam',Init_lr = 1e-3,weight_decay = 0。(冻结)
#           Init_Epoch = 0,UnFreeze_Epoch = 100,Freeze_Train = False,optimizer_type = 'adam',Init_lr = 1e-3,weight_decay = 0。(不冻结)
#       SGD:
#           Init_Epoch = 0,Freeze_Epoch = 50,UnFreeze_Epoch = 300,Freeze_Train = True,optimizer_type = 'sgd',Init_lr = 1e-2,weight_decay = 5e-4。(冻结)
#           Init_Epoch = 0,UnFreeze_Epoch = 300,Freeze_Train = False,optimizer_type = 'sgd',Init_lr = 1e-2,weight_decay = 5e-4。(不冻结)
#       其中:UnFreeze_Epoch可以在100-300之间调整。
#   (二)从0开始训练:
#       Init_Epoch = 0,UnFreeze_Epoch >= 300,Unfreeze_batch_size >= 16,Freeze_Train = False(不冻结训练)
#       其中:UnFreeze_Epoch尽量不小于300。optimizer_type = 'sgd',Init_lr = 1e-2,mosaic = True。
#   (三)batch_size的设置:
#       在显卡能够接受的范围内,以大为好。显存不足与数据集大小无关,提示显存不足(OOM或者CUDA out of memory)请调小batch_size。
#       受到BatchNorm层影响,batch_size最小为2,不能为1。
#       正常情况下Freeze_batch_size建议为Unfreeze_batch_size的1-2倍。不建议设置的差距过大,因为关系到学习率的自动调整。
#----------------------------------------------------------------------------------------------------------------------------#
#------------------------------------------------------------------#
#   冻结阶段训练参数
#   此时模型的主干被冻结了,特征提取网络不发生改变
#   占用的显存较小,仅对网络进行微调
#   Init_Epoch          模型当前开始的训练世代,其值可以大于Freeze_Epoch,如设置:
#                       Init_Epoch = 60、Freeze_Epoch = 50、UnFreeze_Epoch = 100
#                       会跳过冻结阶段,直接从60代开始,并调整对应的学习率。
#                       (断点续练时使用)
#   Freeze_Epoch        模型冻结训练的Freeze_Epoch
#                       (当Freeze_Train=False时失效)
#   Freeze_batch_size   模型冻结训练的batch_size
#                       (当Freeze_Train=False时失效)
#------------------------------------------------------------------#
Init_Epoch          = 0
Freeze_Epoch        = 50
Freeze_batch_size   = 8
#------------------------------------------------------------------#
#   解冻阶段训练参数
#   此时模型的主干不被冻结了,特征提取网络会发生改变
#   占用的显存较大,网络所有的参数都会发生改变
#   UnFreeze_Epoch          模型总共训练的epoch
#                           SGD需要更长的时间收敛,因此设置较大的UnFreeze_Epoch
#                           Adam可以使用相对较小的UnFreeze_Epoch
#   Unfreeze_batch_size     模型在解冻后的batch_size
#------------------------------------------------------------------#
UnFreeze_Epoch      = 300
Unfreeze_batch_size = 4
#------------------------------------------------------------------#
#   Freeze_Train    是否进行冻结训练
#                   默认先冻结主干训练后解冻训练。
#------------------------------------------------------------------#
Freeze_Train        = True

#------------------------------------------------------------------#
#   其它训练参数:学习率、优化器、学习率下降有关
#------------------------------------------------------------------#
#------------------------------------------------------------------#
#   Init_lr         模型的最大学习率
#   Min_lr          模型的最小学习率,默认为最大学习率的0.01
#------------------------------------------------------------------#
Init_lr             = 1e-2
Min_lr              = Init_lr * 0.01
#------------------------------------------------------------------#
#   optimizer_type  使用到的优化器种类,可选的有adam、sgd
#                   当使用Adam优化器时建议设置  Init_lr=1e-3
#                   当使用SGD优化器时建议设置   Init_lr=1e-2
#   momentum        优化器内部使用到的momentum参数
#   weight_decay    权值衰减,可防止过拟合
#                   adam会导致weight_decay错误,使用adam时建议设置为0。
#------------------------------------------------------------------#
optimizer_type      = "sgd"
momentum            = 0.937
weight_decay        = 5e-4
#------------------------------------------------------------------#
#   lr_decay_type   使用到的学习率下降方式,可选的有step、cos
#------------------------------------------------------------------#
lr_decay_type       = "cos"
#------------------------------------------------------------------#
#   save_period     多少个epoch保存一次权值
#------------------------------------------------------------------#
save_period         = 10
#------------------------------------------------------------------#
#   save_dir        权值与日志文件保存的文件夹
#------------------------------------------------------------------#
save_dir            = 'logs'
#------------------------------------------------------------------#
#   eval_flag       是否在训练时进行评估,评估对象为验证集
#                   安装pycocotools库后,评估体验更佳。
#   eval_period     代表多少个epoch评估一次,不建议频繁的评估
#                   评估需要消耗较多的时间,频繁评估会导致训练非常慢
#   此处获得的mAP会与get_map.py获得的会有所不同,原因有二:
#   (一)此处获得的mAP为验证集的mAP。
#   (二)此处设置评估参数较为保守,目的是加快评估速度。
#------------------------------------------------------------------#
eval_flag           = True
eval_period         = 10
#------------------------------------------------------------------#
#   num_workers     用于设置是否使用多线程读取数据
#                   开启后会加快数据读取速度,但是会占用更多内存
#                   内存较小的电脑可以设置为2或者0  
#------------------------------------------------------------------#
num_workers         = 4

#------------------------------------------------------#
#   train_annotation_path   训练图片路径和标签
#   val_annotation_path     验证图片路径和标签
#------------------------------------------------------#
train_annotation_path   = '2007_train.txt'
val_annotation_path     = '2007_val.txt'

4. Prediction of training results

Two files are required for training result prediction, namely yolo.py and predict.py.
We first need to modify model_path and classes_path in yolo.py, these two parameters must be modified.

model_path points to the trained weight file in the logs folder.
classes_path points to the txt corresponding to the detection category.


After completing the modification, you can run predict.py for detection. After running, enter the image path to detect.

FAQ summary

1. Unable to install the non-maximum suppression library nms_rotated for rotating target detection.

Please note that pytorchthe version is 1.10.1, torchvisionthe version is 0.11.2, the development environment of the Windows system needs to install the C++ compilation environment in Visual Studio. For details, see the environment configuration in my other blog post YOLOv7-OBB .

2. The part of the code that runs the data set loading format voc_annotation.pyreports an error.

If it is HRSC dataset, I am ready hrsc_annotation.py. The existing data set xml format is nothing more than two [cx, cy, w, h, w, theta], or [x1, y1, x2, y2, x3, y3, x4, y4], you can according to your own data set Refer to the annotation format, and note that the target name under objects is not necessarily rotated_bndbox.

Guess you like

Origin blog.csdn.net/weixin_43293172/article/details/128889755