Wisdom target detection 62 - Keras builds YoloV7 target detection platform

study preface

Keras also reproduce it, the reproduction of SimOTA is too painful.
insert image description here

Source code download

https://github.com/bubbliiiiing/yolov7-keras
You can order a star if you like it.

YoloV7 improved part (incomplete)

1. The main part: The innovative multi-branch stacking structure is used for feature extraction. Compared with the previous Yolo, the jump connection structure of the model is more dense. Using an innovative downsampling structure, Maxpooling and features with a step size of 2x2 are used for parallel extraction and compression.

2. Enhanced feature extraction part: Same as the main part, the enhanced feature extraction part also uses a multi-input stacking structure for feature extraction, and uses Maxpooling and features with a step size of 2x2 for parallel downsampling.

3. Special SPP structure: The SPP with CSP mechanism is used to expand the receptive field, and the CSP structure is introduced into the SPP structure. This module has a large residual edge to assist optimization and feature extraction.

4. Adaptive multi-positive sample matching: In the Yolo series before YoloV5, each real frame corresponds to a positive sample during training, that is, during training, each real frame is only predicted by one prior frame. In YoloV7, in order to speed up the training efficiency of the model, the number of positive samples is increased. During training, each real frame can be predicted by multiple prior frames. In addition, for each real frame, the iou and type are calculated based on the predicted frame adjusted by the prior frame to obtain the cost, and then find the most suitable prior frame for the real frame.

5. Drawing on the structure of RepVGG, introduce RepConv in a specific part of the network, after fuse, reduce the parameter amount of the network to ensure the network x

6. The auxiliary branch is used to assist the convergence, but it is not used in the smaller models YoloV7 and YoloV7-X.

The above is not all the improvements, there are some other improvements, here are just some of the improvements that I am more interested in and are very effective.

YoloV7 implementation ideas

1. Overall structure analysis

insert image description here
Before learning YoloV7, we need to have a certain understanding of the work done by YoloV7, which will help us understand the details of the network later. The prediction method of YoloV7 is not much different from the previous Yolo, and it is still divided into three parts. .

They are Backbone, FPN and Yolo Head respectively .

Backbone is the backbone feature extraction network of YoloV7 . The input picture will first be feature extracted in the backbone network . The extracted features can be called the feature layer, which is the feature set of the input picture . In the backbone part, we obtained three feature layers for the next step of network construction. I call these three feature layers effective feature layers .

FPN is the enhanced feature extraction network of YoloV7 . The three effective feature layers obtained in the backbone part will perform feature fusion in this part. The purpose of feature fusion is to combine feature information of different scales. In the FPN part, the obtained effective feature layers are used to continue to extract features. In YoloV7, the Panet structure is still used. We not only upsample the features to achieve feature fusion, but also downsample the features again to achieve feature fusion.

Yolo Head is the classifier and regressor of YoloV7 . Through Backbone and FPN, we can already obtain three enhanced effective feature layers. Each feature layer has width, height and number of channels. At this time, we can regard the feature map as a collection of feature points one after another . There are three prior frames on each feature point, and each prior frame has the number of channels. feature . What Yolo Head actually does is to judge the feature points and judge whether there is an object corresponding to the prior frame on the feature points . Like the previous version of Yolo, the decoupling head used by YoloV7 is together, that is, classification and regression are implemented in a 1X1 convolution.

Therefore, the work done by the entire YoloV7 network is feature extraction-feature enhancement-predicting the object situation corresponding to the prior frame .

2. Analysis of network structure

1. Introduction to backbone network Backbone

insert image description here

The backbone feature extraction network used by YoloV7 has two important features:
1. It uses a multi-branch stacking module . This module is not named in the paper, but I think this name is very appropriate after analyzing the source code. In this blog post, multi-branch Stack the modules as shown.
After reading this picture, you should understand why I call this module a multi-branch stacking module, because in this module, the input of the final stacking module contains multiple branches, the first on the left is a convolution normalization activation function, and the second on the left is a Convolution normalized activation function, the second from the right is three convolutional normalized activation functions, and the first from the right is five convolutional normalized activation functions.
After the four feature layers are stacked, a convolution normalization activation function will be performed again for feature integration.
insert image description here

def Multi_Concat_Block(x, c2, c3, n=4, e=1, ids=[0], name = ""):
    c_ = int(c2 * e)
        
    x_1 = DarknetConv2D_BN_SiLU(c_, (1, 1), name = name + '.cv1')(x)
    x_2 = DarknetConv2D_BN_SiLU(c_, (1, 1), name = name + '.cv2')(x)
    
    x_all = [x_1, x_2]
    for i in range(n):
        x_2 = DarknetConv2D_BN_SiLU(c2, (3, 3), name = name + '.cv3.' + str(i))(x_2)
        x_all.append(x_2)
    y = Concatenate(axis=-1)([x_all[id] for id in ids])
    y = DarknetConv2D_BN_SiLU(c3, (1, 1), name = name + '.cv4')(y)
    return y

So many stacks actually correspond to a denser residual structure. The residual network is characterized by being easy to optimize , and can improve accuracy by increasing considerable depth . Its internal residual block uses skip connections, which alleviates the problem of gradient disappearance caused by increasing depth in deep neural networks.

2. Use the innovative transition module Transition_Block for downsampling. In a convolutional neural network, a common transition module for downsampling is a convolution with a kernel size of 3x3 and a step size of 2x2 or a step size of 2x2 max pooling . In YoloV7, the author has assembled two kinds of transition modules, and one transition module has two branches, as shown in the figure. The left branch is a maximum pooling with a step size of 2x2 + a 1x1 convolution, and the right branch is a 1x1 convolution + a convolution with a convolution kernel size of 3x3 and a step size of 2x2. The results of the two branches are output will stack.
insert image description here

def Transition_Block(x, c2, name = ""):
    #----------------------------------------------------------------#
    #   利用ZeroPadding2D和一个步长为2x2的卷积块进行高和宽的压缩
    #----------------------------------------------------------------#
    x_1 = MaxPooling2D((2, 2), strides=(2, 2))(x)
    x_1 = DarknetConv2D_BN_SiLU(c2, (1, 1), name = name + '.cv1')(x_1)
    
    x_2 = DarknetConv2D_BN_SiLU(c2, (1, 1), name = name + '.cv2')(x)
    x_2 = ZeroPadding2D(((1, 1),(1, 1)))(x_2)
    x_2 = DarknetConv2D_BN_SiLU(c2, (3, 3), strides=(2, 2), name = name + '.cv3')(x_2)
    y = Concatenate(axis=-1)([x_2, x_1])
    return y

The entire backbone implementation code is:

from functools import wraps

from keras import backend as K
from keras.initializers import random_normal
from keras.layers import (BatchNormalization, Concatenate, Conv2D, Layer,
                          MaxPooling2D, ZeroPadding2D)
from utils.utils import compose


class SiLU(Layer):
    def __init__(self, **kwargs):
        super(SiLU, self).__init__(**kwargs)
        self.supports_masking = True

    def call(self, inputs):
        return inputs * K.sigmoid(inputs)

    def get_config(self):
        config = super(SiLU, self).get_config()
        return config

    def compute_output_shape(self, input_shape):
        return input_shape

#------------------------------------------------------#
#   单次卷积DarknetConv2D
#   如果步长为2则自己设定padding方式。
#------------------------------------------------------#
@wraps(Conv2D)
def DarknetConv2D(*args, **kwargs):
    darknet_conv_kwargs = {
    
    'kernel_initializer' : random_normal(stddev=0.02)}
    darknet_conv_kwargs['padding'] = 'valid' if kwargs.get('strides')==(2, 2) else 'same'
    darknet_conv_kwargs.update(kwargs)
    return Conv2D(*args, **darknet_conv_kwargs)
    
#---------------------------------------------------#
#   卷积块 -> 卷积 + 标准化 + 激活函数
#   DarknetConv2D + BatchNormalization + SiLU
#---------------------------------------------------#
def DarknetConv2D_BN_SiLU(*args, **kwargs):
    no_bias_kwargs = {
    
    'use_bias': False}
    no_bias_kwargs.update(kwargs)
    if "name" in kwargs.keys():
        no_bias_kwargs['name'] = kwargs['name'] + '.conv'
    return compose(
        DarknetConv2D(*args, **no_bias_kwargs),
        BatchNormalization(momentum = 0.97, epsilon = 0.001, name = kwargs['name'] + '.bn'),
        SiLU())

def Transition_Block(x, c2, name = ""):
    #----------------------------------------------------------------#
    #   利用ZeroPadding2D和一个步长为2x2的卷积块进行高和宽的压缩
    #----------------------------------------------------------------#
    x_1 = MaxPooling2D((2, 2), strides=(2, 2))(x)
    x_1 = DarknetConv2D_BN_SiLU(c2, (1, 1), name = name + '.cv1')(x_1)
    
    x_2 = DarknetConv2D_BN_SiLU(c2, (1, 1), name = name + '.cv2')(x)
    x_2 = ZeroPadding2D(((1, 1),(1, 1)))(x_2)
    x_2 = DarknetConv2D_BN_SiLU(c2, (3, 3), strides=(2, 2), name = name + '.cv3')(x_2)
    y = Concatenate(axis=-1)([x_2, x_1])
    return y

def Multi_Concat_Block(x, c2, c3, n=4, e=1, ids=[0], name = ""):
    c_ = int(c2 * e)
        
    x_1 = DarknetConv2D_BN_SiLU(c_, (1, 1), name = name + '.cv1')(x)
    x_2 = DarknetConv2D_BN_SiLU(c_, (1, 1), name = name + '.cv2')(x)
    
    x_all = [x_1, x_2]
    for i in range(n):
        x_2 = DarknetConv2D_BN_SiLU(c2, (3, 3), name = name + '.cv3.' + str(i))(x_2)
        x_all.append(x_2)
    y = Concatenate(axis=-1)([x_all[id] for id in ids])
    y = DarknetConv2D_BN_SiLU(c3, (1, 1), name = name + '.cv4')(y)
    return y

#---------------------------------------------------#
#   CSPdarknet的主体部分
#   输入为一张640x640x3的图片
#   输出为三个有效特征层
#---------------------------------------------------#
def darknet_body(x, transition_channels, block_channels, n, phi):
    #-----------------------------------------------#
    #   输入图片是640, 640, 3
    #-----------------------------------------------#
    ids = {
    
    
        'l' : [-1, -3, -5, -6],
        'x' : [-1, -3, -5, -7, -8], 
    }[phi]
    #---------------------------------------------------#
    #   base_channels 默认值为64
    #---------------------------------------------------#
    # 320, 320, 3 => 320, 320, 64
    x = DarknetConv2D_BN_SiLU(transition_channels, (3, 3), strides = (1, 1), name = 'backbone.stem.0')(x)
    x = ZeroPadding2D(((1, 1),(1, 1)))(x)
    x = DarknetConv2D_BN_SiLU(transition_channels * 2, (3, 3), strides = (2, 2), name = 'backbone.stem.1')(x)
    x = DarknetConv2D_BN_SiLU(transition_channels * 2, (3, 3), strides = (1, 1), name = 'backbone.stem.2')(x)
    
    # 320, 320, 64 => 160, 160, 128
    x = ZeroPadding2D(((1, 1),(1, 1)))(x)
    x = DarknetConv2D_BN_SiLU(transition_channels * 4, (3, 3), strides = (2, 2), name = 'backbone.dark2.0')(x)
    x = Multi_Concat_Block(x, block_channels * 2, transition_channels * 8, n=n, ids=ids, name = 'backbone.dark2.1')
    
    # 160, 160, 128 => 80, 80, 256
    x = Transition_Block(x, transition_channels * 4, name = 'backbone.dark3.0')
    x = Multi_Concat_Block(x, block_channels * 4, transition_channels * 16, n=n, ids=ids, name = 'backbone.dark3.1')
    feat1 = x
    
    # 80, 80, 256 => 40, 40, 512
    x = Transition_Block(x, transition_channels * 8, name = 'backbone.dark4.0')
    x = Multi_Concat_Block(x, block_channels * 8, transition_channels * 32, n=n, ids=ids, name = 'backbone.dark4.1')
    feat2 = x
    
    # 40, 40, 512 => 20, 20, 1024
    x = Transition_Block(x, transition_channels * 16, name = 'backbone.dark5.0')
    x = Multi_Concat_Block(x, block_channels * 8, transition_channels * 32, n=n, ids=ids, name = 'backbone.dark5.1')
    feat3 = x
    return feat1, feat2, feat3

2. Construct FPN feature pyramid for enhanced feature extraction

insert image description here
In the feature utilization part, YoloV7 extracts multiple feature layers for target detection , and extracts a total of three feature layers .
The three feature layers are located in different positions of the main body, respectively located in the middle layer, the middle and lower layers , and the bottom layer. When the input is (640,640,3), the shapes of the three feature layers are feat1=(80,80,512), feat2= (40,40,1024), feat3=(20,20,1024).

After obtaining three effective feature layers, we use these three effective feature layers to construct the FPN layer. The construction method is (in this blog post, the SPPCSPC structure is attributed to FPN):

  1. The feature layer of feat3=(20,20,1024) first uses SPPCSPC for feature extraction. This structure can improve the receptive field of YoloV7 and obtain P5.
  2. First perform a 1X1 convolution adjustment channel on P5, then perform upsampling UmSampling2d and combine it with the feat2=(40,40,1024) convolutional feature layer , and then use Multi_Concat_Block for feature extraction to obtain P4. At this time The obtained feature layer is (40,40,256).
  3. First perform a 1X1 convolution adjustment channel on P4, then perform upsampling UmSampling2d and combine it with feat1=(80,80,512) to perform a convolution feature layer , and then use Multi_Concat_Block for feature extraction to obtain P3_out, obtained at this time The feature layer is (80,80,128).
  4. The feature layer of P3_out=(80,80,128) performs a Transition_Block convolution for downsampling, after downsampling, stacks with P4 , and then uses Multi_Concat_Block for feature extraction P4_out, the feature layer obtained at this time is (40,40,256).
  5. The feature layer of P4_out=(40,40,256) performs a Transition_Block convolution for downsampling, and stacks it with P5 after downsampling , and then uses Multi_Concat_Block for feature extraction P5_out. The feature layer obtained at this time is (20,20,512).

The feature pyramid can perform feature fusion of feature layers of different shapes , which is conducive to extracting better features .

#---------------------------------------------------#
#   Panet网络的构建,并且获得预测结果
#---------------------------------------------------#
def yolo_body(input_shape, anchors_mask, num_classes, phi, mode="train"):
    #-----------------------------------------------#
    #   定义了不同yolov7版本的参数
    #-----------------------------------------------#
    transition_channels = {
    
    'l' : 32, 'x' : 40}[phi]
    block_channels      = 32
    panet_channels      = {
    
    'l' : 32, 'x' : 64}[phi]
    e       = {
    
    'l' : 2, 'x' : 1}[phi]
    n       = {
    
    'l' : 4, 'x' : 6}[phi]
    ids     = {
    
    'l' : [-1, -2, -3, -4, -5, -6], 'x' : [-1, -3, -5, -7, -8]}[phi]

    inputs      = Input(input_shape)
    #---------------------------------------------------#   
    #   生成主干模型,获得三个有效特征层,他们的shape分别是:
    #   80, 80, 256
    #   40, 40, 1024
    #   20, 20, 1024
    #---------------------------------------------------#
    feat1, feat2, feat3 = darknet_body(inputs, transition_channels, block_channels, n, phi)

    # 20, 20, 1024 -> 20, 20, 512
    P5          = SPPCSPC(feat3, transition_channels * 16, name="sppcspc")
    P5_conv     = DarknetConv2D_BN_SiLU(transition_channels * 8, (1, 1), name="conv_for_P5")(P5)
    P5_upsample = UpSampling2D()(P5_conv)
    P4          = Concatenate(axis=-1)([DarknetConv2D_BN_SiLU(transition_channels * 8, (1, 1), name="conv_for_feat2")(feat2), P5_upsample])
    P4          = Multi_Concat_Block(P4, panet_channels * 4, transition_channels * 8, e=e, n=n, ids=ids, name="conv3_for_upsample1")

    P4_conv     = DarknetConv2D_BN_SiLU(transition_channels * 4, (1, 1), name="conv_for_P4")(P4)
    P4_upsample = UpSampling2D()(P4_conv)
    P3          = Concatenate(axis=-1)([DarknetConv2D_BN_SiLU(transition_channels * 4, (1, 1), name="conv_for_feat1")(feat1), P4_upsample])
    P3          = Multi_Concat_Block(P3, panet_channels * 2, transition_channels * 4, e=e, n=n, ids=ids, name="conv3_for_upsample2")
        
    P3_downsample = Transition_Block(P3, transition_channels * 4, name="down_sample1")
    P4 = Concatenate(axis=-1)([P3_downsample, P4])
    P4 = Multi_Concat_Block(P4, panet_channels * 4, transition_channels * 8, e=e, n=n, ids=ids, name="conv3_for_downsample1")

    P4_downsample = Transition_Block(P4, transition_channels * 8, name="down_sample2")
    P5 = Concatenate(axis=-1)([P4_downsample, P5])
    P5 = Multi_Concat_Block(P5, panet_channels * 8, transition_channels * 16, e=e, n=n, ids=ids, name="conv3_for_downsample2")
    
    if phi == "l":
        P3 = RepConv(P3, transition_channels * 8, mode, name="rep_conv_1")
        P4 = RepConv(P4, transition_channels * 16, mode, name="rep_conv_2")
        P5 = RepConv(P5, transition_channels * 32, mode, name="rep_conv_3")
    else:
        P3 = DarknetConv2D_BN_SiLU(transition_channels * 8, (3, 3), strides=(1, 1), name="rep_conv_1")(P3)
        P4 = DarknetConv2D_BN_SiLU(transition_channels * 16, (3, 3), strides=(1, 1), name="rep_conv_2")(P4)
        P5 = DarknetConv2D_BN_SiLU(transition_channels * 32, (3, 3), strides=(1, 1), name="rep_conv_3")(P5)

    # len(anchors_mask[2]) = 3
    # 5 + num_classes -> 4 + 1 + num_classes
    # 4是先验框的回归系数,1是sigmoid将值固定到0-1,num_classes用于判断先验框是什么类别的物体
    # bs, 20, 20, 3 * (4 + 1 + num_classes)
    out2 = DarknetConv2D(len(anchors_mask[2]) * (5 + num_classes), (1, 1), strides = (1, 1), name = 'yolo_head_P3')(P3)
    out1 = DarknetConv2D(len(anchors_mask[1]) * (5 + num_classes), (1, 1), strides = (1, 1), name = 'yolo_head_P4')(P4)
    out0 = DarknetConv2D(len(anchors_mask[0]) * (5 + num_classes), (1, 1), strides = (1, 1), name = 'yolo_head_P5')(P5)
    return Model(inputs, [out0, out1, out2])

3. Use Yolo Head to obtain prediction results

insert image description here
Using the FPN feature pyramid, we can obtain three enhanced features. The shapes of these three enhanced features are (20,20,512), (40,40,256), (80,80,128), and then we use the feature layers of these three shapes Pass in Yolo Head to get the prediction result.

Different from the previous Yolo series, YoloV7 uses a RepConv structure before Yolo Head. The idea of ​​this RepConv is taken from RepVGG. The basic idea is to introduce a special residual structure to assist training during training. This residual structure is After a unique design, the complex residual structure can be equivalent to an ordinary 3x3 convolution in the actual prediction. At this time, the complexity of the network is reduced, but the prediction performance of the network is not reduced.

For each feature layer, we can use a convolution to adjust the number of channels. The final number of channels is related to the number of types that need to be distinguished. In YoloV7, there are 3 prior frames for each feature point on each feature layer .

If the voc training set is used, the class is 20, the final dimension should be 75 = 3x25, and the shape of the three feature layers is ( 20,20,75 ), ( 40,40,75 ), ( 80,80 ,75 ).
The last 75 can be split into three 25, corresponding to the 25 parameters of the three prior frames, and 25 can be split into 4+1+20.
The first 4 parameters are used to judge the regression parameters of each feature point, and the prediction frame can be obtained after the regression parameters are adjusted; the
fifth parameter is used to judge whether each feature point contains an object;
the last 20 parameters are used to judge each feature point The type of object to include.

If you use the coco training set, the class is 80, the final dimension should be 255 = 3x85, the shape of the three feature layers is ( 20,20,255 ), ( 40,40,255 ), ( 80,80,255 )
the last 255 It can be split into three 85, corresponding to the 85 parameters of the three prior frames, and 85 can be split into 4+1+80.
The first 4 parameters are used to judge the regression parameters of each feature point, and the prediction frame can be obtained after the regression parameters are adjusted; the
fifth parameter is used to judge whether each feature point contains an object;
the last 80 parameters are used to judge each feature point The type of object to include.

The implementation code is as follows:

def SPPCSPC(x, c2, n=1, shortcut=False, g=1, e=0.5, k=(5, 9, 13), name=""):
    c_ = int(2 * c2 * e)  # hidden channels
    x1 = DarknetConv2D_BN_SiLU(c_, (1, 1), name = name + '.cv1')(x)
    x1 = DarknetConv2D_BN_SiLU(c_, (3, 3), name = name + '.cv3')(x1)
    x1 = DarknetConv2D_BN_SiLU(c_, (1, 1), name = name + '.cv4')(x1)
    
    y1 = Concatenate(axis=-1)([x1] + [MaxPooling2D(pool_size=(m, m), strides=(1, 1), padding='same')(x1) for m in k])
    y1 = DarknetConv2D_BN_SiLU(c_, (1, 1), name = name + '.cv5')(y1)
    y1 = DarknetConv2D_BN_SiLU(c_, (3, 3), name = name + '.cv6')(y1)
    
    y2 = DarknetConv2D_BN_SiLU(c_, (1, 1), name = name + '.cv2')(x)
    out = Concatenate(axis=-1)([y1, y2])
    out = DarknetConv2D_BN_SiLU(c2, (1, 1), name = name + '.cv7')(out)
    
    return out

def fusion_rep_vgg(fuse_layers, trained_model, infer_model):
    for layer_name, use_bias, use_bn in fuse_layers:

        conv_kxk_weights = trained_model.get_layer(layer_name + '.rbr_dense.0').get_weights()[0]
        conv_1x1_weights = trained_model.get_layer(layer_name + '.rbr_1x1.0').get_weights()[0]

        if use_bias:
            conv_kxk_bias = trained_model.get_layer(layer_name + '.rbr_dense.0').get_weights()[1]
            conv_1x1_bias = trained_model.get_layer(layer_name + '.rbr_1x1.0').get_weights()[1]
        else:
            conv_kxk_bias = np.zeros((conv_kxk_weights.shape[-1],))
            conv_1x1_bias = np.zeros((conv_1x1_weights.shape[-1],))

        if use_bn:
            gammas_kxk, betas_kxk, means_kxk, var_kxk = trained_model.get_layer(layer_name + '.rbr_dense.1').get_weights()
            gammas_1x1, betas_1x1, means_1x1, var_1x1 = trained_model.get_layer(layer_name + '.rbr_1x1.1').get_weights()

        else:
            gammas_1x1, betas_1x1, means_1x1, var_1x1 = [np.ones((conv_1x1_weights.shape[-1],)),
                                                         np.zeros((conv_1x1_weights.shape[-1],)),
                                                         np.zeros((conv_1x1_weights.shape[-1],)),
                                                         np.ones((conv_1x1_weights.shape[-1],))]
            gammas_kxk, betas_kxk, means_kxk, var_kxk = [np.ones((conv_kxk_weights.shape[-1],)),
                                                         np.zeros((conv_kxk_weights.shape[-1],)),
                                                         np.zeros((conv_kxk_weights.shape[-1],)),
                                                         np.ones((conv_kxk_weights.shape[-1],))]
        gammas_res, betas_res, means_res, var_res = [np.ones((conv_1x1_weights.shape[-1],)),
                                                     np.zeros((conv_1x1_weights.shape[-1],)),
                                                     np.zeros((conv_1x1_weights.shape[-1],)),
                                                     np.ones((conv_1x1_weights.shape[-1],))]

        w_kxk = (gammas_kxk / np.sqrt(np.add(var_kxk, 1e-3))) * conv_kxk_weights
        b_kxk = (((conv_kxk_bias - means_kxk) * gammas_kxk) / np.sqrt(np.add(var_kxk, 1e-3))) + betas_kxk
        
        kernel_size = w_kxk.shape[0]
        w_1x1 = np.zeros_like(w_kxk)
        w_1x1[kernel_size // 2, kernel_size // 2, :, :] = (gammas_1x1 / np.sqrt(np.add(var_1x1, 1e-3))) * conv_1x1_weights
        b_1x1 = (((conv_1x1_bias - means_1x1) * gammas_1x1) / np.sqrt(np.add(var_1x1, 1e-3))) + betas_1x1

        w_res = np.zeros_like(w_kxk)
        b_res = np.zeros_like(b_kxk)

        weight = [w_res, w_1x1, w_kxk]
        bias = [b_res, b_1x1, b_kxk]
        
        infer_model.get_layer(layer_name).set_weights([np.array(weight).sum(axis=0), np.array(bias).sum(axis=0)])

def RepConv(x, c2, mode="train", name=""):
    if mode == "predict":
        out = Conv2D(c2, (3, 3), name = name, use_bias=True, padding='same')(x)
        out = SiLU()(out)
    elif mode == "train":
        x1 = Conv2D(c2, (3, 3), name = name + '.rbr_dense.0', use_bias=False, padding='same')(x)
        x1 = BatchNormalization(momentum = 0.97, epsilon = 0.001, name = name + '.rbr_dense.1')(x1)
        x2 = Conv2D(c2, (1, 1), name = name + '.rbr_1x1.0', use_bias=False, padding='same')(x)
        x2 = BatchNormalization(momentum = 0.97, epsilon = 0.001, name = name + '.rbr_1x1.1')(x2)
        
        out = Add()([x1, x2])
        out = SiLU()(out)
    return out

#---------------------------------------------------#
#   Panet网络的构建,并且获得预测结果
#---------------------------------------------------#
def yolo_body(input_shape, anchors_mask, num_classes, phi, mode="train"):
    #-----------------------------------------------#
    #   定义了不同yolov7版本的参数
    #-----------------------------------------------#
    transition_channels = {
    
    'l' : 32, 'x' : 40}[phi]
    block_channels      = 32
    panet_channels      = {
    
    'l' : 32, 'x' : 64}[phi]
    e       = {
    
    'l' : 2, 'x' : 1}[phi]
    n       = {
    
    'l' : 4, 'x' : 6}[phi]
    ids     = {
    
    'l' : [-1, -2, -3, -4, -5, -6], 'x' : [-1, -3, -5, -7, -8]}[phi]

    inputs      = Input(input_shape)
    #---------------------------------------------------#   
    #   生成主干模型,获得三个有效特征层,他们的shape分别是:
    #   80, 80, 256
    #   40, 40, 1024
    #   20, 20, 1024
    #---------------------------------------------------#
    feat1, feat2, feat3 = darknet_body(inputs, transition_channels, block_channels, n, phi)

    # 20, 20, 1024 -> 20, 20, 512
    P5          = SPPCSPC(feat3, transition_channels * 16, name="sppcspc")
    P5_conv     = DarknetConv2D_BN_SiLU(transition_channels * 8, (1, 1), name="conv_for_P5")(P5)
    P5_upsample = UpSampling2D()(P5_conv)
    P4          = Concatenate(axis=-1)([DarknetConv2D_BN_SiLU(transition_channels * 8, (1, 1), name="conv_for_feat2")(feat2), P5_upsample])
    P4          = Multi_Concat_Block(P4, panet_channels * 4, transition_channels * 8, e=e, n=n, ids=ids, name="conv3_for_upsample1")

    P4_conv     = DarknetConv2D_BN_SiLU(transition_channels * 4, (1, 1), name="conv_for_P4")(P4)
    P4_upsample = UpSampling2D()(P4_conv)
    P3          = Concatenate(axis=-1)([DarknetConv2D_BN_SiLU(transition_channels * 4, (1, 1), name="conv_for_feat1")(feat1), P4_upsample])
    P3          = Multi_Concat_Block(P3, panet_channels * 2, transition_channels * 4, e=e, n=n, ids=ids, name="conv3_for_upsample2")
        
    P3_downsample = Transition_Block(P3, transition_channels * 4, name="down_sample1")
    P4 = Concatenate(axis=-1)([P3_downsample, P4])
    P4 = Multi_Concat_Block(P4, panet_channels * 4, transition_channels * 8, e=e, n=n, ids=ids, name="conv3_for_downsample1")

    P4_downsample = Transition_Block(P4, transition_channels * 8, name="down_sample2")
    P5 = Concatenate(axis=-1)([P4_downsample, P5])
    P5 = Multi_Concat_Block(P5, panet_channels * 8, transition_channels * 16, e=e, n=n, ids=ids, name="conv3_for_downsample2")
    
    if phi == "l":
        P3 = RepConv(P3, transition_channels * 8, mode, name="rep_conv_1")
        P4 = RepConv(P4, transition_channels * 16, mode, name="rep_conv_2")
        P5 = RepConv(P5, transition_channels * 32, mode, name="rep_conv_3")
    else:
        P3 = DarknetConv2D_BN_SiLU(transition_channels * 8, (3, 3), strides=(1, 1), name="rep_conv_1")(P3)
        P4 = DarknetConv2D_BN_SiLU(transition_channels * 16, (3, 3), strides=(1, 1), name="rep_conv_2")(P4)
        P5 = DarknetConv2D_BN_SiLU(transition_channels * 32, (3, 3), strides=(1, 1), name="rep_conv_3")(P5)

    # len(anchors_mask[2]) = 3
    # 5 + num_classes -> 4 + 1 + num_classes
    # 4是先验框的回归系数,1是sigmoid将值固定到0-1,num_classes用于判断先验框是什么类别的物体
    # bs, 20, 20, 3 * (4 + 1 + num_classes)
    out2 = DarknetConv2D(len(anchors_mask[2]) * (5 + num_classes), (1, 1), strides = (1, 1), name = 'yolo_head_P3')(P3)
    out1 = DarknetConv2D(len(anchors_mask[1]) * (5 + num_classes), (1, 1), strides = (1, 1), name = 'yolo_head_P4')(P4)
    out0 = DarknetConv2D(len(anchors_mask[0]) * (5 + num_classes), (1, 1), strides = (1, 1), name = 'yolo_head_P5')(P5)
    return Model(inputs, [out0, out1, out2])

3. Decoding of prediction results

1. Obtain prediction frame and score

From the second step, we can obtain the prediction results of the three feature layers, and the shapes are ( N,20,20,255 ), ( N,40,40,255 ), ( N,80,80,255 ) data.

However, this prediction result does not correspond to the position of the final prediction frame on the picture, and it needs to be decoded to complete. In YoloV5, there are 3 prior frames for each feature point on each feature layer.

The last 255 of each feature layer can be split into three 85, corresponding to the 85 parameters of the three prior frames , we first reshape it, the result is ( N,20,20,3,85 ), ( N ,40.40,3,85 ), ( N,80,80,3,85 ).

85 of them can be split into 4+1+80.
The first 4 parameters are used to judge the regression parameters of each feature point, and the prediction frame can be obtained after the regression parameters are adjusted; the
fifth parameter is used to judge whether each feature point contains an object;
the last 80 parameters are used to judge each feature point The type of object to include.

Taking ( N,20,20,3,85 ) as an example, this feature layer is equivalent to dividing the image into 20x20 feature points. If a feature point falls in the corresponding frame of the object, it is used to predict the object.

As shown in the figure, the blue points are 20x20 feature points. At this time, we will demonstrate the decoding operation of the three prior frames of the black points on the left:
1. Calculate the center prediction point, and use Regression to predict the first two The content of the serial number offsets the center coordinates of the three prior frames of the feature points, and after the offset, there are three red points in the right figure;
2. Calculate the width and height of the predicted frame, and use Regression to predict the contents of the last two serial numbers Obtain the width and height of the predicted frame after calculating the exponent;
3. The predicted frame obtained at this time can be drawn on the picture.
insert image description here
In addition to such decoding operations, there are non-maximum suppression operations that need to be performed to prevent the accumulation of boxes of the same kind.

#---------------------------------------------------#
#   对box进行调整,使其符合真实图片的样子
#---------------------------------------------------#
def yolo_correct_boxes(box_xy, box_wh, input_shape, image_shape, letterbox_image):
    #-----------------------------------------------------------------#
    #   把y轴放前面是因为方便预测框和图像的宽高进行相乘
    #-----------------------------------------------------------------#
    box_yx = box_xy[..., ::-1]
    box_hw = box_wh[..., ::-1]
    input_shape = K.cast(input_shape, K.dtype(box_yx))
    image_shape = K.cast(image_shape, K.dtype(box_yx))

    if letterbox_image:
        #-----------------------------------------------------------------#
        #   这里求出来的offset是图像有效区域相对于图像左上角的偏移情况
        #   new_shape指的是宽高缩放情况
        #-----------------------------------------------------------------#
        new_shape = K.round(image_shape * K.min(input_shape/image_shape))
        offset  = (input_shape - new_shape)/2./input_shape
        scale   = input_shape/new_shape

        box_yx  = (box_yx - offset) * scale
        box_hw *= scale

    box_mins    = box_yx - (box_hw / 2.)
    box_maxes   = box_yx + (box_hw / 2.)
    boxes  = K.concatenate([box_mins[..., 0:1], box_mins[..., 1:2], box_maxes[..., 0:1], box_maxes[..., 1:2]])
    boxes *= K.concatenate([image_shape, image_shape])
    return boxes

#---------------------------------------------------#
#   将预测值的每个特征层调成真实值
#---------------------------------------------------#
def get_anchors_and_decode(feats, anchors, num_classes, input_shape, calc_loss=False):
    #---------------------------------------------------#
    #   计算先验框的数量,num_anchors = 3
    #---------------------------------------------------#
    num_anchors = len(anchors)
    #------------------------------------------#
    #   grid_shape指的是特征层的高和宽
    #------------------------------------------#
    grid_shape = K.shape(feats)[1:3]
    #--------------------------------------------------------------------#
    #   获得各个特征点的坐标信息。生成的shape为(20, 20, num_anchors, 2)
    #--------------------------------------------------------------------#
    grid_x  = K.tile(K.reshape(K.arange(0, stop=grid_shape[1]), [1, -1, 1, 1]), [grid_shape[0], 1, num_anchors, 1])
    grid_y  = K.tile(K.reshape(K.arange(0, stop=grid_shape[0]), [-1, 1, 1, 1]), [1, grid_shape[1], num_anchors, 1])
    grid    = K.cast(K.concatenate([grid_x, grid_y]), K.dtype(feats))
    #---------------------------------------------------------------#
    #   将先验框进行拓展,生成的shape为(20, 20, num_anchors, 2)
    #---------------------------------------------------------------#
    anchors_tensor = K.reshape(K.constant(anchors), [1, 1, num_anchors, 2])
    anchors_tensor = K.tile(anchors_tensor, [grid_shape[0], grid_shape[1], 1, 1])

    #---------------------------------------------------#
    #   将预测结果调整成(batch_size, 20, 20, 3, 85)
    #   85可拆分成4 + 1 + 80
    #   4代表的是中心宽高的调整参数
    #   1代表的是框的置信度
    #   80代表的是种类的置信度
    #   batch_size, 20, 20, 3, 5 + num_classes
    #---------------------------------------------------#
    feats           = K.reshape(feats, [-1, grid_shape[0], grid_shape[1], num_anchors, num_classes + 5])
    #------------------------------------------#
    #   对先验框进行解码,并进行归一化
    #------------------------------------------#
    box_xy          = (K.sigmoid(feats[..., :2]) * 2 - 0.5 + grid) / K.cast(grid_shape[::-1], K.dtype(feats))
    box_wh          = (K.sigmoid(feats[..., 2:4]) * 2) ** 2 * anchors_tensor / K.cast(input_shape[::-1], K.dtype(feats))
    #------------------------------------------#
    #   获得预测框的置信度
    #------------------------------------------#
    box_confidence  = K.sigmoid(feats[..., 4:5])
    box_class_probs = K.sigmoid(feats[..., 5:])
    
    #---------------------------------------------------------------------#
    #   在计算loss的时候返回grid, feats, box_xy, box_wh
    #   在预测的时候返回box_xy, box_wh, box_confidence, box_class_probs
    #---------------------------------------------------------------------#
    if calc_loss == True:
        return grid, feats, box_xy, box_wh
    return box_xy, box_wh, box_confidence, box_class_probs

2. Score screening and non-maximum suppression

After the final prediction results are obtained, score sorting and non-maximum suppression screening are performed .

Score screening is to screen out prediction boxes whose scores meet the confidence level.
Non-maximum suppression is to filter out the box with the highest score belonging to the same category in a certain area.

The process of score screening and non-maximum suppression can be summarized as follows:
1. Find the frame in the picture whose score is greater than the threshold function . Score screening before coincident box screening can greatly reduce the number of boxes.
2. Cycling the category , the function of non-maximum suppression is to filter out the frame that belongs to the same category with the highest score in a certain area . Cycling the category can help us perform non-maximum suppression on each category.
3. Sort the category from largest to smallest according to the score.
4. Take out the box with the highest score each time, and calculate the degree of overlap with all other predicted boxes, and remove the ones that overlap too much.

The results of score screening and non-maximum suppression can be used to draw prediction boxes.

The image below is non-maximally suppressed.
insert image description here
The figure below is without non-maximum suppression.
insert image description here
The implementation code is:

#---------------------------------------------------#
#   图片预测
#---------------------------------------------------#
def DecodeBox(outputs,
            anchors,
            num_classes,
            image_shape,
            input_shape,
            #-----------------------------------------------------------#
            #   13x13的特征层对应的anchor是[116,90],[156,198],[373,326]
            #   26x26的特征层对应的anchor是[30,61],[62,45],[59,119]
            #   52x52的特征层对应的anchor是[10,13],[16,30],[33,23]
            #-----------------------------------------------------------#
            anchor_mask     = [[6, 7, 8], [3, 4, 5], [0, 1, 2]],
            max_boxes       = 100,
            confidence      = 0.5,
            nms_iou         = 0.3,
            letterbox_image = True):

    box_xy = []
    box_wh = []
    box_confidence  = []
    box_class_probs = []
    for i in range(len(outputs)):
        sub_box_xy, sub_box_wh, sub_box_confidence, sub_box_class_probs = \
            get_anchors_and_decode(outputs[i], anchors[anchor_mask[i]], num_classes, input_shape)
        box_xy.append(K.reshape(sub_box_xy, [-1, 2]))
        box_wh.append(K.reshape(sub_box_wh, [-1, 2]))
        box_confidence.append(K.reshape(sub_box_confidence, [-1, 1]))
        box_class_probs.append(K.reshape(sub_box_class_probs, [-1, num_classes]))
    box_xy          = K.concatenate(box_xy, axis = 0)
    box_wh          = K.concatenate(box_wh, axis = 0)
    box_confidence  = K.concatenate(box_confidence, axis = 0)
    box_class_probs = K.concatenate(box_class_probs, axis = 0)

    #------------------------------------------------------------------------------------------------------------#
    #   在图像传入网络预测前会进行letterbox_image给图像周围添加灰条,因此生成的box_xy, box_wh是相对于有灰条的图像的
    #   我们需要对其进行修改,去除灰条的部分。 将box_xy、和box_wh调节成y_min,y_max,xmin,xmax
    #   如果没有使用letterbox_image也需要将归一化后的box_xy, box_wh调整成相对于原图大小的
    #------------------------------------------------------------------------------------------------------------#
    boxes       = yolo_correct_boxes(box_xy, box_wh, input_shape, image_shape, letterbox_image)
    box_scores  = box_confidence * box_class_probs

    #-----------------------------------------------------------#
    #   判断得分是否大于score_threshold
    #-----------------------------------------------------------#
    mask             = box_scores >= confidence
    max_boxes_tensor = K.constant(max_boxes, dtype='int32')
    boxes_out   = []
    scores_out  = []
    classes_out = []
    #-----------------------------------------------------------#
    #   筛选出一定区域内属于同一种类得分最大的框
    #-----------------------------------------------------------#
    for c in range(num_classes):
        #-----------------------------------------------------------#
        #   取出所有box_scores >= score_threshold的框,和成绩
        #-----------------------------------------------------------#
        class_boxes      = tf.boolean_mask(boxes, mask[:, c])
        class_box_scores = tf.boolean_mask(box_scores[:, c], mask[:, c])

        #-----------------------------------------------------------#
        #   非极大抑制
        #   保留一定区域内得分最大的框
        #-----------------------------------------------------------#
        nms_index = tf.image.non_max_suppression(class_boxes, class_box_scores, max_boxes_tensor, iou_threshold=nms_iou)

        #-----------------------------------------------------------#
        #   获取非极大抑制后的结果
        #   下列三个分别是:框的位置,得分与种类
        #-----------------------------------------------------------#
        class_boxes         = K.gather(class_boxes, nms_index)
        class_box_scores    = K.gather(class_box_scores, nms_index)
        classes             = K.ones_like(class_box_scores, 'int32') * c

        boxes_out.append(class_boxes)
        scores_out.append(class_box_scores)
        classes_out.append(classes)
    boxes_out      = K.concatenate(boxes_out, axis=0)
    scores_out     = K.concatenate(scores_out, axis=0)
    classes_out    = K.concatenate(classes_out, axis=0)

    return boxes_out, scores_out, classes_out

4. Training part

1. Calculate the content required for loss

Calculating loss is actually a comparison between the predicted results of the network and the real results of the network.
Like the prediction results of the network, the loss of the network is also composed of three parts, namely the Reg part, the Obj part, and the Cls part. The Reg part is the regression parameter judgment of the feature point, the Obj part is the judgment of whether the feature point contains an object, and the Cls part is the type of object contained in the feature point.

2. The matching process of positive samples

In YoloV7, the matching process of positive samples during training can be divided into two parts.
a. For each real frame, roughly match the prior frame and feature points by coordinates, width and height.
b. Use SimOTA to adaptively select how many a priori boxes each real box corresponds to.

The so-called positive sample matching is to find which a priori boxes are considered to have corresponding real boxes, and is responsible for the prediction of this real box .

a. Match the prior frame and feature points

In this part, YoloV7 performs rough matching on each ground truth box. Find which a priori boxes on which feature points can be responsible for the prediction of the real box.

First, match the prior frame. In the YoloV7 network, a total of 9 prior frames of different sizes are designed. Each output feature layer corresponds to 3 prior boxes.

For any real frame gt, YoloV7 no longer uses iou for positive sample matching, but directly uses the aspect ratio for matching, that is, uses the real frame and 9 a priori frames of different sizes to calculate the aspect ratio.

If the width-to-height ratio between the ground truth box and a priori box is greater than the set threshold, it means that the match between the ground truth box and the priori box is not enough, and the priori box is considered as a negative sample.

For example, there is a real frame at this time, its width and height are [200, 200], which is a square . The 9 a priori boxes set by default in YoloV7 are [12, 16], [19, 36], [40, 28], [36, 75], [76, 55], [72, 146], [142, 110 ] ], [192, 243], [459, 401] . Set Threshold Threshold to 4 .

At this point we need to calculate the width-to-height ratio of the real frame and the 9 prior frames . There are two situations when comparing the width and height, one is that the width and height of the real frame are larger than the prior frame, and the other is that the width and height of the prior frame are larger than the real frame . Therefore, we need to calculate at the same time: the width and height of the real frame/the width and height of the prior frame; the width and height of the prior frame/the width and height of the real frame. Then pick the maximum value among them.

The next list is the comparison result. This is a matrix with a shape of [9, 4]. 9 represents 9 a priori boxes, and 4 represents the width and height of the real box/the width and height of the a priori box; the width and height of the a priori box/ The width and height of the ground truth box.

[[16.66666667 12.5         0.06        0.08      ]
 [10.52631579  5.55555556  0.095       0.18      ]
 [ 5.          7.14285714  0.2         0.14      ]
 [ 5.55555556  2.66666667  0.18        0.375     ]
 [ 2.63157895  3.63636364  0.38        0.275     ]
 [ 2.77777778  1.36986301  0.36        0.73      ]
 [ 1.4084507   1.81818182  0.71        0.55      ]
 [ 1.04166667  0.82304527  0.96        1.215     ]
 [ 0.43572985  0.49875312  2.295       2.005     ]]

Then take the maximum value for the comparison result of each prior box. The following matrix is ​​obtained:

[16.66666667 10.52631579  7.14285714  5.55555556  3.63636364  2.77777778
  1.81818182  1.215       2.295     ]

Afterwards, we judge which a priori boxes have a comparison result whose value is less than the threshold . It can be known that [76, 55], [72, 146], [142, 110], [192, 243], [459, 401] five a priori boxes all meet the requirements.

[142, 110], [192, 243], [459, 401] belong to the feature layers of 20, 20.
[76, 55], [72, 146] belong to the feature layers of 40, 40.

At this point we can already judge which sizes of a priori boxes can be used for the prediction of the real box.

In the past Yolo of YoloV5, each real frame is predicted by the feature point in the upper left corner of the grid where its center point is located.

In YoloV7, like YoloV5, for the selected feature layer, first calculate which grid the real frame falls in. At this time, the feature point in the upper left corner of the grid is a feature point responsible for prediction.

At the same time, the rounding rule is used to find the nearest two grids, and all three grids are considered to be responsible for predicting the real frame.
insert image description here
The red dot indicates the center of the real box , and besides the current grid, its two nearest neighbor grids are also selected. From here, it can be found that the value range of the XY axis offset part of the prediction frame is no longer 0-1, but 0.5-1.5.

After finding the corresponding feature points, the corresponding feature points are responsible for the prediction of the real frame in the prior frame that satisfies the aspect ratio.

But this step is only a rough screening, and we will use simOTA to accurately screen later.


def preprocess_true_boxes(self, true_boxes, input_shape, anchors, num_classes):
    assert (true_boxes[..., 4]<num_classes).all(), 'class id must be less than num_classes'
    #-----------------------------------------------------------#
    #   获得框的坐标和图片的大小
    #   [640, 640]
    #-----------------------------------------------------------#
    true_boxes  = np.array(true_boxes, dtype='float32')
    input_shape = np.array(input_shape, dtype='int32')
    
    #-----------------------------------------------------------#
    #   一共有三个特征层数
    #-----------------------------------------------------------#
    num_layers  = len(self.anchors_mask)
    #-----------------------------------------------------------#
    #   m为图片数量,grid_shapes为网格的shape
    #   20, 20  640/32 = 20
    #   40, 40
    #   80, 80
    #-----------------------------------------------------------#
    m           = true_boxes.shape[0]
    grid_shapes = [input_shape // {
    
    0:32, 1:16, 2:8}[l] for l in range(num_layers)]
    #-----------------------------------------------------------#
    #   y_true的格式为
    #   m,20,20,3,5+num_classses
    #   m,40,40,3,5+num_classses
    #   m,80,80,3,5+num_classses
    #-----------------------------------------------------------#
    y_true = [np.zeros((m, grid_shapes[l][0], grid_shapes[l][1], len(self.anchors_mask[l]), 2),
                dtype='float32') for l in range(num_layers)]
    #-----------------------------------------------------#
    #   用于帮助先验框找到最对应的真实框
    #-----------------------------------------------------#
    box_best_ratios = [np.zeros((m, grid_shapes[l][0], grid_shapes[l][1], len(self.anchors_mask[l])),
                dtype='float32') for l in range(num_layers)]

    #-----------------------------------------------------------#
    #   通过计算获得真实框的中心和宽高
    #   中心点(m,n,2) 宽高(m,n,2)
    #-----------------------------------------------------------#
    boxes_xy = (true_boxes[..., 0:2] + true_boxes[..., 2:4]) // 2
    boxes_wh =  true_boxes[..., 2:4] - true_boxes[..., 0:2]
    #-----------------------------------------------------------#
    #   将真实框归一化到小数形式
    #-----------------------------------------------------------#
    true_boxes[..., 0:2] = boxes_xy / input_shape[::-1]
    true_boxes[..., 2:4] = boxes_wh / input_shape[::-1]

    #-----------------------------------------------------------#
    #   [9,2] -> [9,2]
    #-----------------------------------------------------------#
    anchors         = np.array(anchors, np.float32)

    #-----------------------------------------------------------#
    #   长宽要大于0才有效
    #-----------------------------------------------------------#
    valid_mask = boxes_wh[..., 0]>0

    for b in range(m):
        #-----------------------------------------------------------#
        #   对每一张图进行处理
        #-----------------------------------------------------------#
        wh = boxes_wh[b, valid_mask[b]]

        if len(wh) == 0: 
            continue
        #-------------------------------------------------------#
        #   wh                          : num_true_box, 2
        #   np.expand_dims(wh, 1)       : num_true_box, 1, 2
        #   anchors                     : 9, 2
        #   np.expand_dims(anchors, 0)  : 1, 9, 2
        #   
        #   ratios_of_gt_anchors代表每一个真实框和每一个先验框的宽高的比值
        #   ratios_of_gt_anchors    : num_true_box, 9, 2
        #   ratios_of_anchors_gt代表每一个先验框和每一个真实框的宽高的比值
        #   ratios_of_anchors_gt    : num_true_box, 9, 2
        #
        #   ratios                  : num_true_box, 9, 4
        #   max_ratios代表每一个真实框和每一个先验框的宽高的比值的最大值
        #   max_ratios              : num_true_box, 9
        #-------------------------------------------------------#
        ratios_of_gt_anchors = np.expand_dims(wh, 1) / np.expand_dims(anchors, 0)
        ratios_of_anchors_gt = np.expand_dims(anchors, 0) / np.expand_dims(wh, 1)
        ratios               = np.concatenate([ratios_of_gt_anchors, ratios_of_anchors_gt], axis = -1)
        max_ratios           = np.max(ratios, axis = -1)
        
        for t, ratio in enumerate(max_ratios):
            #-------------------------------------------------------#
            #   ratio : 9
            #-------------------------------------------------------#
            over_threshold = ratio < self.threshold
            over_threshold[np.argmin(ratio)] = True
            #-----------------------------------------------------------#
            #   找到每个真实框所属的特征层
            #-----------------------------------------------------------#
            for l in range(num_layers):
                for k, n in enumerate(self.anchors_mask[l]):
                    if not over_threshold[n]:
                        continue
                    #-----------------------------------------------------------#
                    #   floor用于向下取整,找到真实框所属的特征层对应的x、y轴坐标
                    #-----------------------------------------------------------#
                    i = np.floor(true_boxes[b,t,0] * grid_shapes[l][1]).astype('int32')
                    j = np.floor(true_boxes[b,t,1] * grid_shapes[l][0]).astype('int32')
                    offsets = self.get_near_points(true_boxes[b,t,0] * grid_shapes[l][1], true_boxes[b,t,1] * grid_shapes[l][0], i, j)
                    for offset in offsets:
                        local_i = i + offset[0]
                        local_j = j + offset[1]

                        if local_i >= grid_shapes[l][1] or local_i < 0 or local_j >= grid_shapes[l][0] or local_j < 0:
                            continue

                        if box_best_ratios[l][b, local_j, local_i, k] != 0:
                            if box_best_ratios[l][b, local_j, local_i, k] > ratio[n]:
                                y_true[l][b, local_j, local_i, k, :] = 0
                            else:
                                continue
                        #-----------------------------------------------------------#
                        #   y_true的shape为(m,20,20,3,85)(m,40,40,3,85)(m,80,80,3,85)
                        #   最后的85可以拆分成4+1+80,4代表的是框的中心与宽高、
                        #   1代表的是置信度、80代表的是种类
                        #-----------------------------------------------------------#
                        y_true[l][b, local_j, local_i, k, 0] = 1
                        y_true[l][b, local_j, local_i, k, 1] = t + 1
                        box_best_ratios[l][b, local_j, local_i, k] = ratio[n]
    return y_true

b. SimOTA adaptive matching

In YoloV7, we will calculate a Cost matrix, which represents the cost relationship between each real frame and each feature point. The Cost cost matrix consists of two parts: 1. Each real frame and the current
feature point prediction frame Degree of coincidence;
2. The type prediction accuracy of each real frame and the current feature point prediction frame;

The higher the degree of overlap between each real frame and the current feature point prediction frame, it means that this feature point has tried to fit the real frame, so its Cost will be smaller.

The higher the type prediction accuracy of each real frame and the current feature point prediction frame, it also means that this feature point has tried to fit the real frame, so its Cost will be smaller.

The purpose of the cost matrix is ​​to adaptively find the real frame that the current feature point should fit. The higher the coincidence degree, the more fitting is required, the more accurate the classification is, the more fitting is required, and the more fitting is required within a certain radius.

In SimOTA, different targets set different numbers of positive samples (dynamick). Taking the ants and watermelons in Megvii’s official answer as an example, the traditional positive sample allocation scheme often assigns the same number of positive samples to watermelons and ants in the same scene. The number of positive samples, either the ants have many low-quality positive samples, or the watermelon only has one or two positive samples. It is not suitable for any distribution method.
The key to dynamic positive sample setting is how to determine k. The specific method of SimOTA is to first calculate the 10 feature points with the lowest cost of each target, and then add the prediction frame corresponding to these ten feature points with the IOU of the real frame to obtain the final the k.

Therefore, the process of SimOTA is summarized as follows:
1. Calculate the degree of overlap between each real frame and the predicted frame of the current feature point.
2. Calculation Add up the IOU of the 20 predicted frames with the highest coincidence degree and the real frame to obtain the k of each real frame, which means that each real frame has k feature points corresponding to it.
3. Calculate the type prediction accuracy of each real frame and the current feature point prediction frame.
4. Calculate the Cost matrix.
5. Take the k points with the lowest Cost as the positive samples of the real frame.

#---------------------------------------------------#
#   loss值计算
#---------------------------------------------------#
def yolo_loss(
    args, 
    input_shape, 
    anchors, 
    anchors_mask, 
    num_classes, 
    balance         = [0.4, 1.0, 4], 
    label_smoothing = 0.01, 
    box_ratio       = 0.05, 
    obj_ratio       = 1, 
    cls_ratio       = 0.5
):
    num_layers = len(anchors_mask)
    #---------------------------------------------------------------------------------------------------#
    #   将预测结果和实际ground truth分开,args是[*model_body.output, *y_true]
    #   y_true是一个列表,包含三个特征层,shape分别为:
    #   (m,20,20,3,85)
    #   (m,40,40,3,85)
    #   (m,80,80,3,85)
    #   yolo_outputs是一个列表,包含三个特征层,shape分别为:
    #   (m,20,20,3,85)
    #   (m,40,40,3,85)
    #   (m,80,80,3,85)
    #---------------------------------------------------------------------------------------------------#
    labels          = args[-1]
    y_true          = args[num_layers:-1]
    yolo_outputs    = args[:num_layers]

    #-----------------------------------------------------------#
    #   得到input_shpae为640,640
    #-----------------------------------------------------------#
    input_shape = K.cast(input_shape, K.dtype(y_true[0]))

    loss        = 0
    outputs     = []
    layer_id    = []
    fg_masks    = []
    is_in_boxes_and_centers = []
    #---------------------------------------------------------------------------------------------------#
    #   y_true是一个列表,包含三个特征层,shape分别为(m,20,20,3,85),(m,40,40,3,85),(m,80,80,3,85)。
    #   yolo_outputs是一个列表,包含三个特征层,shape分别为(m,20,20,3,85),(m,40,40,3,85),(m,80,80,3,85)。
    #---------------------------------------------------------------------------------------------------#
    for l in range(num_layers):
        #-----------------------------------------------------------#
        #   将yolo_outputs的特征层输出进行处理、获得四个返回值
        #   其中:
        #   grid        (20,20,1,2) 网格坐标
        #   raw_pred    (m,20,20,3,85) 尚未处理的预测结果
        #   pred_xy     (m,20,20,3,2) 解码后的中心坐标
        #   pred_wh     (m,20,20,3,2) 解码后的宽高坐标
        #-----------------------------------------------------------#
        grid, raw_pred, pred_xy, pred_wh = get_anchors_and_decode(yolo_outputs[l],
             anchors[anchors_mask[l]], num_classes, input_shape, calc_loss=True)
        
        #-----------------------------------------------------------#
        #   pred_box是解码后的预测的box的位置
        #   (m,20,20,3,4)
        #-----------------------------------------------------------#
        pred_box = K.concatenate([pred_xy, pred_wh])
        
        m       = tf.shape(pred_box)[0]
        scale   = tf.cast([[[input_shape[1], input_shape[0], input_shape[1], input_shape[0]]]], tf.float32)
        outputs.append(tf.concat([tf.reshape(pred_box, [m, -1, 4]) * scale, tf.reshape(raw_pred[..., 4:], [m, -1, num_classes + 1])], -1))
        layer_id.append(tf.ones_like(outputs[-1][:, :, 0]) * l)
        fg_masks.append(tf.reshape(y_true[l][..., 0:1], [m, -1]))
        is_in_boxes_and_centers.append(tf.reshape(y_true[l][..., 1:2], [m, -1]))
    
    outputs     = tf.concat(outputs, 1)
    layer_id    = tf.concat(layer_id, 1)
    fg_masks    = tf.concat(fg_masks, 1)
    is_in_boxes_and_centers = tf.concat(is_in_boxes_and_centers, 1)
        
    #-----------------------------------------------#
    #   [batch, n_anchors_all, 4] 预测框的坐标
    #   [batch, n_anchors_all, 1] 特征点是否有对应的物体
    #   [batch, n_anchors_all, n_cls] 特征点对应物体的种类
    #-----------------------------------------------#
    bbox_preds  = outputs[:, :, :4]  
    obj_preds   = outputs[:, :, 4:5]
    cls_preds   = outputs[:, :, 5:]  
    
    #------------------------------------------------------------#
    #   labels                      [batch, max_boxes, 5]
    #   tf.reduce_sum(labels, -1)   [batch, max_boxes]
    #   nlabel                      [batch]
    #------------------------------------------------------------#
    nlabel = tf.reduce_sum(tf.cast(tf.reduce_sum(labels, -1) > 0, K.dtype(outputs)), -1)
    total_num_anchors = tf.shape(outputs)[1]
    
    num_fg      = 0.0
    loss_obj    = 0.0
    loss_cls    = 0.0
    loss_iou    = 0.0
    def loop_body(b, num_fg, loss_iou, loss_obj, loss_cls):
        # num_gt 单张图片的真实框的数量
        num_gt  = tf.cast(nlabel[b], tf.int32)
        #-----------------------------------------------#
        #   gt_bboxes_per_image     [num_gt, 4]
        #   gt_classes              [num_gt]
        #   bboxes_preds_per_image  [n_anchors_all, 4]
        #   obj_preds_per_image     [n_anchors_all, 1]
        #   cls_preds_per_image     [n_anchors_all, num_classes]
        #-----------------------------------------------#
        gt_bboxes_per_image     = labels[b][:num_gt, :4]
        gt_classes              = labels[b][:num_gt,  4]
        bboxes_preds_per_image  = bbox_preds[b]
        obj_preds_per_image     = obj_preds[b]
        cls_preds_per_image     = cls_preds[b]

        def f1():
            num_fg_img  = tf.cast(tf.constant(0), K.dtype(outputs))
            cls_target  = tf.cast(tf.zeros((0, num_classes)), K.dtype(outputs))
            reg_target  = tf.cast(tf.zeros((0, 4)), K.dtype(outputs))
            obj_target  = tf.cast(tf.zeros((total_num_anchors, 1)), K.dtype(outputs))
            fg_mask     = tf.cast(tf.zeros(total_num_anchors), tf.bool)
            return num_fg_img, cls_target, reg_target, obj_target, fg_mask
        def f2():
            fg_mask = tf.cast(fg_masks[b], tf.bool)
            gt_matched_classes, fg_mask, pred_ious_this_matching, matched_gt_inds, num_fg_img = get_assignments( 
                fg_mask, gt_bboxes_per_image, gt_classes, bboxes_preds_per_image, obj_preds_per_image, cls_preds_per_image, num_classes, num_gt, 
            )
            reg_target  = tf.cast(tf.gather_nd(gt_bboxes_per_image, tf.reshape(matched_gt_inds, [-1, 1])), K.dtype(outputs))
            cls_target  = tf.cast(tf.one_hot(tf.cast(gt_matched_classes, tf.int32), num_classes) * tf.expand_dims(pred_ious_this_matching, -1), K.dtype(outputs))
            obj_target  = tf.cast(tf.expand_dims(fg_mask, -1), K.dtype(outputs))
            return num_fg_img, cls_target, reg_target, obj_target, fg_mask
            
        num_fg_img, cls_target, reg_target, obj_target, fg_mask = tf.cond(tf.equal(num_gt, 0), f1, f2)
        num_fg      += num_fg_img
        # reg_target = tf.Print(reg_target, [num_fg_img, reg_target, tf.boolean_mask(bboxes_preds_per_image, fg_mask)], summarize=1000)

        _loss_iou   = 1 - box_ciou(reg_target, tf.boolean_mask(bboxes_preds_per_image, fg_mask))
        _loss_obj   = K.binary_crossentropy(_smooth_labels(obj_target, label_smoothing), obj_preds_per_image, from_logits=True)
        _loss_cls   = K.binary_crossentropy(cls_target, tf.boolean_mask(cls_preds_per_image, fg_mask), from_logits=True)
        for layer in range(len(balance)):
            num_pos = tf.maximum(K.sum(tf.cast(tf.logical_and(tf.equal(layer_id[b], layer), fg_mask), tf.float32)), 1)

            loss_iou += K.sum(tf.boolean_mask(_loss_iou, tf.boolean_mask(tf.logical_and(tf.equal(layer_id[b], layer), fg_mask), fg_mask))) * box_ratio / num_pos
            loss_obj += K.mean(tf.boolean_mask(_loss_obj, tf.equal(layer_id[b], layer)) * balance[layer]) * obj_ratio
            loss_cls += K.sum(tf.boolean_mask(_loss_cls, tf.boolean_mask(tf.logical_and(tf.equal(layer_id[b], layer), fg_mask), fg_mask))) * cls_ratio / num_pos / num_classes
        return b + 1, num_fg, loss_iou, loss_obj, loss_cls
    #-----------------------------------------------------------#
    #   在这个地方进行一个循环、循环是对每一张图片进行的
    #-----------------------------------------------------------#
    _, num_fg, loss_iou, loss_obj, loss_cls = tf.while_loop(lambda b,*args: b < tf.cast(tf.shape(outputs)[0], tf.int32), loop_body, [0, num_fg, loss_iou, loss_obj, loss_cls])
    
    num_fg      = tf.cast(tf.maximum(num_fg, 1), K.dtype(outputs))
    loss        = (loss_iou + loss_cls + loss_obj) / tf.cast(tf.shape(outputs)[0], tf.float32)
    # loss = tf.Print(loss, [num_fg, loss_iou / tf.cast(tf.shape(outputs)[0], tf.float32), loss_obj / tf.cast(tf.shape(outputs)[0], tf.float32), loss_cls / tf.cast(tf.shape(outputs)[0], tf.float32) ])
    return loss

def get_assignments(fg_mask, gt_bboxes_per_image, gt_classes, bboxes_preds_per_image, obj_preds_per_image, cls_preds_per_image, num_classes, num_gt):
    #-------------------------------------------------------#
    #   获得在真实框内部的特征点的预测结果
    #   fg_mask                 [n_anchors_all]
    #   bboxes_preds_per_image  [fg_mask, 4]
    #   cls_preds_              [fg_mask, num_classes]
    #   obj_preds_              [fg_mask, 1]
    #-------------------------------------------------------#
    bboxes_preds_per_image  = tf.boolean_mask(bboxes_preds_per_image, fg_mask, axis = 0)
    obj_preds_              = tf.boolean_mask(obj_preds_per_image, fg_mask, axis = 0)
    cls_preds_              = tf.boolean_mask(cls_preds_per_image, fg_mask, axis = 0)
    num_in_boxes_anchor     = tf.shape(bboxes_preds_per_image)[0]
    #-------------------------------------------------------#
    #   计算真实框和预测框的重合程度
    #   pair_wise_ious      [num_gt, fg_mask]
    #-------------------------------------------------------#
    # gt_bboxes_per_image = tf.Print(gt_bboxes_per_image, [gt_bboxes_per_image, bboxes_preds_per_image], summarize=1000)
    pair_wise_ious      = box_iou(gt_bboxes_per_image, bboxes_preds_per_image)
    pair_wise_ious_loss = -tf.log(pair_wise_ious + 1e-8)
    #-------------------------------------------------------#
    #   计算真实框和预测框种类置信度的交叉熵
    #   cls_preds_          [num_gt, fg_mask, num_classes]
    #   gt_cls_per_image    [num_gt, fg_mask, num_classes]
    #   pair_wise_cls_loss  [num_gt, fg_mask]
    #-------------------------------------------------------#
    gt_cls_per_image    = tf.tile(tf.expand_dims(tf.one_hot(tf.cast(gt_classes, tf.int32), num_classes), 1), (1, num_in_boxes_anchor, 1))
    cls_preds_          = K.sigmoid(tf.tile(tf.expand_dims(cls_preds_, 0), (num_gt, 1, 1))) *\
                          K.sigmoid(tf.tile(tf.expand_dims(obj_preds_, 0), (num_gt, 1, 1)))

    pair_wise_cls_loss  = tf.reduce_sum(K.binary_crossentropy(gt_cls_per_image, tf.sqrt(cls_preds_)), -1)
    #-------------------------------------------------------#
    #   种类比较接近的情况瞎,交叉熵较低
    #   真实框和预测框重合度较高的时候,cost较低
    #   这个特征点是要有对应的真实框的,cost才会低
    #-------------------------------------------------------#
    cost = pair_wise_cls_loss + 3.0 * pair_wise_ious_loss

    gt_matched_classes, fg_mask, pred_ious_this_matching, matched_gt_inds, num_fg = dynamic_k_matching(cost, pair_wise_ious, fg_mask, gt_classes, num_gt)
    return gt_matched_classes, fg_mask, pred_ious_this_matching, matched_gt_inds, num_fg

def dynamic_k_matching(cost, pair_wise_ious, fg_mask, gt_classes, num_gt):
    #-------------------------------------------------------#
    #   matching_matrix     [num_gt, fg_mask]
    #   cost                [num_gt, fg_mask]
    #   pair_wise_ious      [num_gt, fg_mask] 每一个真实框和预测框的重合情况
    #   gt_classes          [num_gt]        
    #   fg_mask             [n_anchors_all]
    #-------------------------------------------------------#
    matching_matrix         = tf.zeros_like(cost)

    #------------------------------------------------------------#
    #   选取iou最大的n_candidate_k个点
    #   获得当前真实框重合度最大的十个预测框的值
    #   重合度的值域是[0, 1],dynamic_ks的值就是[0, 10]
    #   然后求和,判断应该有多少点用于该框预测
    #   topk_ious           [num_gt, n_candidate_k]
    #   dynamic_ks          [num_gt]
    #   matching_matrix     [num_gt, fg_mask]
    #------------------------------------------------------------#
    n_candidate_k           = tf.minimum(20, tf.shape(pair_wise_ious)[1])
    topk_ious, _            = tf.nn.top_k(pair_wise_ious, n_candidate_k)
    dynamic_ks              = tf.maximum(tf.reduce_sum(topk_ious, 1), 1)
    # dynamic_ks              = tf.Print(dynamic_ks, [topk_ious, dynamic_ks], summarize = 100)
    
    def loop_body_1(b, matching_matrix):
        #------------------------------------------------------------#
        #   给每个真实框选取最小的动态k个点
        #------------------------------------------------------------#
        _, pos_idx = tf.nn.top_k(-cost[b], k=tf.cast(dynamic_ks[b], tf.int32))
        matching_matrix = tf.concat(
            [matching_matrix[:b], tf.expand_dims(tf.reduce_max(tf.one_hot(pos_idx, tf.shape(cost)[1]), 0), 0), matching_matrix[b+1:]], axis = 0
        )
        # matching_matrix = matching_matrix.write(b, K.cast(tf.reduce_max(tf.one_hot(pos_idx, tf.shape(cost)[1]), 0), K.dtype(cost)))
        return b + 1, matching_matrix
    #-----------------------------------------------------------#
    #   在这个地方进行一个循环、循环是对每一张图片进行的
    #-----------------------------------------------------------#
    _, matching_matrix = tf.while_loop(lambda b,*args: b < tf.cast(num_gt, tf.int32), loop_body_1, [0, matching_matrix])

    #------------------------------------------------------------#
    #   anchor_matching_gt  [fg_mask]
    #------------------------------------------------------------#
    anchor_matching_gt = tf.reduce_sum(matching_matrix, 0)
    #------------------------------------------------------------#
    #   当某一个特征点指向多个真实框的时候
    #   选取cost最小的真实框。
    #------------------------------------------------------------#
    biger_one_indice = tf.reshape(tf.where(anchor_matching_gt > 1), [-1])
    def loop_body_2(b, matching_matrix):
        indice_anchor   = tf.cast(biger_one_indice[b], tf.int32)
        indice_gt       = tf.math.argmin(cost[:, indice_anchor])
        matching_matrix = tf.concat(
            [
                matching_matrix[:, :indice_anchor], 
                tf.expand_dims(tf.one_hot(indice_gt, tf.cast(num_gt, tf.int32)), 1), 
                matching_matrix[:, indice_anchor+1:]
            ], axis = -1
        )
        return b + 1, matching_matrix
    #-----------------------------------------------------------#
    #   在这个地方进行一个循环、循环是对每一张图片进行的
    #-----------------------------------------------------------#
    _, matching_matrix = tf.while_loop(lambda b,*args: b < tf.cast(tf.shape(biger_one_indice)[0], tf.int32), loop_body_2, [0, matching_matrix])

    #------------------------------------------------------------#
    #   fg_mask_inboxes  [fg_mask]
    #   num_fg为正样本的特征点个数
    #------------------------------------------------------------#
    fg_mask_inboxes = tf.reduce_sum(matching_matrix, 0) > 0.0
    num_fg          = tf.reduce_sum(tf.cast(fg_mask_inboxes, K.dtype(cost)))

    fg_mask_indices         = tf.reshape(tf.where(fg_mask), [-1])
    fg_mask_inboxes_indices = tf.reshape(tf.where(fg_mask_inboxes), [-1, 1])
    fg_mask_select_indices  = tf.gather_nd(fg_mask_indices, fg_mask_inboxes_indices)
    fg_mask                 = tf.cast(tf.reduce_max(tf.one_hot(fg_mask_select_indices, tf.shape(fg_mask)[0]), 0), K.dtype(fg_mask))

    #------------------------------------------------------------#
    #   获得特征点对应的物品种类
    #------------------------------------------------------------#
    matched_gt_inds     = tf.math.argmax(tf.boolean_mask(matching_matrix, fg_mask_inboxes, axis = 1), 0)
    gt_matched_classes  = tf.gather_nd(gt_classes, tf.reshape(matched_gt_inds, [-1, 1]))

    pred_ious_this_matching = tf.boolean_mask(tf.reduce_sum(matching_matrix * pair_wise_ious, 0), fg_mask_inboxes)
    return gt_matched_classes, fg_mask, pred_ious_this_matching, matched_gt_inds, num_fg

3. Calculate Loss

It can be known from the first part that the loss of YoloV7 consists of three parts:
1. The Reg part. From the second part, we can know the prior frame corresponding to each real frame. After obtaining the prior frame corresponding to each frame, take out the prior frame The prediction frame corresponding to the inspection frame, using the real frame and the prediction frame to calculate the CIOU loss, as the Loss composition of the Reg part.
2. In the Obj part, the prior frame corresponding to each real frame can be known from the second part. All the prior frames corresponding to the real frame are positive samples, and the remaining prior frames are negative samples. According to the positive and negative samples and feature points The cross-entropy loss is calculated as the prediction result of whether the object is included, as the Loss component of the Obj part.
3. In the Cls part, the prior frame corresponding to each real frame can be known from the third part. After obtaining the prior frame corresponding to each frame, the prediction result of the type of the prior frame is taken out, and according to the type of the real frame and the prior frame The type prediction result of the inspection box is used to calculate the cross entropy loss, which is used as the Loss component of the Cls part.

_loss_iou   = 1 - box_ciou(reg_target, tf.boolean_mask(bboxes_preds_per_image, fg_mask))
_loss_obj   = K.binary_crossentropy(_smooth_labels(obj_target, label_smoothing), obj_preds_per_image, from_logits=True)
_loss_cls   = K.binary_crossentropy(cls_target, tf.boolean_mask(cls_preds_per_image, fg_mask), from_logits=True)
for layer in range(len(balance)):
    num_pos = tf.maximum(K.sum(tf.cast(tf.logical_and(tf.equal(layer_id[b], layer), fg_mask), tf.float32)), 1)

    loss_iou += K.sum(tf.boolean_mask(_loss_iou, tf.boolean_mask(tf.logical_and(tf.equal(layer_id[b], layer), fg_mask), fg_mask))) * box_ratio / num_pos
    loss_obj += K.mean(tf.boolean_mask(_loss_obj, tf.equal(layer_id[b], layer)) * balance[layer]) * obj_ratio
    loss_cls += K.sum(tf.boolean_mask(_loss_cls, tf.boolean_mask(tf.logical_and(tf.equal(layer_id[b], layer), fg_mask), fg_mask))) * cls_ratio / num_pos / num_classes

Train your own YoloV7 model

First go to Github to download the corresponding warehouse. After downloading, use the decompression software to decompress it, and then use the programming software to open the folder.
Note that the opened root directory must be correct, otherwise the code will not run if the relative directory is incorrect.

It must be noted that the root directory after opening is the directory where the file is stored.
insert image description here

1. Data set preparation

This article uses the VOC format for training. Before training, you need to make your own data set. If you don’t have your own data set, you can download the VOC12+07 data set through the Github connection and try it out.
Before training, put the label file in the Annotation under the VOC2007 folder under the VOCdevkit folder.
insert image description here
Before training, put the picture file in JPEGImages under the VOC2007 folder under the VOCdevkit folder.
insert image description here
At this point, the placement of the data set has ended.

Second, the processing of data sets

After completing the arrangement of the data set, we need to process the data set in the next step. The purpose is to obtain 2007_train.txt and 2007_val.txt for training. We need to use voc_annotation.py in the root directory.

There are some parameters in voc_annotation.py that need to be set.
They are annotation_mode, classes_path, trainval_percent, train_percent, VOCdevkit_path, the first training can only modify classes_path

'''
annotation_mode用于指定该文件运行时计算的内容
annotation_mode为0代表整个标签处理过程,包括获得VOCdevkit/VOC2007/ImageSets里面的txt以及训练用的2007_train.txt、2007_val.txt
annotation_mode为1代表获得VOCdevkit/VOC2007/ImageSets里面的txt
annotation_mode为2代表获得训练用的2007_train.txt、2007_val.txt
'''
annotation_mode     = 0
'''
必须要修改,用于生成2007_train.txt、2007_val.txt的目标信息
与训练和预测所用的classes_path一致即可
如果生成的2007_train.txt里面没有目标信息
那么就是因为classes没有设定正确
仅在annotation_mode为0和2的时候有效
'''
classes_path        = 'model_data/voc_classes.txt'
'''
trainval_percent用于指定(训练集+验证集)与测试集的比例,默认情况下 (训练集+验证集):测试集 = 9:1
train_percent用于指定(训练集+验证集)中训练集与验证集的比例,默认情况下 训练集:验证集 = 9:1
仅在annotation_mode为0和1的时候有效
'''
trainval_percent    = 0.9
train_percent       = 0.9
'''
指向VOC数据集所在的文件夹
默认指向根目录下的VOC数据集
'''
VOCdevkit_path  = 'VOCdevkit'

classes_path is used to point to the txt corresponding to the detection category. Taking the voc dataset as an example, the txt we use is:
insert image description here
when training your own dataset, you can create a cls_classes.txt by yourself, and write the categories you need to distinguish in it.

3. Start network training

Through voc_annotation.py we have generated 2007_train.txt and 2007_val.txt, now we can start training.
There are many parameters for training. You can read the comments carefully after downloading the library. The most important part is still the classes_path in train.py.

classes_path is used to point to the txt corresponding to the detection category, this txt is the same as the txt in voc_annotation.py! Training your own data set must be modified!
insert image description here
After modifying the classes_path, you can run train.py to start training. After training for multiple epochs, the weights will be generated in the logs folder.
The functions of other parameters are as follows:

#---------------------------------------------------------------------#
#   train_gpu   训练用到的GPU
#               默认为第一张卡、双卡为[0, 1]、三卡为[0, 1, 2]
#               在使用多GPU时,每个卡上的batch为总batch除以卡的数量。
#---------------------------------------------------------------------#
train_gpu       = [0,]
#---------------------------------------------------------------------#
#   classes_path    指向model_data下的txt,与自己训练的数据集相关 
#                   训练前一定要修改classes_path,使其对应自己的数据集
#---------------------------------------------------------------------#
classes_path    = 'model_data/voc_classes.txt'
#---------------------------------------------------------------------#
#   anchors_path    代表先验框对应的txt文件,一般不修改。
#   anchors_mask    用于帮助代码找到对应的先验框,一般不修改。
#---------------------------------------------------------------------#
anchors_path    = 'model_data/yolo_anchors.txt'
anchors_mask    = [[6, 7, 8], [3, 4, 5], [0, 1, 2]]
#----------------------------------------------------------------------------------------------------------------------------#
#   权值文件的下载请看README,可以通过网盘下载。模型的 预训练权重 对不同数据集是通用的,因为特征是通用的。
#   模型的 预训练权重 比较重要的部分是 主干特征提取网络的权值部分,用于进行特征提取。
#   预训练权重对于99%的情况都必须要用,不用的话主干部分的权值太过随机,特征提取效果不明显,网络训练的结果也不会好
#
#   如果训练过程中存在中断训练的操作,可以将model_path设置成logs文件夹下的权值文件,将已经训练了一部分的权值再次载入。
#   同时修改下方的 冻结阶段 或者 解冻阶段 的参数,来保证模型epoch的连续性。
#   
#   当model_path = ''的时候不加载整个模型的权值。
#
#   此处使用的是整个模型的权重,因此是在train.py进行加载的。
#   如果想要让模型从0开始训练,则设置model_path = '',下面的Freeze_Train = Fasle,此时从0开始训练,且没有冻结主干的过程。
#   
#   一般来讲,网络从0开始的训练效果会很差,因为权值太过随机,特征提取效果不明显,因此非常、非常、非常不建议大家从0开始训练!
#   从0开始训练有两个方案:
#   1、得益于Mosaic数据增强方法强大的数据增强能力,将UnFreeze_Epoch设置的较大(300及以上)、batch较大(16及以上)、数据较多(万以上)的情况下,
#      可以设置mosaic=True,直接随机初始化参数开始训练,但得到的效果仍然不如有预训练的情况。(像COCO这样的大数据集可以这样做)
#   2、了解imagenet数据集,首先训练分类模型,获得网络的主干部分权值,分类模型的 主干部分 和该模型通用,基于此进行训练。
#----------------------------------------------------------------------------------------------------------------------------#
model_path      = 'model_data/yolov7_weights.h5'
#------------------------------------------------------#
#   input_shape     输入的shape大小,一定要是32的倍数
#------------------------------------------------------#
input_shape     = [640, 640]
#------------------------------------------------------#
#   phi             所使用的YoloV7的版本。l、x
#------------------------------------------------------#
phi             = 'l'
#------------------------------------------------------------------#
#   mosaic              马赛克数据增强。
#   mosaic_prob         每个step有多少概率使用mosaic数据增强,默认50%。
#
#   mixup               是否使用mixup数据增强,仅在mosaic=True时有效。
#                       只会对mosaic增强后的图片进行mixup的处理。
#   mixup_prob          有多少概率在mosaic后使用mixup数据增强,默认50%。
#                       总的mixup概率为mosaic_prob * mixup_prob。
#
#   special_aug_ratio   参考YoloX,由于Mosaic生成的训练图片,远远脱离自然图片的真实分布。
#                       当mosaic=True时,本代码会在special_aug_ratio范围内开启mosaic。
#                       默认为前70%个epoch,100个世代会开启70个世代。
#------------------------------------------------------------------#
mosaic              = True
mosaic_prob         = 0.5
mixup               = True
mixup_prob          = 0.5
special_aug_ratio   = 0.7
#------------------------------------------------------------------#
#   label_smoothing     标签平滑。一般0.01以下。如0.01、0.005。
#------------------------------------------------------------------#
label_smoothing     = 0

#----------------------------------------------------------------------------------------------------------------------------#
#   训练分为两个阶段,分别是冻结阶段和解冻阶段。设置冻结阶段是为了满足机器性能不足的同学的训练需求。
#   冻结训练需要的显存较小,显卡非常差的情况下,可设置Freeze_Epoch等于UnFreeze_Epoch,Freeze_Train = True,此时仅仅进行冻结训练。
#      
#   在此提供若干参数设置建议,各位训练者根据自己的需求进行灵活调整:
#   (一)从整个模型的预训练权重开始训练: 
#       Adam:
#           Init_Epoch = 0,Freeze_Epoch = 50,UnFreeze_Epoch = 100,Freeze_Train = True,optimizer_type = 'adam',Init_lr = 1e-3,weight_decay = 0。(冻结)
#           Init_Epoch = 0,UnFreeze_Epoch = 100,Freeze_Train = False,optimizer_type = 'adam',Init_lr = 1e-3,weight_decay = 0。(不冻结)
#       SGD:
#           Init_Epoch = 0,Freeze_Epoch = 50,UnFreeze_Epoch = 300,Freeze_Train = True,optimizer_type = 'sgd',Init_lr = 1e-2,weight_decay = 5e-4。(冻结)
#           Init_Epoch = 0,UnFreeze_Epoch = 300,Freeze_Train = False,optimizer_type = 'sgd',Init_lr = 1e-2,weight_decay = 5e-4。(不冻结)
#       其中:UnFreeze_Epoch可以在100-300之间调整。
#   (二)从0开始训练:
#       Init_Epoch = 0,UnFreeze_Epoch >= 300,Unfreeze_batch_size >= 16,Freeze_Train = False(不冻结训练)
#       其中:UnFreeze_Epoch尽量不小于300。optimizer_type = 'sgd',Init_lr = 1e-2,mosaic = True。
#   (三)batch_size的设置:
#       在显卡能够接受的范围内,以大为好。显存不足与数据集大小无关,提示显存不足(OOM或者CUDA out of memory)请调小batch_size。
#       受到BatchNorm层影响,batch_size最小为2,不能为1。
#       正常情况下Freeze_batch_size建议为Unfreeze_batch_size的1-2倍。不建议设置的差距过大,因为关系到学习率的自动调整。
#----------------------------------------------------------------------------------------------------------------------------#
#------------------------------------------------------------------#
#   冻结阶段训练参数
#   此时模型的主干被冻结了,特征提取网络不发生改变
#   占用的显存较小,仅对网络进行微调
#   Init_Epoch          模型当前开始的训练世代,其值可以大于Freeze_Epoch,如设置:
#                       Init_Epoch = 60、Freeze_Epoch = 50、UnFreeze_Epoch = 100
#                       会跳过冻结阶段,直接从60代开始,并调整对应的学习率。
#                       (断点续练时使用)
#   Freeze_Epoch        模型冻结训练的Freeze_Epoch
#                       (当Freeze_Train=False时失效)
#   Freeze_batch_size   模型冻结训练的batch_size
#                       (当Freeze_Train=False时失效)
#------------------------------------------------------------------#
Init_Epoch          = 0
Freeze_Epoch        = 50
Freeze_batch_size   = 8
#------------------------------------------------------------------#
#   解冻阶段训练参数
#   此时模型的主干不被冻结了,特征提取网络会发生改变
#   占用的显存较大,网络所有的参数都会发生改变
#   UnFreeze_Epoch          模型总共训练的epoch
#                           SGD需要更长的时间收敛,因此设置较大的UnFreeze_Epoch
#                           Adam可以使用相对较小的UnFreeze_Epoch
#   Unfreeze_batch_size     模型在解冻后的batch_size
#------------------------------------------------------------------#
UnFreeze_Epoch      = 300
Unfreeze_batch_size = 4
#------------------------------------------------------------------#
#   Freeze_Train    是否进行冻结训练
#                   默认先冻结主干训练后解冻训练。
#------------------------------------------------------------------#
Freeze_Train        = True

#------------------------------------------------------------------#
#   其它训练参数:学习率、优化器、学习率下降有关
#------------------------------------------------------------------#
#------------------------------------------------------------------#
#   Init_lr         模型的最大学习率
#                   当使用Adam优化器时建议设置  Init_lr=1e-3
#                   当使用SGD优化器时建议设置   Init_lr=1e-2
#   Min_lr          模型的最小学习率,默认为最大学习率的0.01
#------------------------------------------------------------------#
Init_lr             = 1e-2
Min_lr              = Init_lr * 0.01
#------------------------------------------------------------------#
#   optimizer_type  使用到的优化器种类,可选的有adam、sgd
#                   当使用Adam优化器时建议设置  Init_lr=1e-3
#                   当使用SGD优化器时建议设置   Init_lr=1e-2
#   momentum        优化器内部使用到的momentum参数
#   weight_decay    权值衰减,可防止过拟合
#                   adam会导致weight_decay错误,使用adam时建议设置为0。
#------------------------------------------------------------------#
optimizer_type      = "sgd"
momentum            = 0.937
weight_decay        = 5e-4
#------------------------------------------------------------------#
#   lr_decay_type   使用到的学习率下降方式,可选的有'step'、'cos'
#------------------------------------------------------------------#
lr_decay_type       = 'cos'
#------------------------------------------------------------------#
#   save_period     多少个epoch保存一次权值
#------------------------------------------------------------------#
save_period         = 10
#------------------------------------------------------------------#
#   save_dir        权值与日志文件保存的文件夹
#------------------------------------------------------------------#
save_dir            = 'logs'
#------------------------------------------------------------------#
#   eval_flag       是否在训练时进行评估,评估对象为验证集
#                   安装pycocotools库后,评估体验更佳。
#   eval_period     代表多少个epoch评估一次,不建议频繁的评估
#                   评估需要消耗较多的时间,频繁评估会导致训练非常慢
#   此处获得的mAP会与get_map.py获得的会有所不同,原因有二:
#   (一)此处获得的mAP为验证集的mAP。
#   (二)此处设置评估参数较为保守,目的是加快评估速度。
#------------------------------------------------------------------#
eval_flag           = True
eval_period         = 10
#------------------------------------------------------------------#
#   num_workers     用于设置是否使用多线程读取数据,1代表关闭多线程
#                   开启后会加快数据读取速度,但是会占用更多内存
#                   keras里开启多线程有些时候速度反而慢了许多
#                   在IO为瓶颈的时候再开启多线程,即GPU运算速度远大于读取图片的速度。
#------------------------------------------------------------------#
num_workers         = 1

#------------------------------------------------------#
#   train_annotation_path   训练图片路径和标签
#   val_annotation_path     验证图片路径和标签
#------------------------------------------------------#
train_annotation_path   = '2007_train.txt'
val_annotation_path     = '2007_val.txt'

4. Prediction of training results

Two files are required for training result prediction, namely yolo.py and predict.py.
We first need to modify model_path and classes_path in yolo.py, these two parameters must be modified.

model_path points to the trained weight file in the logs folder.
classes_path points to the txt corresponding to the detection category.

insert image description here
After completing the modification, you can run predict.py for detection. After running, enter the image path to detect.

Guess you like

Origin blog.csdn.net/weixin_44791964/article/details/127503879