Detailed explanation of YOLOv7 network structure and training your own data

YOLOv7是一种优秀的端到端检测算法。YOLOv7由Proposed in 2022 by Alexey Bochkovskiy and Chien-Yao Wang et al. (YOLOv4 team). YOLOv7 outperforms all known object detectors in speed and accuracy in the range of 5 FPS to 120 FPS, with the highest accuracy of 56.8% AP among all known real-time object detectors at 30 FPS. .

Paper:[2207.02696] YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors (arxiv.org)https://arxiv.org/abs/2207.02696

GitHub:WongKinYiu/yolov7: Implementation of paper - YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors (github.com)https://github.com/WongKinYiu/yolov7


1. Overall introduction to the project

The yolov7 project folder and the dataset folder are at the same level. The datasets are divided into images (pictures) and labels (labels). The images and labels are divided into test (test set), train (training set), val ( validation set) three folders.

 The first cfg folder in the yolov7 folder stores the model configuration file (model.yaml). Stored in the data folder are the data set configuration file (data.yaml) and the hyperparameter configuration file (hyperparameters.yam). The demo deployed for the nvidia triton inference server in the deploy folder. The figure folder contains some demo result images of yolov7 (3D detection, key point detection, etc.). Inference stores data with inference (pictures, folders). Stored in models are commonly used codes for yolov7 network structure composition. The paper is the yolov7 paper. runs are the results of training and testing. Tools includes some tools in ipynb file format (model conversion, model comparison, etc.). Utility functions (activation functions, drawing functions, etc.) are stored in utils. .gitignore is docker’s ignore file. LICENSE.md is the license file. README.md is the usage instruction file. detect.py detection code. export.py model export code. hubconf.py is the pytorch hub file. requirements.txt depends on the environment file. test.py test file. train.py is the training file for yolov7-tiny and yolov7. train_aux.py is the training file for yolov7-w6 and yolov7-e6. 

2. Introduction to network structure

1. Overall structure diagram

The overall structure of yolov7 consists of four parts: Input, Backbone, Head, and Detect. Input is 640*640*3 data input. Backbone is the backbone network composed of CBS, ELAN, and MP-1. Head consists of CBS, SPPCSPC, E-ELAN, MP-2, and RepConv. Detect is three detection heads. Except for the Detect module code which is in models/yolo.py, the other module codes are all in models/common.py.

 2.CBS

code show as below:

class Conv(nn.Module):
    # Standard convolution
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True):  # ch_in, ch_out, kernel, stride, padding, groups
        super(Conv, self).__init__()
        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=False)
        self.bn = nn.BatchNorm2d(c2)
        self.act = nn.SiLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())

    def forward(self, x):
        return self.act(self.bn(self.conv(x)))

    def fuseforward(self, x):
        return self.act(self.conv(x))

结构如下:

Among them (3,2) is a 3*3 convolution kernel with a stride of 2. It is composed of two-dimensional convolution, batch normalization, and SiLU activation function.

3.ELAN和E-ELAN

The ELAN module is composed of the longest and shortest gradient paths. More blocks are stacked through the shortest paths to learn more features.

  The structure is as follows (E-ELAN on the left, ELAN on the right):

 4.MP

The overall role of MP: to achieve downsampling that reduces feature loss. The MP-1 module consists of 2 branches, and the MP-2 module consists of 3 branches. The first branch first uses Maxpool (maximum pooling) to implement downsampling, and then uses a 1*1 convolution to change the number of channels. The other branch of MP-1 is a 3*3 convolution with a convolution kernel step size of 2, which implements downsampling.

Network structure (MP-1 on the left, MP-2 on the right):

5. SPPCSP

The function of SPP is to realize the fusion of information of different feature scales, using maximum pooling at four different scales for processing. The pooling kernel sizes of the maximum pooling are 13x13, 9x9, 5x5, and 1x1 (1x1 means no processing). The CSP module is divided into two parts. One part performs SPP structure processing, and the other part performs channel number processing through 1*1 convolution. Finally, the two parts are concated. SPPCSP realizes the fusion of information of different feature scales, reduces the amount of calculation, and improves the speed.

code show as below:

class SPPCSPC(nn.Module):
    # CSP https://github.com/WongKinYiu/CrossStagePartialNetworks
    def __init__(self, c1, c2, n=1, shortcut=False, g=1, e=0.5, k=(5, 9, 13)):
        super(SPPCSPC, self).__init__()
        c_ = int(2 * c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c1, c_, 1, 1)
        self.cv3 = Conv(c_, c_, 3, 1)
        self.cv4 = Conv(c_, c_, 1, 1)
        self.m = nn.ModuleList([nn.MaxPool2d(kernel_size=x, stride=1, padding=x // 2) for x in k])
        self.cv5 = Conv(4 * c_, c_, 1, 1)
        self.cv6 = Conv(c_, c_, 3, 1)
        self.cv7 = Conv(2 * c_, c2, 1, 1)

    def forward(self, x):
        x1 = self.cv4(self.cv3(self.cv1(x)))
        y1 = self.cv6(self.cv5(torch.cat([x1] + [m(x1) for m in self.m], 1)))
        y2 = self.cv2(x)
        return self.cv7(torch.cat((y1, y2), dim=1))

The network structure is as follows: 

 6.RepConv

yolov7 replaces Conv with RepConv at the end of head. The training phase consists of multiple branches, 3x3, 1x1, and identity (mapping). The inference phase becomes only one 3x3 convolution, which reduces the number of parameters and speeds up the inference. The RepConv training process learns more features and the inference process speeds up.

code show as below:

class RepConv(nn.Module):
    # Represented convolution
    # https://arxiv.org/abs/2101.03697

    def __init__(self, c1, c2, k=3, s=1, p=None, g=1, act=True, deploy=False):
        super(RepConv, self).__init__()

        self.deploy = deploy
        self.groups = g
        self.in_channels = c1
        self.out_channels = c2

        assert k == 3
        assert autopad(k, p) == 1

        padding_11 = autopad(k, p) - k // 2

        self.act = nn.SiLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())

        if deploy:
            self.rbr_reparam = nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=True)

        else:
            self.rbr_identity = (nn.BatchNorm2d(num_features=c1) if c2 == c1 and s == 1 else None)

            self.rbr_dense = nn.Sequential(
                nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=False),
                nn.BatchNorm2d(num_features=c2),
            )

            self.rbr_1x1 = nn.Sequential(
                nn.Conv2d( c1, c2, 1, s, padding_11, groups=g, bias=False),
                nn.BatchNorm2d(num_features=c2),
            )

    def forward(self, inputs):
        if hasattr(self, "rbr_reparam"):
            return self.act(self.rbr_reparam(inputs))

        if self.rbr_identity is None:
            id_out = 0
        else:
            id_out = self.rbr_identity(inputs)

        return self.act(self.rbr_dense(inputs) + self.rbr_1x1(inputs) + id_out)

The training process is structured as follows:

 

7.Detect

 The code is in class IDetect(nn.Module) of yolo.py. (d) The detection head generates soft labels through the optimizer based on the results of the Lead head (leader head) and GT (label true value). Soft labels are used when training models for Aux heads (auxiliary heads) and Lead heads. The reason for this is that the Lead head has relatively strong learning ability, so the soft labels generated by it should be more representative of the distribution and correlation between the source data and the target data. In addition, residual learning is added, allowing the shallower Aux head to directly learn the information learned by the Lead head, so that the Lead head can better focus on learning the residual information that has not been learned. (e) Added fine-tuning.

3. Train your own data

The training process is very different from other YOLO. See README.md directly https://github.com/WongKinYiu/yolov7#readme


Guess you like

Origin blog.csdn.net/qq_51511878/article/details/131991365