[Neural Network] YoloV7

        The Yolo series is a very classic target detection network, belonging to the anchor-base network, that is, it is necessary to generate a priori frame and then filter the a priori frame to obtain a prediction frame. At the same time, it is also a One-Stage network, that is, no additional network structure is required to filter the prior box. These two characteristics make it have the characteristics of fast operation speed and relatively high calculation accuracy.

        The more common Yolo networks are: YoloV3, YoloV5, YoloV7, etc. This article mainly discusses from the YoloV7 network. Argument.

1. YoloV7

        The network structure of YoloV7 is shown in the figure, which can be divided into three parts: Backbone , FPN and Yolo Head . Among them, Backbone is responsible for feature extraction of the input image , and this structure will eventually output three feature layers of different sizes. The FPN network is responsible for enhancing feature extraction . It will perform partial feature fusion on the three feature layers extracted by Backbone (to fuse feature information of different sizes, mainly by upsampling and downsampling). Yolo Head is responsible for classifying and regressing the anchor , and outputting the final prediction frame.

         1.Backbone

        As can be seen from the network structure diagram above, unlike Faster R-CNN and other networks that are directly ready-made backbone networks, the backbone network of YoloV7 is mainly composed of Multi_Concat_Block (multi-branch stacking module) and Transition_Block .

        The structure of Multi_Concat_Block is shown in the figure. It can be seen that it is mainly composed of convolution standard activation functions . A total of 4 channels have passed through 1, 1, 3, and 5 convolution standard activation functions. Finally, the feature layers of these 4 channels are stacked and integrated by a convolutional standard activation function .

         Transition_Block is an innovative point of this network, it is a transition module. In the usual convolutional network, the transition module uses the convolution of the 3x3 convolution kernel + the maximum pooling of the step size 2x2. And this module has two branches: the left branch is the maximum pooling with a step size of 2x2 + 1x1 convolution, the right branch is a convolution with a 1x1 convolution + convolution kernel 3x3, and a step size of 2x2, and the two branches are output. time to stack.

         2. FPN

        YoloV7's Backbone extracts a total of three feature layers. When the input is 640x640x3, the sizes of the three feature layers are 80x80x512, 40x40x1024, and 20x20x1024. These three features will be enhanced by FPN for feature extraction. The specific steps are as follows:

                ① The lowest layer feature (that is, the size of 20x20x1024) will be extracted through SPPCSPC to obtain P5, which can improve the receptive field of YoloV7.

②P5 is combined (Concat) with the second layer of features (40x40x1024) after a convolution after a                  1x1 convolution adjustment channel + an upsampling , and then uses a Multi_Concat_Block for feature extraction to obtain P4.

                ③P4 also performs feature fusion with the first layer of features (80x80x1024) after a convolution after a 1x1 convolution adjustment channel + an upsampling, and then uses a Multi_Concat_Block for feature extraction to obtain P3.

                 ④ P3 is stacked with P4 after a Transition_Block downsampling, and then extracted to P4_Out using Multi_Concat_Block, the size is (40x40x256)

                ⑤ P4_Out is stacked with P5 after a Transition_Block downsampling, and then extracted to P5_Out using Multi_Concat_Block, the size is (20x20x512)

         3.Yolo Head

        YoloV7 uses a RepConv structure before Yolo Head, that is, a specially designed residual structure is introduced during training, but in actual prediction, this structure is equivalent to a 3x3 convolution, so it will not lead to a decline in prediction performance.

        After RepConv, the result can be decoded. After FPN, we can get three outputs, namely (N, 20, 20, 255), (N, 40, 40, 255), (N, 80, 80, 255). where N is the number of channels. The channel number 255 of each output can be decomposed into 3 85, corresponding to the 85 parameters of these 3 prior frames.

        The 85 parameters of the prior frame can be split into 4+1+80, of which 4 represents the regression parameters of each feature point to obtain the adjusted prediction frame, 1 is used to judge whether the feature point contains an object , and 80 is used to Determine the type of object contained in each feature point .

        Rough calculation steps can be divided into the following steps:

                ① Predict the center point, and use the results of the regression prediction (the first two digits) to calculate the center point offset

                ② Predict the width and height of the prediction frame, and calculate the width and height of the prediction frame using the results of regression prediction (the last two digits)

                ③ Draw the prediction frame on the picture

At the same time, a non-maximum suppression operation         is required to prevent multiple prediction frames of the same class

        4. Loss part

        YoloV7's Loss consists of three parts: regression part (Reg), object part (Obj), object category part (Cls)

 

Guess you like

Origin blog.csdn.net/weixin_37878740/article/details/131590277