Yolov7 paper study - innovation point analysis, network structure diagram

Innovation

1. E-ELAN was proposed, but it is only used in yolov7-e6e.
2. Yolov7’s scaling method based on the splicing model is used in yolov7x.
3. Apply heavy-parameterized convolution to the residual module or use it in a splicing-based module. RepConvN
4. Two new label allocation methods are proposed

1. ELAN and E-ELAN

1. ANNOUNCEMENT

yolov7 uses a large number of ELAN as basic modules. So many stacks actually correspond to a denser residual structure. The characteristic of the residual network is that it is easy to optimize and can be improved by adding considerable depth a>. increasing the depth in the deep neural network Disappearance problem to alleviate the gradient caused by skip connection. The internal residual block uses Improve accuracy
Insert image description here

   [-1, 1, Conv, [64, 1, 1]],
   [-2, 1, Conv, [64, 1, 1]],
   [-1, 1, Conv, [64, 3, 1]],
   [-1, 1, Conv, [64, 3, 1]],
   [-1, 1, Conv, [64, 3, 1]],
   [-1, 1, Conv, [64, 3, 1]],
   [[-1, -3, -5, -6], 1, Concat, [1]],
   [-1, 1, Conv, [256, 1, 1]],  # 11

2、E-ELAN

2-group convolution is essentially an extension of ELAN. Only mentioned in yolov7-e6e, it is obtained by parallel processing of ELAN in yolov7-e6.
Insert image description here
Insert image description here

2. Model scaling

designed a model scaling method that changes the depth and width at the same time. yolov7x scales yolov7.
Two convolutional layers are added to increase the depth, and the number of inputs, the number of outputs after splicing, and the number of channels output by the convolutional layer are all 1.25 times the original. From this perspective, consider increasing width.
Insert image description here

3. Improvement of heavy parameterized convolution RepConvN

1、RepConv

Re-parameterized convolution, using 3 different convolutional layers after training is completed, is merged. Although heavy-parameterized convolution has achieved good results on VGG, it has not achieved good results in residual networks.
Insert image description here

2、RepConvN

RepConvN removes the identity connection based on re-parameterized convolution. Apply reparameterized convolution to the residual module or use it in a splicing-based module. However, the simplest reparameterized convolution is used in the code, and the proposed conclusion is not used.
The idea is taken from RepVGG. The basic idea is to introduce a special residual structure to assist training during training. This The residual structure is uniquely designed. In actual prediction, the complex residual structure can be equivalent to an ordinary 3*3 convolution. At this time, the complexity of the network is reduced. , but the prediction performance of the network does not decrease.

Why should we remove the identity connection?
Because the residual network itself has an identity connection, and the original re-parameterized convolution RepConv also has an identity connection, there is a conflict between the two, so the original re-parameterized convolution needs to be removed. The identity connection in RepConv becomes RepConvN.

Insert image description here

4. Soft labels and hard labels

Insert image description here
The hard label is the method used by yolov5, which calculates the loss value by combining the target value and the predicted value; The soft label is the method used by yolov7, pass the target value through the allocator to get the new target value, and then calculate the loss value together with the predicted value.

5. Two new label assignment methods

Insert image description here
Coarse-grained and fine-grained. The coarse label is 5 grids and the fine label is 3 grids.

6. Yolov7 network structure diagram

Insert image description here
Obtain 3 effective feature layers through backbone. Feature pyramid can fuse feature layers of different shapes, which is beneficial to extracting better features. In yolov7, the neck is divided into the head. Use MP-1 for downsampling. In convolutional neural networks, the common module used for downsampling is A convolution kernel size is 3 * 3. Convolution with a stride of 2*2 or a maximum pooling with a stride of 2*2. Pooling layers and convolutions. The left branch is a maximum pooling with a stride of 2 * 2 + 1 1 * 1 convolution, and the right branch is a 1 * 1 convolution + 1 convolution kernel size It is a convolution of 3 * 3 with a stride of 2 * 2, and the results of the two branches are spliced ​​at the time of output. The number of output channels remains unchanged compared to the input, but the size is halved, which is equivalent to a complex version of the maximum pooling layer. The purpose is to compress the width and height, but the number of channels remains unchanged. Enhance the feature extraction module. The function is toenlarge the receptive field. Pooling kernel sizes are 5, 9, 13, no processing.

Insert image description here


Insert image description here

class SPPCSPC(nn.Module):
    # CSP https://github.com/WongKinYiu/CrossStagePartialNetworks
    def __init__(self, c1, c2, n=1, shortcut=False, g=1, e=0.5, k=(5, 9, 13)):
        super(SPPCSPC, self).__init__()
        c_ = int(2 * c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c1, c_, 1, 1)
        self.cv3 = Conv(c_, c_, 3, 1)
        self.cv4 = Conv(c_, c_, 1, 1)
        self.m = nn.ModuleList([nn.MaxPool2d(kernel_size=x, stride=1, padding=x // 2) for x in k])
        self.cv5 = Conv(4 * c_, c_, 1, 1)
        self.cv6 = Conv(c_, c_, 3, 1)
        self.cv7 = Conv(2 * c_, c2, 1, 1)

    def forward(self, x):
        x1 = self.cv4(self.cv3(self.cv1(x)))
        y1 = self.cv6(self.cv5(torch.cat([x1] + [m(x1) for m in self.m], 1)))
        y2 = self.cv2(x)
        return self.cv7(torch.cat((y1, y2), dim=1))

Insert image description here
Insert image description here
Insert image description here

In yolov7, there are 3 a priori boxes for each feature point on each feature layer.

The last 255 can be split into three 85s, corresponding to the 85 parameters of the three a priori boxes. 85 can be split into 4+1+80.
The first 4 parameters are used to determine the regression parameters of each feature point. After adjusting the regression parameters, the prediction frame can be obtained;
The fifth parameter is used to determine whether each feature point contains an object;
The last 80 parameters are used to determine whether The type of object contained in each feature point.

7. Network structure diagram of yolov7x

Insert image description hereInsert image description here

8. Detection module

The detect module is used in testing, and the Idetect and Iauxdetect modules are used in training. I means implicit. You can learn about implicit content by learning yolo-R.
Insert image description here
IDetect
Insert image description here

Insert image description here

9. Relationship between documents

Insert image description here

1, yolov7 and yolov7x are models ofconventional GPU. Yolov7x is obtained by performing stack scaling on the neck based on yolov7, and using the proposed composite scaling method to scale the depth and width of the entire model. of. 2. yolov7-d6, yolov7-e6, yolov7-e6e and yolov7-w6 are cloud GPU models. For yolov7-w6, we use the newly proposed composite scaling method to obtain yolov7-e6 and yolov7-d6< /span> a> as activation function. SiLU as activation function. For other models, we use Leaky ReLU models. The only difference between them lies in the activation functions used. yolov7-tiny It will use edge GPU 3. yolov7-tiny and yolov7-tiny-silu are . yolov7-e6eyolov7-e6e to yolov7-e6, thereby completing E-ELAN. Furthermore, we apply the proposed

10. Encoding of prediction results

For details, you can read this blog post:Smart target detection 61 - Pytorch builds YoloV7 target detection platform

1. Obtain prediction boxes and scores

The prediction results of the three feature layers we obtained do not correspond to the position of the final prediction frame on the picture, and still need to be decoded to complete.

2. Score screening and non-maximal suppression

After obtaining the final prediction results, score sorting and non-maximum suppression screening are required.
Score filtering: Filter out the prediction boxes whose scores meet the confidence level.
Non-maximum suppression: Filter out the boxes with the highest score of the same category in a certain area.

11. Loss value part

1. Contents required to calculate loss

Calculating loss is actually a comparison between the network’s prediction results and the network’s real results.
The loss of the network consists of three parts, namely the reg part, the obj part and the cls part.
reg: Judgment of regression parameters of feature points
obj: Judgment of whether feature points contain objects
cls: Objects contained in feature points Type

2. Matching process of positive samples

(1) For each real frame, roughly match the prior frame and feature points through coordinates, width and height
(2) Use SimOTA adaptive to accurately select how many real frames correspond to A priori boxes
The concept of positive sample matching: Find which a priori boxes are considered to have corresponding real boxes, and are responsible for the prediction of this real box.
a. Instead of using iou to match positive samples, the aspect ratio is used directly for matching.
b. SimOTA adaptive matching
In yolov7, a Cost matrix is ​​calculated, representing the cost relationship between each real box and each feature point. , the Cost matrix consists of two parts:
(1) The coincidence degree of each real box and the current feature point prediction box;
(2) The category prediction accuracy of each real box and the current feature point prediction box;
The higher the above two values, the smaller the Cost.

3. Calculate loss

  1. In the Reg part, Part 2 knows the a priori box corresponding to each real box. After obtaining the a priori box corresponding to each box, the prediction box corresponding to the a priori box is taken out, and the CIOU loss is calculated using the real box and the prediction box. It is composed of Loss as part of Reg.
  2. In the Obj part, the a priori frames corresponding to all real frames are positive samples, and the remaining a priori frames are negative samples. The cross entropy loss is calculated based on the prediction results of the positive and negative samples and feature points whether they contain objects, as the Loss component of the Obj part. .
  3. In the Cls part, after obtaining the a priori frame of each box, the type prediction result of the a priori frame is taken out, and the cross entropy loss is calculated based on the type of the real frame and the type prediction result of the a priori frame, as the Loss component of the Cls part.

Guess you like

Origin blog.csdn.net/qq_43757976/article/details/131540697