YOLOv7 Backbone | Detailed explanation of the original source code

Detailed structure of YOLOv7 Backbone

In the previous article, we took YOLOv5 as the object and dissected the internal structure of a sparrow in detail, including the anchor mechanism , the structure of the backbone , the structure of the neck and the structure of the head . In this article, we will target the code of YOLOv7v0.1 version, combined with the original YOLOv7 text of the author team, introduce in detail the overall architecture of its skeleton network and the implementation principles of each part, and combine the network configuration file yolov7.yaml and Detailed analysis of network components in common.py.

The overall structure of the backbone

insert image description here

First read the network architecture diagram.

1-P1/2; 16-P3/8 : This is a mark made during the process of drawing the structure diagram to avoid marking the wrong intermediate feature size. The first number represents the index of the current module; Pn represents the number of downsampling times of the current module, so the size of the feature map will change where Pn appears; the number after Pn is the multiple of downsampling

CBS : The blue CBS module is the integrated module of Conv+BatchNorm+SiLU, which is mainly responsible for feature extraction. Its parameters are mainly kernel size and stride step size. The size of the kernel will not affect the change of the length and width of the output feature map, because the autopid function will be used in yolo to automatically fill the image ( p = k / / 2 p=k / /2p=k / / 2 ); the operation to change the size of the feature map is only the stride step size, specificallywout = win / s , hout = hin / s w_out=w_in /s, h_out=h_in /swout=win/s,hout=hin/s

ELAN1 : The unique aggregation network in Backbone of YOLOv7, a module inspired by DenseNet and ResNet, the number of channels and the size of the feature map will not change during the aggregation process.

MPConv : A relatively special pooling structure, the main purpose is to effectively reduce the size of the feature map, reduce the amount of calculations and parameters, speed up the calculation and prevent overfitting.

Generally speaking, the overall network architecture of YOLOv7's backbone is quite different from that of v5 (the integration level is getting higher and higher, and it is also getting more and more troublesome, and it is getting more and more difficult for Xiaobai to read), so we still follow the old routine, Disassemble the modules in it step by step.

CBS: the main feature extraction module

In fact, no matter how magnificent and magnificent a building is, it needs to be built bit by bit with cement and bricks, and the CBS module is the cement and bricks of YOLOv7. Here I directly move over the explanation of CBS in YOLOv5 and update some new points of view. .

insert image description here

Conv

The CBS module is actually nothing special, it is Conv+BatchNorm+SiLU, here we will focus on the parameters of Conv, it is time to review the convolution operation of pytorch, first go to the CBL source code:

class Conv(nn.Module):
    # Standard convolution
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True):  # ch_in, ch_out, kernel, stride, padding, groups
        super().__init__()
        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=False)
        self.bn = nn.BatchNorm2d(c2)
        #其中nn.Identity()是网络中的占位符,并没有实际操作,在增减网络过程中,可以使得整个网络层数据不变,便于迁移权重数据;nn.SiLU()一种激活函数(S形加权线性单元)。
        self.act = nn.SiLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())

    def forward(self, x):#正态分布型的前向传播
        return self.act(self.bn(self.conv(x)))

    def forward_fuse(self, x):#普通前向传播
        return self.act(self.conv(x))

It can be seen from the source code: Conv() contains 7 parameters, which are also important parameters in the two-dimensional convolution Conv2d(). There is nothing to say about ch_in, ch_out, kernel, stride, let’s talk about the last three parameters:

autopads

Judging from the mainstream convolution operations I see now, most researchers will not change the size of the feature map through the kernel. For example, the 3x3 kernel in googlenet sets padding=1, so when the kernel≠1 needs to be corrected The input feature map is populated. When the p value is specified, it is filled according to the p value, and when the p value is the default, it is filled by the autopad function:

def autopad(k, p=None):  # kernel, padding
    # Pad to 'same'
    if p is None:
        p = k // 2 if isinstance(k, int) else [x // 2 for x in k]  # auto-pad
        #如果k是整数,p为k与2整除后向下取整;如果k是列表等,p对应的是列表中每个元素整除2。
    return p

Here the author considers that the padding also needs to be changed when using different sizes of convolution kernels for different convolution operations, so here when assigning a value to p, it will first check whether k is an int, if k is a list, then for each in the list elements are divisible.

act

Decide whether to activate the feature map, SiLU means to use Sigmoid for activation.

ELAN: Efficient Network Aggregation Module

This part of the structure is actually not complicated, but in order for readers to have a deeper understanding of this part of the structure, I still plan to introduce it systematically along the description in the original text of YOLOv7. First, let us take a look at the author's in the paper. Evolution of the proposed aggregation module.

ELAN PaperReading

insert image description here

VoVNet

VoVNet is a method proposed in the literature in 2019. It compares the main aggregation network structures DenseNet and ResNet at that time , and proposes that the aggregation method of DenseNet is better than ResNet. Because DenseNet uses concat compared to ResNet's use of sum, it can enable the network feature map to obtain information in multiple perceptual domains (scales), which is extremely useful for downstream detection and segmentation tasks.

In fact, Concat is also an information aggregation method that is widely used, has a clear mechanism, and is convenient to implement. In the multimodal model structure of single-stream, in most cases, it is directly concat image embedding and text embedding, but it is better than The effect of processing text and image information alone is better. The underlying principle has been seen in Zhihu’s answer before. If you have time to sort it out, here is a pit first.

But in practical applications, the author found that DenseNet is not as efficient as it is theoretically said. The author concluded that there are the following reasons:

  • Densenet is a densely connected structure, and each layer of calculation will use all the previous feature layers, as shown in Figure a below, which will result in a large amount of memory consumption, that is, memory access cost (MAC)
  • Densenet will double the number of channels in the process of concat. In order to reduce the amount of calculation, it can only be like the bottleblock of ResNet. It uses 1x1 convolution to reduce the dimension first and then calculate, but this small convolution kernel is not very good. Parallel Computing with GPUs

insert image description here

Therefore, the One-Shot Aggregation method is proposed in VoVNet, as shown in Figure b above, as shown in (a) VoVNet at the beginning of this section.

In fact, it has been clearly written in the legend, that is, feature aggregation is only performed in the final stage of each module, which not only retains the advantages of concat in DenseNet, but also optimizes the utilization of GPU and alleviates the problem of MAC.

CSPVoVNet

Later structures such as PRN and CSPVoVNet consider model design from the perspective of gradient optimization, because after the gradient is optimized, the network can achieve more stable performance on the basis of fewer parameters.

CSP is actually based on PRN (only part of the channel is selected in the process of short connection, which can obtain richer gradient information and achieve better training results). PRN proves that rich gradients are helpful for training better models, so reasoning , allowing all channels to receive different gradient information can make the model training the best, but this is not efficient, so taking into account the performance requirements, CSPNet is further designed:

insert image description here

On the one hand, CSPDenseNet can obtain richer gradient information like DenseNet, and on the other hand, it avoids too much repetitive gradient information through cross stage partial operation.

ELAN

The ELAN article was just posted on arxiv around June this year. It is relatively new. I mainly read the analysis part of the author's third chapter and talked about my understanding.

insert image description here

The article first pointed out that the reason why DenseNet is better than ResNet is that DenseNet uses different gradient sources under the same time stamp, rather than the same input source under the same time stamp in ResNet, and believes that the PRN structure proposed in the article is compared to residual The difference network only selects some channels for residual linking, which is actually equivalent to the dropout mechanism (interested friends can read the blog about the dropout mechanism I wrote before), which also increases the original gradient source information and improves Generalization of the network

Secondly, CSPNet only uses simple channel segmentation, cross-stage connection and a small number of additional transition layers, and successfully completes the intended goal without changing the original network computing unit. In other words, CSPNet uses channel split to increase the information of the gradient source on the one hand, and reduce the amount of calculation on the other hand, killing two birds with one stone.

Finally, it is a relatively engineering experience: to build a network, not only the shortest gradient path must be considered, but also the shortest gradient path of each layer must be effectively trained. As for the length of the longest gradient path of the entire network, it will be greater than or equal to the length of the longest gradient path of any layer. Therefore, when implementing a network-level gradient path design strategy, it is necessary to consider the longest and shortest gradient path lengths of all layers in the network, as well as the longest gradient path length of the entire network.

ELAN CodeReading

Having said so much, in fact, the most important point is that a structure like ELAN can improve the robustness of the entire network, reduce the number of network parameters and accelerate the calculation process, so the most important point of using ELAN here is that it is efficient enough, as follows Look at the code implementation of ELAN again, first look at the network structure of ELAN.

insert image description here

This network structure is actually the same as the ©ELAN in the paper, but the drawing method is different. Combined with the following ELAN network configuration for interpretation:

# ELAN1
[-1, 1, Conv, [64, 1, 1]],# -6
[-2, 1, Conv, [64, 1, 1]],# -5
[-1, 1, Conv, [64, 3, 1]],
[-1, 1, Conv, [64, 3, 1]],# -3
[-1, 1, Conv, [64, 3, 1]],
[-1, 1, Conv, [64, 3, 1]],# -1
[[-1, -3, -5, -6], 1, Concat, [1]],
[-1, 1, Conv, [256, 1, 1]],  # 11

First look at the part of Concat. It can be determined that there are 4 feature maps for splicing in the end. The input sources are the first (-1) from concat to the sixth (-6) from concat. , if you don’t understand this part of the parameters, you can go to YOLOv5Backbone for a detailed explanation. Determine the source of the feature map of concat, and then look at the size of its feature map. Although the kernel sizes of these feature maps are not all the same, according to yolo's autopad mechanism, it can be determined that the output of its feature maps is the same size as the input (only s changes the size of the feature map), so here is actually different depths and different semantics. Information fusion, after the fusion, only the number of channels is quadrupled (64->256), that is, the 128 channels before the ELAN become the 256 channels after the ELAN module. **Conclusion: ELAN does not change the size of the image, but only doubles the channel number. **This part seems troublesome, but it is not that difficult to understand when talking about code, and it is finished in a few sentences.

MPConv

First paste the network structure diagram of the module:

insert image description here

Paste the network configuration of the module again:

# MPConv
[-1, 1, MP, []],# maxpooling:k=2 s=2
[-1, 1, Conv, [128, 1, 1]],
[-3, 1, Conv, [128, 1, 1]],
[-1, 1, Conv, [128, 3, 2]],
[[-1, -3], 1, Concat, [1]],  # 16-P3/8

Let’s look at concat first, find out the two inputs of concat according to the picture, and draw a conclusion: MPConv module does not change the input size, but only increases the number of channels, and doubles it

I'll write it here first, and I'll add it when I think of something later.

[1] Lee Y, Hwang J, Lee S, et al. An energy and GPU-computation efficient backbone network for real-time object detection[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops. 2019: 0-0.

[2]He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.

[3]Huang G, Liu Z, Van Der Maaten L, et al. Densely connected convolutional networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 4700-4708.

[4]Wang C Y, Mark Liao H Y, Chen P Y, et al. Enriching variety of layer-wise learning information by gradient combination[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. 2019: 0-0.

[5]Wang C Y, Liao H Y M, Wu Y H, et al. CSPNet: A new backbone that can enhance learning capability of CNN[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops. 2020: 390-391.

[6]Wang C Y, Liao H Y M, Yeh I H. Designing Network Design Strategies Through Gradient Path Analysis[J]. arXiv preprint arXiv:2211.04800, 2022.

[7] https://blog.csdn.net/weixin_43427721/article/details/123613944?spm=1001.2014.3001.5501

Guess you like

Origin blog.csdn.net/weixin_43427721/article/details/128212384