Detailed Backbone of YOLOv5

Backbone design of YOLOv5

In the previous article "YOLOV5's anchor setting", we discussed the anchor generation principle and detection process, and have a general understanding of the network structure of YOLOv5. Next, we will focus on the Backbone of YOLOv5, and go deep into the underlying source code to experience the Backbone design of v5.

1 Backbone overview and parameters

# Parameters
nc: 80  # number of classes
depth_multiple: 0.33  # model depth multiple
width_multiple: 0.50  # layer channel multiple

# YOLOv5 v6.0 backbone
backbone:
  # [from, number, module, args]
  [[-1, 1, Conv, [64, 6, 2, 2]],  # 0-P1/2
   [-1, 1, Conv, [128, 3, 2]],  # 1-P2/4
   [-1, 3, C3, [128]],
   [-1, 1, Conv, [256, 3, 2]],  # 3-P3/8
   [-1, 6, C3, [256]],
   [-1, 1, Conv, [512, 3, 2]],  # 5-P4/16
   [-1, 9, C3, [512]],
   [-1, 1, Conv, [1024, 3, 2]],  # 7-P5/32
   [-1, 3, C3, [1024]],
   [-1, 1, SPPF, [1024, 5]],  # 9
  ]

The backbone part of yolov5s is as above, its network structure is configured using yaml file, and a network module composed of input is added through ./models/yolo.py parsing file. Unlike the network set by config used by v3 and v4, the network components in the yaml file do not need to be superimposed, and only need to set the number in the configuration file.

1.1 Param

# Parameters
nc: 80  # number of classes
depth_multiple: 0.33  # model depth multiple
width_multiple: 0.50  # layer channel multiple

nc: 8

Represents the number of categories in the dataset, for example, MNIST contains a total of 10 classes from 0 to 9.

depth_multiple: 0.33

Used to control the depth of the model, only enabled when number≠1.For example, the parameters of the first C3 layer (what c3 is introduced later) are set to [-1, 3, C3, [128]], where number=3, which means that there is one C3 (3*0.33) in v5s; similarly, the number of C3 in v5l is 3 ( The depth_multiple parameter of v5l is 1).

width_multiple: 0.50

Used to control the width of the model, mainly for ch_out in args. For example, in the first Conv layer, ch_out=64, then in the actual operation process of v5s, the convolution kernel in the convolution process will be set to 64x0.5, so the feature map of 32 channels will be output.

1.2 backbone

# YOLOv5 v6.0 backbone
backbone:
  # [from, number, module, args]
  [[-1, 1, Conv, [64, 6, 2, 2]],  # 0-P1/2
   [-1, 1, Conv, [128, 3, 2]],  # 1-P2/4
   [-1, 3, C3, [128]],
   [-1, 1, Conv, [256, 3, 2]],  # 3-P3/8
   [-1, 6, C3, [256]],
   [-1, 1, Conv, [512, 3, 2]],  # 5-P4/16
   [-1, 9, C3, [512]],
   [-1, 1, Conv, [1024, 3, 2]],  # 7-P5/32
   [-1, 3, C3, [1024]],
   [-1, 1, SPPF, [1024, 5]],  # 9
  ]
  1. from-n represents the input obtained from the first n layers, such as -1 means to get input from the previous layer
  2. numberIndicates the number of network modules, such as [-1, 3, C3, [128]]indicating that it contains 3 C3 modules
  3. modelIndicates the name of the network module, details can be viewed in ./models/common.py, such as Conv, C3, and SPPF are modules already defined in common
  4. argsIndicates the parameters passed to different modules, namely [ch_out, kernel, stride, padding, groups], even ch_in is omitted here, because the input is the output of the upper layer (initial ch_in is 3). In order to make the modification too troublesome, the acquisition of the input here is def parse_model(md, ch)parsed from the function of ./models/yolo.py.

1.3 Exp

[-1, 1, Conv, [64, 6, 2, 2]],  # 0-P1/2

input: 3x640x640
[ch_out, kernel, stride, padding]=[64, 6, 2, 2]
so the number of new channels is 64x0.5=32
Calculated according to the feature map formula: Feature_new=(Feature_old-kernel+2xpadding)/stride +1 available:
the new feature map size is: Feature_new=(640-6+2x2)/2+1=320

[-1, 1, Conv, [128, 3, 2]],  # 1-P2/4

Input: 32x320x320
[ch_out, kernel, stride]=[128, 3, 2]
Similarly, the new channel number is 64, and the new feature map size is 160

2 Backbone composition

Backbone v6.0 removes the Focus module (to facilitate model export and deployment). Backbone is mainly composed of CBL, BottleneckCSP/C3, and SPP/SPPF, as shown in the following figure:
insert image description here

3.1 CBS

insert image description here

There is nothing unusual about the CBS module, it is Conv+BatchNorm+SiLU. Here we will focus on the parameters of Conv. It is time to review the convolution operation of pytorch. First, go to the CBL source code:

class Conv(nn.Module):
    # Standard convolution
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True):  # ch_in, ch_out, kernel, stride, padding, groups
        super().__init__()
        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=False)
        self.bn = nn.BatchNorm2d(c2)
        #其中nn.Identity()是网络中的占位符,并没有实际操作,在增减网络过程中,可以使得整个网络层数据不变,便于迁移权重数据;nn.SiLU()一种激活函数(S形加权线性单元)。
        self.act = nn.SiLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())

    def forward(self, x):#正态分布型的前向传播
        return self.act(self.bn(self.conv(x)))

    def forward_fuse(self, x):#普通前向传播
        return self.act(self.conv(x))

It can be seen from the source code: Conv() contains 7 parameters, which are also important parameters in the two-dimensional convolution Conv2d(). There is nothing to say about ch_in, ch_out, kernel, stride, let’s talk about the last three parameters:

padding

Judging from the mainstream convolution operations I see now, most researchers will not change the size of the feature map through the kernel. For example, the 3x3 kernel in googlenet sets padding=1, so when the kernel≠1 needs to be corrected The input feature map is populated. When the p value is specified, it is filled according to the p value, and when the p value is the default, it is filled by the autopad function:

def autopad(k, p=None):  # kernel, padding
    # Pad to 'same'
    if p is None:
        p = k // 2 if isinstance(k, int) else [x // 2 for x in k]  # auto-pad
        #如果k是整数,p为k与2整除后向下取整;如果k是列表等,p对应的是列表中每个元素整除2。
    return p

Here the author considers that the padding also needs to be changed when using different sizes of convolution kernels for different convolution operations, so here when assigning a value to p, it will first check whether k is an int, if k is a list, then for each in the list elements are divisible.

groups

Represents group convolution, as shown in the figure below
insert image description here

groups – Number of blocked connections from input channels to output

  • At groups=1, all inputs are convolved to all outputs.
  • At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated.
  • At groups= in_channels, each input channel is convolved with its own set of filters, of size: ⌊(out_channels)/(in_channels)⌋.

act

Decide whether to activate the feature map, SiLU means to use Sigmoid for activation.

one more thing:dilation

Another important parameter in Conv2d is the dilation of dilation. The popular explanation is to control the parameters of the kernel point (convolution kernel point) spacing. By changing the convolution kernel spacing, the feature map and feature information are preserved. In the semantic segmentation task Dilated convolutions are more effective.
insert image description here

3.2 CSP/C3

CSP is C3 in the backbone, because there is a shortcut in C3 in the backbone, but C3 does not use a shortcut in the neck, so the C3 layer in the backbone is represented by CSP1_x, and the C3 in the neck is represented by CSP2_x.

3.2.1 CSP structure

Next, let's sort out the module composition of the C3 layer in the backbone. First upload the source code:

class C3(nn.Module):
    # CSP Bottleneck with 3 convolutions
    def __init__(self, c1, c2, n=1, shortcut=True, g=1, e=0.5):  # ch_in, ch_out, number, shortcut, groups, expansion
        super().__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c1, c_, 1, 1)
        self.cv3 = Conv(2 * c_, c2, 1)  # act=FReLU(c2)
        self.m = nn.Sequential(*[Bottleneck(c_, c_, shortcut, g, e=1.0) for _ in range(n)])
        # self.m = nn.Sequential(*[CrossConv(c_, c_, 3, 1, g, 1.0, shortcut) for _ in range(n)])

    def forward(self, x):
        return self.cv3(torch.cat((self.m(self.cv1(x)), self.cv2(x)), dim=1))

It can be seen from the source code:One branch of the input feature map first passes through .cv1, and then passes through .m to obtain sub-feature map 1; the other branch passes through .cv2 to obtain sub-feature map 2. Finally, the sub-feature map 1 and sub-feature map 2 are spliced ​​and input into .cv3 to obtain the output of the C3 layer, as shown in the figure below.The CV operation here is easy to understand, that is, the previous Conv2d+BN+SiLU, the key is the .m operation.
insert image description here

The .m operation uses nn.Sequential to connect multiple Bottlenecks (named after Resx in the illustration) to the network. The n in the for loop is the number in the network configuration file args, that is, number×depth_multiple Bottlenecks are connected in series. to the network. So, what is Bottleneck?

3.2.2 Bottleneck

If you want to understand Bottleneck, you have to start with Resnet. Before the emergence of Resnet, people generally believed that the deeper the network, the more information they could obtain, and the better the generalization effect of the model. However, a large number of subsequent studies have shown that when the network depth reaches a certain level, the accuracy of the model is greatly reduced. This is not caused by overfitting, but by exploding and vanishing gradients during backpropagation. In other words, the deeper the network, the harder it is to optimize the model, rather than learning more features.
insert image description here
In order to enable the deep network model to achieve better training results, the residual mapping proposed in the residual network replaces the previous basic mapping. For the input x, the desired output is H(x), and the network uses the identity mapping to take x as the initial result, and the original mapping relationship becomes F(x)+x. Instead of allowing multi-layer convolution to approximate H(x), it is better to approximate H(x)-x, that is, to approximate the residual F(x). Therefore, ResNet is equivalent to changing the learning goal to the difference between the target value H(x) and x, and the subsequent training goal is to approach the residual result to 0.
What are the benefits of the residual module?

1. Gradient diffusion aspect. After adding the shortcut structure in ResNet, during backpropagation, not only the gradient is passed between every two blocks, but also the gradient before derivation is added, which is equivalent to artificially increasing the gradient passed forward in each block , it will reduce the possibility of gradient dispersion.
2. Feature redundancy. In the forward convolution, the convolution of each layer actually only extracts part of the information of the image. In this way, the deeper the layer, the more serious the loss of original image information, but only a small part of the original image. Do the extraction. This obviously produces something like underfitting. Adding the shortcut structure is equivalent to adding all the information of the previous layer image to each block, and retaining more original information to a certain extent.

In resnet, people can use the residual module with shortcut to build a network with hundreds or even thousands of layers, while the shallow residual module is named Basicblock (18, 34), and the residual used by the deep network The module is named Bottleneck (50+).
insert image description here
The biggest difference between Bottleneck and Basicblock is the composition of the convolution kernel. Basicblock consists of two 3x3 convolutional layers, and Bottleneck consists of two 1x1 convolutional layers sandwiching a 3x3 convolutional layer:Among them, the 1x1 convolutional layer reduces the dimensionality and then restores the dimensionality, so that the 3x3 convolution has fewer parameters and faster speed in the calculation process.
The first 1x1 convolution reduces the 256-dimensional channel to 64-dimensional, and then restores it through 1x1 convolution at the end. The number of parameters used as a whole: 1x1x256x64 + 3x3x64x64 + 1x1x64x256 = 69632, if you do not use bottleneck, it is two 3x3x256 Convolution, the number of parameters: 3x3x256x256x2 = 1179648, a difference of 16.94 times.
Bottleneck reduces the amount of parameters, optimizes calculations, and maintains the original accuracy.

Having said so much, it is all for the purpose of making a summary of the previous situation of the Bottleneck in the CSP. Let’s look back at the Bottleneck in the CSP, which is actually more clear:

class Bottleneck(nn.Module):
    # Standard bottleneck
    def __init__(self, c1, c2, shortcut=True, g=1, e=0.5):  # ch_in, ch_out, shortcut, groups, expansion
        super().__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c_, c2, 3, 1, g=g)
        self.add = shortcut and c1 == c2

    def forward(self, x):
        return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))

It can be seen that the Bottleneck in CSP is similar to that in the resnet module, first it is a 1x1 convolutional layer (CBS), then it is a 3x3 convolutional layer, and finally it is added to the initial input through a shortcut. However, the difference with resnet here is that after CSP halved the input dimension, it did not use the 1x1 convolution kernel to increase the dimension, but reduced the dimension of the original input x, and adopted the method of concat to splicing tensors. Get an output of the same dimensions as the original input. In fact, it is enough to distinguish one thing here: the shortcut in resnet is realized by add, which is the addition of the corresponding positions of the feature map and the number of channels remains unchanged; while the shortcut in CSP is realized by concat, which is the increase of the number of channels. Although both are the main ways of information fusion, the specific operations on tensors are different.
insert image description here

Secondly, the shortcut can be set according to the task requirements, such as shortcut=True in the backbone, shortcut=False in the neck.
When shortcut=True, Resx is as shown in the figure:
insert image description here

When shortcut=False, Resx is as shown in the figure:
insert image description here
This is actually what YOLOv5 is praised for. The code is more systematic and the code is less redundant. Only one parameter needs to be specified to combine Bottleneck and ordinary convolution, which reduces the code. At the same time, it also improves the overall sense of feeling.

3.3 SSPF

class SPPF(nn.Module):
    # Spatial Pyramid Pooling - Fast (SPPF) layer for YOLOv5 by Glenn Jocher
    def __init__(self, c1, c2, k=5):  # equivalent to SPP(k=(5, 9, 13))
        super().__init__()
        c_ = c1 // 2  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c_ * 4, c2, 1, 1)
        self.m = nn.MaxPool2d(kernel_size=k, stride=1, padding=k // 2)

    def forward(self, x):
        x = self.cv1(x)
        with warnings.catch_warnings():
            warnings.simplefilter('ignore')  # suppress torch 1.9.0 max_pool2d() warning
            y1 = self.m(x)
            y2 = self.m(y1)
            return self.cv2(torch.cat([x, y1, y2, self.m(y2)], 1))

The SSPF module splices x after CBS, y1 after one pooling, y2 after two poolings, and self.m(y2) after three poolings, and then extracts features with CBS.After careful observation, it is not difficult to find that although SSPF has pooled the feature map multiple times, the size of the feature map has not changed, and the number of channels will not change, so the subsequent four outputs can be fused in the channel dimension. The main function of this module is to extract and fuse high-level features. In the process of fusion, the author uses maximum pooling many times to extract as many high-level semantic features as possible.
insert image description here

Backbone overview of YOLOv5s

Finally, combined with the above explanation, it should not be difficult to understand the backbone of v5s

insert image description here

OVER

Guess you like

Origin blog.csdn.net/weixin_43427721/article/details/123613944