Reproduction of YoLoV5 backbone based on Pytorch (Part 1)

Continue to create, accelerate growth! This is the first day of my participation in the "Nuggets Daily New Plan · June Update Challenge", click to view the details of the event

foreword

Now let's make a brief summary. I'm also reviewing the review recently. Let's take a look at the learning results in this semester, consolidate and enhance them. The series of notes are still being sorted out. The review time I give myself is about 1 month. This Of course, the content of deep learning is mainly about YOLO, and next week is the stuff in my own class.

This blog post was also prepared for about two days. I carefully looked at the papers of V1~V3. I haven’t read V4 and V5, because from the perspective of development, the changes from V1 to V3 are very big, V4, V5 are more It is optimized on the structure of the neural network. And our task today is how to reproduce the backbone of YOLOV5.

No matter how good the theory is, it needs to be practiced, which can deepen the understanding and reflection. Next, we need to use YOLO to do more cool things, so this level is a level that I cannot overcome.

This blog post is based on the YOLOV5.5 version for exploration~ Considering the space problem, it will be split into two blog posts for reproduction.

network structure

Before we start, let's take a look at the network structure of the entire yolov5. Please add image descriptionThis is a complete neural network structure, which can be generated through netron.app/ but we will not use this picture directly, because you will find that there are actually a lot of repetitions. , we use this diagram to achieve

reference design

Since the actual picture is not easy to understand, let's refer to Zhihu's big brother: Jiang Dabai's pictureinsert image description here

Next we start to explain each module. (Note here that our current version is actually a picture of input batch_size x 3 x 640 x 640) and our actual picture is not the same as this reference drawing. In fact, the specific picture is still based on the picture above, which is roughly long Whatever it is, I will post it.

Focus Module

Noticed something before we startedinsert image description here

This thing does something like this, the insert image description herecode is like this

class Focus(nn.Module):
    # Focus wh information into c-space
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True):  # ch_in, ch_out, kernel, stride, padding, groups
        super(Focus, self).__init__()
        self.conv = Conv(c1 * 4, c2, k, s, p, g, act)      # 这里输入通道变成了4倍

    def forward(self, x):  # x(b,c,w,h) -> y(b,4c,w/2,h/2)
        return self.conv(torch.cat([x[..., ::2, ::2], x[..., 1::2, ::2], x[..., ::2, 1::2], x[..., 1::2, 1::2]], 1))

复制代码

原始的3 x 640 × 640 的图像输入Focus结构,采用切片操作,先变成12 x 320 × 320的特征图 这里代码里面是将一张图片切成4份,每一份有原来的3个通道,所以 这里是c1*4。并且你应该注意到了这个第一个CONV的W的大小是64 x 32 x3 x3 这个也很好解释,一张图片本来是 .3 x 640 x 640 卷积核大小 3 x 3 按道理如果输出一个通道的话 那么 就是 3 x 3 x 3 此时输出32个通道就是 32 x 3 x 3 x 3 但是你有4份就是12个所以就是 32 x 12 x 3 x 3。 具体的推导可以看这张图 insert image description here

接下来就是我们的其他模块

Conv 卷积模块

这个在YOLOV5里面为了放置各种训练问题,它做了不少优化,首先是一开始的训练的时候有数据增强的处理,然后就是在卷积的时候,有归一化的处理,防止参数差距很大带来的干扰。


class Conv(nn.Module):
    def __init__(self,c1,c2,k=1,s=1,p=None,g=1,act=True):
        super(Conv,self).__init__()
        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=False)
        self.bn = nn.BatchNorm2d(c2)
        self.act = nn.SiLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())

    def forward(self, x):
        
        return self.act(self.bn(self.conv(x)))

    def forward_fuse(self, x):
        return self.act(self.conv(x))



复制代码

insert image description here 这里注意的是这个Sigmoid 是这个玩意和RULE其实很像,但是人家<0 有负值 insert image description here 这里的话多一嘴,其实这个卷积核和我们线性权重是类似的,只不过人家做到是矩阵微分,没那么神秘。

然后这里对应的图中应该是CBL模块,不过在咱们这里是Conv。

残差模块

这个对应的其实就是这个模块了CSP1_x 模块

在我们当前版本是这样的

class Bottleneck(nn.Module):
    # Standard bottleneck
    def __init__(self, c1, c2, shortcut=True, g=1, e=0.5):  # ch_in, ch_out, shortcut, groups, expansion
        super(Bottleneck, self).__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c_, c2, 3, 1, g=g)
        self.add = shortcut and c1 == c2

    def forward(self, x):
        return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))
复制代码

insert image description here

C3 模块

这个模块是这个样子的,和残差有点像,但是人家不是相加,而是扩充。 insert image description here 它是保留了一部分,然后进入残差,最后做一个融合。

class C3(nn.Module):
    # CSP Bottleneck with 3 convolutions
    def __init__(self, c1, c2, n=1, shortcut=True, g=1, e=0.5):  # ch_in, ch_out, number, shortcut, groups, expansion
        super(C3, self).__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c1, c_, 1, 1)
        self.cv3 = Conv(2 * c_, c2, 1)  # act=FReLU(c2)
        self.m = nn.Sequential(*[Bottleneck(c_, c_, shortcut, g, e=1.0) for _ in range(n)])
        # self.m = nn.Sequential(*[CrossConv(c_, c_, 3, 1, g, 1.0, shortcut) for _ in range(n)])

    def forward(self, x):
        return self.cv3(torch.cat((self.m(self.cv1(x)), self.cv2(x)), dim=1))

复制代码

这个“砍”一半是通过卷积实现的,通过把通道砍到原来的0.5倍实现了留下一半,然后通过残差模块self.m(cv1) 实现卷积部分,然后在通过cat合并。

insert image description here insert image description here

基本上我们实际的网络架构图里面有的东西就有了,当然这里面其实还有很多门道,你仔细看common文件里面就知道,这些东西的话需要结合论文来说,我这里不好说,为了这个玩意儿我至少看了5,6篇论文,还要整理。

The 7x7 13x13 26x26 grad cell proposed in the V3 version is what we have here (of course it is not so small here). insert image description here insert image description hereThis is similar to the picture of Mr. Jiang Dabai.

After that, it keeps repeating.

lxsm difference

Now that we're here, let's talk about what the suffixes behind our yolov5 mean.

insert image description here

In fact, you will know this when you open this yolov5xx.yaml.

You will find that except for this insert image description heredifference , everything else is similar.

In fact, these two parameters represent the depth and width of your network, for example width_multiple is 0.5 here

How to control your output? Simple.

The size of the number of channels output by your convolution multiplied by width_mulitple actually controls the depth. For example, I output 128 channels, which was originally set, but I multiply it by 0.5 so that the output is 64, so the width is small . As for the depth, it is also very simple. Remember CSP1_X?

The picture in Mr. Zaijiang Dabai shows that you have several residuals that are repeated. I assume that the standard setting is CSP1_3 (assuming there is a module). Now 3x0.3 is an integer of 1, so my CSP1_3 is actually There is only one residual, and so on, so our depth comes down. The most standard is yolov5l.pt because the settings are all 1.0.

Guess you like

Origin juejin.im/post/7101961593797214238