yolo v8

reference

Wisdom target detection 66 - Pytorch builds YoloV8 target detection platform

CV Classic Backbone Series: Darknet-53

CV Classic Backbone Series: CSPNet

YOLOv8 in-depth explanation! Understand the text, get started quickly

Github official source code

https://github.com/ultralytics/ultralytics

Documents official description

https://docs.ultralytics.com/

principle

Pre-processing (train)

In terms of data enhancement, there is not much difference from yolo v5, but the operation of closing Mosaic in the last 10 epochs proposed in yolo x is introduced. Assuming that the training epoch is 500, the schematic diagram is as follows:

network structure

overall thinking

frame of mind

Feature extraction-feature enhancement-predict the object situation corresponding to the prior frame.

Improvements

1. Main part: There is not much difference from the previous YoloV5 series, but the convolution kernel of the first convolution is reduced compared to the previous one, which is 3 instead of 6. In addition, the preprocessing of the CSP module is changed from three convolutions to two convolutions. The specific implementation method is to double the number of channels of the first convolution, and then divide the convolution results in half on the channels. In addition, it borrows from the multi-stack structure of YoloV7.

2. Strengthen the feature extraction part: the feature layer obtained by the backbone network is no longer convoluted (the purpose is estimated to speed up), and the preprocessing of the CSP module is changed from three convolutions to two convolutions. The implementation method is the same as that of the backbone network. Same.

3. Prediction head: DFL module has been added. The simple understanding of DFL module is to obtain the regression value in a probabilistic way. For example, if we currently set the length of DFL to 8, then the calculation method of a certain regression value is:

4. Adaptive multi-positive sample matching: In YoloV8, refer to YoloX using the implementation without anchors, which is an anchor-free algorithm, which has advantages in the face of targets with irregular length and width; when calculating the loss for positive sample matching, The positive sample needs to meet two conditions: one is within the real frame, and the other is the positive sample that best meets the requirements of the real frame topk (the prediction frame has a high degree of coincidence with the real frame and the type prediction is accurate).

specific structure

Compared with the source code in reference 1, the effect is better.

Backbone

(640, 640, 3)-->(80,80,256)--> (40,40,512)-->(20,20,1024 * deep_mul)

1. 颈部结构

(1)说明

yolo v5最初使用了Focus结构来初步提取特征,在改进后使用了大卷积核的卷积来初步提取特征,速度不快。

yolo v7则使用了三次卷积来初步提取特征,速度不快。

yolo v8则使用普通的步长为2的3x3卷积核来初步提取特征(损失感受野,但提速了,估计是感受野够了)。

(2)内部执行细节

颈部结构使用普通的步长为2的3*3卷积。(640,640,3)-->(320,320,64),后面的卷积核参数均为步长2尺寸3*3。(320,320,64)-->(160,160,128)-->(80,80,256)-->(40,40,512)-->(20,20,1024*deep_mul)

Wout = (W+2P–K)/S+1

W+2P:原图+补丁总长度;K:卷积核所占尺寸;S:步长。

这里的Wout计算结果为小数,向上取整,因为用到的是same原则自动padding。

(3)具体模块

2. CSP模块

(1)说明

CSP模块的预处理从三次卷积换成了两次卷积,并且借鉴了YoloV7的多堆叠结构。具体的实现方式是第一次卷积的通道数扩充为原来的两倍,然后将卷积结果在通道上对半分割,这样可以减少一次卷积的次数,加快网络的速度。

(2)内部细节

经过卷积1*1,步长1后,对半劈开,以第一个CSP模块举例(160,160,128)-->(160,160,256)-->a1(160, 160, 128), a2(160,160,128),

a2经过Bottleneck形成a3(160,160,128), a3经过Bottleneck形成a4(160,160,128)。将a1,a2,a3,a4cat到一起,再经卷积1*1步长1形成A(160,160,128)。

(3)具体模块

Bottleneck内部是两个步长1的3*3卷积,并且设置了残差连接。

class Bottleneck(nn.Module):
    # 标准瓶颈结构,残差结构
    # c1为输入通道数,c2为输出通道数
    def __init__(self, c1, c2, shortcut=True, g=1, k=(3, 3), e=0.5):
        super().__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, k[0], 1)
        self.cv2 = Conv(c_, c2, k[1], 1, g=g)
        self.add = shortcut and c1 == c2

    def forward(self, x):
        return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))  

3.激活函数

使用了SiLU激活函数,SiLU是Sigmoid和ReLU的改进版。SiLU具备无上界有下界、平滑、非单调的特性。SiLU在深层模型上的效果优于 ReLU。可以看做是平滑的ReLU激活函数。

f(x)=x⋅sigmoid(x)

总结:本质是1*1卷积,3*3卷积与残差连接的递进组合,1*1卷积负责维度对齐,3*3卷积多用于降采样。同时,借鉴CSP的维度加深策略

FPN

(80,80,256), (40,40,512), (20,20,1024 * deep_mul)-->(80,80,256), (40,40,512), (20,20,1024 * deep_mul)

FPN的模块和Backbone结构原理一致,经过特征上采样和下采样的融合,三个特征层的shape分别为feat1=(80,80,256)、feat2=(40,40,512)、feat3=(20,20,1024 * deep_mul)。deep_mul只是个系数,对深层的通道进行缩放,在yolo v8中应该是为了平衡计算量做的考虑。

总结

借鉴PANet的思想做特征增强(SSD是最早使用特征金字塔结构表示多尺度特征信息的方法之一,FPN依赖于自上而下的特征金字塔结构,在此基础上,PANet提出一种额外的自下而上的路径)。

Yolo Head

(80,80,256), (40,40,512), (20,20,1024 * deep_mul)-->(4, 8400), (80, 8400), ((144, 80, 80), (144, 40, 40), (144, 20, 20)), (2, 8400), (1, 8400)

(4, 8400): yolo v8解耦头是分开的,分类和回归不在一个3*3卷积里实现。回归值相关的预测头的通道数与DFL的长度有关,在yolo v8中,DFL的长度均设为16。回归值相关的预测头的通道数均为16 × 4 = 64,三个特征层的shape为(20,20,64),(40,40,64),(80,80,64)。64可以分为四个16,用于计算四个回归系数。计算完回归系数后,三个特征层的shape为(20,20,4),(40,40,4),(80,80,4)。8400 = 80*80+ 40*40+ 20*20

(80, 8400): 种类相关的预测头的通道数和需要区分的种类个数相关。

((144, 80, 80), (144, 40, 40), (144, 20, 20)): 144 = 80 + 64

(2, 8400): x0,y0 (80*80, 1), (40*40, 1), (20, 20, 1); x1,y1 (80*80, 1), (40*40, 1), (20, 20, 1);以0.5为间隔的坐标值,将坐标值张量cat到一起就是(2, 8400)。

比如:

[[ 0.5000, 0.5000],

[ 1.5000, 0.5000],

[ 2.5000, 0.5000],

...,

[77.5000, 79.5000],

[78.5000, 79.5000],

[79.5000, 79.5000]]]

(1, 8400): (80*80, 1), (40*40, 1), (20, 20, 1);每个张量内的像素是放缩的倍数,分别为8, 16, 32,将三个张量cat到一起就是(1, 8400)。

总结

将同一特征层分别通过3*3卷积解耦,分别对应回归特征和分类特征。

后处理

得分筛选&非极大值抑制

(8400, 84) -->(10, 6)代表10个物体,4个坐标,1个置信度,1个物体类别。

得分筛选就是筛选出得分满足confidence置信度的预测框。

非极大抑制就是筛选出一定区域内属于同一种类得分最大的框。

损失函数(train)

正样本匹配(找特征点对应的框,并负责这个框的预测。)

(a)根据空间距离判断特征点是否在预测框内。

利用特征点坐标减去真实框左上角,利用真实框右下角减去特征点坐标,如果这几个值全都大于0则特征点在真实框内部。

(b)根据代价函数判断特征点是否在真实框的topk内。

a)每个真实框和当前特征点预测框的重合程度;

b)每个真实框和当前特征点预测框的种类预测准确度;

(c)去重等后处理。

可能存在一个特征点预测多个真实框的情况,需要根据重合程度做筛选。

Tips

增加小目标检测层:yolov8_v7_v5

Guess you like

Origin blog.csdn.net/qq_41804812/article/details/129663475