Super detailed line-by-line interpretation of Yolov 8 source code + detailed explanation of network structure (small white notes for self-use)

YOLO V8 network structure

由于作者之前的Yolov8 复现受到了部分好评,所以决定继续从小白学习路线,进行复现代码,磕代码,学理论知识。
Code to reproduce:

Complete and detailed Yolov8 reproduction + training your own data set_Gu Ge's Blog-CSDN Blog https://blog.csdn.net/chenhaogu/article/details/131161374?spm=1001.2014.3001.5501 (If you think it is easy to use and difficult to create, please like and collect, thank you!)

Tip: If you only want to pay attention to the code explanation, just look at the code explanation part. The original intention of the author is "unpack, look at the essence, and change yourself". After understanding the core code, rewrite the unencapsulated network structure in your own way.


Article Directory


1. Main network structure

1. Backbone: The main body is the CSPDarkNet structure.

        Both yolov5 and yolov8 have this structure. The main structure of yolov5 is the c3 module ; the main idea of ​​yolov8 is the c2f module .

2、CSPnet(Cross Stage Partial):

If you want a more detailed understanding, you can see the introduction of a blogger in Zhihu, which is very detailed.

CSPNet - PyTorch implements CSPDenseNet and CSPResNeXt - Zhihu (zhihu.com) https://zhuanlan.zhihu.com/p/263555330

Generally speaking, it refers to DenesNet, which reduces the amount of calculation and enhances the gradient (students who do not know DenesNet can search for related papers on csdn or Zhihu. If you want to save time and understand briefly, you can add words such as explanation.). It can be described as a part enters the convolutional network for calculation, and a part is reserved for the final concat. In the middle process, there is also the idea of ​​restnet. To sum it up with a famous saying: the small intestine is wrapped in the large intestine.

3、Partial Transition Layer:

Comparing these four structures, it is found that each has its own advantages. Figure a is the traditional dense network idea. Figure b is a comprehensive consideration of Figure c and Figure d. Figure c is that a large amount of feature information of part1 and part2 is finally used, and Figure d is a part of the utilization, and the amount of calculation is reduced.

4. Code explanation:

(1) Network structure:

(2) The network model file is saved in models->v8->yolov8.yaml

# YOLOv8.0n backbone
backbone:
  # [from, repeats, module, args]
  - [-1, 1, Conv, [64, 3, 2]]  # 0-P1/2
  - [-1, 1, Conv, [128, 3, 2]]  # 1-P2/4
  - [-1, 3, C2f, [128, True]]
  - [-1, 1, Conv, [256, 3, 2]]  # 3-P3/8
  - [-1, 6, C2f, [256, True]]
  - [-1, 1, Conv, [512, 3, 2]]  # 5-P4/16
  - [-1, 6, C2f, [512, True]]
  - [-1, 1, Conv, [1024, 3, 2]]  # 7-P5/32
  - [-1, 3, C2f, [1024, True]]
  - [-1, 1, SPPF, [1024, 5]]  # 9

from: input. -1 indicates that the output of the upper layer is used as the input of this layer.

repeats: The number of repetitions of the module.

module: The module to use.

args: parameters in the module

The formula for calculating size: out_size = (in_size - k + 2 * p) / s + 1

(3) Convolution code (nn->modules->conv.py)

def autopad(k, p=None, d=1):  # kernel, padding, dilation
    """Pad to 'same' shape outputs."""
    if d > 1:
        k = d * (k - 1) + 1 if isinstance(k, int) else [d * (x - 1) + 1 for x in k]  # actual kernel-size
    if p is None:
        p = k // 2 if isinstance(k, int) else [x // 2 for x in k]  # auto-pad
    return p


class Conv(nn.Module):
    """Standard convolution with args(ch_in, ch_out, kernel, stride, padding, groups, dilation, activation)."""
    default_act = nn.SiLU()  # default activation

    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
        """Initialize Conv layer with given arguments including activation."""
        super().__init__()
        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=False)
        self.bn = nn.BatchNorm2d(c2)
        self.act = self.default_act if act is True else act if isinstance(act, nn.Module) else nn.Identity()

    def forward(self, x):
        """Apply convolution, batch normalization and activation to input tensor."""
        return self.act(self.bn(self.conv(x)))

    def forward_fuse(self, x):
        """Perform transposed convolution of 2D data."""
        return self.act(self.conv(x))

conv(x)-> bn() -> SiLU()

imput map                                channel:3                     size:640*640

Layer 0 (Conv[3,2,1]) channel: 64 size: 320*320

Layer 1 (Conv[3,2,1]) channel: 128 size: 160*160

(4) Start to enter the core module c2f, the code is in (nn->modules->block.py)

class C2f(nn.Module):
    """CSP Bottleneck with 2 convolutions."""

    def __init__(self, c1, c2, n=1, shortcut=False, g=1, e=0.5):  # ch_in, ch_out, number, shortcut, groups, expansion
        super().__init__()
        self.c = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, 2 * self.c, 1, 1)
        self.cv2 = Conv((2 + n) * self.c, c2, 1)  # optional act=FReLU(c2)
        self.m = nn.ModuleList(Bottleneck(self.c, self.c, shortcut, g, k=((3, 3), (3, 3)), e=1.0) for _ in range(n))

    def forward(self, x):
        """Forward pass through C2f layer."""
        y = list(self.cv1(x).chunk(2, 1))
        y.extend(m(y[-1]) for m in self.m)
        return self.cv2(torch.cat(y, 1))

    def forward_split(self, x):
        """Forward pass using split() instead of chunk()."""
        y = list(self.cv1(x).split((self.c, self.c), 1))
        y.extend(m(y[-1]) for m in self.m)
        return self.cv2(torch.cat(y, 1))

[-1, 3, C2f, [128, True]]: 3: Repeat C2f 3 times; True: Bottleneck has shortcut

The difficult code is:

self.m = nn.ModuleList(Bottleneck(self.c, self.c, shortcut, g, k=((3, 3), (3, 3)), e=1.0) for _ in range(n))

n: Indicates n bottlenecks, assuming 3, rang(3) -> m:[Bottleneck,Bottleneck;Bottleneck], m is equivalent to an integrated module at this time.

You can see three channel parameters, c1=number of input channels; c2=number of output channels; c=0.5*number of output channels; the following are the parameters of each block assuming that the input is [2,128,160,160].

        x goes through cv1 first;

        Then start the division of different dimensions, torch.chunk (number of blocks, dimension), where chunk (2, 1) is the first dimension is divided into two pieces;

        m(y[-1]) for m in self.m: It means to put the second block into n Bottlenecks, which is equivalent to:

##举例对比
    m(y[-1]) for m in self.m                     x*x for x in range(3)
=>
    for m in self.m:                             for x in range(3):
        m(y[-1])                                    print(x*x)

##流程,假设3个Bottleneck

Bottleneck0->m  m(y[-1]);y.extend加上去            0 -> x -> 0*0=0
Bottleneck1->m  m(y[-1]);y.extend加上去            1 -> x -> 1*1=1
Bottleneck2->m  m(y[-1]);y.extend加上去            2 -> x -> 2*2=4

y=[Bottleneck0处理过的第二部分,
   Bottleneck1处理过的第二部分,
   Bottleneck2处理过的第二部分]

        Therefore, y becomes 2+n blocks;

        torch.cat(y, 1) stitches together the first dimension of y (because the above chnk is split in the first dimension);

        Finally, after cv2, featuremap.size 128*160*160.

imput map                                channel:3                     size:640*640

Layer 0 (Conv[3,2,1]) channel: 64 size: 320*320

Layer 1 (Conv[3,2,1]) channel: 128 size: 160*160

Layer 2 (C2f)*3                       channel: 128 size: 160*160

After a layer of Conv

imput map                                channel:3                     size:640*640

Layer 0 (Conv[3,2,1]) channel: 64 size: 320*320

Layer 1 (Conv[3,2,1]) channel: 128 size: 160*160

Layer 2 (C2f)*3                       channel: 128 size: 160*160

Layer 3 (Conv[3,2,1]) channel: 256 size: 80*80

After 6*C2f, this layer starts to enter the common pyramid structure to connect the detection head, and the detection head will be described in detail later

The explanation of the C2F part is the same as above, this time there are 6 repeating modules.

imput map                                channel:3                     size:640*640

Layer 0 (Conv[3,2,1]) channel: 64 size: 320*320

Layer 1 (Conv[3,2,1]) channel: 128 size: 160*160

Layer 2 (C2f)*3                       channel: 128 size: 160*160

Layer 3 (Conv[3,2,1]) channel: 256 size: 80*80

Layer 4 (C2f)*6                       channel: 256 size: 80*80

After a layer of Conv

imput map                                channel:3                     size:640*640

Layer 0 (Conv[3,2,1]) channel: 64 size: 320*320

Layer 1 (Conv[3,2,1]) channel: 128 size: 160*160

Layer 2 (C2f)*3                       channel: 128 size: 160*160

Layer 3 (Conv[3,2,1]) channel: 256 size: 80*80

Layer 4 (C2f)*6                       channel: 256 size: 80*80

Layer 5 (Conv[3,2,1]) channel: 512 size: 40*40

After 6*C2f

imput map                                channel:3                     size:640*640

Layer 0 (Conv[3,2,1]) channel: 64 size: 320*320

Layer 1 (Conv[3,2,1]) channel: 128 size: 160*160

Layer 2 (C2f)*3                       channel: 128 size: 160*160

Layer 3 (Conv[3,2,1]) channel: 256 size: 80*80

Layer 4 (C2f)*6                       channel: 256 size: 80*80

Layer 5 (Conv[3,2,1]) channel: 512 size: 40*40

Layer 6 (C2f)*6                       channel: 512 size: 40*40

After a layer of Conv

imput map                                channel:3                     size:640*640

Layer 0 (Conv[3,2,1]) channel: 64 size: 320*320

Layer 1 (Conv[3,2,1]) channel: 128 size: 160*160

Layer 2 (C2f)*3                       channel: 128 size: 160*160

Layer 3 (Conv[3,2,1]) channel: 256 size: 80*80

Layer 4 (C2f)*6                       channel: 256 size: 80*80

Layer 5 (Conv[3,2,1]) channel: 512 size: 40*40

Layer 6 (C2f)*6                       channel: 512 size: 40*40

Layer 7 (Conv[3,2,1]) channel: 512 size: 20*20

After 3*C2f

imput map                                channel:3                     size:640*640

Layer 0 (Conv[3,2,1]) channel: 64 size: 320*320

Layer 1 (Conv[3,2,1]) channel: 128 size: 160*160

Layer 2 (C2f)*3                       channel: 128 size: 160*160

Layer 3 (Conv[3,2,1]) channel: 256 size: 80*80

Layer 4 (C2f)*6                       channel: 256 size: 80*80

Layer 5 (Conv[3,2,1]) channel: 512 size: 40*40

Layer 6 (C2f)*6                       channel: 512 size: 40*40

Layer 7 (Conv[3,2,1]) channel: 512 size: 20*20

Layer 8 (C2f)*3                       channel: 512 size: 20*20

(6) SPPF, the code is in (nn->modules->block.py)

imput map                                channel:3                     size:640*640

Layer 0 (Conv[3,2,1]) channel: 64 size: 320*320

Layer 1 (Conv[3,2,1]) channel: 128 size: 160*160

Layer 2 (C2f)*3                       channel: 128 size: 160*160

Layer 3 (Conv[3,2,1]) channel: 256 size: 80*80

Layer 4 (C2f)*6                       channel: 256 size: 80*80

Layer 5 (Conv[3,2,1]) channel: 512 size: 40*40

Layer 6 (C2f)*6                       channel: 512 size: 40*40

Layer 7 (Conv[3,2,1]) channel: 1024 size: 20*20

Layer 8 (C2f)*3                       channel: 1024 size: 20*20

Layer 9 (SPPF[k=5])        channel: 1024 size: 20*20         

 code:

class SPPF(nn.Module):
    """Spatial Pyramid Pooling - Fast (SPPF) layer for YOLOv5 by Glenn Jocher."""

    def __init__(self, c1, c2, k=5):  # equivalent to SPP(k=(5, 9, 13))
        super().__init__()
        c_ = c1 // 2  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c_ * 4, c2, 1, 1)
        self.m = nn.MaxPool2d(kernel_size=k, stride=1, padding=k // 2)

    def forward(self, x):
        """Forward pass through Ghost Convolution block."""
        x = self.cv1(x)
        y1 = self.m(x)
        y2 = self.m(y1)
        return self.cv2(torch.cat((x, y1, y2, self.m(y2)), 1))

The calculation formula of maxpool:

2. Head

1. After explaining the backbone, let’s explain the detection head part

The code is as follows: (the code is in ultralytics -> models -> v8 -> yolov8.yaml)

# YOLOv8.0n head
head:
  - [-1, 1, nn.Upsample, [None, 2, 'nearest']]
  - [[-1, 6], 1, Concat, [1]]  # cat backbone P4
  - [-1, 3, C2f, [512]]  # 12

  - [-1, 1, nn.Upsample, [None, 2, 'nearest']]
  - [[-1, 4], 1, Concat, [1]]  # cat backbone P3
  - [-1, 3, C2f, [256]]  # 15 (P3/8-small)

  - [-1, 1, Conv, [256, 3, 2]]
  - [[-1, 12], 1, Concat, [1]]  # cat head P4
  - [-1, 3, C2f, [512]]  # 18 (P4/16-medium)

  - [-1, 1, Conv, [512, 3, 2]]
  - [[-1, 9], 1, Concat, [1]]  # cat head P5
  - [-1, 3, C2f, [1024]]  # 21 (P5/32-large)

  - [[15, 18, 21], 1, Detect, [nc]]  # Detect(P3, P4, P5)
(1) Layer 10: Pass through the SSPF module, and then enter the upsampling upsample.
None: No output size is specified.
2: The size of the output is twice the size of the input.
nearest: Use the nearest interpolation algorithm. After upsampling, the size is from 1024*20*20 -> 1024*40*40.

The route is as shown in the figure:

 (2) Layer 11: Concat module, concat with the upsampled upper layer and the sixth layer (P4). (code in: ultralytics -> nn -> modules-> _init_.py)

[1]: Splicing on dimension 1;

At this time, the upsampled size is 1024*40*40 +  the output size of the 6th layer is 512*40*40 = 1536*40*40 .

class Concat(nn.Module):
    """Concatenate a list of tensors along dimension."""

    def __init__(self, dimension=1):
        """Concatenates a list of tensors along a specified dimension."""
        super().__init__()
        self.d = dimension

    def forward(self, x):
        """Forward pass for the YOLOv8 mask Proto module."""
        return torch.cat(x, self.d)

The route taken is shown in the figure below:

 (3) Layer 12 : 3*C2f; the number of channels is 512, no shortcut is performed. At this time, the size is from 1536*40*40 -> 512*40*40

(4) Layer 13: upsample, layer 12 as input. Same principle as layer 10. Size from 512*40*40 -> 512*80*80

(5) Layer 14: Concat module, which is connected with the upsampled upper layer (13 layers) and the fourth layer (P3).

At this time, the upsampled size is 512*80*80 +  the output size of the fourth layer is 256*80*80 = 768*80*80 .

The route is as shown in the figure:

(6) Layer 15:  3*C2f; the number of channels is 256, no shortcut is performed. At this time, the size is from 768*80*80 -> 256*80*80.

The route is as shown in the figure:

 (7) Layer 16: after convolution Conv, channel 256, k=3, s=2, the calculation formula is introduced above. 256*80*80 -> 256*40*40.

The route is as shown in the figure:

 (8) Layer 17: Concat module, connected with the convolutional layers 16 and 12.

16th floor: 256*40*40 + 12th floor: 512*40*40 = 768*40*40.

The route is as shown in the figure:

(9) Layer 18:  3*C2f; the number of channels is 512, no shortcut is performed. At this time, the size is from 768*40*40 -> 512*40*40 .

The route is as shown in the figure:

 (10) Layer 19: After convolution Conv, channel 512, k=3, s=2, the calculation formula is introduced above. 512*40*40 -> 512*20*20.

The route is as shown in the figure:

 (11) Layer 20: Concat module, connected with convolutional layers 19 and 9.

[[-1, 9], 1, Concat, [1]]

19th floor: 512*20*20 + 9th floor: 1024*20*20 = 1536*20*20.

The route is as shown in the figure:

 (12) Layer 21:  3*C2f; the number of channels is 1024, no shortcut is performed. Size change: 1536*20*20 -> 1024*20*20

The route is as shown in the figure:

 So far, all the head parts have been passed, and the following enters the detect.

3. Detect

1. The 22nd floor

code show as below:

[[15, 18, 21], 1, Detect, [nc]]  # Detect(P3, P4, P5)

class Detect(nn.Module):
    """YOLOv8 Detect head for detection models."""
    dynamic = False  # force grid reconstruction
    export = False  # export mode
    shape = None
    anchors = torch.empty(0)  # init
    strides = torch.empty(0)  # init

    def __init__(self, nc=80, ch=()):  # detection layer
        super().__init__()
        self.nc = nc  # number of classes
        self.nl = len(ch)  # number of detection layers
        self.reg_max = 16  # DFL channels (ch[0] // 16 to scale 4/8/12/16/20 for n/s/m/l/x)
        self.no = nc + self.reg_max * 4  # number of outputs per anchor
        self.stride = torch.zeros(self.nl)  # strides computed during build
        c2, c3 = max((16, ch[0] // 4, self.reg_max * 4)), max(ch[0], self.nc)  # channels
        self.cv2 = nn.ModuleList(
            nn.Sequential(Conv(x, c2, 3), Conv(c2, c2, 3), nn.Conv2d(c2, 4 * self.reg_max, 1)) for x in ch)
        self.cv3 = nn.ModuleList(nn.Sequential(Conv(x, c3, 3), Conv(c3, c3, 3), nn.Conv2d(c3, self.nc, 1)) for x in ch)
        self.dfl = DFL(self.reg_max) if self.reg_max > 1 else nn.Identity()

    def forward(self, x):
        """Concatenates and returns predicted bounding boxes and class probabilities."""
        shape = x[0].shape  # BCHW
        for i in range(self.nl):
            x[i] = torch.cat((self.cv2[i](x[i]), self.cv3[i](x[i])), 1)
        if self.training:
            return x
        elif self.dynamic or self.shape != shape:
            self.anchors, self.strides = (x.transpose(0, 1) for x in make_anchors(x, self.stride, 0.5))
            self.shape = shape

        x_cat = torch.cat([xi.view(shape[0], self.no, -1) for xi in x], 2)
        if self.export and self.format in ('saved_model', 'pb', 'tflite', 'edgetpu', 'tfjs'):  # avoid TF FlexSplitV ops
            box = x_cat[:, :self.reg_max * 4]
            cls = x_cat[:, self.reg_max * 4:]
        else:
            box, cls = x_cat.split((self.reg_max * 4, self.nc), 1)
        dbox = dist2bbox(self.dfl(box), self.anchors.unsqueeze(0), xywh=True, dim=1) * self.strides
        y = torch.cat((dbox, cls.sigmoid()), 1)
        return y if self.export else (y, x)

    def bias_init(self):
        """Initialize Detect() biases, WARNING: requires stride availability."""
        m = self  # self.model[-1]  # Detect() module
        # cf = torch.bincount(torch.tensor(np.concatenate(dataset.labels, 0)[:, 0]).long(), minlength=nc) + 1
        # ncf = math.log(0.6 / (m.nc - 0.999999)) if cf is None else torch.log(cf / cf.sum())  # nominal class frequency
        for a, b, s in zip(m.cv2, m.cv3, m.stride):  # from
            a[-1].bias.data[:] = 1.0  # box
            b[-1].bias.data[:m.nc] = math.log(5 / m.nc / (640 / s) ** 2)  # cls (.01 objects, 80 classes, 640 img)

The following explanation refers to the blogger, and you can also read what this blogger wrote. Detailed explanation of yolov8's Detect layer (change in output dimension)_Yinjiacheng's blog-CSDN blog For the deployment side, the onnx output needs to be obtained for post-processing, but the yolov8 output based on anchor_free is 1*(4+cls)*8400, which is not our common ncwh format, so I learned about the Detect layer and shared my experience. https://blog.csdn.net/yjcccccc/article/details/130261153?ops_request_misc=&request_id=&biz_id=102&utm_term=yolov8%E7%9A%84detect%E6%A8%A1%E5%9D%97&utm_medium=distribute.pc_search_ result.none-task-blog-2~all~sobaiduweb~default-0-130261153.nonecase&spm=1018.2226.3001.4187

First look at the initialization parameters:

nc: 类别数;
nl: 检测模型中所使用的检测层数;
reg_max: 每个锚点输出的通道数;
no: 每个锚点的输出数量,其中包括类别数和位置信息;
stride: 一个形状为(nl,)的张量,表示每个检测层的步长;
cv2: 一个 nn.ModuleList 对象,包含多个卷积层,用于预测每个锚点的位置信息;
cv3: 一个 nn.ModuleList 对象,包含多个卷积层,用于预测每个锚点的类别信息;
dfl: 一个 DFL(Differentiable Feature Localization)类对象,用于应用可微分几何变换,以更好地对目标框进行回归;(代码后面会介绍)
shape属性表示模型期望的输入形状,如果模型只接受固定形状的输入,则 self.shape 存储该形状

 

前向函数:

(1)shape=x的shape,即 batch,channel,h,w

 shape = x[0].shape  # BCHW

(2)

        for i in range(self.nl):
            x[i] = torch.cat((self.cv2[i](x[i]), self.cv3[i](x[i])), 1)

Suppose the input image: 3*640*640;
ch is a tuple, and nl is the length of this tuple.

x[0]= cv2[0][x[0]] + cv3[0]x[0] (the following 1 indicates the first dimension splicing);

Then cv2[0] is Conv(x, c2, 3), and cv3[0] is Conv(x, c3, 3)

The output calculation method of the plot points: self.no = nc + self.reg_max * 4, assuming 80 classes, self.no=80+4*16=144;

Then the three output feature maps should be 1*144*80*80 (640/8), 1*144*40*40 (640/16) and 1*144*20*20 (640/32);

Please refer to the blogger above later, his writing is good enough, you can give him likes and favorites.


 

 

Summarize

The above is what I want to talk about today. This article only briefly introduces the source code structure, and will continue to update and explain it later. Since there are too many written this time, I am afraid that everyone will not be able to understand it. After a while, continue to unpack the code and try to reproduce the reduced version of yolov8 with the common pytorch original framework. Maybe there are many mistakes above, welcome to correct me! ! !

Guess you like

Origin blog.csdn.net/chenhaogu/article/details/131647758