yolov5-master source code detailed notes - yolo module

This article only briefly explains the structure of the yolov5 neural network. For specific theoretical knowledge, please learn the neural network by yourself. (Computer reading experience is better)

Table of contents

yolov5s.yaml:

yolo.py:

       if __name:

       Model:

              init: build network structure

              Define model: Define the model

              Build:

              forward: predict the input image

       parse_model:

common.py:


Before parsing the yolo file, we need to understand what the network structure of yolov5 is like:

yolov5s.yaml:

(This file is actually just a description of the configuration file that guides us to build the model, for our reference, we can refer to this file to configure our own model file)

First understand the yolov5 network structure as shown in the figure:

Let's first understand the Backone in the middle, what is the Head:

# YOLOv5 v6.0 backbone
backbone:
  # [from, number, module, args]
  [[-1, 1, Conv, [64, 6, 2, 2]],  # 0-P1/2
   [-1, 1, Conv, [128, 3, 2]],  # 1-P2/4
   [-1, 3, C3, [128]],
   [-1, 1, Conv, [256, 3, 2]],  # 3-P3/8
   [-1, 6, C3, [256]],
   [-1, 1, Conv, [512, 3, 2]],  # 5-P4/16
   [-1, 9, C3, [512]],
   [-1, 1, Conv, [1024, 3, 2]],  # 7-P5/32
   [-1, 3, C3, [1024]],
   [-1, 1, SPPF, [1024, 5]],  # 9
  ]

# YOLOv5 v6.0 head
head:
  [[-1, 1, Conv, [512, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 6], 1, Concat, [1]],  # cat backbone P4
   [-1, 3, C3, [512, False]],  # 13

   [-1, 1, Conv, [256, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 4], 1, Concat, [1]],  # cat backbone P3
   [-1, 3, C3, [256, False]],  # 17 (P3/8-small)

   [-1, 1, Conv, [256, 3, 2]],
   [[-1, 14], 1, Concat, [1]],  # cat head P4
   [-1, 3, C3, [512, False]],  # 20 (P4/16-medium)

   [-1, 1, Conv, [512, 3, 2]],
   [[-1, 10], 1, Concat, [1]],  # cat head P5
   [-1, 3, C3, [1024, False]],  # 23 (P5/32-large)

   [[17, 20, 23], 1, Detect, [nc, anchors]],  # Detect(P3, P4, P5)
  ]

       backbone: only represents the meaning of numbers

              # 0-P1/2 first floor

                     from: -1 means that the input is passed from the previous layer, [-1, 6] means that it is passed from the 11th and 6th layers

                     number: Indicates the number of module structures, if number>1, then the number=number*depth_multiple

                     module: module structure (Conv, C3, etc.), convolutional layer structure, defined in ( common.py )

                     args: Incoming parameters, you need to contact each network layer model category of common to determine the meaning of each parameter

                     P1: first floor

                     /2: The step size is 2, and the length and width of the picture are divided by 2 ( the reason why the picture resolution requires the length and width to be a multiple of 32 )

                 # 1-P2/4 Second floor:

                layer by layer

        head: Here, like the backbone, is also composed of various network layers

               nn.Upsample : upsampling layer

               Concat : The network layer that synthesizes the outgoing features of each layer

               Detect:                   reasoning detection layer

We noticed that there are a total of 24 neural network layers here, so how do these 24 layers superimpose each other? Is it simply superimposed layer by layer?

The answer is obviously not, the actual network structure is like this: (The network structure analysis backbone in many places on the Internet is superimposed from top to bottom, here I learned from the B station up: 480920279 from bottom to top. network structure)

The meaning of this picture is that we pass a RGB three-channel image into the neural network. After going through the Backbone10-layer network, it enters the Head for upsampling, comprehensive feature processing, etc., and finally three C3 network layers are output to the Detect layer. The three C3 layers from top to bottom are what we call the high-level feature layer, middle-level feature layer, and low-level feature layer.

The difference between these feature layers is that the low-level detects small targets, the middle-level detects medium-sized targets, and the high-level detects large targets, which are combined to predict targets.

# Parameters
nc: 6  # number of classes
depth_multiple: 0.33  # model depth multiple
width_multiple: 0.50  # layer channel multiple
anchors:
  - [10,13, 16,30, 33,23]  # P3/8
  - [30,61, 62,45, 59,119]  # P4/16
  - [116,90, 156,198, 373,326]  # P5/32

        nc: (number of classes) target category number

        Anchors: Three layers correspond to different feature levels

               [10,13, 16,30, 33,23] 3 anchors in the lower layer, 10×13, 16×30, 33×23

               [30,61, 62,45, 59,119] 3 anchors in the middle layer

        depth_multiple: model depth multiple, number* depth_multiple when creating a model

        width_multiple: channel multiple, channel parameters of each layer * width_multiple=number of outgoing channels, multiplication of depth and width parameters is not an integer and rounded down

        depth_multiple, width_multiple determine the complexity of the model, the larger the value, the more complex the accuracy, the higher the time-consuming

For the accuracy of various model files: n<s<m<l<x (the higher the accuracy, the longer the time-consuming) (the difference between these files is only the depth multiple and the channel multiple)

Therefore, we can write a model configuration file ourselves by referring to this manual.

After understanding the neural network structure, let's go to yolo.py to see how the neural network is implemented:

yolo.py:

       if __name:

              create model :              create yolov5 model

(There may be several ways of writing here):

The first:

# Create model
im = torch.rand(opt.batch_size,3,640,640).to(device)
model = Model(opt.cfg).to(device)

The second type:

    # Create model
    model = Model(opt.cfg).to(device)
    model.train()

    # Profile
    if opt.profile:
        img = torch.rand(8 if torch.cuda.is_available() else 1, 3, 640, 640).to(device)
        y = model(img, profile=True)

                     im(g): Randomly define a picture

                     y=model: define the model ( -->Model )

              (Options):

       Model:

              init: build network structure

    def __init__(self, cfg='yolov5s.yaml', ch=3, nc=None, anchors=None):  # model, input channels, number of classes
        super().__init__()
        if isinstance(cfg, dict):
            self.yaml = cfg  # model dict
        else:  # is *.yaml
            import yaml  # for torch hub
            self.yaml_file = Path(cfg).name
            with open(cfg, encoding='ascii', errors='ignore') as f:
                self.yaml = yaml.safe_load(f)  # model dict

                     cfg: configuration file (yolov5s.yaml)

                     ch: input image channel number

                     super().init: load the configuration file

                            Determine whether the input is a string

                            self.yaml_file gets the file name

                     with: start loading the file, the key elements are stored in the form of a dictionary

              Define model: Define the model

# Define model
        ch = self.yaml['ch'] = self.yaml.get('ch', ch)  # input channels
        if nc and nc != self.yaml['nc']:            # 判断该值和yaml中的值是否一样
            LOGGER.info(f"Overriding model.yaml nc={self.yaml['nc']} with nc={nc}")
            self.yaml['nc'] = nc  # override yaml value
        if anchors:
            LOGGER.info(f'Overriding model.yaml anchors with anchors={anchors}')
            self.yaml['anchors'] = round(anchors)  # override yaml value
        self.model, self.save = parse_model(deepcopy(self.yaml), ch=[ch])  # model, savelist
        self.names = [str(i) for i in range(self.yaml['nc'])]  # default names
        self.inplace = self.yaml.get('inplace', True)

                     ch: defines the number of channels

                     nc, anchor: number of correction classes

                     .model: Build the model ( ——> parse_model )

                     .names: category name

                     .inplace: load keyword

              Build:

# Build strides, anchors
        m = self.model[-1]  # Detect()
        if isinstance(m, Detect):
            s = 256  # 2x min stride
            m.inplace = self.inplace
            m.stride = torch.tensor([s / x.shape[-2] for x in self.forward(torch.zeros(1, ch, s, s))])  # forward:[8, 16, 32]
            m.anchors /= m.stride.view(-1, 1, 1)
            check_anchor_order(m)
            self.stride = m.stride
            self._initialize_biases()  # only run once

                     Determine whether the upper layer of the model is a detect layer

                            m.stride: Put the s*s picture into the low-middle-high feature level for prediction, divide the original size by the prediction layer size to get the step size

                            m.anchors /=: anchors divided by the step size

                            check_anchor: Check the incoming anchor order

              forward: predict the input image

       parse_model:

LOGGER.info(f"\n{'':>3}{'from':>18}{'n':>3}{'params':>10}  {'module':<40}{'arguments':<30}")
    anchors, nc, gd, gw = d['anchors'], d['nc'], d['depth_multiple'], d['width_multiple']
    na = (len(anchors[0]) // 2) if isinstance(anchors, list) else anchors  # number of anchors
    no = na * (nc + 5)  # number of outputs = anchors * (classes + 5)

    layers, save, c2 = [], [], ch[-1]  # layers, savelist, ch out

              .info: print information

              Get yaml parameters:

              na: number of anchors

              no: output channel, nc(80), 5 (four points in the rectangular box + confidence), the value is 255

              layers (storage of each layer of network created), save (statistics of the feature layers to be saved)

    for i, (f, n, m, args) in enumerate(d['backbone'] + d['head']):  # from, number, module, args
        # 获取模型,这里主要是作者防止格式错误而不采取直接赋值
        m = eval(m) if isinstance(m, str) else m  # eval strings
        for j, a in enumerate(args):
            try:
                # 同理防止格式错误而不直接赋值
                args[j] = eval(a) if isinstance(a, str) else a  # eval strings, [64, 6, 2, 2]
            except NameError:
                pass

              Get the model and args, this may be written here because the author prevents formatting errors and does not use direct assignment

The next step is to judge whether the current network layer is a convolutional layer or an upsampling layer, a detection layer, etc., and then perform different processing accordingly

# n>1就乘以深度倍数
        n = n_ = max(round(n * gd), 1) if n > 1 else n  # depth gain
        if m in [Conv, GhostConv, Bottleneck, GhostBottleneck, SPP, SPPF, DWConv, MixConv2d, Focus, CrossConv,
                 BottleneckCSP, C3, C3TR, C3SPP, C3Ghost]:
            c1, c2 = ch[f], args[0]
            if c2 != no:  # if not output
                c2 = make_divisible(c2 * gw, 8)

            args = [c1, c2, *args[1:]]   # args[3, 32, 6, 2, 2]
            if m in [BottleneckCSP, C3, C3TR, C3Ghost]:
                args.insert(2, n)  # number of repeats
                n = 1
        elif m is nn.BatchNorm2d:
            args = [ch[f]]
        elif m is Concat:
            c2 = sum(ch[x] for x in f)
        elif m is Detect:
            args.append([ch[x] for x in f])
            if isinstance(args[1], int):  # number of anchors
                args[1] = [list(range(args[1] * 2))] * len(f)
        elif m is Contract:
            c2 = ch[f] * args[0] ** 2
        elif m is Expand:
            c2 = ch[f] // args[0] ** 2
        else:
            c2 = ch[f]

              If m: Determine the structure type

                     Convolutional layer: Determine whether the number of channels is 255, otherwise multiply the channel multiple, and determine whether it is a multiple of 8 (a multiple of 8 is more friendly to GPU calculations)

                        C3 layer:        

m_ = nn.Sequential(*(m(*args) for _ in range(n))) if n > 1 else m(*args)  # module
        t = str(m)[8:-2].replace('__main__.', '')  # module type
        np = sum(x.numel() for x in m_.parameters())  # number params
        m_.i, m_.f, m_.type, m_.np = i, f, t, np  # attach index, 'from' index, type, number params
        LOGGER.info(f'{i:>3}{str(f):>18}{n_:>3}{np:10.0f}  {t:<40}{str(args):<30}')  # print
        save.extend(x % i for x in ([f] if isinstance(f, int) else f) if x != -1)  # append to savelist
        layers.append(m_)

              save.extend: Save the required feature layer, [4, 6, 10, 14, 17, 20, 23]

              ch.append(c2): store the number of channels in each layer, and use the output channel of the upper layer as the input channel of the layer

common.py:

Take Conv as an example to understand simply:

class Conv(nn.Module):
    # Standard convolution
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True):  # ch_in, ch_out, kernel, stride, padding, groups
        super().__init__()
        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=False)
        self.bn = nn.BatchNorm2d(c2)
        self.act = nn.SiLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())

    def forward(self, x):
        return self.act(self.bn(self.conv(x)))

    def forward_fuse(self, x):
        return self.act(self.conv(x))

    Conv:

              init:

                     c1: Enter the number of channels in this layer

                     c2: output the number of channels in this layer

                     k=1: The size of the convolution kernel

                     s=1: The step size of the convolutional layer sliding

       C3:

I have little talent and knowledge. If readers find any errors in the content of the article, please let me know. I would be very grateful.

             

Guess you like

Origin blog.csdn.net/qq_68271367/article/details/127418353