Introduction of yolov5 1: yolov5 architecture and source code debug preparation

  In the first article, we first focus on the yolov5 architecture, understand its various components, and understand its design concepts, and then start to debug yolov5 line by line on this basis. It is worth noting that the architecture of yolov5 has been iteratively evolving, and there are slight differences between different versions, but they are basically the same.

  An analysis of yolov5 architecture

  This is the architecture of yolov5s. The picture comes from Baidu pictures. The blogger should know the painting of the last great god after checking it. Here respect the creation of the great god,

  The difference between yolov5s and yolov5x is that the number of times different blocks are superimposed is different. From an architectural point of view, the architectural design of yolov5 is not complicated. The overall design method of FPN is maintained. FPN strengthens repeated iterations. The structure of two iterations is called For PAN. This structure can be further repeated, evolving into BiFPN like Efficientdet. The backbone is still in the form of darknet, and the network is deepened through the residual structure. The model began to introduce the design of Focus is a bit peculiar, directly reducing the resolution should be one of the reasons for the light weight of yolov5. The introduction of the SPP module further integrates multi-scale feature extraction. There are two different CSP modules, the difference is whether there is a residual structure,

  The overall design and the addition and subtraction of modules are like the product of ablation experiments, but we still go through the modules one by one:

  1. Focus

  The concept of Focus is quite special, and the implementation process is shown in the figure below:

  When I was observing the design of Focus, the question I brought with it was why such a design and why does such a design work?

  My understanding is that first of all, its arrangement is to change the four adjacent blocks from plane to concat on feature.

  That is to say, these four blocks are superimposed, from 3 channels to 12 channels, so as to preserve the position information as much as possible. The same channel should represent the local feature expression, so operating a grid position represents the features of the four grids in the original image. The receptive field is increased while ensuring that the position information is retained as much as possible.

  The effect of this is that the size of the feature map is half of the original size. The problem I think of here is that the convolution can maintain the relative position of the feature, then can this focus operation be regularized instead of pooling or a volume with stride=2 What about product? We all know that the loss of feature map information by pooling is very serious (of course there is also a part of regularization), so in many current architecture designs, pooling is replaced by convolution with stride=2. I also thought of the channel shuffle operation in shufflenet after the group convolution was done, and I was afraid that the global features would not be seen. In order to enhance the feature sharing between channels, the feature fusion was performed. So after the Focus operation, should we also enhance the fusion of the feature vectors of each grid point spliced ​​by concat, for example, through 1x1 group convolution? The fusion of yolov5 uses a common CBL (conv–bn–leakyrelu).

  2.CSP

  There are two CSP structures in the design of yolov5, one is with resunit (2*CBL convolution + residual), and the other is to replace resunit with ordinary CBL. The part with residual is used in the backbone, and the part without residual is used outside the backbone. This should be the result of the experiment. The backbone is a deeper network. Increasing the residual structure can enhance the gradient value of the back propagation between layers, avoiding the disappearance of the gradient due to deepening, and extracting more fine-grained features. Without worrying about network degradation. At the same time, some studies have shown that the residual structure can be regarded as a non-residual ensemble, thereby enhancing part of the generalization ability.

  Of course, the above are all theoretical analysis. I tested the residual structure of the encoder, bridge, and decoder parts of unet in the competition. I also found that the encoder part is more dependent on the residual structure, but the bridge and decoder None of the residual structure of the error has been improved in accuracy. Here is just to talk about the results of my previous experiments for your reference.

  The advantage of the CSP structure is that compared with ordinary CBL, it is divided into two branches. Having branches means the integration of features, and concat can better retain the characteristic information of different branches, so the design of CSP can be Extract richer feature information.

  3.SPP

  The structure of SPP is to extract features of different scales through pooling of different kernel sizes, and then superimpose for feature fusion. The kernel sizes of pooling in yolov5 are 11, 55, 99, and 1313 respectively.

  I have tested SPP and ASPP in various positions on different networks in many competitions. In fact, it is to find the best receptive field size to fit the data, but 1 in fact, this process is relatively time-consuming because of the corresponding pooling kernel And dilation rate are very difficult to adjust. Therefore, I often use deformable convolution to automatically adapt the network to the receptive field and play a role in replacing SPP/ASPP.

  4.PAN

  FPN or PAN or the following BiFPN are all similar structures. The idea of ​​FPN is to enhance the fusion of different layers of features and make predictions on multiple scales. PAN adds a bottom-up integration on the basis of FPN.

  We all know that the deep feature map carries stronger semantic features and weaker positioning information. The shallow feature map carries strong location information and weaker semantic features. FPN is to transfer the deep semantic features to the shallow layer, thereby enhancing the semantic expression on multiple scales. On the contrary, PAN transfers the shallow positioning information to the deep layer, enhancing the positioning ability on multiple scales.

  Recalling the later BiFPN, semantic features and positioning information are "passed around" like kicking a ball in the series FPN/PAN structure...cough cough, they are all metaphysics.

  The above is the special part of the architecture in yolov5. Specifically, we still identify them in the code.

  2. yolov5 source code analysis preparation

  Source portal (implemented by pytorch)

  We must make sure to run yolov5 before debugging. Install the following steps to run through:

  1. Download yolov5 source code, install requirement.txt

  Need to pay attention to python>=3.8, pytorch>=1.7.

  $ git clone

  $ cd yolov5

  $ pip install -r requirements.txt # install dependencies

  2. Configure the yaml file of the dataset and prepare the data

  Yolov5 recognizes the yaml configuration file, so the customized data set should be written according to the template. Let's take the simplest coco128 as an example. Coco128 is a small data set composed of the first 128 sheets of the coco data set to verify the process. The address is in the yolov5/data folder, coco128.yaml. Let's take a look at the content of coco128.yaml:

  # COCO 2017 dataset

  # Train command: python train.py --data coco128.yaml

  # Default dataset location is next to /yolov5:

  # /parent_folder

  # /coco128

  # /yolov5

  # download command/URL (optional)

  download:

  # train and val data as 1) directory: path/images/, 2) file: path/images.txt, or 3) list: [path1/images/, path2/images/]

  train: ../coco128/images/train2017/ # 128 images

  val: ../coco128/images/train2017/ # 128 images

  # number of classes

  nc: 80

  # class names How much does it cost to make people flow in Dalian http://mobile.bhbyby.net/

  names: [ 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light',

  'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',

  'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee',

  'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard',

  'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple',

  'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch',

  'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',

  'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear',

  'hair drier', 'toothbrush' ]

  It can be seen that the configuration file that defines the data set needs to contain three parts:

  download: download address (optional)

  train/val: The relative path of the train data set and the val data set

  nc: number of categories

  names: the specific names of the category

  We first manually download the coco128 data set and put it in the same level directory of yolov5:

  coco128 download link

  Place it in the same level directory as shown in the figure above.

  Needless to say, images are training images, and labels need to pay attention to the format, and you should also pay attention to converting to yolo format if you label yourself.

  The label of each picture is stored separately in a txt file, and each box occupies a separate line.

  The coordinates need to be normalized,

  As shown in the figure above, the label data of yolo is in the format of [class, x_center, Y_center, width, height]. Please note that class is calculated from 0. The four coordinates need to be normalized to between 0-1, that is, if your data is pixel, then you need to divide by the width and height accordingly.

  Finally, the corresponding positions of the folders are aligned as shown in the figure below:

  If the coco128 data is ready, it can be tested.

  3. Prepare the model file

  The model has four types by default, yolov5s, m, l, and x, and the number of parameters is from small to large. Everyone downloads the corresponding model on the official website.

  Model download address

  I downloaded a yolov5s for demonstration. Download and put it under the weights folder:

  $ python train.py --img 640 --batch 16 --epochs 5 --data coco128.yaml --weights yolov5s.pt

  Execute the above code to test, you can see the start of training.

  Epoch gpu_mem box obj cls total targets img_size

  0/4 0G 0.0492 0.0805 0.03435 0.1641 35 640: 16%

  4.wandb*

  yolov5 can be configured with wandb, a web portal that dynamically displays the training status to observe loss and equipment conditions.

Guess you like

Origin blog.51cto.com/14503791/2652433