Video target detection yolo algorithm small model self-training model comparison (yolov5 yolov7 yolov8)

 

Table of contents

 

1 Title

Principle introduction: involving yolov5 yoloX yolov7 yolov8

【Yolov5】

【yoloX】

【Yolov7 】

【Yolov8】Principle

2. Introduction to your own data set

3. My own training records

Title

Yolov5_6.2

 YolovX

Ⅰ Training preparation: Arrange the data set imitating the VOC format

Ⅱ Training preparation: Modify training configuration parameters

Yolov7_main

yolov8_main

4. Test set test of training result model

Yolov5_6.2

Yolov7

Yolov 8  

5. Comparison of the test results of the models trained by Yolov5, yolov7, and yolov8:


1 Title

Principle introduction: involving yolov5 yoloX yolov7 yolov8

【Yolov5】

Pre-training weights: Yolov5s.pt (14.1M)

Anchor Base

Anchors (3, corresponding to the 3 outputs from Neck), the initial anchor is composed of w, h width and height, using the pixel size of the original image, set to 3 per layer, so there are 3 * 3 = 9 in total .

  - [ 10 , 13 , 16 , 30 , 33 , 23 ]   #P3/8

  - [ 30 , 61 , 62 , 45 , 59 , 119 ]   # P4/16

  - [ 116 , 90 , 156 , 198 , 373 , 326 ]   # P5/32

backbone

The main function of the backbone network is to extract features and continuously reduce the feature map. The main structures in backbone include Conv module, C3 module and SPPF module.

After the feature map enters C3, it will be divided into two paths. The left path passes through Conv and a Bottleneck, and the right path only passes through one Conv. Finally, the two paths are concated and then passed through a Conv. The three Conv modules in C3 are all 1*1 convolutions, which play the role of reducing or increasing dimensions, and are of little significance for extracting features. Bottleneck uses residual connections in the backbone. There are two Convs in Bottleneck. The first Conv is a 1*1 convolution, which reduces the channels to half of the original size. The second one is a 3*3 convolution, which reduces the number of channels. Double. Reducing the dimension first will help the convolution kernel better understand feature information, and increasing the dimension will help extract more detailed features. Finally, a residual structure is used to add the input and output to avoid the problem of vanishing gradients.

SPP is spatial pyramid pooling: three parallel MaxPools are concated with the input. The first MaxPool has a kernel of 5*5, the second is 9*9, and the third is 13*13.

It is to perform serial calculations on three MaxPools with a kernel size of 5*5.

Neck

The Neck structure realizes the fusion of shallow graphic features and deep semantic features.

The Neck structure is actually a feature pyramid (FPN), which combines shallow graphic features with shallow semantic features.

As convolution continues, the neural network extracts more and more features, and the levels become deeper and deeper. In the shallow layer of the convolutional neural network , the features extracted by the network are relatively simple, such as color, contour, texture, shape, etc. These features are only reflected in graphics, so they are called graphic features ; as the network continues to deepen , the neural network The network will continuously integrate these features, increase their dimensions, and generate new features, such as how colors and textures are combined, where is the sky and where is the land in the picture, and even includes some features that humans cannot understand, which are called semantic features .

head

The head layer is the Detect module. The network structure of the Detect module is very simple, consisting of only three 1*1 convolutions, corresponding to three detection feature layers.

The three feature maps output by neck are actually three grids. The first grid is 80*80, the second grid is 40*40, and the third grid is 20*20. Through 1*1 convolution, the size of the three feature maps is changed to 80*80* 3x(5+80), 40*40*3x(5+80), 20*20*3x(5+80). 3 represents that each grid contains 3 anchors, and (5+80) represents the information contained in each anchor.

The above ref: https://zhuanlan.zhihu.com/p/609264977

【Anchor_base  &&  Anchor_base】

Anchor_base

(20*20 + 40*40 + 80*80) = 8400

8400*3 = 25200 due to

25200*(11+5) Each cell has 3 drawing frames, 11 categories + (x, y, w, h, conf)

Anchor_Free

(20*20 + 40*40 + 80*80) = 8400

There are 400 prediction boxes, and the size of the corresponding anchor box is 32*32.

There are 1,600 prediction boxes, and the size of the corresponding anchor box is 16*16.

There are 6400 prediction boxes, and the size of the corresponding anchor box is 8*8.

8400*(11+5)

Converting YOLO to anchor-free form is very simple, we drop the prediction for each position from 3 to 1 and directly predict four values: namely two offsets and height and width.

-----------------------------------------------------

【yoloX】

Anchor_Free mode

Baseline model: Yolov3_spp

When selecting the Yolox benchmark model, the author considered that the Yolov4 and Yolov5 series may be somewhat over-optimized from the perspective of anchor box-based algorithms, so the Yolov3 series was ultimately selected. Instead of directly selecting the standard Yolov3 algorithm in the Yolov3 series, the Yolov3_spp version of the SPP component was added.

The YOLOv3 baseline baseline model uses the DarkNet53 backbone + SPP layer (the so-called YOLOv3-SPP). YOLOX adds EMA weight update, cosine learning rate mechanism, IoU loss, and IoU aware branch. BCE loss is used to train the cls and obj branches, and IoU loss is used to train the reg branch.

Mosaic and Mixup are two types of data enhancement.

Yolox's Backbone backbone network is the same as the original Yolov3 baseline backbone network. They all use the Darknet53 network structure.

Select positive sample frames: preliminary screening, SimOTA

Initial screening: models/yolo_head.py/get_in_boxes_info

Train

conda create -n yoloX

conda env list

conda activate yoloX

conda install pip

pip3 install -r requirements.txt

(1) Modify the label information in yolox/data/datasets/coco_classes.py. Change to

COCO_CLASSES = ( "pedes", "car", "bus", "truck", "bike", "elec", "tricycle", "coni", "warm", "tralight", "special_vehicle",)

(2) Modify self.num_classes, depth, width in exps/example/yolox_voc/yolox_voc_s.py

(3) Then modify the self.num_classes, self.defth, self.width, input_size, max_epoch, print_interval, eval_interval and other parameters in yolox/exp/yolox_base.py according to your own needs

(4) Modify exps/example/custom/yolox_s.py

【Yolov7 】

Anchor_Base method

Looking at YOLOV7 as a whole, first resize the input image to 640x640 size, input it into the backbone network, and then output three layers of **feature map** of different sizes through the head layer network, and output the prediction results through Rep and conv, here Taking coco as an example, the output is 80 categories, and each output (x, y, w, h, o) is the coordinate position and the front and rear background. 3 refers to the number of anchors, so the output of each layer is (80+ 5)x3 = 255 multiplied by the size of the feature map is the final output.

backbone

Looking at the backbone as a whole, after 4 CBSs (Conv + BN + SiLU), it is connected to, for example, an ELAN, and then there are three MP + ELAN outputs, corresponding to the outputs of C3/C4/C5, with sizes of 80*80*512, 40*40*1024, 20*20*1024. Each MP has 5 layers, and ELAN has 8 layers, so the number of layers in the entire backbone is 4 + 8 + 13 * 3 = 51 layers. Starting from 0, the last layer is the 50th layer.

BN(Batch Normalization)

Silu(x) = x·sigmoid(x)

Neck:

 illustrate:

1 FPN fuses high-level features with low-level features, thereby simultaneously utilizing the high resolution of low-level features and the rich semantic information of high-level features, and performs independent prediction of multi-scale features, significantly improving the detection effect of small objects.

2 SPP (Spatial Pyramid Pooling), that is, spatial pyramid pooling. The purpose of SPP is to solve the problem of arbitrary input data size. The purpose of using SPP network in YOLOv4 is to increase the receptive field of the network

4 PANet added the DownSample operation after UpSample. PANet frantically fuses features at different levels. It adds a bottom-up feature pyramid structure based on the FPN module, retains more shallow location features, and further improves the overall feature extraction capability.

5 Optimization of the PAN module: The PAN module introduces an E-ELAN structure behind each Concat layer, and uses strategies such as expand, shuffle, and merge cardinality to improve the learning ability of the network without destroying the original gradient path.

 

head

The head layer first downsamples the feature map C5 by 32 times the last output of the backbone, and then passes SPPCSP to change the number of channels from 1024 to 512. First, press top down to fuse with C4 and C3 to get P3, P4 and P5; then press bottom-up to fuse with P4 and P5. This is basically the same as YOLOV5. The difference is that the CSP module in YOLOV5 is replaced by the ELAN-H module, and the downsampling is changed to the MP2 layer.

-----------------------------------------------------

【Yolov8】Principle

Ref: https://zhuanlan.zhihu.com/p/628313867

Ref: https://zhuanlan.zhihu.com/p/628313867

Anchor_Free mode

It is basically the same as YOLOv5, both have the structure of backbone + PANet + Head, and the PANet part is first upsampled and fused and then downsampled and fused;

The C2f structure used in YOLOv8's Backbone and Neck refers to the design idea of ​​YOLOv7's ELAN and is used to replace the CSP structure in YOLOv5. Since the C2f structure has more residual connections, it has a richer gradient flow. (However, there is a Split operation in this C2f module, which is not friendly to specific hardware deployment)

    Loss calculation adopts TaskAlignedAssigner positive sample matching strategy, and introduces Distribution Focal Loss.

① Provides a new SOTA model, including P5 640 and P6 1280 resolution target detection networks and  YOLACT  -based instance segmentation models. Like YOLOv5, models of different sizes in N/S/M/L/X scales are also provided based on the scaling factor to meet the needs of different scenarios.

② The backbone network and Neck part may refer to the YOLOv7 ELAN design idea. The C3 structure of YOLOv5 is replaced by the C2f structure with richer gradient flow, and different channel numbers are adjusted for different scale models. This is a careful fine-tuning of the model structure. Then, the brainless set of parameters is applied to all models, which greatly improves the model performance. However, operations such as Split in this C2f module are not as friendly to specific hardware deployments as before.

③ The Head part has undergone major changes compared to YOLOv5. It has been replaced by the current mainstream decoupled head structure, which separates the classification and detection heads. It also changed from Anchor-Based to Anchor-Free.

④ Loss calculation adopts TaskAlignedAssigner positive sample distribution strategy, and introduces Distribution Focal Loss

⑤ The data enhancement part of training introduces the operation of turning off Mosiac enhancement in the last 10 epochs of YOLOX, which can effectively improve the accuracy.

    The matching strategy of TaskAlignedAssigner is simply summarized as: Select positive samples based on scores weighted by classification and regression scores.

①s is the prediction score corresponding to the annotation category, u is the iou of the prediction box and the gt box, and the degree of alignment can be measured by multiplying the two.

② For each GT, all prediction boxes are based on the corresponding classification scores of the GT category. The IoU of the prediction box and GT is weighted to obtain an alignment score of associated classification and regression alignment_metrics.

For each GT, directly select the topK largest one as a positive sample based on the alignment_metrics alignment score.

Loss calculation includes 2 branches: classification and regression branches, without the previous objectness branch.

① The classification branch still uses BCE Loss

② The regression branch needs to be bound to the integral form representation proposed in Distribution Focal Loss, so Distribution Focal Loss is used, and CIoU Loss is also used.

The three losses can be weighted with a certain weight ratio.

2. Introduction to your own data set

Table 1

ID

describe

Remark

0

Pedestrians (people riding balance bikes and flatbeds are also included in this category)

1

Cars (including SUV, MPV, VAN (van), pickup truck)

2

Bus, bus

3

truck, van

4

bike

5

Motorcycle (electric motorcycle)

6

Tricycle (electric tricycle, gas tricycle)

7

cone barrel

8

warning posts, warning signs

9

traffic light

10

Emergency or special vehicles (ambulances, fire trucks, engineering vehicles such as cranes, excavators, muck trucks, etc.)

Remark

Considerations such as warning triangles, leftovers, and low profile are not actually required for every project, so they will not be considered for the time being.

Training platform information:

system:

Ubuntu 20.04

Driver Version:

525.78.01

CUDA Version:

12.0

Cpu

Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz

Number of datasets:

The image resolution is 1920*1080

Comparison of the number of data set divisions:

Classify

Train

val

test

Remark

quantity

9285

1985

1987

Ratio: 9:2:2=7:1.5:1.5

Table 2 Corresponding target categories and number of labels in the current data set

effect

ID

clsName

Number

Train

0

feet

16281

1

car

100547

2

bus

3268

3

truck

7810

4

bike

3333

5

elec

18418

6

tricycle

3834

7

coni

3548

8

warm

50835

9

tralight

12982

10

specialVehicles

200

Val

0

feet

3448

1

car

21317

2

bus

648

3

truck

1573

4

bike

704

5

elec

3900

6

tricycle

835

7

coni

773

8

warm

10527

9

tralight

2781

10

specialVehicles

44

test

0

feet

3444

1

car

21432

2

bus

675

3

truck

1627

4

bike

668

5

elec

3929

6

tricycle

871

7

coni

786

8

warm

11160

9

tralight

2725

10

specialVehicles

41

The histogram corresponding to Table 2 is as follows for easy comparison.

Train:

 

 

 

3. My own training records

Title

Comparison of Yolo series model training results and test results:

yolov5 yoloX yolov7 yolov8 corresponding small model training and test comparison of each yolo series version

Yolov5_6.2

Pre-trained model size: yolov5s.pt (official 14.1M)

 

 YolovX

Ⅰ Training preparation: Arrange the data set imitating the VOC format

——datasTrain3_More\images

——Annotations xml file corresponding to the picture

————JPEGImages data set pictures

————ImageSets   数据集分为训练集和验证集,因此产生的train.txt和val.txt

——————Main

————————Train.txt

————————Val.txt

上述Annotations中的xml文件通过yolov5标注的数据集txt文件对应转换生成。

上述Main/train.txt文件生成的源码见:

Ⅱ  训练准备:修改训练配置参数

Ref: https://zhuanlan.zhihu.com/p/397499216

①  yolox/data/datasets/voc_classes.py中的标签信息,修改为VOC_CLASSES = ("pedes","car", "bus", "truck", "bike", "elec", "tricycle", "coni", "warm", "tralight", "specialVehicle",)

②  YOLOX_main/exps/example/yolox_voc/yolox_voc_s.py中的self.num_classes, data_dir, image_sets,

③  YOLOX_main/yolox/exp/yolox_base.py中的self.num_classes  self.print_interval = 1
self.eval_interval = 1

④  yolox/data/datasets/voc.py中,VOCDection函数中的读取txt文件, 把 year 相关的路径去掉。

⑤  exps/default/yolox_s.py中,相关文件中 self.depth=0.33,self.width=0.50 保持一致。

⑥  tools/train.py中的参数

python3 tools/train.py -f exps/example/yolox_voc/yolox_voc_s.py -d 0 -b 64 -c yolox_s.pth

运行命令:python tools/train.py -f exps/example/yolox_voc/yolox_voc_s.py -d 0 -b 32 -c yolox/yolox_s.pth

(一)

预训练权重文件:yolox_tiny.pth  (39M)

【因为环境一直没配好,导致没跑起来,尴尬。。。。有优秀的同仁可以指导一下,万分感谢】

(二)

预训练权重文件:yolox_nano.pth  (8M)

【因为环境一直没配好,导致没跑起来,尴尬。。。。】

Yolov7_main

预训练权重:Yolov7-tiny.pt(12.1M)

训练时显卡信息:Name: AD102 [GeForce RTX 4090]

 

yolov8_main

(一)预训练权重:yolov8n.pt(6M)

 

 

在yolov8_main/ultralytics路径下 使用命令运行程序

方法一:

yolo cfg=./yolo/cfg/default.yaml  

方法二:

yolov8_main2307/ultralytics$yolo task=detect mode=train model=models/v8/yolov8n.yaml data=/home/user/hlj/MyTrain/yolov8_main2307/ultralytics/yolo/v8/detect/data/my_yolov8.yaml imgsz=960 batch=32 epochs=100 workers=2

上述命令参考:https://blog.csdn.net/retainenergy/article/details/129199116

 

(二)预训练权重:Yolov8s.pt(22M)

运行命令:/yolov8_main/ultralytics$ yolo cfg=./yolo/cfg/default.yaml

预训练权重:Yolov8s.pt(22M)

==================================================================

四  训练结果模型的测试集测试

训练结果模型的测试集测试:1987张1120*1080的jpg。

测试集:

Yolov5_6.2

运行命令: python val.py

运行命令:python detect.py  生成所有图片的检测结果。

-----------------------------------------------------

Yolov7

验证模式验证模型,需运行命令如下:

python test.py --weights ./runs/train/base_yolov7tiny_pt12m/weights/best.pt --data ./data/my_yolov7.yaml

 

预测测试集图片(1987张),输入命令如下:

python detect.py --weights ./runs/train/base_yolov7tiny_pt12m/weights/best.pt --source /home/user/hlj/MyTrain/YOLOXDatas/test/

 

Yolov8  

(一)base_yolov8n_pt6M

验证模式验证模型,需运行命令如下:

yolo task=detect mode=val model=./runs/detect/base_yolov8n_pt6M/weights/best.pt data=./yolo/v8/detect/data/my_yolov8.yaml  batch=32 workers=0

 

 

预测测试集图片(1987张),输入命令如下:

yolo task=detect mode=predict model=./runs/detect/base_yolov8n_pt6M/weights/best.pt source=/home/user/hlj/MyTrain/YOLOXDatas/test/ save_crop=True save_conf=True

 Predict时的GPU状态

 

 

(二)base_yolov8s_pt22M

验证模式验证模型,需运行命令如下:

yolo task=detect mode=val model=./runs/detect/base_yolov8s_pt22M/weights/best.pt data=./yolo/v8/detect/data/my_yolov8.yaml  batch=8 workers=0

 

预测测试集图片(1987张),输入命令如下:

yolo task=detect mode=predict model=./runs/detect/base_yolov8s_pt22M/weights/best.pt source=/home/user/hlj/MyTrain/YOLOXDatas/test/ save_conf = True

 

 

 

五  Yolov5、yolov7、yolov8各自己训练的模型的测试结果对比:

模型

Yolov5_6.2_s(14M)

Yolov7_tiny(12M)

Yolov8_n(6M)

Yolov8_s(22M)

备注

val速度

0.3ms pre-process, 1.5ms inference, 0.4ms NMS per image at shape (32, 3, 960, 960)

=0.3+1.5+0.4=2.2

 0.9/0.6/1.4 ms inference/NMS/total per 960x960 image at batch-size 32

=0.9+0.6+

0.3ms preprocess, 0.8ms inference, 0.0ms loss, 0.3ms postprocess per image

0.3+0.8+0.3=1.4

 0.3ms preprocess, 2.1ms inference, 0.0ms loss, 0.3ms postprocess per image

=0.3+2.1+0.3=2.7

每张图片前处理、推理、后处理的平均耗时

P

0.914

0.898

0.864

0.928

R

0.858

0.817

0.773

0.871

mAP50

0.911

0.81

0.842

0.925

Predict

速度

0.4ms pre-process, 6.3ms inference, 0.6ms NMS per image at shape (1, 3, 960, 960)

=0.4+6.3+0.6=7.3

(17.3ms) Inference, (0.7ms) NMS

=18

 2.0ms preprocess, 7.4ms inference, 0.8ms postprocess per image at shape (1, 3, 544, 960)

2+7.4+0.8 = 10.2

 2.0ms preprocess, 7.9ms inference, 0.8ms postprocess per image at shape (1, 3, 544, 960)

=2+7.9+0.8=10.7

细分类对应的P R mAP50对比

参数

类别

模型

备注

Yolov5_6.2_s(14M)

Yolov7_tiny(12M)

Yolov8_n(6M)

Yolov8_s(22M)

P

ALL

0.914

0.898

0.864

0.928

pedes

0.913

0.909

0.838

0.898

car

0.956

0.948

0.928

0.946

bus

0.919

0.929

0.908

0.936

truck

0.925

0.929

0.896

0.919

bike

0.836

0.838

0.738

0.855

elec

0.905

0.902

0.86

0.897

tricycle

0.918

0.882

0.848

0.917

coni

0.962

0.976

0.956

0.984

warm

0.965

0.967

0.949

0.97

tralight

0.987

0.983

0.98

0.986

Special_vihicle

0.769

0.622

0.606

0.9

parameter

category

Model

Remark

Yolov5_6.2_s(14M)

Yolov7_tiny(12M)

Yolov8_n(6M)

Yolov8_s(22M)

R

ALL

0.858

0.817

0.773

0.871

feet

0.7

0.629

0.569

0.709

car

0.934

0.925

0.921

0.946

bus

0.943

0.93

0.907

0.951

truck

0.918

0.907

0.885

0.928

bike

0.768

0.641

0.645

0.805

elec

0.883

0.837

0.846

0.901

tricycle

0.884

0.856

0.824

0.894

coni

0.962

0.919

0.891

0.927

warm

0.856

0.803

0.581

0.746

tralight

0.987

0.975

0.968

0.988

Special_vihicle

0.585

0.561

0.463

0.78

parameter

category

Model

Remark

Yolov5_6.2_s(14M)

Yolov7_tiny(12M)

Yolov8_n(6M)

Yolov8_s(22M)

mAP50

ALL

0.911

0.81

0.842

0.925

feet

0.841

0.62

0.702

0.836

car

0.965

0.924

0.964

0.979

bus

0.962

0.928

0.94

0.967

truck

0.95

0.901

0.928

0.962

bike

0.844

0.621

0.718

0.883

elec

0.924

0.824

0.899

0.941

tricycle

0.928

0.845

0.872

0.931

coni

0.98

0.922

0.965

0.979

warm

0.929

0.803

0.777

0.889

tralight

0.992

0.979

0.989

0.994

Special_vihicle

0.705

0.542

0.51

0.813

Guess you like

Origin blog.csdn.net/qq_42835363/article/details/131817017