YOLOV8 principle and implementation full analysis

Table of contents

1 Introduction

2. Overview of YOLOv8

3. Model structure design

4. Loss calculation 

5. Training data augmentation

6. Training strategy

7. Model reasoning process 

8. Summary


1 Introduction

YOLOv8 is the next major update of YOLOv5, which was open sourced by Ultralytics on January 10, 2023. It currently supports image classification, object detection, and instance segmentation tasks. It has received widespread attention from users before it was open sourced.

According to the official description, YOLOv8 is a SOTA model that builds on the success of previous YOLO versions and introduces new features and improvements to further improve performance and flexibility. Specific innovations include a new backbone network, a new Ancher-Free detection head, and a new loss function that can run on a variety of hardware platforms from CPUs to GPUs. However, Ultralytics did not directly name the open source library YOLOv8, but directly used the word Ultralytics, because Ultralytics positioned the library as an algorithm framework rather than a specific algorithm. One of the main features is scalability. It hopes that this library can not only be used for YOLO series models, but also support various tasks such as non-YOLO models and classification segmentation pose estimation. In summary, the two main advantages of the Ultralytics open source library are:

  • Integrate many current SOTA technologies into one
  • Other YOLO series and more algorithms besides YOLO will be supported in the future

The following table shows the official mAP, parameter amount and FLOPs results tested on the COCO Val 2017 dataset. It can be seen that the accuracy of YOLOv8 is much higher than that of YOLOv5, but the corresponding parameters and FLOPs of the N/S/M model have increased a lot. It can also be seen from the above figure that the reasoning speed of most models compared to YOLOV5 has slowed down.

At present, various YOLO series of improved algorithms have significantly improved performance on COCO, but the generalization performance on custom data sets has not been widely verified, and there are still many claims about the excellent generalization performance of YOLOv5.

2. Overview of YOLOv8

The core features and changes of the YOLOv8 algorithm can be summarized as follows:

  1. Provides a brand new SOTA model, including P5 640 and P6 1280 resolution target detection network and YOLACT-based instance segmentation model . Like YOLOv5, different size models of N/S/M/L/X scales are also provided based on the scaling factor to meet the needs of different scenarios;

  2. The backbone network and Neck part may refer to the design idea of ​​YOLOv7 ELAN. The C3 structure of YOLOv5 is replaced by the C2f structure with richer gradient flow, and different channel numbers are adjusted for different scale models. It is a brainless set of parameters applied to all models, which greatly improves the performance of the model. However, operations such as Split in this C2f module are not as friendly to specific hardware deployment as before;

  3. Compared with YOLOv5, the Head part has changed a lot, replacing it with the current mainstream decoupling head structure , separating the classification and detection heads, and also changing from Anchor-Based to Anchor-Free ;

  4. In terms of Loss calculation , the TaskAlignedAssigner positive sample allocation strategy is adopted , and Distribution Focal Loss is introduced ;

  5. The data enhancement part of the training introduces the last 10 epoch in YOLOX to turn off the Mosiac enhancement operation, which can effectively improve the accuracy.

It can be seen from the above that YOLOv8 mainly refers to the related design of algorithms such as YOLOX, YOLOv6, YOLOv7 and PPYOLOE recently proposed. There are not many innovations in itself, and it is biased towards engineering practice.

The following will introduce the various improvements of YOLOv8 target detection in detail according to the five parts of model structure design, Loss calculation, training data enhancement, training strategy and model reasoning process, and the instance segmentation part will not be described temporarily.

3. Model structure design

Without considering Head for the time being, comparing the yaml configuration files of YOLOv5 and YOLOv8, it can be found that the changes are minor, with YOLOv5-s on the left and YOLOv8-s on the right.

The specific changes of the backbone network and Neck are as follows:

  • The kernel of the first convolutional layer is changed from 6x6 to 3x3
  • Replace all C3 modules with C2f, the structure is as follows, you can find more jump connections and additional Split operations

  • Removed the 2 convolutional connection layers in the Neck module
  • The number of C2f blocks in Backbone changed from 3-6-9-3 to 3-6-6-3
  • Looking at models of different sizes such as N/S/M/L/X, it can be found that the two groups of models of N/S and L/X only change the scaling factor, but the number of channels of the backbone network such as S/M/L is set differently. Follow the same set of scaling factors. The reason for this design should be that the channel settings under the same set of scaling factors are not optimally designed, and the YOLOv7 network design did not follow a set of scaling factors to apply to all models

The Head part has changed the most, from the original coupling head to the decoupling head, and from YOLOv5's Anchor-Based to Anchor-Free. Its structure is as follows:

It can be seen that there is no longer the previous objectness branch, only the decoupled classification and regression branches, and its regression branch uses the integral form representation proposed in Distribution Focal Loss.

4. Loss calculation 

The Loss calculation process includes two parts: positive and negative sample distribution strategy and Loss calculation .

Most modern target detectors will make a fuss about the positive and negative sample allocation strategies, such as YOLOX's simOTA, TOOD's TaskAlignedAssigner, and RTMDet's DynamicSoftLabelAssigner. Most of these assigners are dynamic allocation strategies, while YOLOv5 still uses static allocation strategies. . Considering the superiority of the dynamic allocation strategy, the YOLOv8 algorithm directly references TOOD's TaskAlignedAssigner. The matching strategy of TaskAlignedAssigner is simply summarized as follows: Select positive samples according to the scores weighted by classification and regression scores.

 s It is the prediction score corresponding to the label category, and u it is the iou of the prediction frame and the gt frame. Multiplying the two can measure the degree of alignment.

  • For each GT, all prediction boxes are based on the classification scores corresponding to the GT category, and the weighting of the prediction box and GT's IoU obtains an associated classification and regression alignment score alignment_metrics
  • For each GT,  alignment_metrics select the topK directly based on the alignment score as a positive sample

Loss calculation includes two branches:  classification and regression branches, without the previous objectness branch .

  • The classification branch still uses BCE Loss
  • The regression branch needs to be bound to the integral form representation proposed in Distribution Focal Loss, so Distribution Focal Loss is used, and CIoU Loss is also used

The 3 Loss can be weighted by a certain weight ratio.

5. Training data augmentation

In terms of data enhancement, there is not much difference from YOLOv5, but the operation of closing Mosaic in the last 10 epochs proposed in YOLOX is introduced. Assuming that the training epoch is 500, the schematic diagram is as follows:

Considering that different models should use different data enhancement strengths, some hyperparameters will be modified for models of different sizes. Typically, MixUp and CopyPaste will be enabled for large models. The typical effect after data augmentation is as follows: 

6. Training strategy

There is no difference between the training strategy of YOLOv8 and YOLOv5. The biggest difference is that the total number of training epochs of the model has been increased from 300 to 500 , which also leads to a sharp increase in training time. Taking YOLOv8-S as an example, its training strategy is summarized as follows:

 

7. Model reasoning process 

 The reasoning process of YOLOv8 is almost the same as that of YOLOv5. The only difference is that the bbox form of the integral representation in the Distribution Focal Loss needs to be decoded to become a conventional 4-dimensional bbox. The subsequent calculation process is the same as that of YOLOv5.

Taking the COCO 80 class as an example, assuming that the input image size is 640x640, the schematic diagram of the reasoning process implemented in MMYOLO is as follows:

Its inference and post-processing process is:

(1) Convert bbox integral form to 4d bbox format : convert the bbox branch output by Head, and convert the integral form to 4-dimensional bbox format by using Softmax and Conv calculations

(2) Dimensional transformation : YOLOv8 outputs   three feature maps with scales of 80x80, , 40x40 and  . 20x20The Head part outputs feature maps of 6 scales for classification and regression. The category prediction branches and bbox prediction branches of three different scales are spliced ​​together, and the dimensions are transformed. For the convenience of subsequent processing, the original channel dimension will be replaced to the end. The shape of the category prediction branch and the bbox prediction branch are (b, 80x80+40x40+20x20, 80)=(b,8400,80), (b,8400, 4).

(3) Decoding restores to the scale of the original image : the classification prediction branch performs Sigmoid calculation, while the bbox prediction branch needs to be decoded to restore the real original image to the xyxy format after decoding.

(4) Threshold filtering : Traverse each image in the batch and use  score_thr threshold filtering. In this process,  multi_label and nms_pre also need to be considered to ensure that the number of filtered detection boxes will not exceed nms_pre.

(5) Restore to the original image scale and nms : Based on the pre-processing process, restore the remaining detection frames to the original image scale before the network output, and then perform nms. The final output detection frame cannot be more than  max_per_img.

8. Summary

This article analyzes and summarizes the latest YOLOv8 algorithm in detail, from overall design to model structure, Loss calculation, training data enhancement, training strategy and reasoning process, and provides a large number of schematic diagrams for your convenience. Simply put, YOLOv8 is an efficient algorithm including image classification, Anchor-Free object detection and instance segmentation. The design of the detection part refers to a large number of excellent latest YOLO improved algorithms, and realizes the new SOTA. Not only that, but a brand new framework has also been introduced. However, this framework is still in its early stages and needs to be continuously improved.

Reference:Algorithm principles and implementation with YOLOv8 — MMYOLO 0.5.0 documentation

Guess you like

Origin blog.csdn.net/daydayup858/article/details/131385544