In-depth interpretation of YOLOv8 principle, super detailed

Overall structure

Insert image description here
Insert image description here

  • Backbone: Feature ExtractorThe network that extracts features, its function is to extract the information in the picture for use by the subsequent network

  • Neck: placed between backbone and head to better utilize the backbone extraction Features play the role of "feature fusion".

  • Head: Use the previously extracted features to make identification

Some common Backbone, Neck, Head networks

Insert image description here

We will use a lot of terms and abbreviations when describing the YOLOv8 model later. Similarly, when we read these YOLOv8 posts on the Internet, we will find that these posts (including this article) assume that you already have a good understanding of the model knowledge from YOLOv1 to YOLOv7. Or at least you need to be very familiar with YOLOv5.

This is because YOLOv8 is a masterpiece that draws on the strengths of many others. It refers to the models of the YOLOv1 to YOLOv7 series, drawing on the strengths of others and integrating the advantages of each.

Backbone

Drawing on these design ideas from other algorithms

  • Drawing on the idea of ​​VGG, more 3×3 convolutions are used, and the number of channels is doubled after each pooling operation;

  • borrows the idea of ​​network in network, uses global average pooling to make predictions, and places the 1×1 convolution kernel into the 3×3 convolution kernel. between, used to compress features; (I didn’t find where this step is reflected)

  • The batch normalization layer is used to stabilize model training, accelerate convergence, and play a regularization role.

The above three points are Darknet19 points for reference from other models.

Darknet-53: 53 refers to "52-layer convolution" + output layer

Darknet53Of course, it inherits the advantages of Darknet19 and adds the following advantages. Therefore it is listed here

  • Drawing on the ideas of ResNet, a large number of residual connections are used in the network, so the network structure can be designed to be very deep, and the problem of gradient disappearance during training is alleviated, making the model easier to converge.

  • Use a convolutional layer with a stride of 2 instead of a pooling layer to implement downsampling. (This is obvious in the classic Darknet-53. The length and width of the output are reduced from 256 to 128, then to 64, and all the way to 8. This should be achieved through a convolutional layer with a stride of 2; It is also reflected in the convolutional layer of YOLOv8, such as the positions I marked in the picture)

  • Feature fusion

The model architecture diagram is as follows

Darknet-53The characteristics can be summarized as follows: (Conv convolution module + Residual Block residual block) serially superimposed 4 times

Conv convolutional layer + Residual Block residual network is called a stage

Insert image description here

As pointed out in red above, there is a layer of convolution in the original Darknet-53. In YOLOv8, a layer of convolution is removed.

Why remove it?

  • What does the convolutional layer added in the middle of the original Darknet-53 model do? The number of filters (convolution kernels) first increased from 512 in the previous convolution layer to 1024 convolution kernels, and then the number of convolution kernels in the next convolution layer was reduced to 512.

  • After removing this layer, there are 1024 fewer convolution kernels, and 1024 fewer convolution operations can be performed. At the same time, there are also 1024 fewer 3×3 convolution kernel parameters, which is 9×1024 fewer parameters. Parameters need to be fitted. This can greatly reduce the parameters of the model (equivalent to lightweighting)

  • This convolutional layer was removed probably because the author found that the score of the model improved after removing this convolutional layer, so it was removed. Why did the score increase after removing it? It may be because it is easier to add these parameters. Too many parameters cause the model to be overfitted in the training set, but it performs very poorly on the test set, and the final model score is relatively low. After you remove this convolutional layer, the parameters are reduced, the overfitting phenomenon is less serious, and the generalization ability is enhanced. Of course, this is to take the conclusion of your experiment and then make up for it, and then forcefully explain the rationality of this phenomenon.

Insert image description here

From the picture in the official drawing book of MMdetection, we can see that the incoming picture passes through a "Feature Pyramid Network (FPN)", and then the final P3, P4, and P5 are passed to the Neck and Head of the next layer for recognition tasks. . PAN (Path Aggregation Network)

Insert image description here

"FPN is top-down, passing down strong semantic features from high levels. PAN adds a bottom-up pyramid behind FPN to supplement FPN and pass down strong localization features from low levels.

FPN is from the top (small size, the result of a large number of convolutions, rich semantic information) to the downward (large size, the result of a small number of convolutions), passing down the strong semantic features of the high level to enhance the entire pyramid. However, it only enhances the semantic information and does not transfer the positioning information. PAN is aimed at this point. It adds a pyramid from the bottom (less number of convolutions, large size) to upward (more number of convolutions, small size, rich semantic information) behind FPN to supplement FPN and combine the strong localization features of the lower layers. Passing it on is also known as the "Twin Towers Tactics".

The FPN layer conveys strong semantic features from top to bottom, while the feature pyramid conveys strong localization features from bottom to top. The two work together to aggregate parameters of different detection layers from different backbone layers. This operation is indeed Very skinny.
Bottom-up enhancement

PAN (Path Aggregation Network) is an improvement on FPN. Its design concept is to add a bottom-up pyramid behind FPN. PAN introduces a path aggregation method, by aggregating shallow feature maps (low resolution but weak semantic information) and deep feature maps (high resolution but rich semantic information), and delivering feature information along specific paths. Pass the strong localization features of the lower layer up. Such operations can further enhance the expression ability of multi-scale features, making PAN perform even better in target detection tasks.

FPN Explained | Papers With Code

https://jonathan-hui.medium.com/understanding-feature-pyramid-networks-for-object-detection-fpn-45b227b9106c

The normal YOLOv8 object detection model output layers are P3, P4, and P5. In order to improve the detection ability of small targets, the new version of YOLOv8 already includes the P2 layer (the number of convolutions done by the P2 layer less, the size (resolution) of the feature map is larger, which is more conducive to small target recognition), and there are four output layers. The results of the Backbone part have not changed, but the model structures of the Neck and Header parts have been adjusted. This is why there is a v8 model yaml filehttps://github.com/ultralytics/ultralytics/tree/main/ultralytics/cfg/models/v8 p2this model

The newly added P6 is to introduce more parameters. An extra layer of convolution is prepared for xlarge parameter. It is specially suitable for high-resolution images (the image size is large and there is a lot of information that can be mined). )version of.

model=yolov8n.ymal 使用正常版本
model=yolov8n-p2.ymal 小目标检测版本
model=yolov8n-p6.ymal 高分辨率版本

Insert image description here

The entire YOLOv8 backbone is drawn as follows:

We can see that this backbone is composed of three modules, CBS, C2f, SPPF

Insert image description here

The convolution module uses CBS

Three parts: (1) a two-dimensional convolution + (2) two-dimensional BatchNorm + (3) SiLU activation function

Insert image description here

The activation of SiLU is calculated by multiplying its input by the sigmoid function, i.e. xσ(x).

Insert image description here

Advantages of SiLU:

  • No upper bound (to avoid overfitting) (Diss: Which activation function has an upper bound?)

  • There is a lower bound (producing a stronger regularization effect) (Diss, same as above, many activation functions have a lower bound, is this sentence necessary?)

  • Smooth (differentiable everywhere, easier to train) (agree: indeed smoother than relu)

  • x<0 is non-monotonic, the function value first decreases and then increases. Before -1, it is a decreasing function, and after -1, it is an increasing function (it has important significance for distribution. This is also the biggest difference between Swish and ReLU)

If you look at the brown line, it is d(SiLU), which is the derivative function of the SiLU activation function. You can see that when x approaches positive infinity, its derivative function approaches a ratio of 1 A slightly larger constant value

From the independent variable value of the lower bound point (the point with the lowest function value, the corresponding abscissa, that is, the point close to -1), moving towards the positive semi-axis, the monotonically increasing value of the entire function

Residual block uses C2f

Insert image description here

Insert image description here

A CBS convolutional layer

A split is the dimension of Cut it in half, one half of is , and the other half is . This act of splitting in half is calledheight × width × c_outfeature mapchannel(c_out)feature maph × w × 0.5c_couth × w × 0.5c_coutchannelsplit

The feature map before bottleneck and the feature map after bottleneck are concat fused. This is called residual connection.

n Bottlenecks are serialized, and each Bottleneck is concatenate with the last Bottleneck, which is similar to feature fusion.

Advantages: Allows YOLOv8 to obtain richer gradient flow information while ensuring lightweight

SPPF

Insert image description here

Enter a CBS convolution layer, and then perform three Maxpoolings in series. The feature map without Maxpooling and the feature map obtained after each additional Maxpooling are concat spliced ​​to achieve feature fusion.

Use CSP’s “network design method” to lightweight Darknet

Concat the original feature map and the result obtained after the convolution operation of the original feature map, which is the idea of ​​CSP

Insert image description here

Insert image description here

Insert image description here

If you look at the backbone alone, you will find that there seems to be no CSP idea between these modules. Serialized all the way without any cross-layer integration.

Insert image description here

In fact, the CSP idea is used in the modules that make up the backbone, the C2f module and the SPPF module. DarknetBottleneck (add=True) in the C2f module also uses the CSP network structure design idea.

Insert image description here

Neck

The application of CSP ideas on Neck.

There are four red hearts "c" in the picture below, using the idea of ​​CSP. They are all fused with the original feature map that has not been convolved and the feature map that has undergone many convolution operations on the "original feature map". The fusion location is at the four red hearts "c".

Insert image description here

PAN-FPN

PAN-FPN refers to the organization of the neck part of YOLOv8 that imitates the backbone in PANet, that is, the FPN network (feature pyramid network). A simple summary of the characteristics of the PAN-FPN network is that it downsamples first and then upsamples. There are two cross-layer fusion connections between the two branches of upsampling and downsampling. (It can also be reversed, upsampling first, then downsampling)

Insert image description here

Use C2f module as residual block

There’s not much to say, just these four green and blue modules in the picture

Insert image description here

Head

First fork two CBS convolution modules, then pass through a Conv2d, and finally calculate the classifcation loss and Bbox loss respectively.

Decoupled-Head decoupling head

The Head part has undergone major changes compared to YOLOv5. The original coupling head has been replaced by the current mainstream decoupled head structure (Decoupled-Head), which separates the classification and detection heads. At the same time, due to the use of the idea of ​​DFL (Distributional Focal Loss), the number of channels of the regression head has also changed to the form of 4*reg_max, and reg_max defaults to 16.

Insert image description here

Insert image description here

Its Decoupled Head, specifically how to separate the two tasks of classification and recognition, you have to look at the code, which is the following paragraph. (I still don’t understand how to separate it. When I need to use this piece later, I will look at it in detail and learn more)

Insert image description here

A head does target recognition and is measured by Bbox Loss. The loss function includes two parts CIoU and DFL

A head is used for classification, and the BCE binary cross-entropy loss function is used to measure it. The actual one used is VFL (Varifocal Loss).

Insert image description here

Loss function design

Classification loss function

Note here that although VarifocalLoss is defined as follows in "\ultralytics\yolo\utils\loss.py" line 11, the actual classification loss used in the final training is BCE loss.

class VarifocalLoss(nn.Module):
    # Varifocal loss by Zhang et al. https://arxiv.org/abs/2008.13367
    def __init__(self):
        super().__init__()
 
    def forward(self, pred_score, gt_score, label, alpha=0.75, gamma=2.0):
        weight = alpha * pred_score.sigmoid().pow(gamma) * (1 - label) + gt_score * label
        with torch.cuda.amp.autocast(enabled=False):
            loss = (F.binary_cross_entropy_with_logits(pred_score.float(), gt_score.float(), reduction="none") *
                    weight).sum()
        return loss

I provide the evidence below. At \ultralytics\yolo\v8\detect\train.py line 187 there is the following code. It is the code "class Loss:" under this class in the train.py file

First of all, we can see that for cls.loss, the author of YOLOv8 did not use varifocal loss (although the author defined varifocal loss in loss.py earlier), but used BCE loss.

Then bbox loss is CIoU and DFL

Then the weighted average of these three losses is obtained to obtain the final loss.

# cls loss
# loss[1] = self.varifocal_loss(pred_scores, target_scores, target_labels) / target_scores_sum  # VFL way
loss[1] = self.bce(pred_scores, target_scores.to(dtype)).sum() / target_scores_sum  # BCE
 
# bbox loss
if fg_mask.sum():
    loss[0], loss[2] = self.bbox_loss(pred_distri, pred_bboxes, anchor_points, target_bboxes, target_scores,
                                              target_scores_sum, fg_mask)
 
loss[0] *= self.hyp.box  # box gain
loss[1] *= self.hyp.cls  # cls gain
loss[2] *= self.hyp.dfl  # dfl gain
 
return loss.sum() * batch_size, loss.detach()  # loss(box, cls, dfl)

Classification loss VFL Loss

The samples are unbalanced, with very few positive samples and many negative samples. It is necessary to reduce the overall contribution of negative samples to loss, so focal loss is used. VFL certainly has all the features of focal loss.

Unique to VFL:

  • (1) Learn the IACS score (classification score of localization-aware or IoU-aware)
  • (2) If the gt_IoU of the positive samples is high, the contribution to loss will be greater, allowing the network to focus on those high-quality samples. That is to say, training high-quality positive examples will improve AP better than low-quality ones. Bigger.

Insert image description here

Target recognition loss 1-DFL (Distribution Focal Loss)

The formula for this thing is as follows:

Insert image description here

https://arxiv.org/pdf/2006.04388.pdf

Model the position of the box as a general distribution, allowing the network to quickly focus on the distribution of positions close to the target position.

Insert image description here

Target recognition loss 2-CIOU Loss

The clearest article from CIoU:

​​​​​Loss functions IoU, GIoU, DIoU, CIoU, SIoU_siou and ciou__-CHEN-_'s blog-CSDN blog in target detection

Taking aspect ratio into account

Insert image description here

Triangles put into a pile, triangle matching problem

Insert image description here

sample matching

YOLOv8 is

  • (1) Abandoned the Anchor-Base method and used the Anchor-Free method instead.
  • (2) Found a matching method that replaces the side length ratio-TaskAligned

What is Anchor-Based? - Anchor-Based refers to using anchor to match positive and negative samples, thereby narrowing the search space, performing gradient backhaul and training the network more accurately and simply.

What are the disadvantages of the Anchor-Based approach? ——But because of the following disadvantages, we abandoned the redundant step of anchor: anchor will also affect the performance of the network, (1) such as patrol training matching is higher The overhead, (2) there are many hyperparameters that need to be adjusted manually, etc.

What are the advantages of Anchor-free? ——The Anchor-free model abandons or bypasses the concept of anchor, uses a more streamlined way to determine positive and negative samples, and at the same time reaches or even exceeds the accuracy of the two-stage anchor-based model. , and have faster speeds.

In order to match NMS (non maximum suppression), the Anchor allocation of training samples needs to meet the following two rules: - The original intention of designing the TaskAligned rule

  • Normally aligned Anchors should predict high classification scores while having precise localization;

  • Misaligned anchors should have low classification scores and be suppressed in the NMS stage. Based on the above two goals, TaskAligned designed a new Anchor alignment metric to measure the level of Task-Alignment at the Anchor level. Moreover, the Alignment metric is integrated into the sample allocation and loss function to dynamically optimize the prediction of each Anchor. (I didn’t understand this sentence at all)

Anchor alignment metric:

The classification score and IoU represent the prediction effect of these two tasks, so TaskAligned uses a high-order combination of classification score and IoU to measure the degree of Task-Alignment. Use the following method to calculate the anchor-level alignment for each instance:

Insert image description here

s and u are the classification score and IoU value respectively, α and β are the weight hyperparameters. As can be seen from the above formula, t can simultaneously control the classification score and IoU optimization to achieve Task-Alignment, which can guide the network to dynamically focus on high-quality Anchors. The matching strategy of TaskAlignedAssigner is simply summarized as follows: Select positive samples based on scores weighted by classification and regression scores.

Training sample Assignment:

In order to improve the alignment of the two tasks, TOOD (Task-aligned One-stage Object Detection) focuses on the Task-Alignment Anchor and uses a simple allocation rule to select training samples: for each instance, select m with the largest t value Anchors are used as positive samples, and the remaining Anchors are selected as negative samples. Then, it is trained through a loss function designed for the alignment of classification and localization.

Guess you like

Origin blog.csdn.net/weixin_45277161/article/details/134981882