The main innovations of YoloV6

Key technologies of YoloV6

YOLOv6 has made many improvements in BackBone, Neck, Head and training strategies:

1. Hardware-friendly backbone network design

The Backbone and Neck used by YOLOv5/YOLOX are both based on CSPNet , using a multi-branch method and a residual structure. This structure will increase the delay to a certain extent and reduce the memory bandwidth utilization. YOLOv6 introduced the RepVGG structure on the backbone, and improved it based on the hardware, and proposed a more efficient EfficientRep.

RepVGG

RepVGG Block

question:
  1. Why use multiple branches when training

insert image description here

It can increase the representation ability of the model. From the above table, it can be seen that whether adding the Identity branch or the 1*1 branch can improve the accuracy of the model

  1. Why use a single branch when reasoning
  • It is faster to use a single branch. The calculation amount of each branch of multi-branches may be different, but the results of each branch need to be added, resulting in a decrease in parallelism and reduced speed. In addition, more operators are called by multi-branches, which is more time-consuming.
  • save memory
  • more flexible
Structural reparameterization

Convert the multi-branch structure to a single-branch structure, that is, convert the RepVGG block into a 3*3 convolution
insert image description here

insert image description here

  • 1 1 convolution is converted to 3 3 convolution, and BN is converted to 3*3 convolution

Converting 1 1 to 3 3 only requires the 1*1 convolution kernel padding1
insert image description here

BN转化只需要构造一个卷积似的输入输出相等即可,即乘以一个单位矩阵

Assuming that the input channel is 2 and the output channel is also 2, you need to construct two 3 3 2 convolution kernels,
insert image description here

  • Convolutional layer and BN layer fusion

Against the volume layer: yi = w ⋅ xi + b y_i=w \cdot x_i+byi=wxi+b
against the BN layer:yi = γ xi ^ + β y_i=\gamma \hat{x_i}+\betayi=cxi^+β , wherexi ^ = xi − μ i σ i 2 + ϵ \hat{x_i}=\frac{x_i-\mu_i}{\sqrt{\sigma_i^2+\epsilon}}xi^=pi2+ ϵ ximi, μ i \mu_i miis the mean, σ i \sigma_ipi
Let us divide the equation from the equation BN: BN γ , β ( x ) = γ w ⋅ x + b − µ i σ i 2 + ϵ + β = γ w σ i 2 + ϵ ⋅ x + γ σ i 2 + ϵ ⋅ ( b − μ i ) + β BN_{\gamma ,\beta}(x)=\gamma \frac{w\cdot x+b-\mu_i}{\sqrt{\sigma_i^2+\ epsilon}}+\beta= \frac{\gamma w}{\sqrt{\sigma_i^2+\epsilon}}\cdot x+\frac{\gamma}{\sqrt{\sigma_i^2+\epsilon}}\ cdot ( b - \ in_ i ) + \ betaBNc , b(x)=cpi2+ ϵ w x + b mi+b=pi2+ ϵ c wx+pi2+ ϵ c(bmi)+β
w^ = γ w σ i 2 + ϵ \hat{w}=\frac{\gamma w}{\sqrt{\sigma_i^2+\epsilon}}w^=pi2+ ϵ c w, b ^ = γ σ i 2 + ϵ ⋅ ( b − μ i ) + β \hat{b}=\frac{\gamma}{\sqrt{\sigma_i^2+\epsilon}}\cdot(b-\ in_i)+\betab^=pi2+ ϵ c(bmi)+β ,刪有
BN γ , β ( x ) = w ^ ⋅ x + b ^ BN_{\gamma ,\beta}(x)=\hat{w}\cdot x+\hat{b}BNc , b(x)=w^x+b^
After the fusion of the convolutional layer and the BN layer is equivalent to a new convolutional layer

  • Multi-branch fusion

After fusing the convolutional layer and the BN layer, three 3*3 convolutional layers are obtained, and finally the parameters of the three convolutional layers are directly added and fused into one convolutional layer

EfficientRep Backbone

network structure

Replace the CBL module in YOLOv5 Backbone with the RepConv module. At the same time, the original CSP-Block is redesigned as RepBlock, in which the first RepConv of RepBlock will do channel dimension transformation and alignment. In addition, the original SPPF optimization is designed as a more efficient SimSPPF.
insert image description here

yolov5 network structure diagram
insert image description here

The difference between SPP, SPPF and SimSPPF

SPP structure: also known as spatial pyramid pooling, which can convert feature maps of any size into feature vectors of fixed size

  • SPP

  • SPPF

insert image description here

  • SimSPPF

Rep-PAN

Rep-PAN is based on the PAN topology, replacing the CSP-Block used in YOLOv5 with RepBlock, and at the same time adjusting the operators in the overall Neck, the purpose is to achieve efficient reasoning on the hardware while maintaining better multi-scale features fusion ability
insert image description here

2. More concise and efficient Decoupled Head

In YOLOv6, the Decoupled Head structure is adopted and its design is simplified. The detection head of the original YOLOv5 is realized by the fusion and sharing of classification and regression branches, while the detection head of YOLOX decouples the classification and regression branches, and adds two additional 3x3 convolutional layers, although The detection accuracy is improved, but the network delay is increased to a certain extent.
YOLOv6 uses the Hybrid Channels strategy to redesign a more efficient decoupling head structure, which reduces the delay while maintaining accuracy, and alleviates the additional delay overhead caused by the 3x3 convolution in the decoupling head. Through the ablation experiment on the nano size model, compared with the decoupling head structure with the same number of channels, the accuracy is increased by 0.2% AP and the speed is increased by 6.8%.
insert image description here

3. More effective training strategy

  • Anchor-free no anchor paradigm

YOLOv6 uses a more concise Anchor-free detection method. Since the Anchor-based detector needs to perform cluster analysis before training to determine the optimal Anchor set, this will increase the complexity of the detector to a certain extent; at the same time, in some edge-end applications, it is necessary to carry a large number of detection results between hardware The steps will also bring additional delay. The Anchor-free paradigm has been widely used in recent years because of its strong generalization ability and simpler decoding logic. After an experimental investigation on Anchor-free, we found that compared with the additional delay caused by the complexity of Anchor-based detectors, Anchor-free detectors have a 51% increase in speed.

  • SimOTA Label Assignment Strategy

In order to obtain more high-quality positive samples, YOLOv6 introduces the SimOTA [4] algorithm to dynamically allocate positive samples to further improve detection accuracy. The label allocation strategy of YOLOv5 is based on Shape matching, and the number of positive samples is increased through the cross-grid matching strategy, so that the network can quickly converge. However, this method is a static allocation method and will not be adjusted during the network training process.
In recent years, many methods based on dynamic label assignment have emerged. Such methods will allocate positive samples according to the network output during training, so as to generate more high-quality positive samples, and then promote the positive optimization of the network. For example, OTA[7] obtains the optimal sample matching strategy under global information by modeling sample matching as an optimal transmission problem to improve accuracy, but OTA uses the Sinkhorn-Knopp algorithm to lengthen the training time, while SimOTA[7] 4] The algorithm uses the Top-K approximation strategy to get the best match of the sample, which greatly speeds up the training speed. Therefore, YOLOv6 adopts the SimOTA dynamic allocation strategy, combined with the anchor-free paradigm, the average detection accuracy on the nano size model is increased by 1.3% AP.

  • SIoU Bounding Box Regression Loss

In order to further improve the regression accuracy, YOLOv6 uses the SIoU[9] bounding box regression loss function to supervise the learning of the network. The training of target detection network generally needs to define at least two loss functions: classification loss and bounding box regression loss, and the definition of loss function often has a great impact on detection accuracy and training speed.
In recent years, commonly used bounding box regression losses include IoU, GIoU, CIoU, DIoU loss, etc. These loss functions measure the relationship between the predicted frame and the target frame by considering factors such as the degree of overlap between the predicted frame and the target frame, the distance between the center point, and the aspect ratio. In order to guide the network to minimize the loss to improve the regression accuracy, but these methods do not take into account the matching of the direction between the prediction box and the target box. The SIoU loss function redefines the distance loss by introducing the vector angle between the required regressions, which effectively reduces the degree of freedom of the regression, speeds up the network convergence, and further improves the regression accuracy. By using SIoU loss on YOLOv6s for experiments, compared with CIoU loss, the average detection accuracy is increased by 0.3% AP.

Guess you like

Origin blog.csdn.net/qq_40042726/article/details/126189924