Analysis of YOLOv5 algorithm for target detection

YOLOv5 has a total of 5 versions of network models and their weight files, namely (n,s,m,l,x).
(The picture below is from the performance screenshot of the yolov5 official open source project on github)
Insert image description here
Among them, the n, s, m, l, and x network model structures are exactly the same, and the difference lies in the parameters. The other n6, s6, m6, l6, and x6 models are for larger resolution image detection.

Network structure

This network structure diagram comes from CSDN blogger: Jiang Dabai . Quoted here for study record purposes only.
Insert image description here
This picture shows YOLO v 5 s YOLOv5sNetwork model architecture of Y O L O v 5 s .

Backbone

CSP-Darknet53

Neck

SPPF + CSP-PAN
Insert image description here

Head

YOLOv3 Head
Output feature predictions of three sizes, such as (76, 76, 255), (38, 38, 255), (19, 19, 255) (76,76,255), (38,38,255), (19,19,255) )(76,76,255),(38,38,255),(19,19,255 ) , different versions are inconsistent.


Key analysis

边界框优化
Based on YOLOv4, bw = pw ∗ etw b_w=p_w*e^{t_w}bw=pwetw b h = p h ∗ e t h b_h=p_h*e^{t_h} bh=phethVariance bw = pw ∗ ( 2 ∗ σ ( etw ) ) 2 b_w=p_w*(2*\sigma(e^{t_w}))^2bw=pw(2s ( etw))2 b h = p h ∗ ( 2 ∗ σ ( e t h ) ) 2 b_h=p_h*(2*\sigma(e^{t_h}))^2 bh=ph(2s ( eth))2. Use
\sigmaThe σ function aims to limittw t_wtwwith th t_hthvalue range to avoid Nan situations.

数据增强
Data enhancement methods include mosaic, HSV color gamut transformation, rotation, scaling, flipping, translation, shearing, etc.

真实框与Anchor匹配

r w = w g t w a t r h = h g t h a t r w m a x = m a x ( r w , 1 r w ) r h m a x = m a x ( r h , 1 r h ) r w m a x = m a x ( r w m a x , r h m a x ) \begin{align} r_w&=\frac{w_{gt}}{w_{at}}\\ r_h&=\frac{h_{gt}}{h_{at}}\\ r_w^{max}&=max(r_w,\frac{1}{r_w})\\ r_h^{max}&=max(r_h,\frac{1}{r_h})\\ r_w^{max}&=max(r_w^{max},r_h^{max})\\ \end{align} rwrhrwmaxrhmaxrwmax=watwgt=hathgt=max(rw,rw1)=max(rh,rh1)=max(rwmax,rhmax)
where wgt w_{gt}wgtis the width of the real box, wat w_{at}watis the width of anchor, hgt h_{gt}hgtis the height of the real box, hat h_{at}hatis the height of the anchor.

( 3 ) ( 4 ) (3)(4) ( 3 ) ( 4 ) The two equations are used to measure the difference between the real box and the anchor. If the difference between the two boxes is the smallest or the two boxes are the most similar, thenrhmax r_h^{max}rhmax r w m a x r_w^{max} rwmaxis 1.last (5)(5)( 5 ) , obtain the maximum difference value of the frame in height and width. This difference value is compared with the given threshold. If the threshold condition is met, the match is considered successful, otherwise it fails. This principle is similar to the previous one using IOU matching.

损失函数(v6.0 and later versions)
Loss = bounding box positioning loss + target classification loss + CIoU loss (confidence loss)
L oss = λ 1 L loc + λ 2 L cls + λ 3 L ciou Loss=\lambda_1L_{loc} +\lambda_2L_{cls}+\lambda_3L_{ciou}Loss=l1Lloc+l2Lcls+l3Lc i o u
That is, Insert image description here
parameter K is the number of feature maps, S 2 S^2S2 is the number of grid cells,BBB is the number of anchors.

In order to balance the loss of different scales (on the coco data set), for the three prediction feature layers {P 3 (small target, such as 76 ∗ 76), P 4 (medium target, such as 38 ∗ 38), P 5 (large Target, such as 19 ∗ 19 ) } \{P_3 (small target, such as 76*76), P_4 (medium target, such as 38*38), P_5 (large target, such as 19*19)\}{ P3( Small target, such as 7676),P4( Medium target, such as 3838),P5( Large target, such as 19The target CIOU loss on 19 )} adopts different weights: L ossciou = 4 ∗ L ciousmall + L cioumedium + 0.4 ∗ L cioularge Loss_{ciou}=4*L_{ciou}^{small}+L_{ciou}^{ medium}+0.4*L_{ciou}^{large}Lossc i o u=4Lc i o usmall+Lc i o umedium+0.4Lc i o ularge
In the loss function, in order to improve the accuracy of small targets, the loss of small target prediction is increased.

Guess you like

Origin blog.csdn.net/qq_44116998/article/details/128451800