Analysis of YOLOv4 algorithm for target detection

Fundamental

Network structure

CSPDarknet53
Insert image description here
The last three arrows point to the output, which are three feature maps

SPP
Solve multi-scale problems.
For the same feature output map, perform three maxpool2d operations, and then superimpose the outputs of the three operations.
Insert image description here

PANet
Integrating features such as upsampling and downsampling, depth direction splicing
Insert image description here
PANet consists of five core modules (a, b, c, d, e)

The red and green dotted lines in the figure are shortcuts spanning multiple layers, which can achieve higher feature fusion at different levels.

Figure a shows the top-down structure of FPN. The four feature maps in the first column are the results of sequential convolution. The shallow layers reflect details such as edges, and the deep layers reflect richer semantic features. The second column contains four sets of feature maps. They are P 5 , P 4 , P 3 , P 2 P_5,P_4,P_3,P_2 respectively .P5,P4,P3,P2, the upsampling process uses bilinear interpolation. Why not directly use the feature map in the first column? The purpose is that using the feature map of each layer alone cannot reflect the overall features and will weaken the expression ability; while using the features in the second column, shallow features and deep features can be integrated to achieve Richer expression features.

Figure b shows the bottom-up path, obtaining N 2 , N 3 , N 4 , N 5 N_2,N_3,N_4,N_5N2,N3,N4,N5There are 4 feature maps in total. Among them, N 2 N_2N2Just copy P 2 P_2P2, N 3 N_3 N3is by converting N 2 N_2N2After 3 ∗ 3 3*3 with a step size of 233The result after convolution is added toP 3 P_3P3If you get it, handle it the same as everything else.

Figure c shows adaptive feature pooling, fusing the feature maps of all layers, and finally obtaining a 1 ∗ 1 ∗ n 1*1*n11Vector of n , used for classification and positioning

loss function

L o s s = λ c o o r d ∑ i = 0 S 2 ∑ j = 0 M I i j o b j ( 2 − w i ∗ h i ) ( 1 − C I O U ) − ∑ i = 0 S 2 ∑ j = 0 M I i j o b j [ C i ^ l o g ( C i ) + ( 1 − C i ^ ) l o g ( 1 − C i ) ] − λ n o o b j ∑ i = 0 S 2 ∑ j = 0 M I i j n o o b j [ C i ^ l o g ( C i ) + ( 1 − C i ^ ) l o g ( 1 − C i ) ] − ∑ i = 0 S 2 ∑ j = 0 M I i j o b j ∑ c ϵ c l a s s e s [ p i ^ ( c ) l o g ( p i ( c ) ) + ( 1 − p i ^ ( c ) ) l o g ( 1 − p i ( c ) ) ] Loss=\lambda_{coord}\sum_{i=0}^{S^2}\sum_{j=0}^{M}I_{ij}^{obj}(2-w_i*h_i)(1-CIOU)\\ -\sum_{i=0}^{S^2}\sum_{j=0}^{M}I_{ij}^{obj}[\hat{C_i}log(C_i)+(1-\hat{C_i})log(1-C_i)]\\-\lambda_{noobj}\sum_{i=0}^{S^2}\sum_{j=0}^{M}I_{ij}^{noobj}[\hat{C_i}log(C_i)+(1-\hat{C_i})log(1-C_i)]\\-\sum_{i=0}^{S^2}\sum_{j=0}^{M}I_{ij}^{obj}\sum_{c\epsilon classes}^{}[\hat{p_i}(c)log(p_i(c))+(1-\hat{p_i}(c))log(1-p_i(c))] Loss=lcoordi=0S2j=0MIijobj(2wihi)(1C I O U )i=0S2j=0MIijobj[Ci^log(Ci)+(1Ci^)log(1Ci)]ln oo bji=0S2j=0MIijn oo bj[Ci^log(Ci)+(1Ci^)log(1Ci)]i=0S2j=0MIijobjcϵclasses[pi^(c)log(pi(c))+(1pi^(c))log(1pi( c ))]
The first line is the coordinate loss of the positive sample,2 − wi ∗ hi 2-w_i*h_i2wihiis the penalty coefficient, CIOU CIOUC I O U loss is based on DIOU
. 1 − IOU ) + ν CIOU=IOU - (\frac{\rho ^2(b,b^{gt})}{c^2}+\alpha \nu )\\ \nu= \frac{4}{ \pi}(arctan\frac{w^{gt}}{h^{gt}}-arctan\frac{w}{h})^2\\ \alpha =\frac{\nu }{(1-IOU )+\nu }C I O U=IOU(c2r2(b,bgt)+if )n=Pi4( a rc t anhgtwgta rc t anhw)2a=(1IOU)+nn
wherebb _b represents the predicted center coordinate,bgtb^{gt}bg t represents the coordinates of the center of the real frame;ρ \rhoρ representsbbb andbgtb^{gt}bEuclidean distance between g t ; ccc represents the diagonal length of the minimum circumscribing rectangle between the predicted frame and the true frame;w, hw, hwh w g t 、 h g t w^{gt}、h^{gt} wgthg t represents the width and height of the prediction box and the true box respectively;
the second line is the confidence loss of positive samples,
the third line is the confidence loss of negative samples, and
the fourth line is classification loss.


Optimization

边界框回归
Insert image description here

In both yolov2 and yolov3, the bounding box regression solution is used to predict tx, ty, tw, th t_x,t_y,t_w,t_htx,ty,tw,thFour parameters related to the position and size of the prediction box. This method is still used in yolov4, but with a little optimization: the
original method uses the sigmoid function to limit tx, ty, tw, th t_x,t_y,t_w,t_htx,ty,tw,thThese four values ​​are between (0,1).
Since σ ( x ) = 1 1 + e − x \sigma(x) = \frac{1}{1+e^{-x}}σ ( x )=1+ex1
Insert image description here

The value range is (0, 1) (0, 1)( 0 , 1 ) , then leads to the reference point(cx, cy) (c_x,c_y)(cx,cy) , the offset is(0, 1) (0,1)(0,1 ) , then the center point of the prediction box can only be inside the grid cell.
At this time, there is a special situation, if the center point of the real frame falls on the boundary line of the grid cell, such as(cx, cy) (c_x,c_y)(cx,cy) position, then the center position of the prediction box is also(cx, cy) (c_x,c_y)(cx,cy) will be the best effect. To achieve this condition, it is required thatσ (tx) = 0, σ (ty) = 0 \sigma(t_x)=0, \sigma(t_y)=0 in thes ( tx)=0,s ( ty)=0. But forsigmoidfunctions, the condition for achieving equal to 0 is that when x approaches negative infinity, the function value is 0. This situation is difficult to achieve for the network.
For this reason, in yolov4, the scaling factorscalexy scale_{xy}scalexy,equivalent increment
bx = ( σ ( tx ) ∗ scalexy − scalexy − 1 2 ) + cxby = ( σ ( ty ) ∗ scalexy − scalexy − 1 2 ) + cy b_x=(\sigma(t_x) * scale_{xy }-\frac{scale_{xy}-1}{2})+c_x\\b_y=(\sigma(t_y)* scale_{xy}-\frac{scale_{xy}-1}{2})+c_ybx=( s ( tx)scalexy2scalexy1)+cxby=( s ( ty)scalexy2scalexy1)+cy
In actual use, scalexy scale_{xy}scalexyThe usual value is 2, then the original calculation formula is
bx = ( σ ( tx ) ∗ 2 − ​​0.5 ) + cxby = ( σ ( ty ) ∗ 2 − ​​0.5 ) + cy b_x=(\sigma(t_x) * 2-0.5 )+c_x\\ b_y=(\sigma(t_y) * 2-0.5)+c_ybx=( s ( tx)20.5)+cxby=( s ( ty)20.5)+cy
That is, in the original σ ( x ) \sigma{(x)}σ ( x ) times 2, the expression is σ ( x ) = 2 1 + e − x \sigma(x) = \frac{2}{1+e^{-x}}σ ( x )=1+ex2
Insert image description here

σ ( x ) = 2 1 + e − x − 2 \sigma(x) = \frac{2}{1+e^{-x}}-2 σ ( x )=1+ex22
Insert image description here
From the above figure, we can see that the value range becomes(−0.5, 1.5) (-0.5,1.5)(0.5,1.5 )
Then the center position of the finally obtained prediction frame can be a certain distance away from the interior of the grid cell, such as the x coordinate is at(cx − 0.5, cx + 1.5) (c_x -0.5,c_x+1.5)(cx0.5,cx+1.5 ) , the y coordinate is between(cy − 0.5, cy + 1.5) (c_y -0.5,c_y+1.5)(cy0.5,cy+1.5 ) , then it does not matter even if the center of the real box is on the boundary line.

Mosaic 数据增强
Insert image description here

Mosaic is a data enhancement method by mixing 4 training images to make the model more robust.

IOU阈值处理
For the anchor of each grid cell, if there are multiple anchors whose IOU values ​​with the real box are greater than the threshold, then all these anchors will be predicted, which can increase the number of positive samples.

The specific implementation method is:
Insert image description here

Align each anchor with the upper left corner of the real box and calculate the IOU value.
Insert image description here
As shown in the figure above, if for a certain real box, the center point is in the dark green grid cell, then the anchor generated by the dark green grid cell that has the largest IOU with the real box (and meets the threshold) must be a positive sample, but not just this one. Positive sample. For the anchors generated by the two grid cells on the left and upper sides of the dark green box, the anchors that meet the threshold will also be used as positive samples of the real box. The reason for doing this can be understood in conjunction with the part about optimizing the bounding box, because The offset is at ( − 0.5 , 1.5 ) (-0.5,1.5)(0.5,1.5 ) , then the center point position of the prediction frame may appear on the left and upper side. The purpose of our regression is to make the prediction box closer to the real box. Then after optimization, more prediction boxes can be optimized to be closer to the real box, making the prediction effect better.

Guess you like

Origin blog.csdn.net/qq_44116998/article/details/128434962