YOLOv2 Meditations

YOLOv2 Meditations

Thinking about 3 questions?

First, what is the output value of the last layer of the model?
Second, how to convert and calculate the output results of the model to obtain the prediction frame (the frame of the manual standard)?
Third, how to calculate the target for data processing? That is, how to calculate the true value and then calculate the loss?

These three questions have been bothering me for a long time, and I read them over and over again. When I read them at the time, I figured it out, but I forgot after a while, so I decided to record them.

first question

First, what is the output value of the last layer of the model?

First look at the prediction box given in the paper, this picture
prediction box

Prediction box, what is the result predicted by the model? is our network, the output value of the last layer,

The output value of the last layer of the network:
the result of the first step: tx , ty , tw , th the result of the first step: t_x,t_y,t_w,t_h\\The result of the first step: tx,ty,tw,th

From yolov1, we can know,
tx , ty , tw , th t_x,t_y,t_w,t_h\\tx,ty,tw,thIt can be directly used as the final prediction frame, but the author introduced the mechanism of anchor calculation in yolov2, using normalization and anchor conversion to obtain the final prediction frame.

Normalization processing:
second step result: σ ( tx ) , σ ( ty ) , ℮ tw , ℮ th second step result: \sigma\left(t_x\right),\sigma\left(t_y\right), ℮^{t_w},℮^{t_h}\\The result of the second step: σ(tx),p(ty),tw,th
σ is the sigmoid function, , ℮ is the log function, the paper does not directly give \\sigma is the sigmoid function, , ℮ is the log function, the paper does not directly give\\σ is the s i g m o i d function, ,is the l o g function, the paper does not directly give

Why do we need to do normalization processing? It’s okay if we don’t do it. The answer is yes. Put it in and do experiments. There may be a big difference in the convergence and accuracy of the model.

Conversion processing:
the result of the third step: bx = σ ( tx ) + cxby = σ ( ty ) + cybw = pw ℮ twbh = ph ℮ th the result of the third step: \\ b_x= \sigma\left(t_x\right)+ c_x\\ b_y= \sigma\left(t_y\right)+c_y\\ b_w= p_w℮^{t_w}\\ b_h= p_h℮^{t_h}\\The result of the third step:bx=p(tx)+cxby=p(ty)+cybw=pwtwbh=phth

bx , by , bw , bh are the boxes we finally get, just like the boxes marked by hand. b_x,b_y,b_w,b_h are the boxes we finally got, just like the boxes marked by hand.\\bx,by,bw,bhis the box we end up with, just like the hand-annotated box .
in

pw , ph are the width and height of the anchor, cx , cy are the coordinates of the upper left corner of the grid. (These four parameter values ​​are further analyzed below, which are known and fixed) p_w, p_h are the width and height of the anchor, c_x , c_y is the coordinate of the upper left corner of the grid. \\ (below, further analyze these four parameter values, which are known and fixed)pw,phis the width and height of an c h or , cx,cyis the coordinate of the upper left corner of the grid .(Below, further analysis of these four parameter values ​​is known and fixed)

The first question, summary:
the calculation calculated from the model goes through the first, second, and third steps, and finally we get the prediction box we want (like the manual standard box).

third question

Data processing, how to calculate GT value?

First, let's take a look at what is the predicted value input to the loss function?
Earlier, we talked about the predicted value, which has three stages of values, which can be used as the predicted output, but which one should be sent to the loss?

答案:
σ ( t x ) , σ ( t y ) , ℮ t w , ℮ t h \sigma\left(t_x\right),\sigma\left(t_y\right),℮^{t_w},℮^{t_h}\\ p(tx),p(ty),tw,th
This value will be input to loss as the predicted value.

Second, how to get the corresponding GT value?

A GT box, divided into 2 parts: x, y and w, h.
Why do you want to divide it like this? Because the values ​​of x, y, w, and h are obtained in different ways, this place is very convoluted and easy to forget.

First look at x, y
on a 416x416 picture, 13x13x5 anchors can be generated, that is, in each grid, first draw 5 anchors (anchor boxes). The picture below is a 7x7 grid road, from yolov1's paper.
insert image description here

Suppose, the target I want to input loss is like this:
targetx , targety , targetw , targetth target_x,target_y,target_w,target_h\\targetx,targety,targetw,targeth
Start calculating
targetx, targety target_x, target_y\\targetx,targety

Calculation steps:
Step1: Calculate the coordinates of the center point of the GT real frame, which grid (grid) it falls on, which is called the winning grid
Step2: The width of each grid (width and height are equal):
gridw = 416 13 = 32 grid_w = \ frac{416}{13} = 32\\girdw=13416=32

Step3:
targetx = σ ( tx ) = coordinates of the center point of the GT real frame x − upper left corner xgirdw of the winning grid target_x=\sigma\left(t_x\right) = \frac{GT\mathrm{coordinates of the center point of the real frame}x-\ mathrm{upper left corner of grid}x}{gird_w}\\targetx=p(tx)=girdwThe coordinates of the center point of the GT real frame xx in the upper left corner of the winning grid
This value is the GT value we want to send to the loss function.
In the same way, it can be obtained:
targety = σ ( ty ) = coordinate y of the center point of the GT real frame − ygirdh target_y=\sigma\left(t_y\right) = \frac{GT\mathrm{coordinates of the center point of the real frame} y-\mathrm{The upper left corner of the winning grid}y}{gird_h}\\targety=p(ty)=girdhCoordinate y of the center point of the GT real framey in the upper left corner of the winning grid

At this point, it can be found that calculating
targetx, targety target_x, target_y\\targetxtargety

The entire calculation process has nothing to do with the 5 anchors, and the value of the anchor is not used.

However, you can notice a piece of information. The coordinates of the upper left corner of the winning bid grid are actually the center coordinates of the anchor. This understanding is very important in the code, because some code implementations use this technique. When reading the code, you must turn around flexibly. .

This is also a very close relationship between grid and anchor.

Look again, w, h

We have already assumed that the target to input loss is like this:
targetx , targety , targetw , targetth target_x,target_y,target_w,target_h\\targetx,targety,targetw,targeth
Start calculating
targetw, targeth target_w, target_h\\targetw,targeth

How to get these two values?

  1. Directly equal to w, h of the GT real box? No!
  2. Is it equal to w, h of an anchor? no!
  3. What is the conversion relationship between grid, anchor, and GT frame?

The purpose of introducing anchor? How to use it?
Before talking about the anchor, let's make it clear that in yolo, the loss of the calculation box and the classification only considers the situation with the target, that is: only positive samples are calculated, and negative samples are not considered. When calculating the confidence, both positive and negative samples are considered.

insert image description here
Looking back at yolov1
in yolov1, a gt box is directly used as the gt target of the loss to calculate the loss. There are 2 prediction boxes out of the network. Which box and GT box are used to calculate the loss? This is a detail that yolov1 is relatively easy to forget:
calculate the IOU of the two prediction boxes and the GT box, the one with the larger IOU, wins, and is used to calculate the loss with the GT box, and the other is to calculate no object.

When FasterRCNN
arrived at yolov2, it did not directly send the GT frame to the loss, but sent it in after converting it with the anchor. How to convert it?
The introduction of the anchor comes from Faster RCNN. First, let’s take a look at how to use the anchor in Faster RCNN:
1) The anchor and GT calculate the IOU, and get the input of the network: the anchor with the IOU greater than 0.7 is a positive sample (foreground), in Between 0.1 and 0.3 are used as negative samples (background), between 0.3 and 0.5 ignore the box that directly uses anchor as input, the category is 1 (foreground, with target), 0 (background, no target).
2) Get the candidate frame ROIs from the feature map through the RPN network, and perform binary classification on the ROIs to determine whether the content of the candidate frame is foreground or background, leave the foreground candidate frame, discard the background candidate frame, and fine-tune the foreground BBox and label through regression gt close.
3) Two losses, rpn loss calculation box loss: anchor and predicted value, category loss: 0, 1 whether there is a target, rcnn loss, the loss of the foreground predicted by rpn and the GT box, and there are N categories.
The above is the application of the anchor in Faster RCNN. You can see that the use of the anchor allows the model to obtain a large number of foreground frames in the rpn stage, which is closer to the GT frame. At the same time, a large number of background frames are also removed, and finally the RPN is predicted. The foreground frame is sent to the RCNN stage for fine-tuning and correction. The accuracy rate is very high, but the speed is slower.


How does yolov2 use anchor in yolov2?
In yolov2, 5 anchors are preset, of which only one anchor has a target relationship with the GT box, and the other 4 anchors have no target relationship. If there is no target, it will not participate in the loss calculation.
Here you can look at yolov1 again. There seems to be no essential difference between having an anchor and not having an anchor, because there is only one positive sample, and the GT box is also found through the grid. The only thing that can be thought about is that the anchor brings the first concept of testing.
The model has changed from looking for a specific frame to looking for a fine-tuning frame. In this place, I can't understand why his effect will become better.

From the perspective of negative samples, he has 4 more negative samples, and it is a clear prior frame, which is stronger than yolov1.

At the same time, when calculating the confidence score, positive and negative samples will be considered. yolov1 is a 1:1 relationship, and yolov2 is a 1:4 relationship. Of course, in the middle, there are still anchors that ignore IOU>0.6. This processing can be understood , because there is already the largest IOU-anchor, and other anchors are greater than the threshold value of 0.6, but they cannot be regarded as positive samples, nor can they be regarded as negative samples, so ignore them. Do not use such anchors to participate in loss calculations. These are very important Minor handling.

With the above understanding, look at the relationship between anchor and gt, and then calculate:
targetw = ℮ tw = GT box wanchorw, targeth = ℮ th = GT box hanchorh, target_w=℮^{t_w}=\frac{GT box _w}{anchor_w},target_h=℮^{t_h}=\frac{GT frame_h}{anchor_h},targetw=tw=anchorwGT boxw,targeth=th=anchorhGT boxh,

Through observation, we found
that the point in the upper left corner of each grid is the center point of the anchor.

reference

Guess you like

Origin blog.csdn.net/u010006102/article/details/126759232