Faster RCNN Series 2 - Overview of RPN's True and Predicted Values

Faster RCNN series:

Faster RCNN series 1 - Anchor generation process
Faster RCNN series 2 - Overview of the true value and predicted value of RPN
Faster RCNN series 3 - Detailed explanation of the true value of RPN and loss value calculation
Faster RCNN series 4 - Generate Proposal and
RoI

  For the target detection task, the model needs to predict the category and position of the object, that is, the category, the coordinates of the center point of the frame xxxyyy , border widthwww and heighthhThe five quantities of h , based on the prior frame of Anchor, RPN can predict the category of Anchoras the category of the predicted frame, and can predictthe offset of the real frame relative to the Anchorto solve the position of the real frame.

  Therefore, RPN has two kinds of true and predicted values, namely category and offset .

  As shown in the figure below, there are 3 Anchors and 2 labels in the input image, Anchor A overlaps with label M, Anchor C overlaps with label N, and Anchor B does not overlap with any label.

insert image description here

Figure 1 The relationship between Anchor and tags

1.1 Truth value

  • category truth value

  The category truth value here refers to whether the Anchor belongs to the foreground or the background. RPN judges whether an Anchor belongs to the foreground or the background by calculating the IoU of the Anchor and the label. The IoU calculation formula of Anchor A and label M in Figure 1 is as follows:

I o U ( A , M ) = A ∩ M A ∪ M IoU(A,M)=\frac{A\cap M}{A\cup M} I or U ( A ,M)=AMAM

  When the IoU is greater than a certain value, the true category of the anchor is the foreground; when the IoU is smaller than a certain value, the true category of the anchor is the background. The specific judgment criteria are as follows:

  • For any anchor, the maximum IoU with all labels is less than 0.3, and it is regarded as a negative sample.

  • For any label, the Anchor with the largest IoU is regarded as a positive sample.

  • For any Anchor, if the maximum IoU with all labels is greater than 0.7, it is considered a positive sample.

  • offset true value

  Suppose the center coordinates of Anchor A in Figure 1 are xa x_{a}xaya y_{a}ya, width and height are wa w_{a}waYo ha h_{a}ha, the center coordinate of label M is xxxyyy , width and height arewww andhhh , the formula for calculating the true value of the offset is as follows:

{ t x = ( x − x a ) w a t y = ( y − y a ) h a t w = l o g ( w w a ) t h = l o g ( h h a ) \left\{\begin{matrix} t_{x}= \frac{(x-x_{a})}{w_{a}}\\ t_{y}= \frac{(y-y_{a})}{h_{a}} \\ t_{w}=log(\frac{w}{w_{a}}) \\ t_{h}=log(\frac{h}{h_{a}}) \end{matrix}\right. tx=wa(xxa)ty=ha( and anda)tw=log(waw)th=log(hah)

1.2 Predicted value

  RPN realizes the prediction of category and offset by building the network structure shown in Figure 2, that is, the category prediction value is obtained through the classification branch, and the offset prediction value is obtained through the regression branch.

insert image description here

Figure 2 RPN network structure
  • Category Prediction

  In the classification branch, first use 1 × 1 1 × 11×1 convolution output18 × 37 × 50 18 × 37 × 5018×37×50 features, since each point has 9 Anchors by default, and each Anchor only predicts whether it belongs to the foreground or the background, so the number of channels is 18. Then use the torch.view() function to map the features to2 × 333 × 75 2 × 333 × 752×333×75 , so that the first dimension is only the foreground and background score of an Anchor, and sent to the Softmax function for probability calculation, and the obtained features are transformed to18 × 37 × 50 18 × 37 × 5018×37×The dimension of 50 , the final output is the probability that each Anchor belongs to the foreground and background.

  • offset prediction

  In the regression branch, use 1 × 1 1 × 11×1 convolution output36 × 37 × 50 36 × 37 × 5036×37×The feature of 50 , the first dimension 36 contains the prediction of 9 Anchors, and each Anchor has 4 data, which respectively represent the offset of the four quantities of the horizontal and vertical coordinates of the center point of each Anchor and the width and height relative to the true value.

reference article

"Pytorch Object Detection in Deep Learning"

Guess you like

Origin blog.csdn.net/python_plus/article/details/130220764