H2RBox:HORIZONTAL BOX ANNOTATION IS ALL YOU NEED FOR ORIENTED OBJECT DETECTION(读论文)

H2RBOX: HORIZONTAL BOX ANNOTATION IS ALL YOU NEED FOR ORIENTED OBJECT DETECTION


It is sufficient to train a rotated object detector using horizontal boxes

abstract

Training rotating boxes with horizontal box annotations can save annotation costs and optimize a large number of existing datasets.
In this paper, we train a rotating box detector using weak supervision combined with self-supervision, and compare the performance with horizontal box-based instance segmentation models.
code:https://github.com/yangxue0827/h2rbox-mmrotate

intro

Introduced several data sets:
DIOR-R
SKU110K-R
introduced some HBox-supervised instance segmentation methods:
BoxInst
BoxLevelSet

Since this task is proposed for the first time, the model in this paper will compare the performance of the instance segmentation-bounding rectangle model.

insert image description here
Contributions to this article:

  • The first HBox annotation-based oriented object detector, proposed self-supervised angle prediction moule
  • Compared with H2RBox instance segmentation-bounding rectangle model BoxInst (Tian et al., 2021), mAP50 (67.9% vs 53.59%), 12 x faster (31.6fps vs 2.7fps), memory usage (6.25 GB vs. 19.93 GB)
  • Compared with the classic rotating target detection model FCOS,
    H2RBox is only 0.91% (74.40% vs. 75.31%) behind on DOTA-v1.0, and even surpasses it by 1.7% (34.90% vs. 33.20%) on DIOR-R , 29.1 FPS vs. 29.5 FPS on DOTA-v1.0.

related work

Some waiting papers:
SDI
MCG
BBTP
Mask R-CNN
BoxInst
CondInst
BoxLevelSet

proposed method

This part is recommended to be read in conjunction with the paper. After all, the method is the most important part, and what the author has explained in the paper is a simplified version.
insert image description here

  • Enhanced view generation

The most interesting part of the text is the use of self-supervision to ensure the angular consistency of the predicted rotation box, as shown on the right side of the diagram. Note: After the rotation is fed into the network, either cropping or padding is required.
1: Keep the center area
2: Fill (blank fill and reflective fill)
(here the filled area does not participate in the regression loss, don't worry about the target appearing in the filled part)
insert image description here

  • weakly supervised branch

Using the resnet + FPN structure,
how does the head network of FCOS be used for regerssion to set up supervision?
Using the external horizontal box as supervision, the author explained that this will lead to a problem that the RBox cannot be accurately predicted, resulting in the following situation
insert image description here

  • self-supervised branch

As a supplement to the ws branch, the ss branch only includes regression and does not include classification, which means that ss is used to optimize the learning of the network for frame regression, that is, xywha simply recalls the rotation transformation matrix and recalls the four parameters of FCOS learning xl xr yl
yr
insert image description here
, Combined with centerness to complete the label assignment. Since the circumscribed horizontal frame must be satisfied, and the center coincides with the center of the original rotating frame, the parameters that need to be learned for the rotating frame are wh theta. Obviously, when wh is accurate, there are only two cases for theat to make the circumscribed rectangle equal to gtH. insert image description here
B sss = S ( R ⋅ B wsc ) B_{ss}^s = S(R\cdot B_{ws}^c)Bsss=S(RBwsc)
Here the left side of the equation represents the mirror frame of the ss branch prediction, the right side S represents the horizontal flip transformation (because the center is symmetrical, it is also called vertical flip), and R represents the rotation matrix (that is, by θ \thetaThe rotation matrix obtained by θ , the angle is the rotation angle of ss), B wsc B_{ws}^cBwscThe box coincident rbox predicted for the ws branch

(I can't understand the following few sentences)
The final SS learning consists of scale-consistent and spatial-location-consistent learning:
S im ⟨ R ⋅ B ws , B ss ⟩ Sim \langle R\cdot B_{ws},B_{ ss} \rangleSimRBws,Bss
Fig. 4(b) shows the visualization by using the SS loss, with accurate predictions. The appendixvisualizations
of feasible solutions for different combinations of constraints.
Let's see)

  • label assignment

Author: The consistency between the ws branch and the ss branch can be learned by setting the center point loss and angle loss.
The rb predicted by the ws branch is used to supervise the rbox predicted by the ss branch. And the centerness category target GT Hbox of each pixel in the figure should be consistent.
1) One-to-one, each original image point corresponds to the ss image point, and the Hb corresponding to its rb is used as supervision
2) One-to-many, the rb closest to the center point is used to supervise the ss branch rb
insert image description here
(to be honest, these few sentences , I seem to understand it, and I don’t know if I understand it right, I still have to enter it from the code)

  • loss combining the ws and ss

ws branch: rotated object detector, based on FCOS, loss: L reg L cls L cn L_{reg} L_{cls} L_{cn}LregLclsLcnThey are focal loss cross-entropy loss IoU loss, #Why is there still Lcn? ? ?
insert image description here
Some parameter explanations: (I don’t want to play by hand, there is no difference)
insert image description here

Importantly, let's take a look at the ss branch loss:
insert image description here
well, the t in this Lxy is not explained, and then B(-,-,+,+) I don't understand. .

experiment

see paper

ablation studies

insert image description here

  • Effect of Generative Boundary Method on Results
  • Different assigners affect the results
  • Training and testing strategies for special circular objects (st is storage tank)
  • The improvement of the result by the ss method

Interpretation at the code level

See the link in the paper for the code. After
reading the code, the overall code is very concise. This article mainly focuses on the novel design of loss.
Next, introduce the code in three steps.

model.forwar_train(self,img,img_meta,gt_bboxes,gt_labels,gt_bboxes_ignore)

This part is the code for the forward propagation of the network to do the following things:

  • Randomly generate an angle
  • Rotate the original image, then reflect to fill the border
  • The rotated picture and the original picture are sent to backbone+neck (shared weight)
  • Then send the obtained feature map to head.forward_train

It can be seen that the difference from the general model.forward_train is only one more parallel branch.

losses=self.bbox_head.forward_train(x,x_aug,rot,img_metas,gt_bboxes,gt_labels,gt_bboxes_ignore)

The head network Rotated_FCOS_head is sent to the head network.
First, for the original x, directly forward (x) to get bbox_pred, angle_pred, class, centerness
and then for x_aug, go through the regression branch (shared weight) to get bbox_pred, angle_pred

The following is the most important loss calculation, how to design loss?
According to the paper, first of all, for the loss of x, use FCOS’s own loss. There are losses in classification + regression + angle + centerness, and then the loss related to x_aug can be called consistency loss. Let's take a closer look

loss(outs,outs_aug,tf,gt_bboxes, gt_labels, img_metas)

Guess you like

Origin blog.csdn.net/fei_YuHuo/article/details/127574833