Oriented R-CNN for Object Detection (Oriented R-CNN target detection)

1. Preface and related work 

 

Figure 1 Intuitive distinction between (a) Rotated RPN and (b) RoI Transformer and (c) Oriented RPN

 Figure 1 shows a comparison of different schemes for generating targeted proposals. (a) Rotation RPN densely places rotation anchors of different scales, scales, and angles. (b) RoI Transformer+ learns orientation proposals from horizontal RoIs. It includes RPN, RoI alignment and regression. (c) Our proposal-oriented RPN generates high-quality proposals at almost no cost. The number of parameters of orientation RPN is about 1/3000 of RoI Transformer+, and 1/15 of rotation RPN.

(a) Rotated RPN: It places 54 anchor points with different angles, scales and aspect ratios (3 scales×3 ratios×6 angles) at each location.
Rotated RPN Pros: The introduction of rotated anchors improves the recall rate and shows good performance when oriented objects are sparsely distributed.
Rotated RPN Cons: A large number of anchors will lead to a large amount of computation and memory usage.

(b) RoI Transformer: Learning orientation proposals from horizontal RoIs through a complicated process, which involves RPN, RoI alignment and regression.
Advantages of RoI Transformer: Provides promising orientation schemes and greatly reduces the number of rotation anchors.
Disadvantages of RoI Transformer: It also brings expensive computing costs.

In this paper, we propose an effective and simple oriented object detection framework called Oriented R-CNN, which is a general-purpose two-stage oriented detector with good accuracy and efficiency. Specifically, in the first stage, we propose a region-oriented proposal network (RPN-oriented), which directly generates high-quality oriented proposals in an almost cost-free manner. The second stage is the oriented R-CNN head, which is used to refine the oriented regions of interest (roi-oriented) and identify them.

 2. Oriented R-CNN

Figure 2 The overall framework of the FPN-based two-stage directional R-CNN detector

 The first stage generates oriented proposals through Oriented RPN, and the second stage classifies proposals and refines their spatial location through Oriented R-CNN head. The object detection method proposed in this paper, called Directed R-CNN, consists of a Directed RPN and a Directed RCNN head (see Figure 2). It is a two-stage detector, where the first stage generates high-quality oriented proposals in an almost cost-free manner, and the second stage is an oriented RCNN head for proposal classification and regression. The FPN backbone produces five levels of features {P2, P3, P4, P5, P6}. For simplicity, in RPN-oriented, the architecture and classification branch of FPN are not shown

 2.1 Directed RPN

, which takes five levels of features {P2, P3, P4, P5, P6} of FPN as input and attaches the same designed head (3×3 convolutional layer and two sibling 1×1 convolutional layers) to each level of features. We assign three horizontal anchors with three aspect ratios {1:2, 1:1, 2:1} to each spatial location in all level features. The pixel areas of anchor points on { } are respectively . Each anchor point a is represented by a four-dimensional vector a = ( ), where ( ) is the center coordinate of the anchor point, representing the width and height of the anchor point. One of the two branches of the 1×1 convolutional layer is the regression branch: the output is offset by δ = ( ) with respect to the proposal of the anchor . At each location in the feature map, we generate A proposals (A is the number of anchors at each location, which is equal to 3 in this work), so the regression branch has 6A outputs. By decoding the regression output, an orientation solution is obtained. The decoding process is described as follows: P_{2},P_{3},P_{4},P_{5},P_{6}32^{2},64^{2},128^{2},256^{2},512^{2}a_{x},a_{y},a_{w},a_{h}a_{x},a_{y}a_{w},a_{h}\delta _{x},\delta _{y},\delta _{w},\delta _{h},\delta _{\alpha },\delta _{\beta }

\left\{\begin{matrix} \Delta \alpha =\delta _{\alpha}\cdot w,\Delta \beta=\delta _{\beta}\cdot h & & \\ w=a_{w}\cdot e^{\delta _{w}},h=a_{h}\cdot e^{\delta _{h}}& & \\ x=\delta _{x}\cdot a_{w}+a_{x},y=\delta _{y}\cdot a_{h}+a_{y}& & \end{matrix}\right.


Wherein, (x,y)is the center coordinate of the predicted candidate frame, w,his the width and height of the circumscribed rectangle of the predicted candidate frame, \Delta \alpha ,\Delta \betaand is the offset relative to the top and right midpoint of the circumscribed rectangle.

2.1.1 Center point offset representation

Figure 3 Schematic diagram of midpoint offset representation​​

 (a) Schematic representation of midpoint offset. (b) An example of midpoint offset representation

Called the midpoint offset representation, as shown in Figure 3. The black dots are the midpoints of each side of the horizontal box, which is the outer rectangle of the oriented bounding box O, and the orange dots represent the vertices of the oriented bounding box O.

According to O=(x,y,w,h,\Delta \alpha ,\Delta \beta ), the four vertex coordinate sets of each candidate frame can be obtained V = (v1,v2,v3,v4), and the four vertex coordinates are expressed as:
\left\{\begin{matrix} v1=(x,y-h/2)+(\Delta \alpha ,0) & \\ v2=(x+w/2,y)+(0,\Delta \beta ) & \\ v3=(x,y+h/2)+(-\Delta \alpha ,0) & \\ v4=(x-w/2,y)+(0,-\Delta \beta ) & \end{matrix}\right.

The regression of each orientation candidate is achieved by predicting the parameters of its outer rectangle (x, y, w, h) and inferring the parameters of its midpoint offset (∆α, ∆β).

3.1.2 Loss function

Figure 4 Diagram of bbox regression parameterization

The black points are the top and right midpoints, and the orange points are the vertices of the oriented bounding box. (a) Anchor. (b) Truth Box. (c) Prediction box. 

Prescribe positive samples: ① IOU>0.7.② IOU_{max}>0.3
Negative samples: IOU<0.3
Non-positive and non-negative anchors are invalid samples. Emphasize that the ground-truth box refers to the outer rectangle of the oriented candidate box.

Define the L1 loss as:

L_{1}=\frac{1}{N}\sum_{i=1}^{N}F_{cls}(p_{i},p_{i}^{*})+\frac{1}{N}p_{i}^{*}\sum_{i=1}^{N}F_{reg}(\delta _{i},t_{i}^{*})

p_{i}is the output of the directional RPN classification branch, indicating the probability that the candidate box belongs to the foreground. p_{i}^{*}is ithe true label of the first anchor.
\delta _{i}=(\delta _{x},\delta _{y},\delta _{w},\delta _{h},\delta _{\alpha },\delta _{\beta })is the output of the directional RPN regression branch, representing the bias of the candidate box.
t_{i}^{*}=(t_{x}^{*},t_{y}^{*},t_{w}^{*},t_{h}^{*},t_{\alpha }^{*},t_{\beta }^{*})is the true bias.

 3.2 Oriented RCNN head

The Orientation R-CNN head takes as input a feature map {P2, P3, P4, P5} and a set of orientation proposals. For each orientation proposal, we use rotated RoI alignment (referred to as rotated RoIAlign) to extract a fixed-size feature vector from its corresponding feature map.

Each feature vector is fed into two fully connected layers (FC1 and FC2, see Fig. 2), followed by two sibling fully connected layers: an output proposal of K+1 classes (K object classes plus 1 background class), the other generates an offset for each proposal in the K object classes

 3.2.1 Rotate RoIAlign

Rotation RoIAlign is an operation that extracts rotation-invariant features from each orientation proposal

Figure 5 The process of rotating RoIAlign

 V = (v1,v2,v3,v4)For the vertices of the parallelograms of the generated orientation candidate boxes, for the convenience of calculation, each parallelogram needs to be adjusted into an oriented rectangle. It does this by extending the shorter diagonal to the same length as the longer diagonal. After this simple operation, the oriented rectangle is obtained from the parallelogram (x,y,w,h,\theta ). Next, the oriented rectangle (x,y,w,h,\theta )is projected onto the feature map F with a step size of s to obtain a rotated RoI, defined by the following operation (x_{r},y_{r},w_{r},h_{r},\theta ).

\left\{\begin{matrix} w_{r}=w/s, h_{r}=h/s & \\ x_{r}=\left \lfloor x/s \right \rfloor ,y_{r}=\left \lfloor y/s \right \rfloor & \end{matrix}\right.

 3.3 Implementation Details

Directed R-CNN is trained in an end-to-end manner by jointly optimizing the Directed RPN and Directed R-CNN heads. During inference, the oriented proposals generated by oriented RPN usually have high overlap. To reduce redundancy, we keep 2000 box proposals per FPN level in the first stage, followed by non-maximum suppression (NMS). Considering the inference speed, horizontal NMS with an IoU threshold of 0.8 is adopted. The remaining proposals of all levels are merged, and the top 1000 proposals are selected according to their classification scores as the input of the second stage. In the second stage, policy NMS is performed on each object class for bounding box-oriented bounding boxes with predicted class probabilities greater than 0.05. The polyynms IoU threshold is 0.1.

 

Guess you like

Origin blog.csdn.net/weixin_42715977/article/details/130718367