Oriented Object Detection in Aerial Images with Box Boundary-Aware Vectors (BBAVectors) to achieve remote sensing image rotation frame target detection

This article will be interpreted in combination with the original paper and personal understanding.

Paper address: https://arxiv.org/pdf/2008.07043.pdf
Code address: GitHub - yijingru/BBAVectors-Oriented-Object-Detection: [WACV2021] Oriented Object Detection in Aerial Images with Box Boundary-Aware Vectors

First of all, why are remote sensing image target detection now keen to use rotating frames to detect, because objects or targets in remote sensing (aerial) images are usually displayed in any direction, and the targets are densely arranged, and dense prediction needs to be achieved. The remote sensing image target detection through the rotating frame will greatly alleviate the situation of under-detection and missed detection caused by intensive prediction.

Some professional terms:

OBB: Oriented Bounding Box
HBB: Horizontal Bounding Box
RBB: Rotated Bounding Box (RBB refers to all oriented bounding boxes except the horizontal bounding box)

Oriented Object Detection in Aerial Images with Box Boundary-Aware VectorsThis article proposes oriented object detection in aerial images based on box boundary-aware vectors.

A brief introduction to the paper

center+wh+\thetaOn the basis of determining the rotating frame, this paper proposes Box boundary-aware vectors (BBAVectors) to return the frame boundary awareness vector to generate the rotating frame.

center+wh+\thetaDisadvantages:

(1) A small angle change has little effect on the total loss in training, but it may lead to a large IOU difference between the predicted box and the ground truth box. Because the actual evaluation frame detection index uses IoU , and IoU and Smooth L1 are not equivalent, multiple detection frames may have the same size of Smooth L1 Loss. As shown in Figure 1.
(2) Always learn in a new coordinate system after rotation
wh, which is a challenge for the network to jointly learn all box parameters. The rotating frame method shown in Figure 2(a).
 

Figure 1 The IOU with the same loss value is very different

The method proposed by Box boundary-aware vectors (BBAVectors) for the above shortcomings:

(1) All arbitrarily oriented objects share the same coordinate system , learn four vectors in four quadrants , and share more mutual information when some local features are blurred and weak . Figure 2(b).
(2) Add and parameters on the basis of (1) to solve the problem that it is difficult to capture a frame that is almost aligned with the xy axis only under (1). Figure 2(c).wh\alpha

Figure 2 Center point method (a) and BBAVectors (b) (c) method

 Figure 2 illustrates: (a) Oriented bounding box (OBB) description of the baseline method, called center+wh+θ, where w, h, θ are the width, height and angle of the OBB. Note that w and h of OBB are measured in a different rotated coordinate system for each object; (b) the proposed method, where t, r, b, l are upper, right, lower and left box boundary-aware vectors. For all arbitrarily oriented objects, the box boundary-aware vectors are defined in the four quadrants of the Cartesian coordinate system; (c) shows the corner cases where the vectors are very close to the xy axis, which can be detected by the HBB method.

 Contributions of this paper:
(1) First detect the center keypoint of the object, and then regress the Box Boundary Aware Vectors (BBAVectors) on this basis to capture the oriented bounding box. For all arbitrarily oriented objects, the box boundary perception vectors are distributed in the four quadrants of the Cartesian coordinate system.
(2) To alleviate the difficulty of learning vectors in the corner case, oriented bounding boxes are further classified into horizontal bounding boxes and rotated bounding boxes.
(3) Implementations show that learning a box boundary-aware vector outperforms directly predicting the width, height, and angle of an oriented bounding box.

Two method introduction

Figure 3 BBAVectors network structure diagram

Illustration shown in Figure 3: The general architecture of the method and the Oriented Bounding Box (OBB) description. The input images are resized to 608×608 before being fed to the network. The architecture is built on a U-shaped network. During upsampling, feature maps are combined using skip connections. The output of this architecture consists of four maps: heatmap P, offset map O, box parameter map B, and orientation map α. The location of the center point is inferred from the heatmap and offset map. At the center point, a box boundary-aware vector (BBAVector) is learned. The resolution of the output map is 152×152. HBB refers to horizontal bounding box. RBB represents all oriented bounding boxes except HBB. The symbols t, r, b, l refer to the top, right, bottom and left vectors of the BBA vector, we and he are the outer width and height of the OBB. The decoded OBB is shown in the red bounding box.

1. Feature extraction network

The convolutional layer 1-convolutional layer 5 of ResNet101 is used as the backbone network of the model. First, the remote sensing image is adjusted to a size of 608×608 and sent to the ResNet101 network, and then the output image features are changed from 608×608×3 to 152×152×C after a 4-fold downsampling, where C represents the number of convolutional output channels. Then, after four times of downsampling and three times of upsampling, a feature map with a size of 152×152×256 is output.

Divide the obtained feature map of 152×152×256 xinto four branches, and obtain the corresponding parameters respectively:
(1) After the feature map xundergoes 3×3 convolution and 1×1 convolution, the 256 channels are reduced to N channels, Where N is the number of categories contained in the feature map;
(2) After the feature map xundergoes 3×3 convolution and 1×1 convolution, the 256 channels are reduced to 2 channels, and the (x, y) deviation value of the center point is obtained;
(3) The feature map xundergoes two 7×7 convolutions, reducing the 256 channels to 10 channels, learning the vector values ​​of the four quadrants and the detection frame wh, a total of 10 parameters;
(4) The feature map xundergoes 3×3 convolution Convolved with 1×1, the 256 channels are reduced to 1 channel, and the parameters for judging whether to use HBB or RBB are obtained \alpha.

2. Heat map (used to locate key points and center points of the target)

Heatmaps are often used to locate specific keypoints in an input image, and this paper uses heatmaps to detect the center points of arbitrarily oriented objects in aerial images. The heatmap used in this paper has K channels, and each channel corresponds to a category of an object. The mapping on each channel is passed through a sigmoid function. Consider the predicted heatmap value on a particular center point as the confidence of object detection.

Assuming that c=(c_{x},c{_{y}})it is the center point of the directed bounding box, ca 2D Gaussian is placed around each center point to form a ground truth heat map, thereby establishing the position of the center point. Among them, how to operate the Gaussian specifically, and the loss of the limit will not be explained in detail. Very deep basic skills, in order to facilitate understanding how to establish the center point through Gauss, please refer to Figure 4 below.

Figure 4 uses Gaussian surface fitting to determine the center point, and the center point is an integer.

3. Center point deviation ( in order to compensate for the difference between the quantized floating-point number center point and the integer center point )

Extract the peak point from the predicted heatmap P as the center point position of the object. These center points care integers. However, downscaling a point from the input image to the output heatmap produces a float. To compensate for the difference between the quantized floating point center point and the integer center point, predict an offset map such that the direct difference between the scaled floating point number center point and the quantized integer center point is smaller, thereby ensuring that the center point obtained by the heatmap more accurate.

The offset between the floating center point of the definition scaling and the quantization center point is:

o=(\frac{\bar{c_{x}}}{s}-\left \lfloor \frac{\bar{c_{x}}}{s} \right \rfloor,\frac{\bar{c_{y}}}{s}-\left \lfloor \frac{\bar{c_{y}}}{s} \right \rfloor)

Optimize the offset by smooth L1 loss function.

4. Box Parameter

To address center+wh+\thetathe following disadvantages of this method:
(1) A small angle change has a marginal impact on the total loss in training, but it may cause a large IOU difference between the predicted box and the ground truth box.
(2) For each object, the sum of its OBBs wis measured in a separate rotating coordinate system hat an angle with respect to the y-axis . \thetaIt is challenging for the network to jointly learn the box parameters of all objects.
Propose to use box boundary-aware vectors to describe OBBs.

The proposal of BBAVector: (1) Contains up, right , down and left vectors
from the center point of the object , and these four vector parameters are distributed on the four quadrants of the Cartesian coordinate system. All targets in any direction share the same coordinate system, which will facilitate the mutual transfer of target information and thus improve the generalization ability of the model. (2) In order to facilitate the sharing of more mutual information when some local features are vague and weak, four vectors are intentionally designed instead of only setting up and down or right and left .trbl
tbrl

 The box parameter is defined as b=[t,r,b,l,w_{e},h_{e}], where the up t, right r, down b, and left lvectors are BBAV vectors, w_{e}and h_{e}is the outer horizontal box size of an OBB. Therefore, a total of 10 parameters are composed of 2×4 parameters of four vectors and w_{e}two parameters. h_{e}These 10 parameters correspond to the 10 channels learned by the third branch in Figure 3, representing 10 parameters. Still use the smooth L1 loss for parameter optimization.

 5. Orientation direction determination 

For the case where the object is almost aligned with the xy axis, that is, the target direction is horizontal or vertical to the height of the Cartesian coordinate system, there is no need for rotating frame detection, and RBB will also cause detection failure, as shown in Figure 5(b), Figure 5 (c) is the HBB assay used.

Figure 5 Input image (a) and RBB (b) detection and HBB (c) detection effect

 The reason for the failure of using RBB to detect objects with no angle change is that at the boundary of the quadrant, the type of vector is difficult to distinguish. To solve this problem, we divide OBB into two categories and deal with them separately. The two types of boxes are HBB and RBB, where RBB involves all rotated bounding boxes except horizontal boxes. The benefit of this sorting strategy is that it converts the small-angle case to the horizontal case, which is easy to handle. When the network encounters a corner situation, the direction category and external size can help the network capture accurate OBB.

Therefore, a parameter is defined \alpha, which is learned by convolution from the fourth branch in Figure 3.
Create a direction class parameter \alpha, defined as:

\hat{\alpha}=\left\{\begin{matrix}1(RBB)\rightarrow IOU(OBB,HBB)<0.95 &&\\0(HBB)\rightarrow otherwise&&\end{matrix}\right.

 When the crossover union between the oriented bounding box (OBB) and the horizontal bounding box (HBB) is less than 0.95, the rotated bounding box RBB is used for detection, and when it is greater than or equal to 0.95, the horizontal bounding box HBB is used for detection, and binary Cross-entropy loss for training optimization.

 Three experimental results

 The DOTA dataset and HRSC2016 dataset are used for experimental verification.

The implementation result map under the DOTA dataset reaches 75.36.

 In the HRSC2016 dataset, the results achieved map88.6.

 Four conclusions

An object-oriented detection method based on box boundary-aware vectors and center point detection is proposed. The method is single-stage and does not contain anchor boxes. Compared with baseline methods that directly learn the width, height, and angle of oriented bounding boxes, the proposed method based on box-boundary-aware vectors performs better in capturing oriented bounding boxes. Results on HRSC2016 and DOTA datasets show that the proposed method outperforms the state-of-the-art.

This article focuses on interpreting the methods and principles adopted. For detailed experimental results, please refer to the paper directly. If you have any questions, please communicate in the comment area! ! !

Guess you like

Origin blog.csdn.net/weixin_42715977/article/details/130407821