Overview of rotating target detection (continuously updated)

Article directory


Preface (all detection models)

Sort chronologically first (increasing effect)


1.R-DFPN

2.DRBox
3.S2ARN
4.R^2CNN
5.RRPN
6.RetinaNet-H/R
7.ICN
8.FADet
9.R^3Det
10.RSDet(19.12,DOTA上mAP74.1)
11.SCRDet(ICCV2019, DOTA上mAP75.35)
12.P-RSDet(CVPR2020,DOTA上mAP72.3)

4. R^2CNN (17 years)

R2CNN: Rotational Region CNN for Orientation Robust Scene Text Detection
uses Faser-RCNN as the basic framework:
insert image description here

1. Representation method

The ICDAR 2015 data set is a data set for text detection. In this data set, the text is wrapped by a slanted frame. This slanted frame can be composed of 4 coordinate points (x1, y1, x2, y2, x3, y3, x4, y4) Indicates that the four coordinate points are arranged in a clockwise direction, as shown in figure a below. This article is represented by two coordinate points of the rectangle and the height of the rectangle (x1, y1, x2, y2, h). The first point here refers to the point in the upper left corner, and the second point is the second point in the clockwise direction, as shown in the figure bc below:insert image description here

2. Highlight 1: Add anchor

The first stage: Through the RPN network, the Proposal of the positive frame is obtained. Since many texts are very small, change the anchor scale (8,16,32) in Faster RCNN to (4,8,16,32). Experiments prove that the detection effect is significantly improved after adding a small scale.

3. Highlight 2: Add multi-scale ROIPooling and oblique frame FC

The second stage: For each proposal, considering that the width and height of some text boxes are very different, two pooled sizes are added: 3x11 and 11x3, and different pooled sizes (7 × 7, 11 × 3) are used in parallel , 3 × 11) ROIPooling, and then concatenate the results of ROIPooling together, and then perform positive frame prediction, oblique frame prediction and classification prediction through fc6 and fc7. After that, post-processing is performed through the NMS of the oblique frame to obtain the final result.
Detecting the positive frame while detecting the oblique frame can improve the detection effect of the oblique frame, and the experiment also proves this.

4. Highlight 3: Inclined NMS

5. Loss function, experimental results

The definition of the loss function in the first stage is the same as that of Faster RCNN.

In the loss function of the second stage, there is an additional loss of oblique frame regression. The coordinate loss (x1, y1, x2, y2) in the oblique frame is the same as the definition of xy loss in Faster RCNN, and the definition of loss in h and h in Faster RCNN is also the same.
insert image description here
insert image description here

5. RRPN (18 years, oblique text)

1 Overview

The algorithm of the article is derived from Faster R-CNN. The difference is in the extraction of suspected regions and the regression of bounding boxes. For the regions of suspected targets, Rotation Region-of-Interest (RRoI) pooling is used to solve the problem. For the regression of bounding boxes, it is Simultaneous regression ( w , y , h , w , θ ) (w,y,h,w,\theta) (w,y,h,w,θ), where the rotation factor is integrated into the region extraction network, so that its Regions at any angle can be extracted. RoI Pooling is also improved to make it Rotation RoI Pooling, and angle regression is used in the subsequent optimization process, so the final detection effect is also outstanding.

The main content of the article:

1) A text detection method that is different from the segmentation-based method is proposed, which is a method based on region extraction, while combining RRoI (Rotation Region-of-Interest) and rotating region-of-interest learning. The efficiency in the process of text detection is guaranteed.
2) A new strategy for arbitrarily rotated text area optimization is proposed, thereby optimizing the detection performance of rotated text.
3) This method is more accurate and efficient than previous methods in MSRA-TD500, ICDAR2013 and ICDAR2015.

2. Network structure

insert image description here

3. Rotated Bounding Box Representation

Text annotation is represented by ( x , y , h , w , θ ) (x,y,h,w,\theta) (x,y,h,w,θ), coordinates ( x , y ) (x,y ) (x,y) represents the geometric center of the label box, the height hhh represents the short side of the label box, the width www represents the long side of the label box, and the angle θ \theta θ is the positive angle of the long side of the bounding box rotated along the X axis, The range of this angle is [ − π/ 4 , 3 π /4 )
​ ) . Using such five variables for representation has 3 benefits:

1) It is relatively easy to calculate the angle difference between the two rotations;
2) Compared with the traditional expression of using 8 points for the bounding box, using this method can better return to the case of the target detection box with rotation;
3 ) Using such an expression can efficiently calculate the ground truth of the training image after rotation;

4. Rotation Anchors

Due to the complexity of the actual detection scene, the three-dimensional variable anchor with the rotation angle is used here instead of the original two-dimensional variable anchor. In the paper, 6 different angles are used to control the extraction of the target suspected area: ( − π/ 6 , 0 , π/ 6 , π /3 , π /2 , 2 π/ 3 ) (Why are these 6 angles chosen
? group angle? It will be explained later), the aspect ratio adopts 3 groups: 1 : 2 , 1 : 5 , 1 : 8 1:2,1:5,1:8 1:2,1:5,1: 8. Three groups of 8, 16, 32 8, 16, 32 8, 16, 32 are also taken on the scale, so that 54 5-dimensional anchors ( 6 ∗ 3 ∗ 3 ; ( x , y , h , w , θ ) 6 3 3; (x,y,h,w,\theta) 6∗3∗3;(x,y,h,w,θ)). For a feature map with a width and height of W ∗ HW HW∗H, anchors with a number of H ∗ W ∗ 54 H W*54 H∗W∗54 will be generated. Figure 3 below is a schematic diagram of the strategy of the anchor in this article:
insert image description here

5. Learning of Rotated Proposal

The RPN network used in the paper needs to be learned on the basis of the existing anchor, which is different from the traditional Faster R-CNN relying solely on IoU for discrimination. The delineation criteria for positive and negative sample area extraction here are:

Positive sample: IoU with GT box is greater than 0.7, and the angle with GT box is less than π/12
Negative sample: IoU with GT box is less than 0.3, or IoU with GT box is greater than 0.7 but with GT box The samples whose frame angles are greater than π/12
and not classified as the above two cases will not be used in the training process.
The loss function used here is defined as the following form: L = L (regression) + L (classification) using smmoth L1 and cross entropy respectively

The reason for choosing only 6 sets of anchor angles:
The paper fixed the expression range of rotation as [ − π/ 4 , 3 π/ 4 ]
, and then gave a margin range of π /12 when distinguishing positive and negative samples, so this The division forms such 6 groups of angles.

In order to prove that the fitting angle can be trained from the feature map, Figure 5 below shows a comparison of feature maps with different numbers of training rounds. The small white short line is the part that has a higher response to the anchor.insert image description here

6. Optimization of the region extraction network (Accurate Proposal Refinement)

Calculation of IoU in the case of oblique intersection: Traditionally, the rectangular frames involved in the calculation of IoU are horizontal, but such an assumption is not valid in the scene of this paper, so the paper proposes a method to calculate the overlapping area of ​​oblique rectangles Method, see Algorithm 1 for the method, and see Figure 6 for the schematic diagram of the method. In Figure 6, the overlapping area is divided
into multiple triangles using green dotted lines, and the area of ​​the overlapping area is obtained by calculating the sum of the areas of these triangles.
insert image description here
Non-maximum suppression of detection results in the case of oblique intersection: The traditional NMS algorithm only considers IoU, which is also not applicable in the scenario of this article. Therefore, the article gives a new algorithm that considers IoU and rotation angle. It consists of two steps: 1) keep the area extraction with IoU greater than 0.7; 2) for the area extraction with IoU between 0.3 and 0.7, keep the smallest angle between GT and GT (should be less than π /12).

7.RRoI Pooling Layer

RRoI Pooling is proposed here to avoid the loss caused by using traditional RoI Pooling, because the target to be detected has an angle, so the corresponding RRoI Pooling is required. The principle is shown in Figure 7, which is to divide the text area into equal grids according to the direction of the text (a figure), and map the data in these grids to the final result (b figure).

insert image description here

6. DRBox (Detector using RBox in 2017)

Paper: Learning a Rotation Invariant Detector with Rotatable Bounding Box
Contribution: The difficulty of aerial photography was proposed earlier:insert image description here

1. Framework

insert image description here
[1] The previous convolutional layers are used to extract features, and the prediction layer of the last layer outputs the predicted value;

[2] The prediction layer of the last layer has K channels, corresponding to K pre-defined RBoxes for each position (that is, RBoxes of various fixed angles and sizes in the figure), for each RBox, the prediction layer outputs: Confidence (foreground and background probability) and a 5-dimensional vector (bias value of predicted RBox and pre-defined RBox);
3. Decoding process is required: convert the bias value into an accurate predicted RBox;

4. Non-maximum suppression

5. Multi-angle prior RBox : Prior RBoxes can rotate a series of angles at each position; the ratio of RBoxes takes a fixed value according to the object type, which can reduce the total number of prior RBoxes. (But it may not be able to adapt to more kinds of objects, different kinds of objects have different aspect ratios)

Using the strategy of predefined Multi-angle prior RBoxes, the network model is trained to divide the detection task into a series of subtasks (corresponding to a series of angles): each subtask is concentrated in a small angle range, which can reduce the impact caused by object rotation. difficulty.

2 Model training

insert image description here
[1] During training, each ground truth RBox will be assigned to a predefined RBoxes, which is determined by the ArIoU between the two: A r I o U ( P , G ) > 0.5 is a match, defined as a positive sample, in It is subsequently used to calculate the position (x, y) and angle loss.
[2] In order to balance the number of positive and negative samples, hard negative mining is applied.

3 Implementation details

[1] PYRAMID INPUT: The original image is scaled into images of different resolutions, and then a series of overlapping sub-images with a size of 300×300 are cut out. The DRBox model performs inference on each sub-image, but only detects objects of appropriate size. ; Non-maximum suppression is applied to the detection results of the whole image, which can suppress the repeated detection results of sub-images; the pyramid input input strategy facilitates the detection network to share features between large objects and small objects (because they may be the same object )
[2] CONVOLUTION ARCHITECTURE: DRBox uses the castrated version of the VGG network for detection, its fully connected layer, and the convolutional layer and pooling layer after conv4_3 are castrated. A 3 × 3 convolutional layer is followed by conv4_3 of the castrated version of VGG. The capacity of DRBox is 108×108 pixels, and objects larger than 108 pixels may not be detected.
[3] PRIOR RBOX SETTINGS: The pyramid input strategy ensures that the predefined RBoxes can cover most targets of different sizes, thereby ensuring that the model can detect objects of different sizes. The author trained three different DRBox models for car recognition, ship recognition, and aircraft recognition—because the direction of the head and tail of the ship is not easy to distinguish, the angle is 0:30:180. The head and tail directions of cars and planes are easy to distinguish, so the angle is set to 0:30:330.

7. R3det (19 single-stage)

R3Det: Refined Single-Stage Detector with Feature Refinement for Rotating Object
FPN-based one-stage detector

1. Representation method and framework:

(x , y , w , h , θ[-90,0] ) insert image description here
1 Stage 1: Generate horizontal anchors to generate more proposals for the next stage
2 Stage 2 (refinement stage): Use rotation anchors and
use many refined detectors The same feature map performs multiple classifications and regressions, without considering the feature shift caused by the change of the bounding box position, which is not good for categories with large aspect ratio or small sample size. This paper proposes to re-encode the location information of the refined bounding box to the corresponding feature points, so as to reconstruct the entire feature map and achieve feature alignment.insert image description here

2.Rotation RetinaNet

The network is an advanced one-stage detector, including two parts: the backbone network and the classification regression sub-network. Each layer of FPN is connected to a classification regression subnetwork. RetinaNet designed focal loss to solve the problem of category imbalance. Additional angular offset in the prediction sub-network: insert image description here
In the above formula, x, y, w, h, theta represent the center of gravity coordinates, width, height and angle of the box; x, x_a, x' are ground-truth, anchor box and Prediction box. The multi-category loss function is defined as follows: insert image description here
In the above formula, N represents the number of anchors, t' n takes the value of 0 or 1 (foreground is 1, background is 0, background has no regression); v' *j represents the predicted bias Shift vector, v_*j represents the target vector of ground-tryth. tn is the target category, and pn is the probability distribution of each category calculated by sigmoid. L_cls is focal loss and L_reg is smooth L1 loss.

3.Refined Rotation RetinaNet

The refinement stage can be superimposed. The first front-background threshold is 0.5/0.4, and the second threshold is 0.7/0.6. insert image description here
Li is the loss of the i-th refinement stage, and ai is the trade-off coefficient, which is 1 by default.

4.Feature Refinment Module

Many refined detectors use the same feature map for multiple classifications and regressions, without considering the feature shift caused by the change of the bounding box position, which is not good for categories with large aspect ratio or small sample size. This paper proposes to re-encode the position information of the refined bounding box to the corresponding feature points (bilinear interpolation), so as to reconstruct the entire feature map and achieve feature alignment.insert image description here

The feature interpolation formula is: insert image description here
insert image description here
The specific operation is: use two-way convolution to add feature maps to obtain new features. In the refinement stage, only the bounding box with the highest score of each feature point is kept to improve the speed, and at the same time ensure that a feature point only corresponds to a thinned bounding box. For each feature point of the feature map, obtain the corresponding feature vector on the feature map according to the five coordinates of the refined bbox, obtain a more accurate feature vector through bilinear interpolation, then add five feature vectors and replace the current feature vector After traversing the feature points, the entire feature map is reconstructed, and finally the reconstructed feature map is added to the original feature map to complete the whole process.

FRM can preserve the full convolutional structure with higher efficiency and fewer parameters.

5. Experiment

The DOTA dataset contains 15 categories. The author of this paper divides the image into 600x600 subimages and scales them to 800*800 during the experiment. During training, the backbones choose Resnet-FPN and MobileNetv2-FPN, all backbones are trained on ImageNet, the area of ​​the anchor on the pyramid P3-P7 level is 32x32-512x512, and each pyramid level uses 7 aspect ratios (1, 1/2,2,1/3,3,5,1/5) and 3 scales (2 0,2 ​​(1/3),2^(2/3)), adding 6 angles to the rotation anhor (-90, -75, -60, -45, -30, -15)

8. CAD-Net (19,69.9map)

Paper: CAD-Net: A Context-Aware Detection Network for
Objects in Remote Sensing Imagery
uses five-parameter regression:insert image description here

1. Overall composition

insert image description here
The feature pyramid network (beige), the global context network (GCNet highlighted in cyan) and the pyramid local context network (PLCNet highlighted in purple) are used to learn scene-level global context and object-level local context, respectively. A spatial and scale-aware attention module (light green) is designed to guide the network to focus on more informative regions with appropriate feature scales while suppressing irrelevant information. On the basis of the standard horizontal bounding box (HBB) regression, an oriented bounding box (OBB) regression branch is added to make the OBB result more consistent with the arbitrary orientation properties of the target in the remote sensing image [1] Global context network (GCNet): the
network The correlation between the object of interest and its corresponding global scene is learned, that is, the correlation between the features of the object and the features of the whole image. GCNet is inspired by the fact that optical remote sensing images usually cover large areas, and scene-level semantics usually provide important clues for target locations and object categories, such as ships often appear in oceans/rivers, helicopters rarely appear around residential areas, etc. [2
] Pyramid Local Context Network (PLCNet): This network learns multi-scale co-occurrence features and/or co-occurrence objects around objects of interest. Compared with images acquired by ground-based sensors, top-down remote sensing images usually contain richer and more distinguishable co-occurrence features and/or objects that are very useful for object classification and object location reasoning, such as vehicles appearing around, ships in the port, Bridges over rivers, etc.
PLCNet is able to extract features of proposed regions from each feature scale and learn the correlation between them as supplementary information for detection. insert image description here
[3] Spatial-and-Scale-Aware Attention Module The
spatial and scale-aware attention module learns to adaptively focus on more salient areas (spatial awareness) under the relevant scale of the feature map (scale awareness). Spatial-aware features help the network handle objects with sparse textures and low contrast with the background, while scale-aware features help handle objects at different scales. The combination of the two facilitates the learning of object detection models in remote sensing images.insert image description here

Contextual information is employed to provide additional guidance to targets in low-contrast visual cues, while spatial and scale-aware attention is designed to be more robust to scale variations and noise.

For optical remote sensing images, due to various noises and loss of information, distinguishable features such as edges and texture details of images are often severely degraded. In this case, the global and local contexts, which are often closely related to the object of interest, become very important, and they need to be combined to compensate for feature degradation and information loss.

insert image description here

Figure 5. Illustrate the spatial and scale-aware attentional responses at different feature scales that we propose. Brighter areas indicate higher attentional responses.
The proposed spatial and scale-aware attention module is able to focus on informative regions at appropriate feature scales while suppressing irrelevant and noisy regions.

2. Experimental results

Data preprocessing: Optical remote sensing images usually have huge image sizes, for example, the size of DOTA images can reach 6000×6000 pixels.

In order to fit the hardware memory in the training phase, we crop the image into blocks of size 1600×1600 pixels, with 800 pixels overlapping between adjacent blocks

In the inference stage, image patches of 4096 × 4096 pixels are cropped from the test image, with 1024 pixels overlapping between adjacent patches. Apply zero padding if the image is smaller than the cropped image patch
insert image description here
insert image description here

九、 ROI-Transformer(CVPR19)

1. Disadvantages of rotating anchor

A large number of anchors increases the calculation of parameters in the network, and also reduces the efficiency of matching between candidate regions and ground truth. Moreover, direct matching between oriented bounding boxes (OBBs) is more difficult than that between horizontal bounding boxes (HBBs) due to the existence of a large number of redundant rotation anchors. Therefore, [RRPN, DRBox] all use a relaxed matching strategy in the design of the rotation anchor. Some anchors do not achieve IoU above 0.5 on any ground truth, but they are still designated as true positive samples, which may still lead to misalignment problems

2. STN and deformable convolution

Spatial Transformer Networks与Deformable Convelution

Due to the limited generalization ability of conventionally operated CNNs in conventional object detection networks to rotation and scale changes, some orientation and scale invariance are required in the design of RoIs and corresponding extracted features. To this end, Spatial Transformer and deformable convolution and RoI pooling layers have been proposed to simulate geometric changes. However, they are mainly designed for general geometric deformation without using oriented bounding box annotations. Among them, STN is mainly composed of three parts, which are used to map the change process from U matrix (original image or feature map of a certain layer) to V matrix (region of interest of custom size): [1] localization net: generate transformation coefficient insert image description here
θ
[2]Grid generator: Find the coordinates of the U matrix corresponding to the coordinate point of V
[3]Sampler: Assign values ​​through the mapping relationship

3.RoI transform

This paper proposes a module named RoI Transformer (simulating only rigid transformations, learned in the format (dx , dy , dw , dh , dθ )), which aims to learn through supervised RRoI learning and feature extraction based on position-sensitive alignment, through two The stage framework enables the detection of oriented and dense objects.
It consists of two parts: [1] is the RRoI learner, which learns the transition from HRoI to RRoI. [2] is Rotated Position Sensitive RoI Align (using light head structure for all RoI-wise operations), which extracts rotation-invariant feature extraction from RRoI for subsequent object classification and position regression. insert image description here
Figure 2: Architecture of RoI Transformer. For each HRoI, it is passed to the RRoI learner. The RRoI learner in our network is PS RoI Align followed by a fully connected layer of dimension 5 that regresses the offset of the RGT relative to the HRoI. The Box decoder is at the end of the RRoI Learner, which takes the HRoI and offset as input and outputs the decoded RRoI. The feature maps and RRoI are then passed to RRoI Warping for geometrically robust feature extraction. The combination of RRoI Learner and RRoI Warping constitutes RoI Transformer (RT). The geometrically robust pooled features from the RoI Transformer are then used for classification and RRoI regression.

3.1RRoI Learnerinsert image description here

insert image description here
When training the fully connected layer, we are about to match the input HRoI with the ground truth of the oriented bounding box (OBB). For computational efficiency, the matching is between HRoI and axis-aligned bounding boxes instead of the original ground truth. Once the HRoI is matched, we directly set t∗ θ by the definition in Equation (1). The optimized loss function is used as smooth L1 loss. For a prediction t in each forward pass, we decode it from an offset into a parameter of RRoI. That is, our proposed RRoI learner can learn the parameters of RRoI from the HRoI feature map F.

3.2 Rotated Position Sensitive RoI Align

After obtaining the parameters of RRoI, the rotation position sensitive (RPS) RoI Align module is used to extract the rotation invariant features in the network. insert image description here
insert image description here
The combination of RRoI Learner and RPS RoI Align forms the RoI Transformer (RT) module. It can be used instead of the normal RoI warping operation. RT's pooled features are rotation invariant. And RRoI provides better initialization for later regression, since the matched RRoI is closer to the RGT than the matched HRoI. As mentioned before, RRoI is a tuple containing 5 elements. insert image description here
To disambiguate, we use h to denote the short side of RRoI and w to denote the long side of RRoI.

4. Experimental results

Results of an ablation study. We use the Light-Head R-CNN OBB detector as a baseline. The leftmost column represents the optional settings for the RoI Transformer. In the four experiments on the right, we explore appropriate settings for the RoI Transformer.insert image description here

10. RSDet (19.12, mAP74.1 on DOTA)

Highlights:
Detailed interpretation using eight-parameter regression .

Different Parameter Units Affect Network Performance In the five-parameter system, angle, width, height, and center point coordinates have different measurement units, and they show quite different relationships with IoU, as shown in the figure below. Simply adding them up can lead to inconsistent regression performance. Eight parameters can alleviate this problem by using the same units as the corner coordinates.

In this paper, the author designs the following boundary constraint loss, denoted by the rotation loss L(mr)
insert image description here

11. SCRDet (ICCV2019, 75.35)

1. Predecessor: R2CNN++

R2CNN++: Multi-Dimensional Attention Based Rotation Invariant Detector with Robust Anchor Strategy

This is a paper on target detection at any angle. The coordinates of the object are processed as (x, y, w, h, θ). The author adds two kinds of attention mechanism channel attention and pixel-level attention mechanism to the text. filter.

%% Algorithm Overview
This paper divides the innovation part into three parts:

  1. IF-Net: Fusion of feature maps of two different layers
  2. MDA-Net: Using Channel Attention and Pixel-Level Attention Mechanisms
  3. Rotaion Branch: After roi, predict the coordinates of any angle (x, y, w, h, θ).

insert image description here

%% MDA-Net Discussion
The channel attention mechanism in this article is the structure of SEnet, but the pixel-level attention mechanism here is to train a pixel binary image branch. By binarizing the target area, a binarized map containing the target area is obtained, and by constructing a binary classification loss, the model learns a pixel-level attention mechanism.

insert image description here

%% loss
builds multi-tasks, three tasks: classification, coordinate regression, attention training

insert image description here

2. SCRIt:

SCRDet: Towards More Robust Detection for Small, Cluttered
and Rotated Objects

First of all, the article recognizes the above three problems faced by remote sensing target detection, and improves these problems:

  1. For small targets: A feature fusion structure is designed from the perspective of feature fusion and anchor sampling.
  2. For the dense permutation problem: a supervised multi-dimensional attention network is designed to reduce the adverse effect of background noise.
  3. For the arbitrary orientation problem: an improved smooth L1 loss is designed by adding an IoU constant factor, which is specifically designed to solve the bounding problem for rotated bounding box regression.
    The whole framework is based on Faster RCNN, which mainly includes SF-Net, MDA-Net and IoU-Smooth L1 Loss. The structure diagram is as follows:

insert image description here

1. SF-Net

Small target detection has always been a difficult problem, especially in remote sensing images. The article argues that feature fusion and effective sampling are the keys to better detection of small objects. For anchor-based, the way the anchor is laid directly affects the positive sample sampling rate. The classic anchor laying method is related to the resolution of the feature map, that is, the step size of the anchor laying (the anchor step lengths on C2-C5 are 4, 8, 16, 32 respectively). As the network deepens, the resolution of the feature map decreases, and the step size of the anchor expands, which often leads to the loss of sampling of small targets, as shown in the following figure:insert image description here

Through such findings, the article selects an appropriate feature map resolution by resizing to ensure that small objects are sampled as much as possible, and simple feature fusion ensures rich semantic information and location information. The reason why C2 is not used here is because the remote sensing target detection will set more scales and proportions, then the anchors on the feature map of C2 will become too much, and the smallest targets in the remote sensing data set are generally in the More than 10 pixels (specifically refers to DOTA1.0, DOTA1.5 gives the label below 10 pixels).

The structure diagram of SF-Net is as follows:
#

Since the paper is based on Faster RCNN, FPN is not considered. But in the actual application process, I still feel that FPN is really fragrant after using so many detection methods.
Although the structure of SF-Net is really quite earthy, it still has some inspiration for me in remote sensing detection, especially small target detection. , that is, the anchor-based method should fully guarantee the recall of RPN. Adding anchors is violent, and a big side effect is that the detector becomes very slow, so I am still looking forward to the application of the anchor-free method in remote sensing. At present, my junior has already had a preliminary understanding of this method. Progress: , see 12, 13.

2. MAD-Net

Due to the complexity of the remote sensing image background, the proposed region generated by RPN may introduce a lot of noise information, as shown in the figure below.

insert image description here

Excessive noise may confuse object information, and the boundaries between objects will become blurred, resulting in missed detection and increased false alarms. Therefore, it is necessary to enhance object features and weaken non-object features. In order to more effectively capture the features of small objects in complex backgrounds, the article designs a supervised multi-dimensional attention network (MDA-Net), as shown in the figure below. Specifically, in the pixel-based attention network, the feature map F3 is convolved with different sizes of convolution kernels to learn a dual-channel saliency map (see Figure d above). This saliency map shows foreground and background scores. A channel in the saliency map is selected and multiplied by F3 to obtain a new informative feature map A3 (see c above). It should be noted that the value of the saliency map after the Softmax function is between [0,1]. In other words, it can reduce noise and relatively enhance object information. Since the saliency map is continuous, the background information will not be completely removed, which is good for preserving some contextual information and improving robustness.

In fact, this module is not used well now, it is a combination of spatial attention and channel attention. But in the actual application process, spatial attention is really very useful in remote sensing detection, and I will basically add it when playing games. Speaking of this, it is actually the article I voted for CVPR2019 (R2CNN++: Multi-Dimensional Attention Based Rotation Invariant Detector with Robust Anchor Strategy), the result was of course rejected, and it was also said to be an incremental contribution in ICCV2019. Of course I agree, this is a summary of my early DOTA list. The reason why ICCV2019 was selected is mainly because of the third part.

3. IoU-Smooth L1 Loss

First of all, we need to understand two common ways to rotate the bounding box.insert image description here

insert image description here
SCRDet is the opencv representation used. Under the angle definition of the currently commonly used rotation detection frame, due to the boundary problem of the rotation angle, unnecessary losses will be generated, as shown in the following figure:insert image description here

The most ideal angle regression route is to rotate the blue box counterclockwise to the red box, but due to the periodicity of the angle, the loss according to this regression method is very large (see the Example on the right side of the figure above). At this time, the model must be regressed in a more complex form (for example, the blue box rotates clockwise, while scaling w and h), which increases the difficulty of regression. To better address this problem, we introduce an IoU constant factor in the traditional smooth L1 loss function. In the boundary cases, the new loss function is approximately equal to 0, eliminating the sudden increase in loss. The new regression loss can be divided into two parts. The smooth L1 regression loss function takes the unit vector to determine the direction of gradient propagation, and IoU represents the size of the gradient, so that the loss function becomes continuous. Furthermore, using IoU to optimize the regression task is consistent with the metric of the evaluation method, which is more straightforward and efficient than coordinate regression. The IoU-Smooth L1 loss formula is as follows:
insert image description here
You can look at the comparison of the effects of the two losses in boundary conditions:
insert image description here

The root cause of this I think is that the prediction of the angle is beyond the defined range. In fact, there is not the only way to solve this problem. RRPN and R-DFPN judge whether it is within the defined range in the loss formula of the paper, and alleviate this problem by adding and subtracting [formula], but this method is obviously not elegant. And there are still problems, mainly that it is difficult to judge how many angle cycles are beyond the forecast range. Of course, it can be done by adding a periodic function to the loss of the angle part, such as trigonometric functions such as tan and cos, but I often fail to converge in the actual use process. For the boundary problem, I have actually done research on other methods, which will be discussed in detail in future articles.

Regarding the IoU-Smooth L1 Loss, it was temporarily added before switching to the ddl of ICCV2019. I did not expect it to be the key to the mid-draft.

IoU-Smooth L1 Loss was tested on the RetinaNet rotation detection code, and the effect was found to be surprisingly good, rising from 62.25 to 68.65, but it was also found that a slight change in the configuration file would cause NAN, which is difficult to handle.

4. Experimental results

The final experiment was mainly carried out on the DOTA dataset, which was considered a SOTA paper at the time.

Ablation experiment:

insert image description here
Comparative Experiment:
insert image description here

12. Sliding vertex (Gliding vertex, 2020CVPR, 75.02)

The title of the article: Gliding vertex on the horizontal bounding box for multi-oriented object detection.
It is a new work by Mr. Bai Xiang of Huazhong University of Science and Technology, published on November 21, 2019. This article is used for object detection, and what is interesting is that Mr. Bai Xiang's expertise in OCR is used. General object detection uses a non-rotating rectangle to represent an object. The article believes that for long objects (such as oblique Chinese sentences, aerial ships, etc.), if the object is oblique, then this representation Objects cannot be located precisely. If a rotating rectangle is used to represent an object, the rotation angle of the rotating rectangle is difficult to learn. This article is to locate a quadrilateral to represent an object by learning the offset of four points on a non-rotated rectangle. The general process is shown in the figure below:insert image description here

1. Network structure

The article uses the structure of Fasterrcnn, but the predicted results are slightly different, and there are a few more. As shown below.
insert image description here
insert image description here

2. Generation of tags

insert image description here
insert image description here
insert image description here

3. Loss function

insert image description here

4. Test and experiment

insert image description here
insert image description here

Thirteen. P-RSDet (CVPR2020, mAP72.3 on DOTA)

Article title: Object Detection for Remote Sensing
Images Based on Polar Coordinates
Innovation points :
1. Using polar coordinates, simpler and fewer parameters. Refer to CornorNet regression pole (x, y) and two corner points (ρ, θ1, θ2)
2. Add a new loss function Polar Ring Area Lossinsert image description here
insert image description here

1.framework

insert image description here

The input image is WxH, and the output is C × W/d × H/d, where C represents the category. (x1, y1), (x2, y2), (x3, y3), (x4, y4) in the Cartesian coordinate system can be converted to s in polar coordinates (ρ1, θ1), (ρ2, θ2), (ρ3 , θ3), (ρ4,θ4). And:
ρ1 = ρ2 = ρ3 = ρ4
θ3 = θ1 + π, θ4 = θ2 + π
Transformation of two coordinate systems: (i is 1-4, n is category)
x(in ) = x(ip) + ρ(i) cos(θin)
y(in) = y(ip) + ρ(i) sin(θin)

2. Pole point extraction

The extreme point extraction adopts the method of Gaussian heat map, the loss adopts Focal Loss 123456
N as the number of pictures, and the values ​​of α and β are 2 and 4 respectively. The specific process is as follows:
insert image description here

3. Loss function

The overall loss function is as follows:
insert image description here
the first item is the previous focal loss, and the second item α is 0.1, specifically:insert image description here

The first item is the POLAR RING AREA LOSS added by the author, and the balance factor λ is 0.01. The specific expansion is: insert image description here
the second item is the ordinary L1 loss, it can be seen:
insert image description here
Another: It can be seen that when ρ (prediction) is equal to ρ (label), or when θ (prediction) is equal to θ (label), the polar ring loss does not work

4. Experimental results

insert image description here

insert image description here

The code is as follows (example):

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import  ssl
ssl._create_default_https_context = ssl._create_unverified_context

2. Read data

The code is as follows (example):

data = pd.read_csv(
    'https://labfile.oss.aliyuncs.com/courses/1283/adult.data.csv')
print(data.head())

The data requested by the url network used here.


Summarize

Tip: Here is a summary of the article:
For example: the above is what I will talk about today. This article only briefly introduces the use of pandas, and pandas provides a large number of functions and methods that allow us to process data quickly and easily.

Guess you like

Origin blog.csdn.net/qq_45752541/article/details/110840027