Visible light remote sensing image target detection (3) Arbitrary of text scene detection

The preface  introduced the main problems faced by the target detection task of visible light remote sensing images. Now the problem of rotating targets is optimized. In order to facilitate everyone's understanding of the difference from the previous general target detection, the framework of the Faster-Rcnn network model is used to detect rotating targets. Improve.

Reproduction of this tutorial is prohibited. At the same time, this tutorial comes from Knowledge Planet [CV Technical Guide] More technical tutorials, you can join Planet Learning.

Transformer, target detection, semantic segmentation exchange group

Welcome to pay attention to the public account CV technical guide , focusing on computer vision technical summary, latest technology tracking, interpretation of classic papers, CV recruitment information.

CV's major direction columns and the most complete tutorials for each deployment framework

Note  This article takes scene text recognition as the background, but the main problem to be solved is the problem of rotating target detection in scene text. What we mainly do is to refine the method. Of course, the method provided in this article is also applicable to the rotating target detection task of remote sensing images. . Through this article, we must consciously read more articles and accumulate our own ideas. Problems in this field may be relatively easy to solve by transferring knowledge from other fields.

background

Remote sensing image detection tasks have received a lot of attention in recent years, and although these methods have shown competitive results, most methods rely on horizontal or near-horizontal annotations and return detections of horizontal regions. However, in practical applications, a large number of remote sensing image regions are not horizontal, and even processing methods that treat non-horizontally aligned remote sensing images as axis-aligned may not be accurate. Therefore, the method of horizontal boxes cannot be widely used in practice. In this paper, the authors propose a rotation-based method and an end-to-end remote sensing image detection system for arbitrary orientation remote sensing image detection. The special feature is that the information of the rotation angle is combined , so that the detection system can generate proposals in any direction.

The authors propose Rotated Region Proposal Network (RRPN) , which aims to generate oblique proposals using orientation angle information from remote sensing images. Then, the angle information is suitable for bounding box regression to make proposals more accurately fit remote sensing image regions. A Rotated Region of Interest (RRoI) pooling layer is used to project arbitrarily oriented proposals to the feature map. Finally, a two-layer network is deployed to classify regions as foreground or background.

method

1. The overall structure of the model

Figure 1 Frame diagram of rotating target detection based on Faster-Rcnn

I believe that everyone who has read the paper Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks must be familiar with this picture. The picture below is the original picture of Faster-Rcnn.

Figure 2 Faster-Rcnn target detection framework

After comparing the above figure, it is not difficult to find that the main difference between rotating target detection and general target detection is two parts, the area recommendation part and the region of interest pooling part.   RPN is the most innovative module of Faster-Rcnn, and its importance to the target detection task is self-evident. The following is a brief description of the working principle. The RPN network is composed of CNN, which is generally composed of 3 3 and 1 1 convolutions. After 3 3 convolutions, there will be an activation function. The RPN network can input any size of image size, preventing the loss of features affected by operations such as crop and wrap. . It is the fully connected layer that affects the image input size, and SPP-Net (Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition) has solved this problem . Convolution processing is performed after the Feature map. The center point of each convolution can correspond to an area of ​​the original image. This is the concept of the receptive field . Suppose we have done 4 times of downsampling by 2 times, then the previous one on the Feature map The point corresponds to the area of ​​the original size. To generate 9 anchor boxes in the area of ​​this size, it is generated according to 3 sizes and 3 aspect ratios . Assuming that the Feature map size is 40 60, then the total anchor boxes are 40 60 9 = 20 anchor boxes, because the number of anchor boxes is large and the quality is not good enough, so these anchor boxes are screened, using IOUThis indicator is processed, assuming that IOU>0.7 is a positive sample, IOU<0.3 is a negative sample, and the IOU between 0.3 and 0.7 is directly discarded. If some IOUs do not reach 0.7, the largest IOU is taken as a positive sample. Next, the classification and regression tasks are performed through a full convolutional network (1 1 convolutional network). The classification task is a binary classification. It is necessary to distinguish whether there is an object in the anchor box. The regression task is based on the coordinates of the anchor box and the true value. The coordinates are encoded and decoded to generate the position of the predicted frame. After the classification task is over, the prediction frame is screened. Here, NMS processing is used. For the prediction frame with a large overlapping area, the prediction frame with low confidence is deleted, and the prediction frame with high confidence is retained. The function of RoI Pooling is to project the frame generated by the original image onto the Feature map for Max Pooling operation. There are some problems with RoI Pooling here. First, if the frame of the original image is not aligned exactly on the grid, you need to change the frame and align it on the grid. Second, the grid size is not evenly divided during Pooling. For example, when the grid of 5 7 is divided into four 3 3, 5 will not be divisible by 3. At this time, 5 will be simplified The processing is 3 and 2, and 7 cannot be divisible by 3, so 7 will be divided into 3 and 4, as shown in Figure 3. At this time, the grid becomes 2 3, 2 4, 3 3, 3 4. In order to solve the above problems, RoI Align is proposed to replace RoI Pooling, which will not be repeated here.

Figure 3 RoI Pooling

2. RRPN module

The rotated box representation is represented by  (x, y, h, w, θ)  five elements; the rotated anchor box uses the angle, scale and aspect ratio parameters. The angles are -π/6, 0, π/6, π/3, π/2, 2π/3, the scales are 8, 16, 32, and the aspect ratios are 1:2, 1:5 and 1:8. For each point on the feature map, 54 r-anchors are generated; as shown in Figure 4. There are only 9 anchors in the horizontal box . Before starting the training, define the calibration rules of the positive and negative anchor points: 1) If the IoU of the reference box corresponding to the anchor point and the ground truth>0.7, mark it as a positive sample; 2) If the anchor point The angle between the reference box and the ground truth corresponding to the point is less than π/12, and it is marked as a positive sample; 3) IoU is less than 0.3, and it is marked as a negative sample; 4) IoU is greater than 0.7, but the angle is greater than π/12, and it is marked as a negative sample ; 5) The rest are neither positive nor negative and are not used for training. The calculation of the loss function adds an angle of Loss calculation to the traditional multi-task loss function;

Figure 4 R-anchor generation mechanism

Skew IoU Computation is actually the calculation of IoU. Since the inclination of the bounding box will cause the overlapping part to be a polygon, the area of ​​this part is not easy to calculate compared with the area overlapped by the horizontal box. The author used a very clever method, decompose different polygons into triangles, calculate the area of ​​each triangle, and apply the IoU calculation formula to get the IoU value . As shown in Figure 5.

Figure 5 Calculation of IoU

Skew Non-Maximum Suppression (inclined NMS calculation), skewed NMS consists of two stages: (1) For proposals with IoU greater than 0.7, maintain the maximum IoU; (2) If the IoU of all proposals are in the [0.3, 0.7] range , then keep the angle difference between the proposal and the ground truth minimum (the angle difference should be less than π12) . Compared with ordinary NMS, angle information is introduced.

3. RRoI Pooling layer

For text detection tasks in arbitrary orientations, conventional layers can only handle horizontal proposals. Therefore, the authors propose to rotate the layers to adjust the generated arbitrary orientation. We first set the layer hyperparameters to and . For proposals of height and width, the rotated proposal region can be divided into sub-regions of size (as shown in Figure 6). After that, an affine transformation is performed to convert the rotated bounding box into a horizontal box, and then performed in each sub-region, the value of which is stored in each matrix; compared with Regions are pooled into a fixed-size feature map. Finally, the proposals are converted into and sent to the classifier to give the result, that is, the foreground or the background.

Figure 6 RRoI Pooling Layer

experiment

The experimental part is all experimented with the scene text detection data set, and all have achieved good SOTA results.

Summarize

First, understand the labeling method of rotating target detection and the meaning of each parameter.

Second, deeply understand the principle of RPN usage in Faster-Rcnn.

Third, the difference between RRPN and RPN, the working principle of RoI Pooling, and the principle of RRoI Pooling.

 Welcome to pay attention to the public account CV technical guide , focusing on computer vision technical summary, latest technology tracking, interpretation of classic papers, CV recruitment information.

[Technical Documents] "Building a pytorch Model Tutorial from Zero" 122-page PDF Download

QQ exchange group: 470899183 . There are big guys in the group who are responsible for answering everyone's daily study, scientific research, and code questions.

Model deployment exchange group: 732145323 . It is used for communication on model deployment, high-performance computing, optimization acceleration, and technology learning in computer vision.

other articles

One day online, 4k star | Facebook: Segment Anything

3090 single card for 5 hours, everyone can train exclusive ChatGPT, HKUST open source LMFlow

Efficient-HRNet | Will EfficientNet thinking + HRNet technology be stronger and faster?

Practical Tutorial|Analysis and Optimization of Common Reasons for Low GPU Utilization

ICLR 2023 | SoftMatch: Achieving Quality and Quantity Trade-off of Pseudo-Labels in Semi-Supervised Learning

Target detection innovation: a region-based semi-supervised method, some labels are enough (download of original paper attached)

CNN strikes back! InceptionNeXt: When Inception meets ConvNeXt

Interpretability Analysis of Neural Networks: 14 Attribution Algorithms

Painless increase: Practical Trick for target detection optimization

Explain in detail the three ways PyTorch compiles and calls custom CUDA operators

What should I do if the GPU memory is insufficient when training a model with deep learning?

CV's major direction columns and the most complete tutorials for each deployment framework

Computer Vision Introduction 1v3 Tutorial Class

Communication group in various directions of computer vision

Guess you like

Origin blog.csdn.net/KANG157/article/details/130673798