Remote sensing image target detection-paper reading sharing: Fast and accurate multi-class geospatial object detection with large-size...

This article introduces

Hello everyone, I am a graduate student currently studying, and my research direction is mainly computer vision. Here I share some papers I usually read, which can be regarded as a note. Because of my limited level, I may not understand some parts of the paper. If any friends read this article, they must think about the correctness for themselves when referring to it. If you find an error, please point it out in the comment area or private message in time, and I will change it in time, thank you!
Paper name: Fast and accurate multi-class geospatial object detection with large-size remote sensing imagery using CNN and Truncated NMS
Paper address: link .
Paper notes and PPT: link . Extraction code: hnai

Introduction

The current difficulties in target detection in remote sensing images:

1. Multi-class geospatial object detection of remote sensing images has broad application prospects, such as urban management planning, natural disaster early warning, military monitoring and so on. The target detection of natural images has reached a very high level, but there are still many difficulties in the target detection of remote sensing images, mainly due to the large size of remote sensing images, overlooking angles, dense targets, and being easily affected by light and weather.

2. If the large-size remote sensing image is directly scaled according to the size of the model input, a lot of image details will be lost, and the target in the image will become small and difficult to extract features, so the large-size image needs to be cropped , but there will be a large number of Truncated Objects (truncated objects) during the cropping process . These truncated objects will cause a large degree of precision loss during the image stitching process, as shown in the figure below: the
truncated target
complete aircraft is captured in the black box on the right, The green box on the left only captures the tail of the plane, causing the truncation of the target, but it will also detect the tail of the plane during detection. In the final splicing, the two boxes exist at the same time, and we only want to keep the blue box and remove the red box. .

The method proposed in this paper

In order to balance the accuracy and speed, this paper uses pruning strategy (pruning strategy) to obtain
multi-volume YOLO v4 (multi-size YOLO v4)
on the basis of YOLO v4. At the same time, in order to improve the detection speed, MIOU loss function (MIOU loss function) into YOLO v4, and this paper proposes the Truncated NMS algorithm based on NMS to deal with the overlapping problem of the truncated target detection frame.

The target detection process in this article:
1. First crop the picture to a suitable size
2. Pass the cropped picture into the improved YOLO v4 model for target detection
3. Use the Truncated NMS algorithm to remove duplicate frames
4. Convert the predicted The cropped image is stitched into the original image to complete the detection

Multi-volume YOLO v4

YOLO v4 network structure

YOLOv4 structure
1. The backbone of YOLO v4 is CSPDarknet 53. The input image is down-sampled by 8, 16, and 32 times to obtain three feature maps. Taking the input of 1024 x 1024 as an example, 128 x 128 and 64 x 64 are obtained after the backbone And three feature maps of 32 x 32.
2. In the neck, YOLO v4 uses the PAN network. The feature map obtained by 32 times downsampling in the backbone will be upsampled after convolution, BN, and activation function, and will be fused with the feature map obtained by the original 16 times downsampling. , similarly, the fused feature map will also be fused by upsampling and the original 8-fold downsampling feature map. This is a top-down process, and then downsampled one by one from bottom to top, and then feature fusion , to get the final features. Friends who don't understand YOLOv4 can search for YOLOv4's explanation blog.

PAN structure
PAN network is an improvement of FPN network. FPN network transmits semantic information from top to bottom, while PAN network transmits semantic information from bottom to top. PAN can be understood as FPN plus a reverse FPN .
FPN network

PAN network

network pruning or network expansion

In this paper, network pruning or network expansion (network pruning or network expansion) is used to change the depth and width of the network, so as to weigh the speed and accuracy of the network. In this paper, the YOLO v4 network is divided into four sizes , among which YOLO v4 l is the original size of YOLO v4. As the depth and width increase, the parameters of the network gradually increase, the accuracy gradually increases, and the speed gradually decreases.
Network pruning (network pruning) is divided into structured pruning and unstructured pruning, similar to dropout, but network pruning does not randomly "subtract" neurons, but based on some standards, such as parameter output The absolute value, the impact of parameter clipping on the loss value, etc. At the same time, precision restoration is required after clipping.
Multi-scale YOLOv4
You can read this blog about network pruning technology: link

Manhattan-Distance intersection over union loss

Let's divide the loss function into two parts:
loss function
first look at the first three items: the first two items are easy to understand, and the focus is on the third item, which mainly calculates the loss of the aspect ratio between the predicted frame and the real frame. I drew a picture to show Explanation:
Loss function aspect ratio
Although the blue prediction basket and the pink prediction box have the same IOU as the green real box, their aspect ratios are completely opposite. When calculating the loss, the loss of the basket will be larger, because we want the aspect ratio of the prediction box Roughly close to real objects, for example, the prediction frame of a car should be flat, while the prediction frame of a person should be thin and tall.
Look at the second half of the loss function:
loss function
ρ( ) is the distance in miles, δ( ) is the Manhattan distance, b is the coordinates of the center point of the predicted frame, and bgt is the coordinates of the center point of the real frame. The Euclidean distance is the distance between two points, and the Manhattan distance is the absolute value of the difference between the X and Y coordinates of the two points and then summed, as shown in the following figure:

Distance between Euclid and Manhattan
Therefore, when calculating the loss, not only simply consider the distance between the two center points, that is, the distance between the two center points, but also their Manhattan distance. I drew a picture to explain why this is done: for the above
distance explanation
blue The color prediction box and the pink prediction box have the same Euclidean distance between their center and the center point of the real box, but the blue prediction box is obviously better, and the Manhattan distance comes in handy. The blue prediction box The Manhattan distance between the center point and the center point of the real box is smaller, so the loss will be smaller, which is more fair to the blue box. If you don't understand this point, you can look at the explanation of the two distances.

Truncated NMS algorithm

Truncated Object
As mentioned before, because the size of the remote sensing image is large, the image needs to be cropped before target detection, and a certain overlap rate (overlap rate) will be retained during cropping, such as the green frame and the black frame in the figure below It is the two images cropped when the overlap rate is 20%. However, a truncated object (truncated object) will appear in the cropped image, and the truncated object will appear in the two cropped images respectively, and there will be overlapping prediction frames when splicing back to the original image after prediction, which is difficult to remove by traditional NMS algorithms Overlapping small boxes.
truncated target
In order to solve the above problems, this paper proposes a Truncated NMS algorithm, which can effectively remove the repeated small boxes in the truncated target prediction. The overall algorithm is shown in the figure:
Truncation Algorithm
Of course, it would be confusing to look at this algorithm directly, and I only understood it after reading it several times, so I took it apart to explain. Of course, I also drew a picture: B is the set of all prediction boxes, and
Algorithm 1
S is all The score of the predicted frame, IOUt is the IOU threshold of the two frames, IOIt is the ratio of the intersection of the two predicted frames to the inner frame, and IOOt is the ratio of the intersection of the two predicted frames to the outer frame, examples are as follows:

Example of inner and outer frames
The yellow underlined part is the intersection of the two boxes.
The next step is to select the box with the highest confidence score as the M box. The algorithm is divided into two cases. The first is that the small box is selected as M, and the second is that the large box is selected. as M:

Two M cases
You must look at the formula and taste the two situations carefully. It will be easy to understand after tasting it here.

condition2
If it is condition2, directly output the current box as the final prediction box

condition1
If it is condition1, the IOU and the current M frame are judged as inside boxes, and these boxes are removed, including the current M itself. After all the inside boxes are removed, only the outside box remains, and return to condition2

Other cases
If it is not the above two cases, it will be processed according to the conventional NMS. If the IOU of the two boxes is less than the threshold, it will be retained, and if it is greater than the threshold, it will be removed.

Anyway, it took me a long time to figure out this algorithm, but the idea is really great. If you really don’t understand it, you can private message me, and I will explain it to you.

ExperimentExperiment

There is actually nothing to say about this part of the experiment. You can just read the results in the paper. Here is a picture of the truncated target detection in the paper:
Truncated target detection effect

Summarize

This paper uses the network pruning (network pruning) method to obtain a multi-volume YOLO-v4 (multi-scale YOLO-v4) model, which can select a model of appropriate size according to different prediction standards, and thus weigh the accuracy and speed of detection At the same time, the MIOU loss loss function is introduced, and the Manhattan distance is added to the loss function to optimize the "unfair competition" brought by the Euclidean distance, which improves the detection accuracy. At the same time, in order to solve the problem of truncated object (truncated target) detection error, the Truncated NMS algorithm is proposed on the basis of NMS, and the IOU relationship between the large frame and the small frame is used to effectively improve the truncated object by removing the small frame. The problem of overlapping prediction frames in the original image, while retaining the function of traditional NMS, will not remove the prediction frames of other targets by mistake. From the comparison of the detection effect released in this paper with other algorithms, it can indeed improve the accuracy of remote sensing image target detection, and at the same time, it can well improve the prediction problem of truncated targets.

It’s a bit rusty to write a blog for the first time. I hope that all friends can give suggestions. If there is something unclear in the article, please feel free to contact me. I hope it will be helpful to you.

Guess you like

Origin blog.csdn.net/KK7777777/article/details/127140588