Learning Deep Ship Detector in SAR Images From Scratch

Paper address: Baidu network disk link, extraction code: 5whq
Published journal: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 57, NO. 6, JUNE 2019
Paper author: National University of Defense Technology

An article full of dry goods, strongly recommended!

1. Main content

Existing problems:

  1. pre-trained : The neural network has many parameters and often requires a large amount of data for training, but the number of labeled data sets is relatively small. If ImageNet pre-trained weights are used, then there will be a learning bias due to the mismatch between SAR images and ImageNet images in large areas
  2. small object detection : Ships in SAR images are relatively small-sized and densely clustered targets, but since most detectors are used to detect rough feature maps, and the foreground and background are extremely unbalanced, in small target detection The performance is not very good.

Proposed method:

Training ship detector from scratch

  1. condensed backbone : composed of dense blocks. Therefore, earlier layers can receive additional supervision from the objective function through dense connections, which makes training easy. In addition, the feature reuse strategy is adopted to make it have high parameter efficiency. Therefore, the backbone can be freely designed and efficiently trained from scratch without using a large number of annotated samples.
  2. improved cross-entropy loss : Improved cross-entropy loss to solve the problem of foreground and background imbalance, and output multi-scale ship proposals from multiple intermediate layers to improve the recall rate.
  3. position-sensitive score maps : position-sensitive score maps are used to encode position information into each ship proposal for distinction.

in conclusion:

  1. Compared with the imagenet pre-trained detector, the ship detector trained from scratch can achieve better performance
  2. Our method is more effective than existing algorithms on the task of detecting small and dense ships

If you think the article is good, welcome to like, collect, comment and exchange, this is also the driving force for continuous updates!

2. Network structure

At that time, DSOD was the only network that did not use pre-trianed for image detection. DSOD used the method of dense layer-wise connections to reuse feature, that is, condensed block. The author is also inspired by DSOD and follows the design method of DSOD.
However, DSOD is a target detection based on the method of SSD regression. The author believes that the two-stage network can be better used for dense detection of small targets, so the proposal-based method is adopted.

SSD network structure :
insert image description here

DSOD network structure:
insert image description here

The network structure of this article:
insert image description here
the overall structure of the network = backbone + SPN (Ship Proposal Network) + SDN (Ship Discrimination Network)

The table below can be viewed later
insert image description here

backbone

insert image description here

The entire backbone is divided into 4 stages, which are composed of dense blocks.

  • Before entering stage 1, the original input is down-sampled by 2 times through conv stride 2
  • Before entering each dense block, 2 times downsampling is performed by 2 * 2 max pooling
  • Two dense blocks are connected by 1 * 1 cnn and 2 * 2 max pooling
  • Dense block = 3 * 3 cnn + 1 * 1 cnn + BN + Scale layer + Relu

[What is the operation of Scale layer? ? ?

The internal structure of the Dense block
insert image description here
In order to ensure the maximum information flow between the layers in the network, a Dense block is used to connect each layer to the rest in a feed-forward manner.
Since each layer can directly obtain the gradient from the loss function and the original input signal, the network containing the dense block is easy to train.
Furthermore, features from shallow layers are reused by deep layers, reducing the number of parameters.

SPN—Ship Proposal Network

Comparison of RPN networks:
insert image description here

Reasons why traditional CNN networks are difficult to detect small objects:

  • The size of the feature map used for detection is too small to cover small objects, which will cause blurred positioning (a pixel in fp may contain multiple small objects)
  • Small objects often cause serious foreground and background imbalance.
    • Because most regions are easily judged as negative samples, which is an invalid signal for network learning, so training is often inefficient;
    • Simple negative samples will become the main factor affecting the training, which will lead to model degradation.
      [In Fast CNN, government samples are sampled through sampling, but the negative samples obtained by sampling cannot fully express the background]

Solution to inconsistency :

  • Use convolutions of different sizes to generate ship-like regions from different intermediate layers
  • More ship details can be extracted from the shallow layer, and more global context information can be extracted from the deep layer
  • In this way, the recall rate of the ship can be effectively improved

SPN network structure:
insert image description here
The SPN network does not directly sample the output of each dense block as input, but each branch combines the outputs of different dense blocks to generate ship proposals. And after the dimension reduction is performed through 1 * 1 cnn, the concate between different dense block fp is performed.
In doing so, on the one hand, fp of different scales can be fused to produce more accurate predictions; on the other hand, parameters can be effectively reduced by reducing the dimensionality of 1 * 1 cnn.

The details of the RPN network are introduced below, which is actually the process of Faster RCNN , but this article has been improved.
First, briefly review the ideas of Faster RCNN:
Anchor in Faster RCNN

In Faster RCNN, by traversing the feature maps calculated by Conv layers, each point is equipped with 9 kinds of anchors as the initial detection frame, and then the detection frame position is corrected on the basis of the initial anchor through 2 bounding box regressions. The first correction is on the RPN network, and the second correction is on the Detection network.
insert image description here

Then, look at the method proposed in this article:
there are also anchors with 9 sizes, but they are divided into three branches, and each branch is set with anchors of three different sizes, and then in each branch, by Three different sizes of filters generate proposals corresponding to anchors of different sizes.
The generated proposals have 6 channels, of which the first two channels are used for classification and the last four channels are used for regression position.
Specifically, the first two channels obtain positive and negative classifications through softmax classification anchors (note that it is well distinguished from the distribution of positive and negative samples for training later); the latter four channels are used to calculate the bias of proposals for anchors shift to get precise proposals (corresponding to the first correction mentioned earlier).

At each window position, a region box (x, y, w, h) will be predicted according to different filters, where x, y represent the coordinates of the upper left point of the region, and w, h represent the size of the predicted region. In fact, the result of the region box is obtained through anchor + offset. The specific settings are as follows:
insert image description here
In order to create a proposals training samples, the regions outside the image boundary will be discarded, and the remaining regions will be divided according to the positive and negative sample distribution settings. In the article, if the IOU between the regions box and a certain gt_boxes is the largest, the region box will be classified as a positive sample; if the IOU is lower than 0.2, it will be classified as a negative sample; other regions will be discarded and will not participate in the subsequent training process.

So far, the brief introduction of the SPN network is over.
In fact, the improved CE loss proposed by the author to alleviate the unevenness of the foreground and background is also used in the SPN network, which we will introduce later.

SDN—Ship Discrimination Network

The necessity of the two-stage SDN network :
On the one hand, the sliding operation does not cover the ship well, and the ship proposals from multiple branches highly overlap with each other.
On the other hand, the local feature representation of each prediction region box Bi is not sufficient for better discrimination.
Therefore, in order to improve the detection accuracy, the SDN network is added after the SPN network

insert image description here
The process of the SDN network :
the image with the predicted area frame (generated by SPN) is used as input, and the refined category and position are output at the same time. Since these ship proposals have different sizes, ROI pooling is usually employed for each bounding box to extract fixed-dimensional features (eg., 7 × 7 × 512). Then, these features are input to the following fully connected layer and split into two branches for further classification and bounding box regression (corresponding to the second correction mentioned above).
[In Faster RCNN, the original image is used as input; but in the network structure diagram of this article, the shallow fp is used as input, which seems to be different from the paper? ? ?

Focal loss

Cross-entropy loss function: S + means positive samples, S - means negative samples
insert image description here
Only when the sampled positive sample and negative sample proposals ratio is 1:1, CE loss is effective for training. Because in order to deal with the problem of unbalanced foreground and background, He Kaiming's Focal loss
is used , so that all proposals can be used for training.

Focal Loss: β is a balancing parameter, which is used to prevent negative samples from dominating the training loss
insert image description here
. In the case of misclassification, Focal loss is equivalent to CE loss. When a positive sample is misclassified, the value of p(xi ) is small, and the adjustment factor (1-p(xi ) ) 2<\sup> is close to 1; when a negative sample is misclassified, The value of p(xi ) is very large, and the adjustment factor p(xi ) 2 <\sup> is close to 1 at this time.
For correct classification, Focal loss can be smoothly reduced (smoothly downweighted). When a positive sample is correctly classified, the value of p(xi ) is very large, and the adjustment factor (1-p(xi ) ) 2<\sup> is close to 0; when a negative sample is correctly classified, The value of p(xi ) is very small, and the adjustment factor p(xi ) 2 <\sup> is close to 0 at this time.
Therefore, Focal loss can prevent a large number of easy-to-classify negative samples from dominating the training process during the training process.

3. Experimental results

insert image description here

insert image description here

insert image description here
insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/weixin_46297585/article/details/123442293