RetinaNet network introduction

foreword

  We introduced it in the previous blog post Focal Loss, and the principle is relatively simple. If you don’t understand it, you can jump to the previous blog post to learn about it. Introduction to Focal Loss . Let's take a look at the source of this blog post Focal Loss: Focal Loss for Dense Object DetectionRetainNet , this paper proposes a network one-stagethat has been surpassed by the network two-stage.

1. RetainNet network

Let's look at the performance first RetainNet, and you can see that it is far superior Faster R-CNNto the network. Let's take a look at the network structure
insert image description here
again : we can see that a similar structure is adopted , with three main differences. For those who don't know it, you can jump to my previous blog post ( introduction to the FPN network ):RetainNet
insert image description here
RetainNetFPNFPN

  • FPNC2Builds are used P2, builds RetainNetare not used . The reason given in the paper is that more computing resources will be calculated. Because of the four low-level features, the resolution is relatively large.C2P2C2C2
  • FPNin P6is downsampled by a maximum scale downsampling layer, and RetainNet is downsampled by a convolutional layer.
  • FPNIt is from P2-P6, RetainNet is from P3-P7, P7it is P6based on an activation function ReLU, and then obtained through a convolution.

  In FPN, each prediction feature layer only uses one scaleand three ratios, and RetainNeeach prediction feature layer in t uses three scaleand three ratios. RetainNetin scaleand ratios如the following table:

layers stride anchor_sizes anchor_aspect_ratios The number of generated anchors, (multiplied by 3 means 3 ratios)
P2 4(2(^)2) 32 0.5,1,2 (1024//4) ( ^)2×3=196608
P3 8(2(^)3) 64 0.5,1,2 (1024//8) ( ^)2xx3=49152
P4 16(2(^)4) 128 0.5,1,2 (1024//16)^^2xx3=12288
P5 32(2(^)5) 256 0.5,1,2 (1024//32) ( ^)2xx3=3072
P6 64(2(^)6) 512 0.5,1,2 (1024//64) ( ^)2×3=768

Let's look at the predictor part of RetainNet again:
insert image description here
  the predictor is divided into two branches, one predicts the category, and the other is the target bounding box regression parameter. The final output K represents the number of categories of detection targets (excluding background), and A represents anchorthe number of each prediction feature layer. In FasterRCNNthe middle, for the prediction layer, each anchorwill generate a set of bounding box regression parameters for each category, which is slightly different from the prediction here, and it is the same here SSD, and now the samples are basically not available for this category. The known prediction method can reduce the network training parameters.

2. Calculation of losses

First of all, we will perform a match, that is, calculation, for each of anchorour pre-marked gt, iouthe rules are as follows:

  • If iou >= 0.5 iou>=0.5iou>=0.5 , marked as a positive sample
  • i o u < = 0.4 iou<=0.4 iou<=0.4 , marked as a negative sample
  • i o u ∈ [ 0.4 , 0.5 ) iou \in[0.4, 0.5) iou[0.4,0.5 ) , discard

The total loss still uses classification loss and regression loss, as follows:
 Loss = 1 NPOS ∑ i L clsi + 1 NPOS ∑ j L regj \text { Loss } =\frac{1}{N_{POS}} \sum_i L_ {cls}^i+\frac{1}{N_{POS}} \sum_j L_{reg}^j Loss =NPOS1iLclsi+NPOS1jLregj

  • L c l s L_{cls} Lcls: Sigmoid Focal Loss, we introduced it in the last blog post, if you don’t understand, you can go back and see: Introduction to Focal Loss .
  • L r e g L_{reg} Lreg:L1 Loss
  • i i i : all positive and negative samples
  • not a wordj : all positive samples
  • N p o s N_{pos} Npos: the number of positive samples

The above is RetainNetthe introduction about the network, if there is any mistake, please correct me!

Guess you like

Origin blog.csdn.net/qq_38683460/article/details/131158221