foreword
We introduced it in the previous blog post Focal Loss
, and the principle is relatively simple. If you don’t understand it, you can jump to the previous blog post to learn about it. Introduction to Focal Loss . Let's take a look at the source of this blog post Focal Loss
: Focal Loss for Dense Object DetectionRetainNet
, this paper proposes a network one-stage
that has been surpassed by the network two-stage
.
1. RetainNet network
Let's look at the performance first RetainNet
, and you can see that it is far superior Faster R-CNN
to the network. Let's take a look at the network structure
again : we can see that a similar structure is adopted , with three main differences. For those who don't know it, you can jump to my previous blog post ( introduction to the FPN network ):RetainNet
RetainNet
FPN
FPN
FPN
C2
Builds are usedP2
, buildsRetainNet
are not used . The reason given in the paper is that more computing resources will be calculated. Because of the four low-level features, the resolution is relatively large.C2
P2
C2
C2
FPN
inP6
is downsampled by a maximum scale downsampling layer, andRetainNe
t is downsampled by a convolutional layer.FPN
It is fromP2-P6
, RetainNet is fromP3-P7
,P7
it isP6
based on an activation functionReL
U, and then obtained through a convolution.
In FPN
, each prediction feature layer only uses one scale
and three ratios
, and RetainNe
each prediction feature layer in t uses three scale
and three ratios
. RetainNet
in scale
and ratios如
the following table:
layers | stride | anchor_sizes | anchor_aspect_ratios | The number of generated anchors, (multiplied by 3 means 3 ratios) |
---|---|---|---|---|
P2 | 4(2(^)2) | 32 | 0.5,1,2 | (1024//4) ( ^)2×3=196608 |
P3 | 8(2(^)3) | 64 | 0.5,1,2 | (1024//8) ( ^)2xx3=49152 |
P4 | 16(2(^)4) | 128 | 0.5,1,2 | (1024//16)^^2xx3=12288 |
P5 | 32(2(^)5) | 256 | 0.5,1,2 | (1024//32) ( ^)2xx3=3072 |
P6 | 64(2(^)6) | 512 | 0.5,1,2 | (1024//64) ( ^)2×3=768 |
Let's look at the predictor part of RetainNet again:
the predictor is divided into two branches, one predicts the category, and the other is the target bounding box regression parameter. The final output K represents the number of categories of detection targets (excluding background), and A represents anchor
the number of each prediction feature layer. In FasterRCNN
the middle, for the prediction layer, each anchor
will generate a set of bounding box regression parameters for each category, which is slightly different from the prediction here, and it is the same here SSD
, and now the samples are basically not available for this category. The known prediction method can reduce the network training parameters.
2. Calculation of losses
First of all, we will perform a match, that is, calculation, for each of anchor
our pre-marked gt, iou
the rules are as follows:
- If iou >= 0.5 iou>=0.5iou>=0.5 , marked as a positive sample
- i o u < = 0.4 iou<=0.4 iou<=0.4 , marked as a negative sample
- i o u ∈ [ 0.4 , 0.5 ) iou \in[0.4, 0.5) iou∈[0.4,0.5 ) , discard
The total loss still uses classification loss and regression loss, as follows:
Loss = 1 NPOS ∑ i L clsi + 1 NPOS ∑ j L regj \text { Loss } =\frac{1}{N_{POS}} \sum_i L_ {cls}^i+\frac{1}{N_{POS}} \sum_j L_{reg}^j Loss =NPOS1i∑Lclsi+NPOS1j∑Lregj
- L c l s L_{cls} Lcls: Sigmoid Focal Loss, we introduced it in the last blog post, if you don’t understand, you can go back and see: Introduction to Focal Loss .
- L r e g L_{reg} Lreg:L1 Loss
- i i i : all positive and negative samples
- not a wordj : all positive samples
- N p o s N_{pos} Npos: the number of positive samples
The above is RetainNet
the introduction about the network, if there is any mistake, please correct me!