RetinaNet paper detailed Focal Loss for Dense Object Detection

1. Paper related information

​ 1. Thesis title: Focal Loss for Dense Object Detection

​ 2. Posting time: 2017

​ 3. Document address: https://arxiv.org/pdf/1708.02002.pdf

4. Authors: Tsung-Yi Lin, Kaiming He

​ 5. Paper source code: code: https://github.com/facebookresearch/Detectron.

2. Paper details

Background and introduction

At present, the highest detector is the two-stage detection method started by R-CNN . This method applies the classifier to discrete target candidate regions. The one-stage detector uses regular and intensive sampling where targets may appear, and then uses the detector in these places.
The one-stage detector is faster and simpler, but it lags behind two-stage in accuracy. The main reason for the lower accuracy of the one-stage detector is that the samples have extreme positive and negative sample classes during training (foreground & background class) Unbalance phenomenon, namely class imbalance . It can be roughly understood that due to the large number of negative samples during training, most of the loss function is composed of negative samples without information, and the loss obtained cannot provide good guidance for model training.
eg: There is a large gap in the proportion of positive and negative samples in the two classifications, which leads to the model's prediction biased towards a certain category. If the positive sample occupies 1%, and the negative sample occupies 99%, then the model only needs to predict all samples as negative samples, then the model can easily achieve 99% accuracy, so the model trained by such a sample is not accurate .

Solve the class imbalance (class imbalance):
  • Two stage detecors such as R-CNN narrowed the number of candidate regions in the first stage-proposal stage, and filtered most of the background samples (negative samples); in the second stage, through some inspiration The type of sampling, such as maintaining a 1:3 ratio of positive and negative samples, and onem online difficult sample mining, maintains the class balance very well.

  • One-stage detectors need to deal with a large number of candidate areas, densely covering each spatial location, scale, and ratio. Although heuristic sampling can also be used, the effect is poor, because the training has a large number of easily distinguishable backgrounds. Dominated by samples, this is a classic problem in detection, which is usually solved by techniques such as difficult sample mining.

The authors suggest to a new loss function Focus loss to solve the class imbalance phenomenon, the loss function reinvented the standard cross-entropy loss (standard cross entropy loss), is a cross-entropy loss dynamic scale (dynamically scaled cross entropy loss), with As the confidence of the correct sample increases, the scale will decrease to 0. This loss function can reduce the loss weights assigned to well-classified samples and focus on those difficult samples . These losses will be detailed later~

The so-called difficult sample can be simply understood as it is not easy to distinguish whether it contains a target, such as an anchor with only one leg, and it is difficult to judge whether it is a human target.

In order to evaluate the effectiveness of this loss function Focus loss, the author designed and trained a one-stage sample density detector-RetinaNet. The results show that the solution of Focus loss to solve class imbalance is far better than heuristic sampling or difficult sample mining. As with the previous solutions used in one-stage, RetinaNet has the speed of one-stage when using focus loss, and it also exceeds the accuracy of all the best two-stage detectors at the time.

Note : The specific form of focus loss is not very critical.

Focal Loss

In order to introduce Focal loss, first introduce the concept of cross entropy (CE) for two classifications

cross entropy(CE):

Insert picture description here
In the above formula, y takes the value +1 or -1, +1 represents the ground truth class, and -1 is not. p represents the probability that y is the ground truth, it is estimated to be the class. By convention, we let:
Insert picture description here
there is the CE (P, Y) = the CE (P T ) = - log (P T ) . p t can be regarded as a probability value of the sample being correctly classified, including the wrong sample that is correctly classified as background and the correct sample that is correctly classified as gt class.

The curve of CE(p t ) loss can be seen as the blue curve in the figure below. A salient feature of this curve is that even if the sample is easy to distinguish (that is, p t > 0.5, belonging to an easy sample), it will still cause a Big loss. When a large number of easy examples loss is added together, a large loss value will be obtained, and the loss of other samples will be overwhelmed.

Balanced Cross Entropy

One way to solve class imbalance is to add a weight factor α ∈ [0, 1] for positive samples, and a weight factor 1-α for negative samples. In practice, α can be set by the inverse frequency or as a hyperparameter by cross-validation. For convenience, similar to the definition of P T manner defined [alpha] T . Then get α-balanced CE loss:
Insert picture description here
this is a simple extension of cross-entropy, through the α weight, we can solve the imbalance of positive and negative samples.

Focal Loss Definition

α solves the imbalance between positive and negative samples, but it cannot distinguish between easy/hard examples. Therefore, focal loss transforms cross loss to reduce the weight of easy examples and focuses on the training of hard examples.
Specifically, a modulating factor—(1 − p t ) γ is added to the cross entropy , where the adjustable parameter γ>0.
So our focal loss is:
Insert picture description here

Insert picture description here

The curves corresponding to different values ​​of γ are shown in the figure above, where the function degenerates to cross entropy loss when γ is 0. The loss function has two characteristics:

  1. When the sample is misclassified, p t is very small, and the entire adjustment factor is close to 1, which has almost no effect on the loss value. When the sample is well classified, the p t value approaches 1, and the adjustment factor is close to 0, which reduces the loss of the well-classified samples, and thus becomes relatively difficult to focus on.
  2. As γ increases, the influence of regulatory factors also increases. In the experiment, γ is the best.
  • The sigmoid method is more accurate than softmax when calculating [formula];
  • The formula of Focal Loss is not fixed, it can also have other forms, the performance difference is not big, so the expression of Focal Loss is not critical.

Intuitively, the adjustment factor reduces the loss contribution of simple samples and expands the low loss range of simple samples.

In practice, the accuracy of using focal loss with α-balance variable will be slightly improved. α and γ are related. When γ increases, α should decrease slightly. The best value is γ=2, α=0.25 :
Insert picture description here

Model Initialization

When there is a significant class imbalance, that is, the background examples are much larger than the foreground examples, but when the network is initialized, for any input, the probability of being judged as positive and negative is the same, and our focal loss will reduce the loss of the samples judged as correct. Weight, so whoever is misjudged more, who causes more loss.
It can be considered that half of the negative samples are mistakenly classified as positive and half of the positive samples are mistakenly classified as negative. However, since the base of the negative samples is much larger than the positive samples, the negative samples almost contribute most of the loss, so the function will be unstable in the initial training stage , The model will tend to divide the sample into negative samples.
In order to alleviate this bias, the author made a small modification to the final level of convolutional bias used for classification (see Figure 2 for the specific location) and initialized it to a special value b = − log((1 − π)/ π). π is taken as 0.01 in the paper, which can improve the classification probability of positive in the initial stage of training.

RetinaNet Detector

RetinaNet Detector consists of a backbone and two subnetworks for specific tasks. The backbone extracts the characteristics of the entire picture. Then the two subnetworks perform convolution calculations on the feature map output by backbone, the first subnetworks convolution calculates the object classification results, and the second subnetworks performs bounding box regression.
Structure diagram: The
Insert picture description here
backbone uses RenNet+FPN, and each uses The three feature layers are all 256 channels.
Note that the FPN here is the same as FCOS. The picture shows only 3 layers. In fact, the output of FPN is further connected to get 5 layers of p3-p7.
The following describes the code running process:

The three feature maps of ResNet, c3, c4, and c5 are all subjected to 1x1 convolution for channel transformation to obtain m3~m5, and the output channels are unified to 256
starting from m5 (the smallest feature map), first carry out 2 times the nearest neighbor upsampling, and then Perform an add operation with m4 to obtain a new m4
. Upsample the new m4 by 2 times its nearest neighbors, and then perform an add operation with m3
to obtain a new m3. Perform respective 3x3 convolutions on m5 and the newly merged m4 and m3. , Get the final output of 3 scales P5~P3,
perform the convolution operation of 3x3 and stride=2 on c5, get P6,
perform the convolution operation of 3x3 and stride=2 again on
P6 , get P7 P6 and P7. The purpose is to provide The feature map with strong semantics is beneficial to the detection of large objects and super large objects. Only convolution is included in the FPN module of RetinaNet, not BN and ReLU.

Summary: The FPN module receives three feature maps of c3, c4, and c5, and outputs five feature maps of P2-P7, the number of channels is 256, and the stride is (8,16,32,64,128), of which the larger stride (the smaller the feature map) Used to detect large objects, and small stride (large feature map) is used to detect small objects.

Each feature layer uses three different proportions, three sizes of anchors, a total of 9 anchors. Each anchor gets a k-dimensional classification vector and a 4-dimensional box regression vector. The anchor with IoU>0.5 is positive and is assigned to the corresponding Ground Truth. The IoU at [0,0.4] is negative, that is, background, and the IoU at [0.4,0.5] is ignored.

Both the classification branch and the border regression branch are FCN sub-networks, and the parameters of the network are shared for each layer of FPN.

  • The final output channel of the classification branch is A×k, that is, the number of anchors×num_class.
  • The output channel of the border regression branch is 4A, and each anchor corresponds to 4 parameters to predict the relative displacement of the anchor and the ground truth. Use the parameterization method in RCNN to regress.

Total loss

The total training loss is composed of the classified Focal loss and the smooth L1 loss of the frame regression. The Focal loss calculates the loss of all anchors. This is different from other methods, which just select a part of the anchors. And when doing normalization, it is divided by the number of anchors assigned to the ground truth.

Inference

Inference only needs to be forwarded simply. In order to speed up, each FPN layer selects only the anchors with the detection threshold of 0.05 and the top 1000 scores after screening to decode (restore the anchor position on the original image), and finally all RPN predictions are merged, and then Non-maximum suppression of nms gives the final result.

Experimental results:


Insert picture description here
Comparison of speed and time trade off with other detectors:
Insert picture description here

Guess you like

Origin blog.csdn.net/yanghao201607030101/article/details/110083394