retinanet 和focal loss

Reprinted

This is a paper reading notes

Thesis link: https://arxiv.org/abs/1708.02002

Code link: https://github.com/facebookresearch/Detectron

First, ask a question, why is the accuracy of one stage method lower than that of two stage method?

This problem is the main problem discussed and solved in this paper.

The author concludes that a very important factor is the imbalance of positive and negative samples in the one stage method .

So why is the one stage method uneven between positive and negative samples?

Here is an example:

First look at a typical one stage method, SSD. In SSD, Prior is first generated according to a certain rule, and about 10,000 candidate regions will be generated. Due to the regular sampling, it is relatively dense and redundant, even if hard example mining (for The scores are sorted, taking the previous part), but there will still be more simple samples, making the positive and negative samples unbalanced.

Compared with the two stage method, we have Faster R-CNN as the representative. In Faster R-CNN, first use the RPN network to generate the target box candidate area, and the appointment generates 1000-2000 candidate box, which filters out most of the simple Negative samples, then in the classification process, use positive and negative samples 1: 3 or OHEM method to make the positive and negative samples more balanced.

A brief summary of the advantages of the two stage is:

  • It uses a two-stage cascade, so that the rpn stage can control the number of candidate regions to about 1-2k, which is much reduced compared to one-stage.

  • Mini-batch sampling, sampling is not random sampling, but sampling based on the position of the positive sample (such as setting the negative sample to the ground-truth IOU between 0.1-0.3), which can kill a big Some simple samples, and the choice of positive and negative sample ratio, such as 1: 3, are also breaking the balance.

Problems caused by the imbalance of positive and negative sample categories:

  • The training is inefficient and too many negative samples have no effect on the detection frame. (training is inefficient as most locations are easy negatives that contribute no useful learning signal)
  • Too many simple negative samples will suppress the training, making the training effect not good. (enmasse the easy nagatives can overwhelm training and lead to degenerate models.)

In order to solve this problem, the author proposed Focal Loss and designed RetinaNet


A method to solve positive and negative samples: Focal Loss

Focal Loss is modified on the basis of cross-entropy loss, so it is necessary to review cross-entropy loss first.

The formula for cross-entropy loss is as follows. Here is a simple Binary CrossEntropy Loss, that is, there are only two categories.

Insert picture description here

To simplify, we define p t p t p t ptpt p_t When γ is 2, most of the loss of negative samples is very small, only a small part of the loss is relatively large, which also shows that focal loss plays a role in suppressing simple negative samples, making most negative samples No effect.

Insert picture description here

The following is the detection accuracy of the network, as well as the speed, and comparison with the state of the art network.

Looking at the two pictures below, you can see that the APs are not the same. The main reason is that they use some strategies, such as scale transformation.

Insert picture description here

Insert picture description here

The above is some understanding of Focal Loss and RetinaNet, if there is something wrong, please correct me

Published 63 original articles · praised 7 · views 3396

Guess you like

Origin blog.csdn.net/weixin_44523062/article/details/105189680
Recommended