Information papers
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollár, Focal Loss for Dense Object Detection, ICCV 2017.
https://arxiv.org/abs/1708.02002
The development of history
We focus on the content after the re-emergence of the neural network, after previous contents might briefly mention.
Two sects
Two sects mainly one stage and two stage, both have focused on accuracy and speed, but eventually trade-off.
one stage | two stage | |
---|---|---|
algorithms | YOLO, YOLT | R-CNN, SPPnet |
accuracy | Low(30% mAP) | High(60%+ mAP) |
speed | Fast(100+ FPS) | Slow(5 FPS) |
two stage
Its main feature is to use an algorithm (selective search, RPN, etc.) to produce a series of proposal, the proposal will be fed to a pre-trained neural network (VGG-16, ResNet, etc.), after which the output is classification.
In particular , RPN will be presorted (background vs foreground)
one stage
Most models will use a series of similar sliding window classifier "anchor" directly classified.
RetinaNet
The model is based on article one stage, the purpose is also evident in order to improve accuracy.
The main contribution
Make a point: the important causes one stage there is insufficient accuracy class imbalance.
class imbalance
The so-called class imbalance is due to the two stage model will mostly bg and fg presorting, and therefore will not be much larger than the number of fg bg number, but one stage model in order to enhance the speed abandon the proposal process, so most models there can be no pre-sorting issue, which would lead to a number of very uneven, often a difference of 2 order.
Some previous solutions
OHEM directly discarded fraction easy example, will no doubt lead to incomplete data, thereby affecting the results
Solutions of this article
It proposed a new FUNC Loss:
$$
the CE (P_T) = -log (P_T) \
FL (P_T) = - \ alpha_t (. 1-P_T) ^ \ Gamma the CE (P_T)
$$
where
$$
P_T = \ left {
\ the aligned the begin {}
P = Y. 1 && \
. 1-P && otherwise
\ the aligned End {}
\ right. \
\ alpha_t = \ {left
\ the aligned the begin {}
\ Y = Alpha. 1 && \
l- \ Alpha && otherwise
\ the aligned End {}
\ right.
$$
particularly **, $ CE $ cross entropy found $ \ gamma = 2 of the experiment, \ alpha = $ 0.25 to get the best results.
We can be found through simulation map, the relative certainty of the category (easy sample, eg bg), loss set a small, but not sure for the category compared with the (hard sample, eg fg), loss is large, thus preventing the emergence grasp a large number of categories because the advantage rule loss.
We analyzed the loss function from the four cases:
Correct classification & easy target item classification - $ y = 1, p ~ 1 $
In this case $ p_t = p ~ 1 $, and $ \ gamma> 1 $, so $ FL (p_t) << CE (p_t) $
& Correct classification target item easily classified - $ y = 1, p ~ 0 $
In this case $ p_t = p ~ 0 $, and $ \ gamma> 1 $, so $ FL (p_t) ~ CE (p_t) $
Misclassified & easy target item classification - $ y = -1, p ~ 1 $
In this case $ p_t = 1 - p ~ 0 $, and $ \ gamma> 1 $, so $ FL (p_t) ~ CE (p_t) $
& Misclassified target item easily classified - $ y = -1, p ~ 0 $
In this case $ p_t = 1 - p ~ 1 $, and $ \ gamma> 1 $, so $ FL (p_t) << CE (p_t) $
To In a nutshell, the impact is not easy to classify a large number of errors is reduced.
In particular , easy classification error does not affect the basic decrease is also very reasonable.
Extra - network construction
The left part of the network is easy to see that the FPN, and for the right half is their original, it is easy to see (also mentioned in the paper) of the upper and lower two points they Parameter Not Shared convolutional network, one is used classifying the other is used to anchor box regressing (4 dims).
Initialization
Classification Subnet for the Initialization BIAS:
$$
b = -log (1 - \ PI / \ PI)
$$
$ \ pi $ refers to all anchor has $ \ pi $ fg grasp all be treated as initialization, it proved to $ \ pi = 0.01 $ more appropriate.
Experimental results
To add here when OHEM 1: 3 refers to the discard low probability fg: bg.
The results of this FIG view, RetinaNet may be called a real state-of-the-art.
Appendix
A
First defined:
$$
x_t = YX, Y \ in {\ PM1}
$$
where $ X $ denotes the number.
Focal Loss tried a variant of the conclusions in the original language summary is:
More generally, we expect any loss function with similar properties as FL or FL* to be equally effective.
B
$$
\frac{dCE}{dx} = y(p_t - 1)\
\frac{FL}{dx} = y(1 - p_t)^\gamma(\gamma p_t log(p_t) + p_t - 1)\
\frac{dFL^}{dx} = y(p_t^ - 1)
$$
Concluded that, when $ x_t> 0 $, FL derivative is closer to zero than CE.
to sum up
To be good at discovering the nature of the problem - seen an important reason stage algorithms compare the two one stage algorithms from the then popular method.