Focal Loss for Dense Object Detection

Focal Loss Based on Dense Objects
Abstract
Currently the most accurate object detectors are R-CNN based two-stage methods, in which we apply a classifier on a series of candidate object positions. In contrast, the one-stage detector has the advantages of being faster and simpler, but it is far less accurate than the two-stage detector. In this article, we examine why. We found that the extreme proportion of foreground and background class imbalances during training may be an important reason for this. We propose to address this class imbalance problem by reconstructing the standard cross-entropy, which achieves the purpose of reducing weights on well-classified samples. Our novel target loss focuses on sparse hard examples and avoids burdening the detector with a large number of easy negative examples during training. To evaluate the effectiveness of our loss, we design and train a simple dense detector, which we call RetinaNet. Our results show that when trained with focal loss, RetinaNet can both reach the speed of previous one-stage detectors and exceed the accuracy of existing state-of-art detectors.


1. Introduction
The current state-of-the-art object detectors are based on a two-stage, proposal-oriented mechanism. Due to its popularity in the R-CNN framework, one stage generates candidate target locations with a series of coefficients, and the second stage uses a convolutional neural network to classify each candidate location into a foreground class or background. Through a series of improvements, this two-stage framework consistently achieves high accuracy on the COCO benchmark.
Although the two-stage detector has been so successful, a natural question is, can a simple one-stage detector achieve the same accuracy? One-stage detectors are generally applied to some regular, dense sample target locations, which have different sizes and aspect ratios. Recent studies on one-stage detectors, such as YOLO and SSD, have shown results that one-stage detectors are faster than two-stage detectors, and are approximately 10-40 times faster than two-stage detectors in terms of accuracy. %.
For the first time, we show a one-stage detector that performs indistinguishably on the COCO dataset compared to more complex two-stage detectors, such as FPN and the Faster R-CNN variant Mask R-CNN. To achieve this result, we identified the class imbalance problem as the main obstacle preventing one-stage detectors from reaching state-of-the-art accuracy during training, and proposed a new loss function to remove this obstacle .
The class imbalance problem is emphasized in R-CNN-like detectors due to two-stage concatenation and sample exploration. Proposal stages (such as SS, EB, DeepMask, RPN) quickly achieve the purpose of reducing the number of candidate target locations and filter out most background samples. In the second classification stage, sample heuristics, such as fixed foreground-to-background ratios, or OHEM methods are used to maintain a reasonable balance between foreground and background.

In contrast, a one-stage detector must process more candidate object locations sampled from an image. In practice this usually results in enumerating about 100k of locations that are densely covered with spatial locations, sizes and aspect ratios. Although similar sample heuristics are also used, they become ineffective when the training process is dominated by easily classified background samples. This kind of invalidity is a typical problem in object detection, and the main method to solve this problem is through techniques such as bootstrap or hard sample mining.

In this paper, we propose a new loss function that is more effective than previous methods in dealing with class imbalance. This loss function is dynamically scaled based on the cross-entropy loss function, where the scaling factor decays to 0 as confidence in the correct classification increases. (see picture 1)


Intuitively, this scaling factor can automatically reduce the contribution of easy examples to the loss during training, and quickly guide the model to focus on hard examples. Experiments show that our proposed focal loss function enables us to train a high-accuracy, one-stage detection model that significantly outperforms other, previously used, one-stage detection methods by using sample exploration and hard sample mining. State-of-the-art techniques for stage detectors. Finally, we note that the exact form of focal loss is not critical, so we simultaneously show that other forms can achieve similar results.
To demonstrate the efficiency of our proposed focal loss, we design a one-stage detector called RetinaNet, named after its dense sample target location in an input image. Its design is specific to an efficient intrinsic network feature pyramid and the use of anchor boxes. The idea it absorbs mainly comes from, (Scalable object detection using deep neural networks.\ Single shot multibox detector.\ Faster R-CNN: Towards real-time object detection with region proposal net-

works \ Feature pyramid networks for object detection.) these several articles. RetinaNet is efficient and accurate; based on the best model on the backbone of ResNet-101-FPN, we achieve 39.1% AP on COCO test-dev at 5 seconds per frame, surpassing all previously published best Performance of one-stage and two-stage detectors (see Figure 2).



2. Typical object detectors from Related Work : In sliding windows, classifiers are applied to a dense network of image segmentation, an idea that has a rich and long history. One of the earliest successes was the application of convolutional neural networks to handwritten digit recognition by LeCun et al. Viola and Jones used boosted object detectors for facial recognition, sparking widespread dissemination and adoption of similar models. The introduction of HOG and the utilization of overall channel features have facilitated the rise of efficient algorithms like pedestrian detection. DPMs extend dense detectors to more general object classes and have consistently achieved state-of-the-art performance on the PASCAL dataset for many years. Although sliding-window methods are still the leading detection paradigm in classical computer vision, with the resurgence of deep learning, the two-stage detectors introduced next quickly boiled down to object detection.
Two-stage detectors: The dominant paradigm of modern object detection is based on two-stage methods. SS is pioneering in this field. In the first stage, a series of sparse candidate proposals that can cover all objects are generated, and most negative sample regions are filtered out at the same time. In the second stage, these proposals are subjected to foreground category and background classification. Classification. R-CNN upgrades the second-stage classifier to a convolutional neural network that leads the way in accuracy and modern object detection. R-CNN has improved in recent years in terms of speed and use of learning target recommendations. RPN combines proposal generation with the second-stage classifier to form a single convolutional neural network, which constitutes the basic framework of Faster R-CNN. More extensions of this framework (FPN/OHEM/Top-down modulation for object detection/ResNet/Mask R-CNN) have also been proposed.
One-stage detector: OverFeat is the first modern one-stage object detector based on deep neural networks. Recently SSD and YOLO updated the one-stage method. These detectors are tuned to achieve speed improvements, so their accuracy is lower than two-stage methods. SSD has 10-20% less AP, while YOLO focuses more on speed on the trade-off between speed and accuracy (see Figure 2). Recent results show that two-stage detectors can achieve faster speeds by reducing the input image accuracy and reducing the number of proposals, but one-stage methods not only have a larger computational expense but also lag in accuracy. In contrast, the purpose of this paper is to find out whether a one-stage detector can match or even exceed the accuracy of a two-stage detector at the same or even faster speed.
Our designed RetinaNet shares many similarities with previous dense detectors, especially the anchor concept introduced in RPN and the feature pyramid used in SSD and FPN. Our better results are achieved not because of innovations in network design but because of a new design of loss function.
Class Imbalance: Either typical one-stage object detection methods such as boosted detectors (Rapid object detection using a boosted cascade of simple features/ Integral channel features) and DPM, or more recent methods such as SSD, during training Both face the class imbalance problem. These detectors are at 10000-100000 candidate locations (on each image), but only a few of them contain objects. This imbalance problem leads to two problems: (1) since most positions are easily classified negative samples, and they only contribute some useless learning signals, it will lead to inefficient training; (2) overall , easily classified negative samples will overwhelm the training process and cause the model to degenerate. A common practice is to do some form of difficult negative sample mining (Learning and Example Selection for Object and Pattern Detection./ Rapid object detection using a boosted cascade of simple features/ Cascade object detection with deformable part models./ Training region-based object detectors with online hard example mining./ SSD: Single shot multibox detector.), i.e. sampling hard samples during training, or implementing some more complex strategies for sampling or/reconstructing weights. In contrast, our proposed approach with focal loss naturally handles the class imbalance problem faced by one-stage detectors, and achieves all performance without sampling and without simple negative examples for overwhelming the loss and computing gradients. samples for efficient training.
Robust Estimation: Many people are interested in designing a method and the loss function of the base file that can achieve the contribution of outliers by reducing hard examples with large errors. In contrast, we do not deal with outliers. Our focal loss function solves the problem of class imbalance by reducing the weight of inliers (easy examples), so that even though the number of easy examples is large, they contribute significantly to the total. small purpose. In other words, the focal loss plays the opposite role of the robust loss: it focuses on those sparse hard examples that are harder to classify.


3. Focal Loss
is mainly used to deal with the extremely unbalanced foreground and background targets of the one-stage target detection scheme during training (such as 1:1000). We introduce the focal loss from the binary cross-entropy loss (extending the focal loss to the multi-class case is simple, for simplicity we only discuss the binary case here)

In the above expression, y only takes two values ​​of +1 and -1, which represent the true class, and p is the estimated probability of the model for the class of y=1. For convenience, we define pt as:


Therefore, we redefine CE(p,y) as CE(pt) = -log(pt)

The cross-entropy loss can be seen from the blue curve in Figure 1 (the top one). As can also be seen from the figure, one of the more obvious features of this loss is that even for those samples that are easy to classify (pt >> 0.5), their loss is not small. When we aggregate over a large number of easy-to-classify samples, those small loss values ​​may outweigh the loss for the sparser classes.

A common way to deal with class imbalance problems is to introduce weighting factors \alpha and 1-\alpha for class 1 and class-1, respectively, where \alpha is between [0,1]. In fact, \alpha can be determined by inverse class frequency (?) or by treating it as a hyperparameter and then setting it using cross-validation. For convenience, we define \alpha t similarly as we define pt. We define the CE loss balanced with \alpha as:


This loss is just a simple extension of CE, and we consider it as a later experimental baseline for our proposed focal loss.


3.2 Focal Loss Definition
As shown in our experiments, the large class imbalance problem encountered during dense detector training has a large impact on CE loss. Negative samples that are easy to classify form a large part of the loss function and dominate the gradient. Although \alpha balances the importance of positive and negative samples, it does not discriminate between easy/hard samples. Therefore, we propose to transform the loss function to down-weight easy samples, thereby focusing training on hard negatives. 
We propose to add an adjustment factor (1 − pt)γ to the CE loss, where γ is an adjustable focusing parameter and γ ≥ 0. We define loss as:


A visualization of the focal loss for several different values ​​of γ ∈ [0, 5] is shown in Figure 1. We note two features of focal loss. (1) When a sample is misclassified and pt is small, the adjustment factor is very close to 1, and the loss hardly changes. When pt is close to 1, the adjustment factor gradually becomes 0, and the loss of correctly classified samples is also down-weighted. (2) The focusing parameter γ makes a mild adjustment to the ratio at down-weighting easily classified samples.
When γ = 0, FL and CE are equal, and as γ increases, the effect of the adjustment factor increases accordingly.
Intuitively, the adjustment factor reduces the loss contribution of easy-to-classify samples and widens an interval of samples that get smaller losses. For example, when γ = 2 and pt ≈ 0.968, the sample loss will be reduced by a factor of about 1000x. This in turn increases the importance of misclassified samples in the loss function (when pt ≤ .5 and γ=2, the loss function is enlarged by a factor of more than 4).

In fact, we use a variant of \alpha to balance the focus loss:


In our experiments we adopted this form of loss function because it yielded better accuracy than the form without \alpha balance. Finally, we note that the use of this loss layer combines the sigmoid operation used to compute p with the loss computation, resulting in greatly improved computational stability.

Although in our main experimental results, we use the loss function definition of the above form, its exact form is not critical. In appendix we considered other forms of focal loss and found them to be equally effective.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325983776&siteId=291194637
Recommended