Image Detection - RetinaNet: Focal Loss for Dense Object Detection (arXiv 2018)

Disclaimer: This translation is only a personal study record

Article information

Summary

  The highest accuracy object detectors to date are based on a two-stage approach generalized by R-CNN, in which a classifier is applied to a sparse set of candidate object locations. In contrast, single-stage detectors that regularly, densely sample possible object locations have the potential to be faster and simpler, but have so far lagged behind two-stage detectors in accuracy. In this article, we investigate why this is the case. We find that the extreme foreground-background class imbalance encountered during the training of dense detectors is the main reason. We propose to address this class imbalance by reshaping the standard cross-entropy loss so that it weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse set of hard examples and prevents a large number of prone negatives from overwhelming the detector during training. To evaluate the effectiveness of our loss, we design and train a simple dense detector, which we call RetinaNet. Our results show that when trained with focal loss, RetinaNet is able to match the speed of previous single-stage detectors while exceeding the accuracy of all existing state-of-the-art two-stage detectors. The code is located at: https://github.com/facebookresearch/Detectron.

insert image description here

Figure 1. We propose a new loss, which we call focal loss, which adds a factor ( 1 − pt ) γ (1−p_t)^γ to the standard cross-entropy criterion(1pt)c。 set> 0 c>0c>0 can reduce relative loss ( pt > .5 )for well-classified examples(pt>.5 ) , thus paying more attention to hard-to-classify erroneous examples. As demonstrated by our experiments, the proposed focal loss is capable of training highly accurate dense object detectors with a large number of simple background examples.

insert image description here

Figure 2. Speed ​​(ms) and accuracy (AP) of COCO test-dev. Due to the focal loss, our simple single-stage RetinaNet detector outperforms all previous single-stage and two-stage detectors, including the best Faster R-CNN [28] system reported in [20]. We show RetinaNet variants with ResNet-50-FPN (blue circles) and ResNet-101-FPN (orange diamonds) at five scales (400-800 pixels). Ignoring the low-accuracy state (AP<25), RetinaNet forms the upper envelope of all current detectors, with an improved variant (not shown) reaching 40.8 AP. See §5 for details.

1 Introduction

  Current state-of-the-art object detectors are based on a two-stage, proposal-driven mechanism. As generalized in the R-CNN framework [11], the first stage generates a sparse set of candidate object locations, and the second stage uses a convolutional neural network to classify each candidate location into one of the foreground classes or background. Through a series of advances [10, 28, 20, 14], this two-stage framework consistently achieves the highest accuracy on the challenging COCO benchmark [21].

  Despite the success of two-stage detectors, a natural question is: can a simple single-stage detector achieve similar accuracy? A single-stage detector is applied for regular, dense sampling of object locations, scales, and aspect ratios. Recent studies on single-stage detectors, such as YOLO [26, 27] and SSD [22, 9], demonstrate promising results, yielding accuracies within 10-40% compared to state-of-the-art two-stage methods faster detectors.

  This paper takes this concept a step further: we propose a single-stage object detector that for the first time matches the state-of-the-art COCO AP of more complex two-stage detectors, such as Feature Pyramid Network (FPN) [20] or Faster Mask R-CNN [14] variant of R-CNN [28]. To achieve this result, we identify class imbalance during training as the main obstacle preventing single-stage detectors from achieving state-of-the-art accuracy, and propose a new loss function to remove this obstacle.

  The class imbalance problem in CNN-like detectors is addressed through a two-stage cascade and a sampling heuristic. The proposal stage (e.g., selective search [35], edge box [39], depth mask [24, 25], RPN [28]) quickly narrows down the number of candidate object locations to a small number (e.g., 1-2k ), to filter out most background samples. In the second classification stage, sampling heuristics such as a fixed foreground-to-background ratio (1:3) or Online Hard Example Mining (OHEM) [31] are performed to maintain a manageable balance between foreground and background.

  In contrast, single-stage detectors have to deal with a larger set of candidate object locations sampled regularly across the image. In practice, this usually amounts to enumerating about 100k locations densely covering spatial locations, scales and aspect ratios. While similar sampling heuristics can also be applied, they are inefficient because the training process is still dominated by easily classified background examples. This inefficiency is a classic problem in object detection and is usually addressed by techniques such as bootstrapping [33, 29] or hard example mining [37, 8, 31].

  In this paper, we propose a new loss function as a more efficient alternative to previous methods dealing with class imbalance. The loss function is a dynamically scaled cross-entropy loss, where the scaling factor decays to zero as confidence in the correct class increases, see Figure 1. Intuitively, this scaling factor automatically reduces the contribution of easy examples during training and quickly focuses the model on hard examples. Experiments show that our proposed Focal Loss enables us to train high-accuracy single-stage detectors that significantly outperform alternatives trained using sampling heuristics or hard example mining, which have previously trained single-stage detectors. state-of-the-art technology. Finally, we note that the exact form of focal loss does not matter, and we show that other instances can achieve similar results.

  To demonstrate the effectiveness of the proposed focal loss, we design a simple one-stage object detector named RetinaNet, named for its dense sampling of object locations in the input image. Its design features an efficient in-network feature pyramid and the use of anchor boxes. It borrows various state-of-the-art ideas from [22, 6, 28, 20]. RetinaNet is efficient and accurate; our best model, based on a ResNet-101-FPN backbone, achieves a COCO test-dev AP of 39.1 when running at 5 frames per second, surpassing previously published single- and two-stage detectors The best model results for , see Figure 2.

2. Related work

Classical Object Detectors : The sliding window paradigm of applying classifiers to dense image grids has a long and rich history. One of the earliest successes was the classic work of LeCun et al. He applied convolutional neural networks to handwritten digit recognition [19, 36]. Viola and Jones [37] used augmented object detectors for face detection, leading to the widespread adoption of such models. The introduction of HOG [4] and holistic channel features [5] yields an effective method for pedestrian detection. DPM [8] helps extend dense detectors to more general object categories and achieves the best results for many years on PASCAL [7]. While the sliding window approach was the dominant detection paradigm in classical computer vision, with the revival of deep learning [18], the two-stage detectors described next quickly dominated object detection.

Two-Stage Detectors : The dominant paradigm of modern object detection is based on a two-stage approach. As pioneered by the Selective Search work [35], the first stage generates a sparse set of candidate proposals that should contain all objects while filtering out most negative positions, and the second stage classifies the proposals as foreground/ background. R-CNN [11] upgraded the second-stage classifier to a convolutional network, which achieved a large increase in accuracy and ushered in the modern era of object detection. R-CNN has improved over the years both in terms of speed [15, 10] and using learned object proposals [6, 24, 28]. Region Proposal Network (RPN) integrates proposal generation with a second-stage classifier into a single convolutional network, forming a Faster R-CNN framework [28]. Many extensions to this framework have been proposed, such as [20, 31, 32, 16, 14].

Single-stage detectors : OverFeat [30] is one of the first modern single-stage object detectors based on deep networks. Recently, SSD [22, 9] and YOLO [26, 27] have revived interest in one-stage methods. These detectors are tuned for speed, but their accuracy lags behind two-stage methods. SSDs have 10-20% lower AP, while YOLO focuses on more extreme speed/accuracy tradeoffs. See Figure 2. Recent work has shown that two-stage detectors can be implemented quickly by simply reducing the input image resolution and the number of proposals, but even with larger computational budgets, the accuracy of single-stage methods lags behind [17]. In contrast, the goal of this work is to see whether single-stage detectors can match or exceed the accuracy of two-stage detectors while running at similar or faster speeds.

  The design of our RetinaNet detector shares many similarities with previous dense detectors, especially the "anchor" concept introduced by RPN [28], and the use of feature pyramids in SSD [22] and FPN [20]. We emphasize that the best results achieved by our simple detector are not based on innovations in network design, but due to our new loss.

Class Imbalance : Classic one-stage object detection methods such as Boosted Detectors [37, 5] and DPM [8], as well as more recent methods such as SSD [22], face large class imbalances during training. balance. These detectors evaluate 104-105 candidate locations per image, but only a few locations contain objects. This imbalance leads to two problems: (1) training is inefficient since most positions are prone to negative influences, providing no useful learning signal; (2) in general, simple negative influences may Overwhelms the training and leads to a degraded model. A common solution is to perform some form of hard negative mining [33, 37, 8, 31, 22], sampling hard examples during training or more complex sampling/reweighting schemes [2]. Instead, we show that our proposed focal loss naturally handles the class imbalance faced by single-stage detectors and allows us to efficiently train on all examples without sampling, nor the easy negative overwhelming loss and Calculated gradients.

Robust Estimation : There is great interest in designing robust loss functions (e.g., Huber loss [13]) that reduce the contribution of outliers by weighting the loss for examples with large errors (hard examples). Instead, instead of resolving outliers, our focal loss addresses class imbalance by weighting outliers (simple example) such that even if their number is large, their contribution to the total loss is small. In other words, focal loss plays the opposite role of robust loss: it focuses training on a sparse set of hard examples.

3. Loss of focus

  Focal Loss is designed to address single-stage object detection scenarios where there is an extreme imbalance (e.g., 1:1000) between foreground and background categories during training. We introduce focal losses starting from the cross-entropy (CE) loss for binary classification (extending focal loss to the multiclass case is straightforward and works well; for simplicity, we focus on binary losses in this work.):

insert image description here

In the example above, y ∈ {±1} specifies the ground-truth class, and p ∈ [0, 1] is the model's estimated probability for the class labeled y=1. For ease of annotation, we define pt p_tpt

insert image description here

今重写CE ( p , y ) = CE ( pt ) = − log ( pt ) CE(p,y)=CE(p_t)=-log(p_t)CE ( ​​p ,y)=EC ( pt)=log(pt)

  The CE loss is shown as the blue (top) curve in Fig. 1. A noteworthy property of this loss is that even easily classifiable examples ( pt > > .5 ) (p_t>>.5)(pt>>.5 ) also incurs considerable losses, which is easily seen in its chart. These small loss values ​​can overwhelm rare classes when summarizing over a large number of easy examples.

3.1 Equilibrium cross entropy

  A common approach to address class imbalance is to introduce a weighting factor α ∈ [0,1] for class 1 and 1 − α for class −1. In practice, α can be set by inverse frequency, or viewed as a hyperparameter set by cross-validation. For the convenience of notation, we define α t α_tatin a manner similar to defining pt p_tptThe way. We write the α-balanced CE loss as:

insert image description here

This loss is a simple extension to CE, which we consider as an experimental baseline for our proposed focal loss.

3.2 Definition of Focal Loss

  As our experiments show, the large class imbalance encountered during the training of dense detectors overwhelms the cross-entropy loss. The easy-to-classify negative classes constitute most of the loss and dominate the gradient. While α-balanced the importance of positive/negative examples, it does not distinguish between easy/hard examples. Instead, we propose to reshape the loss function to lighten the weight of easy examples, thereby focusing training on difficult negatives.

  More formally, we propose to add a modulation factor ( 1 − pt ) γ (1−p_t)^γ to the cross-entropy loss(1pt)γ , the adjustable focusing parameter γ≥0. We define focal loss as:

insert image description here

  In Figure 1, for γ ∈ [0, 5] γ ∈ [0, 5]cSeveral values ​​of [ 0 , 5 ] , you can visually see the focus loss. We note two properties of the focal loss. (1) When an example is misclassified and pt is small, the modulation factor is close to 1 and the loss is unaffected. aspt → 1 p_t → 1pt1 , the factor becomes 0, and the loss for well-classified examples is down-weighted. (2) Focusing parameterγ γγ smoothly adjusts the rate at which simple examples are downweighted. Whenγ = 0 γ=0c=0 , FL is equivalent to CE, and withγ γAs γ increases, so does the effect of the modulation factor (we findγ = 2 γ=2c=2 worked best in our experiments).

  Intuitively, the modulation factor reduces the loss contribution from simple examples and extends the range over which examples receive low loss. For example, at γ = 2 γ=2c=2 , compared to CE,pt = 0.9 p_t=0.9pt=The loss for an example classified by 0.9 will be reduced by a factor of 100, while at pt ≈ 0.968 p_t ≈ 0.968ptIn the case of 0.968, the loss will be reduced by 1000 times. This in turn increases the importance of correcting misclassified examples (forpt ≤ .5 p_t≤.5pt.5γ = 2 γ=2c=2 , its loss can be reduced by up to 4 times).

  In practice, we use an α-balanced variant of the focal loss:

insert image description here

We adopted this form in our experiments because of its slightly improved accuracy over the non-α-balanced form. Finally, we note that the implementation of the loss layer combines the sigmoid operation for computing p with the loss computation, which improves numerical stability.

  Although in our main experimental results we use the above definition of focal loss, its precise form is not critical. In the appendix, we consider other instances of focal loss and show that these are equally effective.

3.3 Class Imbalance and Model Initialization

  By default, binary classification models are initialized to output y=-1 or 1 with equal probability. With such an initialization, the loss due to frequent classes dominates the total loss and leads to instability in early training in the case of class imbalance. To deal with this, we introduce the notion of a "prior" for the model-estimated p-value for the rare class (foreground) at the beginning of training. We denote the prior by π and set it such that the model estimates p for rare class instances to be low, say 0.01. We note that this is a change in the model initialization (see §4.1), not a change in the loss function. We found that this improves the training stability of cross-entropy and focal loss in the presence of severe class imbalance.

3.4 Class Imbalance and Two-Stage Detectors

  Two-stage detectors are usually trained with cross-entropy loss instead of α-balanced or our proposed loss. Instead, they address class imbalance through two mechanisms: (1) two-stage cascades and (2) biased micro-batch sampling. The first cascade stage is an object proposal mechanism [35, 24, 28] that reduces the nearly infinite set of possible object locations to a thousand or two thousand. Importantly, the selected proposals are not random but likely correspond to real object locations, which eliminates the vast majority of easy negatives. When training the second stage, biased sampling is often used to build mini-batches containing, for example, a 1:3 ratio of positive to negative examples. This ratio acts like an implicit α-balanced factor implemented by sampling. Our proposed focal loss aims to directly address these mechanisms in single-stage detection systems via a loss function.

4. RetinaNet detector

  RetinaNet is a single unified network consisting of a backbone network and two task-specific sub-networks. The backbone is responsible for computing the convolutional feature maps over the entire input image and is a non-self-convolutional network. The first subnetwork performs convolutional object classification on the output of the backbone network; the second subnetwork performs convolutional bounding box regression. These two sub-networks are characterized by the simple design we propose specifically for one-stage dense detection, see Fig. 3. While there are many possible choices for the details of these components, as experiments have shown, most design parameters are not particularly sensitive to exact values. Next we describe each component of RetinaNet.

Feature Pyramid Network Backbone : We adopt the Feature Pyramid Network (FPN) from [20] as the backbone network of RetinaNet. In short, FPN augments standard convolutional networks with top-down pathways and lateral connections, so that the network efficiently builds rich multi-scale feature pyramids from single-resolution input images, see Fig. 3(a)- (b). Each level of the pyramid can be used to detect objects of different scales. FPN improves the multi-scale prediction of fully convolutional networks (FCN) [23], as shown by its gains over RPN [28] and DeepMask-like schemes [24], as well as in two-stage detectors such as Fast R-CNN [10 ] or the gain in Mask R-CNN [14]).

  Following [20], we build FPN on top of the ResNet architecture [16]. We construct a system with level P 3 P_3P3to P 7 P_7P7pyramid, where lll indicates the pyramid level (P l P_lPlhas a resolution 2 l 2^l lower than the input2l ). As in [20], all pyramid levels have C = 256 channels. Pyramid details generally follow [20], with some modest differences (RetinaNet uses feature pyramid levelsP 3 P_3P3to P 7 P_7P7, where P 3 P_3P3to P 5 P_5P5is the residual level from the corresponding ResNet using top-down and lateral connections ( C 3 C_3C3to C 5 C_5C5) is calculated as in [20], P6 is calculated by C 5 C_5C5Obtained on the 3×3 stride-2 convolution, P 7 P_7P7is passed at P 6 P_6P6It is calculated by applying ReLU and 3×3 stride-2 convolution. This is slightly different from [20]: (1) We do not use high-resolution pyramid levels P 2 for computational reasonsP2,(2) P 6 P_6 P6is computed by strided convolution instead of downsampling, (3) we include P 7 P_7P7to improve large object detection. These minor modifications improve speed while maintaining accuracy). While many design choices are not critical, we emphasize that the use of an FPN backbone is; preliminary experiments using only features from the final ResNet layer yield low AP.

Anchor : We use translation-invariant anchor boxes similar to those in the RPN variant in [20]. Anchors are at pyramid level P 3 P_3P3to P 7 P_7P7has 3 2 2 32^2322 to51 2 2 512^25122 area. As described in [20], at each pyramid level, we use three aspect ratios{ 1 : 2 , 1 : 1 , 2 : 1 } \{1:2, 1:1, 2:1\}{ 1:21:12:1 } anchor. For denser scale coverage than in [20], at each level we add the original 3 aspect ratio anchors of size{ 2 0 , 2 1 / 3 , 2 2 / 3 } \{2^0,2 ^{1/3},2^{2/3}\}{ 20,21/3,22/3 }anchor. This improves AP in our setup. There are a total of A = 9 anchors per level, and at each level they cover a scale range of 32–813 pixels relative to the input image to the network.

  Each anchor is assigned a one-hot vector of length K for classification targets, where K is the number of target classes, and a 4-vector for box regression targets. We use the assignment rule from RPN [28] but modify it for multi-class detection and tune the threshold. Specifically, anchors are assigned to ground-truth object boxes using an Intersection over Union (IoU) threshold of 0.5; if their IoU is in [0, 0.4, they are set as background). Since each anchor is assigned to at most one object box, we set the corresponding entry in its length K label vector to 1 and all other entries to 0. If no anchor is assigned (possibly overlapping in [0.4, 0.5)), it is ignored during training. The box regression target is computed as the offset between each anchor and its assigned target box, which is ignored if there is no assignment.

insert image description here

Figure 3. The single-stage RetinaNet network architecture uses a Feature Pyramid Network (FPN) [20] backbone on top of a feed-forward ResNet architecture [16] (a) to generate rich multi-scale convolutional feature pyramids (b). On this backbone, RetinaNet connects two sub-networks, one sub-network is used to classify anchor boxes (c), and the other sub-network is used to regress from anchor boxes to ground-truth target boxes (d). The network design is intentionally simple, which enables this work to focus on a new focal loss function that closes the accuracy gap between our single-stage detector and state-of-the-art two-stage detectors such as those with FPN [20 ]’s Faster R-CNN while running at a faster speed.

Classification Subnetwork : The classification subnetwork predicts the probability that an object exists at each spatial location for each of the A anchors and K object classes. This subnetwork is a small FCN connected to each FPN level; the parameters of this subnetwork are shared across all pyramid levels. Its design is simple. Taking an input feature map with C channels from a given pyramid level, the subnetwork applies four 3×3 conv layers, each layer has a C filter, each layer is followed by a ReLU activation, and then a 3× conv layer with a KA filter. 3 conv layers. Finally, a sigmoid activation is appended to output the KA binary prediction for each spatial location, see Fig. 3(c). In most experiments, we use C=256 and A=9. Compared with RPN [28], our object classification subnet is deeper, uses only 3×3 CONV, and does not share parameters with the box regression subnet (described below). We found that these higher-level design decisions are more important than specific values ​​of hyperparameters.

Box regression subnetwork : In parallel to the object classification subnetwork, we attach another small FCN to each pyramid level, with the aim of regressing the offset from each anchor box to nearby ground-truth objects, if present. The design of the box regression subnetwork is the same as that of the classification subnetwork, except that it terminates at a 4A linear output at each spatial location, see Fig. 3(d). For each A-anchor at each spatial location, the 4 outputs predict the relative offset between the anchor and the ground-truth box (we use the standard box parameterization from R-CNN [11]). We note that, unlike recent work, we use a class-agnostic bounding box regressor that uses fewer parameters, which we find to be equally effective. The object classification subnetwork and the box regression subnetwork, while sharing a common structure, use different parameters.

4.1 Inference and training

Inference : RetinaNet forms a single FCN consisting of a ResNet FPN backbone, a classification subnet, and a box regression subnet, see Figure 3. Therefore, inference consists of simply forwarding the image through the network. For speed, after setting the detector confidence threshold to 0.05, we only decode the box predictions of up to 1k top-scoring predictions per FPN level. The top predictions from all levels are combined and non-maximum suppression with a threshold of 0.5 is applied to produce final detections.

Focal Loss : We use the focal loss introduced in this work as the loss on the output of the classification subnetwork. As we will show in §5, we find that γ = 2 works well in practice and that RetinaNet is relatively robust to γ ​​∈ [0.5, 5]. We emphasize that when training RetinaNet, the focal loss is applied to all ~100k anchors in each sampled image. This is in contrast to the common practice of selecting a small set of anchors (e.g., 256) for each mini-batch using heuristic sampling (RPN) or hard example mining (OHEM, SSD). The total focal loss for an image is computed as the sum of the focal losses for all ~100k anchors, normalized by the number of anchors assigned to the ground-truth box. We perform normalization by specifying the number of anchors instead of the total number of anchors, since the vast majority of anchors are easily negative and receive negligible loss values ​​under the focal loss. Finally, we note that the weight α assigned to the rare class also has a stable range, but it interacts with γ, so it is necessary to choose both together (see Tables 1a and 1b). In general, α should decrease slightly as γ increases (for γ=2, α=0.25 works best).

Initialization : We conduct experiments with ResNet-50-FPN and ResNet-101-FPN backbones [20]. The base ResNet-50 and ResNet-101 models are pretrained on ImageNet1k; we use the models published in [16]. New layers added for FPN are initialized as described in [20]. Except for the last layer in the RetinaNet subnet, all new conv layers are initialized with Gaussian weight padding with bias b=0 and σ=0.01. For the final conv layer of the classification subnet, we set the bias initialization to b = − log ( ( 1 − π ) / π ) b=−log((1−π)/π)b=log((1π ) / π ) , where π specifies that at the beginning of training, each anchor should be labeled as foreground with confidence ~π. We use π = .01 in all experiments, although results are robust to exact values. As described in §3.3, this initialization prevents a large number of background anchors from producing large, unstable loss values ​​in the first iteration of training.

insert image description here

Table 1. Ablation experiments on RetinaNet and Focal Loss (FL) . Unless otherwise stated, all models are trained on trainval35k and tested on minival. If not specified, the defaults are: γ = 2; anchors for 3 scales and 3 aspect ratios; a ResNet-50-FPN backbone; and a sequence and test image scale of 600 pixels. (a) RetinaNet with α-balanced CE achieves up to 31.1 AP. (b) In contrast, using FL with the same exact network achieves an AP gain of 2.9 and is quite robust to the exact γ/α setting. (c) Using 2-3 scale and 3 aspect ratio anchors produces good results, after which point performance saturates. (d) FL outperforms the best variant of Online Hard Example Mining (OHEM) [31, 22] with 3-point AP. (e) Accuracy/speed trade-off of RetinaNet in test-dev at various network depths and image scales (see also Fig. 2).

Optimization : RetinaNet is trained using Stochastic Gradient Descent (SGD). We use synchronous SGD on 8 GPUs with a total of 16 images per minibatch (2 images per GPU). Unless otherwise specified, all models were trained with an initial learning rate of 0.01 for 90k iterations, then divided by 10 at 60k iterations, and again by 10 at 80k iterations. Unless otherwise stated, we use horizontal image flipping as the only form of data augmentation. Use a weight decay of 0.0001 and a momentum of 0.9. The training loss is the sum of the focal loss and the standard smoothing L1 loss for box regression [10]. For the models in Table 1e, the training time ranges from 10 to 35 hours.

5. Experiment

  We present experimental results on the bounding box detection track of the challenging COCO benchmark [21]. For training, we follow common practices [1, 20] and use the COCO trainval35k split (the union of 80k images from training and a random subset of 35k images from the 40k image val split). We report impairment and sensitivity studies by evaluating the minival partition (remaining 5k images from val). For our main results, we report COCO-AP for the test-dev split, which has no public labels and requires the use of an evaluation server.

5.1 Dense Detection Training

  We conduct extensive experiments to analyze the behavior of the loss function for dense detection and various optimization strategies. For all experiments, we use depth 50 or 101 ResNets [16] and build Feature Pyramid Network (FPN) [20] on top. For all ablation studies, we use an image scale of 600 pixels for training and testing.

Network initialization : Our first attempt to train RetinaNet uses the standard cross-entropy (CE) loss without any modifications to the initialization or learning strategy. This fails quickly and the network diverges during training. However, efficient learning can be achieved by simply initializing the last layer of our model such that the prior probability of detecting an object is π = .01 (see §4.1). Training RetinaNet with ResNet-50, this initialization has yielded a respectable AP of 30.2 on COCO. Results are not sensitive to the exact value of π, so we use π = .01 for all experiments.

insert image description here

Figure 4. Cumulative distribution functions of normalized losses for positive and negative samples for different values ​​of γ in the converged model. For positive examples, changing γ has little effect on the loss distribution. However, for negative examples, increasing γ causes the loss to heavily focus on hard examples, shifting almost all attention away from easy negative examples.

Balanced Cross-Entropy : Our next attempt to improve learning involves using the α-balanced CE loss described in §3.1. The results for various α are shown in Table 1a. Setting α=.75 gives a 0.9 AP gain.

Focal Loss : The results using our proposed focal loss are shown in Table 1b. The focal loss introduces a new hyperparameter, the focal parameter γ, which controls the strength of the modulation term. When γ = 0, our loss is equivalent to CE loss. As γ increases, the shape of the loss changes, so "easy" examples with low losses are further discounted, see Figure 1. With increasing γ, FL exhibits a larger gain than CE. With γ=2, FL improves 2.9AP over α-balanced CE loss.

  For the experiments in Table 1b, we find the best α for each γ for a fair comparison. We observe that for higher γ, lower α is chosen (less emphasis on positives is required since easy negatives are weighted down). In general, however, the benefits of varying γ are much greater, and in fact the best α is only in the range [0.25, .75] (we tested α ∈ [0.01, .999]). We use γ = 2.0 and α = .25 in all experiments, but α = .5 is almost as effective (lower .4AP).

Focal Loss Analysis : To better understand the focal loss, we analyze the empirical distribution of the loss of the converged model. For this, we adopt the default ResNet-101 600-pixel model trained with γ = 2 (with 36.0 AP). We apply the model to a large number of random images, and for $10^7$ negative windows and 1 0 5 10^510The predicted probabilities of the 5 positive windows are sampled. Next, for positives and negatives, we compute the FL of these samples separately and normalize the loss so that it sums to 1. Given a normalized loss, we can sort the losses from lowest to highest and plot the cumulative distribution function (CDF) of positive and negative samples and different settings of γ (even if the model was trained with γ=2).

  The cumulative distribution functions of positive and negative samples are shown in Fig. 4. If we look at the positive samples, we see that the CDFs for different values ​​of γ look quite similar. For example, about 20% of the most difficult positive samples account for about half of the positive loss, and as γ increases, more loss is concentrated in the top 20% of samples, but the effect is small.

  The effect of γ on negative samples was significantly different. For γ=0, the positive and negative CDFs are very similar. However, as γ increases, more weights are focused on hard negative examples. In fact, when γ = 2 (our default setting), the vast majority of loss comes from a small number of samples. It can be seen that FL can effectively downplay the effect of easy negatives and focus all attention on hard negative examples.

Online Hard Example Mining (OHEM) : [31] proposes to improve the training of two-stage detectors by constructing mini-batches with high-loss examples. Specifically, in OHEM, each example is scored according to its loss, then non-maximum suppression (NMS) is applied, and mini-batches are built with the examples with the highest loss. nms threshold and batch size are tunable parameters. Like focal loss, OHEM puts more emphasis on misclassified examples, but unlike FL, OHEM completely discards easy examples. We also implemented a variant of OHEM used in SSD [22]: after applying nms to all examples, construct mini-batches to enforce a 1:3 ratio between positives and negatives, to help ensure that each mini-batch There were enough positives every time.

  We tested two OHEM variants in the setting of one-stage detection, which has a large class imbalance. The results of the original OHEM strategy and the “OHEM 1:3” strategy for selected batch sizes and nms thresholds are shown in Table 1d. These results use ResNet-101, our baseline trained with FL achieves 36.0 AP in this setting. In contrast, the best setting of OHEM (no 1:3 ratio, batch size 128, nms .5) achieves 32.8 AP. This is a 3.2 AP gap, showing that FL is more effective than OHEM in training dense detectors. We note that we tried other parameter settings and variants of OHEM without better results.

Hinge loss : Finally, in earlier experiments we tried to use pt p_tptThe hinge loss [13] on the training, the loss in pt p_tptSet to 0 above a certain value. However, this is unstable and we were not able to obtain meaningful results. The results of exploring alternative loss functions are given in the appendix.

insert image description here

Table 2. Object detection single-model results (bounding box AP) vs. state-of-the-art on COCO test-dev. We show results for our RetinaNet-101-800 model, trained with proportional jittering, taking 1.5x longer than the same model in Table 1e. Our model achieves the best results, outperforming both one-stage and two-stage models. See Table 1e and Figure 2 for a detailed breakdown of speed vs. accuracy.

5.2 Model architecture design

Anchor Density : One of the most important design factors in a single-stage detection system is the density with which it covers the space of possible image boxes. Two-stage detectors can classify boxes at any location, scale, and aspect ratio using region pooling operations [10]. In contrast, since single-stage detectors use a fixed sampling grid, among these methods a popular way to achieve high box coverage is to use multiple “anchors” [28] at each spatial location to cover each A box of various scales and aspect ratios.

  We sweep the number of scale and aspect ratio anchors used per spatial location and per pyramid level in FPN. We consider cases ranging from a single square anchor per position to 12 anchors per position spanning 4 sub-octave scales ( 2 k / 4 2^{k/4}2k /4 , for k ≤ 3) and 3 aspect ratios [0.5, 1, 2]. The results using ResNet-50 are shown in Table 1c. A surprisingly good AP (30.3) is achieved using only one square anchor. However, when using 3 scales and 3 aspect ratios per location, the AP improves by almost 4 points (to 34.0). We used this setup in all other experiments in this work.

  Finally, we note that anchors beyond 6-9 do not show further gains. Thus, while a two-stage system can classify arbitrary boxes in an image, the saturation of performance with respect to density means that the higher latent density of a two-stage system may not confer an advantage.

Speed ​​vs. Accuracy : Larger backbones yield higher accuracy, but also slower inference. The same is true for the input image scale (defined by the shorter image side). We show the effect of these two factors in Table 1e. In Figure 2, we plot the speed/accuracy trade-off curve for RetinaNet and compare it with a recent method using public numbers in COCO test-dev. The figure shows that due to our focal loss, RetinaNet forms an upper bound among all existing methods regardless of low-accuracy schemes. RetinaNet with ResNet-101-FPN and 600-pixel image scale (for simplicity, we denote by RetinaNet-101-600) matches the accuracy of the recently released ResNet-101-FPN Faster R-CNN [20], while Runtime per image was 122 ms vs 172 ms (both measured on an Nvidia M40 GPU). Using a larger size allows RetinaNet to surpass the accuracy of all two-stage methods while still being faster. For faster run time, only one operating point (500 pixel input) improves with ResNet-50-FPN over ResNet-101-FPN. Addressing high frame rate mechanisms may require special network designs, as described in [27], which is beyond the scope of this work. We note that after publication, the faster R-CNN variant in [12] now achieves faster and more accurate results.

5.3 Comparison with existing technologies

  We evaluate RetinaNet on the challenging COCO dataset and compare test-dev results with recent state-of-the-art methods, including one-stage and two-stage models. Results are shown in Table 2 for our RetinaNet-101-800 model, which was trained with proportional jittering and takes 1.5 times longer than the model in Table 1e (giving an AP gain of 1.3). Compared to existing one-stage methods, our method achieves a healthy 5.9 point AP gap (39.1 vs. 33.2) with its closest competitor, DSSD [9], while also being faster, see Fig. 2. Compared to recent two-stage methods, RetinaNet outperforms the top-performing Faster R-CNN model based on Inception-ResNet-v2-TDM [32] by 2.3 percentage points. Inserting ResNeXt-32x8d-101-FPN [38] as the RetinaNet backbone further improves 1.7 AP, exceeding the 40 AP on COCO.

6 Conclusion

  In this work, we identify class imbalance as the main obstacle preventing single-stage object detectors from surpassing the best-performing two-stage methods. To address this, we propose focal loss, which applies a modulation term to the cross-entropy loss in order to focus learning on difficult negative examples. Our method is simple and efficient. We demonstrate its effectiveness by designing a fully convolutional single-stage detector and report extensive experimental analysis showing state-of-the-art accuracy and speed. The source code is located at https://github.com/facebookresearch/Detectron[12]

insert image description here

Figure 5. As xt = yx x_t=yxxt=Function of y x , change in focal loss compared to cross entropy. For well-classified examples( xt > 0 ) (x_t>0)(xt>0 ) , both the original FL and the alternative variant FL* reduce the relative loss.

insert image description here

Table 3. FL and FL* versus CE for selected settings.

Appendix A: Focal Loss*

  The exact form of focus loss is not important. We now show another instance of focal loss that has similar properties and produces comparable results. The following also provides more insight into the properties of the focal loss.

  We first consider cross-entropy (CE) and focal loss (FL) in a slightly different form from the main text. Specifically, we define a quantity xt x_txtas follows:

insert image description here

where y ∈ { ± 1 } y ∈ \{±1\}y{ ± 1 } specifies the truth class as before. Then we can writept = σ( xt ) p_t=σ(x_t)pt=s ( xt) (this is the same as pt p_tin equation 2ptdefinition compatible). when xt > 0 x_t>0xt>0 , an example is correctly classified, in this casept > .5 p_t>.5pt>.5

We can now use xt x_txtto define another form of focal loss. For pt ∗ p^*_tpt F L ∗ FL* FL is defined as follows:

insert image description here

FL has two parameters, γ and β, that control the steepness and offset of the loss curve. We plot FL and CE and FL for two selected settings γ and β in Fig. 5. It can be seen that, like FL, FL* with selected parameters reduces the loss assigned to well-classified examples.

  We train RetinaNet-50-600 using the same setup as before, but we replace FL with FL* with selected parameters. These models achieve almost the same AP as those trained with FL, see Table 3. In other words, FL* is a reasonable alternative to FL that works well in practice.

  We found that various gamma and beta settings gave good results. In Fig. 7 we show the results for RetinaNet-50-600 with FL* under a wide set of parameters. The loss plot is color-coded such that valid settings (model converged and AP over 33.5) are shown in blue. For simplicity, we used α = 0.25 in all experiments. It can be seen that reducing the well-classified examples ( xt > 0 ) (x_t>0)(xt>0 ) weight loss is effective.

  More generally, we expect any loss function with similar properties to FL or FL* to be equally effective.

insert image description here

Figure 6. Derivation of the loss function with respect to x according to Figure 5.

insert image description here

Figure 7. The effectiveness of FL* under different settings of γ and β. Plots are color-coded, so valid settings are shown in blue.

Appendix B: Derivatives

  For reference, the derivatives of CE, FL and FL* with respect to x are:

insert image description here

A plot of the selected settings is shown in Figure 6. For all loss functions, the derivative tends to be -1 or 0 for high confidence predictions. However, unlike CE, for valid settings of FL and FL*, as long as xt > 0 x_t>0xt>0 , the derivative is very small.

References

[1] S. Bell, C. L. Zitnick, K. Bala, and R. Girshick. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In CVPR, 2016. 6
[2] S. R. Bulo, G. Neuhold, and P. Kontschieder. Loss maxpooling for semantic image segmentation. In CVPR, 2017.3
[3] J. Dai, Y. Li, K. He, and J. Sun. R-FCN: Object detection via region-based fully convolutional networks. In NIPS, 2016. 1
[4] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005. 2
[5] P. Doll´ar, Z. Tu, P. Perona, and S. Belongie. Integral channel features. In BMVC, 2009. 2, 3
[6] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks. In CVPR, 2014.2
[7] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes (VOC) Challenge. IJCV, 2010. 2
[8] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester. Cascade object detection with deformable part models. In CVPR, 2010. 2, 3
[9] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. DSSD: Deconvolutional single shot detector. arXiv:1701.06659, 2016. 1, 2, 8
[10] R. Girshick. Fast R-CNN. In ICCV, 2015. 1, 2, 4, 6, 8
[11] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. 1, 2, 5
[12] R. Girshick, I. Radosavovic, G. Gkioxari, P. Doll´ar, and K. He. Detectron. https://github.com/facebookresearch/detectron, 2018. 8
[13] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning. Springer series in statistics Springer, Berlin, 2008. 3, 7
[14] K. He, G. Gkioxari, P. Doll´ar, and R. Girshick. Mask R-CNN. In ICCV, 2017. 1, 2, 4
[15] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV. 2014. 2
[16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. 2, 4, 5, 6, 8
[17] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy. Speed/accuracy trade-offs for modern convolutional object detectors. In CVPR, 2017. 2, 8
[18] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012. 2
[19] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1989. 2
[20] T.-Y. Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In CVPR, 2017. 1, 2, 4, 5, 6, 8
[21] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014. 1, 6
[22] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed. SSD: Single shot multibox detector. In ECCV, 2016. 1, 2, 3, 6, 7, 8
[23] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015. 4
[24] P. O. Pinheiro, R. Collobert, and P. Dollar. Learning to segment object candidates. In NIPS, 2015. 2, 4
[25] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Doll´ar. Learning to refine object segments. In ECCV, 2016. 2
[26] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In CVPR, 2016. 1, 2
[27] J. Redmon and A. Farhadi. YOLO9000: Better, faster, stronger. In CVPR, 2017. 1, 2, 8
[28] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015. 1, 2, 4, 5, 8
[29] H. Rowley, S. Baluja, and T. Kanade. Human face detection in visual scenes. Technical Report CMU-CS-95-158R, Carnegie Mellon University, 1995. 2
[30] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014.2
[31] A. Shrivastava, A. Gupta, and R. Girshick. Training region-based object detectors with online hard example mining. In CVPR, 2016. 2, 3, 6, 7
[32] A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta. Beyond skip connections: Top-down modulation for object detection. arXiv:1612.06851, 2016. 2, 8
[33] K.-K. Sung and T. Poggio. Learning and Example Selection for Object and Pattern Detection. In MIT A.I. Memo No.1521, 1994. 2, 3
[34] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI Conference on Artificial Intelligence, 2017. 8
[35] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. IJCV, 2013. 2, 4
[36] R. Vaillant, C. Monrocq, and Y. LeCun. Original approach for the localisation of objects in images. IEE Proc. on Vision, Image, and Signal Processing, 1994. 2
[37] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In CVPR, 2001. 2, 3
[38] S. Xie, R. Girshick, P. Doll´ar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In CVPR, 2017. 8
[39] C. L. Zitnick and P. Doll´ar. Edge boxes: Locating object proposals from edges. In ECCV, 2014. 2

Guess you like

Origin blog.csdn.net/i6101206007/article/details/132132682
Recommended