Scalable Object Detection using Deep Neural Networks

作者：Dumitru Erhan,Christian Szegedy, Alexander Toshev等

发表时间：2013

Abstract

Deep convolutional neural networks have recently achieved state-of-the-art performance on a number of image recognition benchmarks, including the ImageNet Large-Scale Visual Recognition Challenge(ILSVRC-2012). The winning model on the localization sub-task was a network that predicts a single bounding box and a confidence score for each object category in the image. Such a model captures the whole-image context around the objects but cannot handle multiple instances of the same object in the image without naively replicating the number of outputs for each instance. In this work, we propose a saliency-inspired neural network model for detection, which predicts a set of class-agnostic bounding boxes along with a single score for each box, corresponding to its likelihood of containing any object of interest. The model naturally handles a variable number of instances for each class and allows for crossclass generalization at the highest levels of the network. We are able to obtain competitive recognition performance on VOC2007 and ILSVRC2012, while using only the top few predicted locations in each image and a small number of neural network evaluations.

深度卷积神经网络最近在包括ILSVRC-2012等多个图像识别基准上取得了最新的性能。定位子任务的获胜模型是在预测图像中对每个对象类别预测单个边界框和置信度分数的网络。这样的模型捕获目标对象周围的整个图像上下文，但若不复制每个实例的输出数量，则不能处理图像中相同对象的多个实例。在本文中，我们提出一个用于检测的显著性启发式神经网络模型，该模型预测一组类不可知的边界框，其中每个框都包含其感兴趣对象的可能性分数。该模型自然地为每个类处理数量可变的实例，并允许在网络的最高层中进行跨类泛化。当在每幅图上仅使用前几个预测为并且采用少量的神经网络评价指标，我们能够得到在VOC2007和ILSVRC2012上的竞赛级别的识别性能。

1.Introduction

Object detection is one of the fundamental tasks in computer vision. A common paradigm to address this problem is to train object detectors which operate on a subimage and apply these detectors in an exhaustive manner across all locations and scales. This paradigm was successfully used within a discriminatively trained Deformable Part Model (DPM) to achieve state-of-art results on detection tasks [6].

目标检测是计算机视觉的基本任务之一。解决这个问题的一个常见范例是训练对子图像进行操作的目标检测器，并以穷举的方式在所有位置和尺度上应用这些检测器。该范例在受过区别训练的可变形部件模型（DPM）内被成功使用，并获得检测任务的最新结果[6]。

The exhaustive search through all possible locations and scales poses a computational challenge. This challenge becomes even harder as the number of classes grows, since most of the approaches train a separate detector per class. In order to address this issue a variety of methods were proposed, varying from detector cascades, to using segmentation to suggest a small number of object hypotheses [14, 2, 4].

穷尽搜索通过所有可能的位置和尺度构成了一个计算挑战。随着类数量的增加，这个挑战变得更加困难，因为大多数方法针对每个类训练单独的检测器。为了解决这个问题，提出了各种方法，从检测器串联到使用分割来建议少量对象假设[14，2，4]。

In this paper,we ascribe to the latter philosphy and propose to train a detector, called “DeepMultiBox”,’ which generates a few bounding boxes as object candidates. These boxes are generated by a single DNN in a class agnostic manner. Our model has several contributions. First, we define object detection as a regression problem to the coordinates of several bounding boxes. In addition, for each predicted box the net outputs a confidence score of how likely this box contains an object. This is quite different from traditional approaches,which score features within predefined boxes, and has the advantage of expressing detection of objects in a very compact and efficient way.

在本文中，我们根据后者的哲学逻辑提出培养一个探测器，称为“DeepMultiBox”，产生一些边界框作为候选对象。这些框由一个DNN以类不相关的方式生成。我们的模型有几个贡献。首先，我们将物体检测作为对若干边界框坐标的回归问题。此外，对于每个预测的框，网络输出一个置信度评分表明该框包含一个对象的可能性。与传统的方法不同，预定义框中的分数特征，并且具有以非常紧凑和高效的方式表示对象检测的优点。

The second major contribution is the loss, which trains the bounding box predictors as part of the network training. For each training example,we solve an assignment problem between the current predictions and the groundtruth boxes and update the matched box coordinates, their confidences and the underlying features through Backpropagation. In this way, we learn a deep net tailored towards our localization problem. We capitalize on the excellent representation learning abilities of DNNs,as recently exeplified recently in image classification [10] and object detection settings [13], and perform joint learning of representation and predictors.

第二个主要贡献是损失，训练边界框预测器作为网络训练的一部分。对于每个训练实例，我们解决当前预测与真实框之间的分配问题，并通过反向传播更新匹配的盒坐标、它们的置信度以及底层特征。通过这种方式，我们学习了一个针对我们定位问题的深层网络。我们利用了DNN最近在图像分类[10]和目标检测[13]中表现出的优良的学习能力，并且执行表现和预测器的联合学习。

Finally, we train our object box predictor in a classagnostic manner. We consider this as a scalable way to enable efficient detection of large number of object classes. We show in our experiments that by only post-classifying less than ten boxes, obtained by a single network application, we can achieve state-of-art detection results. Further, we show that our box predictor generalizes over unseen classes and as such is flexible to be re-used within other detection problems.

最后，我们以类不可知的方式训练我们的目标框预测器。我们将其视作一种可扩展的方式来实现对大量对象类的有效检测。我们在实验中表明，通过仅对单个网络应用获得的不到十个盒子进行后分类，就可以获得最先进的检测结果。此外，我们证明了我们的框预测器在不可见类上灵活的泛化是可行的，可以在其他检测问题中重复使用。

2.Previous work

The literature on object detection is vast, and in this section we will focus on approaches exploiting class-agnostic ideas and addressing scalability.

关于对象检测的文献很多，在本节中，我们将重点介绍利用类不相关思想和解决可扩展性的方法。

Many of the proposed detection approaches are based on part-based models [7], which more recently have achieved impressive performance thanks to discriminative learning and carefully crafted features[6]. These methods, however,rely on exhaustive application of part templates over multiple scales and as such are expensive. Moreover, they scale linearly in the number of classes, which becomes a challenge for modern datasets such as ImageNet.

许多提出的检测方法是基于基于零件的模型[7]，最近由于鉴别学习和精心设计的特征，该模型取得了令人印象深刻的性能[6]。然而，这些方法依赖于部分模板在多个尺度上的穷尽性应用，因此是昂贵的。此外，它们在类数上线性地缩放，这对于现代数据集（如ImageNet）来说是一个挑战。

To address the former issue, Lampert et al. [11] use a branch-and-bound strategy to avoid evaluating all potential object locations. To address the latter issue,Song et al.[12] use a low-dimensional part basis, shared across all object classes. A hashing based approach for efficient part detection has shown good results as well [3].

为了解决前一个问题，Lampert等人[11]，使用分支定界策略来避免评估所有可能的对象位置。为了解决后一个问题，Song等人[12]使用低维部分基础，在所有对象类之间共享。一种基于散列的有效部分检测方法也显示了良好的效果 [3]。

A different line of work, closer to ours, is based on the idea that objects can be localized without having to know their class. Some of these approaches build on bottom-up classless segmentation [9]. The segments, obtained in this way, can be scored using top-down feedback [14, 2, 4]. Using the same motivation, Alexe et al. [1] use an inexpensive classifier to score object hypotheses for being an object or not and in this way reduce the number of location for the subsequent detection steps. These approaches can be thought of as Multi-layered models, with segmentation as first layer and a segment classification as a subsequent layer. Despite the fact that they encode proven perceptual principles, we will show that having deeper models which are fully learned can lead to superior results.

与我们更接近的是一种基于这样一种理念的不同工作，即对象可以被定位，而不必知道他们的类。这些方法中的一些建立在自底向上的无类分割[9]。以这种方式获得的割片可以使用自上而下的反馈[14, 2, 4]进行评分。使用同样的动机，Alexe等,[1]使用廉价的分类器来对作为或不作为对象的假设目标进行评分，以此方式减少后续检测步骤的位置数量。这些方法可以认为是多层模型，以分割为第一层，以分割分类为后续层。尽管事实是，他们编码证实了感知原理，我们将表明，拥有更深的模型，充分学习后可以得到更好的结果。

Finally, we capitalize on the recent advances in Deep Learning, most noticeably the work by Krizhevsky et al.[10]. We extend their bounding box regression approach for detection to the case of handling multiple objects in a scalable manner. DNN-based regression, to object masks however, has been applied by Szegedy et al. [13]. This last approach achieves state-of-art detection performance but does not scale up to multiple classes due to the cost of a single mask regression.

最后，我们利用了深度学习的最新进展，最值得注意的是Krizhevsky等人的工作[10]。我们以可扩展的方式扩展了他们检测到多个处理对象时的边界框的回归方法。然而，基于DNN的回归，到对象掩膜，已经被Szegedy等人应用[13]。最后一种方法实现了最先进的检测性能，但是由于单个掩膜回归的成本，无法扩展到多个类。

3.Proposed approach

We aim at achieving a class-agnostic scalable object detection by predicting a set of bounding boxes, which represent potential objects. More precisely, we use a Deep Neural Network (DNN), which outputs a fixed number of bounding boxes. Inaddition, it outputs a score for each box expressing the network confidence of this box containing an object.

我们的目标是通过预测一组表示潜在对象的边界框来实现类无关的可扩展目标检测。更确切地说，我们使用一个深度神经网络（DNN），它输出一个固定数量的边界框。此外，它还输出每个盒子的分数来表示网络认为该边界框包含对象的可能性。

Model To formalize the above idea, we encode the i-th object box and its associated confidence as node values of the last net layer:

模型为了使上述思想形式化，我们将第i个对象框及其相关联的分数作为最后一个网络层的节点值进行编码：

Boundingbox: we encode the upper-left and lower-right coordinates of each box as four node values,which can be written as a vector $l_{i}\in R^{4}$ . These coordinates are normalized w.r.t.image dimensions to achieve invariance to absolute image size. Each normalized coordinate is produced by a linear transformation of the last hidden layer.

Boundingbox:我们将每个方框的左上和右下坐标编码为四个节点值，它们可以被写成向量 $l_{i}\in R^{4}$ 。这些坐标被归一化w.e.t.图像维度，以实现绝对图像尺寸的不变性。每个归一化坐标由最后隐藏层的线性变换产生。

Confidence: the confidence score for the box containing an object is encoded as a single node value ci ∈ [0,1]. This value is produced through a linear transformation of the last hidden layer followed by a sigmoid.

置信度：包含一个对象的框的置信度分数被编码为单个节点值ci ∈ [0,1]。这个值是通过最后一个隐藏层的sigmoid线性变换产生的。

We can combine the bounding box locations li, i ∈ {1,...K}, as one linear layer. Similarly, we can treat collection of all confidences ci, i ∈{1,...K}as the output as one sigmoid layer. Both these output layers are connected to the last hidden layers.

我们可以结合边界框位置li，i ∈ {1,...K}，作为一个线性层。类似地，我们可以将所有的置信度分数ci，i ∈{1,...K}的集合作为一个sigmoid输出层。这两个输出层都连接到最后的隐藏层。

At inference time, out algorithm produces K bounding boxes. In our experiments, we use K = 100 and K = 200. If desired, we can use the confidence scores and non-maximum suppression to obtain a smaller number of high-confidence boxes at inference time. These boxes are supposed to represent objects. As such, they can be classified with a subsequent classifier to achieve object detection. Since the number of boxes is very small,we can afford powerful classifiers. In our experiments, we use another DNN for classification [10].

在前向传播过程中，out算法产生K个边界框。在我们的实验中，我们使用k＝100和k＝200。如果需要的话，我们可以使用置信度分数和非极大值抑制以在前向传播过程中获得较少数量的高置信度分数的框。这些框应该代表目标。因此，它们可以被随后的分类器分类，以实现目标检测。由于框的数量很小，我们可以负担得起巨大的的分类。在我们的实验中，我们使用另一个DNN进行分类[10]。

Training Objective We train a DNN to predict bounding boxes and their confidence scores for each training image such that the highest scoring boxes match well the ground truth object boxes for the image. Suppose that for a particular training example, M objects were labeled by bounding boxes gj, j ∈{1,...,M}. In practice, the number of predictions K is much larger than the number of ground truth boxes M. Therefore, we try to optimize only the subset of predicted boxes which match best the ground truth ones. We optimize their locations to improve their match and maximize their confidences. At the same time we minimize the confidences of the remaining predictions,which are deemed not to localize the true objects well.

训练目标我们训练一个DNN来预测每个训练图像的边界框及其关联置信度分数，以便最高得分框与图像的真实目标框较好地匹配。假设对于一个特定的训练实例，M个目标由边界框gj, j ∈{1,...,M}标记。在实际应用中，预测的边界框数量K远大于真实边界框的数量M。因此，我们试图只优化预测边界框的子集，它们与真实边界框最匹配。我们优化他们的位置，以改善他们的匹配度和最大化他们的可信度。同时，我们将剩余预测边界框的置信度最小化，表示他们被认为不能很好地定位真实目标。

To achieve the above, we formulate an assignment problem for each training example. We xij ∈{0,1} denote the assignment: xij = 1 if the i-th prediction is assigned to j-th true object. The objective of this assignment can be expressed as:

where we use L2 distance between the normalized bounding box coordinates to quantify the dissimilarity between bounding boxes.

为了实现上述，我们为训练实例的分配问题而制定一个公式。如果第i个预测被分配给第j个真对象，则Xij ∈{0,1}赋值：Xij＝1。本任务可以表示为（如上公式），在这里我们使用归一化的坐标框坐标之间的L2距离来量化边界框之间的相异性。

Additionally,we want to optimize the conﬁdences of the boxes according to the assignment x. Maximizing the conﬁdences of assigned predictions can be expressed as:

In the above objective $\sum _{j}x_{ij}=1$ if prediction i has been matched to a groundtruth. In that case ci is being maximized, while in the opposite case it is being minimized. A different interpretation of the above term is achieved if we $\sum _{j}x_{ij}$ view as a probability of prediction i containing an object of interest. Then,the above loss is the negative of the entropy and thus corresponds to a max entropy loss.

此外，我们用分配x的方式来优化边界框的置信度，所安排的预测框的最大置信度可以表示为（如上公式）：

在上述任务中， $\sum _{j}x_{ij}=1$ 如果预测i已经匹配到一个真实框。在这种情况下，ci被最大化，而在相反的情况下，它被最小化。如果我们将 $\sum _{j}x_{ij}$ 视作预测框i包含感兴趣目标的概率，则可以对上述术语进行不同的解释。此外，上述损失是熵的负，因此对应于最大熵损失。

The ﬁnal loss objective combines the matching and conﬁdence losses:

subject to constraints in Eq. 1. α balances the contribution of the different loss terms.

最终目标损失结合了匹配度和置信度损失：（如上公式）。受等式1的限制。α平衡不同损失项的贡献。

Optimization For each training example, we solve for an optimal assignment x∗ of predictions to true boxes by

where the constraints enforce an assignment solution. This is a variant of bipartite matching, which is polynomial in complexity. In our application the matching is very inexpensive – the number of labeled objects per image is less than a dozen and in most cases only very few objects are labeled.

优化对于每一个训练例子，我们求解一个从预测框倒真实框的最优分配X*（如上），其中约束实施分配解决方案。这是二分匹配的一种变体，它是复杂的多项式。在我们的应用中，匹配非常便宜——每幅图像的标记对象数量少于12个，并且在大多数情况下只有很少的对象被标记。

Then, we optimize the network parameters via back propagation. For example, the ﬁrst derivatives of the back propagation algorithm are computed w. r. t. l and c:

然后，通过反向传播优化网络参数。例如，反向传播算法通过导数计算w.r.t.l和c：

Training Details While the loss as deﬁned above is in principle sufﬁcient, three modiﬁcations make it possible to reach better accuracy signiﬁcantly faster. The ﬁrst such modiﬁcation is to perform clustering of ground truth locations and ﬁnd K such clusters/centroids that we can use as priors for each of the predicted locations. Thus, the learning algorithm is encouraged to learn a residual to a prior,for each of the predicted locations.

训练细节虽然上述的计算loss原则上很合适，但是三个修正可以显著地更快地达到更好的精度。首先第一个修改是执行真实框坐标的聚类，以找到K个这样的簇/质心，我们可以用它作为每个预测位置的先验。因此，学习算法被鼓励学习每个预测位置的残差到先验。

A second modiﬁcation pertains to using these priors in the matching process: instead of matching the N ground truth locations with the K predictions, we ﬁnd the best match between the K priors and the ground truth. Once the matching is done, the target conﬁdences are computed as before. Moreover, the location prediction loss is also unchanged: for any matched pair of (target, prediction) locations, the loss is deﬁned by the difference between the groundtruth and the coordinates that correspond to the matched prior. We call the usage of priors for matching prior matching and hypothesize that it enforces diversiﬁcation among the predictions.

第二个改进涉及在匹配过程中使用这些先验：我们没有用K个预测来匹配N个真值位置，而是用K个先验与真实位置之间进行最佳匹配。一旦匹配完成，目标置信度就如以前一样计算。此外，位置预测损失也是不变的：对于任何匹配的一对（目标，预测）位置，损失是通过真实坐标值与对应于匹配的先验的坐标之间的差异来定义的。我们调用先验匹配先前的匹配，并假设它在预测中强制执行多样性。

It should be noted, that although we deﬁned our method in a class-agnostic way, we can apply it to predicting object boxes for a particular class. To do this, we simply need to train our models on bounding boxes for that class.

需要注意的是，尽管我们以类不相关的方式定义了方法，但我们可以将其应用于预测特定类的目标框。要做到这一点，我们只需要在该类的边界框上训练我们的模型。

Further, we can predict K boxes per class. Unfortunately, this model will have number of parameters growing linearly with the number of classes. Also, in a typical setting, where the number of objects for a given class is relatively small, most of these parameters will see very few training examples with a corresponding gradient contribution. We argue thusly that our two-step process – ﬁrst localize, then recognize – is a superior alternative in that it allows leveraging data from multiple object types in the same image using a small number of parameters.

此外，我们可以预测每个类的k个边界框。不幸的是，这个模型将有许多参数随类数线性增长。而且，在典型的设置中，给定类的对象数量相对较少，这些参数中的大多数将很少看到具有相应梯度贡献的训练示例。因此，我们认为，我们的两步过程——首先定位，然后识别——是一个更好的选择，因为它允许来自同一图像中多个对象类型的数据使用少量参数。

4.Experimental results

4.1.Network Architecture and Experiment Details

The network architecture for the localization and classiﬁcation models that we use is the same as the one used by [10]. We use Adagrad for controlling the learning rate decay, mini-batches of size 128, and parallel distributed training with multiple identical replicas of the network, which achieves faster convergence. As mentioned previously,we use priors in the localization loss–these are computed using k-means on the training set. We also use an α of 0.3 to balance the localization and conﬁdence losses.

我们使用的定位和分类模型的网络架构与[10 ]所使用的网络架构相同。我们使用Adagrad来控制学习速率衰减，批处理大小为128，以及具有网络的多个相同副本的并行分布式训练，从而实现更快的收敛。如前所述，我们在定位损失中使用先验位置框-这些是在训练集上使用k-均值计算得到的。我们还使用0.3的α来平衡定位损失和置信度损失。

The localizer might output coordinates outside the crop area used for the inference. The coordinates are mapped and truncated to the ﬁnal image area, at the end. Boxes are additionally pruned using non-maximum-suppression with a Jaccard similarity threshold of 0.5. Our second model then classiﬁes each bounding box as objects of interest or “background”.

定位器可能输出用于推断的目标区域以外的坐标。坐标被映射并截断到最后的图像区域。另外，使用Jaccard相似性阈值为0.5的非极大值抑制来修剪框。然后，我们的第二个模型将每个坐标框分类为感兴趣的对象或“背景”。

To train our localizer networks, we generated approximately 30 million images from the training set by applying the following procedure to each image in the training set. For each image, we generate the same number of square samples such that the total number of samples is about ten million. For each image,the samples are bucketed such that for each of the ratios in the ranges of 0−5%,5−15%,15−50%,50−100%, there is an equal number of samples in which the ratio covered by the bounding boxes is in the given range.

为了训练我们的定位器网络，我们通过对训练集中的每个图像应用以下过程，得到从训练集中生成的大约3000万幅图像。对于每个图像，我们产生相同数量的正方形样本，使得样本总数约为一千万。对于每个图像，将样本进行桶装，使得对于0_5％、5_15％、15_50％、50_100％范围内的每个比率，存在相同数量的样本，其中边界框所覆盖的比率在给定范围内。

The selection of the training set and most of our hyperparameters were based on past experiences with non-public data sets. For the experiments below we have not explored any non-standard data generation or regularization options.

训练集和我们的大部分超参数的选择是基于过去在非公开数据集上的经验。对于下面的实验，我们还没有探索任何非标准的数据生成或正则化选项。

In all experiments, all hyper-parameters were selected by evaluating on a held out portion of the training set (10%random choice of examples).

在所有实验中，所有超参数都是通过评估训练集的保留部分（10%随机选择示例）来选择的。

4.2.VOC2007

The Pascal Visual Object Classes (VOC) Challenge [5] is the most commong benchmark for object detection algorithms. It consists mainly of complex scene images in which bounding boxes of 20 diverse object classes were labelled.

In our evaluation we focus on the 2007 edition of VOC, for which a test set was released. We present results by training on VOC 2012, which contains approx. 11000 images. We trained a 100 box localizer as well as a deep net based classiﬁer [10].

PASCAL视觉对象类（VOC）挑战[5 ]是最常用的目标检测算法的基准。它包含了主要的复杂场景图像，其中20个不同的对象类的坐标框被标记。

在我们的评价中，我们关注VOC的2007版，它发布了一套测试集。我们目前的结果是通过训练VOC 2012，其中包含约11000张图像。我们训练了一个100框定位器和一个基于深度网络的分类器[10 ]。

4.2.1 Training methodology

We trained the classiﬁer on a data set comprising of

• 10 million crops overlapping some object with at least 0.5 Jaccard overlap similarity. The crops are labeled with one of the 20 VOC object classes.

• 20 million negative crops that have at most 0.2 Jaccard similarity with any of the object boxes. These crops are labeled with the special “background” class label.

The architecture and the selection of hyperparameters followed that of [10].

我们在一个数据集上对分类器进行了训练。

1000万个裁剪重叠至少0.5的IOU相似性的对象。这些裁剪被标记为20个VOC对象类别中的一个。

2000万个负裁剪样本，最多有0.2的IOU与任何一个目标框。这些裁剪区是用特殊的“背景”类标签标注的。

超参数的体系结构和选择遵循[10 ]。

4.2.2 Evaluation methodology

In the ﬁrst round, the localizer model is applied to the maximum center square crop in the image. The crop is resized to the network input size which is 220 × 220. A single pass through this network gives us up to hundred candidate boxes. After a non-maximum-suppression with overlap threshold 0.5, the top 10 highest scoring detections are kept and were classiﬁed by the 21-way classiﬁer model in a separate passes through the network. The ﬁnal detection score is the product of the localizer score for the given box multiplied by the score of the classiﬁer evaluated on the maximum square region around the crop. These scores are passed to the evaluation and were used for computing the precision recall curves.

在第一回合，定位模型应用于图像中最大的裁剪正方形的中心。将裁剪区的大小调整为网络的输入大小220×220。一次通过这个网络，给我们提供多达上百个候选框。然后通过0.5的IOU阈值的非极大值抑制，得分最高的前十个检测结果被保留，并且由21路分类器模型在通过网络的单独通道中进行分类。最终的检测得分是给定框的定位器得分乘以裁剪区周围最大平方区域上评估的分类器的得分的乘积。这些分数被传递用于评估，并被用于计算精确召回曲线（P-R曲线）。

4.3.Discussion

First,we analyze the performance of our localizer in isolation. We present the number of detected objects, as deﬁned by the Pascal detection criterion, against the number of produced bounding boxes. In Fig.1 plot we show results obtained by training on VOC2012. In addition, we present results by using the max-center square crop of the image as input as well as by using two scales: the max-center crop by a second scale where we select 3×3 windows of size 60% of the image size.

首先，我们单独分析了我们的定位器的性能。我们根据生成的边界框的数量，给出由Pascal检测标准确定的检测对象的数量。在图1中，我们展示了通过对VOC2012的训练获得的结果。此外，我们展示了使用图像的最大中心正方形裁剪作为输入，以及使用两个尺度的结果：通过第二尺度的最大中心裁剪，其中我们选择大小为图像大小的60％的3×3窗口。

As we can see, when using a budget of 10 bounding boxes we can localize 45.3% of the objects with the ﬁrst model, and 48% with the second model. This shows better perfomance than other reported results, such as the objectness algorithm achieving 42% [1]. Further, this plot shows the importance of looking at the image at several resolutions. Although our algorithm manages to get large number of objects by using the max-center crop, we obtain an additional boost when using higher resolution image crops.

如我们所见，当使用10个边界框的预算时，我们可以用第一个模型定位45.3%的对象，而用第二个模型定位48%的对象。这显示了比其他报道的结果更好的性能，如目标算法达到42%[1]。此外，这个图显示了在几个分辨率中看图像的重要性。虽然我们的算法通过使用最大中心裁剪来尽可能获得大量的对象，但是当使用高分辨率的图像裁剪对象时，我们获得了额外的增强。

Further, we classify the produced bounding boxes by a 21-way classiﬁer, as described above. The average precisions (APs) on VOC 2007 are presented in Table 1. The achieved mean AP is 0.29,which is on par with state-of-art. Note that, our running time complexity is very low – we simply use the top 10 boxes.

此外，我们用21路分类法对所生成的坐标框进行分类，如上所述。VOC 2007的各类的平均准确率（APs）列于表1中。所获得的MAP为0.29，这与现有技术相当。注意，我们的运行时间复杂度很低，我们只使用前10个坐标框。

Example detections and full precision recall curves are shown in Fig. 2 and Fig. 3 respectively. It is important to note that the visualized detections were obtained by using only the max-centered square image crop, i. e. the full image was used. Nevertheless, we manage to obtain relatively small objects, such as the boats in row 2 and column 2, as well as the sheep in row 3 and column 3.

示例检测和完整的P-R曲线分别如图2和图3中所示。重要的是要注意，可视化的检测只使用最大中心正方形图像裁剪，即使用全图像。然而，我们设法获得相对较小的对象，例如第二行第二列中的船，以及第三行第三列中的羊。

4.4.ILSVRC 2012 Detection Challenge

For this set of experiments, we used the ILSVRC 2012 detection challenge dataset. This dataset consists of 544,545 training images labeled with categories and locations of 1,000 object categories, relatively uniformly distributed among the classes. The validation set, on which the performance metrics are calculated, consists of 48,238 images.

对于这组实验，我们使用ILVRC 2012检测挑战数据集。该数据集由544545个训练图像组成，标记有1000个对象类别的类别和位置，在类别之间相对均匀地分布。计算性能指标的验证集由48238个图像组成。

4.4.1 Training methodology

In addition to a localization model that is identical (up to the dataset on which it is trained on) to the VOC model, we also train a model on the ImageNet Classiﬁcation challenge data,which will serve as the recognition model. This model is trained in a procedure that is substantially similar to that of [10] and is able to achieve the same results on the classiﬁcation challenge validation set; note that we only train a single model, instead of 7 – the latter brings substantial beneﬁts in terms of classiﬁcation accuracy, but is 7×more expensive, which is not a negligible factor.

除了与VOC模型相同的定位模型（直到它被训练的数据集），我们还在ImageNet分类挑战数据上训练一个模型，该模型将作为识别模型。这个模型是在一个与[10]基本相似的程序中训练的，并且能够在分类挑战验证集上获得相同的结果；注意，我们只训练单个模型，而不是7——后者在分类准确性方面带来实质性的好处。但7×更昂贵，这不是一个可以忽略的因素。

Inference is done as with the VOC setup: the number of predicted locations is K = 100, which are then reduced by Non-Max-Suppression (Jaccard overlap criterion of 0.4) and which are post-scored by the classiﬁer: the score is the product of the localizer conﬁdence for the given box multiplied by the score of the classiﬁer evaluated on the minimum square region around the crop. The ﬁnal scores (detection score times classiﬁcation score) are then sorted in descending order and only the top scoring score/location pair is kept for a given class (as per the challenge evaluation criterion).

推断与VOC设置一样：预测位置的数量是K=100，然后通过非最大抑制（IOU阈值为0.4）来减少，并且由分类器进行后评分：得分是给定框的定位器置信度乘以给定框的分类验证分在crop周围最小平方区域上进行评价。然后，按降序对最后得分(检测得分乘以分类得分)进行排序，对于给定类(根据挑战评估标准)，仅保留最高得分/位置对。

In all experiments, the hyper-parameters were selected by evaluating on a held out portion of the training set (10% random choice of examples).

在所有实验中，通过评估训练集的保持部分（10%随机选择实例）来选择超参数。

4.4.2 Evaluation methodology

The ofﬁcial metric of the “Classiﬁcation with localization“ ILSVRC-2012 challenge is detection@5, where an algorithm is only allowed to produce one box per each of the 5 labels (in other words, a model is neither penalized nor rewarded for producing valid multiple detections of the same class), where the detection criterion is 0.5 Jaccard overlap with any of the ground-truth boxes(in addition to the matching class label).

“Classication with Localization”ILSVRC-2012挑战的一个标准度量是.@5，其中算法只允许每5个标签生成一个盒子（换句话说，一个模型既不被处罚，也不因生成同一个类的有效多次检测而获得奖励）。其中，检测标准是0.5 Jaccard与任何基本真值框(除了匹配的类标签之外)重叠。

Table 4.4.2 contains a comparison of the proposed method, dubbed DeepMultiBox, with classifying the ground-truth boxes directly and with the approach of inferring one box per class directly. The metrics reported are detection5 and classiﬁcation5, the ofﬁcial metrics for the ILSVRC-2012 challenge metrics. In the table, we vary the number of windows at which we apply the classiﬁer (this number represents the top windows chosen after nonmax-suppression, the ranking coming from the conﬁdence scores). The one-box-per-class approach is a careful reimplementation of the winning entry of ILSVRC-2012 (the “classiﬁcation with localization”challenge),with 1 network trained (instead of 7).

表4.4.2包含所提议的称为DeepMultiBox的方法与直接对基本真值框进行分类以及直接推断每个类一个框的方法的比较。报告的度量是detection5和classiﬁcation5，ILVRC-2012挑战度量的官方度量。在表中，我们改变应用分类器的窗口数量（这个数字表示在非最大抑制之后选择的顶部窗口数量，排名来自于置信度分数）。每类一盒的方法是认真地重新实施ILSVRC-2012获奖项目（“具有定位的分类”挑战），由一个网络训练（而不是7）。

We can see that the DeepMultiBox approach is quite competitive: with5-10 windows,it is able to perform about as well as the competing approach. While the one-box-perclass approach may come off as more appealing in this particular case in terms of the raw performance, it suffers from a number of drawbacks: ﬁrst, its output scales linearly with the number of classes, for which there needs to be training data. The multibox approach can in principle use transfer learning to detect certain types of objects on which it has never been speciﬁcally trained on, but which share similarities with objects that it has seen. Figure 5 explores this hypothesis by observing what happens when one takes a localization model trained on ImageNet and applies it on the VOC test set, and vice-versa. The ﬁgure shows a precision recall curve: in this case, we perform a class-agnostic detection: a true positive occurs if two windows (prediction and groundtruth) overlap by more than 0.5, independently of their class. Interestingly, the ImageNet-trained model is able to capture more VOC windows than vice-versa: we hypothesize that this is due to the ImageNet class set being much richer than the VOC class set.

我们可以看到，DeepMultiBox方法是非常有竞争力的：使用5-10个窗口，它可以执行与竞争方法相当的性能。虽然单框每类方法在特殊情况下就原始性能而言可能更具吸引力，但是它存在许多缺点：首先，其输出大小与需要训练数据的类的数量成线性关系。多盒方法原则上可以使用迁移学习来检测某些类型的对象，这些对象从未被专门训练过，但是与它曾被训练过的对象具有相似性。图5通过观察在ImageNet上训练的定位模型应用于VOC测试集时发生的情况来探索这个假设，反之亦然。图中显示了一个P-R曲线：在这种情况下，我们执行一个类不可知检测：如果两个窗口（预测和真实框）重叠超过0.5，则出现真正例，这与它们的类无关。有趣的是，ImageNet训练的模型能够捕获比VOC更多的窗口，反之亦然：我们假设这是由于ImageNet类集比VOC类集丰富得多。

Secondly, the one-box-per-class approach does not generalize naturally to multiple instances of objects of the same type (except via the the method presented in this work, for instance). Figure 5 shows this too, in the comparison between DeepMultiBox and the one-per-class approach2. Generalizing to such a scenario is necessary for actual image understanding by algorithms,thus such limitations need to be overcome, and our method is a scalable way of doing so. Evidence supporting this statement is shown in Figure5 shows that the proposed method is able to generally capture more objects more accurately that a single-box method.

其次，每类一盒的方法并不自然地泛化到相同类型的对象的多个实例（例如，除了通过本文中介绍的方法之外）。图5也显示了这一点，在DeepMultiBox和每个类方法之间的比较中。推广到这种场景对于通过算法来理解实际图像是必要的，因此需要克服这种限制，并且我们的方法是一种可扩展的方法。支持此语句的证据如图5所示，表明所提出的方法通常能够比单盒方法更准确地捕获更多的对象。

5.Discussion and Conclusion

In this work, we propose a novel method for localizing objects in an image, which predicts multiple bounding boxes at a time. The method uses a deep convolutional neural network as a base feature extraction and learning model. It formulates a multiple box localization cost that is able to take advantage of variable number of groundtruth locations of interest in a given image and learn to predict such locations in unseen images.

在这项工作中，我们提出了一种新的方法在图像中进行目标对象定位，一次预测多个边界框。该方法采用深度卷积神经网络作为基本特征提取和学习模型。它制定了多盒定位成本，该成本能够利用给定图像中感兴趣的不同数目的真实框位置，并学习预测未知图像中的这些位置。

We present results on two challenging benchmarks, VOC2007 and ILSVRC-2012, on which the proposed method is competitive. Moreover, the method is able to perform well by predicting only very few locations to be probed by a subsequent classiﬁer. Our results show that the DeepMultiBox approach is scalable and can even generalize across the two datasets, in terms of being able to predict locations of interest, even for categories on which it was not trained on. Additionally, it is able to capture multiple instances of objects of the same class, which is an important feature of algorithms that aim for better image understanding.

我们提出结果在两个具有挑战性的基准，VoT2007和ILVRC-2012，所提出的方法是有竞争力的。此外，该方法能够通过预测非常少的位置来进行良好的后续的分类器探测。我们的结果表明，DeepMultiBox方法是可扩展的，甚至可以在两个数据集之间进行泛化，即能够预测感兴趣的位置，甚至对于未经训练的类别也是如此。此外，它能够捕获相同类的对象的多个实例，这是旨在更好地理解图像的算法的一个重要特征。

In the future, we hope to be able to fold the localization and recognition paths into a single network, such that we would be able to extract both location and class label information in a one-shot feed-forward pass through the network. Even in its current state, the two-pass procedure (localization network followed by categorization network) entails 5-10 network evaluations, each at roughly 1 CPU-sec (modern machine). Importantly, this number does not scale linearly with the number of classes to be recognized, which makes the proposed approach very competitive with DPMlike approaches.

将来，我们希望能够将定位和识别路径折叠到单个网络中，这样我们就能够在通过网络的一次前馈传递中提取位置和类标签信息。即使在当前状态下，双程过程（定位网络后跟分类网络）也需要5-10个网络评估，每个评估大约为1CPU-sec（现代机器）。重要的是，这个数字并不与要识别的类的数量成线性关系，这使得所提出的方法与DPM类方法非常具有竞争力。

References

[1] B.Alexe, T.Deselaers, and V. Ferrari. Whatisan object? In CVPR. IEEE, 2010. 2, 4

[2] J. Carreira and C. Sminchisescu. Constrained parametric min-cuts for automatic object segmentation. InCVPR,2010.1,2

[3] T. Dean, M. A. Ruzon, M. Segal, J. Shlens, S. Vijayanarasimhan, and J. Yagnik. Fast, accurate detection of 100,000 object classes on a single machine. In CVPR, 2013. 2

[4] I. Endres and D. Hoiem. Category independent object proposals. In ECCV. 2010. 1, 2

[5] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303– 338, 2010. 4

[6] P.F.Felzenszwalb,R.B.Girshick,D.McAllester,andD.Ramanan. Object detection with discriminatively trained partbased models. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(9):1627–1645, 2010. 1, 6

[7] M. A. Fischler and R. A. Elschlager. The representation and matching of pictorial structures. Computers, IEEE Transactions on, 100(1):67–92, 1973. 1

[8] R. B. Girshick, P. F. Felzenszwalb, and D. McAllester. Discriminatively trained deformable part models, release 5. http://people.cs.uchicago.edu/ rbg/latent-release5/. 6

[9] C. Gu, J. J. Lim, P. Arbel´aez, and J. Malik. Recognition using regions. In CVPR, 2009. 2

[10] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1106–1114, 2012. 1, 2, 3, 4, 6

[11] C. H. Lampert, M. B. Blaschko, and T. Hofmann. Beyond sliding windows: Object localization by efﬁcient subwindow search. In CVPR, 2008. 2

[12] H. O. Song, S. Zickler, T. Althoff, R. Girshick, M. Fritz, C. Geyer, P. Felzenszwalb, and T. Darrell. Sparselet models for efﬁcient multiclass object detection. In ECCV. 2012. 2

[13] C. Szegedy, A. Toshev, and D. Erhan. Deep neural networks forobjectdetection. In Advances in Neural Information Processing Systems (NIPS), 2013. 1, 2, 6

[14] K. E. van de Sande, J. R. Uijlings, T. Gevers, and A. W. Smeulders. Segmentation as selective search for object recognition. In ICCV, 2011. 1, 2
[15]L.Zhu,Y Chen,A.Yuille,and W. Freeman. Latent hierarchical structural learning for object detection. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 1062–1069. IEEE, 2010. 6

论文翻译——Scalable Object Detection using Deep Neural Networks