Object detection at 200 Frames Per Second

每秒 200 帧的目标检测

Rakesh Mehta and Cemalettin Ozturk
United Technologies Research Center-Ireland
爱尔兰联合技术研究中心

Abstract

In this paper, we propose an efficient and fast object detector which can process hundreds of frames per second. To achieve this goal we investigate three main aspects of the object detection framework: network architecture, loss function and training data (labeled and unlabeled). In order to obtain compact network architecture, we introduce various improvements, based on recent work, to develop an architecture which is computationally light-weight and achieves a reasonable performance. To further improve the performance, while keeping the complexity same, we utilize distillation loss function. Using distillation loss we transfer the knowledge of a more accurate teacher network to proposed light-weight student network. We propose various innovations to make distillation efficient for the proposed one stage detector pipeline: objectness scaled distillation loss, feature map non-maximal suppression and a single unified distillation loss function for detection. Finally, building upon the distillation loss, we explore how much can we push the performance by utilizing the unlabeled data. We train our model with unlabeled data using the soft labels of the teacher network. Our final network consists of 10x fewer parameters than the VGG based object detection network and it achieves a speed of more than 200 FPS and proposed changes improve the detection accuracy by 14 mAP over the baseline on Pascal dataset.
在本论文中，我们提出了一种高效快速的目标检测器，其每秒可处理数百帧图像。为了实现这一目标，我们调查研究了目标检测框架的三个主要方面：网络架构、损失函数和训练数据 (有标注的和无标注的)。为了得到紧凑的网络架构，我们引入了多种基于近期研究的改进措施，以开发出一个计算上轻量级的且能达到合理表现水平的架构。为了在保证同等复杂度的同时进一步提升表现，我们使用了 distillation 损失函数。使用 distillation 损失，我们可将更准确的教师网络的知识迁移给我们提出的轻量级学生网络。 为了让 distillation 能高效用于我们提出的单阶段检测器流程，我们提出了多种创新：目标性缩放的 distillation 损失、feature map 非极大抑制、用于检测的单个统一 distillation 损失函数。最后，基于 distillation 损失，我们探索了通过使用无标注数据我们可以给表现带来多大提升。我们使用教师网络的软标签，通过无标注数据训练了我们的模型。在 Pascal 数据集上，与基于 VGG 的目标检测网络相比，我们的最终网络的参数数量少十几倍，并且能实现超过 200 FPS 的速度，另外我们提出的修改能让检测准确度超过基准 14 mAP。实际应用对目标检测技术的准确度和速度都有很高的要求。

1. Introduction

Object detection is a fundamental problem in computer vision. In recent years we have witnessed a significant improvement in the accuracy of object detectors [25, 27, 22, 26, 7, 10, 21] owing to the success of deep convolutional networks [19]. It has been shown that modern deep learning based object detectors can detect a number of a generic object with considerably high accuracy and at a reasonable speed [22, 26]. These developments have triggered the use of object detection in various industrial applications such as surveillance, autonomous driving, and robotics. Majority of the research in this domain has been focused towards achieving the state-of-the-art performance on the public benchmarks [21, 10]. For these advancements the research has relied mostly on deeper architectures (Inception [33], VGG [32], Resnet [11]) which come at the expense of a higher computational complexity and additional memory requirements. Although these results have demonstrated the applicability of object detection for a number of problems, the scalability still remains an open issue for fullscale industrial deployment. For instance, a security system with fifty cameras and a 30 frame/sec rate, would require a dedicated server with 60 GPUs even if we use one of the fastest detector SSD (22 FPS at 512 resolution) [22]. These number can quickly grow for a number of industrial problems for example security application in a big building. In these scenarios, the speed and memory requirement becomes crucial as it enables processing of multiple streams on a single GPU. Surprisingly, the researchers have given little importance to the design of fast and efficient object detectors which have low memory requirements [17]. In this work we try to bridge this gap, we focus on the development of an efficient object detector with low memory requirements and a high speed to process multi-streams on a single GPU.
目标检测是计算机视觉领域内的一个基本问题。得益于深度卷积网络的成功，近些年来，我们已经见证了目标检测器准确度的显著提升 [25, 27, 22, 26, 7, 10, 21]。事实已经表明，基于深度学习的现代目标检测器能以相当高的准确度和合理的速度检测多个一般目标 [22, 26]。这样的进展让目标检测被应用在了多种行业应用中，比如监控、自动驾驶和机器人。这一领域的大多数研究都关注的是在公共基准上实现当前最佳表现。这些研究所取得的进展大都依赖于更深度的架构 (Inception [33], VGG [32], Resnet [11])，其代价是更高的计算复杂度和更多内存需求。尽管这些结果已经表明了目标检测在多种问题上的可用性，但对全面行业部署而言，可扩展性仍是一个悬而未决的问题。比如，有 50 个摄像头和 30 帧/秒速率的安保系统，即使使用最快速的检测器 SSD (512 分辨率时速度为 22 FPS)，也将需要带有 60 个 GPU 的专用服务器 [22]。对于某些行业问题 (大型建筑内的安保应用) 而言，这些数字还会更大。在这些情况下，速度和内存需求会变得很关键，因为其能在单个 GPU 上实现多个数据流的处理。让人惊讶的是，研究者们并不很重视设计快速有效且内存需求低的目标检测器 [17]。在本研究中，我们试图填补这一漏洞，我们的研究重点是开发出一种高效的目标检测器，其有较低的内存需求且能在单个 GPU 上高速处理多个数据流。

With the aim to design a fast and efficient object detector we start by asking ourselves a fundamental question: what are the essential elements of a deep learning object detector and how can we tailor them to develop the envisaged detector? Based on the related work [25, 27, 22, 26, 7] we broadly identify the key components of the deep learning based object detection framework as (1) Network architecture, (2) Loss function and (3) Training data. We study each of these components separately and introduce a broad set of customizations, both novel and from related work, and investigate which of these play the most crucial role in achieving a speed-accuracy trade-off.
为了设计一个快速而高效的目标检测器，我们首先提出一个基本问题：深度学习目标检测器的基本要素是什么？我们如何定制它们以开发设想的检测器？基于相关工作 [25, 27, 22, 26, 7]，我们广泛确定了基于深度学习的目标检测框架的关键组件：(1) 网络结构，(2) 损失函数和 (3) 训练数据。我们分别对这些组件进行了研究，并引入了广泛的定制，包括新颖性和相关性工作，并研究其中哪些在实现速度精度折衷方面发挥着至关重要的作用。

The network architecture is a key factor which determines the speed and the accuracy of the detector. Recent detectors [15] tends be based on deeper architecture (VGG [32], Resnet [11]), which makes them accurate but increases the computationally complexity. Network compression and pruning [9, 5, 13] approaches have been utilized with some success to make detection architectures more compact [36, 31, 38]. However, these approaches still struggle to improve the speed, for instance, the compact architecture of [31] with 17M parameters achieves a speed of 17 FPS. In this work, we develop certain design principles not only for compact architecture but also for a high speed detector, as speed is of prime importance for us. We draw inspiration from the recent work of Densenet [14], Yolo-v2 [26] and Single Shot Detector (SSD) [22], design an architecture which is deep but narrow. The deeper architecture allows us to achieve higher accuracy while the narrow layers enable us to control the complexity of the network. It is observed that the architectural changes itself can result in 5 mAP increase over the selected baseline. Building on these works, our main contribution in architecture design is a development of a simple yet efficient network architecture which can process more than 200 FPS making it the fastest deep learning based object detector. Furthermore, our model contains only 15M parameters compared to 138M in VGG-16 model [3] thus resulting in one of the most compact networks. The speed of the proposed detector in comparison to state-of-the-art detector is shown in Fig. 1.
网络架构是决定检测器速度和精度的关键因素。最近的检测器 [15] 倾向于基于更深的架构 (VGG [32], Resnet [11])，这使得它们准确但增加了计算复杂度。网络压缩和修剪 [9, 5, 13] 方法已被使用，取得了一些成功，使检测架构更加紧凑 [36, 31, 38]。但是，这些方法仍然难以提高速度，例如，[31] 的紧凑架构具有 17 M 参数，速度达到 17 FPS。在这项工作中，我们不仅为紧凑型架构而且为高速检测器开发了一些设计原则，因为速度对我们来说非常重要。我们从 Densenet [14], Yolo-v2 [26] and Single Shot Detector (SSD) [22] 最近的工作中汲取灵感，设计了一种深而窄的架构。更深的架构允许我们实现更高的精度，而狭窄的层次使我们能够控制网络的复杂性。 据观察，架构变化本身可能导致 5 mAP 增加。在这些作品的基础上，我们在架构设计方面的主要贡献是开发了一个简单而有效的网络架构，可处理超过 200 个 FPS，使其成为最快的深度学习对象检测器。此外，我们的模型仅包含 15 M 参数，而 VGG-16 模型中的参数为 138 M [3]，因此产生了最紧凑的网络之一。与现有技术的检测器相比，所提出的检测器的速度如图 1 所示。

这里写图片描述
Figure 1: Speed and performance comparison of the proposed detector with other competing approaches. For SSD and Yolo-v2 we show results for more accurate models
图 1：我们所提出的检测器与其它竞争方法的速度和表现比较。对于 SSD 和 Yolo-v2，我们还给出了更准确的模型的结果。

Given the restriction of simple and fast architecture, we investigate efficient training approaches to improve the performance. Starting with a reasonably accurate lightweight detector we leverage deeper networks with better performance to further improve the training strategy. For this purpose, we consider network distillation [12, 2, 1], where the knowledge of a larger network is utilized to efficiently learn the representation for the smaller network. Although the idea was applied to object detection recently [3, 20], our work has key contributions in the way we apply distillation. (1) We are the first one to apply distillation on single pass detector (Yolo), which makes this work different from the prior work which applies it to the region proposal network. (2) The key to our approach is based on the observation that object detection involves non-maximal suppression (NMS) step which is outside the end-to-end learning. Prior to NMS step, the last layer of the detection network consists of dense activations in the region of detection, if directly transferred to the student network it leads to overfitting and deteriorates the performance. Therefore, in order to apply distillation for detection, we propose Feature Map-NMS (FM-NMS) which suppresses the activation corresponding to overlapping detections. (3) We formulate the problem as an objectness scaled distillation loss by emphasizing the detections which have higher values of objectness in the teacher detection. Our results demonstrate the distillation is an efficient approach to improving the performance while keeping the complexity low.
鉴于简单快速的架构存在局限性，我们调查研究了可以提升表现的有效训练方法。从准确度合理的轻量级检测器开始，我们利用了有更优表现的更深度的网络来进一步改进训练策略。为此，我们考虑了网络 distillation [12, 2, 1]，其中更大网络的知识会被用来高效地学习更小网络的表征。尽管这一思想最近已经在目标检测上得到过应用，但我们的工作在我们应用 distillation 的方式上有关键性的贡献。(1) 我们最早将 distillation 应用到了单流程检测器(Yolo) 上，这使得我们的工作不同于之前的将其应用于区域建议网络 (region proposal network) 的工作。(2) 我们的方法的关键基于这一观察：目标检测涉及非极大抑制 (NMS) 步骤，而这个步骤在端到端学习之外。在 NMS 步骤之前，检测网络的最后一层由检测区域中的密集激活构成，如果它被直接迁移给学生网络，就会导致过拟合和表现下降的问题。因此，为了将 distillation 应用于检测，我们提出了 feature map 非极大抑制 (Feature Map-NMS 或 FM-NMS)，其会抑制对应于重叠检测的激活。(3) 通过强调教师检测中有更高目标性 (objectness) 值的检测结果，我们将该问题形式化为了一个目标性缩放的 distillation 损失问题。我们的结果表明，这种 distillation 是一种在保持复杂度较低的同时提升表现的有效方法。

Finally, we investigate “the effectiveness of data” [8] in the context of object detection. Annotated data is limited but with the availability of highly accurate object detectors and an unlimited amount of unlabeled data, we explore how much we can push the performance of the proposed light-weight detector. Our idea follows the line of semisupervised learning [29, 35, 4] which has not been thoroughly explored in deep learning object detector. Closely related to our approach is the recent work of Radosavovic et.al. [23] where annotations are generated using an ensemble of detectors. Our idea differs from their’s in two main ways: (1) We transfer the soft labels from the convolutional feature maps of teacher network, which has shown to be more efficient in network distillation [28]. (2) Our loss formulation, through objectness scaling and distillation weight, allows us to control the weight given to the teacher label. This formulation provides the flexibility to give high importance to ground-truth detections and relatively less to inaccurate teacher prediction. Furthermore, our training loss formulation seamlessly integrates the detection loss with the distillation loss, which enables the network to learn from the mixture of labeled and unlabeled data. To the best of our knowledge, this is the first work which trains the deep learning object detector by jointly using the labeled and unlabeled data.
最后，我们在目标检测语境中调查研究了“数据的有效性” [8]。有标注数据是有限的，但使用高准确度的目标检测器和无限量的无标注数据，我们探索了我们提出的轻量级检测器的表现可以提升的程度。我们的思路遵循半监督学习 [29, 35, 4]，这是深度学习目标检测器领域一个尚未得到深入研究的领域。Radosavovic et. al. [23] 是与我们的方法密切相关的一项近期研究，其中的标注是使用组合在一起的检测器生成的。我们的思路与他们的方法有两个主要差异：(1) 我们是迁移来自教师网络的卷积 feature map 的软标签，事实表明这在网络 distillation 上更高效 [28]。(2) 我们通过目标性缩放和 distillation 权重得到了损失公式，这让我们可以根据教师标签控制权重。这个公式提供了灵活性，能为 ground-truth 的检测结果赋予高权重，为不准确的教师预测结果提供相对更低的权重。此外，我们的训练损失公式无缝整合了检测损失与 distillation 损失，这让该网络可以从有标注数据和无标注数据的混合数据中学习。就我们所知，这是第一个通过联合使用有标注数据和无标注数据来训练深度学习目标检测器的研究。

2. Architecture customizations

Most of the recent successful object detectors are dependent on the depth of the underlying architecture to achieve good performance. They achieve good performance, however, the speed is restricted to 20-60 FPS even for the fastest detectors [22, 26]. In order to develop a much faster detector, we select a moderately accurate but a really high-speed detector, Tiny-Yolo [26], as our baseline. It is a simplified version of Yolo-v2 with fewer convolutional layers, however with the same loss function and optimization strategies such as batch normalization [16], dimension cluster anchor box, etc. Building upon this we introduce a number of architectural customizations to make it more accurate and even faster.
大多数最近成功的目标检测器都依赖于底层架构的深度来实现良好的性能。他们取得了良好的性能，但是，即使是最快的检测器，速度也被限制在 20-60 FPS [22, 26]。为了开发更快速的检测器，我们选择一个中等精度但真正高速的检测器 Tiny-Yolo [26] 作为我们的 baseline。它是 Yolo-v2 的简化版本，具有较少的卷积层，但具有相同的损失函数和优化策略，如批量归一化 [16]，dimension cluster anchor box等。在此基础上，我们引入了一些架构自定义，以使其更加准确，甚至更快。

Dense feature map with stacking Taking inspiration from recent works [14, 22] we observe that merging the feature maps from the previous layers improves the performance. We merge the feature maps from a number previous layer in the last major convolutional layer. The dimensions of the earlier layers are different from the more advanced one. Prior work [22] utilizes max pooling to resize the feature maps for concatenation. However, we observe that the max pooling results in a loss of information, therefore, we use feature stacking where the larger feature maps are resized such that their activations are distributed along different feature maps [26].
密集堆叠的 feature map。从最近的工作 [14, 22] 中获得灵感，我们观察到合并来自先前层的 feature maps 提高了性能。我们将前面层的 feature maps 合并到最后一个主要卷积层中。The dimensions of the earlier layers are different from the more advanced one. 之前的工作 [22] 利用最大池化来调整连接的 feature maps。然而，我们观察到最大池化会导致信息的丢失，因此，我们使用特征叠加，大尺寸的 feature maps 缩放，使得它们的激活分布在不同的 feature maps上 [26]。

扫描二维码关注公众号，回复： 3116282 查看本文章

Furthermore, we make extensive use of the bottleneck layers while merging the features. The idea behind the use of bottleneck layer [14, 34] is to compress the information in fewer layers. The 1x1 convolution layers in the bottleneck provide the flexibility to express the compression ratio while also adding depth at the same time. It is observed that merging the feature maps of advanced layers provide more improvement, therefore, we use a higher compression ratio for the initial layers and lower one for the more advanced layers.
此外，我们在合并特征时广泛使用 bottleneck layers。使用 bottleneck layers [14, 34] 的想法是将信息压缩进更少数量的层。bottleneck layers 中的1 x 1 卷积层提供了灵活性来表达压缩比，同时还增加了深度。据观察，合并 advanced layers 的 feature maps 会提供更多改进，因此，我们对初始层使用较高的压缩比，对较 advanced layers 使用较低的压缩比。

Deep but narrow The baseline Tiny-Yolo architecture consists of a large number (1024) of feature channels in their last few convolutional layers. With the feature map concatenation from the prior layers, we found the that these large number of convolutional features maps are not necessary. Therefore, the number of filters are reduced in our design, which help in improving the speed.
深而窄。Tiny-Yolo 架构中，在最后的一些卷积层中有大量 (1024) 特征通道。利用来自先前层的 feature map 级联，我们发现这些大量的卷积 feature maps 不是必需的。因此，我们的设计减少了滤波器的数量，有助于提高速度。

深而且窄的网络结构。

Compared to other state-of-the-art detectors, our architecture lacks depth. Increasing depth by adding more convolutional layers results in computational overhead, in order to limit the complexity we use 1x1 convolutional layers. After the last major convolutional layer, we add a number of 1x1 convolutional layers which add depth to the network without increasing computational complexity.
与其他最先进的检测器相比，我们的架构缺乏深度。通过添加更多卷积层来增加深度会增加计算开销，为了限制计算复杂度，我们使用 1 x 1 卷积层。在最后一个主要卷积层之后，我们添加了许多1 x 1卷积层，这些层为网络增加了深度，而不增加计算复杂度。

Overall architecture Building on these simple concepts we develop our light-weight architecture for the object detection. These modification results in an improvement of 5 mAP over the baseline Tiny-Yolo architecture, furthermore, our architecture is 20% faster than Tiny-Yolo because we use fewer convolutional filters in our design. The baseline Tiny-Yolo achieves a 54.2 mAP on Pascal dataset, while the proposed architecture obtains 59.4 mAP. The overall architecture is shown in Fig. 2. We address this network as F-Yolo.
整体架构。基于这些简单的概念，我们开发了用于物体检测的轻量级体系结构。这些修改的结果比 Tiny-Yolo 架构提高了 5 mAP。此外，我们的架构比 Tiny-Yolo 快 20％，因为我们在设计中使用更少的卷积滤波器。baseline Tiny-Yolo在 Pascal 数据集上达到 54.2 mAP，而所提出的架构获得 59.4 mAP。整体架构如图 2 所示。我们将此网络称为F-Yolo。

这里写图片描述
Figure 2: Base architecture of our detector. To keep architecture simple we limit the depth of the network, keep number of feature maps low and use a small filter kernel (3 × 3 or 1 × 1).
图 2：我们的检测器的基本架构。为了保证架构简单，我们限制了网络的深度，并且保证 feature map 的数量少且使用了较小的过滤器核 (3×3 或 1×1)。

Dense feature map with stacking 提高检测算法性能，SSD 和 DenseNet 都利用了这个策略。在融合过程中，采用将大 size 的 feature map 进行 resize，然后和小 size 的 feature map stacking。采用大 size 的 feature map 做 max pooling，然后与小 size 的 feature map concatenation 将会丢失信息。可以参考 YOLOv2 的特征融合。

104 x 104 x 64 feature map，1 x 1 x 4 卷积核压缩，得到 104 x 104 x 4 feature map，resize / reshape 13 x 13 x 256；
52 x 52 x 128 feature map，1 x 1 x 16 卷积核压缩，得到 52 x 52 x 16 feature map，resize / reshape 13 x 13 x 256；
26 x 26 x 256 feature map，1 x 1 x 32 卷积核压缩，得到 26 x 26 x 32 feature map，resize / reshape 13 x 13 x 128；
13 x 13 x (256 + 256 + 128) feature map 和 13 x 13 x 1152 feautre map concate得到13 x 13 x 1792的feature map。

IMPORTANT NOTICE: tf.space_to_depth

deep but narrow。网络越深且越宽一般而言效果会越好，但同时计算量和参数量也会随之增加，导致算法速度变慢，因此需要做一个平衡。
在 Tiny-YOLOv2 算法中，最后的几个卷积层都比较宽 (卷积核数量达到 1024)，在前面引入特征融合后就不需要这么多的卷积核了，于是对这些层的卷积核数量做了缩减，也就是达到了narrow。如图 2 最后一个 3 x 3 卷积层的卷积核数量是125。
采用叠加一系列 1 x 1 的卷积，1 x 1 卷积的计算量要远远小于 3 x 3 卷积。如图 2 最后 3 个 1 x 1 卷积层。作者取名叫F-Yolo。
融合特征时，大量使用 1 × 1 的卷积核 (bottleneck layer)，起到特征压缩 / 降维的作用。使用较后面的层进行融合比前面的层效果要好，因此前面层融合时使用较高的压缩比 (1 × 1 卷积输出的通道数少)，后面的层使用较小的压缩比 (1 × 1 卷积输出的通道数多)。

3. Distillation loss for Training

Since we restrict ourselves to a simple architecture to achieve high speed, we explore network distillation [28, 12] approach to improve the performance as it does not affect the computational complexity of the detector. The idea of the network distillation is to train a light-weight (student) network using knowledge of a large accurate network (teacher). The knowledge is transferred in form of soft labels of the teacher network.
我们受限于使用一个简单的体系结构以实现高速，因此我们探索网络 distillation [28, 12] 方法来提高性能，它不会影响检测器的计算复杂度。网络 distillation 的想法是使用大型准确网络 (教师) 的知识来训练一个轻量级 (学生) 网络。知识以教师网络的软标签形式传输。

Before describing our distillation approach, we provide a brief overview of the Yolo loss function and the last convolutional layer of the network. Yolo is based on one stage architecture, therefore, unlike RCNN family of detectors, both the bounding box coordinates and the classification probabilities are predicted simultaneously as the output of the last layer. Each cell location in the last layer feature map predicts N bounding boxes, where N is the number of anchor boxes. Therefore, the number of feature maps in the last layers are set to N × (K + 5), where K is the number of classes for prediction of class probabilities and 5 corresponds to bounding box coordinates and objectness values (4 + 1). Thus, for each anchor box and in each cell, network learns to predict: class probabilities, objectness values and bounding box coordinates. The overall objective can be decomposed into three parts: regression loss, objectness loss and classification loss. We denote the multi-part objective function as:

这里写图片描述

where $\overrightarrow{o}_i$ , $\overrightarrow{p}_i$ , $\overrightarrow{b}_i$ are the objectness, class probability and bounding box coordinates of the student network and $o^{gt}_i$ , $p^{gt}_i$ , $b^{gt}_i$ are the values derived from the ground truth. The objectness is defined as IOU between prediction and ground truth, class probabilities are the conditional probability of a class given there is an object, the box coordinates are predicted relative to the image size and loss functions are simple L₁ or L₂ functions see [25, 26] for details.
在描述我们的 distillation 方法之前，我们简要概述 Yolo 损失函数和网络的最后一个卷积层。Yolo 基于一个阶段的体系结构，因此，与 RCNN 检测器家族不同，边界框坐标和分类概率同时预测，为最后一层的输出。最后一层 feature map 中的每个单元位置预测 N 个边界框，其中 N 是 anchor 的数量。因此，最后一层中的 feature map 的数量被设置为 N × (K + 5)，其中 K 是用于预测类概率的类的数量，并且 5 对应于边界框坐标和 objectness values (4 + 1)。因此，对于每个 anchor 和每个单元，网络学习预测：类概率，objectness values 和边界框坐标。总体目标可以分解为三部分：回归损失，目标损失和分类损失。我们将多部分目标函数表示为：

这里写图片描述

其中 $\overrightarrow{o}_i$ , $\overrightarrow{p}_i$ , $\overrightarrow{b}_i$ 是学生网络的 objectness, class probability and bounding box coordinates， $o^{gt}_i$ , $p^{gt}_i$ , $b^{gt}_i$ 是 ground truth。objectness 被定义为预测和 ground truth 之间的 IOU，类别概率是给定存在对象类别的条件概率，矩形框坐标是相对于图像大小的，并且损失函数是简单 L₁ or L₂ 函数，见 [25, 26] 了解详情。

student 网络使用改进的 Tiny-YOLOv2 网络，其中 gt 表示真实值，^ 表示预测值。o 是否是物体的得分，p 是类别概率，b 是 bbox 回归值。
如果使用网络蒸馏，只需要用 teacher 网络最后的输出替换掉 student 网络的 gt 值即可。但是 single-stage 的检测器直接应用蒸馏损失会有一些问题，比如负样本的 anchor 太多。
objectness loss 一个 bbox 是否包含 object 的损失；classification loss 一个 bbox 的分类损失；regression loss 一个bbox 的坐标回归损失。

To apply distillation we can simply take the output of the last layer of the teacher network and replace it with the ground truth $o^{gt}_i$ , $p^{gt}_i$ , $b^{gt}_i$ . The loss would propagate the activations of the teacher network to student network. However, the dense sampling of the single stage detector introduces certain problems which makes the straightforward application of distillation ineffective. We discuss these problems below and provide simple solutions for applying distillation in single stage detector.
要应用 distillation，我们可以简单地把教师网络的最后一层的输出取代为地面实况 ground truth $o^{gt}_i$ , $p^{gt}_i$ , $b^{gt}_i$ 。这种 loss 会将教师网络的激活传播给学生网络。然而，单级检测器的密集采样引入了一些问题，这使得直接应用 distillation 失效。我们在下面讨论这些问题，并提供在单级检测器中应用 distillation 的简单解决方案。

3.1. Objectness scaled Distillation

Current distillation approaches for detectors (applied for RCNN family) [20, 3] use the output of the last convolutional layer of the teacher network for transferring the knowledge to the student network. Following similar approach in Yolo, we encounter a problem because it is single stage detector and predictions are made for a dense set of candidates. Yolo teacher predicts bounding boxes in the background regions of an image. During inference, these background boxes are ignored by considering the objectness value of the candidates. However, standard distillation approach transfers these background detections to the student. It impacts the bounding box training f_bb(), as the student network learns the erroneous bounding box in the background region predicted by the teacher network. The RCNN based object detectors circumvent this problem with the use of region proposal network which predicts relatively fewer region proposals. In order to avoid “learning” the teacher predictions for background region, we formulate the distillation loss as objectness scaled function. The idea is to learn the bounding box coordinates and class probabilities only when objectness value of the teacher prediction is high. The objectness part of the function does not require objectness scaling because the objectness values for the noisy candidates are low, thereby the objectness part is given as:

这里写图片描述

The objectness scaled classification function for the student network is given as:

这里写图片描述

where the first part of the function corresponds to the original detection function while the second part is the objectness scaled distillation part. Following the similar idea the bounding box coordinates of the student network are also scaled using the objectness

这里写图片描述

A large capacity teacher network assigns very low objectness values to a majority of the candidates which corresponds to the background. The objectness based scaling act as a filter for distillation in single stage detector as it assigns a very low weight to the background cells. The foreground regions which appears like objects have higher values of objectness in the teacher network and the formulated distillation loss utilizes the teacher knowledge of these regions. It should be noted that the loss function stays the same but for distillation, we only add the teacher output instead of the ground truth. The loss function for the training is given as:

这里写图片描述

which considers the detection and distillation loss for classification, bounding box regression and objectness. It is minimized over all anchor boxes and all the cells locations of last convolutional feature maps.
当前检测器的 distillation 方法 (应用于 RCNN 系列) [20, 3] 使用教师网络的最后一个卷积层的输出将知识传递给学生网络。在 Yolo 采用类似的方法后，我们遇到了一个问题，因为它是单级检测器，并且预测是针对一组密集的候选项进行的。Yolo 教师网络预测图像背景区域中的边界框。在 inference 过程中，通过考虑候选对象的 objectness value 来忽略这些背景框。然而，标准 distillation 方法将这些背景检测转移给学生。它会影响边界框训练 f_bb()，因为学生网络会在教师网络预测的背景区域中学习错误的边界框。基于 RCNN 的物体检测器通过使用预测相对较少的 region proposal network 来绕过这个问题。为了避免“学习”教师对背景区域的预测，我们将 distillation 损失制定为对象缩放函数。这个 idea 只有在教师网络预测的 objectness value 高时才学习边界框坐标和类别概率。该函数的 objectness 部分不要求对象缩放，因为有噪声的候选对象的 objectness value 较低，因此对象部分给出如下：

这里写图片描述

学生网络的目标标度分类函数给出如下：

这里写图片描述

其中，函数的第一部分对应于原始检测函数，而第二部分是对象缩放部分。遵循类似的想法，学生网络的边界框坐标也使用对象来缩放

这里写图片描述

大容量的教师网络将很低的 objectness value 分配给对应于背景的大多数候选框。基于物体的缩放作为单级检测器中的 distillation 过滤器，因为它将非常低的权重分配给背景单元。类似于对象的前景区域在教师网络中具有较高的 values of objectness，并且制定的 distillation 损失利用这些区域的教师知识。应该指出，损失函数保持不变，但对于 distillation，我们只添加教师输出代替 ground truth。训练的损失函数如下：

这里写图片描述

其中考虑了分类检测和 distillation 损失，边界框回归和 objectness。在最后的卷积 feature map 的所有 anchor 和 cell 位置上最小化。

YOLO 等单级检测器输出大量的候选区域，实际上每个 anchor 都是，并且只有少部分 anchor 为正样本。导致 student 网络学习到大量背景检测知识，同时 teacher 网络会输出背景的 bbox 回归结果，而 student 网络也会学到这些错误的信息。R-CNN 等两阶段检测器通过使用 region proposals 来避免这种问题。
因此可以根据每个 anchor 的物体得分 o 来调整损失。只有当物体得分很高时，student 网络才学习到相应的知识。背景区域权重低，物体区域权重高。

3.2. Feature Map-NMS

Another challenge that we face comes from the inherent design of the single stage object detector. The network is trained to predict a box from a single anchor box of a cell, but in practice, a number of cells and anchor boxes end up predicting the same object in an image. Therefore, NMS is essential as a post-processing step in object detector architectures. However, the NMS step is applied outside the end-to-end network architecture and highly overlapping prediction are represented in the last convolutional layer of the object detector. When these predictions are transferred from the teacher to student network it results in a redundant information. Therefore, we observed that the distillation loss described above results in a loss of performance as the teacher network ends up transferring the information loss for highly overlapping detections. The feature maps corresponding to highly overlapping detection end up propagation large gradient for same object class and dimensions, thereby leading to network over-fitting.
我们面临的另一个挑战来自单级物体检测器的固有设计。网络被训练从一个单元的单个 anchor 中预测一个边界框，但实际上，一些单元和 anchor 最终预测图像中的同一个物体。因此，NMS 作为目标检测器体系结构的后处理步骤是非常重要的。但是，NMS 步骤应用于端到端网络架构之外，高度重叠的预测表示在物体检测器的最后一个卷积层中。当这些预测从教师传输到学生网络时，会导致冗余信息。因此，我们观察到，上述 distillation 损失会导致性能的损失，因为教师网络最终将高度重叠检测的信息损失转移。对应于高度重叠检测的 feature map 对于相同的对象类别和维度最终传播大梯度，从而导致网络过拟合。

使用网络蒸馏方式训练时，由于单阶段检测器的设计问题，会存在大量冗余候选框。通常我们使用 NMS 算法来处理冗余候选框。但是 NMS 是后处理阶段，在网络蒸馏过程中无法加入 NMS，这会导致 student 网络学习到冗余的信息。

In order to overcome the problem arising from overlapping detections, we propose Feature Map-NMS (FM-NMS). The idea behind FM-NMS is that if multiple candidates in neighbourhood of KxK cells correspond to the same class, then they are highly likely to correspond to the same object in an image. Thereby, we choose only one candidate with the highest objectness value. In practice, we check the activations corresponding to the class probabilities in last layer feature maps, set to zeros the activations correspond to the same class. The idea is demonstrated in Fig. 3, where we show the soft labels of teacher network in form of detections. The last layer of the teacher network predicts a number of bounding boxes in a region around the dog. To suppress the overlapping detections we pick the detection with the highest objectness values. The strongest candidate among the overlapping candidates is transfered to the student network. The idea for two cells is demonstrated in Fig. 4. In our experiments we use the cell neighbourhood of 3 × 3.
为了克服重叠检测问题，我们提出了Feature Map-NMS (FM-NMS)。FM-NMS 背后的想法是，如果 K x K cells 附近的多个候选对应相同的类别，那么它们很可能对应于图像中的同一对象。因此，我们只选择一个具有最高 objectness value 的候选。在实践中，我们检查与最后一层 feature maps 中的类概率相对应的激活，将激活对应于同一类设置为零。这个想法如图 3 所示，其中我们以检测的形式展示了教师网络的软标签。教师网络的最后一层预测狗周围区域的多个边界框。为了抑制重叠检测，我们选择具有最高 objectness value 的检测。重叠候选中最强的候选被转移到学生网络。The idea for two cells is demonstrated in Fig. 4. 在我们的实验中，我们使用的单元邻域是 3 × 3。

在预测层，每 K × K 个预测格子中。如果预测的是同一类物体，那么选择 objectness 得分最高的那个作为候选区域，其他的区域置为 0。

这里写图片描述
Figure 3: Overall architecture for the distillation approach. Distillation loss is used for both labelled and unlabeled data. FM-NMS is applied on the last layer feature maps of teacher network to supress the overlapping candidates.
图 3：distillation 方法的整体架构。distillation 损失在有标注数据和无标注数据上都会使用。FM-NMS 被应用在教师网络的最后一层 feature map 上以抑制重叠的候选项。

用一个复杂网络 (teacher network) 学到的东西去辅助训练一个简单网络 (student network)。network distillation算法是模型加速压缩领域的方向之一，引入 network distillation 的目的是为了提升小网络的效果。
目前 network distillation 算法主要是针对 RCNN 系列的 object detection 算法 (two stage 系列)。对于 two stage 的object detection 算法而言，其最后送给检测网络的 ROI 数量是很少的 (默认是 128 个)，而且大部分都是包含 object 的 bbox，因此针对这些 bbox 引入 distillation loss 不会有太大问题。但是对于 YOLO 这类 one stage 算法而言，假设 feature map 大小是 13 x 13，每个 grid cell 预测 5 个 bbox，一共生成 13 x 13 x 5 = 845 个 bbox，而且大部分都是背景 (background)。如果将大量的背景区域传递给 student network，就会导致网络不断去回归这些背景区域的坐标以及对这些背景区域做分类，这样训练起来模型很难收敛。因此，作者利用 YOLO 网络输出的objectness 对 distillation loss 做一定的限定，只有 teacher network 的输出 objectness 较高的 bbox 才会对 student network 的最终损失函数产生贡献，这就是 objectness scaled distillation。

这里写图片描述
Figure 4: Teacher network predicts bounding box coordinates and class probabilities simultaneously in the last layer. Each column represented in color blue and green corresponds to N detection, where N is number of anchor boxes. Adjacent columns often result in highly overlapping bounding boxes with the same class label. Proposed FM-NMS retains only the strongest candidate from the adjacent cells. This candidate is transfered as soft label to the student network.
图 4：教师网络在最后一层同时预测边界框坐标和类别概率。蓝色和绿色表示的每一列对应 N 个检测结果，其中 N 是锚定框 (anchor box) 的数量。相邻的列通常会得到具有同一类别标签的高度重叠的边界框。我们提出的 FM-NMS 只会保留相邻单元中最强的候选项。这个候选项会被作为软标签而迁移给学生网络。

NMS 是 object detection 算法中常用的后处理算法，用来去除重复预测框，传统的 NMS 算法和网络结构的训练是独立的。如果不做 NMS，直接将 teacher network 的预测框输出给 student network，则会导致 student network 接收到的 object 相关的 loss 会非常大，这样训练就会对这些 object 过拟合。因此这里采取了类似 NMS 算法的 feature map NMS 进行重复框去除。
Yolo 算法中，最后的输出层中每个 grid cell 的维度是 1 x 1 x (N x (K + 5))，也就是图中的蓝色或绿色三维矩形条，矩形条的长度就是N x (K + 5)，其中 N 表示 bbox 的数量，K 表示分类的类别数，5 表示 4 个坐标值和1 个 objectness score。grid cell 的数量就是输出层的 feature map 的宽高乘积。FM-NMS 算法假设几个相邻的 grid cell 所预测 bbox 的类别一样，那么这些 bbox 极有可能预测的是同一个 object。基于这样的假设或者说是观察到的现象，FM-NMS 算法的做法是：每次取 3 x 3 个相邻的 grid cell，对这 9 个 grid cell 中预测类别相同的 bbox的得分(objectness value) 进行排序，最后只选择得分最高的那个 bbox 传给后续的 student network。Figure 4 是对 2 个 grid cell 做 FM-NMS 的示意图。

4. Effectiveness of data

Finally in this paper we investigate the how much can we improve the performance by using more training data.
最后，在本文中，我们将探讨通过使用更多训练数据可以提高精度的程度。

Labeled data The straightforward approach is to add more annotated training data for the training. It has been shown [37, 22] that increasing annotated data improves the performance of the models. However, earlier studies did not have a constraint of a model with limited capacity in their experiments. In this work, we restrict ourselves to a simple model and analyze if by simply adding the adding more data we can increase the performance.
标记的数据。直接的方法是为训练添加更多标记数据。已经表明 [37, 22] 增加标记数据可以提高模型的性能。然而，早期的研究并没有在实验中限制容量有限的模型。在这项工作中，我们将自己限制在一个简单的模型中，并通过简单地添加更多数据来分析我们是否可以提高性能。

Unlabeled data Given the limited availability of annotated data, we utilize the unlabeled data in combination with the distillation loss. The main idea behind our approach is to use both soft labels and ground truth label when they are available. When the ground truth is not available only the soft labels of the teacher network are utilized. In practice, we propagate only the teacher part of the loss when the ground truth is not present and a combination of loss described in (2)-(4) otherwise. Since the objective function seamlessly integrate soft-label and ground truth, it allows us to train a network with a mix of the labeled and unlabeled data.
未标记的数据。鉴于标记数据有限，我们将未标记的数据与 distillation 损失结合使用。我们的方法背后的主要思想是使用软标签和 ground truth 标签。当 ground truth 不可用时，只使用教师网络的软标签。在实践中，当 ground truth不存在时，我们只传播损失的教师部分，否则就是在 (2) - (4) 中描述的损失的组合。由于目标函数无缝集成了软标签和 ground truth，因此它允许我们通过混合使用标签和未标签的数据来训练网络。

5. Experiments on Object detection

We perform experiments on Pascal VOC 2007 dataset [6]. The dataset consists of 20 object classes and 16K training images.
我们在 Pascal VOC 2007 数据集上进行实验 [6]。数据集由 20 个目标类别和 16K 训练图像组成。

5.1. Implementation Details

We use Darknet deep learning framework [24] for our evaluation. Tiny-Darknet trained on the ImageNet [30] for classification task is used for initialization. We remove the last two layers of the pre-trained network and add additional convolutional layers at the end. For detection, the network is trained using SGD with the initial learning rate of 10^-3 for first 120 epochs and 10^-4 for next 20 epochs and finally 10^-5 for last 20 epochs. We use the standard training strategies such as momentum of 0.9 and 0.0005 weight decay. The batch size is set to 32 in all our experiments. The size of the input image in our experiments is set to 416 × 416. For the network distillation experiments we set the λ_D to 1, thereby giving equal weight to the distillation and detection loss, however as the distillation part is scaled to the objectness, the final weight of the distillation part is always less than the detection loss.
我们使用 Darknet 深度学习框架 [24] 进行评估。在 ImageNet [30] 上训练的用于分类任务的 Tiny-Darknet 用于初始化。我们删除预先训练的网络的最后两层，并在最后添加额外的卷积层。为了检测，使用 SGD 对网络进行训练，前 120 个 epochs 初始学习率为 10^-3，后续 20 个 epochs 为 10^-4，最后 20 个 epochs 为 10^-5。我们使用标准的训练策略，如 momentum = 0.9 和 weight decay = 0.0005。批量大小在我们所有的实验中都设置为 32。在我们的实验中输入图像的大小设置为 416 x 416。对于网络 distillation 实验，我们将 λ_D 设置为 1，从而给出 distillation 和检测损失相等的权重，但是由于 distillation 部分按比例缩放，蒸distillation 部分的最终权重总是小于检测损失。

5.2. Architecture

In this section, we present the results of different architecture configurations. First, we evaluate the effect of the feature map merging on the base architecture. The results for different layer combinations are shown in Table 1. It can be observed that the accuracy increases as the feature maps from more layers are merged together. Another important inference that we can draw from these results is that merging more advanced layers results in a more improvement rather than the initial layers of the network. There is a slight drop in the performance as the first few convolutional layers are merged with the final combination, indicating that the initial layer capture quite rudimentary information. Conv11 column of the table corresponds to the additional 1×1 convolutional layers added at the end to increase the depth of the network. These layers result in a gain of 0.5 mAP and provide a computationally efficient way of increasing the depth.
在本节中，我们将介绍不同体系结构配置的结果。首先，我们评估 feature map 合并对基础架构的影响。表 1 列出了不同层组合的结果。可以观察到，随着来自更多层的 feature map 合并在一起，准确度增加。我们从这些结果中可以得出的另一个重要推论是：合并更高级的层次会导致更多的改进，而不是网络的初始层次。性能略有下降，因为最初的几个卷积层与最终的层相合并，表明最初的层捕获了非常基本的信息。该表的 Conv11 列对应于最后添加的 1 × 1 附加卷积层以增加网络的深度。这些层产生 0.5 mAP 的增益，并提供增加深度的计算有效方式。

这里写图片描述
Table 1: The accuracy of the detector after merging different layers.
表1：合并不同层之后的检测器的准确度。

表 1 是特征融合的效果对比。随着融合层数的增加，检测精度越来越高。高层特征融合效果优于底层特征。max (max pooling) / stack (feature stacking)

We also show the comparison of two different approaches for feature map merging. Max layers were used in most of the prior works [22, 17], while feature stacking is a less common approach [26]. We found that feature stacking achieves much better results than max pooling. It can be observed for all combinations of merging the stacking achieves a better accuracy.
我们还展示了用于 feature map 合并的两种不同方法的比较。在以前的大多数作品中都使用了最大层数 [22, 17]，而feature stacking 则是一种不常用的方法 [26]。我们发现，feature stacking 比最大池化效果好得多。对于所有合并组合，可以观察到堆叠达到更好的精度。

Table 2 shows the speed of various improvement on the baseline detector. The speed is shown for a single Nvidia GTX 1080 GPU with 8 GPU memory and 16 GB CPU memory. The experiments are performed in batches and the small size of the network allows us to have a larger batch size and also enables parallel processing. Given that the GTX 1080 is not the most powerful GPU currently available, we believe that these models would be even faster on a more advanced GPU like Nvidia Titan X, etc. For the baseline Tiny-Yolo, we are able to achieve the speed of more than 200 FPS, as claimed by the original authors, using parallel processing and batch implementation of the original Darknet library. All the speed measurement are performed on Pascal VOC 2007 test image and we show the average time for 4952 images and it also includes the time for writing the detection results in the file. From the results, it can be observed that merging operations reduces the speed of the detector as it results in a layer with a fairly large number of feature maps. The convolutional operation on the combined feature maps layer reduces the speed of detector. Therefore, it can be observed that reducing the number of feature maps has a big impact on the speed of the detector. We are able to push the speed beyond 200 by reducing the filter to 512 instead of 1024. Finally, adding more 1 × 1 layers at the end of architecture also comes at fairly low computational overhead. These simple modifications result in an architecture which is an order of magnitude faster than popular architectures available.
表 2 显示了 baseline 检测器的各种改进速度。显示单个 Nvidia GTX 1080 GPU 的速度，具有 8 个 GPU 内存和 16 GB CPU 内存。这些实验分批进行，网络的小尺寸允许我们拥有更大的批量，并且可以进行并行处理。鉴于GTX 1080并非目前可用的最强大的 GPU，我们相信这些模型在更高级的 GPU (例如 Nvidia Titan X 等) 上会更快。对于基准 Tiny-Yolo，我们能够实现超过 200 FPS 的速度，正如原作者所声称的那样，使用原始 Darknet 库的并行处理和批处理实现。所有速度测量均在 Pascal VOC 2007 测试图像上进行，我们显示了 4952 图像的平均时间，并且还包括将检测结果写入文件的时间。从结果中可以看出，合并操作降低了检测器的速度，因为它会导致具有相当多的 feature map。组合的 feature map 层上的卷积操作降低了检测器的速度。因此，可以观察到，减少 feature map 的数量对检测器的速度有很大影响。我们可以通过将滤波器减少到 512 而不是 1024 来将速度提高到 200以上。最后，在体系结构的末端添加更多层也会带来相当低的计算开销。这些简单的修改产生的结构比现有的流行结构快一个数量级。

这里写图片描述
Table 2: Speed comparison for different architecture modifications.
表 2：不同架构修改方法的速度比较。

5.3. Distillation with labeled data

First, we describe our teacher and student networks. Since we use the soft labels of the teacher network for the training, we are confined to use a Yolo based teacher network which densely predicts the detections as the output of last convolutional layer. Another constraint on the selection of the teacher/student network is that they should have same input image resolution because a transfer of the soft-labels requires the feature maps of the same size. Thereby, we choose Yolo-v2 with Darknet-19 base architecture as the teacher. We use the proposed F-Yolo as the student network, as it is computationally light-weight and is very fast. To study the impact of more labelled data we perform training with only Pascal data and combination of Pascal and COCO data. For Pascal we consider the training and validation set from 2007 and 2012 challenge and for COCO we select the training images which have at least one object of Pascal category in it. We found there are 65K such images in COCO training dataset.
首先，我们描述我们的教师和学生网络。由于我们使用教师网络的软标签进行训练，因此我们被确定为使用基于 Yolo 的教师网络，它将最后的卷积层的输出作为密集检测的结果。选择教师/学生网络的另一个约束是它们应该具有相同的输入图像分辨率，因为软标签的传输需要相同大小的 feature map。因此，我们选择 Yolo-v2 和 Darknet-19 基础架构作为教师网络。我们使用提出的 F-Yolo 作为学生网络，因为它轻量型的网络，速度非常快。为了研究更多标记数据的影响，我们只用 Pascal 数据和 Pascal 与 COCO 数据的组合来进行训练。对于Pascal，我们考虑 2007 年和 2012 年竞赛的训练集和验证集，对于 COCO 我们选择至少有一个 Pascal 类别对象的训练图像。我们发现 COCO 训练数据集中有 65K 这样的图像。

To study the impact of teacher network on the distillation training we also train our teacher models with two different datasets: Pascal data and combination of COCO and Pascal. The baseline performance of these teacher networks is given in Table 3. It can be observed that simply by training Yolo-v2 with COCO training data improves the performance by 3.5 points. With these two teachers, we would like to understand the effect of a more accurate teacher on the student performance.
为了研究教师网络对 distillation 训练的影响，我们还用两个不同的数据集来训练我们的教师模型：Pascal 数据和COCO 和 Pascal 的组合。表 3 给出了这些教师网络的 baseline 性能。可以观察到，仅仅通过用 COCO 训练数据训练 Yolov2 将性能提高 3.5 个百分点。With these two teachers, we would like to understand the effect of a more accurate teacher on the student performance.

这里写图片描述
Table 3: Comparison of performance for distillation with different strategies on Pascal VOC 2007. The results are shown for two teacher network and for two set of labeled training data (Pascal VOC and combination of Pascal VOC and COCO).
表 3：在 Pascal VOC 2007 上使用不同策略的 distillation 的表现比较。这里给出的结果是两个教师网络在两个有标注训练数据集 (Pascal VOC 数据集以及 Pascal VOC 和 COCO 的组合数据集) 上的结果。

In our first experiment, we compare different strategies to justify the effectiveness of our proposed approach. We introduce two main innovation for single stage detector: Objectness scaling and FM-NMS. We perform distillation without the FM-NMS and objectness scaling step. The results are shown in Table 3. It can be observed that the performance of the distilled student detector drops below the baseline detector when distillation is performed without FM-NMS step. For both the teacher network there is a significant drop in the performance of the network. Based on these experiments we find that the FM-NMS is a crucial element to make distillation work on single stage detector. In the experiments without the objectness scaling, we again observe a drop in the performance, although the drop in the performance is not very high.
在我们的第一个实验中，我们比较了不同的策略，以证明我们提出的方法的有效性。我们介绍了单级检测器的两项主要创新：物体缩放和 FM-NMS。我们在没有 FM-NMS 和物体缩放步骤的情况下进行 distillation。结果如表 3 所示，可以观察到 distillation 学生检测器的性能在没有 FM-NMS 步骤时精度降至 baseline 检测器以下。对于教师网络来说，网络的性能都有显着的下降。基于这些实验，我们发现 FM-NMS 是在单级检测器上进行 distillation 工作的关键因素。在没有 objectness scaling 的实验中，我们再次观察到了性能的下降，尽管性能的下降并不是很高。

The experiments with additional annotated data (COCO training image) show a similar trend, thus verifying the importance of FM-NMS and object scaling. However, it is interesting to observe that there is a significant improvement in the performance of full distillation experiment with larger training data. Full distillation approach gain by 2.7 mAP with COCO training dataset. With larger training data there are soft-labels which can capture much more information about the object like section present in the image.
附加的标注数据 (COCO 训练图像) 的实验显示了类似的趋势，因此验证了 FM-NMS 和 object scaling 的重要性。然而，有趣的是观察到用更大的训练数据进行全 distillation 实验的性能有显着的改进。采用 COCO 训练数据集，全部精馏方法获得 2.7 mAP 的增益。对于较大的训练数据，可以使用软标签来捕获更多关于目标的信息，如图像中存在的部分。

We can also observe that the performance of the baseline detector improves significantly with larger training data. It shows that our light-weight models have the capacity to learn more provided more training samples. With distillation loss and additional annotated data proposed detector achieves 67 mAP while running at a speed of more than 200 FPS.
我们还可以观察到 baseline 检测器的性能可以通过更大的训练数据显著提高。它表明我们的轻量级模型有能力学习更多提供的训练样本。distillation 损失和额外的标注数据提出检测器达到 67 mAP，同时运行速度超过 200 FPS。

Surprisingly, for a fixed student the teacher detector does not plays a crucial role in the training. The COCO teacher performs worse than the VOC teacher when combination of VOC and COCO data is used. We suspect the it is difficult to evaluate the impact of the teacher quality as the different in the teacher performance is not large (¡ 4mAP).
令人惊讶的是，对于一个固定的学生来说，教师检测器在训练中并不起关键作用。当使用 VOC 和 COCO 数据组合时，COCO 老师的表现比 VOC 老师差。我们怀疑，由于教师绩效差异不大 (¡ 4mAP)，很难评估教师素质的影响。

We show the detectors performance for the different classes of the Pascal VOC 2007 test set in Table 4. The performance of the proposed F-Yolo (only with architecture modifications) and D-Yolo (architecture changes + distillation loss) is compared with original Yolo and Yolov2. It is interesting to observe that with distillation loss and more data there is a significant improvement for small objects such as bottle and bird (10 AP). The difference in the performance between the Tiny-Yolo and the proposed approach is clearly approach in some of the sample images shown in Fig. 5.
我们在表 4 中给出了不同类别的 Pascal VOC 2007 测试集的检测器性能。提出的 F-Yolo (仅用于架构修改) 和D-Yolo (架构变更 + distillation 损失) 的性能与原始的 Yolo 和 Yolov2 进行比较。有趣的是观察到，随着 distillation 损失和更多的数据，瓶子和鸟等小物体 (10 AP) 的显着改进。Tiny-Yolo 和所提出的方法之间的性能差异在图 5 所示的一些示例图像中显然是接近的。

这里写图片描述
Table 4: Comparison of proposed approach with the popular object detectors on VOC-07 dataset.
表4：VOC-07 数据集上流行的物体检测器的建议方法的比较。

这里写图片描述
Figure 5: Example images with teacher network (Yolo-v2), proposed approach and the Tiny-Yolo baseline.

5.4. Unlabeled data

Previous experiment with combination of COCO and VOC data showed that the F-Yolo has the capacity to learn with more training data. In this section we study how much can our model learn from unlabeled data. In this experiment we evaluate accuracy of the detector by increasing the unlabeled data in the training set. We use labeled data only from VOC dataset and use the images from COCO without their labels as additional data. The labeled and unlabeled images are combined and used together for training, for unlabeled images we only propagate the loss evaluated from teacher soft-labels. We train the network with different number of unlabeled images (16K, 32K, 48K and 65K) to evaluate the influence of the unlabeled data on the student network. The results are shown in Table 5. It can be observed that the performance of the baseline detector improves significantly (3-4 mAP) with additional data. As more unlabeled data is added the performance of the detector improves.
先前的 COCO 和 VOC 数据组合的实验表明，F-Yolo 有能力学习更多的训练数据。在本节中，我们将研究我们的模型可以从未标记的数据中学习多少。在这个实验中，我们通过增加训练集中的未标记数据来评估检测器的准确性。我们仅使用来自 VOC 数据集的标记数据，并使用 COCO 的图像，而不用其标签作为附加数据。标记的和未标记的图像被组合并且一起用于训练，对于未标记的图像，我们只传播由教师软标签评估的损失。我们用不同数量的未标记图像 (16K，32K，48K和65K) 对网络进行训练，以评估未标记数据对学生网络的影响。结果如表 5 所示。可以观察到 baseline 检测器的性能随附加数据显着提高 (3-4 mAP)。随着更多未标记的数据被添加，检测器的性能得到改善。

这里写图片描述
Table 5: Performance comparison with on Pascal 2007 using unlabeled data.
表5：使用未标记数据与使用 Pascal 2007 的性能比较。

It is interesting to compare the change in the performance with unlabeled data and COCO labeled data separately to understand the importance of annotation in the training. Using complete COCO data with annotation our model achieve 64.2 mAP (Table 3 student baseline) and using Yolo-v2 as teacher network and unlabeled COCO images, model obtain 62.3 mAP. These results indicate that although the annotation are important, we can significantly improve the performance by using an accurate teacher network simply by adding more unlabeled training data.
将无标签数据和 COCO 标签数据分别进行性能比较，以了解训练中标注的重要性，这是很有意思的。使用具有注释的完整 COCO 数据，我们的模型获得 64.2 mAP (表 3 学生 baseline) 并且使用 Yolo-v2 作为教师网络和未标记的 COCO 图像，模型获得 62.3 mAP。这些结果表明，尽管标注很重要，但通过添加更多未标记的训练数据，我们可以通过使用准确的教师网络来显着提高性能。

Finally, we compare the performance of proposed distillation approach with the competing distillation approach [3]. The competing approach for distillation employs Faster-RCNN framework which is a two stage detector, while our approach has been specifically designed for one stage detector. The results are shown in Table 6. The performance is compared with the following architectures: Alexnet [19], Tuckernet [18], VGG-M [32] which are distilled using VGG-16 network [32]. It can be observed that the proposed distilled network is an order of magnitude faster than the RCNN based detector. In terms of number of parameters the proposed approach is comparable to Tucker network, however, it is much faster than all Faster-RCNN based networks shown here. The speed-up over these approaches can be attributed to the efficient design of single stage detector and the underlying network optimized for speed and the additional data that is used for training. The results show that these modifications leads to a gain of around 9 mAP over the competing comparable network while being much faster than it.
最后，我们比较所提出的 distillation 方法和竞争 distillation 方法的性能[3]。distillation 的竞争方法采用Faster-RCNN 框架，这是一个两阶段检测器，而我们的方法已经专门设计用于一个阶段检测器。结果如表 6 所示。将以下体系结构的性能进行比较： Alexnet [19]，Tuckernet [18]，VGG-M [32] 使用 VGG-16 网络进行 distillation [32]。可以观察到，提出的 distillation 网络比基于 RCNN 的检测器快一个数量级。在参数数量方面，所提出的方法可以与 Tucker 网络相比，但是它比这里显示的所有基于 Faster-RCNN 的网络要快得多。这些方法的加速可归因于单级检测器的高效设计以及针对速度和用于训练的附加数据进行了优化的底层网络。结果表明，这些修改导致相比于竞争性网络的增益大约为 9 mAP，但速度比它快得多。

这里写图片描述
Table 6: Comparison of proposed single stage distilled detector (Yolo) with Faster-RCNN distilled detectors on VOC 2007 test set.
表6：在 VOC 2007 测试集上比较提出的单级 distilled 检测器 (Yolo) 和 Faster-RCNN distilled 检测器。

6. Conclusions

In this paper, we develop an architecture for efficient and fast object detection. We investigate the role of network architecture, loss function and training data to balance the speed performance trade-off. For network design, based on prior work, we identify some simple ideas to maintain computational simplicity and following up on these ideas we develop a light-weight network. For training, we show distillation is a powerful idea and with carefully designed components (FM-NMS and objectness scaled loss), it improves the performance of a light-weight single stage object detector. Finally, building on distillation loss we explore unlabeled data for training. Our experiments demonstrate the design principle proposed in this paper can be used to develop object detector which is an order of magnitude faster than the state-of-the-art object detector while achieving a reasonable performance.
在本文中，我们开发了一个用于有效和快速目标检测的架构。我们调查了网络架构，损失函数和训练数据在平衡速度性能权衡方面的作用。对于基于先前工作的网络设计，我们确定了一些简单的想法，以保持计算的简单性并跟踪我们的这些想法。对于训练，我们展示 distillation 是一个强大的 idea 和精心设计的组件 (FM-NMS and objectness scaled loss)，它提高了轻型单级物体检测器的性能。最后，在 distillation 损失的基础上，我们探索未标记的训练数据。我们的实验证明，本文提出的设计原理可用于开发物体检测器，其比现有技术的物体检测器快一个数量级，同时实现合理的性能。

References

SSD: Single Shot MultiBox Detector
PVANet: Lightweight Deep Neural Networks for Real-time Object Detection
Do We Need More Training Data?
Densely Connected Convolutional Networks
Rethinking the Inception Architecture for Computer Vision
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
The Unreasonable Effectiveness of Data
Deep Residual Learning for Image Recognition
Speed/accuracy trade-offs for modern convolutional object detectors
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

WORDBOOK

Ireland [‘aɪələnd]：n. 爱尔兰
computer science，CS：计算机科学
computer vision，CV：计算机视觉
distillation loss function：蒸馏损失函数
open issue：胜负未定，胜负未卜的比赛

KEY POINTS

To achieve this goal we investigate three main aspects of the object detection framework: network architecture, loss function and training data (labeled and unlabeled) .

In order to obtain compact network architecture, we introduce various improvements, based on recent work, to develop an architecture which is computationally light-weight and achieves a reasonable performance.

We draw inspiration from the recent work of Densenet, Yolo-v2 and Single Shot Detector (SSD), design an architecture which is deep but narrow. The deeper architecture allows us to achieve higher accuracy while the narrow layers enable us to control the complexity of the network.
(设计一个又深又窄的网络结构。更深的网络可以提高准确率，更窄的网络可以减少网络的复杂度。)

The idea of the network distillation is to train a light-weight (student) network using knowledge of a large accurate network (teacher). The knowledge is transferred in form of soft labels of the teacher network.

Using distillation loss we transfer the knowledge of a more accurate teacher network to proposed light-weight student network.
(知识蒸馏一般用于模型压缩，主要思想是使用一个通常更慢，结构更复杂高精度的网络 (Teacher) 的知识训练一个轻量的网络 (Student)。前提是“大网络中的节点是有冗余的”。)

These results indicate that although the annotation are important, we can significantly improve the performance by using an accurate teacher network simply by adding more unlabeled training data.

After the last major convolutional layer, we add a number of 1x1 convolutional layers which add depth to the network without increasing computational complexity.
(为了提高准确率，同时不影响速度，网络设计的较深且较窄。网络最后多次使用 1 × 1 的卷积核，增加网络深度的同时不增加大量计算。)

It can be observed that the accuracy increases as the feature maps from more layers are merged together. Another important inference that we can draw from these results is that merging more advanced layers results in a more improvement rather than the initial layers of the network. There is a slight drop in the performance as the first few convolutional layers are merged with the final combination, indicating that the initial layer capture quite rudimentary information.

Dense feature map with stacking Taking inspiration from recent works [14, 22] we observe that merging the feature maps from the previous layers improves the performance. We merge the feature maps from a number previous layer in the last major convolutional layer. The dimensions of the earlier layers are different from the more advanced one. Prior work [22] utilizes max pooling to resize the feature maps for concatenation. However, we observe that the max pooling results in a loss of information, therefore, we use feature stacking where the larger feature maps are resized such that their activations are distributed along different feature maps [26].
(融合越多层的 feature map，检测精度越高。使用较后面的层比使用较前面的层效果好，因为浅层只捕获到了基本的信息。resize 浅层特征，然后 stack 到深层特征比对浅层特征进行 max pooling 效果好。与 max pooling 相比，这样做利用了 feature map 中每个点的信息。)

The convolutional operation on the combined feature maps layer reduces the speed of detector. Therefore, it can be observed that reducing the number of feature maps has a big impact on the speed of the detector.

All the speed measurement are performed on Pascal VOC 2007 test image and we show the average time for 4952 images and it also includes the time for writing the detection results in the file.

We choose Yolo-v2 with Darknet-19 base architecture as the teacher. We use the proposed F-Yolo as the student network, as it is computationally light-weight and is very fast.

The performance of the proposed F-Yolo (only with architecture modifications) and D-Yolo (architecture changes + distillation loss) is compared with original Yolo and Yolov2. It is interesting to observe that with distillation loss and more data there is a significant improvement for small objects such as bottle and bird (10 AP).

It can be observed that the performance of the distilled student detector drops below the baseline detector when distillation is performed without FM-NMS step. For both the teacher network there is a significant drop in the performance of the network. Based on these experiments we find that the FM-NMS is a crucial element to make distillation work on single stage detector. In the experiments without the objectness scaling, we again observe a drop in the performance, although the drop in the performance is not very high.

However, it is interesting to observe that there is a significant improvement in the performance of full distillation experiment with larger training data.

Surprisingly, for a fixed student the teacher detector does not plays a crucial role in the training.

It can be observed that the performance of the baseline detector improves significantly (3-4 mAP) with additional data. As more unlabeled data is added the performance of the detector improves.

It is observed that merging the feature maps of advanced layers provide more improvement, therefore, we use a higher compression ratio for the initial layers and lower one for the more advanced layers.

Neural networks typically produce class probabilities by using a “softmax” output layer that converts the logit, $z_i$ , computed for each class into a probability, $q_i$ , by comparing $z_i$ with the other logits.
对于多分类问题，让简单网络学习复杂网络的概率输出。由于复杂网络 softmax 之后的概率通常会出现极小 / 极大的情况，因此可以使用参数 T 将 logit 的值的数量级差距尽可能缩小，具体公式如下：

这里写图片描述

where T is a temperature that is normally set to 1. Using a higher value for T produces a softer probability distribution over classes.

for instance:
logits = [0.01, 0.01, 0.05, 10.]

softmax(logits) = [4.58498133e-05, 4.58498133e-05, 4.77209797e-05, 9.99860579e-01]
softmax(logits/10) = [0.17483823, 0.17483823, 0.17553898, 0.47478456]
softmax(logits/100) = [0.24357804, 0.24357804, 0.24367549, 0.26916844]

可以看出 T 越大，概率分布越平滑。软化后的 softmax 保留了更多的类别分布信息，有利于 Student 网络学习数据分布。这里 λ 是 hard target 与 soft target 的权重。

这里写图片描述

网络蒸馏：从复杂网络中抽取数据的分布让简单网络学习。
1. 训练大模型：先用 hard target，就是正常的 label 训练大模型。
2. 计算 soft target：利用训练好的大模型来计算 soft target。就是大模型“软化后”再经过 softmax 的 output。
3. 训练小模型：在小模型的基础上再加一个额外的 soft target 的 loss function，通过 lambda 来调节两个 loss functions 的比重。
4. 预测时，将训练好的小模型按常规方式 (右图) 使用。
注意：反向传播阶段，不更新 Teacher 网络。

由于使用了蒸馏损失，因此我们可以在训练过程中使用未标注的数据。
(1) 如果是标注的数据，使用提出的损失训练。
(2) 如果是未标注的数据，只使用蒸馏损失部分。即在训练时，只传播 teacher 网络和 student 网络的 soft target 部分。
这样网络就可以同时使用有标注和未标注的数据进行训练了。

distillation loss = ground truth (high important) + soft label (less important)
loss = detection loss (labeled data) + distillation loss (unlabeled data)