Scale-Transferrable Object Detection

浅层的特征图更大，小目标识别需要足够大的feature map来提供精细的特征和做密集的采样，所以在浅层做small object，但是浅层的semtanic不够，pooling层不仅可以减少参数还可以扩大感受野。

深层的semtanic够，但是feature map小，所以放大，对channel可以进行压缩。前三层进行pooling扩大感受野，主要用来检测large object，后两层使用STM扩大resolution，主要用来检测small object。

Abstract: we develop a novel Scale-transferrable detection network for detecting multi-scale objects in images.

Combine object predictions from multiple feature maps from different network depths.使用scale-transfer layer/module in this work

1. Introduction

Scale problem lies in the heart of object detection. In order to detect objects of different scales, a basic strategy is to use image pyramids [1] to obtain features at different scales. However, this will greatly increase memory and computational complexity, which will reduce the real-time performance of object detectors.

因为CNN中每一层的感受野是固定的，固定的感受野和不同尺度目标之间存在不连续性。浅层特征图有小的感受野，他们用来发现小目标，深层特征图有大的感受野，可以用来检测大的目标。然而。浅层特征有更少的语义信息，削弱小目标检测的表现。FPN,ZIP和DSSD在所有的尺度上集成特征图的语义信息，如图1中b所示，自顶向下的体系结构将高级语义特征映射和低级特征映射结合起来，从而在所有尺度上生成更多的语义特征映射。然而，为了提高检测表现，特征金字塔必须小心建造，添加的额外层带来了多余的计算花费。

为了获取高层语义多尺度特征图，we develop a scale-transfer module and embeded this module directly into a densenet.STM consistd of pooling and scale-transfer layers.pooling layer is used to obtain small scale feature maps, and scale-transfer layer is used to obtain large scale feature maps.

Densenet末端的feature map有大量的channels，scale-transfer layer通过压缩channel的数目来扩大feature map的长和宽，这可以有效的减少下一卷积层的参数。（我发现特征图带来的参数量增长是很快的）

We believe that the STM has two distinct advantages. First of all, combining DenseNet [14] the feature maps own both low-level object detail features and highlevel semantic features naturally. We will prove that this will improve the accuracy of object detection. Second, STM is made up of pooling and super-resolution layers with no additional parameters and computation.

2. Related work

使用多尺度联合方式去检测目标：

1. To detect objects using the combinations of multi-layre features.

2. The other is to use different layer features to predict objects at differnet scales.

1)HyperNet,YOLOV2

2)SSD,MS-CNN,DSOD

Our proposed method falls into the third class approach. We use DenseNet to combine features of different layers and use scale-transfer module to obtain feature maps with different resolutions.我们的模块可以嵌入到Densenet中去。

3 Scale-Transferrable detection network

The output of the last layer of the dense block has highest number of channels and is suitable as input for our scale-transfer layer which expands the width and height of the feature map by compressing the number of channels

3.1 Base Network:densenet

DenseNet-169 高级特征和低级特征之间的整合。

Inspired by DSOD [27], we replace the input layers (7 × 7 convolution layer, stride = 2 followed by a 3 × 3 max pooling layer, stride = 2) into three 3× 3 convolution layers and one 2× 2 mean pooling layer. The stride of the first convolution layer is 2 and the others are 1. The output channels for all three convolution layers are 64. We call these layers “stem block”.

3.2 High efficiency scale-transfer module

联合多层不同分辨率的特征图对于多目标检测是有利的。

Densenet 输入300*300，最后一层输出 9*9，一种简单的方式就是直接利用浅层高分辨率特征图进行预测，比如和SSD类似。然后，低水平的特征图缺乏目标对象的语义信息，导致了目标检测的低表现。

Suppose that the dimensions of the input tensor of the scale-transfer layer are H × W × C · r2, where r is the up sampling factor. The scale-transfer layer is an operation of periodic rearrangement of elements. As you can see from the Figure 3, scale-transfer layer expands the width and height by compressing the number of channels in the feature map

每一个feature map是有宽度的，现在把每个feature map进行压缩，把元素信息放到整张图上去，原本1*1的区域变成了r*r，相当于与元素信息的rearrangement

We embed the scale-transfer module directly into Densenet to obtain six feature maps to construct a one-stage object derector named scale-transferrable detection network.

Scale-transfer layer can effectively reduce the number of channels in the last layer of the dense block in DenseNet,and reduce the parameters and computation of the next convolutional prediction layer. This improves the speed of the detector.

3.3 Object Localization Module

The Scale-Transferrable Detection Network (STDN) consists of a base network and two task specific prediction subnetworks. The role of the base network is to do feature extraction. The first subnet is used for object classification, and the second subnet is used for bounding box position regression.

3.3.1 Anchor boxes

The same as that of SSD.we use[1.6,2.0,3.0] aspect ratios. Threshold(0.5) IoU,negatives and positives is at most 3:1

3.3.2 Classification subnet

A 1*1 conv layer and two 3*3 conv layer. Bn+relu+conv2d

3.3.3 Box Regression subnet

3.3.4 Training objective

3.3.5 Training setting

Scale-Transferrable Object Detection

猜你喜欢