DSOD: a deep supervised learning target detector from scratch

DSOD: a deep supervised learning target detector from scratch

Reprinted: https://www.cnblogs.com/0x000/p/7406385.html

The original address of the paper: https://arxiv.org/pdf/1708.01241

Abstract:

        We propose the Deep Supervised Object Detector (DSOD), a framework that can learn object detectors from scratch. The state of the object of the art object depends to a large extent on the large-scale data classification such as ImageNet that is pre-trained by the off-board network, which causes the learning bias due to the loss function of both parties and the difference in the category distribution between the classification and detection tasks. Fine-tuning the model of the detection task can alleviate this bias to a certain extent, but it cannot fundamentally eliminate this bias. In addition, it is more difficult to transfer the trained model from classification to detection between different domains (eg RGB to depth images). A better solution to solve these problems is to cultivate the target detector from scratch, which prompted us to propose the relationship and. Due to the complex loss function and limited training data in target detection, many efforts in this direction have failed. At DSOD, we have a set of design principles for training object detectors, starting from scratch. One of the key findings is that deep supervision, composed of dense inter-layer connections, plays a key role in learning a good detector. Combining several other principles, we developed the following relationship and single detection (SSD) framework. Experiments on the relationship between PASCALVOC2007, 2012 and MS and the cocoa data set show that better results can be achieved than the state-of-the-art solutions and more compact models. For example, it is better than the relationship between SSD and all three benchmarks with real-time detection speed, while only the SSD parameters of 1/2 and 1/10 parameters are required for fast RCNN. Our code and model are: https://github.com/szq0214/dsod .

1. Introduction

        Convolutional neural networks (CNN) have produced dramatic performance improvements in many computer vision tasks, such as image classification [17, 28, 32, 9, 10], object detection [5, 4, 27, 19, 21 , 25] [23, image segmentation, 8, 2, 36] and so on. In the past many years, innovative network structures have been proposed. szegedy etc. [32] proposed the characteristics of the "start" module to produce filters of various sizes on city maps. Helium etc. [9] proposed the learning and training of the residual skip connection block, which makes the very deep net with more than 100 layers. Huang et al. [10] Proposed a wise connection between densenets and dense layers. Thanks to these excellent network structures, the accuracy of many vision tasks in the university has been greatly improved. They are on the ice, an object detection area of ​​the fastest pigeon ITS used in wide-area surveillance, autonomous driving, etc.
        In order to obtain good performance, most advanced object detection systems preprocess ImageNet [3]. This fine-tuning process is also regarded as transfer learning [24]. There are at least two advantages to fine-tuning from a pre-trained model. First, there are many state-of-the-art deep models that can be publicly released. It is very convenient to reuse them for object detection. Second, fine-tuning can quickly generate the final model, and requires less instance-level annotated training data than classification tasks.
        However, the use of pre-trained networks in object detection also has key limitations: (1) Limited structural design space. The pre-trained network models are mainly derived from ImageNet-based classification tasks and are usually very heavy-containing a large number of parameters. The existing object detector directly uses the pre-trained network, so the flexibility of controlling/adjusting the network structure is very small (even for small changes in the network structure). The requirements for computing resources are also constrained by the heavy network structure. (2) Learning bias Due to the different loss functions and category distributions between classification and detection tasks, we believe that this will lead to different search/optimization spaces. Therefore, learning may be biased towards the local minimum, which is not the best for detection tasks. (3) The domain does not match. As we all know, fine-tuning can reduce the gap caused by the distribution of different target categories. However, when the source domain (ImageNet) and the target domain (such as depth images, medical images, etc.) have a huge mismatch, it is still a serious problem [7].
        Our work has the following two problems. First, is it possible to train an object detection network from scratch? Second, if the first answer is yes, are there any principles to design a resource-efficient network structure for object detection while maintaining high detection accuracy? In order to achieve this goal, we propose the Deep Supervised Reflection Detector (DSOD), which is a simple and efficient framework for learning object detectors from scratch. DSOD is quite flexible, so we can customize various network structures for different computing platforms (such as servers, desktops, mobile and even embedded devices).
        We provide a set of principles for designing DSOD. A key point is that deep supervision plays a key role, which is driven by the work of [18,35]. In [35], Xie et al. proposed an overall nested structure for edge detection, which includes a lateral output layer forwarded by each base station network for explicit in-depth supervision. This paper uses the dense hierarchical connection proposed in DenseNet [10] instead of using multiple cut-in loss signals with a parallax output layer. The dense structure is not only used in the backbone subnet, but also in the front-end multi-scale prediction layer. Figure 1 shows the structure comparison of the front-end prediction layer. The fusion and reuse of multi-resolution prediction maps can help maintain or even improve the final accuracy, while reducing model parameters to a certain extent.
        Our main contributions are summarized as follows:
        (1) As far as we know, we provide DSOD, the world's first framework that can study object detection networks from scratch, with the most advanced performance.
        (2) We introduced and verified a set of principles, from scratch to gradually ablate research and design efficient object detection networks.
        (3) We show that our DSOD can realize the most advanced performance on three standard benchmarks (PASCALVOC2007, 2012 and MSCOCO data sets) with real-time processing speed and more compact models.

2. Related Work related work

        Object detection. The most advanced CNN-based object detection methods can be divided into two categories: (i) regional proposal-based methods and (ii) no proposal methods.
        The proposed methods include R-CNN [5], FastRCNN [4], faster R-CNN [27] and R-FCN [19]. R-CNN uses selective search [34] to first generate potential object regions in the image, and then classify the proposed regions. R-CNN requires high computational cost because each region is processed separately by the CNN network. Fast R-CNN and Faster R-CNN increase efficiency by sharing calculations and using neural networks to generate region proposals. R-FCN further improves the speed and accuracy by removing fully connected layers and adopting a position-sensitive score map for final detection.
        Recently, unproposed methods such as YOLO [25] and SSD [21] have been proposed for real-time detection. YOLO uses a single feedforward convolutional network to directly predict the object class and location. Compared with the area-based method, YOLO no longer needs to perform a second classification operation for each area, so it is very fast. SSD has improved YOLO in several ways, including (1) using a small convolution filter to predict the category and anchoring offset of the bounding box position; (2) using pyramid features for prediction at different scales; and (3) using the default Frame and aspect ratio to adjust different object shapes. The DSOD we proposed is based on the SSD framework, so it inherits the speed and accuracy advantages of SSD while producing smaller and more flexible models. Network architecture inspection. A lot of efforts have been made on the network architecture design of image classification. Many different networks have appeared, such as AlexNet[17], VGGNet[28], GoogLeNet[32], ResNet[9] and DenseNet[10]. At the same time, several regularization techniques [29,12] have been proposed to further improve the model's capabilities. Most detection methods [5,4,27,21] directly use the pre-trained ImageNet model as the backbone network.
        Some other works have designed a specific skeleton network structure for object detection, but first need to pre-train the ImageNet classification data set. For example, YOLO [25] defines a network with 24 convolutional layers, followed by 2 fully connected layers. YOLO9000 [26] improves YOLO by proposing a new network called Darknet-19, which is a simplified version of VGGNet [28]. Kim et al. [15] proposed PVANet for object detection, which consists of simplified "initial" blocks of GoogleNet. Huangetal. [11] studied different combinations of network structure and detection framework and found that the faster R-CNN [27] using InceptionResNet-v2 [31] achieves the highest performance. In this article, we also consider the network structure of general object detection. However, the proposed DSOD no longer requires ImageNet pre-training.
        Learn the deep model from scratch. As far as we know, there is no work to train an object detection network from scratch. Compared with the existing solutions, the proposed method has very good advantages. We will explain and verify this method in detail in the following sections. In semantic segmentation, Jegou et al. [13] showed that a well-designed network structure can outperform the most advanced solutions without using a pre-trained model. It restores the full input resolution by adding an upsampling path and extends DenseNets to a fully convolutional network.

3.DSOD

        In this section, we first introduce our DSOD architecture and its components, and explain several important design principles. Then we describe the training settings.

3.1.DSOD Architecture DSOD architecture

        The DSOD method proposed by the overall framework is a multi-scale no proposal detection framework similar to SSD [21]. The network structure of DSOD can be divided into two parts: backbone subnet feature extraction and front terminal network prediction multi-scale response map. The backbone subnet is a variant of the deeply supervised DenseNets [10] structure, which consists of one dry block, four dense blocks, two transition layers and two transition w/o merge layers. The front terminal network (or DSOD prediction layer) integrates multi-scale prediction responses through a fine dense structure. Figure 1 shows the proposed DSOD prediction layer and the planar structure of the multi-scale prediction map used in SSD [21]. The complete DSOD network architecture 1 is shown in Table 1. We elaborate on each component and the corresponding design principles below.
        Principle 1: No proposal. We investigated all the most advanced CNN-based object detectors and found that they can be divided into three categories. First, R-CNN and FastR-CNN require external object proposal generators, such as selective search. Second, faster R-CNN and R-FCN require a comprehensive regional proposal network (RPN) to generate relatively few regional proposals. Third, YOLO and SSD are single-shot and no-proposal methods, which deal with object positions and bounding box coordinates as regression problems. We observe that only the proposed method (type 3) can successfully converge without a pre-trained model. We speculate that this is because RoI (regional interest) is concentrated in the other two types of methods-RoIpooling generates features for each region proposal, which hinders the smooth back propagation of the gradient from the region level to the convolutional feature map. The proposed method works well with the pre-trained network model, because the parameter initialization is good for those layers before the RoI pool, while training from scratch is incorrect.
        Therefore, we have reached the first principle: training and testing networks from scratch requires a no-proposal framework. In fact, we have introduced a multi-scale no-proposal framework from the SSD framework [21] because it can achieve the most advanced accuracy while providing fast processing speed. Principle 2: In-depth supervision. The effectiveness of deep supervised learning has been proven in GoogLeNet[32], DSN[18], DeepID3[30], etc. The central idea is to provide a comprehensive objective function as a direct supervision of the earlier hidden layer, rather than just in the output layer. These "mates" or "auxiliary" objective functions can alleviate the problem of "vanishing" gradients in multiple hidden layers. No proposal detection framework includes classification loss and localization loss. The clear solution requires the addition of complex video output layers to introduce "accompanying" targets for the detection task at each hidden layer, similar to [35]. Here, we use an elegant and implicit dense hierarchical connection method introduced in DenseNets [10] to strengthen deep supervision. When all previous layers in a block are connected to the current layer, the block is called a dense block. Therefore, the earlier layers in DenseNet can receive additional attachments from the objective function by skipping connections. Although only a single loss function is needed at the top of the network, all layers including the earlier layers can still share the blocked signal. We will verify the benefits of in-depth supervision in Section 4.1.2. Transition w/o pool layer. We introduce this layer to increase the number of dense blocks without reducing the final feature map resolution. In the original design of DenseNet, each transition layer contains a pool operation for sampling the feature map. If you want to maintain the same output scale, the number of dense blocks is fixed (4 dense blocks in all DenseNet architectures). The only way to increase the depth of the network is to add layers to each block of the original DenseNet. The transition w/o pool layer eliminates the limitation on the number of dense blocks in the DSOD architecture, and can also be used in standard DenseNet.
        Principle 3: Stem. Driven by Inception-v3[33] and v4[31], we define the stem as a stack of three 3×3 convolutional layers, followed by 2×2 maximum merged layers. The first conversion layer works with stride=2, and the other two steps are 1. We found that adding this simple tree trunk structure can significantly improve the detection performance in our experiments. We speculate that compared with the original design of DenseNet (7×7 conversion layer, stride=2, followed by 3×3 maximum pool, stride=2), dry blocks can reduce the information loss of the original input image. We will show that this dry block reward is important for the detection performance of Section 4.1.2.
        Principle 4: Intensive forecast structure. Figure 1 shows a comparison between a planar structure (such as SSD) and the dense structure in our proposed front terminal network. SSD designs the prediction layer as an asymmetric hourglass structure. For a 300×300 input image, six-scale feature maps are used to predict objects. The Scale-1 feature map comes from the middle layer of the backbone subnet, which has the maximum resolution (38×38) to handle small objects in the image. The remaining five scales are located at the top of the backbone subnet. Then, a flat transition layer with a bottleneck structure (a 1×1 conversion layer for reducing the number of feature maps plus a 3×3 conversion layer) is used between two adjacent-scale feature maps [33,9].
        Learn half and reuse half. In the planar structure in the SSD (see Figure 1), each subsequent scale is directly converted from the adjacent previous scale. We propose a dense prediction structure to fuse multi-scale information for each scale. For simplicity, we limit the channels that output the same number of predicted feature maps for each scale. In DSOD, in each scale (except scale-1), half of the feature maps are learned from the previous scale through a series of conversion layers, and the remaining half of the feature maps are directly sampled from intuitive high-resolution images , Resolution feature map. The down-sampling block consists of 2×2, stride=2 maximum merged layer, followed by 1×1, stride=1 conversion layer. The pool layer is designed to match the resolution with the current size when grading. The 1×1 conversion layer is used to reduce the number of channels to 50%. The collection layer is placed before the 1×1 conversion layer to consider reducing the computational cost. This down-sampling block actually comes with each scale of the multi-resolution feature map from all its previous scales, which is basically the same as the dense hierarchical connection introduced in DenseNets. For each scale, we only learn half of the new feature maps and reuse the remaining half. This dense prediction structure can produce more accurate results with fewer parameters than the plain structure, which will be studied in Section 4.1.

3.2.TrainingSettings training settings

        We implement our detector based on the Caffe framework [14]. All our models start with the SGD solver on NVIDIA Titan X GPU. Since the DSOD feature maps of each scale are connected from multiple resolutions, we use the L2 normalization technique [22] to extend the feature norm to 20 for all outputs. Please note that SSD only applies this normalization to Scale-1. Most of our training strategies follow solid state drives, including data increase, the size and aspect ratio of the default box, and loss functions (for example, smooth L1 loss for localization and softmax loss for classification purposes), and we have Own learning rate scheduling and mini-batch size settings. The details will be given in the experimental part.

4. Experiments

        We conduct experiments on the widely used PASCALVOC2007, 2012 and MSCOCO datasets, which have 20, 20, and 80 object categories respectively. Object detection performance is measured with average accuracy (mAP).

4.1. Ablation Study on PASCAL VOC 2007 PASCAL VOC 2007 Ablation Study

        We first investigate each component and design principle of our DSOD framework. The results are mainly summarized in Table 2 and Table 3. We used our DSOD300 (300×300 input) to design several controlled experiments for PASCALVOC2007 for this ablation study. Apply consistent settings to all experiments, except to check certain components or structures. In this research, we used the VOC2007 train and the 2012 train ("07+12") to train the vehicle models, and tested the VOC2007 tester.

4.1.1Configurations in Dense Blocks

        We first study the impact of different configurations on the dense blocks of the backbone subnet.
        Transition layer compression factor. We compared the two compression factor values ​​(θ=0.5,1) in the transition layer of the dense network. The results are shown in Table 3 (rows 2 and 3). The compression factor θ=1 means that there is no feature map reduction in the transition layer, and θ=0.5 means that the feature map is reduced by half. The results show that θ=1 and mAP is 2.9% higher than θ=0.5.
        #The bottleneck layer of channels. As shown in Table 3 (rows 3 and 4), we observe that a wider bottleneck layer (with more response mapping channels) greatly improves the performance (4.1% mAP).
        # Channels of the first conversion layer We observed that the large number of channels in the first conversion layer is beneficial, which brings a 1.1% mAP improvement (columns 4 and 5 in Table 3).
growth rate. It is found that a larger growth rate k is much better. When increasing k from 16 to 48 with 4k bottleneck channels, we observe the 4.8% mAP improvement in Table 3 (rows 5 and 6).

4.1.2 Effectiveness of Design Principles

        We have now demonstrated the validity of the key design principles outlined earlier.
        No proposed framework. We try to train the object detector from scratch using proposed frameworks such as FasterR-CNN and R-FCN. However, for all the network structures we tried (VGGNet, ResNet, DenseNet), the training process failed to converge. We further tried to train the object detector using the proposal-free framework SSD. The training convergence is successful, but the result is worse (VGG 69.6%) compared to the previous fine-tuning of the training mode (75.8%), as shown in Table 4. This experiment verifies our design principle, free frame.
        Deep Supervision Then, we supervised the training of the object detector from scratch. Our DSOD300 achieves a mAP of 77.7%, which is far better than the SSD300S that has never been strictly monitored using VGG16 (69.6%). It is also much better than the fine-tuning result of SSD300 (75.8%). This verifies the principle of in-depth supervision.
        Transition w/o pool layer. We compare the situation without this design layer (only 3 dense blocks) and the case of the design layer (4 dense blocks in our design). The backbone network is DS/32-12-16-0.5. The results are shown in Table 3. The network structure with the Transitionw/o collection layer brings a performance gain of 1.7%, thus verifying the effectiveness of this layer.
        Stems. It can be seen from Table 3 (rows 6 and 9) that the performance of the stem block has increased from 74.5% to 77.3%. This verifies our conjecture that the use of stemblock can protect the loss of information in the original input image.
        Dense forecast structure. We analyzed the density prediction structure from three aspects: speed, accuracy and parameters. As shown in Table 4, due to the overhead from the additional down-sampling block, DSOD on TitanXGPU has a dense front-end structure, and the running speed is slightly lower than that of the planar structure (17.4fps and 20.6fps). However, the dense structure increases mAP from 77.3% to 77.7% while reducing the parameter from 18.2M to 14.8M. Table 3 gives more details (lines 9 and 10). We also try to replace the prediction layer in SSD with the proposed dense prediction layer. When using VGG-16, the accuracy of the VOC2007 test suite can be improved from 75.8% (original SSD) to 76.1% (with pre-trained model), with 69.6% to 70.4% (pre-trained model) as the backbone. This verifies the effectiveness of the dense prediction layer.
        What if you pre-train on ImageNet? It is interesting to see the relationship and performance of pretrainedImageNet with the backbone network. We cultivated a streamlined backbone network DS/64-12-16-1 on ImageNet, and obtained 66.8% of the first accuracy and 87.8% of the first 5 accuracy on the verification set (slightly worse than VGG-16). After fine-tuning the overall inspection framework of the "07+12" train number, we achieved a 70.3% MAP on the VOC2007 test suite. The corresponding zero-based solution achieves an accuracy of 70.7%, or even slightly better. Future work will investigate this more thoroughly.

4.1.3 Runtime Analysis Runtime Analysis

        The estimated speed is shown in column 6 of Table 4. For a 300×300 input, our full DSOD can process images with a simple prediction structure in 48.6ms (20.6fps) on a single TitanXGPU, and 57.5ms (17.4fps) with a dense prediction structure. For comparison, R-FCN runs 90ms (11fps) on ResNet-101, and ResNet-101 runs 110ms (9fps). For VGGNet, ResNet-101 of SSD300* is 82.6ms (12.1fps), and for VGGNet it is 21.7ms (46fps). In addition, our model only uses about 1/2 parameters of VGGNet's SSD300, uses 1/4 of ResNet-101 to SSD300, has 1/4 of ResNet-101 to R-FCN, and has 1/10 of VGGNet to faster R-CNN. The simple version of DSOD (10.4M parameters, without any speed optimization) can run 25.8fps with only 1% mAP drop.

4.2. Results on PASCAL VOC 2007 PASCAL VOC 2007

        The model is trained based on the combination of VOC2007 train and VOC2012 train ("07+12"), and then [21]. The batch size we used is 128. Please note that this batch size exceeds the capacity of the GPU memory (even with 8 GPU servers, each with 12GB of memory). We use a technique to overcome GPU memory constraints by accumulating gradients through two training iterations implemented on the Caffe platform [14]. The initial learning rate is set to 0.1, and then divided by 10 after every 20k iterations. After the training is completed, it reaches 100k iterations. Below [21], we use a weight decay of 0.0005 and a momentum of 0.9. All conversion layers are initialized using the "xavier" method [6].
        Table 4 shows our results on the VOC2007 test set. SSD300 is the result of updating SSD using new data augmentation technology. Our DSOD300 smooth connection reached 77.3%, slightly better than SSD300 (77.2%). DSOD300 with dense prediction structure improves the result to 77.7%. After joining COCO as training materials, the performance further increased to 81.7%.

4.3. Results on PASCAL VOC 2012 PASCAL VOC 2012

        For the VOC2012 data set, we use VOC2012 train and VOC2007 training + test for training, and test on the VOC2012 test set. For the first 30k iterations, the initial learning rate is set to 0.1 and then divided by 10 after every 20k iterations. The total number of training sessions is 110k. The other settings are the same as those used in the VOC2007 experiment. The results of our DSOD300 are shown in Table 4. DSOD300 achieves 76.3% mAP, which is consistently better than SSD300* (75.8%).

4.4. Results on MSCOCO MSCOCO results

        Finally, we evaluate our DSOD on the MSCOCO dataset [20]. MSCOCO contains 80k training images, 40k verification and 20k tests (test development set). After [27,19], we use the training set (train set + validation set) for training. The batch size is also set to 128. The initial learning rate is set to 0.1 in the first 80k iterations, and then divided by 10 after every 60k iterations. The total number of training sessions is 320k.
        The results are summarized in Table 6. Our DSOD300 reached 29.3%/47.3% on the test development set, which is better than the benchmark SSD300* by a large margin. Our results are comparable to single-scale R-FCN and are close to R-FCNmulti-sc using ResNet-101 as the pre-training model. Interestingly, we observe that our result is 0.5IoU, which is lower than R-FCN, but our [0.5:0.95] result is better or comparable. This shows that our predicted position is more accurate than R-FCN under larger overlapping settings. It is reasonable that our small object detection accuracy is slightly lower than R-FCN, because our input image size (300×300) is much smaller than R-FCN (~600×1000). Even with this shortcoming, our large object detection accuracy is still much better than R-FCN. This further illustrates the effectiveness of our method. Figure 2 shows some examples of qualitative testing of COCO and DSOD300 models.

5.Discussion 5. Discussion

        Better model structure and more training data. An emerging idea in the computer vision community is that a deeper and larger neural network supported by a large amount of training data like ImageNet [3] can solve object detection or other vision tasks. Therefore, more and more large data sets have been collected and released recently, such as the OpenImages data set [16], the number of which is 7 times larger than ImageNet and the number of images 6 times larger. We absolutely agree that, given the moderate assumptions of unlimited training data and unlimited computing power, deep neural networks should perform very well. However, our proposed method and experimental results mean another point of view to deal with this problem: Compared with complex models trained from big data, better model structures can achieve similar or better performance. In particular, our DSOD was only trained on 16,551 images on VOC2007, but it achieved competitive or even better performance with the 1.2 million+16,551 image training models.
        Under this premise, it is worth reiterating that as data sets become larger, training deep neural networks becomes more and more expensive. Therefore, a simple and effective method is becoming more and more important. Despite its simple concept, our method shows great potential in this situation.
        Why start training from scratch? There have been many successful examples of model fine-tuning. People may ask why we should train object detectors from scratch. We believe that, as mentioned above, training from scratch is essential in at least two cases. First, there may be a huge regional difference from the pre-training model domain to the target domain. For example, most of the pre-trained models are trained on a large RGB image dataset (ImageNet). It is very difficult to transfer ImageNet models to depth images, multispectral images, medical images and other fields. Some advanced domain adaptation techniques have been proposed. But if we have a technology that can train object detectors from scratch, that would be great. Second, the fine-tuning of the model limits the structural design space of the object detection network. This is critical for deploying resource-constrained Internet of Things (IoT) scenarios to deep neural network models.
        Model compactness and performance. It is often reported that there is a trade-off between model compactness (number of parameters) and performance. Most CNN-based detection solutions require huge storage space to store a large number of parameters. Therefore, these models are generally not suitable for low-end devices such as mobile phones and embedded electronics. Due to the dense block of parameters, our model is much smaller than most competing methods. For example, our smallest dense model (DS/64-64-16-1 with dense prediction layer) achieves 73.6% of mAP with only 5.9M parameters, which shows huge application potential on low-end devices.

6.Conclusion

        We propose the Deep Supervised Object Detector (DSOD), which is a simple and efficient framework for training object detectors from scratch. Without using pre-trained models on ImageNet, DSOD demonstrated the competitive accuracy of the most advanced detectors (such as SSD, faster R-CNN and R-FCN) on the popular PASCALVOC2007, 2012 and MSCOCO datasets, only The 1/2, 1/4 and 1/10 parameters are compared with SSD, R-FCN and faster R-CNN respectively. DSOD has great potential in different scenarios in the fields of depth, medical, and multispectral imaging. Our future work will consider these areas and learn super-efficient DSOD models to support devices with limited resources.

Acknowledgements

References

Guess you like

Origin blog.csdn.net/sinat_39307513/article/details/86493977