ILSVRC2016 Object Detection Task Review: Image Object Detection (DET)

Companion : ILSVRC2016 Object Detection Task Review - Video Object Detection (VID )

        The authoritative evaluation in the field of computer vision - the ImageNet Large Scale Visual Recognition Challenge (Large Scale Visual Recognition Challenge) has been attracting much attention since it was held in 2010. In 2016, in the image target detection task of the competition, the domestic team shined brightly, taking the top five in the task (as shown in Figure 1). We will briefly analyze the image object detection methods used in the competition based on the abstracts submitted by the top five participating teams and the published papers or technical documents.

ILSVRC2016 Object Detection Task Review: Image Object Detection (DET)

Figure 1. ILSVRC2016 target detection (no additional data) task competition results

Generally speaking, most of the participating teams use the ResNet/Inception network + Faster R-CNN framework, pay attention to the pre-training of the network, improve the RPN, and use the context information to combine the commonly used multi-scale testing, horizontal flipping, window voting, etc. method, and finally fuse multiple models to get the result.

Below we'll break down many of the highlights of the entry method.

1. Using Context Information

1 、 GBD-No

GBD-Net (Gated Bi-Directional CNN) is the result of the CUImage team and a highlight of this year's DET task. The method utilizes a bidirectionally gated CNN network to selectively transfer information in context windows of different scales to model context.

The motivation for GBD-Net stems from a careful analysis of the role of context information in the process of candidate window classification. First of all, context information can play a key role in classifying windows. As shown in Figure 2(a)(b), the red box in the figure must be combined with context information to accurately determine its category (including the background). Therefore, in many cases, we can use the context information to make the judgment shown in Figure 1(c). But as shown in Figure 1(d), not all context information can give us correct guidance, so context information needs to be used selectively.

ILSVRC2016 Object Detection Task Review: Image Object Detection (DET)

Figure 2. Research motivation of GBD-Net

Based on this, CUImage proposed GBD-Net. As shown in Figure 3, the way GBD-Net collects context information is in the same vein as [2][3], directly enlarging the window based on the target window to obtain more context information, or reducing the window to retain more target details, In this way, multiple support regions are obtained, and the bidirectionally connected network allows information of different scales and resolutions to be transferred between each support region, thereby comprehensively learning the optimal features. However, as mentioned in the research motivation, not all context information can bring "positive energy" to decision-making, so a "gate" is added to the two-way interconnection connection to control the mutual dissemination of context information. GBD-Net brings a 2.2% mAP improvement on the ImageNet DET dataset with ResNet-269 as the base network.

ILSVRC2016 Object Detection Task Review: Image Object Detection (DET)

Figure 3. GBD-Net framework diagram

2、Dilation as context

The 360+MCG-ICG-CAS_DET team migrated the method of obtaining context information by dilated convolution to the target detection task, reduced the 8 dilated convolution layers to 3 layers, and organized the context corresponding to each pixel before ROI pooling. information, as shown in Figure 4, eliminating the need to repeatedly extract context features for each ROI. This method can obtain a 1.5% improvement on the VOC07 dataset, based on the Res50 network.

ILSVRC2016 Object Detection Task Review: Image Object Detection (DET)

Figure 4. Diagram of Dilation as context method

3、Global context

In 2015, a method of using ROI pooling to pool the whole image to obtain context information was proposed. On this basis, the Hikvision team further refined it and proposed the global context method shown in Figure 5(a), which obtained 3.8 on the ILSVRC DET verification set. % mAP performance improvement. This method has been described in detail in previous articles and will not be repeated here.

In addition to the global context method based on ROI pooling, CUImage follows the global context method mentioned in [6], adding global classification result information for each ROI, as shown in Figure 5(b). This method adds global context information to GBD-net local context, which further improves mAP by 1.3%.

ILSVRC2016 Object Detection Task Review: Image Object Detection (DET)

Figure 5. Schematic diagram of the Global context method

Second, improve the classification loss

The 360+MCG-ICG-CAS_DET team proposed two improved softmax losses: dividing the background class into several implicit sub-categories for background, and adding sink classes when necessary .

In Faster R-CNN, all windows with an IOU greater than 0.5 with Ground Truth are regarded as positive samples, and those with IOU between 0.1 and 0.4 are regarded as background category samples, so although there is a large similarity between the samples of normal target categories, However, the difference between samples of the background category is very large. In this case, it is unfair for the background category to still treat the target category and the background category equally. Therefore, the background implicit subcategory method divides the background category into several subcategories. If you want more parameters to describe the changing background, re-aggregate all subcategories into one background category before softmax to avoid explicitly defining each subcategory. class of problems (as shown in Figure 6(a)).

ILSVRC2016 Object Detection Task Review: Image Object Detection (DET)

Figure 6 Improved classification loss

In addition, due to some conflicts in the training data itself (for example, the same image content will become a positive sample and a negative sample at the same time in different scenes, or the samples of different categories are very similar), for some windows, the score of the ground truth category is not very high all the time, while some other error classes will outperform the ground truth class. So the sink method adds a sink class. When the ground truth score is not Top-K, the sink class and the ground truth class are optimized at the same time, otherwise the ground truth class is optimized normally. In this way, the scores on those wrong categories are diverted to the sink category, so that when classifying windows, even if the ground truth category score is not particularly high, it can still be higher than other categories, as shown in Figure 6(b).

3. Improve RPN

Both CUImage and Hikvision propose to improve the RPN, and the improvement strategies of both are derived from CRAFT (as shown in Figure 7). After the RPN, a binary Fast R-CNN is connected to further reduce the number of windows and improve the positioning accuracy.

ILSVRC2016 Object Detection Task Review: Image Object Detection (DET)

Figure 7 CRAFT

CUImage further upgraded CRAFT to CRAFT-v3, added random crops to the training process, adopted a multi-scale strategy in the test, and balanced the proportion of positive and negative samples, using two models for fusion, and raised the recall@300 proposal on ILSVRC DET val2 to 95.3 %.

Hikvision directly follows the idea of ​​box refinement and performs a cascade directly on the basis of the RPN network, as shown in Figure 8. At the same time, they noticed that Faster R-CNN ideally hopes that the ratio of positive and negative samples of PRN is 1:1, but in actual operation, the number of positive samples is often small, which makes the ratio of positive and negative samples vary greatly, so the positive and negative samples are compared. After the sample ratio is forced to be limited to no more than 1:1.5, the recall is increased by 3%.

ILSVRC2016 Object Detection Task Review: Image Object Detection (DET)

Figure 8 Cascaded RPN

4. Network selection and training skills

Since ILSVRC2015, ResNet and subsequent Inception v4, Identity mapping has been widely used in object detection, scene segmentation and other applications due to its strong classification performance. Different networks can usually converge to different extreme points, and this network difference is the key to the greater improvement of model fusion. CUImage, Hikvision, Trimps Soushen, 360+MCG-ICT-CAS_DET, NUIST all train multiple models for fusion with different base networks.

Before training the target detection model, targeted model pre-training can usually make the final trained target detection model converge to a better position. Hikvision mentioned that using a pretrained model when initializing the branches of the global context is much better than random initialization. In addition, they used ILSVRC LOC data to first pre-train a fine-classified object detection model on 1000 categories, and then transferred to DET data to train a 200-category model. CUImage also mentions the importance of model pre-training. After training the classification network in the 1000-class Image-centric way, they trained the classification network in the Object-centric way based on ROI-Pooling. The pre-training network increased the mAP of the final target detection model by about 1%.

Furthermore, Hikvision proposes that forcing a balance of positive and negative sample ratios during training has a positive effect. Techniques such as OHEM and multi-scale training are simple and effective ways to improve mAP.

Five, test skills

There are many techniques that can be used in the testing process, which will play an icing on the cake on the final target detection result. Methods such as multi-scale testing, horizontal flipping, window fine-tuning and multi-window voting, multi-model fusion, NMS threshold adjustment, and multi-model fusion are widely used, and have been generally verified to prove their effectiveness.

Trimps Soushen and 360+MCG-ICT-CAS_DET use the Feature Maxout method to integrate into multi-scale testing, try to make each window scale to a scale close to 224x224 for testing, and make full use of the performance of the pre-training network. The box refinement and box voting method first uses the process of regressing windows in the Fast R-CNN series of frameworks, iterates repeatedly, and then votes with all windows to determine the final target category and position. In previous competitions, it was rarely mentioned how to perform model fusion for target detection. In ILSVRC2016, CUImage, Hikvision, Trimps Soushen, 360+MCG-ICT-CAS_DET all adopted almost the same fusion strategy, that is, use one or more models first. The RPN network generates a fixed ROI, and then the classification and regression results obtained by these ROIs through different models are added to obtain the final fusion result. After the experiments of various fusion methods, the method of adding scores can obtain better fusion performance.

Summarize

This paper provides a general induction and introduction to the methods used in the 2016 ILSVRC DET task. The target detection system has many steps and the process is cumbersome, and every detail is very important. In the process of research, while grasping the overall structure, how to deal with important details will become the key to the effectiveness of a method.

The author of thx Li Yu is a master and doctoral student in the Intermedia Group of the Prospective Research Laboratory of the Institute of Computing Technology, Chinese Academy of Sciences, an associate researcher Tang Sheng, and a doctoral tutor Li Jintao. In 2016, as the core of the 360+MCG-ICT-CAS_DET team, he participated in the DET task of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) and won the fourth place. Target detection related work was invited to give a conference report at the joint working group meeting of ECCV 2016 ImageNet and COCO Visual Recognition Challenge.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325848782&siteId=291194637