ILSVRC2016 Object Detection Task Review - Video Object Detection (VID)

Companion : ILSVRC2016 Object Detection Task Review (Part 1)--Image Object Detection        

The task of image object detection has made tremendous progress in the past three years, and the detection performance has been significantly improved. However, in the fields of video surveillance, vehicle assisted driving, etc., video-based target detection has a wider demand. Due to the problems of motion blur, occlusion, diversity of morphological changes, and diversity of illumination changes in videos, only using image object detection technology to detect objects in videos cannot get good detection results. How to use the target timing information and context information in the video becomes the key to improve the performance of video target detection.

ILSVRC2015 newly added a video object detection task (Object detection from video, VID), which provides researchers with good data support. The VID evaluation metric of ILSVRC2015 is the same as the image object detection evaluation metric—the mAP of the detection window is calculated. However, for video object detection, a good detector should not only ensure accurate detection on each frame of image, but also ensure that the detection results are consistent/continuous (that is, for a specific object, a good detector should continuously detect this target and not to confuse it with other targets). ILSVRC2016 added a new subtask to the VID task for this problem (see Part IV - Introduction to Video Object Detection Timing Consistency).

In ILSVRC2016, in the two sub-tasks of VID without external data, the top three were taken by domestic teams (see Table 1, Table 2). This paper mainly summarizes the video target detection methods in ILSVRC2016 based on the relevant materials published by the four teams of NUIST, CUVideo, MCG-ICT-CAS and ITLab-Inha.

ILSVRC2016 Object Detection Task Review - Video Object Detection (VID)

Table 1. ILSVRC2016 VID results (no external data)

ILSVRC2016 Object Detection Task Review - Video Object Detection (VID)

Table 2. ILSVRC2016 VID tracking result (no external data)

Through the study of the relevant reports of the participating teams [2-5], the video target detection algorithm currently mainly uses the following framework:

  • Treat the video frame as an independent image, and use the image target detection algorithm to obtain the detection result;

  • Use the timing information and context information of the video to correct the detection results;

  • The detection results are further revised based on the tracking trajectory of the high-quality detection window.

This article is divided into four parts. The first three parts introduce how to improve the accuracy of video object detection, and finally introduce how to ensure the consistency of video object detection.

1. Single-frame image target detection

At this stage, the video is usually divided into independent video frames for processing, and a relatively robust single-frame detection result is obtained by selecting an excellent image target detection framework and various techniques to improve the image detection accuracy. " ILSVRC2016 Object Detection Task Review (Part 1) - Image Object Detection " has been summarized in detail and will not be repeated here.

Combined with our own experiments and the relevant documents of each participating team, we believe that the selection of training data and the selection of network structure play a crucial role in improving the performance of target detection.

  • training data selection

First, analyze the ILSVRC2016 VID training data: the VID database contains 30 categories, the training set has a total of 3862 video clips, and the total number of frames exceeds 1.12 million. From the numbers alone, such a large amount of data seems to be enough to train detectors for 30 classes. However, the background of the same video clip is single, and the image difference of adjacent multiple frames is small. Therefore, to train the existing target detection model, there is a lot of data redundancy in the VID training set, and the data diversity is poor, so it is necessary to expand it. In the competition task, images containing VID categories can be extracted from the ILSVRC DET and ILSVRC LOC data for augmentation. CUVideo, NUIST and MCG-ICT-CAS use ILSVRC VID+DET as the training set, and ITLab-Inha uses ILSVRC VID+DET, COCO DET, etc. as the training set. It should be noted that when building a new training set, attention should be paid to balancing samples and removing redundancy (CUVideo and MCG-ICT-CAS extract part of the VID training set to train the model, ITLab-Inha selects a certain number of images in each category to participate in training, NUIST uses models trained on DET to screen VID data). For the same network, using the expanded dataset can improve the detection accuracy by about 10%.

  • network structure selection

Different network structures also have a great impact on detection performance. We conduct experiments on the VID validation set: with the same training data, the detection accuracy of the Faster R-CNN [7] model based on ResNet101 [6] is about 12% higher than that of the Faster R-CNN model based on VGG16 [ 8 ] . This is also the key to MSRA's success in the 2015 ILSVRC and COCO competitions. The top teams in this year's competition basically use ResNet/Inception's basic network, and CUVideo uses 269-layer GBD-Net [9] .

Second, improve the classification loss

The target will have motion blur, low resolution, occlusion and other problems on some video frames. Even the current best image target detection algorithm cannot detect the target well. Fortunately, timing information and contextual information in videos can help us deal with these kinds of problems. The more representative methods are Motion-guided Propagation (MGP) and Multi-context suppression (MCS) in T-CNN [10] .

  • MGP

There are many missed targets in the detection results of a single frame, and these missed targets may be included in the detection results of adjacent frames. Therefore, we can use the optical flow information to propagate the detection results of the current frame forward and backward, and the MGP processing can improve the recall rate of the target. As shown in Figure 1, the detection windows at time T are propagated forward and backward respectively, which can well fill the missed targets at time T-1 and T+1.

ILSVRC2016 Object Detection Task Review - Video Object Detection (VID)

Figure 1. Schematic diagram of MGP [10]

  • MCS

Using image detection algorithms to treat video frames as separate images does not take full advantage of the contextual information of the entire video. Although it is said that any category of targets may appear in the video, for a single video clip, only a few categories will appear, and there is a co-occurrence relationship between these categories (there may be whales in the video clips where ships appear, But it is almost impossible for zebras to appear). Therefore, statistical analysis can be performed with the help of the detection results on the entire video segment: all detection windows are sorted by their scores, and the categories with higher scores are selected. The remaining categories with lower scores are likely to be false detections, and their scores need to be suppressed. (Figure 2). In the detection result processed by MCS, the correct category is in the front, and the wrong category is in the back, so as to improve the accuracy of target detection.

ILSVRC2016 Object Detection Task Review - Video Object Detection (VID)

Figure 2. Schematic diagram of multi-context suppression [10]

3. Correction using tracking information

The MGP mentioned above can fill in the missed targets on some video frames, but it is not very effective for the targets that are continuously missed in multiple frames, and target tracking can solve this problem well. The four teams of CUVideo, NUIST, MCG-ICT-CAS and ITLab-Inha all used tracking algorithms to further improve the recall rate of video object detection. The basic process of using the tracking algorithm to obtain the target sequence is as follows:

  • Use the image target detection algorithm to obtain better detection results;

  • Select the target with the highest detection score as the starting anchor point of tracking;

  • Based on the selected anchor points, track forward and backward on the entire video clip to generate a tracking trajectory;

  • Select the highest score from the remaining targets for tracking. It should be noted that if this window has appeared in the previous tracking track, then skip it directly and select the next target for tracking;

  • The algorithm executes iteratively and can use a score threshold as a termination condition.

The resulting tracked trajectories can be used both to improve target recall and to revise results as long sequence contextual information.

4. Network selection and training skills

For video target detection, in addition to ensuring the detection accuracy of each frame of image, it should also ensure that each target is tracked stably for a long time. To this end, ILSVRC2016 adds a VID subtask, which calculates the mAP of each target tracking track (tracklet)/tubelet (tubelet) to evaluate the timing consistency or tracking continuity performance of the detection algorithm.

评价指标:图像目标检测mAP评测对象是每个检测窗口是否精准,而视频时序一致性评测对象是目标跟踪轨迹是否精准;图像目标检测中如果检测窗口跟Ground Truth类别相同,窗口IoU大于0.5就认定为正例。而评价时序一致性时,如果检测得到的跟踪轨迹和Ground Truth(目标真实跟踪轨迹)是同一个目标(trackId相同),并且其中检测出的窗口与Ground Truth窗口的IoU大于0.5的数量超过一个比例,那么认为得到的跟踪轨迹是正例;跟踪轨迹的得分是序列上所有窗口得分的平均值。分析可知,如果一个目标的轨迹被分成多段或者一个目标的跟踪轨迹中混入其他的目标都会降低一致性。

那么如何保证视频检测中目标的时序一致性呢?本文认为可以从以下三个方面入手:

  • 保证图像检测阶段每帧图像检测的结果尽量精准;

  • 对高质量检测窗口进行跟踪并保证跟踪的质量(尽量降低跟踪中出现的漂移现象);

  • 前面两步获取到的跟踪结果会存在重叠或者临接的情况,需针对性地进行后处理。

ITLab-Inha团队提出了基于变换点检测的多目标跟踪算法[11],该算法首先检测出目标,然后对其进行跟踪,并在跟踪过程中对跟踪轨迹点进行分析处理,可以较好地缓解跟踪时的漂移现象,并能在轨迹异常时及时终止跟踪。

针对视频目标检测的一致性问题,作者所在的MCG-ICT-CAS提出了基于检测和跟踪的目标管道生成方法。

ILSVRC2016 Object Detection Task Review - Video Object Detection (VID)

a.基于跟踪的目标管道/跟踪轨迹

ILSVRC2016 Object Detection Task Review - Video Object Detection (VID)

b.基于检测的目标管道

ILSVRC2016 Object Detection Task Review - Video Object Detection (VID)

c.基于检测和跟踪的融合管道

图3. 基于检测/跟踪/检测+跟踪管道示意图

图3-a表示使用跟踪算法获取到的目标管道(红色包围框),绿色包围框代表目标的Ground Truth。可以看到随着时间推移,跟踪窗口逐渐偏移目标,最后甚至可能丢失目标。MCG-ICT-CAS提出了基于检测的目标管道生成方法,如图3-b所示,基于检测的管道窗口(红色包围框)定位较为准确,但由于目标的运动模糊使检测器出现漏检。从上面分析可知:跟踪算法生成的目标管道召回率较高,但定位不准;而基于检测窗口生成的目标管道目标定位较为精准,但召回率相对前者较低。由于两者存在互补性,所以MCG-ICT-CAS进一步提出了管道融合算法,对检测管道和跟踪管道进行融合,融合重复出现的窗口并且拼接间断的管道。

如图4所示,相对于单独的检测或者跟踪生成的目标管道,融合后目标管道对应的检测窗口的召回率随着IoU阈值的增加一直保持较高的值,说明了融合后的窗口既能保持较高的窗口召回率,也有较为精准的定位。融合后的目标管道mAP在VID测试集上提升了12.1%。

ILSVRC2016 Object Detection Task Review - Video Object Detection (VID)

图4.不同方法生成目标管道的召回率

总结

本文主要结合ILSVRC2016 VID竞赛任务对视频目标检测算法进行介绍。相对于图像目标检测,当前的视频目标检测算法流程比较繁琐且视频自身包含的信息没有被充分挖掘。如何精简视频目标检测流程使其具有实时性,如何进一步挖掘视频包含的丰富信息使其具有更高的检测精度,以及如何保证视频目标检测的一致性或许是视频目标检测接下来要着重解决的问题。

参考文献

[1]ILSVRC2016相关报告

[2]CUVideo slide

[3]NUIST slide

[4]MCG-ICT-CAS slide

[5]ITLab-Inha slide

[6]He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[J]. arXiv preprint arXiv:1512.03385, 2015.

[7]Ren S, He K, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[C]//Advances in neural information processing systems. 2015: 91-99.

[8]Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.

[9]Zeng X, Ouyang W, Yang B, et al. Gated bi-directional cnn for object detection[C]//European Conference on Computer Vision. Springer International Publishing, 2016: 354-369.

[10]Kang K, Li H, Yan J, et al. T-cnn: Tubelets with convolutional neural networks for object detection from videos[J]. arXiv preprint arXiv:1604.02532, 2016.

[11]Lee B, Erdenee E, Jin S, et al. Multi-class Multi-object Tracking Using Changing Point Detection[C]//European Conference on Computer Vision. Springer International Publishing, 2016: 68-83.

Thanks to the author Wang Bin, Ph.D. student in the Cross-Media Computing Group of the Prospective Research Laboratory of the Institute of Computing Technology, Chinese Academy of Sciences, and researcher Zhang Yongdong. In 2016, under the leadership of Associate Researcher Tang Sheng, as the core members of the MCG-ICT-CAS team of the Institute of Computing Technology (Wang Bin, Xiao Junbin), he participated in the Video Object Detection (VID) task of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). and won third place. Target detection related work was invited to give a conference report at the ECCV 2016 ImageNet and COCO Visual Recognition Challenges Joint Workshop.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325848946&siteId=291194637