Target detection, YOLOV: Making Still Image Object Detectors Great at Video Object Detection. Image object detector performance in video object detection

I. Introduction

Different from the traditional two-stage pipeline, the paper proposes to perform region-level selection after a stage of detection, avoiding processing a large number of low-quality candidate regions. In addition, a new module is built to evaluate the relationship between the target frame and the reference frame and guide the aggregation.

The authors conduct extensive experiments to verify the effectiveness of the proposed method and reveal that it outperforms other state-of-the-art VID methods in terms of effectiveness and efficiency. On the ImageNet VID dataset, a single 2080Ti GPU achieves 87.55% AP50 over 30 frames per second.

Original text of the paper:
original text

Code:
source code

2. Innovative thinking

The region-based CNN family (R-CNN) is a forerunner of two-stage object detectors with multiple follow-up functions. Given region-level features, these still image detectors can be easily transferred to more complex tasks such as segmentation and video object detection. Then, due to the two-stage nature, efficiency is the bottleneck for practical applications, while for single-stage object detectors, localization and classification are directly produced by the dense prediction union of feature maps.

Video object detection can be considered an advanced version of still image object detection. Video sequences can be processed by feeding frame by frame into a still image object detector. However, in this way, the temporal information across frames will be wasted, which may be the key to remove/reduce the blurriness present in a single image.

As shown in Figure 1, degradations such as motion blur, camera defocus, and occlusions often appear in video frames, significantly increasing the difficulty of detection. For example, it is difficult or even impossible for humans to tell where and what objects are by just looking at the last frame in Figure 1. On the other hand, a video sequence can provide richer information than a single still image. That is, it is agreed that other frames in the sequence may support the prediction of a particular frame.
insert image description here
Figure 1: Suffers from various degradations, such as motion blur and occlusion, making base YOLOX unable to perform the task

There are two main types of frame aggregation, namely box-level and feature-level. These two technical routes can improve detection accuracy from different angles. Regarding box-level methods, they concatenate the predictions of stationary object detectors by chaining bounding boxes to form tubelets, and then refine the results in the same tubelet. The box-level approach can be viewed as post-processing, which can be flexibly applied to one-stage and two-stage detectors.

Whereas for feature-level schemes, the features of keyframes are enhanced by finding and aggregating similar features in other frames (also called reference frames). The two-stage approach is represented explicitly by a backbone feature map extracted by a Region Proposal Network (RPN), thanks to this property, two-stage detectors can be easily migrated to video object detection problems. Therefore, most video object detectors are built on two-stage detectors.

However, these two-stage video object detectors are further slowed down due to the introduction of relations between seeking schemes, thus making it difficult to meet the needs of real-time scenarios. Different from the two-stage basis, an implicit representation by the feature map elements of the one-stage detector is proposed. Despite having no explicit representation of the object, these elements of the feature map can still benefit from aggregating temporal information for the VID task.

Driven by these considerations, the question naturally arises whether such a region-level design can be adapted for single-stage detectors containing only pixel-level features to construct practical (accurate and fast) video object detection.

This paper answers the above questions by devising a simple yet effective strategy for aggregating the features generated by single-stage detectors in this work.

3. The main contribution of this paper

  1. A feature similarity measure module is proposed to build an affinity matrix, which is then used to guide aggregation.

  2. To further alleviate the limitation of cosine similarity, the average pooling operator on the reference features is customized.

  3. YOLOV can achieve 85.5% AP50 on the ImageNet VID dataset at a speed of 40+FPS on a single 2080Ti GPU. By further introducing post-processing, its accuracy reaches 87.5% AP50 at more than 30fps.

4. Method

A method that considers video features (various degradations and rich temporal information), instead of processing frames individually, how to find support information for target frames (keyframes) from other frames plays a key role in improving video detection accuracy. Most existing methods are based on two-stage techniques.

As mentioned before, **their main disadvantage is relatively slow inference speed compared to single-stage basis. **To alleviate this limitation, the authors put region/feature selection after the prediction head of a single-stage detector. The framework is shown in Figure 3.

insert image description here
Figure 3: Design framework of this paper. Based on YOLOX detector, the corresponding model is called YOLOV. A few frames are randomly sampled from a video and fed into a base detector to extract features.

Traditional two-stage pipeline: first "select" a large number of candidate regions as proposals; determine whether each proposal is an object, and which category it belongs to. Computational bottlenecks mainly come from processing a large number of low-confidence region candidates.

As shown in Figure 3, the pipeline also includes two stages. The difference is that its first stage is prediction (discarding a large number of regions with low confidence), while the second stage can be regarded as region-level refinement (utilizing other frames by aggregation).

Following this rationale, our design can simultaneously benefit from the efficiency of single-stage detectors and the accuracy obtained from temporal aggregation. The proposed strategy can be generalized to many basic detectors such as YOLOX, FCOS and Pyoloe.

FSM: Feature Selection Module
Since most predictions have low confidence, the detection head of a single-stage detector is a natural and reasonable choice for selecting (high-quality) candidates from feature maps. After the RPN process, the top-k (e.g. 750) predictions are first selected based on their confidence scores. Then, non-maximum suppression (NMS) selects a fixed number of predictions a (e.g., a = 30) to reduce redundancy. In order to obtain features for video object classification, the accuracy of the base detector should be guaranteed accordingly.

In practice, the authors found that directly aggregating selected features from the classification branch and backpropagating the classification loss for the aggregated features would lead to unstable training.

To address the above issues, the authors insert two 3×3 convolutional (Conv) layers into the model neck as a new branch, called the video object classification branch, which generates features for aggregation. Then, the position-related features from the video classification and regression branches are fed into the feature aggregation module.

FAM: Feature Aggregation Module
When keyframes exhibit certain degradations, the selected schemes corresponding to these similar features are likely to suffer from the same problem. This phenomenon is called the homogeneity problem.

To overcome this problem, the prediction confidence Pi from the base is further considered, and each column of Pi contains only 2 scores, namely the classification score and the IoU score from the classification and regression heads respectively. Then, query, key and value matrices are constructed and fed into multi-head attention. Through the scaled dot product of attention, the corresponding Ac and Ar are obtained, and collecting all fractions in P yields a matrix [P1,P2,…,Pf] of size 2×FA.

To fit these scores to attention weights, the authors construct two matrices, Sr and Sc. Then, get the self-attention results for the classification and regression branches:

insert image description here

Concatenate Vc with the output of equation (3) to better preserve the initial representation by:

insert image description here

Furthermore, considering the nature of softmax, it often ignores features with low weights, which limits the diversity of reference features that may be used subsequently.

To avoid such risks, the authors introduce an average pooling (AP) based on reference features. All references with a similarity score higher than a threshold τ are selected and average pooling is applied to these references. This way, more information from related functions can be maintained. The average pooled features and key features are then fed into a linear projection layer for final classification. The process is shown in Figure 4.
insert image description here

5. Experiment

To observe the effect of different sampling strategies, the number of reference frames is varied in both global and local modes. The numerical results are shown in Table 1.
insert image description here
Table 1: Effect of the number of global fg and local fl reference frames.

Adjust the number of most confident proposals retained per frame a in the FSM from 10 to 100 to see its impact on performance. As shown in Table 2, with the increase of a, the accuracy continues to improve and tends to be stable until it reaches 75.

insert image description here
Table 2: The effect of the number of frames a in FSM.

To verify the effectiveness of the association means (AM) and reference feature average pooling (AP), the performance with and without these modules is evaluated. The results in Table 4 show that these designs can all help feature aggregation capture better semantic representations from one-stage detectors. Compared with YOLOX-S (69.5% AP50), the accuracy of YOLOV-S with only AM is improved by 7.4%.
insert image description here
Table 4: Affinity modality (AM) and reference feature pool (AP) availability.

insert image description here
Table 5: Effectiveness of our strategy compared to bases. Table 5 shows the detailed comparison between Yolox and Yolov.

insert image description here
Figure 5: Visual comparison between reference scenarios selected for a given key scenario by three different methods.

insert image description here
Table 6: Accuracy of detecting objects at different speeds. As shown in Table 7, the effectiveness of the model is clearly verified on each category. This improvement increases (brings a greater advantage) as movement speed increases.

insert image description here
Table 7: Time cost in offline mode with batch inference. Post-processing is tested on an i7-8700k CPU.

6. Conclusion

The paper proposes a practical video object detector that takes both detection accuracy and inference efficiency into consideration. To improve detection accuracy, a feature aggregation module is designed to efficiently aggregate temporal information.

To save computational resources, we place region selection after (coarse) prediction, unlike existing two-stage detectors. This small change led to a dramatic increase in detector efficiency.

The content of this article is reprinted content. If there is any infringement, please contact to delete it.

Guess you like

Origin blog.csdn.net/qq_53250079/article/details/127409875