The difference between video object detection and image object detection

I. Introduction

This article introduces the answers of several bigwigs who know the difference between video target detection and image target detection. The main content includes the difference between video object detection and image object detection, the research progress, research ideas and methods of video object detection.

Author: Naiyan Wang, Zha Zha, Yi Chen
https://www.zhihu.com/question/52185576/answer/155679253

EDIT: CV Technical Guide

Disclaimer: For academic sharing only, intrusion and deletion

This article is reproduced from the CV Technical Guide

original

Author: Naiyan Wang https://www.zhihu.com/question/52185576/answer/155679253

Taking the time to briefly answer this question happens to be a direction we are more concerned about.

To put it simply, video detection has more Temporal Context (temporal context) information than single image detection. Different methods want to use these Contexts to solve different problems. One type of method is to focus on how to use this part of information to speed up Video Detection . Because there is a lot of redundancy between adjacent frames, it is meaningful in practical applications if some cheap methods can be used to speed up without compromising performance. Another method is to pay attention to this part of the information, which can effectively alleviate the difficulties caused by motion blur and small object area in single-frame picture detection, so as to improve performance. Of course, the ideal way is to be fast and good.

Of course, there are some very simple baseline methods here, such as directly using tracking associations. In fact, this kind of method does not go deep into the model itself, and is generally limited to the post-processing steps. Although it can also achieve a certain improvement in results, I personally feel that it is not very elegant. More attention is paid to the work from the following two groups.

  1. CUHK: Xiaogong Wang has three articles that I learned about. At the beginning** (TPAMI Sshort) post-processes the output of a single-frame image detector through motion information and multi-class correlation . A small improvement on the Baseline method . On this basis, the follow-up article (CVPR16) introduced a Temporal CNN** to rescore each Tubelet . In this way, the confidence of each proposal is re-evaluated through Temporal information . The recent work (CVPR17) will generate this step of the proposal , and also get the timing from the static picture to do it. In addition, for the classification of each Tubelet , the popular LSTM is also adopted .

  2. MSRA: Relatively speaking, jifeng Dai 's work here is cleaner and his thinking is clearer. Personally, I prefer it. In fact, the two works in this area have similar ideas, but they just correspond to the two purposes of acceleration and performance improvement mentioned above. Its core is to quickly calculate the Optical Flow to capture the Motion information in the video , and then use the Flow information to use Bilinear Sampling to warp the previous Feature Map (that is, to predict the Feature Map of the current frame through Optical Flow ). With such information, if we want to speed up, we can directly use the predicted Feature Map to output the result; if we want to get better results, we can combine the predicted Feature Map and the Feature Map calculated by the current frame to output the result together . It is worth mentioning that the latter is currently the only End to End Video Detection method.

In addition, there are some piecemeal tasks, which are basically in the post-processing process to deal with the problem of rescore detection, such as Seq-NMS and so on.

Finally, I want to throw a brick to start a discussion, and propose a problem we observed in Video Detection. We also wrote a paper to talk about this matter ([1611.06467] On The Stability of Video Detection and Tracking), that is, the stability in Video Detection Sex ( Stability ) issues. See the video below. In fact, the two Detectors are not very different in terms of accuracy. However, to the human eye, it is clear which one is better. Video link:
Video
Such a stability problem will actually bring a lot of troubles in practical applications. For example, in autonomous driving, stable 2D detection frames are required for vehicle distance and velocity estimation. Unstable detection will greatly affect the accuracy of subsequent tasks. So, in the article, we first proposed a quantitative indicator to measure this stability, and then evaluated several simple Baselines . We also calculated the Correlation between this Stability indicator and the commonly used Accuracy indicator , and found that these two indicators are actually not very correlated, that is to say, they capture the quality of two aspects in Video Deection respectively. I hope this work can give you some inspiration. Apart from improving the accuracy, we should also consider how to improve the equally important stability.

In summary, the problem of Video Detection is a very good topic, both in terms of practicality and from the perspective of academic research. With the continuous work of RBG and Kaiming, there is less and less room for improvement in Still Image Detection . Instead of desperately trying to achieve a mAP of 0.x points under Still Image , it is better to take a step back and dig out some new settings, which will be a brighter future.

Author: Fried https://www.zhihu.com/question/52185576/answer/298921652

Naiyan Wang 's answer is great, pointing out the core difference: In Video-based Object Detection , we can use Temporal Context to eliminate information redundancy when the frame rate is high, and use Temporal Context to supplement single-frame images insufficient information for better and faster tracking. It also comes with two corresponding most fashionable and beautiful video detection algorithms, which I feel benefited a lot.
Here I want to answer the mechanism and difference between the two from my own point of view. Because I was doing video-based target detection and tracking in the past two years, the method used may be relatively old-fashioned compared to the current Long Short-Term Memory (LSTM), but I think the subject should be a novice. Let's learn about the past classics or It is meaningful and can be used as an early supplement.

research problem

Whether it is based on video or image, the core of our research is the problem of target detection, that is, to identify the target in the image (or in the image of the video) and realize the positioning.

Target detection based on single frame image

The realization of target detection on static images is itself a process of sliding window + classification. The former is to help lock the local area where the target may exist, and the latter is to score through the classifier to determine whether the locked area has (is) what we are looking for The goal. The core of the research mostly focuses on the latter, what kind of feature representation to choose to describe your locked area ( HOG, C-SIFT, Haar, LBP, Deformable Part Models (DPM) and etc. ), what kind of input these features Classifiers ( SVM, Adaboost and etc. ) to score and judge whether it is the target we are looking for.

Although the target we want to detect may have a variety of shapes (due to variety, deformation, illumination, angle, etc.), the feature representation obtained by training CNN with a large amount of data can still help the process of recognition and judgment very well. However, in some extreme cases, such as the target is very small, or the target is too similar to the background, or the target is really distorted due to blur or other reasons in this frame of image, CNN will also feel powerless and cannot recognize it as The target we are looking for. Another situation is that the shooting scene is mixed with other things that look similar to the target (such as airplanes and large birds with wings), and there may also be misjudgments at this time.

That is, in these cases, we may not be able to complete the robust detection of the target by virtue of the appearance information of a single frame.

Video-Based Object Detection

A single frame is not enough, multiple frames are needed. In the video, the target often has motion characteristics, and the sources of these characteristics include the deformation of the target itself, the motion of the target itself, and the motion of the camera. After the introduction of multiple frames, we can not only obtain the appearance information of the target in many frames, but also obtain the motion information of the target between frames. So there are some ways

Type 1: Goal-Focused Sports Information

First realize the separation of foreground and background based on motion segmentation or background extraction (optical flow method and Gaussian distribution, etc.), that is to say, we use motion information to pick out areas that are likely to be targets; then consider the target in consecutive frames Persistence (size, color, trajectory consistency) can help delete some unqualified candidate target areas; then judge the selected area by scoring, or use appearance information (mentioned in a single frame).

The second type: the combination of dynamic and static, that is, on the basis of the first type, adding the appearance deformation of the target

Some objects in the video will show large-scale and regular deformations, such as pedestrians and birds. At this time, we can summarize the special movement characteristics and behavior paradigm of the target by learning the deformation law, and then see whether the detected target meets such behavior changes. Common behavioral features include 3D descriptors, Markov-based shape dynamics, pose/primtive action-based histogram, etc. This method of combining static and dynamic information of the target to judge whether it is a specific target is somewhat biased towards action classification.

The third: the use of frequency domain features

In video-based target detection, in addition to the analysis of target space and time information, the frequency domain information of the target can also play a huge role in the detection process. For example, in bird species detection, we can distinguish bird species by analyzing the frequency of wing flapping.

It is worth noting that there are two situations in video-based detection here . One is that you just want to know whether there is such a target in this scene, and if so, where is its corresponding scene position; the other is The second is whether there is such an object in this scene, and where is its position in each frame. The approach we present here focuses on the latter, more complex one.

Deep learning is promising and rampant. It is hoped that visual feature modeling will continue to develop, and that the entire field of computer vision will become more diversified, rather than being marginalized by machine learning.

Author: Yichen https://www.zhihu.com/question/52185576/answer/413306776

Seeing the answers of so many big guys above, I will also add some of my own understanding.
First of all, conceptually speaking, the problem to be solved in video target detection is the correct identification and positioning of the target in each frame of the video. So what is the difference from other fields such as image target detection and target tracking?

1. The difference from image target detection

insert image description here
(The picture comes from Flow-Guided Feature Aggregation for Video Object Detection)

2. Differences from target tracking

Target tracking can usually be divided into two types: single target tracking and multi-target tracking. The tasks to be solved are the same as video target detection in that they require precise positioning of the target in each frame of image. The difference is that target tracking does not consider the target recognition problem. .

3. Progress in Video Object Detection

  1. Methods combined with optical flow
    I have been following the work of jifeng Dai of MSRA

The starting point of the boss's work is very simple. DFF (Deep Feature Flow) first divides the detection task into two parts: the feature extraction task Nfeat (ResNet101) and the detection task Ntask (R-FCN). By distinguishing between key frames and non-key frames, Nfeat is used to extract features on key frames Get the feature map, use the FlowNet network to estimate the optical flow on the non-key frame, and get the feature map of the non-key frame in the form of bilinear warp through the features extracted from the key frame. The detection network is used to implement the task after the feature maps obtained in the two ways.
insert image description here
The advantage of this work is that the redundant information of continuous frames is used to reduce a large number of calculations, and the detection speed is very fast.

The starting point of FGFA (Flow Guided Feature Aggregation)
is to improve the feature quality, improve the motion blur and video out-of-focus problems in the video, and its method is characterized by better fusion of the information of the front and rear frames. With the help of the idea of ​​the attention model, the cosine similarity of each spatial position between the current frame and the previous and subsequent frames is calculated as the adaptive weight, and the closer the feature map of the warp is to the current frame, the greater the weight.

insert image description here
Since this work has done feature extraction for each frame, the calculation cost is very large, and the detection speed is not high. The advantage is that the detection accuracy is improved, and the ImageNet VID task champion scheme uses the above two methods.

  1. Approaches Combined with Object Tracking
    Link

  2. Method combined with RNN
    链接: [1712.06317] Video Object Detection with an Aligned Spatial-Temporal Memory (arxiv.org)

链接: [1607.04648] Context Matters: Refining Object Detection in Video with Recurrent Neural Networks (arxiv.org)

  1. other fusion methods
    链接: [1712.05896] Impression Network for Video Object Detection (arxiv.org)

  2. non-end-to-end approach
    链接: [1604.02532v4] T-CNN: Tubelets with Convolutional Neural Networks for Object Detection from Videos (arxiv.org)

Link: [1602.08465v3] Seq-NMS for Video Object Detection (arxiv.org)

In summary, the current video target detection research is not hot enough compared to the image field. Most of the research ideas are to either focus on using redundant information to improve detection speed, or to fuse context information between consecutive frames to improve detection quality. There's not a lot of work to do to reduce redundancy and improve speed. (It is also possible that the article has not been read enough, welcome to correct me) and the fusion of context information can consider using 3D convolution, RNN, attention model and other methods commonly used in behavior recognition.

Guess you like

Origin blog.csdn.net/qq_53250079/article/details/127426768