(1) [Deep video] video comprehension paper series (Part 1) [paper intensive reading] notes

Source video of station b (video comprehension paper series (Part 1) [paper intensive reading]): https://www.bilibili.com/video/BV1fL4y157yA/

0 Preface

In the video, video understanding is divided into four major directions:
1, hand-crafted --> cnn
2, two-stream 3,
3D cnn
4, video transformer

1 DeepVideo

cvpr 14

论文pdf:Large-scale Video Classification with Convolutional Neural Networks

DeepVideo is after the emergence of Alexnet, in the era of deep learning, using ultra-large-scale data sets, using a relatively deep convolutional neural network to do video understanding (DeepVideo is one of the earliest work on video understanding).

Author team from: Google Research, Stanford University

The team has made great contributions to the video field, such as: Sports-1M, YouTube-8M, AVA (action detection) and other data sets. It has greatly promoted the development of the video field.

The method in the DeepVideo paper is relatively straightforward. The idea in the paper is how to apply the convolutional neural network from image recognition to video recognition. The difference between a video and a picture is that the video has an additional time axis (more video frames). So the paper tried the following variants (the last three in the picture below):
insert image description here
First, look at the single frame method below (that is, the single frame method), which is actually a task of image classification. Choose a frame in the video, then pass this frame through a convolutional neural network, then through two FCs, and finally get the classification result, which is actually a baseline. It has no timing information at all, and no video information in it.
insert image description here
Then, the author made Late Fusion. The reason why it is called Late is because of some combinations it does in the network output layer. As shown in the figure below, from the video, randomly select a few frames, and then each frame passes through a convolutional neural network separately. These two neural networks share weights, combine the two features, and then pass through the FC [full connection layer] , and finally do the output. In this way, there is a little timing information.
insert image description here
Then there is Early Fusion, the author did fusion at the input layer. The specific method is to directly combine the RGB on the Channel.
Originally, a picture had 3 channels, but now there are 5 pictures, there are 3x5=15 channels. That means that the network structure needs to be changed, especially the first layer.
The number of input channels accepted by the first convolutional layer will change from 3 to 15, and the subsequent networks will remain unchanged from the previous ones.
In this way, the timing changes can be felt at the input layer, hoping to learn some global motion or time information.
insert image description here
Slow Fusion is proposed on the basis of Late Fusion and Early Fusion. Late Fusion is merged too late, and Early Fusion is merged too early. Slow Fusion has done some fusion at the feature level.
The specific method is to select a small video clip of 19 video frames, and pass through a convolutional neural network every 4 video frames. The initial layer is also weight-shared. After extracting the initial features, the final The first 4 input segments are slowly merged into 2 input segments, and then some convolution operations are performed to learn deeper features. Then fuse the two features, and finally give the learned features to FC.
In other words, the entire network is an overall learning of multiple videos. The author's final experimental results show that Slow Fusion has better results (compared to the first three).
insert image description here
But on the whole, there is not much difference between these methods. Even after pre-training on 1 million videos, when doing transfer learning on the UCF101 small data set, the effect is not as good as the previous manual features.

This is very weird, so the author began to try another way.

The author found that using a 2D convolutional neural network to learn the time features is not easy to learn, and it is difficult to learn.

insert image description here
The author tried [Multiresolution CNN], the author divided the input into two parts, one is the original image, and the other is a part cut out from the middle of the original image. Whether it is a picture or a video, most of the most useful objects will appear in the middle of the picture or video. The upper stream in the figure is called fovea stream, and the lower stream is called context stream. The fovea is the most central depression in the retina of the human eye and is the most sensitive area to external changes. Context refers to the overall information of the picture. The author wants to let the network learn the most useful information in the network through these operations, and also learn the overall understanding of the picture. This architecture can also be regarded as a dual-stream structure. The two networks in Multiresolution CNN share weights . This structure can also be understood as an early use of attention . The author forced the network to focus on the central region of the picture.

insert image description here

After experimenting on the Sports-1M data, it was found that the multi-resolution convolutional network has a certain improvement. For example, compared with Single-Frame (baseline) and Single-Frame Multires, there is a certain improvement, but the improvement is relatively small. Both Early Fusion and Late Fusion are worse than Single-Frame (baseline). After a series of complex operations, Slow Fusion is only a little higher than Single-Frame (baseline).

insert image description here
On the widely accepted UCF101 data set in 2014, the author's variant: Fine-tune top 3 layers, only has an accuracy of 65.4%. At the time, the best manual method could already achieve 87% accuracy. Such a huge difference makes people think that the deep learning network is very effective in image classification and image detection. Why does it hit a wall when it comes to video classification?

In fact, the significance of this article does not lie in its effect. The author not only proposed the largest video understanding data set at that time, but also tried all the most direct ways you can think of, and made a lot for the follow-up work. Good foreshadowing. This has led to the rapid development of deep learning in the field of video. So in 2018 and 19, video understanding (or action recognition) has become the top 5 or 6 keywords in the cv field.

Guess you like

Origin blog.csdn.net/WhiffeYF/article/details/127420497