[Paper reading] Intensive reading of video comprehension series papers

Video Comprehension Thesis Lectures (Part 1) [Paper Intensive Reading]

1. Large-scale Video Classification with Convolutional Neural Networks

Summary

We investigate a variety of ways to extend the connectivity of CNNs in the temporal domain to exploit local spatiotemporal information, and suggest multi-resolution, notched architectures as a promising way to speed up training.

1 Introduction

Encouraged by positive results in the image field, we investigate the performance of CNNs in large-scale video classification, where the network can obtain not only appearance information in a single static image, but also its complex temporal evolution. In this context, there are several challenges in scaling and applying CNNs.

  • There are currently no video classification benchmarks that match the size and variety of existing image datasets, and in order to obtain sufficient data volumes to train our CNN architecture, we collected a new Sports-1M dataset to support this field future work.
  • From a modeling perspective, we investigate these issues empirically by evaluating multiple CNN architectures that all take different approaches to incorporate information across the temporal domain.
  • From a computational point of view, an effective way to speed up the running performance of CNNs is to modify the architecture to include two separate processing streams: a context stream that learns features on low-resolution frames, and a context stream that learns features only in the middle of the frame. Manipulated high-resolution sag stream. We observe a 2-4x increase in the runtime performance of the network due to the reduced dimensionality of the input, while preserving classification accuracy.
  • We empirically investigate the problem of transfer learning, achieving significantly better performance on UCF-101 by repurposing low-level features learned on the Sports-1M dataset.

2. Related work

A standard approach to video classification consists of three main stages. First, local visual features describing video regions are extracted densely or over a sparse set of interest points. Next, these features are combined into a fixed-size video-level description. Finally, train a classifier (such as SVM) to distinguish the visual categories of interest.
Compared with the field of image data, there are relatively few works applying CNN in video classification. Since all successful applications of CNNs in the image domain have large training sets, we speculate that this is partly due to the lack of large-scale video classification benchmarks. Our model is trained end-to-end with full supervision.

3. Model

3.1 Fusion of time information

We treat each video as a bag of short, fixed-size clips. Since each clip temporally consists of several consecutive frames, we can extend the connectivity of the network in the temporal dimension to learn spatio-temporal features. There are several options for the precise details of extended connectivity, and we describe below three broad classes of connectivity patterns (early fusion, late fusion, and slow fusion).
insert image description here
Red, green, and blue boxes denote convolutional, normalization, and pooling layers, respectively. In slow fusion models, the described columns share parameters.

  • Single Frame. We use a single-frame baseline structure to understand the contribution of static appearance to classification accuracy. This network is similar to a simple convolutional neural network. The last layer is connected to a softmax classifier.
  • Late Fusion. These two data streams are combined in the first fully connected layer . Therefore, neither of the two single-frame towers can detect any motion individually, but the first fully connected layer can compute global motion features by comparing the outputs of the two towers.
  • Early Fusion. Instantly combine information across time windows at the pixel level. This is achieved by modifying the filters of the first convolutional layer in the single-frame model, expanding it to a size of 11 × 11 × 3 × T pixels, where T is the temporal extent. Early direct connections to pixel data allow the network to accurately detect local motion direction and velocity.
  • Slow Fusion. The slow fusion model is a balanced combination of these two methods, which slowly fuses temporal information throughout the network, allowing higher layers to gradually acquire more global information in both spatial and temporal dimensions.

3.2 Multi-resolution CNN

insert image description here
Input frames are fed into two separate processing streams: a context stream that simulates a low-resolution image, and a notch stream that processes a high-resolution center crop (since objects of interest tend to occupy the central region) . The two streams consist of alternating convolutional (red), normalization (green), and pooling (blue) layers. Both data streams converge into two fully connected layers (yellow).

4. Experiment details, training results

insert image description here

5. Conclusions and future work

  • Multi-resolution and slow fusion can improve the performance of the network.
  • We also identify a mixed-resolution architecture, which consists of low-resolution context and high-resolution concave flow, as an effective way to speed up CNNs without sacrificing accuracy.
  • Our transfer learning experiments on UCF-101 show that the learned features are general and generalize to other video classification tasks.

In future work, we hope to incorporate a wider variety of categories in the dataset to obtain more powerful and general features, investigate methods for explicitly reasoning about camera motion, and explore recurrent neural networks as a more powerful technique for integrating segment-level predictions Incorporated into global video-level predictions.


2. Two-Stream Convolutional Networks for Action Recognition in Videos

Summary

We study the architecture of deep convolutional networks (ConvNets) for discriminative training for action recognition in videos. Our challenge is how to capture complementary appearance information from still frames and motion between frames.
Our contribution is threefold. First, we propose a two-stream ConvNet architecture that incorporates both spatial and temporal networks. Second, we demonstrate that ConvNets trained on multi-frame dense optical flow can achieve very good performance despite limited training data. Finally, we show that multi-task learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance of both.

1 Introduction

Compared to static image classification, the temporal component of videos provides additional (important) cues for recognition, since some actions can be reliably identified based on motion information. Additionally, video provides natural data augmentation (dithering) for single image (video frame) classification.
We investigate a different architecture based on two independent recognition streams (spatial and temporal), which are then combined via post-fusion . Spatial streams are trained for action recognition from still video frames, while temporal streams are trained to recognize actions from motion in the form of dense optical flows.

2. Dual-stream architecture for video recognition

insert image description here
Videos can be naturally decomposed into spatial and temporal components. The spatial part, in the form of individual frames, carries information about the scene and objects described in the video. The temporal part, in the form of motion across frames, expresses the motion of the observer (camera) and objects.

3. Optical flow convolutional network

insert image description here
A ConvNet model, which constitutes the temporal recognition flow of our architecture, whose input is a stack of optical flow displacement fields between several consecutive frames. This input explicitly describes motion between video frames, which makes recognition easier.

3.1 ConvNet input configuration:

Superposition of optical flow:
Dense optical flow can be viewed as a set of displacement vector fields dt between consecutive frames t and t+1, and the horizontal and vertical components of the vector field, dxt and dyt, can be viewed as image channels , well suited for recognition using convolutional networks. To represent motion over a sequence of frames, we stack the flow channels of L consecutive frames together, forming a total of 2L input channels.
insert image description here
Trajectory stacking:
insert image description here
Left: Optical flow stacking (1) samples the displacement vector d at the same position in multiple frames. Right: Trajectory stacking (2) samples vectors on a trajectory. Frames and corresponding displacement vectors are shown in the same color.
Point increase method: (need to be reflected in the code)

  • two-way optical flow
  • Average flow subtraction. In general, zero centralization of network input is beneficial

4. Experiment details, training results

Many tricks are used for training: cropping, flipping, RGB dithering, multi-GPU acceleration, optical flow maps are scaled to [0, 255] and saved as JPEGs.

insert image description here
When using migration learning, if all networks are fine-tuned, the Dropout rate can be set larger to prevent overfitting. If only the parameters of the last layer are updated, the dropout rate should be set smaller, because only the parameters of the last layer participate in the learning.

insert image description here

insert image description here

5 Conclusion

(i) Temporal and spatial recognition streams are complementary, as their fusion significantly improves both (6% over TemporalNet and 14% over SpatialNet); (ii) SVM-based soft-score fusion
outperforms
(iii) In the case of ConvNet fusion, there is no benefit to using bidirectional flow; (
iv) Temporal ConvNet trained with multi-task learning performs best both alone and when fused with SpatialNet.


3. Beyond Short Snippets: Deep Networks for Video Classification

Summary

We propose two methods capable of handling full-length videos. The first approach explores various convolutional temporal feature pooling architectures, examining the various design choices that need to be made when tuning a CNN for this task. The second proposed method explicitly models the video as an ordered sequence of frames. For this, we employ a recurrent neural network that uses long short-term memory (LSTM) units connected to the output of the underlying CNN.

1 Introduction

We evaluate two approaches that can meet this requirement: feature ensembles and recurrent neural networks. Feature pooling networks process each frame independently using a CNN, and then combine frame-level information using various pooling layers. The recurrent neural network architecture we employ is derived from long short-term memory (LSTM) cells, and uses memory cells to store, modify, and access internal state, enabling it to discover long-range temporal relationships. Like feature pooling, LSTM networks operate on top of frame-level CNN activations that can learn how to integrate information over time. By sharing parameters through time, both architectures are able to maintain a constant number of parameters while capturing a global description of the temporal evolution of the video.
To learn a global description of a video while keeping the computation low, we propose to process only one frame per second. To compensate for the loss of implicit motion information, we incorporate explicit motion information in the form of optical flow images computed from adjacent frames. Optical flow allows us to preserve the benefits of motion information (often achieved through high frame rate sampling), while still capturing global video information.
insert image description here
Our contributions can be summarized in the following points:

  1. We propose a CNN architecture for obtaining global video-level descriptors and demonstrate that using an increasing number of frames can significantly improve classification performance.
  2. By temporally sharing parameters, the number of parameters versus video length remains the same in both feature pooling and LSTM architectures.
  3. We confirm that optical flow images can greatly benefit video classification.

2. Related work

Instead of trying to learn spatio-temporal features over small time periods, we considered several different approaches to aggregating powerful CNN image features over long periods of video (tens of seconds), including feature pooling and recurrent neural networks. Standard recurrent networks have difficulty learning long sequences due to the problem of vanishing and exploding gradients. In contrast, long short-term memory (LSTM) uses memory cells to store, modify, and access internal state, enabling it to better discover long-distance temporal relationships.

3. Model

3.1 Feature pooling structure

insert image description here
Different feature pooling architectures. Stacked convolutional layers are denoted by "C". The blue, green, yellow, and orange rectangles represent max pooling, temporal convolutional layers, fully connected layers, and softmax layers, respectively.

3.2 LSTM architecture

In contrast to Max-pooling, which produces order-invariant representations, we propose to use a recurrent neural network that explicitly takes into account the sequence of CNN activations. Since videos contain dynamic content, frame-to-frame changes may encode additional information that can be helpful in making more accurate predictions.

insert image description here

Here the author introduces the LSTM model, but LSTM is rarely used now.

insert image description here
A deep video LSTM takes as input the output of the last CNN layer in each successive video frame. The CNN output is processed forward through time and upward through a five-layer stack of LSTMs. A softmax layer predicts the class at each time step. The parameters of the convolutional network (pink) and softmax classifier (orange) are shared at different time steps.

4. Training Results

insert image description here

insert image description here

5 Conclusion

Unlike previous work that trains on videos of a few seconds, our network utilizes videos as long as two minutes (120 frames) to achieve the best classification performance. If speed is a requirement, our method can process the entire video in a single shot. Training is done by scaling smaller networks to progressively larger ones and fine-tuning them. The resulting network achieves state-of-the-art performance on both the Sports-1M and UCF-101 benchmarks, supporting the idea that learning should take place across entire videos rather than short clips.
We also show that using optical flow does not always help , especially if the videos have not been preprocessed, as is the case for the Sports-1M dataset. To take advantage of optical flow in this context, it is necessary to employ more sophisticated sequence processing architectures such as LSTMs. Furthermore, using LSTMs and optical flow on image frames yields the highest published performance metrics on the Sports-1M benchmark.


4. Convolutional Two-Stream Network Fusion for Video Action Recognition

Summary

We have studied some methods of fusing ConvNet in space and time, as follows:
(i) Using convolution instead of softmax can save parameters without loss of accuracy
(ii) Fusing the spatial ratio of this network in the last convolutional layer Early is better, and fusion at the class prediction layer can improve accuracy
(iii) pooling abstract convolutional features in spatio-temporal neighborhoods further improves performance

1 Introduction

Using convolutional networks to solve action recognition problems is not as effective as convolutional networks in other tasks. The possible reason is that the data set is too small and noisy, and the convolutional network focuses on spatial information and cannot make full use of temporal information.

insert image description here
The two-stream structure (or any previous method) cannot exploit two very important cues in videos for action recognition. (i) Recognizing what moves where, i.e. registering appearance recognition (spatial cues) with optical flow recognition (temporal cues); (ii)
How these cues evolve over time.

2. Related work

C3D learns 3D convolution in a limited time, and the convolution kernel is 3×3×3. Another approach is to split the three-dimensional convolution into two-dimensional spatial convolution and one-dimensional temporal convolution.
As of now (2016), two-stream networks are the most effective way to apply deep learning to action recognition.

3. Method

The author's structure is built on a two-stream network. This network has two main disadvantages:

  • Fusion is only at the last layer, so spatial and temporal features cannot be learned.
  • The memory on the temporal scale is limited because spatial convolutions only operate on a single frame, while temporal convolutions only operate on stacks of L temporally adjacent optical flow frames.

3.1 Spatial Fusion

The author lists a series of ways to fuse spatial layers: Sum fusion, Max fusion, Concatenation fusion, Conv fusion, Bilinear fusion
In the experimental part, we evaluate and compare the performance of these possible fusion methods in terms of classification accuracy

3.2 Where to integrate the network

insert image description here
The example on the left shows fusion after the fourth convolutional layer. From a convergence perspective, only one network tower is used. The image on the right shows fusion at two layers (after conv5 and after fc8), where both network towers are preserved, one is a hybrid spatio-temporal network and the other is a purely spatial network.
insert image description here
Different ways of fusing temporal information. (a) 2D pooling ignores time and just pools over spatial neighborhoods, individually reducing the size of the feature map for each temporal sample. (b) 3D pooling pools from local spatiotemporal neighborhoods, first stacking feature maps across time, and then shrinking this spatiotemporal cube. (c) 3D convolution + 3D pooling Before 3D pooling, convolution is also performed with a fusion kernel spanning feature channels, space, and time.

3.3 Temporal Fusion

Combining feature map xt with time t, there are 3D Pooling, 3D Conv + Pooling

insert image description here
Short-term information is captured on a fine time scale, and temporally adjacent inputs are captured on a coarse time scale.

4. Experiment and training results

insert image description here

4.1 Ways to integrate dual-stream networks

insert image description here
For all fusion methods shown in the table, the fusion of FC layers leads to lower performance compared to ReLU5, and the ranking of methods is the same as in Table 1, except that bilinear fusion is not possible at FC layers. Among all FC layers, FC8 performs better than FC7 and FC6 with Conv fusion at 85.9%, followed by Sum fusion at 85.1%. We think that the reason why ReLU5 performs slightly better is that in this layer, the spatial correspondence between appearance and motion is fused, which is already folded in the FC layer.

4.2 Where to integrate

insert image description here
The performance is best when ReLU5 or ReLU5+FC8 are fused (but almost twice the parameters involved).

4.3 Accuracy comparison

insert image description here

5 Conclusion

(1) Compared with fusion in the last Softmax layer, fusion in the middle convolutional layer can improve performance without adding too many parameters (see fusion method) (
2) Fusion in the last convolutional layer (relu5) The performance is the best. If it is combined with the last fully connected layer fusion (fc8), the performance can be improved a little more (see fusion position) (3)
Using pool3d instead of pool2d after fusion can further improve performance (see 3D Conv and 3D Pooling)


5. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

Summary

The author mainly wants to efficiently train ConvNet for action recognition with fewer training samples. There are two main contributions:

  1. A temporal segment network (TSN) network model is proposed:
    TSN sampling has the characteristics of sparsity and globality, and can model the temporal dependence between frames with longer intervals, ensuring that the acquisition of video-level information
    TSN includes both spatial information extraction and temporal information extraction. Road model, and based on the late fusion method to fuse the results of the two models
  2. A series of best practice schemes are proposed, such as data augmentation, regularization, cross-modal pre-training, etc. and achieved very good results

1 Introduction

In action recognition, there are two key and complementary aspects: appearance and dynamics. The performance of a recognition system largely depends on whether it can extract and utilize relevant information from it. However, extracting such information is non-trivial due to many complex factors, such as scale changes, viewpoint changes, and camera motions.
Mainstream ConvNet frameworks usually focus on appearance and short-term motion, thus lacking the ability to integrate long-term temporal structures.

The application of ConvNets to video-based action recognition is hampered by two major obstacles. First, the long-range temporal structure plays an important role in understanding the dynamics of action videos.
However, mainstream ConvNet frameworks usually focus on representations and short-term motion, thus lacking the ability to incorporate long-range temporal structures. Second, in practice, training deep ConvNets requires a large number of training samples to achieve optimal performance. But the dataset is limited. Therefore, convolutional networks face the risk of overfitting.
Temporal Segment Networks (TSN), a framework that extracts short segments over a long video sequence with a sparse sampling scheme in which samples are uniformly distributed along the temporal dimension. On this basis, a segmented structure is adopted to summarize the information of the sampled fragments. In this sense, temporal segment networks are able to model the long-distance temporal structure of the entire video. Moreover, this sparse sampling strategy preserves relevant information at extremely low cost, enabling end-to-end learning of long video sequences with reasonable time and computational resource budgets.

Some good practices are explored to overcome the aforementioned difficulties caused by the limited number of training samples, including: 1) cross-modal pre-training; 2) regularization; 3) augmentation data augmentation.

The author believes that when training the video classification model, continuous frames will cause redundancy, so the strategy of dense sampling is unnecessary, so the author adopts the strategy of sparse sampling.

2. Related work

There are two main approaches:

  • Two-stream structure, one model learns information at the image level, one model learns information at the time level, and finally fuses the results of the two models
  • Using a 3D convolution kernel to extract information at the image level and time level at the same time, this leads to variants of different 3D convolution kernels

3. Model

3.1 Time Segment Network TSN

Specifically, our proposed Temporal Segment Network framework, which aims to exploit the visual information of the entire video for video-level prediction, also consists of spatial stream ConvNets and temporal stream ConvNets.
Instead of working on single frames or stacks of frames, temporal segment networks work on sequences of short segments that are sparsely sampled throughout the video.
insert image description here
An input video is split into K segments, and a short segment is randomly selected from each segment. The rank scores of different segments are fused by a segment consensus function to generate segment consensus, which is a video-level prediction. Then, predictions from all modalities are fused to produce the final prediction. ConvNets of all segments share parameters.

3.2 Learning Temporal Segment Networks

  1. Model architecture: The previous two-stream model used a relatively shallow model structure. The author chose BN-Inception, a relatively deep model structure, as the building block because the model is good in speed and accuracy.

  2. Model input: The previous two-stream model used RGB as the input of the spatial stream and optical flow as the input of the temporal stream. The authors study two additional modalities, RGB difference and warped optical flow.

insert image description here

Figure 2. Examples of four types of input patterns. RGB image, RGB difference, optical flow field (x, y direction) and distorted optical flow field (x, y direction)

  1. Model training: Due to the small number of video samples, training a deep ConvNet may overfit. In order to solve this problem, the author proposes the following strategy:
  • Cross-modal pre-training: RGB networks can be pre-trained using ImageNet, but there is no pre-trained data set for optical flow networks. All authors use pre-trained parameters on ImageNet to initialize optical flow model parameters.
  • Regularization: Batch Normalization is used to solve the problem of covariate shift. During learning, BN will estimate the mean and variance of the activations in each batch and use them to transform these activations into a standard Gaussian distribution. This operation will make the model converge faster, but at the same time it will bring about the problem of overfitting. Because, the author solves this problem by fixing the BN parameters of other layers except the first layer. Moreover, the author added a dropout layer behind the BN-Inception model to solve the problem of overfitting.
  1. Data augmentation: random cropping, horizontal flipping, corner cropping, scale jittering.

4. Training Results

(1) Cross-modal pre-training and partial BN with dropout experiments:
insert image description here
(2) New modal features: RGB Difference and warped optical flow fields experiments
insert image description here

RGB Difference and RGB features are complementary to a certain extent.
Combining RGB and optical flow features can achieve very good results

insert image description here
So in the following experiments, we choose average pooling as the default aggregation function.

insert image description here
"BN-Inception+TSN" refers to the setting of applying the time-segment network framework on top of the best-performing BN-Inception architecture.
insert image description here

Component analysis of the proposed method on the UCF101 dataset. From left to right, we add components one by one. BN-Inception is used as the ConvNet architecture

insert image description here
Comparison of temporal segment network (TSN) based methods with other state-of-the-art methods. The results of using two input modes (RGB+Flow) and three input modes (RGB+Flow+Warped Flow) are presented respectively.

5 Conclusion

The author proposes a Temporal Segment Network for modeling video-level models that capture long-term motion features. With a sparsely sampled segmented structure and various data augmentation strategies, the model achieves very good results on HMDB51 and UCF101.

Guess you like

Origin blog.csdn.net/weixin_45751396/article/details/127544882