Optical Flow Guided Feature Paper

Paper address: 1711.11152.pdf (arxiv.org) icon-default.png?t=M85Bhttps://arxiv.org/pdf/1711.11152.pdf

Paper code (Caffe):

kevin-ssy/Optical-Flow-Guided-Feature: Implementation Code of the paper Optical Flow Guided Feature, CVPR 2018 (github.com)icon-default.png?t=M85Bhttps://github.com/kevin-ssy/Optical-Flow-Guided-Feature

 Summary

Motion representation plays a crucial role in human action recognition in videos. In this study, the authors introduce a new compact motion representation for video action recognition called Optical Flow Guided Features (OFF), which enables the network to extract temporal information in a fast and robust way. OFF comes from the definition of optical flow and is orthogonal to optical flow. The derivation also provides theoretical support for using the difference between two frames.

By directly computing pixel-wise spatial-temporal gradients of depth feature maps, OFF can be embedded into any existing CNN-based video action recognition framework with only a slight increase in cost. It enables CNN to simultaneously extract spatiotemporal information, especially the temporal information between frames. Experimental results validate this simple yet powerful idea.

On UCF-101, a network unfed with only RGB inputs achieved a competitive accuracy of 93.3%, which is comparable to the results obtained with two streams (RGB and optical flow), but 15 times faster. Experimental results also show that OFF is complementary to other motion modes such as optical flow. When the proposed method is inserted into the state-of-the-art video action recognition framework, the accuracy is 96.0% and 74.2% on UCF-101 and HMDB51, respectively. 

Introduction

How to design and use motion representations that are both fast and robust? To this end, the required computations should be economical and the representation should be adequately guided by motion information. Taking into account the above requirements, we propose Optical Flow Guided Features (OFF), which are fast to compute and can comprehensively represent motion dynamics in video clips.

In this paper, we define a new feature representation at the feature level from the optical flow part of the orthogonal space. This definition brings the guidance of optical flow to the representation here, therefore, we call it the optical flow guidance feature (OFF).

The features include spatial gradients of feature maps in the horizontal and vertical directions, and temporal gradients obtained from the differences between feature maps of different frames. Since all operations in OFF are differentiable, when OFF is inserted into a CNN architecture, the entire process is end-to-end trainable.

In fact, the OFF unit only consists of pixelation operators of CNN features. These operators are fast to apply and enable networks with RGB inputs to capture both spatial and temporal information. An important component in OFF is the difference between features from different images/segments. As shown in the figure below, the difference between the features from the two images provides representative motion information that can be conveniently used by CNNs. Negative values ​​in the difference image depict where body parts/objects disappear, while positive values ​​represent where they appear.

This pattern that disappears at one location and appears at another can be easily viewed as a specific movement pattern and captured by later CNN layers. The temporal difference can be further combined with the spatial gradient such that, according to our derivation in a later section, OFF consists of optical flow guidance at the feature level.

       Left column : input frame. 

Middle two columns : standard depth features before applying OFF to both frames. 

       Right column : OFF time difference. Red and cyan are used for positive and negative values ​​respectively. The feature difference between two frames is effective and comprehensive in representing motion information.

Related Work

Traditional methods extract hand-crafted local visual features such as 3DHOG, motion boundary histograms (MBH), improved dense trajectories (iDT), and then encode them into sparse or compact feature vector classifiers. Later, it was found that deep learning features are better than hand-crafted action recognition features. The TwoStream-based framework uses deep CNN to learn handcrafted action features such as optical flow and iDT as a major breakthrough in action recognition. These attempts have made significant progress in improving recognition accuracy, but still rely on pre-computed optical flow or iDT, which limits the speed of the entire framework.

To quickly obtain motion modalities, recent works only use optical flow in the training stage, or propose motion vectors as simplified versions of optical flow. These attempts produced reduced optical flow results and were still not equivalent to methods using traditional optical flow as the input stream.

Many methods learn to use 3D CNN to capture motion information directly from input frames. Through temporal convolution and pooling operations, 3D CNN can extract temporal information between consecutive frames without segmenting them into short segments.  In contrast to learning filters that capture motion information, our OFF is a principled representation mathematically derived from optical flow. Constrained by parameters such as network design, training samples, and weight decay, 3D CNN may not learn as good motion representation as OFF. Therefore, current state-of-the-art 3D CNN-based algorithms still rely on traditional optical flow to help the network capture motion patterns.  

The OFF proposed in this article has the following characteristics:

        1) Capture the motion pattern so that the RGB stream with OFF is the same as both stream methods

        2) Also complementary to other motion representations such as optical flow

To capture long-term temporal information from videos, an intuitive approach is to introduce long short-term memory (LSTM) modules as encoders to encode the relationships between deep features illustrated by sequences. LSTM can still be applied to OFF, therefore, our OFF is complementary to these methods. In parallel with our work, another recent approach applies a strategy called ranked pooling, which generates fast video-level descriptors, i.e., dynamic images. However, dynamic images are different in nature from our design and implementation. Dynamic images are designed to summarize a sequence of frames, while our method is designed to capture motion information related to optical flow.

Optical Flow Guided Feature

 Our proposed OFF is inspired by the well-known brightness constant constraint defined by traditional optical flow, which is formulated as follows:

where I(x, y, t) represents the frame of the pixel at position (x, y) of a at time t. For frames t and (t + Δt), Δx and Δy are the spatial pixel displacements on the x and y axes, respectively.

Assume that for any point moving from (x, y) at frame t to (x + Δx, y + Δy) at frame t + Δt, its brightness remains constant over time. When we apply this constraint at the feature level, we have:

where f is the mapping function used to extract features from image I. w represents the parameters in the mapping function. The mapping function f can be any differentiable function.

In this paper, we adopt trainable CNN, including convolution, ReLU and pooling operations. According to the definition of optical flow, we assume p = (x, y, t) and get the following equation:

 By dividing both sides of the above equation by Δt we get:

Where p = (x, y, t), (vx, vy) represents the two-dimensional velocity of the feature point at p;

\frac{\partial f(I;w)(p)}{\partial x}and are the spatial gradients of ∂f(I; w)(p)\frac{\partial f(I;w)(p)}{\partial y} on the x and y axes respectively , and are the time gradients along the time axis.\frac{\partial f(I;w)(p)}{\partial t}

As a specific case, when f(I; w)(p) = I(p), then f(I; w)(p) only represents the pixel at p. 

In this particular case, (Vx, Vy) is called optical flow.

For each p, optical flow is obtained by solving the optimization problem with the constraints in the last equation above .

In this case, the term  means the difference between RGB frames .  

Previous research has shown that temporal differences between frames are useful in video-related tasks, however, there is no theoretical evidence to help explain why this simple idea works well. Here we can find its correlation with spatial features and optical flow. We generalize the optical flow representation from pixel I(p) to feature f(I; w)(p). In this general case, [Vx, Vy] is called the feature flow .

As can be seen from the above equation, it is orthogonal to the vector containing the feature-level optical flow . Changes with feature-level optical flow . Therefore, guided by feature-level optical flow. We will  call this optical flow guided feature (OFF). OFF encodes spatial-temporal information orthogonally and complementary to feature-level optical flows (vx, vy) .

Using Optical Flow Guided Feature in Con-volutional Neural Network

Network Architecture 

 

The above figure shows the overall architecture of the network, which consists of three subnets used for different purposes: feature generation subnet, OFF subnet and classification subnet. The feature generation subnet uses common CNN structures to generate basic features. In the OFF subnet, OFF features are extracted using features from the feature generation subnet, and then several residual blocks are stacked to obtain fine features. Then, the classification subnet uses the features of the first two subnets to obtain action recognition results.

The feature generation sub-network extracts features for each frame sampled from the video. Based on the features of two adjacent frames extracted by the feature generation subnetwork, the OFF subnetwork is applied to generate OFF for further classification. Scores from all subnetworks are fused to obtain the final result.

The figure below shows a more detailed network structure with two segment inputs. As shown in the figure below, we extract features with the same resolution from multiple layers at a specific level by concatenating them together and feeding them to an OFF unit. The entire network has 3 OFF units of different sizes.

    The input is two segments, blue and green, respectively fed into the feature generation subnet to obtain basic features. The basic feature f(I) (equivalent to the representation f(I; w) in the previous section ) is extracted from the input image using several convolutional layers, where rectified linear units (ReLU) are used for nonlinear functions, max. Pooling is used for downsampling.

    We choose BN-Inception as the network structure to extract feature maps. The feature generation subnetwork can be replaced by any other network architecture.

    Here K represents the maximum side length of the square feature map selected for going through the OFF subnet to obtain OFF features.

    The OFF subnet consists of several OFF units, and several residual blocks are connected between OFF units at different resolution levels. When viewed as a whole, these residual blocks constitute ResNet-20. Scores obtained by different subnetworks are independently supervised.  

The detailed structure of OFF subnet and OFF unit is shown in the figure below:

The OFF subnet consists of several OFF units. Different units use different depths of basic features f(I). As shown in Figure 4, the OFF unit contains OFF layers to produce OFF. Each OFF layer contains a 1 × 1 convolutional layer for each feature, and a set of operators including sobel and element-wise subtraction for OFF generation. After obtaining OFF, the OFF unit concatenates them with features from lower levels, and then the combined features are output to the following residual block.

A 1x1 convolutional layer is connected to the input base features for dimensionality reduction.

After that, we use the Sobel operator and element-wise subtraction to calculate the spatial and temporal gradients respectively.

The combination of gradients constitutes OFF, and the Sobel operator, subtraction operator and the 1×1 convolutional layer before them constitute the OFF layer.

The OFF layer is responsible for generating OFF from the basic features f(I).  

OFF should include the spatial and temporal gradients of features.

Denote f(I,c) as the c-th channel of the basic feature f(I). Denote Fx and Fy as the OFF of the gradient in the x and y directions respectively, which correspond to the spatial gradient. We apply the Sobel operator to spatial gradient generation as follows:

Among them, * represents the convolution operation, and the constant Nc represents the number of channels of feature f(I) . Denote Ft as the OFF of the gradient in the time direction.

The temporal gradient is obtained by element-wise subtraction as follows:

With the Fx, Fy and Ft features obtained above, we concatenate them with lower level features as the output of the OFF layer. We use 1×1 convolutional layers before Sobel and subtraction operations to reduce the number of channels. In our experiments, the channel size is reduced to 128 regardless of the number of input channels. This feature is then fed into the OFF unit to calculate the OFF we defined in the previous section. After obtaining OFF, several residual blocks are concatenated between OFF units at different resolution levels as refinement.  

In the residual block near the OFF unit, the dimensionality of OFF is further reduced to save the amount of calculation and the number of parameters. Residual blocks at different resolution levels finally form ResNet-20. Note that no batch normalization operation is applied in our residual network to avoid overfitting issues. OFF units can be applied to different levels of CNN layers. The input to an OFF unit includes the basic depth features of the two segments, and the features of the OFF unit at the previous feature level (if present). In this way, OFF at the previous semantic level can be used to refine OFF at the current semantic level.

Classification subnet

The classification subnetwork obtains features from different sources and uses multiple intrinsic classifiers to obtain multiple classification scores.

The classification scores of all sampled frames are then combined by averaging over each feature generation subnet or OFF subnet.

Semantic-level OFF can be used to generate classification scores during the training phase, using their corresponding losses to learn.

This strategy has proven useful in many tasks. During the testing phase, scores from different subnets can be assembled for better performance.

network training

Action recognition is considered as a multi-class classification problem.

Next is the setup in TSN, since each segment produces multiple classification scores, we need to fuse them separately in each subnet to generate video-level scores for loss calculation.  

Here, for the OFF subnet, the features produced by the output of the OFF subnet of the t-th segment on level 1 are represented by Ft,l .

The classification score of segment t at level l using Ft,l is represented by Gt,l . The aggregate video level score for level 1 is represented by G1 .

The video-level action classification score Gl is obtained in the following way:

where Nt represents the number of frames used to extract features.

The aggregate function represented by G is used to summarize the predicted scores from different segments along time.

After investigation in TSN, G was implemented by average pooling for better performance.  

For feature generation subnetworks, the above equation is also applicable.

Although we do not require intermediate supervision of the feature generation subnetwork, the level 1 feature Ft,l for segment t is simply equivalent to the final feature output of the subnetwork.

To update the parameters of the entire network, the loss is set to the standard categorical cross-entropy loss.

Since each feature-level subnetwork is independently supervised, a loss function is used at each level:

where C is the number of action classes, Gl,c is the estimated score of class c for features from level l, and yc represents the ground truth class label.

By using this loss function, we can optimize the network parameters through backpropagation.

Detailed Implementation of Training – Two-Phase Training Strategy

The training of the entire network consists of two stages.

The first stage is indeed the application of existing methods, such as TSN, for training feature generation subnetworks.

In the second stage, we train OFF and classification subnetworks, where all weights in the feature generation subnetwork are frozen.

Learn the weights of OFF subnets and classification subnets from scratch.

The entire network can be further fine-tuned in an end-to-end manner, however, we did not observe significant gains at this stage.

To simplify the training process, we only use the first two stages to train the network .

Detailed implementation of training – Intermediate supervision during training

Intermediate supervision has proven to be a practical training strategy in many other computer vision tasks.

Since the OFF subnet is fed by intermediate inputs, here we add intermediate supervision at each level to obtain better OFF at each resolution level.

Detailed implementation of training - reducing memory costs

Since our framework consists of several sub-networks, it consumes more memory than the original TSN framework, which extracts and stores motion frames before training the CNN and trains multiple networks independently.

To reduce computational and memory costs, we sample fewer frames in the training phase than in the testing phase and still obtain satisfactory results.

Network test

Since different sub-networks produce multiple classification scores, we need to fuse them together during the testing stage to obtain better performance.  

In this study, we summarize the scores of the feature generation subnet and the last level OFF subnet through a simple sum operation.

We choose to test our model based on the state-of-the-art framework TSN.

The test settings under the TSN framework are as follows:

  • During the testing phase of TSN, 25 segments were sampled from RGB, RGB difference and optical flow.  
  • However, the number of frames in each segment is different in these modalities.
  • The original settings we adopted using TSN were 1, 5, and 5 frames per segment for RGB, RGB difference, and optical flow sampling respectively.  
  • The input to our network is 25 segments, where the t-th segment is considered frame T in Figure 3.
  • In this case, the features extracted by a separate branch of our feature generation subnet are for segments rather than frames when using TSN.
  • Other settings remain consistent with those in TSN.

Testing and Evaluation - Datasets

Experimental results are evaluated on two popular video action datasets, UCF-101 and HMDB-51.

The UCF-101 dataset has 13320 videos divided into 101 classes, while HMDB51 contains 6766 videos and 51 classes.

Our experiments follow the official protocol, which divides the data set into 3 training and testing splits, and finally calculates the average accuracy of all 3 splits.

We prepare the optical flow between frames before training by directly using an algorithm implemented in OpenCV.

Testing and Evaluation - Implementation Details

We train our model using 4 NVIDIA TITAN X GPUs, implemented on Caffe and OpenMPI.

We first train the feature generation subnetwork using the same strategy provided in the corresponding method.

Then in the second stage, we train the OFF subnetwork from scratch and freeze all parameters in the feature generation subnetwork.

A mini-batch stochastic gradient descent algorithm is used here to learn network parameters.

When the feature generation subnet is fed by RGB frames, the entire training process of the OFF subnet requires 20000 iterations to converge, initialized with a learning rate of 0.02 and reduced to 0.1 using a multi-step strategy at iterations 10000, 15000.

When the input changes to temporal modalities such as optical flow, the learning rate is initialized to 0.05, and other strategies are consistent with those proposed in RGB.

The batch size is set to 128 and all training strategies described in the previous sections are applied.

When evaluating on UCF-101 and HMDB-51, we add a dropout module on the spatial stream of OFF. There is no difference in training parameters between different methods.

However, when the input is RGB disparity or optical flow, more time will be spent in the training and testing phases as more frames are read into the network.

contribute

First, OFF is a fast and robust representation of movement.

OFF quickly enables over 200 frames/second, with only RGB as input, and from and guided by optical flow.

Experimental results show that by obtaining only RGB frames from videos, the performance of CNN with OFF is close to that of state-of-the-art optical flow-based algorithms.

CNN with OFF can achieve an accuracy of 93.3% on the UCF-101 dataset with only RGB frames as input, which is currently the most advanced RGB-based action recognition method.

When OFF is inserted in state-of-the-art action recognition methods in a dual-stream manner (RGB + optical flow), the performance of our algorithm can result in 96.0% of UCF-101 and 74.2% of HMDB-51. .

Second, networks equipped with OFF can be trained in an end-to-end manner. In this way, spatial and motion representations can be jointly learned through a single network. This property is suitable for video tasks on large datasets as it may not require the network to precompute and store motion patterns for training. Additionally, OFF can be used between images/clips in video clips at both image level and feature level.

Guess you like

Origin blog.csdn.net/Mr___WQ/article/details/126893566