Optical Flow Estimation Based on Spatial Pyramid Network

Optical Flow Estimation using a Spatial Pyramid Network

We learn to compute optical flow by combining the classical spatial pyramid formulation and deep learning. Warp a pair of images at each pyramid level by estimating current flow and computing flow updates, estimating large-scale motion through a coarse-to-fine approach. Instead of standard minimization of the objective function at each pyramid level, we train a deep network at each level to compute flow updates. Unlike recent FlowNet approaches, this network does not need to handle large motions; these are all handled by the pyramids. This has several advantages. First, our Spatial Pyramid Network (SPyNet) is much simpler, 96% smaller than FlowNet in terms of model parameters. This makes it more efficient and more suitable for embedded applications. Second, since the flow rate of each pyramid layer is small (< 1 pixel), a convolution method for a pair of warped images is suitable. Third, unlike FlowNet, the learned convolutional filters look similar to classical spatio-temporal filters, which gives us insight into the method and how it can be improved. Our results are more accurate than FlowNet on most standard benchmarks, suggesting a new direction for combining classical flow methods with deep learning.

1. Introduction

In recent years, significant progress has been made in the problem of accurately estimating optical flow, improving performance on increasingly challenging benchmarks. Nonetheless, most streaming methods derive from a "classical formulation" that makes various assumptions about the image, from constant brightness to spatial smoothness. These assumptions only roughly approximate reality, which may limit performance. Recent research in this area has focused on improving these assumptions, or making them more robust against violations of [7].

Another approach abandons the classical formulation entirely and starts over with the latest neural network architectures. This approach takes a pair (or sequence) of images and learns to compute flow directly from them. Ideally, such a network would learn to solve correspondence problems (short and long range), learn problem-dependent filters, understand what is constant in a sequence, and understand the spatial structure of the stream and how it relates to the image structure. Initial attempts are promising, but not yet as precise as traditional methods.

Objective: We believe that an alternative method exists that combines the advantages of both approaches. Decades of research on convection have produced well-designed systems and efficient principles. But in some places, the assumptions of these methods limit their performance. Therefore, while maintaining the engineering architecture, we apply machine learning to address weaknesses with the aim of: 1) improving the performance of existing neural networks and the classical methods on which our work is based; 2) enabling real-time traffic estimation where accuracy outperforms slowness much more classical methods; 3) reduce memory requirements and make streaming more suitable for embedded, robotics and mobile applications.

Problem: A key problem with recent methods for learning to flow [16] is that they typically take two frames, stack them together, and apply a convolutional network architecture. When the motion between frames is larger than one (or a few) pixels, the spatio-temporal convolutional filter will not get a meaningful response. In other words, no meaningful temporal filter can be learned if the convolution window in one image does not overlap with the related image pixel at the next time point.

There are two issues that need to be resolved. One addresses long-range correlations, and the other addresses details, sub-pixels, optical flow, and precise motion boundaries. FlowNet [16] tries to learn both approaches simultaneously. In contrast, we use deep learning to solve the latter and rely on existing methods to solve the former.

Method: To handle large motions, we adopt a traditional coarse-to-fine approach using a spatial pyramid. At the top level of the pyramid, we want the motion between frames to be less than a few pixels, so the convolutional filters can learn meaningful temporal structure. At each level of the pyramid, we solve for the flow with a convolutional network and upsample the flow at the next level. As standard, using the classic formulation [36], we warp one image towards the other using the current flow and repeat this process at each pyramid level. Instead of minimizing a classical objective function at each level, we learned convolutional networks to predict flow increments at that level. We train the network from coarse to fine, learn the flow correction at each level, and add it to the flow output of the above network. The idea is that the displacement is always less than a few pixels at each pyramid level.

We call this method SPyNet, for Spatial Pyramid Network, and train it using the same flying chair data as FlowNet [16]. We report similar performance to FlowNet on Flying Chairs and Sintel [11], but significantly more accurate than FlowNet on Middlebury [4] and KITTI [18] after fine-tuning. The total size of SPyNet is 96% smaller than FlowNet, which means it runs faster and uses much less memory. Replace the expensive iterative propagation of classical methods with non-iterative computation of neural networks

We do not claim to solve the full optical flow problem with SPyNet - we are dealing with the same problems as traditional methods and inherit some of their limitations. For example, large motions of small or thin objects are notoriously difficult to represent with pyramids. We treat large motion problems as separate, requiring distinct solutions. Instead, what we show is that the traditional problem can be reformulated, parts of it can be learned, and performance can be improved in many scenarios

Furthermore, because our method bridges past methods with new tools, it provides insights on how to move forward. In particular, we found that SPyNet can learn spatiotemporal convolutional filters similar to traditional spatiotemporal derivatives or Gabor filters [2, 23]. The learned filters are similar to biological models of motion processing filters in MT and V1 [35]. This is in contrast to the highly random filters learned by FlowNet. This suggests that it is time to re-examine old spatio-temporal filtering methods with new tools

In summary, our contributions are: 1) Combining traditional coarse-to-fine pyramid methods with deep learning for optical flow estimation; 2) The new SPyNet model is 96% smaller and 96% faster than FlowNet ; 3) SPyNet achieves comparable or lower errors than FlowNet on standard benchmarks (Sintel, KITTI, and Middlebury); 4) the learned spatio-temporal filters provide insights into what filters are needed for flow estimation; 5) the trained network and Relevant code is publicly available for research: GitHub - anuragranj/spynet: Spatial Pyramid Network for Optical Flow

2. Related Work

Our formulation effectively combines ideas from "classical" optical flow and recent deep learning methods. Our review focuses on the work most related to this.

Spatial Pyramid and Optical Flow: The classical formulation of the optical flow problem can be traced back to Horn and Schunck [24], involving an optimized sum based on a data item with constant brightness and a spatial smoothing item. The disadvantage of classical methods is that they make very approximate assumptions about the image brightness variation and the spatial structure of the stream. Many methods focus on improving robustness by changing assumptions. A comprehensive review will effectively cover the history of the field; for this we refer the reader to [36]. A key advantage of learning computational flow, as we do here, is that we do not modify these assumptions by hand. Instead, changes in image brightness and spatial smoothness are reflected in the learned network. The idea of using spatial pyramids has an equally long history, dating back to [10], and its first use in the classical flow formulation appears in [19]. Typical Gaussian or Laplacian pyramids are used for flow estimation, and the main motivation is to handle large motions. These methods are notoriously problematic when small objects are moving rapidly. [8] incorporated long-range matching into the traditional optical flow objective function. This approach, which combines image matching to capture large motions, with variational [31] or discrete optimization [20] to capture finer motions, can yield accurate results. Of course, spatial pyramids are widely used in other areas of computer vision and have recently been used in deep neural networks [15] to learn generative image models

Spatio-temporal filters

Burt and Adelson [2] presented the theory of spatiotemporal models for motion estimation, and Heeger [23] provided computational examples. Although inspired by human perception, these methods performed poorly at the time [6]. Various approaches have shown that spatio-temporal filters result from learning, for example using independent component analysis [41], sparse [30] and multi-layer models [12]. Memisevic and Hinton learned simple spatial transformations with Restricted Boltzmann Machines [28], finding a wide variety of filters. Taylor et al. [39] used Restricted Boltzmann Machines to learn “flow-like” features using synthetic data, but did not evaluate flow accuracy.

Dosovitskiy et al. [16] learn spatio-temporal filters for flow estimation using deep networks, but these filters differ from classical filters inspired by neuroscience. By using a pyramid approach, we here learn filters that are visually similar to classical spatio-temporal filters, but yield good flow estimates because they are learned from data.

Learning to model and compute flow

Probably the first attempt to learn a model to estimate optical flow was the work of Freeman et al. [17] using MRF. They consider a simple synthetic world of uniformly moving blobs with a ground truth flow. The training data was not real, and they did not apply the method to real image sequences. Roth and Black [32] learned a Field of Experts (FoE) model to capture the spatial statistics of optical flow. FoE can be viewed as a (shallow) convolutional neural network. The model is trained using laser scans of real scenes and flow fields generated by natural camera motions. They have no images of the scene (only their streams), so the method only learns the spatial components.

.......

Deep Learning: The above learning methods have limited training data and use shallower models. In contrast, deep convolutional neural networks have emerged as a powerful class of models for solving recognition [22, 38] and dense estimation [13, 27] problems

FlowNet [16] represents the first deep convolutional architecture trained end-to-end for flow estimation. Although the network was trained on an artificial dataset of flying chairs on randomly selected images, it showed promising results. Although the results are promising, the method lags behind the current state of the art in terms of accuracy. Depth matching methods [20, 31, 42] do not fully solve the problem, as they resort to classical methods to compute the final flow field. It remains an open question which architectures are best suited for this problem, and how best to train them.

......

Fast flow:

Some recent methods try to balance speed and accuracy, aiming for real-time processing and reasonable (though not the highest) accuracy. GPU-flow [43] started this trend, but there are now some methods that outperform it. PCA-Flow [44] runs on the CPU, which is slower than the frame rate and produces an overly smooth flow field. EPPM [5] is smaller, has similar mid-level performance on intel (tested), and similar speed on GPU. The recent DIS-Fast [26] is a GPU method that is significantly faster than previous methods, but also significantly less accurate.

Our method is also significantly faster than the previous best CNN streaming method (FlowNet), which reports a runtime of 80ms per frame for FlowNetS. The key to increasing the speed is creating a small neural network that fits perfectly on the GPU. Additionally, all of our pyramid operations are implemented on the GPU. Scale is an important issue, but less of a concern than speed. In order for optical flow to exist on embedded processors, aircraft, mobile phones, etc., the algorithm needs a small memory footprint. Our network is 96% smaller than FlowNetS, using only 9.7MB, making it easily small enough to fit on mobile phone GPUs.

3. Spatial Pyramid Network

Our method uses the coarse-to-fine spatial pyramid structure of [15] to learn the residual flow at each pyramid level. Here we describe the network and training procedure.

3.1. Spatial Sampling

Let d(·) be the down-sampling function, and the image I of m×n is extracted into the corresponding image d(I) of size m/2×n/2. Let u( ) be the inverse operation for upsampling an image. These operators are also used for downsampling and upsampling of the horizontal and vertical components of the optical flow field V. We also define a warping operator w(I, V) that warps the image I according to the flow field V, using bilinear interpolation.

3.2. Inference

3.3. Training and Network Architecture

Optical Flow Estimation Based on Spatial Pyramid Network

Guess you like