Classic literature reading - FastFlowNet (lightweight optical flow estimation)

0. Introduction

Dense optical flow estimation plays a key role in many robotic vision tasks. With the advent of deep learning, it has been predicted with a satisfactory accuracy than traditional methods. However, current networks often occupy a large number of parameters and require heavy computational costs. These drawbacks hinder applications on mobile devices with limited power or memory. To address these challenges, in this paper, FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation , we delve into designing efficient structures for fast and accurate optical flow prediction. Our proposed FastFlowNet works in a well-known coarse-to-fine manner with the following innovations. First, a new Head Enhanced Pooling Pyramid (HEPP) feature extractor is adopted to enhance high-resolution pyramid features while reducing parameters. Second, we introduce a novel Center Dense Dilated Correlation (CDDC) layer for constructing compact cost volumes that can maintain a large search radius and reduce computational burden. Third, an efficient shuffle block decoder (SBD) is implanted in each pyramid level to slightly degrade the flow estimation accuracy. FastFlowNet contains only 1.37 M parameters and runs at 90 or 5.7 fps on NVIDIA GTX 1080 Ti and embedded Jetson TX2 GPUs, respectively, on Sintel resolution images. At present, the lightweight network can basically cope with low-speed unmanned vehicle systems. The corresponding code has been open sourced on Github . At the same time, some people in the NVIDIA Developer Forums are trying to complete the acceleration of TensorRT

insert image description here

1. Article contribution

In order to improve the accurate estimation speed of optical flow and facilitate practical application, this paper proposes a lightweight fast network called FastFlowNet. Our model is based on the widely used coarse-to-fine residual structure and improves it from three aspects: pyramid feature extraction, cost volume construction and optical flow decoding, covering all components of the flow estimation pipeline.

  1. First, a Head Enhanced Pooling Pyramid (HEPP) feature extraction method is proposed, which uses convolutional layers at higher layers and parameter-free pooling at lower layers.

  2. Second, recent studies [13], [19] show that increasing the search radius of relevant layers can improve flow accuracy. While the feature channel of the cost volume is the square of the search radius, the computational complexity of the subsequent decoder network is the 4th power of the search radius. In FastFlowNet, we keep the search radius at 4, like in PWC-Net [13], to sense large movements. The difference is that, to reduce computation, we down-dig the sample feature channels in large residual regions, and propose a novel Center Dense Dilated Correlation (CDDC) layer to construct a compact cost volume.

  3. Third, we observe that each pyramid-level stream decoder accounts for a relatively large proportion of the parameters and calculations of the entire network. To further reduce computation while maintaining good performance, we construct a new shuffle block decoder (SBD) module referring to lightweight ShuffleNet [21] to meet its requirements of low computation cost and high classification accuracy. Different from the shuffle lenet [21] as the backbone network, our SBD module is used for optical flow regression, which is only located in the middle part of the decoder network.

2. Specific algorithm

First of all, let's take a look at the overall framework of this article. Given two temporally adjacent input images I 1 , I 2 ∈ RH × W × 3 I_1,I_2∈\mathbb{R}^{H×W ×3}I1,I2RH × W × 3 , the FastFlowNet we propose uses the coarse-to-fine residual structure to estimate the gradually refined optical flowF l ∈ RH l × W l × 2 , l = 6 , 5 , … , 2 F_l∈\ mathbb{R}^{H_l×W_l×2}, l = 6,5,...,2FlRHl×Wl×2,l=6,5, 2 . But it has been extensively improved to speed up inference by appropriately reducing parameters and computational costs. To this end, we first replace the double convolutional feature pyramids in PWC-Net [13] with head-enhanced pooling pyramids to enhance high-resolution pyramidal features and reduce model size. Based on this, a new center-dense expanded correlation layer is proposed to construct a compact cost volume while maintaining a large search radius. Finally, optical flow is regressed using a new shuffled block decoder at each pyramid level with a significantly lower computational cost.
insert image description here
The structural details of the model are shown in Table 1
insert image description here

2.1 Head Augmentation Pool Pyramid

Large channel numbers at low resolutions result in a large number of parameters. Furthermore, since the low-level pyramidal features are only responsible for estimating the coarse flow field, it can be re-applied to coarse and fine structures. Therefore, we combine the high-level feature pyramid with other low-level pooling pyramids to obtain two strengths as shown in Figure 2. On the other hand, the high-resolution pyramid features are relatively shallow in PWC-Net, because each pyramid layer only contains two3×The convolution of 3 cores has a smaller receptive field. Therefore, we add a convolution on higher layers to enhance the pyramidal features at a small additional cost. By balancing computation across different scales, we propose a head-enhanced pooling pyramid feature extractor, as shown at the top of Table i. Like FlowNetC, PWC-Net, and LiteFlowNet, HEPP generates1/2^{1/2}1/2 resolution (level 1) to1/64^{1/64}6 pyramid levels at 1/64 resolution (level 6).

2.2 Center dense expansion correlation

A key step in modern optical flow architectures is to compute feature correspondences via an inner-product based correlation layer [10]. Given llTwo pyramid features of level l f 1 l , f 2 lf^l_1,f^l_2f1l,f2l, as in many coarse-to-fine residual methods, we first adopt the warp operation based on bilinear interpolation [12], [11], according to 2 × 2×2 × upsampled optical flow fieldup 2 ( F l + 1 ) up_2(F^{l+1})up2(Fl + 1 )for the second featuref 2 lf^l_2f2lDo warp twist. The target feature after warping fwarplf^l_{warp}fwarplDisplacements caused by large motions can be significantly reduced, which helps to narrow the search area and simplifies the task of estimating relatively small residual flows. Recent models [13][14] build cost volumes by associating source features with corresponding warped target features within a local square region, which can be formulated as
insert image description here
where xxx andddd represents space coordinates and offset coordinates. NNN is the length of the input features. RRR is the search radius;⋅ ·⋅Represents the inner product.

As shown in Figure 3a, many flow networks [13], [18], [15], [16] set r = 4 r = 4r=4 , the resulting huge computational burden hinders low-power applications. A simple way to reduce the computational cost is to reduce r, as shown in Figure 3b, setr = 3 r = 3r=3 . It can reduce the cost of volumetric features from 81 to 49, but this method is at the cost of sacrificing perceptual range and accuracy. Unlike
insert image description here
ASPP which uses parallel atrous convolutions to obtain multi-scale context information, our proposed CDDC builds Reduced calculations for large radius cost volumes. In FastFlowNet, it outputs 53 feature channels with a similar computational budget to the traditional r = 3 setting. Our motivation is that the residual flow distribution is more focused on small motions. Experimental results show that the performance of CDDC algorithm is better than simple compression algorithm.

…For details, please refer to Gu Yueju

Guess you like

Origin blog.csdn.net/lovely_yoshino/article/details/128170719