[Paper Description] UPFlow: Upsampling Pyramid for Unsupervised Optical Flow Learning (CVPR 2021)

1. Brief description of the paper

1. First author: Kunming Luo

2. Year of publication: 2021

3. Published in journal: CVPR

4. Keywords: optical flow, upsampling pyramid, unsupervised, deep learning

5. Exploration Motivation: There are two main problems regarding pyramid learning, called bottom-up and top-down problems. The bottom-up problem refers to the upsampling module in the pyramid. Existing methods typically employ simple bilinear or bicubic upsampling to interpolate crossing edges, resulting in blurring artifacts in the predicted optical flow. As the scale becomes finer, these errors will be propagated and aggregated. The top-down problem refers to pyramidal supervision. Previous leading unsupervised methods usually only add guidance loss on the final output of the network, leaving no guidance in the intermediate pyramid layers. In this case, due to the lack of training guidance, coarse-level estimation errors accumulate and destroy finer-level estimates.

6. Work goal: solve the above problems.

7. Core idea: To this end, an improved pyramid learning framework for unsupervised optical flow estimation is proposed.

  1. We propose a self-guided upsampling module to tackle the interpolation problem in the pyramid network, which can generate the sharp motion edges.
  2. We propose a pyramid distillation loss to enable robust supervision for unsupervised learning of coarse pyramid levels.

8. Experimental results:

We achieve superior performance over the state-of-theart unsupervised methods with a relatively large margin, validated on multiple leading benchmarks.

9. Paper and code download:

https://github.com/coolbeam/UPFlow_pytorch

https://openaccess.thecvf.com/content/CVPR2021/papers/Luo_UPFlow_Upsampling_Pyramid_for_Unsupervised_Optical_Flow_Learning_CVPR_2021_paper.pdf

2. Implementation process

1. Characteristics of multi-view and monocular cues

The network pipeline is shown in the figure. The network can be divided into two stages: pyramid encoding and pyramid decoding.

In the first stage, feature pairs of different scales are extracted from the input image through a convolutional layer. In the second stage, the optical flow is estimated in a coarse-to-fine manner using the decoder module D and the upsampling module S↑. The structure of the decoder module D is the same as UFlow, including feature warp, cost body construction through relevant layers, cost body normalization, and full convolutional layer optical flow decoding. The parameters of D and S↑ are shared across all pyramid levels. To sum up, the pyramid decoding stage can be expressed as:

In the formula, I∈{0,1,…, N} is the index of each pyramid layer, the smaller the thicker, Fti and Fit+1 are the features extracted from It and It+1 in the i-th layer, V`fi− 1 is the upsampling stream of the i-1th layer. In practice, considering accuracy and efficiency, N is usually set to 4. The final optical flow result is obtained by directly upsampling the output of the last pyramid level. Especially in Equation 2, previous methods usually use bilinear interpolation to upsample the optical flow field, which may produce noisy or blurred results at object boundaries. This paper proposes a self-guided upsampling module to solve this problem.

2. Self-guided upsampling module

On the left of the figure, the case of bilinear interpolation. 4 points are shown, representing 4 flow vectors, belonging to two motion sources, marked red and blue respectively. The missing regions are then bilinearly interpolated without semantic guidance. Therefore, a hybrid interpolation result is produced in the red motion region, resulting in cross-edge interpolation. To alleviate this problem, a self-guided upsampling module (SGU) is proposed to change the interpolation source points through the interpolation flow. The main idea of ​​SGU is shown on the right side of the figure. First interpolation is performed through the closed red motion of a point, and then the learned interpolation process is used to bring the result to the target position (Figure 3, green arrow). This avoids the mixed imputation problem.

In the design, in order to keep the interpolation of planar areas unchanged and to have the interpolation flow applied only to motion boundary areas, a per-pixel weight map is learned to indicate where the interpolation flow should be disabled. Therefore, the upsampling process of SGU is a weighted combination of the bilinear upsampling flow and the correction flow obtained by warping the upsampling flow with the interpolation flow. The detailed structure of the SGU module is shown in the figure.

Given a low-resolution flow Vfi−1 from layer i−1, a higher-resolution initial flow v~if−1 is first generated by bilinear interpolation:

In the formula, p is the pixel coordinate of a higher resolution, s is the scale magnification, N is the four adjacent pixels, and w (p/s, k) is the bilinear interpolation weight. Then, the interpolation flow Ufi is calculated from the feature Fti and the feature Fit+1 by warping the interpolation of Vi−1f: 

Where V~i−1f is the result of the interpolation flow Ui f distorting Vi f−1. Since interpolation blur only occurs in edge areas of objects, there is no need to learn interpolation flow in flat areas. Therefore, the interpolation map Bfi is used to explicitly force the model to learn the interpolation flow only in motion boundary regions. The final loading result is the fusion of V~i−1f and Vi f−1: 

where V is the output of the module and ⊙ is the element-wise multiplier.

To generate the interpolated streams Ufi and interpolated maps Bfi, a dense block with 5 convolutional layers is used. Specifically, the feature map Fti and the feature map Fti+1 of the warp are connected as the input of the dense block. The number of kernels in each convolutional layer in the dense block are 32, 32, 32, 16, and 8 respectively. The output of the dense block is a tensor graph with 3 channels. Use the first two channels of the tensor map as the interpolation stream, and use the last channel to form the interpolation map through the sigmoid layer. Note that no supervision is introduced for the learning of interpolated flows and interpolated graphs. The figure below shows an example from the MPISintel Final dataset, where SGU produces sharper and sharper results at object boundaries compared to the bilinear method. Interestingly, the self-learning interpolation map is almost an edge map, and the interpolation flow is also concentrated in the target edge area.

3. Pyramid level loss guidance

Several losses are used to train the pyramid network: unsupervised optical flow loss for the final output stream and pyramid distillation loss for intermediate streams at different pyramid layers.

3.1 Unsupervised optical flow loss

To learn the flow estimation model H in an unsupervised setting, a photometric loss Lm based on the photometric constancy assumption is used, where the same object must have similar intensities at It and It+1. However, some areas may be occluded by moving objects, making their corresponding areas not present in another image. Since photometric loss does not work in these areas, only add Lm in the unoccluded areas. The photometric loss can be expressed as:

Where Mt is the occlusion mask, Ψ is the robust penalty function: Ψ(x) = (|x| + ω)q, where q is 0.4 and ω is 0.01. In the forward-backward inspection estimated occlusion mask Mt, 1 represents the unoccluded pixels, and 0 represents the occluded pixels. 

In order to improve performance, some previously effective unsupervised components are also added, including smoothing loss Ls, census loss Lc, enhanced regularization loss La and boundary hole warp loss Lb. For simplicity, explanation of these components is omitted.

3.2 Pyramid distillation loss

In order to learn the intermediate flow of each pyramid layer, it is proposed to distill the best output flow to the intermediate flow through the pyramid distillation loss Ld. Intuitively, this is equivalent to computing all unsupervised losses on each intermediate output. However, photometric consistency measurements are not accurate enough for optical flow learning at low resolutions [. Therefore, it is inappropriate to implement unsupervised losses at the middle levels, especially at the lower levels of the pyramid. Therefore, it is recommended to use the best output stream as pseudo labels and add supervised loss instead of unsupervised loss to the intermediate output.

To calculate Ld, the final output stream is directly downsampled and its difference from the intermediate stream is evaluated. Since there is no occlusion area in Lm, the flow estimation in the occlusion area is noisy. In order to eliminate the influence of these noise areas in pseudo labels, the occlusion mask Mt is also downsampled and the occlusion areas are excluded from Ld. Therefore, the pyramid distillation loss can be expressed as:

where si is the scale amplification factor of the i-th layer of the pyramid, and S↓ is the downsampling function. Finally, the training loss L is expressed as:

Among them, λd, λs, λc, λa, and λb are hyperparameters, assuming λd = 0.01, λs = 0.05, λc = 1, λa = 0.5, and λb = 1.

4. Experiment

4.1. Comparison with advanced technologies

Guess you like

Origin blog.csdn.net/qq_43307074/article/details/134944113