[Paper Brief] Global Matching with Overlapping Attention for Optical Flow Estimation (CVPR 2022)

1. Brief introduction of the paper

1. First author: Shiyu Zhao

2. Year of publication: 2022

3. Published journal: CVPR

4. Keywords: optical flow, deep learning, self-attention, matching optimization

5. Motivation to explore: Previous energy-based optimization methods are often unable to handle large displacements due to the inability to obtain long-range motion correspondences. To solve this problem, matching optimization methods introduce a matching step before optimization, the purpose of which is to find the correspondence between pixels or blocks. However, their matching process relies on complex manual features, which is time-consuming and inaccurate. Optical flow estimation methods based on direct regression of neural networks cannot explicitly capture the correlation of long-distance motion and cannot effectively handle large motions.

6. Work goal: Inspired by the improvement of matching optimization methods brought about by energy-based optimization methods, this paper introduces a matching step to explicitly handle large displacement processing in direct regression methods.

7. Core idea: Based on this idea, a new optical flow estimation framework is proposed, namely the global matching optical flow network GMFlowNet, which introduces global matching before direct regression.

  1. We introduce a global matching step to explicitly handle large displacement optical flow estimations for direct-regression methods. With typical 4D cost volumes, our global matching is effective and efficient.
  2. We propose a well designed Patch-based OverLapping Attention (POLA) to address local ambiguities in matching and demonstrate its effectiveness via extensive experiments.
  3. Following traditional matching-optimization frameworks, we propose a learning-based matching-optimization framework named GMFlowNet that achieves state of the art performance on standard benchmarks.

8. Experimental results:

  1. Extensive experiments demonstrate that GMFlowNet significantly outperforms the most popular optimization-only model RAFT and achieves stateof-the-art performance.
  2. GMFlowNet provides better flow estimations especially for large motion areas and textureless regions.
  3. We thoroughly investigate our global matching and POLA, showing that they are both effective and efficient.

9. Paper download:

https://openaccess.thecvf.com/content/CVPR2022/papers/Zhao_Global_Matching_With_Overlapping_Attention_for_Optical_Flow_Estimation_CVPR_2022_paper.pdf

https://github.com/xiaofeng94/GMFlowNet

2. Implementation process

1. Framework comparison

The main framework for optical flow estimation. (a) Traditional matching optimization methods first establish sparse matching to obtain coarse optical flow, and then use energy-based optimization to improve the optical flow. (b) The direct regression method simulates energy-based optimization with learned parameters. They can be viewed as learning-based optimizations without matching. (c) Our framework introduces matching before learning-based optimization, further improving performance.

2. Overview of GMFlowNet

The new framework, GMFlowNet, introduces a simple and effective global matching before learning-based optimization, which consists of three modules, namely large-context feature extraction, global matching and learning-based optimization.

​3. Large context feature extraction

Large contextual information is key to handle local ambiguous position matching, such as repetitive patterns and texture-less regions. GMFlowNet first adopts 3 continuous layers (3-Convs) to extract initial features, and then adopts Transformer blocks to include long-distance feature information. Due to the large dimensionality of image features, considering the amount of computation without using ordinary self-attention on the entire feature map, a local localization module POLA for optical flow estimation is designed.

Block-based overlapping attention . POLA divides features into M×M non-overlapping patches, processes each patch and its 8 adjacent patches, and adopts multi-head attention. Given a vectorized patch P and its surrounding 3×3 paches S. For the i-th head in the attention, first use linear projection to project P and S to the dk dimension, and after the projection, it is Pi and Si; then use Pi and Si to calculate the attention, and get the output Hi; finally, the H of all heads i is concatenated as H, and H is projected into D dimension as the final result O∈M ×d . Can be expressed as:

n is the number of heads and L is the linear projection function. In the experiment, let n =8, dk=d/n.

Why POLA is an Improved Attention Method. Swin Transformer has a fixed window and a sliding window, and the sliding window requires 2 separate attention blocks for the exchange of information between patches, which causes information loss and is not conducive to matching, because matching relies heavily on contextual information to reduce local ambiguity . POLA contains the features between patches in one block, and directly exchanges information with less information loss. Compared with the pixel-by-pixel method, the advantages of POLA are: it consumes less memory; it can be implemented efficiently on existing deep learning platforms; and it can achieve better performance by arranging features through patches.

4. Global matching

Contextual features F1 and F2 are extracted for the first input image I1 and the second input image I2, respectively. Then, construct 4D cost volumes of F1 and F2. After that, the global matching is calculated in the cost body, and the coarse optical flow is output as the initial state of optimization.

4D cost volume calculation . Build a 4D cost volume at 1/8 of the input resolution.

Match confidence calculation . Convert the cost body to matching confidence with a double softmax operator. This operator is efficient and enables supervision of matching. The formula for calculating the matching confidence Pc is:

where C(i,j; ) refers to all (u,V) Given (i,j), C( ,u,V) is similar.

Selection of matches and generation of streams. According to the matching confidence Pc, the matching of I1 at (i, j) is obtained:

M2→1 of I2 is also obtained. Then, choose to satisfy M1→2(i,j) and M2→1(u,v), and define the matching set Mc:

The coarse optical flow is calculated as:

5. Optimization

Use existing update operators in RAFT as our optimization. This optimization predicts an incremental optical flow and adds it to the current optical flow estimate. It iterates over such additions and outputs a sequence of light predictions. In this paper, the coarse optical flow f0 is used to replace the zero light used in RAFT to initialize the optimization. The optimization part of GMFlowNet is replaceable. This paper adopts RAFT because it achieves the best performance. Any future optimizations can be applied here for further improvements.

6. Supervision

matching loss. Combine the real optical flow fgt to the pixel level and collect the real matching set Mgt. If the region occurs in both frames, they are considered to match and the occluded region is set to non-matching. As supervision in feature matching, the negative log-likelihood of the matching region Pc is minimized as:

Optimize loss. Follow RAFT and use the l1 distance between predicted optical flow and fgt as supervised optimization. The optimization loss is defined as:

The overall loss function of GMFlowNet is:

 λ balances different loss terms.

7. Experiment

7.1. Quantitative evaluation

Performance under different displacements. Divide the Sintel training set into s10, s10-40, and s40+ subsets, train GMFlowNet on the C+T data set, and use RAFT as the benchmark to evaluate on the subsets, and the evaluation index is AEPE. The results show that GMFlowNet has a large improvement in regions with extremely large displacements, which shows that global matching with large backgrounds is beneficial to handle larger motions.

Evaluate across domains/on standard benchmarks. Train on the C+T dataset and evaluate on the S and K datasets. The results show that GMFlowNet has good generalization ability, and attribute the improvement of generalization ability to global matching.

7.2. Qualitative assessment

Optical flow visualization. GMFlowNet provides better predictions for locally blurred regions such as textureless regions.

 Cost body visualization. The peak value of GMFlowNet is much higher than that of RAFT.

7.3. Ablation experiments

7.4. Efficiency

Using RAFT and RAFT with global matching as a comparative test, the results show that after adding global matching, the running speed is slightly slower, but the performance is significantly improved. Compared with +Swin, GMFlowNet takes 0.078 seconds extra time, which is acceptable considering the performance improvement.

Guess you like

Origin blog.csdn.net/qq_43307074/article/details/130659290