SelFlow: Self-Supervised Learning of Optical Flow

       This paper proposes a self-supervised learning method for optical flow . The method extracts reliable flow estimates from non-occluded pixels and uses these predictions as ground truth to learn optical flow for hallucinating occlusions.
       In this paper, a simple CNN is designed to exploit temporal information from multiple frames for better flow estimation. Our method can yield the best performance for unsupervised optical flow learning on challenging benchmarks such as MPI Sintel, KITTI 2012 and 2015.
       Optical flow estimation is a core building block of various computer vision systems. Due to occlusions , accurate flow estimation is still an open problem. Traditional methods minimize energy functions to encourage association of visually similar pixels and normalize incoherent motion to propagate flow estimates from non-occluded to occluded pixels. However, this family of methods is usually time- consuming and not suitable for real-time applications.
       Recent research learns to estimate optical flow from images end-to-end using convolutional neural networks ( CNNs ) . However, training a fully supervised CNN   requires a large amount of labeled training data  , which is extremely difficult to obtain for optical flow, especially in the presence of occlusions. The size of the training data is a critical bottleneck for optical flow estimation. 
       In the absence of large-scale real-world annotations, existing methods turn to pre-training on  synthetic labeled datasets  , followed by fine-tuning on small annotated datasets. However, there is usually a large gap between the distribution of synthetic data and natural scenes. In order to train a stable model, we must carefully follow specific learning schedules for different datasets.
       The basic idea of ​​unsupervised optical flow learning methods benefiting from unlabeled data is to warp the target image towards the reference image according to the estimated optical flow, and then use photometric loss to minimize the difference between the reference image and the warped target image . This idea works for non-occluded pixels , but provides misleading information for occluded pixels. Recent approaches suggest excluding those occluded pixels when computing photometric loss, or using additional spatial and temporal smoothing terms to regularize flow estimates. DDFlow [26] proposes a  data distillation  method that employs  random cropping  to create occlusions for self-supervision. But these methods do not generalize well to all natural occlusions. Therefore, there is still a large performance gap comparing unsupervised methods with state-of-the-art fully supervised methods. 
       ( data distillation : Perform multiple transformations on unlabeled data (similar to data enhancement), use a single model for prediction, and then integrate the prediction results to automatically generate image labels)
       This paper demonstrates that a self-supervised approach can learn to  estimate optical flow using any form of occlusion from unlabeled data  . Our work is based on extracting reliable flow estimates from non-occluded pixels and using these predictions to guide optical flow learning for occlusions . Figure 1 illustrates our idea of  ​​perturbing superpixels  to create synthetic occlusions. This paper further  exploits temporal information from multiple frames  to improve flow prediction accuracy in a simple CNN architecture. The resulting learning method yields the highest accuracy among all unsupervised optical flow learning methods on the Sintel and KITTI benchmarks. 
       Figure 1: Our NOC model is first trained with a classical photometric loss (measuring the difference between a reference image (a) and a warped target image (d)), guided by an occlusion map (g). Randomly selected superpixels in the target image (b) are then  perturbed  to generate hallucinatory occlusions. Finally, reliable flow estimates from the NOC model are used to guide  the OCC model to learn those newly occluded pixels (denoted by a self-supervised mask (i), where a value of 1 indicates that the pixel is unoccluded in (g) but in (h) Occluded. Note that the yellow area is part of the moving dog. Our self-supervised method learns optical flow for both moving objects and static scenes
       Selfflow is the first time a supervised learning method has achieved such excellent accuracy without using any externally labeled data. 
       Related Work
               Classical optical flow estimation :
                       Classical variational methods model optical flow estimation as an energy minimization problem based on brightness constancy and spatial smoothness . This approach  works well for small motions , but tends to fail when the displacements are large. Later works integrated feature matching to initialize sparse matching, and then interpolate dense flow maps in a coarse-to-fine pyramidal fashion [6, 47, 38].
                       Recent work uses convolutional neural networks (CNNs) to improve sparse matching by learning efficient feature embeddings [49, 2]. However, these methods are usually computationally expensive and cannot be trained end-to-end.
                       A natural extension to improve the robustness and accuracy of flow estimation is to incorporate temporal information over multiple frames . A straightforward approach is to add temporal constraints  , such as constant velocity, constant acceleration, low-dimensional linear subspace, or rigid/nonrigid partitioning.
                       But our method is much simpler and does not rely on any data assumptions. Selfflow directly learns optical flow for a wider range of challenging cases present in the data. 
               Supervised Learning of Optical Flow : Learning Optical Flow Using CNNs
                       FlowNet is the first end-to-end optical flow learning framework. It takes two consecutive images as input and outputs a dense flow graph. FlowNet 2.0 stacks several basic FlowNet models for iterative refinement and significantly improves accuracy. 
                       SpyNet proposes to warp images at multiple scales to account for large displacements, resulting in a compact spatial pyramid network. 
                       PWC-Net and LiteFlowNet propose warping features extracted from CNNs and   achieve state-of-the-art results with a lightweight framework . However, achieving high accuracy with these CNNs requires pre-training on multiple synthetic datasets and following a specific training schedule.
                       This paper reduces the reliance on pre-training on synthetic data and proposes an efficient method for self-supervised training on unlabeled data. 
       Unsupervised optical flow learning:
               The commonly used photometric loss , which  measures the difference between the reference image and the warped image , is computed based on the fundamentals of luminance constancy and spatial smoothness . However, this loss does not apply to occluded pixels .
               Recent studies suggest to first obtain an occlusion map and then  exclude those occluded pixels when computing the photometric difference . [18] introduced the use of a multi-frame formulation and more advanced occlusion inference to estimate optical flow, achieving state-of-the-art unsupervised results. DDFlow [26] proposes a data distillation method to learn the optical flow of occluded pixels, which is  especially effective for pixels close to the image boundary .
               All these unsupervised learning methods only deal with the specific case of occluded pixels, lacking  the ability to reason about the optical flow of all possible occluded pixels.  This paper addresses this issue through  a superpixel-based occlusion illusion technique .
       Self-supervised learning :
               Supervision signals are generated purely from the data itself and are widely used to learn feature representations from unlabeled data. Preceding tasks such as image inpainting, image colorization, solving jigsaw puzzles are often employed. [33] proposed to explore low-level motion-based cues to learn feature representations without human supervision. [9] combine multiple self-supervised learning tasks to train a single visual representation.
               In this paper , we exploit the domain knowledge of optical flow and use the reliable prediction of non-occluded pixels as a self-supervised signal to guide the learning of optical flow for occluded pixels
       Method:
               This paper trains two CNNs ( NOC-Model and OCC-Model ) with the same network architecture. NOC-Model focuses on accurate flow estimation for non-occluded pixels, and OCC-Model learns to predict optical flow for all pixels.
               This paper  extracts reliable non-occluded flow estimates from NOC-Model to guide the learning of OCC-Model for those occluded pixels . Only the OCC model is required for testing. This paper builds a network based on PWC-Net and further extends it to multi-frame optical flow estimation (Figure 2).
               After obtaining the optical flow, a Spatial Transformer Network is used to backward warp the target image to reconstruct the reference image. Denote the occlusion map from Ii to Ij by Oi→j , where a value of 1 means that pixels in Ii are not visible in Ij.  
               We  create new target images ~It+1 by injecting random noise on superpixels  to generate occlusions . Noise can be injected into any one of three consecutive frames or even multiple consecutive frames, as shown in Figure 1. 
               
               Figure 2: On the KITTI dataset, the occlusion map is sparse, containing only pixels that move outside the image boundary. 
               As shown in Figure 2, our three-frame flow estimation network structure is built on top of the two-frame PWC-Net with some modifications to aggregate temporal information.
               First, the network takes three images as input, resulting in three feature representations Ft−1, Ft and Ft+1.
               Then, in addition to the forward flow wt→t+1 and the forward cost amount, the network also simultaneously calculates the reverse flow wt→t−1 and the reverse cost amount for each level. When estimating forward flow, we also utilize initial reverse flow and reverse cost volume information. Because  past It-1 frames  can provide very valuable information, especially for those regions that were occluded in future It+1 frames but not in It-1. Combining all this information allows for a more accurate estimate of optical flow.
               Third, superimpose the initial forward flow ˙wl t→t+1, the negative initial backward flow −˙wl t+1→t, refer to the features of the image F lt , forward cost volume and backward cost volume to estimate each level of forward flow. For backward flow, exchange flow and cost volume are taken as input. The forward and backward flow estimation networks  share the same network structure and weights
       For each level of initial flow, we boost the next level of optical flow in both resolution and magnitude. 

Guess you like

Origin blog.csdn.net/YoooooL_/article/details/130876444