GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose Paper Reading

Paper information

Title : GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose
Author : Zhichao Yin and Jianping Shi
Source : CVPR
Time : 2018

Abstract

We propose GeoNet, a joint unsupervised learning framework for monocular depth, optical flow and ego-motion estimation in videos.

These three components are coupled together through the properties of the 3D scene geometry and are jointly learned by our framework in an end-to-end manner. Specifically, geometric relationships are extracted based on the predictions of individual modules and then combined into image reconstruction losses to reason separately for static and dynamic scene parts.

Furthermore, we propose an adaptive geometric consistency loss to improve robustness to outliers and non-Lambertian regions to effectively address occlusion and texture blurring issues.

Introduction

In this paper, we propose GeoNet, an unsupervised learning framework for jointly estimating monocular depth, optical flow and camera motion in videos. The foundation of our approach is built on the properties of 3D scene geometry (see Section 3.1 for details).

The intuitive explanation is that most natural scenes are composed of rigid static surfaces, i.e. roads, houses, trees, etc. The 2D image motion they project between video frames can be completely determined by the depth structure and camera motion. At the same time, dynamic objects such as pedestrians and cars are common in such scenes, and they usually have the characteristics of large displacement and chaotic arrangement.

Therefore, we capture the above principles using deep convolutional networks. Specifically, our paradigm employs a divide-and-conquer strategy. A novel cascade architecture consisting of two stages is designed to adaptively resolve scene rigid body flow and target motion. Therefore, the global playing field can be gradually refined, making our complete learning process decomposed and easier to learn. This fused motion field-guided view synthesis loss leads to a natural regularization of unsupervised learning. An example of prediction is shown in Figure 1.
Insert image description here
As a second contribution, we introduce a novel adaptive geometric consistency loss to overcome factors not included in pure view synthesis goals, such as occlusion handling and photo inconsistency problems. By mimicking traditional front-to-back (or side-to-side) consistency checks, our method automatically filters out possible outliers and occlusions. Prediction consistency is enforced across different views in non-occluded areas, while erroneous predictions are smoothed, especially in occluded areas.

Related Work

Scene flow estimation is another topic closely related to our work, which solves the dense 3D motion field of a scene from a sequence of stereoscopic images [49]. Top-ranked methods on the KITTI benchmark often involve joint reasoning of geometry, rigid motion, and segmentation [3, 51]. MRF [27] is widely adopted to model these factors as discrete labeling problems. However, these off-the-shelf methods are often too slow for practical use due to the large number of variables that need to be optimized. On the other hand, several recent methods emphasize strict regularities in general scene flows. Taniai et al. [46] proposed using binary masks to segment moving objects from rigid scenes. Sevilla-Lara et al. [41] defined different image motion models based on semantic segmentation.

Method

In this section, we start with the nature of 3D scene geometry. We then provide an overview of GeoNet. It consists of two components: a rigid structural reconstructor and a non-rigid motion positioner.

Finally, we propose geometrically consistent execution, which is the core of GeoNet.

Nature of 3D Scene Geometry

A video or image is a screenshot of 3D space projected into specific dimensions. 3D scenes naturally consist of static backgrounds and moving objects. Movement in the static parts of the video is entirely caused by camera motion and depth structure. The motion of dynamic objects is more complex, consisting of uniform camera motion and specific object motion.

Compared to full scene understanding, understanding uniform camera motion is relatively easy because most areas are constrained by it.

To essentially decompose the 3D scene understanding problem, we wish to learn scene-level consistent motion controlled by camera motion, i.e., rigid flow and concrete object motion, separately.

To model tightly constrained rigid flow, we pass the depth map D i of frame iDiThe set of and the relative camera motion from the target frame to the source frame T t → s T_{t→s}Ttsto define static scene geometry. From the target image I t I_tItto source image I s I_sIsThe relative two-dimensional rigid flow of can be expressed as:
Insert image description here
On the other hand, we model the unconstrained object motion as the classical optical flow concept, that is, the two-dimensional displacement vector.

We learn the residual flow ft → sresf^{res}_{t→s}ftsresrather than a complete representation of the nonrigid case.

GeoNet OverViwe

Our proposed GeoNet perceives the essence of 3D scene geometry in an unsupervised manner.

in particular,

  1. We use separate components to learn rigid flow and object motion via a rigid structure reconstructor and a non-rigid motion localizer, respectively.
  2. Image appearance similarity is employed to guide unsupervised learning, which can be generalized to an unlimited number of video sequences without any labeling cost.

An overview of our GeoNet is shown in Figure 2.
Insert image description here

It contains two stages, the rigid structure reasoning stage and the non-rigid motion refinement stage.

The first stage of inferring scene layout consists of two sub-networks, namely DepthNet and PoseNet. Depth maps and camera poses are regressed separately and fused to produce rigid flow.

The second stage is done by ResFlowNet to handle dynamic objects. The residual non-rigid flow learned by ResFlowNet is combined with the rigid flow to derive our final flow prediction. Since each of our sub-networks targets a specific sub-task, the complex scene geometry understanding goal is decomposed into a number of simpler goals. View synthesis at different stages is the basic supervision of our unsupervised learning paradigm.

Last but not least, we perform geometric consistency checks during training, which significantly enhances the consistency of our predictions and achieves impressive performance

Rigid Structure Reconstructor

Our first stage aims to reconstruct the rigid scene structure and be robust to non-rigid and outliers.

The training examples are temporally consecutive frames I i (i = 1 ∼ n) I_i(i = 1 ∼ n) with known camera intrinsic propertiesIi(i=1n ) . Typically, the target frameI t I_tItis designated as the reference view, while the other frames are source frames I s I_sIs.
DepthNet takes a single view as input and leverages accumulated scene priors for depth prediction. During training, the entire sequence is treated as a mini-batch of independent images and input into DepthNet.

Instead, to better exploit feature correspondences between different views, our PoseNet takes as input the entire sequence connected along the channel dimension, regressing all relative 6DoF camera poses T t → s T_{t→s} in one goTts

Based on these basic predictions, we are able to derive the global rigid flow according to equation (1). We can instantly composite any pair of target frames and another view between the source frames.

We will I ~ srig \tilde{I}^{rig}_sI~srigExpressed as based on ft → srigf^{rig}_{t→s}ftsrigI s I_sIsUnwarp the image to the target image plane.

Therefore, the supervision signal of our current stage is naturally designed to minimize the synthetic view I ~ srig \tilde{I}^{rig}_sI~srigand the original frame I t I_tItThe difference between (or vice versa) forms appears.

However, it should be pointed out that the rigid flow only dominates the motion in the non-occluded rigid region and fails in the non-rigid region. Although this negative effect is somewhat mitigated in fairly short sequences, we employ a robust image similarity measure [15] against photometric loss that maintains a reasonable perceptual similarity assessment and moderate resilience to outliers. is balanced, and is essentially differentiable as follows:
Insert image description here

Non-rigid Motion Localizer

The first stage provides us with a stereoscopic perception of rigid scene layout, but ignores the ubiquity of dynamic objects. Therefore, we propose a second component, ResFlowNet, to localize non-rigid motions.

Intuitively, universal optical flow can directly model unconstrained motion, which is commonly adopted in off-the-shelf depth models [8, 18]. But they don't take full advantage of the good constraint properties of rigid regions, which we actually did in the first stage.

We formulate ResFlowNet to learn residual non-rigid flows, i.e. movements caused only by object motion relative to the world plane. Specifically, we cascade ResFlowNet after the first stage as recommended by [18]. For any given frame pair, ResFlowNet utilizes the output of the rigid structure reconstructor and predicts the corresponding residual signal ft → sresf^{res}_{t→s}ftsres. The final full flow prediction ft → sfullf^{full}_{t→s}ftsfull f t → s r i g + f t → s r e s f^{rig}_{t→s} + f^{res}_{t→s} ftsrig+ftsresconstitute.
Insert image description here

As shown in Figure 3, our first stage, the rigid structure reconstructor, produces high-quality reconstructions in most rigid scenes, which provides a good starting point for our second stage. Therefore, ResFlowNet in our motion localizer only focuses on other non-rigid residues. Note that ResFlowNet not only corrects mispredictions in dynamic objects, but also improves imperfect results from the first stage that may be caused by high saturation and extreme lighting conditions thanks to our end-to-end learning protocol.

Similarly, we can simply modify it to extend the first stage to the second stage. Specifically, follow the complete process ft → sfullf^{full}_{t→s}ftsfull, we again perform image warping between any pair of target and source frames. Let I ~ srig \tilde{I}^{rig}_s in equation (2)I~srigReplace with I ~ sfull \tilde{I}^{full}_sI~sfullGet the complete deformation loss L fw L_{fw}Lfw. Similarly, we extend the smoothness loss of the two-dimensional optical flow field in equation (3), which we denote as L fs L_{fs}Lfs

Geometric Consistency Enforcement

Our GeoNet employs a rigid structural reconstructor for static scenes and a non-rigid motion localizer as compensation for dynamic objects. Both stages utilize view synthesis targets as supervision and implicitly assume photometric consistency. Although we employ a powerful image similarity evaluation such as Equation (2).

Occlusions and non-Lambertian surfaces are still not handled perfectly in practice.

To further mitigate these effects, we apply forward and backward consistency checks in the learning framework without changing the network architecture. The work of Godard et al. [15] incorporates similar ideas into their deep learning scheme with left-right consistency loss. However, we believe that this consistency constraint as well as distortion loss should not be imposed on occluded regions (see Section 4.3). Instead, we optimize an adaptive consistency loss for the final motion field.

Specifically, our geometric consistency enforcement is achieved by optimizing the following objectives.
Insert image description here
Since these regions violate the photo-consistency as well as the geometric consistency assumption, we only use the smoothness loss L fs L_{fs}Lfsto deal with them. Therefore, our full-flow distortion loss L fw L_{fw}Lfwand geometric consistency loss L gc L_{gc}LgcBoth press [ δ ( pt ) ] [ δ (pt)][ δ ( pt )] pixel weighting.
Insert image description here

おすすめ

転載: blog.csdn.net/qin_liang/article/details/132744830