Real-time stereo matching network StereoNet

insert image description here

Overview

This paper introduces StereoNet, the first end-to-end deep architecture for real-time stereo matching, which runs at 60fps on an NVidia Titan X and generates high-quality, edge-preserving, quantization-free disparity maps. A key point of this paper is that the sub-pixel matching accuracy of this network is an order of magnitude higher than that of traditional stereo matching methods. This allows us to achieve real-time performance by using a very low resolution cost, encoding all the information needed to achieve high parallax accuracy.

Project open source connection: https://github.com/meteorshowers/X-StereoLab

1. Introduction

Stereo matching is a classic computer vision problem that involves estimating depth from two slightly displaced images. With growing interest in virtual and augmented reality, depth estimation techniques have recently been pushed to the center of research. It is at the heart of many tasks, from 3D reconstruction to localization and tracking. Its applications span different research and product areas, including indoor mapping and construction, autonomous vehicles, and body and face tracking.

In this paper, we propose a novel deep network architecture, StereoNet, which can generate state-of-the-art 720p depth maps at 60Hz on NVidia Titan X. In summary, the main contributions of this paper are as follows:

  • The subpixel matching accuracy of StereoNet is an order of magnitude better than "traditional" stereo methods.
  • The sub-pixel accuracy of this network is high, and the depth accuracy of traditional stereo matching can be achieved at very low resolution Cost Volume.
  • This paper shows that previous work introducing cost-volume in deep architectures is an over-parameterization of the task, and how this can significantly help reduce the system's runtime and memory footprint with little cost in terms of accuracy.
  • A new hierarchical depth refinement layer capable of performing high-quality upsampling, preserving edges.
  • Finally, it is demonstrated that the proposed system achieves convincing results on several benchmarks while being real-time on high-end GPU architectures.

2. StereoNet network architecture

2.1 Overall structure

Our stereo matching method combines a design that exploits the problem structure and classical methods to solve the problem. The overall framework of the network is as follows:
insert image description here

2.2 Rough prediction: Cost Volume Filtering

Stereoscopic systems generally solve a corresponding problem. The problem usually boils down to forming a disparity map by finding pixel-to-pixel matches along scan lines in two rectified images.

The desire for smoothing and edge-preserving solutions has led to methods like Cost Volume Filtering, which explicitly model the matching problem by forming and processing a 3D volume that jointly resolves all candidate differences at each pixel. When using color values ​​directly for matching, we compute a feature representation for each pixel used for matching.

feature network . The first step in this pipeline is to find a meaningful representation of an image patch that can be matched exactly in later stages. We recall that stereo suffers from textureless regions, and traditional methods address this problem by aggregating costs using large windows.

We replicate the same behavior in the network by ensuring that features are extracted from a large receptive field. In particular, we use a feature network that shares weights between the two input images. We first downsample the input image with a stride of 2 using 5 × 5 convolutions, maintaining 32 channels during downsampling.

Then apply 6 residual blocks with 3 × 3 convolution, batch normalization and leaky ReLu (α = 0.2). Finally, the last layer is processed using a 3 × 3 convolution that does not use batch normalization or activation. The output is a 32-dimensional feature vector at each pixel in the downsampled image. This low-resolution representation is important for two reasons:

  • It has a large receptive field and is useful for textureless areas.
  • Keep eigenvectors compact.

cost body . At this time, we use the difference between the feature vector of the pixel and the feature vector of the matching candidate pixel to form the cost volume at coarse resolution. We note that asymmetric representations generally perform well, and concatenating two vectors achieves similar results in our experiments.

At this stage, the traditional stereo method will use a winner-take-all approach, choosing between the two eigenvectors the gap with the smallest Euclidean metric. Instead, here we let the network learn the correct metric by running multiple convolutions and nonlinearities.

2.3 Hierarchical Refinement: Edge-Aware Upsampling

The disadvantage of relying on coarse matching is that the resulting myopic output lacks fine details. To keep our compact design, we deal with this problem by learning an edge-preserving refinement network. We note that the network's job at this stage is to expand or attenuate the disparity values ​​to incorporate high-frequency details using the color input as a guide, so a compact network that learns pixel-to-pixel mapping, similar to the network in recent computational photography work, is an appropriate method. Specifically, our task refinement network finds only residuals (or incremental differences) to add or subtract from coarse predictions.

Figure 2 shows the output of the refinement layer at each level of the hierarchy, and the residuals added at each level to recover high-frequency details. The behavior of this network is reminiscent of joint bilateral upsampling, and we do believe that this network is a learned edge-aware upsampling function, utilizing a guided image.
insert image description here

2.4 Loss function

We train a stereo network under full supervision using ground-truth labeled stereo data. We minimize the hierarchical loss function:
insert image description here

3. Experiment

Here, this paper evaluates our system on several datasets and demonstrates that we achieve high-quality results at a fraction of the computational cost required by state-of-the-art techniques.

insert image description hereinsert image description here
insert image description here
insert image description here

4. Conclusion

This paper presents StereoNet, the first real-time, high-quality end-to-end structural passive stereo matching. We start with the insight that the low-resolution cost capacity contains most of the information to generate high-precision disparity maps and recover thin structures given enough training data. We demonstrate subpixel accuracy of 1/30 of a pixel, exceeding the limit published in the literature. Our improved method recovers high-frequency details hierarchically using the color input as a guide to draw parallels of a data-driven joint bilateral upsampling operator.

The main limitation of our method is due to the lack of supervised training data: in fact, we show that our method achieves state-of-the-art results when there are enough examples. To mitigate this effect, our future work includes a combination of supervised and self-supervised learning to augment the training set.

Guess you like

Origin blog.csdn.net/wjinjie/article/details/122303148