[Paper Brief] IS-MVSNet: Importance Sampling-based MVSNet (ECCV 2022)

1. Brief introduction of the paper

1. First author: Likang Wang

2. Year of publication: 2022

3. Published journal: ECCV

4. Keywords: MVS, 3D reconstruction, importance sampling, unsupervised error distribution estimation

5. Exploration motivation: Predict the depth map in a coarse-to-fine manner, which partially alleviates the limitation on resolution. The basic assumption behind the coarse-to-fine algorithm is that the predictions at the coarse stage are reliable estimates of the truth. But even with the coarse-to-fine strategy, the depth resolution is still a key factor hindering the simultaneous attainment of high precision and high efficiency. Existing coarse-to-fine algorithms do not fully exploit previously reliable prediction assumptions due to treating each candidate depth value equally across the depth range.

6. Work goals: This paper focuses on selecting the most promising candidate values. The new problem, then, is to distinguish which depths are most trustworthy. While rough predictions assume closeness to actual depth, they are not 100 percent accurate. Therefore, estimating the error distribution of coarse predictions becomes crucial in order to localize the ground-truth more precisely.

7. Core idea: Based on the above considerations, the importance sampling-based MVSNet (IS-MVSNet) is proposed, and an effective candidate depth sampling strategy is introduced, which can be performed in a cost-free manner by significantly improving the depth resolution near the true value. More accurate depth prediction.

  1. We proposed an importance-sampling module for sampling candidate depth, effectively achieving higher depth resolution and yielding better point-cloud results while introducing no additional cost.
  2. Furthermore, we proposed an unsupervised error distribution estimation method for adjusting the density variation of the importance-sampling module.
  3. Notably, the proposed sampling module does not require any additional training and works reasonably well with the pre-trained weights of the baseline model.

8. Experimental results:

Tanks & Temples(TNT), ETH3D and DTU demonstrate IS-MVSNet’s superiority over current SOTAs. With an F-score of 62.82%, ISMVSNet surpasses all the published MVS algorithms on TNT’s intermediate benchmark by a clear margin.

9. Paper download:

https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136920663.pdf

https://github.com/NoOneUST/IS-MVSNet

2. Implementation process

1. Overview of IS-MVSNet

IS-MVSNet continues the network structure from coarse to fine, as shown in the figure below.

  1. IS-MVSNet employs Feature Pyramid Network (FPN) to extract multi-level representations of reference and source images.
  2. Sample a set of hypothetical depths for further evaluation. For the coarsest stage S=1, hypothetical depths are sampled uniformly over a predefined depth range. For stage S>1, a deep hypothesis selection strategy based on importance sampling is proposed, which provides a more efficient sampling method for IS-MVSNet without sacrificing efficiency. An unsupervised method is also proposed to estimate suitable hyperparameters for importance sampling.
  3. Compute cost body.
  4. A 3D CNN is used to regularize the cost volume and predict the probability of each ground-truth hypothesized depth.
  5. Calculate the inner product of the depth sample and the probability of the corresponding prediction as the depth prediction for the current stage.

2. Hypothesis depth selection based on importance sampling

As a coarse-to-fine algorithm, IS-MVSNET gradually refines the depth prediction. In the case of stage S>1, although the former prediction D s 1 is roughly close to the actual depth D GT , there is still a gap between them. Assuming we can estimate the depth prediction error per pixel, we further sample the hypothesized depth around the ground truth at a larger resolution. In this case, the model's ability to capture fine details can be greatly enhanced.

While estimating the error per pixel is difficult and impractical, we propose to estimate the error distribution across the dataset and adjust the assumed depth sampling accordingly. However, the existing MVS algorithms do not consider the error estimation, and blindly regard the prediction error as a uniform random variable. In IS-MVSNET, we propose a method to find N s good candidate depth values ​​di for each pixel in stage S>1, based on the depth prediction D s 1 of the previous stage and the depth prediction error δ️️ The probability density function (PDF), F(δ), is estimated for all pixels in the dataset. We then sample at di to generate a more accurate depth prediction Ds:

where p(di) represents the probability that the candidate depth Di is the nearest neighbor of dgt.

In this way, we can more precisely locate the most promising depth candidates and then assign more attention. The result is increased depth accuracy due to the finest depth resolution increments around the true value.

Error formulation. The first problem is how to formulate the error distribution. We think it is reasonable to approximate the error PDF as a unimodal function for three reasons. First, since there are many factors that affect the prediction error, the central limit theorem shows that the error tends to a zero-mean unimodal distribution; second, unbiased estimation is achieved by generating rough predictions through uniform sampling; third, we The experiments verify that the error does obey a unimodal distribution with a mean close to zero. Notably, we do not require the previous stage to give a unimodal probabilistic prediction of the hypothesized depth for a given pixel. Instead, we expect the distances from the actual depth to the depth predictions computed from all hypothesized depths to follow a unimodal distribution.

Assuming most pixel depth estimates from the previous stage are correct, our method significantly outperforms uniform sampling. In Figure 4d, our experiments on real datasets show that sampling following a zero-mean Gaussian distribution is indeed significantly better than a uniform distribution. Furthermore, sampling following a Gaussian distribution benefits most pixels by providing a higher sampling density at the actual depths of these pixels, even though most pixel depth estimates were wrong in the extreme cases of the previous stage. Even if we do not follow a zero-mean Gaussian distribution to estimate the mean and sample, our method still benefits from more pixels than uniform sampling. Our sampling method outperforms or is comparable to uniform sampling even in regions containing the most false predictions, such as repetitive and textureless regions, small objects far away from the background.

discrete interval. Discrete intervals have two advantages over sampling from a continuous PDF. First, given a finite number of depths, say 8, discrete intervals yield more stable sampling densities than iid sampling, closer to the actual error distribution. Second, discrete intervals are beneficial for convolutions, since adjacent pixels have similar sampling depths, and spatial correlation is crucial for convolutions.

Based on these considerations, we further propose to non-uniformly sample depth candidates following a sequence of predefined intervals. Precisely, the error PDF should control the depth interval: where the PDF is larger, the interval should be smaller; otherwise, it should be larger. Let µs e−1 represent the average error of the s−1 stage, then the depth interval close to Dps−1+µs e−1 should be smaller, otherwise it should be larger. We adopt a simple and typical geometric sequence to fit the interval pattern to meet the requirements. Note that other sequences with similar trends are acceptable if they have the similar properties of the Gaussian distribution N(μes−1, σse−1), i.e. they have only one unimodality at μes−1 and the sequences have the same properties as σes −1 for (N(μ se−1, σes−1)) similar effect parameters. Furthermore, it is not necessary to strictly enforce the interval sequence to converge to
N(µs e−1, σes−1) when the number of intervals is oriented ∞. For example, arithmetic progressions work well too. With this approach, we sample the depth of the error distribution while maintaining local consistency. The detailed importance sampling algorithm is described below.

Algorithm details. Use discrete intervals to place depth hypotheses on a range of depths, rather than sampling depths directly from a continuous PDF. In the first stage, the entire depth range R1 is divided into n1−1 equivalent intervals of size R1/n1−1, since no prior unbiased depth estimate is given at stage s=1. In the next stage s ∈ {2,3,…}, an ordinary geometric progression is employed to generate depth hypotheses and increase the sampling density in the central region. The discretized intervals are parameterized with ks, a hyperparameter that determines the shape of the interval. As shown in Figure 2, the minimum interval is reduced to 1ks, and the interval length change speed is cs, which is controlled by ks. Larger ks means denser sampling around Dps−1+μes−1 in the corrected prior prediction. When ks > 1, the interval of center hypothetical depths is reduced to 1/ks, while the interval of edge depths is increased. That is, the central interval rs/ks is 1/ks smaller than the uniform sampling interval rs. When ks=1, importance sampling is downgraded to uniform sampling. When 0 < ks < 1, this method can handle most cases where previous predictions were wrong.

Figure 2: Schematic diagram of depth selection when the number of depths is 6. In this sampling strategy, the depth range remains unchanged. The minimum depth interval is reduced to 1 ks, and the interval length is increased by the ratio of cs, which is controlled by ks characteristic of equation (1). The larger ks is, the smaller the minimum interval is, and the larger cs is, the faster the interval length changes.

Specifically, the depth intervals form a symmetrical geometric progression:

 cs is the common ratio of adjacent intervals. Since it is desirable to keep the depth range of the network and the number of assumed depths the same as the baseline model, cs is uniquely controlled by ks, Rs and ns according to Eq. (1). In practice, cs is numerically computed as the root of Eq.(1).

Define a unique depth candidate for each pixel. Specifically, first, each pixel has its own set of discrete depth candidates defined by a sequence of intervals; second, the interval (i.e., the sum of intervals) between the depth candidate and the depth range R is consistent in the size of all pixels and third, set the center position of the depth range R along the depth axis to the previous depth estimate Dps−1 for each pixel. Thus, each pixel has a unique set of depth candidates whose spacing is the same across pixels; Fourth, if the average error μes−1 is estimated, the position of the further “corrected” range is Ds−1p+ µes−1.

3. Unsupervised Error Distribution Estimation

In IS-MVSNet, we introduce two new hyperparameters ks and µs to adjust the shape of the sampling function gs(x) at the stage s > 1. In practical applications, the depth estimation error is concentrated around zero. Therefore, by default we treat mean error µs = 0 and only estimate ks. However, the ks estimation scheme proposed in this section is also applicable to µs. If we want to estimate both ks and µs, we first fix ks and estimate µs, then fix µs and estimate ks.

As analyzed in the previous section, when the true depth is known, the optimal k can be uniquely determined by minimizing the difference between the sampling function gs(x) and the actual error distribution. However, we do not know the actual depth in the real scene, and the scale, lighting and camera intrinsics are different for different datasets. Therefore it is necessary to estimate a ks for each dataset. We take the matching cost as a clue to the actual depth and show that estimating the error distribution is equivalent to minimizing the matching cost, which is always achievable. This section proposes a general unsupervised hyperparameter ks selection strategy that makes the importance-based sampling module not constrained by hyperparameters in all scenarios.

Recall that in MVS, the input 2D image and camera parameters are always available, and there is photometric consistency between different views. Given a 3D point P with depth dr and projection Pr in the reference view, the coordinates of the projection Pv of P in the vth source view can be computed as Pv = Hv(dr)Pr, where Hv(dr) is a homography matrix.

Assuming the depth estimate Dps is correct, then Pvs = Hv(Dps)Psr should represent the same 3D point as Psr, i.e. the feature Frs = Fvs of Psr. Since multiple views are given, we use variance Var[Fvs] to measure their similarity. Therefore, the best depth estimate Dp*=argminVar[Fvs].

As mentioned in the previous section, k determines the estimation error distribution PDF. Specifically, a larger k indicates an error distribution with less variance. When k=1, importance sampling is the same as uniform sampling; when k=∞, only one candidate point has a chance to be sampled. Clearly, both k = {1, ∞} lead to a non-minimum difference between the estimated and actual PDF. Therefore, when k increases from 1, the performance of the model improves first and then gradually decreases. We adopted a unimodal function to approximate the performance k-curve. Based on this consideration, we propose an unsupervised hyperparameter k selection algorithm based on ternary search, as shown in Algorithm 1, Algorithm 2 and Figure 3. Since ternary search reduces the search range by a constant ratio in each iteration, it converges very fast. Generally speaking, 3 to 5 iterations can get a satisfactory k. Our experiments in Figure 4c show that randomly picking two reference views per scan is sufficient to determine k.

Figure 3: Schematic diagram of the error distribution estimation module. We evaluate k with a photometric loss, and apply a ternary search to find the optimal k for Algorithm 1 and Algorithm 2.

4. Experiment

4.1. Datasets

 4.2. Comparison with advanced technologies

Continuing Vis-MVSNet, a half-size depth map is predicted, while other mentioned methods are full-size. Since the objects in the DTU are very small, the depth map requires a higher planar resolution. Therefore, the improvement on TNT is more significant than that on DTU. Although UCSNet shows better overall distance, its advantage relies on a depth-range determination strategy, which does not conflict with our depth-range-agnostic sampling algorithm.

Guess you like

Origin blog.csdn.net/qq_43307074/article/details/130659310