[Paper Brief] Adaptive Patch Deformation for Textureless-Resilient Multi-View Stereo (CVPR 2023)

 1. Brief introduction of the paper

1. First author: Yuesong Wang, Zhaojie Zeng

2. Year of publication: 2023

3. Published journal: CVPR

4. Keywords: MVS, 3D reconstruction, conventional methods, adaptive blocks, textureless regions

5. Exploration motivation: Traditional PM-based methods use propagation and local refinement strategies to find appropriate matches, do not require cost body construction, and therefore require little memory, but are limited by the receptive field and are difficult to handle untextured regions. Due to the wide application of convolution, deep learning-based methods have much larger receptive field than traditional methods. Both AA-RMVSNet and TransMVSNet expand the receptive field by introducing deformable convolution. As the receptive field increases, unreliable pixels can gain enough geometric information from surrounding reliable pixels, leading to better depth estimation. Nonetheless, as shown in the figure below, larger receptive fields lead to more memory consumption, making them difficult to process datasets containing large-scale textureless regions or high-resolution images using mainstream GPU devices.

6. Work goal: to develop a solution with a small memory overhead that can handle large-scale texture-free regions well,
so this paper migrates the idea of ​​deformable convolution to the traditional PM-based MVS mechanism.

To develop a memory-friendly solution that can well handle large-scale textureless regions at the same time, in this paper, we transplant the spirit of deformable convolution to a traditional PM-based MVS pipeline.

7. Core idea: Implemented a PatchMatch (PM) based MVS method, APD-MVS, using adaptive patch deformation and NCC based matching metrics. We realize a PM-based MVS method, APD-MVS, which adopts our adaptive patch deformation and an NCC-based matching metric.

  1. For PM-based MVS, we propose to adaptively deform the patch of an unreliable pixel when computing the matching cost, which increases the receptive field when facing textureless regions to ensure robust matching.
  2. We propose to detect reliable pixels by checking the convergence of matching cost profiles, maintaining the accuracy of detection while being able to find more anchor pixels, which ensures better adaptive patch deformation.

8. Experimental results:

Our method achieves state-of-the-art results on ETH3D dataset and Tanks and Temples dataset with lower memory consumption.

9. Paper download:

https://github.com/whoiszzj/APD-MVS

https://openaccess.thecvf.com/content/CVPR2023/papers/Wang_Adaptive_Patch_Deformation_for_Textureless-Resilient_Multi-View_Stereo_CVPR_2023_paper.pdf

2. Implementation process

1. Overview of APD-MVS

For each reference image I0 and its source image Ii, a pyramid structure Li is obtained by scaling, where the first layer corresponds to the original image. At layer k, the traditional PM method is used to obtain an initial depth map, which is then used to evaluate the reliability of each pixel. For each unreliable pixel, its corresponding block is adaptively deformed to cover enough high-reliability anchor pixels. In intermediate layers, the depth is inherited from the previous layer through upsampling. Depending on the reliability of the pixels, different matching strategies are employed. For reliable pixels, conventional PM is applied, and for unreliable pixels, the matching strategy is replaced by deformable PM. Also, after obtaining the depth, the reliability of pixels and deformable blocks of unreliable pixels are calculated again, and then input to the next layer. Finally, at layer 1, the depth estimation is fused to obtain a dense point cloud. Additionally, some changes have been made to the Propagation and Improvement modules to better utilize pixel reliability.

​2. Deformable PatchMatch

Review the traditional PM method first. Given the planar assumption, one can obtain the block projection of a block of fixed size centered on a pixel in the reference image onto the source image. The matching cost between two blocks is usually calculated based on the NCC metric. Specifically, assuming a reference image Ii with camera parameters Pi = {Ki, Ri, Ci} has a source image Ij with camera parameters Pj = {Kj, Rj, Cj}, where K is the internal reference and R is the rotation matrix
, C is the camera center. For a pixel p in Ii with homogeneous coordinates p = [u, v, 1]T, let its plane hypothesis be fp = [nT, d]T, where n is the plane normal and d is the depth. Homography is defined as:

​Set a square window Bp centered at pixel p to represent the reference block. For Bp, its corresponding block Bjp can be found in the source image Ij using Hij. The matching cost is calculated as 1 minus the NCC score:

Where cov(X, Y)=E(X−E(X))E(Y−E(Y)), x and y are the color values ​​in X and Y, and E(·) is the expected value. It is then aggregated using view weights to obtain m(p, fp, Bp) considering all source images.

Unlike traditional PM which only considers pixels in a square window, Deformable PM is proposed to compute the matching cost of an unreliable pixel within a deformable block that covers enough high-reliability anchor pixels. Suppose p is an unreliable pixel with a deformable block containing anchor pixel S. Given a planar hypothesis fp, the matching cost of defining the deformable PM computation is

where λ is a weight value to adjust the influence of the anchor pixel on the center pixel p. The window size of Bp is set to w×w, and the increment is set to θ, and the window size of Bs is the same as that of Bp, but the increment is set to w/2, which speeds up the calculation. In the experiment, set λ=0.25, w=11, θ=2, |S|=8.

The calculation of mD(p, fp, S) is shown in the figure below. To obtain robust local features, a local window Bs (s ∈ S) is generated for each anchor pixel s, resulting in m(s, fp, Bs). Then each window Bp,B(s∈S) is calculated separately, and finally aggregated by weight to retain more feature information. The reason why weight aggregation is used instead of calculating m(p, fp, Ball), Ball=Bp∪{b|s∈S} is that the number of unreliable pixels contained in Ball usually exceeds the number of reliable pixels. Computing m(p, fp, Ball) directly will result in feature information from reliable pixels being filtered out as noise.

Visualization of deformable PM calculations. Green dots represent unreliable pixels in the center. The red dotted lines form deformable blocks, and the anchor pixels s (i = 1...8) are represented by red dots. The matching cost of the central point is obtained through weighted aggregation.

3.  Adaptive block deformation

In order to facilitate the calculation of the matching cost of unreliable pixels, it is necessary to adaptively deform the blocks in advance. When deforming blocks, the following principles should be adhered to as much as possible:

  1. Deformable blocks of unreliable pixels should adaptively cover enough nearby anchor pixels;
  2. The depth of deformable blocks should be continuous, ensuring that the pixels in the block are related;
  3. Anchor pixels should be as close as possible to unreliable pixels to provide better fitting to deformed blocks;
  4. Anchor pixels should be found in all directions to increase the robustness of the deformable PM.

Therefore, an adaptive block deformation algorithm based on the spoke method and RANSAC is proposed to meet the above requirements. In order to facilitate subsequent processing, the nearest reliable pixel of each pixel is obtained in advance as:

Where Ω is the search range centered on p (in the experiment, Ω is defined as a 100×100 square window), and R(p) is the index of p, where 1 means trustworthy and 0 means untrustworthy.

The main idea of ​​the algorithm is to obtain ϕ candidate reliable pixels {Ci}ϕ, and then retain the well-adapted pixels through RANSAC. When searching for candidates for the center pixel p in all directions, a spoke-like approach is adopted, dividing the search space into ϕ sectors with the same angle. For each sector, given the initial search radius, a random direction vector is generated within the sector to obtain the search pixel q. If there is a valid N(q) and N(q) is within the sector, it is marked that a suitable candidate reliable pixel was found in the sector. Otherwise, multiple random searches will be done within this radius. If the candidate point is still not found, expand the radius and repeat the above process, as shown in Figure 5. The algorithm ensures that there is a reliable pixel in each direction and that the candidate pixels are as close to the center as possible.

After obtaining {Ci}ϕ, it is filtered through the RANSAC algorithm to improve the anti-occlusion ability. Since the method is PM-based, pixels in deformable blocks are implicitly required to have the same planar assumption. If there are candidates that cannot fit into a consistent plane, they are considered outliers. Therefore, at each iteration, three candidate pixels are randomly sampled, and the fitting plane π is composed of 3d points. The center pixel needs to be within the triangle formed by these three candidate pixels, ensuring that the anchor pixel is in all directions of the center pixel. Then, calculate the distance {Di}ϕ=1 between π and the 3D point corresponding to the candidate pixel. Then the cost (π) of the random sample is given by

where ε is the threshold for filtering outliers, I( ) is an indicator function that makes I(true) = 1 and I(false) = 0, and Dp is the distance from the 3D point of the central pixel p to the fitting plane π. When comparing the cost of different options, first consider α, the smaller the α, the better the plane fit.

The effect of β is considered if the two costs are equal in the α dimension. After a certain number of iterations, if the best fitting plane πbest can be found (sufficient points are distributed near the plane), then the reliable pixel points can reach up to 1 × S ×{Ci|Di≤ε, i = 1…ϕ } to choose φ according to their distance πbest to the fitted plane. Then form a deformable block. Otherwise, the center pixel may lie in a non-planar area. In this case, the traditional PM is still used to obtain the matching cost.

In order to better deal with non-planar regions, the planar fitting threshold ε is first relaxed to treat it as a planar region, and then the fitting threshold is gradually reduced as the optimization proceeds to free the non-planar regions from planar constraints. With this approach, non-planar regions are easier to find correct depth estimates from deformable PMs under good initialization conditions. In the experiments, we set ϕ = 32 and gradually reduce the ε scene depth range from 1% to 0.5%.

The image on the right is a demonstration of adaptive block deformation. The green dot represents the central pixel, the blue dots around it represent conventional blocks, and the red dots constitute the receptive field of deformable blocks. The left plot shows the matching cost around the ground truth (dashed green line). Compared with traditional PM, deformed PM has remarkable ground-truth convergence performance for unreliable pixels.

4. Pixel Reliability Evaluation

Now there is only one problem left: how to evaluate the reliability of the pixels. It is inappropriate to strictly divide pixels into reliable and unreliable categories from the beginning. As the optimization proceeds, more pixels of depth estimation will fall into the search range without matching ambiguity, making it stable enough to serve as anchor pixels for deformable PMs. More anchor pixels can lead to better fitting of deformed blocks, further improving depth estimation accuracy. Therefore, a pixel-by-pixel reliability evaluation mechanism that checks the matching cost distribution during optimization is proposed. Specifically, for each pixel, a matching cost curve is formed using the matching costs computed by conventional PMs in the vicinity of the current depth estimate. Reliability is then assessed by examining valleys and convergence of the contours.

Since the depth range is different in different scenes, it is not universal to directly sample it, so the sampling operation is performed in the disparity space. For each pixel p in the reference image I0, given a plane assumption fp = [nT, d]T and a camera focal length f, the average disparity is calculated as:

where Sgood is the subset of {Ii} selected by the joint view, bIj is defined as the baseline length between the reference image I0 and the source image Ij, and E( ) is the expected value. Then the matching cost is calculated through the traditional PM, and the graph P(D) is obtained.

where B is a regular square window centered at p and δ is the sample extent. After obtaining the matching cost curve, the reliability of the pixel can be evaluated according to the geometric properties of the matching cost curve, as shown in Algorithm 1. The main idea of ​​this algorithm is to calculate the significance of the global minimum with other local minima. For reliable pixels, the global minimum should be more discriminative than the minimum for unreliable pixels.

Divide pixels into two states: reliable and unreliable. The goal of the function FindLMins is to find a local minimum in the profile, while the goal of FindGMin is to find a global minimum. Parameters t1, t2, t3 are thresholds, which are set to 0.50, 0.15 and 0.20 in the experiment. η (< δ) is the threshold to judge whether the depth estimation optimization converges. As the optimization proceeds, η will decrease dynamically, making the determination of reliable pixels more stringent. In the experiment, set δ = 30, η = max(6−2 * i, 2), where i is the number of iterations.

5. Propagation and refinement

Based on [37], the propagation and refinement process is improved to better utilize the deformable block information of unreliable pixels. When performing joint view selection on unreliable pixels, the cost matrix M computed directly using the anchor pixels in deformable blocks improves the robustness of view selection.

During propagation, the candidate plane hypotheses propagated to unreliable pixels are replaced by the plane hypotheses of the anchor pixels. Furthermore, in the refinement step, the fitting plane assumption of the deformable block is incorporated into the candidate points, which speeds up the convergence. The method described above makes the depth estimation of unreliable pixels partly dependent on the depth estimation of reliable pixels. Therefore, reliable pixels are processed first in each iteration, followed by unreliable pixels.

When the depth estimation optimization of truly reliable pixels has not yet converged, i.e. |DGM−D'|>η in Algorithm 1, reliable pixels adopt deformable PMs like unreliable pixels. On the one hand, this is to utilize the surrounding information to speed up the convergence, and on the other hand, it is to prevent such reliable pixels from harming the adaptive patch deformation of unreliable pixels. However, the imposed planar constraints will lead to a decrease in the accuracy of depth estimation at such pixels. Therefore, a local optimization is added at the end of the optimization. By taking small samples around the original depth, the optimal depth is obtained according to the cost value of traditional PM. If the cost of the optimal depth is much smaller than the cost of the original depth, the optimal depth is used to improve the accuracy at the truly reliable pixels and have little effect on the truly unreliable pixels.

6. Experiment

6.1. Memory overhead

Test memory cost on NVIDIA TITAN. Learning-based methods still do not balance memory overhead and reconstruction results well. In contrast, conventional methods usually do not consume much GPU memory. CasMVSNet is a widely used learning-based baseline, while PatchmatchNet, GBi-Net, and IterMVS are designed to reduce memory consumption. Compared with the state-of-the-art traditional method ACMMP, the performance of these learning-based methods is still unsatisfactory. Considering all these state-of-the-art methods, APD-MVS achieves lower memory consumption and better reconstruction results.

 6.2. Comparison with advanced technologies

Guess you like

Origin blog.csdn.net/qq_43307074/article/details/130659427