[Paper Brief] DELS-MVS: Deep Epipolar Line Search for Multi-View Stereo (WACV 2023)

1. Brief introduction of the paper

1. Lead author: Mattia Rossi

2. Year of publication: 2023

3. Published journal: WACV

4. Keywords: MVS, 3D reconstruction, epipolar search

5. Exploration motivation: Whether the current method is the depth value or the inverse depth value, the range of the depth value needs to be determined in advance, which affects the reconstruction effect and increases the memory overhead.

The second is the need to discretize the depth search space, which requires both to define a depth range of interest and to define a discretization scheme for it. Although the depth range can be inferred from the sparse reconstruction of a Structure from Motion (SfM) system, the sparse nature of the reconstruction can lead to under or over estimate the
depth range of the portion of the scene captured by a specific image, thus preventing the reconstruction of some scene areas. Moreover, some discretization strategies suit certain scenes better than others. For objects close to the camera, a fine grained discretization toward the depth range minimum is preferable. Conversely, objects farther in the distance can be reconstructed even with a coarser discretization. The limitations behind depth discretization approaches becomes obvious when considering scenes containing multiple objects at very different distances from the camera. A better alternative is represented by a discretization of the inverse depth range, but this still calls for a depth range of interest.

6. Work goal: Solve the above problems through limit search.

7. 核心思想:In order to tackle both of these disadvantages, we propose a new method for MVS depth estimation denoted Deep Epipolar Line Search (DELS). For each pixel in the reference image, we search for the matching point in the source image by exploring the corresponding epipolar line. The search procedure is carried out iteratively and utilizes bi-directional partitions of the epipolar line around the current estimate to update the predicted matching point. These updates are guided by a deep neural network, fed with features from both the reference and the source image.

  1. A deep, iterative and coarse-to-fine depth estimation algorithm operating directly on the epipolar lines, thus avoiding the drawbacks of depth discretization, e.g. no specification of a depth range is needed.
  2. A confidence prediction module and a geometry-aware fusion strategy that, coupled, permit a robust fusion of the multiple reference image depth maps from the different source images.

8. Experimental results:

we verify the robustness of our method by evaluating on all the most popular MVS benchmarks, namely ETH3D, Tanks and Temples and DTU and achieve competitive results.

9. Paper download:

2212.06626.pdf (arxiv.org)

2. Implementation process

1. Overview of DELS-MVS

The system takes one reference image R and N≥1 source images S0≤N≤N−1 as input . First, a convolutional neural network is used to extract depth features from a reference image and a given source image , and then these features are input into the core algorithm for estimating the depth map Dn on the reference image. For each reference image pixel, the goal of the algorithm is to estimate the residual between the actual pixel projected onto the source image and the authors' initial guess along the epipolar line. To avoid scale dependence, the algorithm estimates residuals through iterative classification steps in a coarse-to-fine manner, and the algorithm is named Deep Epipolar Line Search (DELS). Because iterative classification is similar to search and utilizes a deep neural network, called epipolar residual network (ER-Net) . Finally, DELS-MVS also features a confidence network (C-Net) , which associates the confidence map with the estimated depth map Dn .

After extracting depth features, deep epipolar search generates output depth maps and confidence maps for each source image, which are then fused into a single output depth map.

2. Depth estimation through epipolar residuals

The DELS algorithm is initialized with a constant depth map D0n , where the constant is the average value between the minimum and maximum depth range obtained from the Structure from Motion (SfM) system.

Given a pixel (x, y) in the reference image , let pn0(x, y) be the projection of (x, y) on the source image Sn according to the initial depth Dn0(x, y) . DnGT is the real depth map, let pn GT(x, y) be the value of (x, y) projected into the source image Sn according to DnGT (x, y) . Both projections run on the same pole, as shown in the image above. The goal is to estimate the one-dimensional epipolar residual EGT n R such that the following relationship holds :

where the unit vector →d (x, y) R2 defines the epipolar direction. In fact, it is important to observe that epipolar lines do not inherently have any direction. In particular, we determine the direction →d(x, y) such that the depth hypothesis at (x, y) moves uniformly in its direction, thus decreasing monotonically. For completeness, observe that the following relationship holds:

Denoting the epipolar residuals estimated by DELS by En(x, y) , Eq.(1) is used to correct the original estimate pn0(x, y) to yield pn(x, y). The residual can now be transformed from the reference source image to the depth value Dn (x, y) of the original pixel ( x, y) using the given relative camera transformation TR→Sn . This process is shown in the figure below.

3. Deep Epipolar Line Search (DELS)

In MVS, the baseline between different source and reference images may vary greatly, whether within the same scene or across different scenes. Furthermore, depth maps can display very different ranges depending on the specific scene: from very small ranges ( for reconstruction of small objects ) to very large ranges ( for reconstruction of outdoor scenes ) . In most 3D reconstruction scenarios, the scene scale is unknown. Overall, this makes network training and direct regression of disparity a challenging task. Therefore, this paper redefines the epipolar residual estimation problem as an iterative, coarse-to-fine classification scheme.

In this paper, E ni(x,y)i=1,2 , ... represents the epipolar residual estimation after iteration i . In order to estimate the epipolar residual for a new iteration i , the epipolar is divided into k partitions L ={0,1 , , k−1} where k is an even number, as shown in the figure below. In particular, two partition sets are defined : the inner set LI = {1 , ..., k−2} and the outer set LO = {0, k−1}. The inner set obtained by Eni−1(x, y) calculated by formula 1 is located at the center of the epipolar position pn i−1(x, y) of the previous iteration . The set extends from the left boundary to the right boundary , where δi(x, y) R+ denotes the size of each partition. Partitions 0 and k−1 of the outer set simulate the case where the match is located on an epipolar line segment not covered by the current inner partition, i.e. outside the segment enclosed by OiL(x, y) and OiR(x, y). Thus, they provide an estimate of the way forward in the next iteration.

In this paper, a neural network is used to predict the partition l*i (x, y) l , which carries the ground-truth projection pn GT(x, y) . Once the segment is determined, the relative residual eni(x, y) R can be calculated and added to the previous iteration residual en i−1(x, y) to obtain the current iteration:

where En 0(x, y)=0 . In particular two cases are considered :

Case 1: l*i(x, y) LI , which shows that pn GT(x, y) will lie somewhere within the finite partition l*i(x, y) , so pni(x, y) will be placed at its midpoint.

Case 2: Partition l*i(x, y) LO , which only provides a general hint about the search direction of pn GT(x, y) , so pn i(x, y) is placed in OLi(x,y) −δi(x,y)/2 and ORi(x,y)+δi(x,y)/2 for l*i(x,y)=0 and k−1. In particular, this strategy allows compensating for incorrect predictions in the previous iteration, since some overlap is maintained between the internal partitions of the current and next iteration.

The two scenarios are summarized in the figure above , with sketches of three iterations of DELS. In general, the algorithm distinguishes two cases : whether the predicted match on the epipolar line lies in the inner set LI ( case 1) or the outer set LO ( case 2) . This will change the partition width δi and the next estimated position for the next iteration . In general, for a prediction partition l*i(x, y)∈l , the generalized update strategy is equivalent to defining relative residuals as follows:

The DELS coarse-to-fine scheme updates the partition width δi in each iteration, distinguishing Case1 and 2 again . In fact, in the first scenario, the partition width is halved at the beginning of iteration i+1 , while in the second it remains constant:

where ε∈R+ is the lower bound of the interval width. Finally, DELS adopts a multi-resolution strategy, j=0,1,2 represent full resolution, half resolution and quarter resolution respectively. The number of iterations j on the hierarchy denote Ij . Level j is initialized with the upscaled epipolar residual of the previous scale j+1, that is, Enj = NN↑(Ejn+1)·2 , where NN↑ is the nearest upsampling operator. Furthermore, iterations for a given pixel are interrupted when the partition width lower bound for a given level ϵj is reached and the selected partition is included in the inner set, as the maximum achievable accuracy for epipolar residuals is considered achieved . The figure above ( iteration 3 , left ) exemplifies this behavior.

4. Epipolar residual network

The Epipolar Residual Network (ER-Net) performs the classification task for each iteration of DELS . ER-Net receives the epipolar residual map Ein−1 of the previous iteration: this allows, for each reference image pixel (x, y), to align features on the epipolar line around the latest estimate pn i−1(x, y) Sample FSn is performed. To this end, deformable convolutions are incorporated into an adapted U-Net-like structure . In particular, at the top layer of U-Net , deformable kernels are distributed along epipolar lines and aligned so that they overlap with samples equidistant in distance, δi (x, y) before and after pni−1(x,y) . y)/2 . In this setup, samples in the inset fall at the start, middle, and end of each partition. The sampling scheme is represented by the gray dots in the figure above . The kernel size is chosen to cover all samples in the inner set and expand further to the outer set. In particular, setting k=12 and a kernel size of 5 yields 5×5=25 samples. For U-Net at low resolution levels , keep the same partition width and kernel size as the highest level, because this increases the network receptive field.

The multi-channel output layer output by the network is fed into a softmax activation layer that produces a probability for each of the k partitions li(x, y) L :

During inference, select the partition with the highest softmax probability:

ER-Net is employed in a multi-resolution scheme involving full, half and quarter resolutions , but does not share parameters between layers.

5. Belief network

The method computes N depth maps on a reference image , each computed using a different source image. This raises the problem of how to utilize all estimated depth maps to fuse them into a single depth map, since some reference image regions may be visible in one source image but not in another. To this end, a confidence network (C-Net) is introduced to assign a confidence map Cn to each estimated depth map Dn: the confidence map is then used to guide the fusion of multiple available depth maps.

At each level j of the multiresolution scheme , a map similar to the pixel-level entropy of the partition probability is computed, but considering its evolution in DELS iterations:

C-net is supervised with C~n=NN↑(C~n2)+NN↑(C~n1)+C~n0, which takes into account different resolution levels. Furthermore, set C ∼ (x, y) to be the maximum entropy value at the location where the estimated partition has lI*(x, y) LO property in the last iteration . This incorporates the knowledge that predictions in the last iteration with respect to the outer set LO yield unreliable depth estimates, indicating occlusion. The C-Net architecture consists of 4 convolutional layers ( kernel size 3, leaky  ReLU activation ) and a sigmoid activation output, generating Cn in the range [0,1] .

6. Geometry-aware multi-view fusion

At each pixel (x, y) , there is a set of N depth candidates, one for each source image. The depth candidate Dn(x, y) is associated with the center of the line partition of the ultimate epipolar line predicted by the DELS algorithm, and the partition length corresponds to the depth range of size ∆Dn(x, y) ∈ R in the scene . It is worth noting that although the epipolar divisions have the same length on different source images, the corresponding depth range ∆Dn (x, y) differs from source image to source image, depending on the source camera and Reference to the geometric relationship between cameras. In this method, exploiting this key fact, a geometry-aware fusion scheme utilizing the learned confidence is proposed. In particular, the associated partitions of depth candidates have sufficient confidence and a narrow depth range, since they correspond to reliable and fine-grained depth predictions.

Therefore, the fusion process is as follows. First, we determine the candidate set with confidence greater than a threshold ω . Then, we select the candidate with the smallest partition depth, namely Dn* (x, y) , where n * = argnmin∆Dn(x, y) , n Nω (Nω is empty, choose the one with the highest confidence). The final depth is the mean η of all candidates from Dn*(x, y) within a relative distance threshold , introducing a mask:

The final output depth value can be expressed as:

7. Experiment

7.1. Implementation Details

Implemented in PyTorch and trained with ADAM optimizer with batch size 1.

7.2. Comparison with advanced technologies

Guess you like

Origin blog.csdn.net/qq_43307074/article/details/129618202