[Paper Brief] ATLAS-MVSNet: Attention Layers for Feature Extraction and Cost Volume Regularization ICPR 2022

1. Brief introduction of the paper

1. First author: Rafael Weilharter

2. Year of publication: 2022

3. Published journal: ICPR

4. Keywords: MVS, 3D reconstruction, local attention, feature extraction, cost volume regularization

5. Exploration motivation: Contextual features are not exploited well enough, and exact matching problems still exist in low-texture, repetitive, specular and reflective regions.

While these methods are able to achieve impressive results, accurate matching problems still remain in low-textured, repetitive, specular and reflective regions. A possible reason for this is that context-aware features have not been leveraged well enough yet.

6. Work goal: use attention to solve the above problems. However, global attention layers focus on all spatial locations of the input and thus are limited to small inputs.

Nevertheless, these works rely on global attention layers, which attend to all spatial locations of an input and are therefore limited to a small input.

7. Core idea: ATLAS-MVSNet is proposed, which uses local attention layer (ATLAS) for feature extraction and 3D regularization to significantly improve the performance of common CNN solutions.

We introduce a multi-stage feature extraction network with hybrid attention blocks (HABs) to extract dense features and capture important information for the later matching and depth inference tasks.
We extend the local 2D attention layers proposed by [26] to 3D in order to be able to adopt our HABs for the 3D regularization network.
We produce clean depth maps prior to applying any filtering technique with an end-to-end neural network that is fully trainable on a single consumer grade GPU with only 11GB of memory.

8. Experimental results:

We perform extensive evaluations to show that our ATLAS-MVSNet ranks amongst the top methods on the DTU and the more challenging Tanks and Temples (TaT) benchmarks.

9. Paper and code download:

GitHub - rafael-weilharter/ATLAS-MVSNet: Attention Layers for Feature Extraction and Cost Volume Regularization in Multi-View Stereo

https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9956633

2. Implementation process

1. Overview of ATLAS-MVSNet

Architecture overview: First, HABs are applied to a given set of images using a 2D multi-stage feature extraction network. Features at different scales are aggregated into a cost body through homography warping. The cost volume at the coarsest scale (stage n − 1) is regularized by 3D CNN and 3D HAB, and a depth estimate is produced by regression. Estimate the amount of cost used to initialize subsequent stages. This process is repeated for n stages to obtain the final depth map.

2. Feature extraction

Multi-level network using U-NET structure. In the beginning, 4 convolutional layers are applied, where the stride of layer 1 is set to 2. Then, the feature maps obtained in stage 0 are passed through 2D HAB.

2D hybrid attention block. HABs are constructed as residual blocks, using a hybrid combination of convolutional and local attention layers (see figure above). To reduce the memory requirements of the local attention layer, the input is first passed through a convolutional layer with stride 2, followed by group normalization (GN) and ReLU layers. Then apply a local attention layer with LayerScale. The implementation of the local attention layer is shown in the figure below.

Similar to convolution, the input is a local region R of size s × s (s = 3) centered on the pixel of interest xij. Query, key, and value learn 3 different transformations compared to convolutional layers with only 1 transformation. According to R, the pixel output yij is calculated by the softmax operation σ( ):

where query qij=Wqxij, key kab=Wkxab and value vab=Wvxab are learnable linear transformations with their respective weight matrices Wq, Wk and Wv. The disadvantage of this formulation is that no positional information is encoded, which leads to permutation equivalence, limiting the performance of vision tasks. Therefore, relative positional embeddings are introduced by adding learnable parameters to the keys. Use half of the dimension of the output channel to encode the row direction, and the other half dimension to encode the column direction. In practice, this can be done by arranging the 2D encoding as a vector rab, resulting in:

In this way, attention layers can be integrated into the network like convolutional layers.

As a normalization strategy, LayerScale is applied. Formally, this is done by multiplying the diagonal matrix by the output Xatt after the attention layer:

where Y is the final output of HAB and Xdown is the input for downsampling. The parameters λ1 to λn are learnable weights.

The last output with the lowest scale produces the coarsest feature maps. For subsequent stages, the previous HAB output is enlarged by a factor of 2, and these features are concatenated with the HAB output of the current stage. Additional convolutional layers are applied after concatenation.

3. Cost object construction

The cost body is constructed by means of homography change and variance. The minimum cost volume covering the entire depth range is constructed at the coarsest stage. Subsequent cost volumes are built over narrower depth ranges based on the depth maps of previous cost volumes.

4. Cost body regularization

Depth maps are predicted in a coarse-to-fine mode. Depth is regressed through 3D regularized network and soft argmin operation. The 3D regularized network consists of 5 blocks of two 3D convolutional layers with residual connections followed by 3D HABs.

3D hybrid attention blocks. The design principle is the same as shown in the figure above, but without downsampling, so the cost remains at a constant scale. The local 2D attention layer is extended by extending the weight matrices Wq, Wk and Wv to three dimensions. To extend position encoding to 3D, another learnable parameter vector needs to be added for the depth direction. This means that factorization is now done across 3 dimensions, splitting the output channel dimension by 1/3, as each encoded embedding.

The network only uses 3D HAB at the coarsest stage for two reasons: 1) HAB comes at the cost of increased GPU memory consumption, since a different transformation has to be learned for each query, key, and value. This results in an exponential increase in GPU memory requirements. 2) It is most critical to obtain correct depth estimates at the coarsest stage covering the entire depth range, as this prediction will be propagated to other stages.

5. Loss function

has n stages, producing n−1 intermediate outputs and 1 final depth prediction. Compute the mean absolute error at each stage as loss:

where λk is the loss weight, which is reduced by 1/2 at each stage.

6. Experiment

6.1. Implementation Details

The number of stages of the final network is set to 5, from the coarsest stage 4 to the finest stage 0, and the number of depth hypotheses is set to 32, 8, 8, 4, respectively. The number of training input images is 3, the image resolution is 1600×1152, and the training takes a total of 18 epochs. The network can be trained end-to-end on a single consumer-grade GPU with 11GB of memory (eg, Nvidia GeForce GTX 1080 Ti, Nvidia GeForce RTX 2080 Ti). The number of test input images is 5.

6.2. Comparison with advanced technologies

Note, that there is a trade-off between these measurements, which is dependent on the fusion parameter τ.

Guess you like

Origin blog.csdn.net/qq_43307074/article/details/130050926