[Paper Brief] Self-supervised Multi-view Stereo via Effective Co-Segmentation and Data-Augmentat (AAAI 2021)

1. Brief introduction of the paper

1. First author: Hongbin Xu, Zhipeng Zhou

2. Year of publication: 2021

3. Published journal: AAAI

4. Keywords: MVS, self-supervision, segmentation, data augmentation

5. Exploration motivation: As can be seen from the figure (a) below, the self-supervised MVS methods that use image reconstruction tasks as proxy tasks all rely on a rough assumption, that is, the Color Constancy Hypothesis. The assumption is that: matching points between multiple views have the same color. However, as can be seen from Figure (b) above, in actual scenarios, the color values ​​of multi-view images may be interfered by various external factors, resulting in different colors of matching points, such as: lighting changes, reflections, noise interference, and so on. Therefore, self-supervised signals based on color consistency assumptions are likely to introduce false supervisory signals in these cases, which instead interfere with the performance of the model. We call this type of problem: Color Constancy Ambiguity.

6. Work goal: The root cause of the color consistency ambiguity in self-supervised MVS is that the proxy task of image reconstruction only considers the correspondence in color space (Correspondence). However, this metric based on the difference of RGB pixel values ​​is not reliable enough to represent the correspondence between multiple views, and also limits the performance of self-supervised methods. Then, it is natural to consider how to introduce additional prior knowledge to provide a more robust proxy task as a self-supervised signal. This can be divided into the following two points:

  1. Semantic Consistency : Introduce abstract semantic information to provide robust correspondence, replace image reconstruction tasks with semantic segmentation map reconstruction tasks, and build self-supervised signals.
  2. Data augmentation consistency : Introduce data augmentation in self-supervised training to improve the robustness of the network against different color variations.

7. Core idea: However, when constructing self-supervised signals, there are still some problems that cannot be ignored:

  1. Obtaining semantic segmentation map annotations is prohibitively expensive for semantic consistency priors. In addition, the scenes in the training set are dynamically changing, and we cannot clearly define the semantic categories of all elements in all scenes like the automatic driving task. This is why previous self-supervised methods have not used semantic information to construct self-supervised losses. To this end, a self-supervised loss is constructed by unsupervised co-segmentation (Co-Segmentation) on multi-view images to mine the shared semantic information between multi-view images.
  2. For the data enhancement consistency prior, the data enhancement itself will bring about changes in the color distribution, in other words, it may in turn cause the problem of color consistency ambiguity and interfere with the self-supervised signal. To this end, the single-branch self-supervised training framework is divided into two branches, and the prediction results of the original branch are used as pseudo-labels to supervise the prediction results of the data augmentation branch.
  3. The specific contributions are as follows:
  1. We propose a unified unsupervised MVS pipeline called Joint Data-Augmentation and Co-Segmentation framework(JDACS) where extra priors of semantic consistency and data augmentation consistency can provide reliable guidance to overcome the color constancy ambiguity.
  2. We propose a novel self-supervision signal based on semantic consistency, which can excavate mutual semantic correspondences from multi-view images at unfixed scenarios in a totally unsupervised manner.
  3. We propose a novel way to incorporate heavy data augmentation into unsupervised MVS, which can provide regularization towards color fluctuation.

8. Experimental results:

he experimental results show that our proposed method can lead to a leap of performance among unsupervised methods and compete on par with some top supervised methods.

9. Paper and code download:

https://arxiv.org/pdf/2104.05374v1.pdf

https://github.com/ToughStoneX/Self-Supervised-MVS

2. Implementation process

1. Overview of JDACS

The whole framework is divided into three branches:

  1. Depth estimation branch: Input the reference view (Reference View) and the source view image (Source View) into the network, and use the predicted depth map and source view image to reconstruct the reference view image. Compare the difference between the reconstructed image and the original image under the reference perspective, and construct the Photometric Consistency Loss (Photometric Consistency).
  2. 协同分割分支:将输入多视图送入一个预训练的VGG网络,对其特征图进行非负矩阵分解(NMF)。由于NMF的正交约束,其过程可以看做对多视图之间的共有语义进行聚类,并输出协同分割图。随后通过预测的深度图和多视角的协同分割图构建分割图像重建任务,即语义一致性损失。
  3. 数据增强分支:对原始多视图进行随机的数据增强,并送入到网络中。以深度估计分支预测的深度图作为伪标签来监督数据增强分支的预测结果,构建数据增强一致性损失。

2. 深度估计分支

采用MVSNet、CVP-MVSNet等主干网络,用于预测深度图。

光度一致性光度一致性的关键思想是在同一视角下最小化合成图像和原始图像之间的差异。表示第1个视图为参考视图,其余N−1视图为i(2≤i≤N)索引的源视图。对于一对特定的图像(I1,Ii)具有相关的内在和外在参数(K,T),可以基于它的参考视图坐标pj计算出在源视图中相应的位置p`j 。

其中j(1≤j≤HW)为像素点的索引,D为预测的深度图。然后利用可微双线性采样得到形变后的图像Ii`。

通过形变,二进制有效性掩码Mi同时生成,表示新视图中的有效像素,因为一些像素可能被投影到图像的外部区域。在MVS中,将所有N−1个源视图投影参考视角来计算光度一致性损失。

其中∇表示梯度算子,o是点积。在颜色空间和梯度空间计算两者的L1损失。

3. 协同分割分支

通过无监督协同分割从多视图图像中挖掘隐式公共分割。共同分割的目的是在给定的图像集合中定位共同目标的前景像素。非负矩阵分解(NMF)具有固有的聚类性质。通过一个经典的协同分割管道,将NMF应用于一个预先训练的CNN层的激活,可以用来发现图像之间的语义对应。

1. N张图片作为输入,经过ImageNet预训练好的Vgg模型得到[N, C, h, w]的特征图。(此处h和w是特征图大小而非原图大小)。

2. Convert the dimensions of the N feature maps to [Nhw, C], and obtain the P matrix: [Nhw, K] and Q matrix: [K, C] through non-negative matrix decomposition, and convert the dimension of the P matrix to [N, h, w, K], where K is the preset number of categories, the purpose is to cluster the pixels in the feature map into K categories.

3. Reshape into a onehot graph, and then construct a semantic graph through softmax

4. Calculate the co-segmentation loss

4. Data augmentation branch

Some recent work on contrastive learning demonstrates the benefits of data augmentation in self-supervised learning. Intuitively, data augmentation brings challenging samples, breaking the reliability of unsupervised losses, thus providing robustness to changes. Briefly, define a random vector θ to parameterize any augmentation τθ: image I→I¯. τθ However, data augmentation is rarely applied in self-supervised methods, since natural color fluctuations in augmented images may interfere with self-supervised constraints on color invariance. Therefore, instead of optimizing the original goal of view synthesis, the consistency of unsupervised data augmentation is enhanced by regularizing the output of original data and augmented samples.

Specifically, the data of N pictures is enhanced, and the depth map of each layer of the pyramid is obtained through the shared CVP_MVSNET of the depth estimation branch, and then the loss is calculated.

Data Augmentation Consistency Loss : The regular forward pass prediction of the original image I in the depth estimation branch is denoted as D, and the prediction of the enhanced image I¯τθ is denoted as D¯τθ. In a contrasting fashion, data augmentation consistency is ensured by minimizing the difference between D and D¯τθ:

where Mτθ denotes the unoccluded mask τθ under transformation τ. Due to epipolar constraints between different views, the augmentation methods integrated in this framework do not change the spatial location of pixels. The data augmentation method is as follows:

  1. Cross-view mask: To simulate the occlusion illusion in the multi-view situation, a binary cropping mask 1-Mτθ1 is randomly generated to occlude some regions on the reference view. The occlusion mask is then projected onto other views, occluding corresponding regions in the image. Assuming that the remaining region Mτθ1 is not affected by the transformation, the effective region between original samples and augmented sample results can be compared.
  2. Gamma Correction: Gamma correction is a non-linear operation used to adjust the luminosity of an image. To simulate various illuminations, integral stochastic gamma correction τθ2 parameterizes θ2 to challenge the unsupervised loss.
  3. Color dithering and blurring: Many transforms can attach color fluctuations to an image, such as random color dithering, random blurring, random noise. Color fluctuations make unsupervised losses in MVS unreliable, since photometric losses require constant color between views. Instead, these transformations, denoted τθ3, can create challenging scenes and regulate robustness to color fluctuations in self-supervision.

The entire transformation τθ can be expressed as a combination of the above enhancements: τθ = τθ3◦τθ2◦τθ1, where ◦ denotes a feature combination.

5. Overall loss

In addition to the basic self-supervised signal LPC based on photometric consistency, two additional self-supervised signals, semantic consistency LSC data-enhanced consistency LDA, are added. In addition to the above losses, some common regularization terms are also used for depth estimation, such as structural similarity LSSIM and depth smoothness LSmooth. The final loss is as follows:

The weights are empirically set as: λ1 = 0.8, λ2 = 0.1, λ3 = 0.1, λ4 = 0.2, λ5 = 0.0067. 

6. Experiment

Comparison with Advanced Technology

7. Restrictions

There are some problems to be solved: First, there is no effective self-supervised signal in non-textured areas such as black/white backgrounds, etc., because the color and even semantics of all background pixels are the same; second, through collaborative segmentation The method only excavated relatively rough semantic information, because the VGG model pre-trained based on the ImageNet classification task is not suitable for segmentation tasks that need to pay attention to detailed semantics.

Guess you like

Origin blog.csdn.net/qq_43307074/article/details/130050935