Kick Back & Relax:Learning to Reconstruct the World by Watching SlowTV

Reference code: slowtv_monodepth

This article proposes a method for self-supervised depth estimation on uncalibrated data sets, which is to estimate the internal parameters of the camera through network prediction, thereby completing the self-supervised process. In order to verify the self-supervised depth estimation without camera calibration, some videos were downloaded from the Internet to construct the SlowTV data set, and some tricks were also added to the depth self-supervision process, such as data augmentation of images with arbitrary aspect ratios.

Loss function:
The loss function here is similar to the loss function in MonoDepth2. The photometric reconstruction error is still the old formula:
L ph ( I , I ′ ) = λ 1 − L ssim ( I , I ′ ) 2 + ( 1 − λ ) LI ( I , I ′ ) L_{ph}(I,I^{'})=\lambda\frac{1-L_{ssim}(I,I^{'})}{2}+(1-\ lambda)L_I(I,I^{'})Lph(I,I)=l21Lyes im(I,I)+(1l ) LI(I,I )
For the processing of moving targets, the minimum reconstruction error of the front and back frames and the Auto-Mask mechanism are used to alleviate (it cannot be fundamentally eliminated): L rec
= ∑ p min ⁡ k L ph ( I k , I t + k ′ ) L_{rec}=\sum_p\min_kL_{ph}(I_k,I^{'}_{t+k})Lrec=pkminLph(Ik,It+k)
where,t + k t+kt+k is the index value of the previous and next frames. Auto-Mask is consistent with the previous paper (theoretically the error after reconstruction should be smaller than the inter-frame error):
M = [ min ⁡ k L ph ( I k , I t + k ′ ) < min ⁡ k L ph ( I k , I t + k ) ] \mathcal{M}=[\min_kL_{ph}(I_k,I^{'}_{t+k})\lt \min_kL_{ph}(I_k,I_{t+k} )]M=[kminLph(Ik,It+k)<kminLph(Ik,It+k)]

Camera internal parameter prediction:
Since the video collected by the network does not provide calibrated internal parameter data ( it is not necessary for scenes with accurate internal parameters ), the network needs to be used to predict. Only one prediction is needed for an input sequence (this part The judgment is also made in the code), and the fully connected layer prediction is used for the prediction part, but different activation functions are used for the focal length and center when outputting, and the Softplus activation function similar to the ReLU curve is used for the focal length prediction.

# src/networks/pose.py#L86
def _get_focal_dec(self, n_ch: int) -> nn.Sequential:
    """Return focal length estimation decoder. (b, c, h, w) -> (b, 2)"""
    return nn.Sequential(
        self.block(n_ch, n_ch, kernel_size=3, stride=1, padding=1),
        self.block(n_ch, n_ch, kernel_size=3, stride=1, padding=1),
        nn.Conv2d(n_ch, 2, kernel_size=1),  # (b, 2, h, w)
        nn.AdaptiveAvgPool2d((1, 1)),  # (b, 2, 1, 1)
        nn.Flatten(),  # (b, n)
        nn.Softplus(),
    )

The sigmoid activation function is used for the center. After all, the center is near 0.5, so there will be no gradient saturation interval problem.

# src/networks/pose.py#L97
def _get_offset_dec(self, n_ch: int) -> nn.Sequential:
    """Return principal point estimation decoder. (b, c, h, w) -> (b, 2)"""
    return nn.Sequential(
        self.block(n_ch, n_ch, kernel_size=3, stride=1, padding=1),
        self.block(n_ch, n_ch, kernel_size=3, stride=1, padding=1),
        nn.Conv2d(n_ch, 2, kernel_size=1),  # (b, 2, h, w)
        nn.AdaptiveAvgPool2d((1, 1)),  # (b, 2, 1, 1)
        nn.Flatten(),  # (b, n)
        nn.Sigmoid(),
    )

Random image scale:
It has been verified in the MiDas algorithm that the size of the image will affect depth estimation. In order to make the network more robust and have stronger zero-sample generalization capabilities, the image is cropped and resized here. Cover [50%, 100%] [50\%,100\%] of the image height when cropping[50%,100% ] this range, and the aspect ratio will also change, such as 1:1, 16:9, etc.

Ablation experiments of the above variables on network performance:
Insert image description here

Performance influencing factors of self-supervised depth estimation:
The following content comes from the work of the same team as this article, which studies the impact of backbone, loss function, etc. in the self-supervised depth estimation algorithm on self-supervised depth estimation:

paper:Deconstructing Self-Supervised Monocular Reconstruction: The Design Decisions that Matte

Backbone has the greatest impact on network performance:
Insert image description here

Smoothing of depth estimates has smaller fluctuations in depth performance:
Insert image description here

Guess you like

Origin blog.csdn.net/m_buddy/article/details/132372909